Version 0.88: custom content processors!
This is the long and awaited release that brings full customizability to web content processors.
There is a lot to say about it :
the starting point is that it should never lower the parsing quality of a 1flow instance. New parsers are applied selectively and can be tested on some articles before being applied globaly on a website. Other websites aren't affected except in some very-unusual conditions you should never reach.
There is already a new parser bundled with 1flow (based on the newspaper python module), that has already solved empty-content-problems on some websites.
The pure-1flow interface for this feature is not yet finished (it misses the ChainedItem part, which is the trickiest to make simple), but there is an interim interface in the Django Admin.
processing errors can now be individually consulted in the Django Admin.
you can write your own content parsers, without needing to submit them to 1flow via a PR or anything.
in a near future, your 1flow instances will be able to share their custom parsers via the generic synchronization mechanism, which is underway. There will also be a way to share parser globally with the 1flow community, via a peer-based review system.
As you can see, the newspaper processor accept code is a no-op: it calls a mutualized accept code suiting any processor that fetches and parses an article content. No need to duplicate code, even between your processors (see documentation).
And here is the process code:
This processor is only 26 lines long, including imports and logging… It lacks a few comments, but you get the point : writing a new parser is FAST.
You can test it right away, without bothering much with the Python shell. You can even test it on real articles, thanks to the new staff insert in the article view:
Then in the django management shell:
a = Article.objects… # not needed every time. # a.reset() # a.absolutize_url() a.fetch_content()
And you see right there how your processor behaves.
If you need something more packaged, just attach you new processor chain to the website, and run:
./manage.py reparse_articles <website_ID_or_URL> test
And a random article will be processed, and its new content printed to you. If you're satisfied, reparse all articles with the new processor chain:
./manage.py reparse_articles <website_ID_or_URL>
And you're ready to stare at some matrix-like screen.
Sharing processors with other people
In the meantime, sharing your content parsers (which are processors and processing chains in the 1flow terminology) is still easy, via the use of 2 management commands:
./manage.py dump_processors
./manage.py load_processors
On the other side. Provided you copy the file created with dump_processors on the machine you will run load_processors onto, the commands take care of the rest and will self-tell you all that you need to know.
I will slowly write new content parsers for a paying customer, and they will luckily benefit the whole 1flow community. I will also finish the management interface for them, and implement parameter-driven processors, allowing to use them to their full potential.
It's not currently an acheivement per se. It's a start. Processors are a very generic mechanism, and many currently-fixed things in 1flow will switch to processor-based code (post-processing of original data, processing of any-typed document like PDFs, ODT, images, audio…), allowing nearly full customization of any processing chain at any point.
Think of a dead-simple plugin (only 2 methods to implement) system that has access to every part of the engine, and you have only a vague idea of what can be done.