1flow.io news blog @1flowio - Tumblr Blog

1flow 0.103, 0.104 & 0.105 : blogging embryo, basic search engine, CAS

All administrators are urged to update, recent releases include security / privacy fixes.

New features

For end-users

blogs & writing features begin. You can now create your own blogs, but not yet create anything inside them. this is a WIP feature and will evolve soon.

search engine is back. It's still quite rough and not as polished as a real search engine. It doesn't handle accentuated letters for example, and doesn't work well with multiple-words searches. But it's far faster than the previous mongodb incarnation, and searches in excerpts and contents too.

For administrators

1flow is now CAS-aware (Centralized Authentication Service). Just define the environment variable ONEFLOW_CAS_SERVER_URL and you're done.

Important fixes

security/privacy issues in REST API permissions are now all gone, thanks to sparks 5.12. Thanks to David Larlet for finding and reporting the problem.

fix decoding of some very strange emails. Makes the mailfeed engine much more robust.

fix tweets URLs.

Interface changes

no-date articles are now at the end. no visual pollution, at last.

small enhancements making the reading interface more usable on medium-sized devices (iPads 2/mini).

the article staff-related information are now in a dedicated pane (on computers, not mobile). This helps me experimenting new kind of modern interfaces widget that will help make the next 1flow interface a lot better (streamlined, one-page app…)

Other

a lot of french translations update.

constance settings for the basic search engine. See constance config.

a bunch of fixes and small interfaces enhancements, broken links fixed…

#release #cas #enhancements #search #security #fixes

•18+ Adults Only

Watch Anya Live on Cam

Anya is live and ready to show you everything. Watch her strip, dance, and perform exclusive shows just for you. Interact in real-time and make your fantasies come true.

✓ Live Streaming✓ Interactive Chat✓ Private Shows✓ HD Quality✓ Free Actions

Free to watch • No registration required • HD streaming

1flow 0.102: fix for a non-exploited privacy/security non-issue on twitter/mail accounts/feeds

This version brings an important fix regarding security / privacy.

In previous versions, all users had access to all accounts (twitter / email) created on the system to push data into their own feeds. Not access to the account credentials (nothing could leak on that side), but access to the account itself, to pull data from, once the platform is connected.

This has a very little importance on twitter because the vast majority of data is already public. Pulling tweets from one account or another doesn't change anything if you follow only unlocked accounts. Like for email accounts, there were no access possible to API tokens.

By the way, Twitter feeds are not yet enabled on 1flow.io so no data was pulled at all.

Concerning email, there was no data disclosed because we don't have any mailfeed (yet) deployed on 1flow.io, and other instances are private and users already share the mail accounts used to get newsletters.

Nevertheless, having another user pull data from your own IMAP account is not cool.

By the way, no user could access another user email account without first registering one, be it a fake one. The number of users having linked email accounts beiing very low (only beta testers), I consider this whole thing a non-issue in practice, while considering it important in theory.

Thus, this non-issue is now fixed. I've known it from the beginning and was looking for a simple way to solve it without writing spaghetti code. Now it is the case, and the solution is easily transposable to any future kind of account in the platform.

PS : I repeat : when using the “newsletters to feed” feature, it's best to dedicate an email account for that ; one that doesn't store your personnal emails. That's a best practice, be it for 1flow or not.

Cheers, Olivier

#release #security #issue #fix

1flow 0.100 & 0.101: sharing and email-based feeds !

These 2 versions bring loads of small fixes, some cleanups for the upcoming 1.0 and two major features:

first, sharing items between users is now live. This means a lot of visible and invisible changes, the most obvious beiing that :

you can now magically invite your friends to the platform by just sharing items with their email address. As long as you use the same email address, they will be able to find their history of shared items when they decide to use their account (but that's completely optional).

sent items and received items reading lists are now available in the interface. No more blurred entries in the source selector.

shared items (what we call “pokes” in the 1flow worlds) are notified to you by email, but you can disable notifications in your preferences.

second, 1flow is now fully able to turn email-based newletters into regular feeds. The full procedure includes as little work as possible. 1flow already ships with a processing chain suited to parse HTML newsletters.

User or interface changes

any modal can now be closed (sending its contents) with control-enter. This makes sharing particularly fast with the autocompletion feature. But it works in all dialogs.

fix navbar width container.

small CSS / visual enhancements or fixes in reading lists.

refresh the mailfeeds GUI to include processing parameters & chain directly, avoiding a round-trip and permissions for staff feeds management.

Internal / system-wide changes or fixes

add a global config directive to allow instance local users to share between each other. As 1flow instances are considered private, this is enabled by default. You should disable it on public 1flow instances.

upgrade bootstrap to 3.3.4.

export enhancements & fixes. Notably, exclude items without any date published from exports that are date-clamped, and add items multiple pages if present, now that FTR extractor allows us to support multi-pages web articles.

new URL crawler processor (in v0.100). Disabled everywhere except in emails for now. This will allow 1flow instances to crawl the whole web in the future, but this needs a lot of love and parameters to avoid saturating the local database before wider enabling.

the url crawler processor did not extract URLs correctly (fixed in v0.101). This affected only the emails because web crawly in not yet enabled, but it's fixed for all.

all content-related stock processors know how to handle emails.

empty articles versions are cleaned every hour. Don't clutter the database.

failed items are retried every hour / day / week. This allows to resolve temporary network problems and make more succeeded items available to users.

upgrade the twitter API library and handle more errors gracefully, and already handled errors better.

dump/restore processors procedure enhanced to avoid corner-cases errors.

make mail accounts context managers acquirable multiple times.

templates and models cleanups and refactoring, as much as possible, to avoid code obesity.

#release #sharing #email #newsletter #enhancements #fixes

1flow.io is now https only

It should have been the case for long, but for some reasons I didn't care, or didn't manage to do it before.

Since a few days, http://1flow.io/ is now httpS://1flow.io/, for your privacy pleasure :-)

And the http virtualhosts redirects to the https one, so there is no way to get the unencrypted version.

Sorry for all these unsecure times until now.

There are still some rough cases where Django tries to generate HTTP URLs, but I'm already submitting patches to address them.

#ssl #security #https

1flow version 0.98 & 0.99: schema.org microdata & on-disk processors

I didn't write any announcement for the 0.98, and the 0.99 is already out. So here are all the changes:

a nearly-complete REST API (YEAH !!) for all 1flow objects (in 0.98, see below),

a whole rewrite of the processors architecture (in v0.99), allowing processors to be real on-disk python modules. This makes changes tracking and developer job much easier, while still allowing admins to test new processors directly in the GUI if they want.

new processors internal documentation (in 0.99) ; full sphinx documentation should show up on readthedocs in the near future.

a schema.org microdata extractor (in 0.98),

a staff button to manually refresh feeds in reading lists (idem),

a refresh of all python dependancies (idem),

small enhancements or fixes on some processors (opengraph, HTML title…) (in 0.98 and 0.99),

a small twitter feeds enhancement, which allows “API rate exceeded” events to be better detected and handled (in 0.98).

Anecdotically, we reached 909 versions. Given what is missing functionally to release 1.0 and how I release new versions, I can make the wild guess that version 1.0 will show up after nearly 1000 releases, which means exactly nothing but could eventually be fun to observe.

REST API

The 1flow REST API is built on django-rest-framework, and gives access to nearly everything in 1flow to the outside world. It's still in its early phase though: user interface bits are missing (to generate an API token, or see the one auto-generated for you).

You still need an account on the 1flow instance, obviously not everything is public !

And there are a little number of fields missing from the API because of generic foreign keys woes. This should be fixed soon, as with the missing interfaces.

Until then, if you want to help testing the API, just drop me a mail and I'll give you your API token for full access.

About on-disk-processors

All currently shipped processors have been ported to python modules. I feel this is how definitive ones should be written, and consider writting code in the interface as a transient state only.

Recent versions have been a nice playground to test in-DB direct code writting, but in the end it is a much more error-prone work and makes me loose time more than bringing any benefits for middle/long-term processors maintenance.

Developping / testing is still easier with an IDE-like editor (at least for me), and writting processors has proven to be more a developper job than an instance administrator one in real life.

Anecdotically again, the processing framework seems to be faster than before (for obvious reasons with byte-compiled code and without DB access), but I didn't benchmark it.

#release #processors #fixes #features #rest #django-rest-framework #api #microdata

•18+ Adults Only

Watch Anya Live on Cam

Anya is live and ready to show you everything. Watch her strip, dance, and perform exclusive shows just for you. Interact in real-time and make your fantasies come true.

✓ Live Streaming✓ Interactive Chat✓ Private Shows✓ HD Quality✓ Free Actions

Free to watch • No registration required • HD streaming

Version 0.97 : Five-Filters compatible processor

This version brings the long awaited FiveFilter processor, based on the recently published python-ftr module.

Besides that, it includes a fix in the HTML characters encoding detection and an up-to-date corresponding processor, and some minor and internal fixes to chain processing code.

Not much to say, the FTR processors was a lengthy work. Nevertheless, it's only a start: it brings a huge potential quality raise in article parsing by allowing us to specify custom website parsing directives, but we now need to create and maintain specific rules for a community-wide good reading experience (there are already ~1k of them and counting very regularly).

By porting the PHP code to python, my goal was to keep our users communities unified and not waste resources on different engines. In the current setup, website parsing rules are only one or two PR away, which should ease extending them.

My best regards and a big UP to Five Filters for releasing their code under the AGPL license. This is what I call Libre Software.

#release #features #parsing #extended #metadata #quality

Version 0.96: reading history

This version bring a new feature for users: the reading history. It also enhances the way items are marked read while beiing collapsed or not.

the reading history is a new reading list (global, folder-related or subscription-related) that shows you what you read, in the reverse chronological order. It answers the question “what is this super-cool article I read yesterday in the current feed ?”. Or “what are the last 2 articles I read in this feed ?”, whenever they were published.

next to this, when you mark an item read while it's collapsed, it will not be recorded in your history. It's like marking all read (shortcut M-A-R) : in fact you didn't really read all the articles, it's just a facility to clear your unread items. So your history will not be cluttered with these.

users have a new preference controlling whether articles marked this way are kept visible (but still less than others) or hidden from the current reading list, eventually with a delay.

There are still rough edges and questions:

while in your history, you can make the system “forget” some articles from it by marking them “read without history” from there. I consider this a feature for now, but the future will tell us if it's a good thing to keep this possible or if the action is not relevant in this particular reading list. What is the point of beiing able to remove them from the reading history ? It can be considered as a real privacy feature to manually remove some articles, but isn't it counter-sense in some ways ?

what to do when marking a starred item “read without history” ? Is this relevant ? I'm not sure. Either the starred state should be cleared if the item is removed from the starred reading list, or the ability to remove from history should be disabled in this particular list.

same question for “read later case”. In this case I tend to think the “read later” state should simply be cleared when marking an item read witout wanting to read it. But that's not so obvious for starred.

Even the name, “marking read without recording it in your history”, seems a bit complicated. It could be "forget about this". This would conflict with the “mark all read” feature, that would become “forget all unread items”. For a someone used to RSS readers or any reader, this should require some explanations before really seasing the whole feature and what it has to offer.

Anyway, as it helps me a lot, the reading history seems a cool feature for me, even if not yet completely finished. I will be glad to get your feedback about it.

History note: this feature has been lying dormant since 2013, in the is_auto_read attribute. Items were to be marked automatically read by the system if you didn't read it in a certain delay. This should help you not having infinitely growing unread items lists.

While this particular feature is not yet implemented, I will do it in the near future, it's not that complicated to code. The reading history and forgetting items marked read while they are collapsed is just an extension of this attribute : in fact they are just marked auto-read.

See you soon,

#enhancement #user #reading #history

Version 0.95: chunked exports

This version brings a very useful feature: chunked export. Instead of waiting for the webserver to exhaust memory and never start transfering a 15-days export, you can now ask for nearly whatever you want and see it coming feed-by-feed.

Beware though: as exporting your entire database is possible, you could get into trouble either handling the resulting JSON file, or bringing your server to its knees, memory-wise.

You probably didn't know it yet about this export feature. My bad it was totally undocumented until today. That's because the feature is still a little rough for wide use.

As of now, it is designed around the use case of administrators exporting the whole database content (or a user folder) between 2 dates, to feed external systems with any kind of web-fetched contents.

If you're interested, the procedure is not straightforward, but still simple.

Note: big feeds, like twitter ones which can hold 5k+ items for a single day, can still cause pauses and hickups in the download (depending on your hardware).

#enhancement #export #release #speedups #resources

Version 0.94: HTML cleaner is back, and future plans

This version brings a massive processors pack update:

a new HTML cleaner processor, integrated in default chain. It tries to clean the downloaded HTML as much as possible. It's like the legacy hard-coded 1flow parser did before, but I enhanced it a little, made it more robust and more unicode/encodings friendly. How it's integrated now opens a door for future dynamic (parametered) cleaning features.

HTML downloader : fix for unicode handling, fix logging indentation and logging strings.

rework all statsd calls in all processors for cleaned HTML count.

full newspaper processor: enhanced to extract lang/description/title/top_image if available.

bookmark-extractor processor: cleanups and fixes, re-order in default chain to benefit from the HTML downloader instead of re-downloading the HTML again.

in all 1flow stock processors: more comments, variable normalization, logging calls normalization.

cleaned and reset processors and default chains categories now that the processing workflow is getting more mature.

probably some minor things I forgot.

In the rest of the application, you get:

internal change: statistics about processing errors are now automatically handled by the chain, when errors are created or cleared. Writers of processors don't need to worry about it anymore.

an important fix in the transaction management of the processing chains: now a stop chain exception will not rollback the transaction anymore.

removal of ancient code, unused or now integrated in the processors.

staff members get noticeable enhancements in the management UI, like the minimal grep: filter (just try “grep:statsd” in the processors management page), new and undocumented yet has: and no: filter keywords in websites management UI, to easily filter those who have processing parameters (eg. “has:param”) or custom processing chain set (eg. “has:chain”).

staff members also get access to the full history of every article, with direct links to inspect each version of article excertps and contents. This will make it easier to customize HTML cleaners or specific extractors when they come (believe me, they will!).

some smaller [but still important] fixes, like twitter feeds refresh which were broken in 0.93 because of wrong arguments passing.

Next iterations will include (in no particular order):

probably an interface to test custom chains on articles, individually.

most probably a FiveFilter processor re-implementation in python to take advantage of instapaper-like configuration files.

a full refresh of the user UI, making it work again on mobile devices (yeah I know it's been sucking for a while), and revamping the “individual article display” page because it's mostly broken and needs a big refactor with the reading list code to mutualize code.

perhaps the first implementation of instances synchronization, to push websites & feeds descriptions & images automatically, and perhaps processors & chains, manually in the first time, but avoiding the need to load/dump them.

and why not a sharing feature, finally?

wow, this would need support for publishing RSS/Atom to the outside, and also integrating themes to nicely display articles to guests.

ho my, and a blog writing feature. Yes.

Notice that version 1.0 is somewhere about the sharing feature. As it seems essential for normal users, I will probably wait until then to release 1.0. But we are clearly on the good path.

A very exciting program, if you ask me.

#release #enhancements #parsing #processing #reliability #quality #plans #future

Version 0.93: read history, staff interface enhancements

This version brings a new user feature: read history.

For each subscription and folder, even for all your subscriptions at once, you can display which articles / tweet (anything, in fact) you read, in the reverse order you read it.

This is makes it hopefully easier to find something you read recently. The global read history replaces the all articles selector link, which had virtually no use case anyway. Feel free to ask, whimpler or complain if you want this link back.

Besides this, the version includes small enhancements for the staff members, like a visual refactoring of the feeds/websites management interface, and new has: and category: filters for processing chains and processors.

And last, a small fix: when marking something unread, not bookmarked, not starred or not anything, the associated date will be cleared from database, instead of beiing updated to when the “undone” action is make.

#release #enhancements #management #history

•18+ Adults Only

Watch Anya Live on Cam

Anya is live and ready to show you everything. Watch her strip, dance, and perform exclusive shows just for you. Interact in real-time and make your fantasies come true.

✓ Live Streaming✓ Interactive Chat✓ Private Shows✓ HD Quality✓ Free Actions

Free to watch • No registration required • HD streaming

Version 0.92: new processors features, websites processing parameters

This version (and the quick 0.91 published a few days ago) brings some exciting features, and a handfull of fixes:

a new “excerpt to content” processor. It's best used on RSS feeds that produce already good content, avoiding the need to fetch the full article. In a few cases of them, the RSS content is not only already full, but ad-less, while fetching the article will clutter the content with unwanted things. With this processor (internally called 1fs-content-from-excerpt) you can completely bypass the downloading phase, and even the HTML to markdown (internally called content) phase. See doc for details. This processor is already included in the default chain, and activated for some web sites on 1flow.io.

some updates to processors and chains names, refreshing them to better reflect the features they offer.

dead-feeds refreshing: suggested long ago by kouio, we check every day some feeds that are considered dead, to see if they came back to life. Coupled to auto-reopen-on-external-subscribe feature, this should keep the dead number of feeds as low as possible on the platform, while avoiding to consume too much resources on feeds that are really dead.

documentation updates. The processing framework starts to mature to my taste, and the documentation tries to reflect this.

enhanced interface for staff members, regarding web sites and processing chains. parameters icons are now in a stronger color when parameters are set, in order for the manager to quickly see which web sites / chains have parameters set, and which not. I will probably code a dedicated query filter (with a shortcut button) soon.

the load_processors management command will now reset SQL sequences at the end of the process, thus allowing you to create processors/chains without problems. Before, in some conditions, you could not create custom local processors/chains if you had already loaded the 1flow processors pack.

refined model cache configuration: on big 1flow instances, this will bring a noticeable response speedup for users that have > 200 feeds.

#release #features #speedups #enhancements #fixes #processing #documentation

Sparks 5.1 released

This version brings:

import the benchmark function from 1flow to mutualize between projects.

create a new symlookup() template tag to get symbolic value of NamedTupleChoices.

allow the lookup() tag to work on named tuple choices, not only dict(). For NTC, it will return description, given a hask key.

Allow to reverse the named tuple choice to get the symbolic values and descriptions without regenerating a dict() every time.

views.SortMixin: eventually decompress the sort_by parameter in case it's an iterable.

views.SortMixin: log the exception so that the developer can eventually know something went wrong or not.

honor the is_staff_or_superuser_and_enabled (if ever present; it's a 1flow specific thus completely optional) in OwnerQuerySetMixin.

new native_filters feature for the filter mixin (self documented in class).

fabfile: import django settings at some point to get PROJECT_ROOT.

Don't crash when using the new_fixture_name() function outside of a Fabric environment. Use Django settings, and '.' if settings not available.

Allow to specify a custom suffix to fixture names if wanted.

Enhance the Django-Rest-Framework logger, make it work with PATCH requests.

It's needed for recent versions of 1flow (eg. > 0.80) to take advantage of the load_processors and dump_processors management commands, and to get detailled items status in reading lists.

#release #sparks #features

version 0.90: opengraph processor and small enhancements

This version embodies the new processor architecture. It brings:

the big new feature: a new opengraph processor, which helps gathering metadata of articles that come from wild sources (eg. Twitter or classic web import, with no RSS metadata present, but only an URL). The current implementation deals with title, published date (the most important to my eyes), tags, language and description. I will implement author next, but I couldn't find any website that implement this field to test the processor onto.

new processor categories (metadata and test) and new chains, each with a defined purpose. Notably the test opengraph extractor chain name is self-explanatory, but it's now easier to test a chain on any source. This live-test facility will hopefully get an interface soon. In the current state, processors/chains tests need to be run in the 1flow shell; but most of the work can be done via the interface. Doing it all via interface would speed a lot processors development, and it's planned anyway.

small fixes on the user side; the article panes of an already read article are no more empty after closing the reading screen. There is no more “by ” (with no author) and “in ” (with no tag) in side panes nor reading list.

chained items (in processing chains) bare graphical interface. Not as polished as I would like (lacks autocompleter for processor / chain, which is the “next big thing” to code in this particular area).

bare processors parameters implementation, allowing for example to specify a custom user-agent for the HTML downloader. The parameters format is YAML, which seems the easiest to read/write for humans while staying compatible with python function arguments passing. See note below.

management interface enhancements; the most notable one beiing that ESC and click-outside will not close source code modals, which has been causing a lot of lost work while editing processors code recently.

small performance enhancements with lower database hits in some low-level — and much used everywhere — model properties, and factorized code between processors and models to avoid desynchronization on condition changes.

internal code cleanups (a bunch of code moved to processors is now officially retired from models code).

Interface note: we still lack the processing category interface, but as they are manageable via Django admin, I consider this harmless. Not a bunch of people are going to create processing categories everyday. At last will you create one or two categories for your own processors, but it's not what I call an interface-efficient-needing management part.

Processors parameters note: there is absolutely nothing enforced/secured on this feature for now. Expect unexepected crashes if you tweak your own processor parameters. see the Legacy simple HTML downloader processor for a live example.

See you for next release. Not yet 1.0, but we are approaching it every day :-)

#release #features #processors #opengraph #enhancements #bugfixes #interface #tests

Version 0.89 : lowering database hits and new staff tools

This version brings a few user-invisible — but extremely useful nonetheless — changes:

integration of the cacheops django module, to avoid the database hammerings introcuded by the processor infrastructure. I prefer a lot hitting Redis than PostgreSQL to (re-(re-))load processors and processing chains at every article.

addition of interface button to trigger selective or global cache clearing; this helps making static-files and templates upgrades immediately visible to the user, or fix Javascript issues when non-cached templates fragments are desynchronized with too-agressively cached JS fragments (or the other way around).

resolution of some small bugs like logging strings arguments or corner-case typos (yeah I know, coverage-driven tests could have helped find these earlier ; help welcome)…

The first point still needs love and thinking. We could probably benefit from a dedicated task fetching and parsing chunks of articles at once in a batched manner. Chains and Processors could remain in memory, avoiding repeated and useless hits to any database (even Redis will not be touched thanks to the Django internal instance cache).

This would speed things up at worst. It would eventually lower the resources needed by celery, at the cost of less parallelization possible.

But as the workers are currently all on the same machine, this would be certainly bring a benefit, be it small.

Some satisfying level of parallelization could still be achieved by lowering the chunks of articles to fetch — in the constance configuration for example. Letting workers handle more jobs before beiing dropped from memory.

But this last point needs particular caution, because of external parser's memory leaks that greatly affect the machine. This issue is the only one that made me drop workers after a fixed amount of processing.

In case our new processors can be made to leak less (or not at all!), we could raise the number of jobs a worker can handle in its life, and thus again benefit from the Django internal cache.

One single optimization, multiple levels of benefits. I love this! But hunting in our current (very efficient) Cython-based parser is a very difficult task, and also very error prone.

Back to code. See ya,

#release #features #database #optimisations #speed #enhancements #fixes #PostgreSQL #celery #Python

Major speed enhancements on http://1flow.io

Hello all,

Those who have not been frustrated enough by the recent slugishness have probably noticed that most of the speed problems have been resolved on http://1flow.io.

How did I resolve the problem ? I just shut MongoDB down.

The PostgreSQL code has been reliable for long enough to run standalone, and I finally took the time to deactivate all mongodb related code and tasks.

In fact, most of the [ported] queries run faster on the SQL engine, even with inheritance and other goodies enabled. The machine has started to enjoy a quasi-permanent ~5 system load (even with sentry, stats, mail server, and other services running locally…).

MongoDB freed a gorgeous 47Gb of RAM, immediately used by the OS cache for full-in-memory table and index cache :-)

There are still some hickups — sometimes a particular page takes a few seconds to respond with no obvious reason — and some known-to-be-slow feeds (eg. the “all items” or “all unread” ones). The second part will be adressed by some upcoming code optimizations, and the first probably with some pjaxer magic.

See you back on http://1flow.io/ !

#speed #enhancements #memory #database #indexes #cache #MongoDB #PostgreSQL

•18+ Adults Only

Watch Anya Live on Cam

Anya is live and ready to show you everything. Watch her strip, dance, and perform exclusive shows just for you. Interact in real-time and make your fantasies come true.

✓ Live Streaming✓ Interactive Chat✓ Private Shows✓ HD Quality✓ Free Actions

Free to watch • No registration required • HD streaming

Version 0.88: custom content processors!

This is the long and awaited release that brings full customizability to web content processors.

There is a lot to say about it :

the starting point is that it should never lower the parsing quality of a 1flow instance. New parsers are applied selectively and can be tested on some articles before being applied globaly on a website. Other websites aren't affected except in some very-unusual conditions you should never reach.

There is already a new parser bundled with 1flow (based on the newspaper python module), that has already solved empty-content-problems on some websites.

The pure-1flow interface for this feature is not yet finished (it misses the ChainedItem part, which is the trickiest to make simple), but there is an interim interface in the Django Admin.

processing errors can now be individually consulted in the Django Admin.

you can write your own content parsers, without needing to submit them to 1flow via a PR or anything.

in a near future, your 1flow instances will be able to share their custom parsers via the generic synchronization mechanism, which is underway. There will also be a way to share parser globally with the 1flow community, via a peer-based review system.

As you can see, the newspaper processor accept code is a no-op: it calls a mutualized accept code suiting any processor that fetches and parses an article content. No need to duplicate code, even between your processors (see documentation).

And here is the process code:

This processor is only 26 lines long, including imports and logging… It lacks a few comments, but you get the point : writing a new parser is FAST.

Testing your processor

You can test it right away, without bothering much with the Python shell. You can even test it on real articles, thanks to the new staff insert in the article view:

Then in the django management shell:

a = Article.objects… # not needed every time. # a.reset() # a.absolutize_url() a.fetch_content()

And you see right there how your processor behaves.

If you need something more packaged, just attach you new processor chain to the website, and run:

./manage.py reparse_articles <website_ID_or_URL> test

And a random article will be processed, and its new content printed to you. If you're satisfied, reparse all articles with the new processor chain:

./manage.py reparse_articles <website_ID_or_URL>

And you're ready to stare at some matrix-like screen.

Sharing processors with other people

In the meantime, sharing your content parsers (which are processors and processing chains in the 1flow terminology) is still easy, via the use of 2 management commands:

./manage.py dump_processors

On one side, and:

./manage.py load_processors

On the other side. Provided you copy the file created with dump_processors on the machine you will run load_processors onto, the commands take care of the rest and will self-tell you all that you need to know.

I will slowly write new content parsers for a paying customer, and they will luckily benefit the whole 1flow community. I will also finish the management interface for them, and implement parameter-driven processors, allowing to use them to their full potential.

It's not currently an acheivement per se. It's a start. Processors are a very generic mechanism, and many currently-fixed things in 1flow will switch to processor-based code (post-processing of original data, processing of any-typed document like PDFs, ODT, images, audio…), allowing nearly full customization of any processing chain at any point.

Think of a dead-simple plugin (only 2 methods to implement) system that has access to every part of the engine, and you have only a vague idea of what can be done.

YEAH !

#feature #release #content #parsing #quality #customization

Version 0.87 : staff management tools, fixes, and engine changes

This version brings:

bug fixes for users: the main one beiing that subscribing to a feed via the bookmarklet works again (in fact it worked but created the feed in the old MongoDB database…); other small fixes included.

data export enhancements:

ability to export all database content (feeds, articles, tweets…)

ability to export a folder (with subfolders) only,

ability to match the export to a user selector contents. This one is best explained with an exemple: say your 1flow instance has 30 feeds and user A is subscribed to 15 of them; exporting the whole DB as this user will have the 15 subscribed feeds having the folders they belong to attached, and the 15 others have no folder attached because the user is not subscribed to them.

new staff views to ease management of feeds and websites, because the Django admin sucks on this side.

a lot of template re-organizing and refactoring, making less code at the end, and more snippets/widgets to include in various places without code duplication.

a lot of internal and invisible changes to prepare the upcoming processors architecture that will be part of next version. The processors are the things that will allow implementing custom parsers and language-specific pre/post-processing on any content, fully configurable, and nearly accessible to end-users (at least staff members). See this, and understand I'm done with most parts of the underlying engine. I've already coded some processors, and translated the fixed 1flow legacy parsing chain into a dynamic one that will soon be enhanced with many bells and whistles. What's coming is HUGE. Documentation will give you another idea, and it reflects what's implemented in my develop branch.

as a pre-requisite of previous point, all 1flow main models are now fully historized. This means that versions of database objects are now available, at least internally (I've yet to write the interfaces to compare them and switch/restore a given one).

The data export is not yet documented. Sorry for that, I lack time to do it properly. Please contact me directly if you need the feature.

#features #release #upcoming #huge #versions #export #database #history #documentation #thrilling

Trending Blogs

Last Seen Blogs

1flow.io news blog