Version 0.89 : lowering database hits and new staff tools
This version brings a few user-invisible â but extremely useful nonetheless â changes:
integration of the cacheops django module, to avoid the database hammerings introcuded by the processor infrastructure. I prefer a lot hitting Redis than PostgreSQL to (re-(re-))load processors and processing chains at every article.
addition of interface button to trigger selective or global cache clearing; this helps making static-files and templates upgrades immediately visible to the user, or fix Javascript issues when non-cached templates fragments are desynchronized with too-agressively cached JS fragments (or the other way around).
resolution of some small bugs like logging strings arguments or corner-case typos (yeah I know, coverage-driven tests could have helped find these earlier ; help welcome)âŠ
The first point still needs love and thinking. We could probably benefit from a dedicated task fetching and parsing chunks of articles at once in a batched manner. Chains and Processors could remain in memory, avoiding repeated and useless hits to any database (even Redis will not be touched thanks to the Django internal instance cache).
This would speed things up at worst. It would eventually lower the resources needed by celery, at the cost of less parallelization possible.
But as the workers are currently all on the same machine, this would be certainly bring a benefit, be it small.
Some satisfying level of parallelization could still be achieved by lowering the chunks of articles to fetch â in the constance configuration for example. Letting workers handle more jobs before beiing dropped from memory.
But this last point needs particular caution, because of external parser's memory leaks that greatly affect the machine. This issue is the only one that made me drop workers after a fixed amount of processing.
In case our new processors can be made to leak less (or not at all!), we could raise the number of jobs a worker can handle in its life, and thus again benefit from the Django internal cache.
One single optimization, multiple levels of benefits. I love this! But hunting in our current (very efficient) Cython-based parser is a very difficult task, and also very error prone.