Niels Hageman: Reliable distributed task scheduling

published May 22, 2015

Niels Hageman talks about Reliable distributed task scheduling, at PyGrunn.

See the PyGrunn website for more info about this one-day Python conference in Groningen, The Netherlands.

I am member of the operations team at Paylogic. Responsible for deployments and support.

Handling taks that are too lengthy for a request/response loop.

Option 1: Celery plus RabbitMQ as backend. Widely used, relatively easy. RabbitMQ proved unreliable for us. Our two queues in two data centres would go out of sync and you needed to reset the queue: manual work and you lose data. This is a known problem. Also, queues could get clogged: data going in but not out.

Option 2: Celery plus MySQL as backend. So no separate queue, just our already running database. But it was not production ready, not maintained, buggy.

Option 3: Gearman (MySQL). Python bindings were again buggy. Could run only one daemon at a time. By default no result store.

Option 4: do it yourself. Generally not a great idea, but it does offer a "perfect fit". We built a minimal prototype, which grew into Taskman.

MySQL as backend: may be fine, but it was not a natural fit. "Thou should not use thy database as a task queue." Polling: there is no built-in event system, though there are hacks that pretend this. Lock contention between tasks is a bit hard. Some options. You can enable autocommit, so you do not need a separate commit. No SELECT FOR UPDATE but simply SELECT. Fuzzy delays. Data growth: queue can grow with time, but you can remove old data.

Task server, daemon with Python and supervisor. Loop: claim task from the database, spawn, sleep, repeat.

Task runner is a separate process. Sets up Python environment where the task runs, it runs the task, does post-mortem: get results, report back.

Task record (database row) simplified: function, positional and keyword arguments, environment, version of environment, state (pre-run, running, ended/finished, ended/failed), return_value.

Task server is an independent application. It does not know about the application that is actually running the tests. Applications need to register to the server with a plugin: methods get_name, get_version, get_environment_variables, get_python_path. Result of task must be a json string.

A task can report progress, by accessing its own instance. So it can say '20% done.' Tasks can be aborted. Task start time can be constrained, e.g. if the task is not started within ten minutes, delete it because this one is not useful anymore. Exception handling.

Taskman properties are optimized for long running tasks, not for minor tasks. Designed for reliability. Tasks will not get lost at some point, they will get executed, and get executed only once. It is less suited for a blizzard of tasks, lots of small tasks, more for heavy database processing. There is no management interface yet. If you must, you can use phpmyadmin currently...

I really want it to be open source, but we have not gotten around to it yet. We first want to add package documentation on how to set it up.