Weblog

published Nov 03, 2021, last modified Nov 04, 2021

Oscar Vilaplana: Orchestrating Python projects using CoreOS

published May 22, 2015

Oscar Vilaplana talks about Orchestrating Python projects using CoreOS, at PyGrunn.

See the PyGrunn website for more info about this one-day Python conference in Groningen, The Netherlands.

This is about orchestrating stuff. Not specifically Python, but the examples will be with Python.

Reliability. Your app does not care where it is running. Runs locally, runs on the cloud, same thing. Portability, repeatable. Loose coupling, compose micro services, makes it easier to scale them, mix and extend.

Cluster-first mentality. Even development machines can run in clusters. Lot of different containers, servers, ports, all connected, how do you manage this. Do you need to be developer and operations officers in one? Let the system figure it out. Let other smart people figure out and determine where the containers are.

One deployment tool for any service, makes them better and better, new service is then not much extra work.

Demo: deploy db, flask app, scale them. RethinkDB, 3 replicas. Give a name to the service, connect to it from the app by name. With kubectl start the service with the wanted amount of replicas.

CoreOS: Kernel + Docker + etcd. Read-only root, no package manager, self auto-updating, systemd for setting limits to a service and making sure it starts and restarts when it is stopped. Socket activation, starting a service only when someone starts using it.

etcd is a distributed configuration store. Atomic compare and swap, change this setting from 1 to 2, otherwise fail. HTTP API. Configurable at runtime. etcd set /pygrunn/current_year 2015

fleet is a distributed systemd. Start a service somewhere in a cluster. Coordination across the cluster. Rules like do not start A and B on same server because they are cpu hungry.

Service discovery: ambassador pattern, talk to someone who knows where the services are.

flannel does something like that. Per-cluster and per-machine subnet. Every container has an ip on a subnet.

Kubernetes: process manager. A pod is a unit of scheduling, name, container image, resources. Labels for finding things.

Replication controllers: create a pod from a template. Ensure that the exact number of pods are running. Upgrades.

Service discovery for pods. An ip per service.

Demo.

It took me a while to wrap my head around it. Look at the slides in the calm of your home.

Me on Twitter: https://twitter.com/grimborg

And see http://oscarvilaplana.cat

Lars de Ridder: Advanced REST API's

published May 22, 2015

Lars de Ridder talks about Advanced REST API's, at PyGrunn.

See the PyGrunn website for more info about this one-day Python conference in Groningen, The Netherlands.

See slides at http://todayispotato.github.io/rest-design-talk

I am head of tech a Paylogic.

Goals of REST: loose coupling between web client and server. Use existing web infrastructure.

Step 1, 2, 3: design your process.

Example with data as basis for the modeling. Simplistic database tables: CoffeeType, CupOfCoffee, Order, Barista. So you GET or POST to /coffeetype, /cupofcoffee, /order, /barista. POST will have keyword arguments, like number of cups. What is missing? You want to get the price, without ordering first.

So far, this is easy to design and build the server. But you get logic in the clients, like for the price. Tight coupling between api design and database design. Your table names should not determine your api names.

Model your process as seen from the end-user. For every step 'invent' a resource. In our case they might be /coffeetypes, /quote, /orders (or /payments), maybe /baristas. For every resource determine the relations. This is the most important parts. Never rely on urls, but use link relations. IANA.org has defined standards for this. Consider which data is involved for every resource.

Media types for APIs. Standard on how to format your (JSON) response. There are standards for this, so do not reinvent the wheel. We use HAL. It is minimalistic, basically only describing how to embed information. By visiting a url you can discover other urls that you can call. Others: http://jsonapi.org, Mason, http://jsonpatch.com, http://json-schema.org.

"I don't need to write documentation, my API is discoverable." This is of course not true. Discoverable APIs help when you are developing an application that uses the API. But do document the process that apps should use.

You should learn HTTP, learn what the verbs really mean. REST really is HTTP.

We chose to evolve our API instead of versioning it, using deprecation links.

Erik Groeneveld: Generators

published May 22, 2015

Erik Groeneveld talks about generators, at PyGrunn.

See the PyGrunn website for more info about this one-day Python conference in Groningen, The Netherlands.

I started my own company, Seecr. This talk is about PEP 342 (Coroutines via Enhanced Generators) and PEP 380 (Syntax for Delegating to a Subgenerator) and beyond. Most of this session is detailed on http://weightless.io/compose

I will talk about five simple things. Once you start working with it, it can get quite complicated, but it is really great.

Basis

Basis: list comprehensions with round instead of square brackets. It gives you a lazy thing instead of a completely populated list. You call .next() on the result to get the next item. Very useful for creating lazy and efficient programs.

Generator functions: using yield instead of return will turn a function automatically into a generator.

Generalized generators, PEP 342

Accept input for a generator: use .send(value) to send a new value to a generator. The generator then goes both ways, and then you basically have a coroutine.

Decomposition, PEP 380

Decomposing a function into two functions is easy. With generators it was not possible. Now it is. Add the @compose decorator to the main generator and call yield sub() on the sub generator. On Python 3.3 and higher, it is nicer: yield from sub(). But you need two new concepts to write real programs with it.

Push back, beyond

Implicitly at the end of a normal function, there is return None if there is no explicit return. For generator functions, you implicitly get raise StopIteration. With Python 3.3 and higher you can explicitly return 42 which translates into raise StopIteration(42) to return a value.

But you can also then do raise StopIteration(42, 56, 18). We will use this to push data back into the stream. The main generator will get 42, and then the next main yield will give you 56, and then 18.

None protocol, beyond

If you yield nothing, you return None. So yield None is the same. With yield None you get the generator ready to receive data, putting it at the first yield statement, so you do not need to first call .next() on it. You alternate the reading and writing stage. You send a few values, and then you tell the generator that you are ready to receive data again by yielding None to it.

Applications

Weightless is an I/O framework. It ties generators to sockets. See http://weightless.io/weightless and http://weightless.io/compose and source code at https://github.com/seecr/weightless-core

Bob Voorneveld: Implement Gmail api in our CRM system

published May 22, 2015

Bob Voorneveld talks about implementing Gmail api in a CRM system, at PyGrunn.

See the PyGrunn website for more info about this one-day Python conference in Groningen, The Netherlands.

I am working for Spindle. We are primarily working on a CRM system. Most Spindle programmers used to work at Voys. We are growing bigger and are still hiring, so come and join us.

Why we started building yet another CRM: communication should be the central point. We are building 'HelloLily'. Lily should be funny, nerdy, smart. HelloLily focuses on accounts, contacts, cases, deals, email, phone calls (future).

Focusing on the email part in this talk. Last year I said: instead of the really old IMAP, let's switch to the gmail api.

Implementation currently: Heroku, Python 2.7, Django 1.7, Django Rest, Django Pipeline, Celery, IronMQ, PostgreSQL, ElasticSearch, Angular 1.3, Logentries.

Celery: 1 scheduler, every five minutes sync every account. Two functions to sync email: first sync and incremental sync. Many functions for sending, moving, deleting, drafting email. This is asynchronous to keep quick responses.

With IMAP we previously had email sync problems: authentication (we should not store your password), not easy to keep track of changes of what happened to a mailbox, no partial download of only one attachment, IMAP implementation differs. Also, searching in the database was not very efficient, even with indexes. We had PostgreSQL problems, models spread over many tables, searching was slow because of that, with every email the search time increased, partial matching was difficult.

So we wanted a search index and use gmail api.

Gmail api: easy api, installable with pip, keeps track of messages (like: since last sync we have 5 new mails and these three have been deleted and one edited), partial download. Since February there is tracking of the type of change. Downside: it is limited to Gmail / Google Apps for business.

Using PostgreSQL with ElasticSearch. Emails are mapped in documents in ES, models pushed to ES with a post_save_signal in Django. Fast response time, averaging 50 milliseconds.

Problems that we encountered and fixes. Encoding and decoding messages, do not trust that the claimed encoding is correct. Sending, forwarding email, with attachments. Losing labels entirely instead of just for one message due to a coding error. High memory usage, scaled up for now. Sending big messages, needed to send in chunks.

Some colleagues are now saying: it does not work properly, let's switch back to IMAP. But we are getting there.

Still beta, testing it out ourselves. See source code on https://github.com/HelloLily

Me on Twitter: ijspaleisje

Niels Hageman: Reliable distributed task scheduling

published May 22, 2015

Niels Hageman talks about Reliable distributed task scheduling, at PyGrunn.

See the PyGrunn website for more info about this one-day Python conference in Groningen, The Netherlands.

I am member of the operations team at Paylogic. Responsible for deployments and support.

Handling taks that are too lengthy for a request/response loop.

Option 1: Celery plus RabbitMQ as backend. Widely used, relatively easy. RabbitMQ proved unreliable for us. Our two queues in two data centres would go out of sync and you needed to reset the queue: manual work and you lose data. This is a known problem. Also, queues could get clogged: data going in but not out.

Option 2: Celery plus MySQL as backend. So no separate queue, just our already running database. But it was not production ready, not maintained, buggy.

Option 3: Gearman (MySQL). Python bindings were again buggy. Could run only one daemon at a time. By default no result store.

Option 4: do it yourself. Generally not a great idea, but it does offer a "perfect fit". We built a minimal prototype, which grew into Taskman.

MySQL as backend: may be fine, but it was not a natural fit. "Thou should not use thy database as a task queue." Polling: there is no built-in event system, though there are hacks that pretend this. Lock contention between tasks is a bit hard. Some options. You can enable autocommit, so you do not need a separate commit. No SELECT FOR UPDATE but simply SELECT. Fuzzy delays. Data growth: queue can grow with time, but you can remove old data.

Task server, daemon with Python and supervisor. Loop: claim task from the database, spawn, sleep, repeat.

Task runner is a separate process. Sets up Python environment where the task runs, it runs the task, does post-mortem: get results, report back.

Task record (database row) simplified: function, positional and keyword arguments, environment, version of environment, state (pre-run, running, ended/finished, ended/failed), return_value.

Task server is an independent application. It does not know about the application that is actually running the tests. Applications need to register to the server with a plugin: methods get_name, get_version, get_environment_variables, get_python_path. Result of task must be a json string.

A task can report progress, by accessing its own instance. So it can say '20% done.' Tasks can be aborted. Task start time can be constrained, e.g. if the task is not started within ten minutes, delete it because this one is not useful anymore. Exception handling.

Taskman properties are optimized for long running tasks, not for minor tasks. Designed for reliability. Tasks will not get lost at some point, they will get executed, and get executed only once. It is less suited for a blizzard of tasks, lots of small tasks, more for heavy database processing. There is no management interface yet. If you must, you can use phpmyadmin currently...

I really want it to be open source, but we have not gotten around to it yet. We first want to add package documentation on how to set it up.