David Glick: Tales from a production Plone cluster
Talk by David Glick at Plone conference 2023, Eibar, Basque Country.
With production I mean: serving public traffic (or an intranet), highly available, with maintenance and monitoring.
I will talk about dlr.de, about 15 million hits per month, in their own data center.
First availability. We have stacks in two zones, so if one zone goes down, the second zone should still run.
Internet traffic comes in at a load balancer / cache. This points to one of the apps. The apps talk with the database (primary or secondary) and to one of the two search machines (Solr).
We use Docker Swarm mode. Docker swarm used to be a separate project, but now it is part of Docker machine. It provides a good way to manage several apps on different machines. It is containerised: we build an image once, and deploy it on multiple machines. So you do not have a buildout that you need to run a multiple machines before you can run the result. It takes care of process management: restart processes when they fail or run out of memory. Or it can startup the service on a new machine.
It allows for rolling deployments: take down one of the apps, update it, start it, wait till it receives traffic, and then start updating the next app.
With Docker Swarm you can have a declarative infrastructure. We write it, it gets checked into git, others can review it.
How do you design for availability? Avoid single points of failure. What I dislike about this idea, is that it makes failure seem like something that rarely happens. Instead you should plan for downtime: for example a machine may need to restart after getting some package updates. Services should be stateless: the data should be somewhere else (except if the service is the actual database server).
Varnish as proxy cache is a bit trickier. Varnish stores responses, so it is not really stateless. We have two copies of Varnish. This means we need to send purge requests to different places. You could add multiple addresses in the caching control panel, but here it would be two times http://varnish. So we send the purge requests to a purger that sits in front of the two Varnishes.
For databases it also gets tricky: you need a primary database and a secondary that is kept up to date.
What if the two zones cannot talk to each other? On top, maybe the first zone, with the primary database, cannot be reached by the internet. Then you need cluster managers in three different zones, so there are always two left who can decide what to do.
Think about what resource limits you need to enforce.
Now maintenance. Tune backend environment variables, like the Relstorage blob cache size, and ZODB cache size. Tune postgres configuration, especially related to used memory. Configure cron jobs: we use the crazymax/swarm-cronjob image to run these jobs, like backup and packing. With Postgres you also need to do vacuuming. This is done automatically, but especially the blobs table may not get vacuumed often, because not much changes.
Lastly monitoring.The most basic type of information you need, is: is the server down for everyone or just me, use for example Uptime Robot or Uptime Kuma. You can use SigNoz for log aggregation. There are metrics that you can check over time, for example metrics on Traefik can tell you how long requests take, how many requests per minute you get, what percentage of requests are failing.
I like "distributed tracing" better than logs, where you can drill down into details of a request, like how many SQL queries are done.
Also use something like Sentry to get reports/alerts about errors.
With collective.opentelemetry, which I am working on, we get more OpenTelemetry data about Plone and Zope.