Jens W. Klein: Big Fat Fast Plone - Scale Up, Speed Up.

published Oct 30, 2014 , last modified Nov 16, 2014

Talk by Jens W. Klein at the Plone Conference 2014 in Bristol.

I am owner of Klein & Partner, member of Blue Alliance, Austria. Doing Plone since version 1.0.

Default Plone is not so fast. Scales great horizontally (so adding machines), but there are still bottlenecks, primarily loading stuff from zodb.

First customer Noeku, over 30 Plone sites, hi availability, low to mid budget, self hosted on hardware, VMs. The pipeline is: nginx, varnish, pound, several Plone instances, databases (zodb, mysql, samba).

Second customer Zumtobel, brand specific international product portals, customer extranets, b2b e-shop, hosting on dedicated machines.

Third customer HTU Graz, one Plone site with several subsites (with lineage), lots of students looking at it, so we have a peak load.

Main Plone database is ZEO or PostgreSQL, plus blobstorage (NFS, NAS). Load balancer: haproxy or pound. Caching proxy: varnish (don't use squid please). Webserver: nginx (better not use apache).

The Plone instances and the client connection pool (between Plones and database) can use memcached (maybe multiple), LDAP, other third party services.

If you want to improve things, you must measure it: use munin, everywhere you can. fio is a simple but powerful tool to get measures on your io. Read up on how Linux manager disk/ram. Know your hardware and your VMs (if any).

Database level

Noeku: zeo server, blobstorage, both replicated with drdb
Zumtobel: RelStorage on PostgreSQL, blobs from NAS over NFS.
HTU Graz: RelStorage on PostgreSQL, all on one machine.

First things first: never store blobs in ZODB: use blobstorage. Standard Plone 4.3 images from news items are stored in zodb. You can change that. Check your code and add-ons.

ZEO server plus blobstorage: ensure a fast IO to harddisk or RAM, and have enough RAM for disk buffering.

Blobstorage on NFS/NAS: shared blobs and mount them on each node. Mount read-only on web server node and use collective.xsendfile (X-HTTP-Accel) to make it faster.

RelStorage plus blobstorage: never store blobs in the SQL database (same as zodb). No MySQL if you can avoid it. Configure your SQL database.

Connections pool: ZEO vs RelStorage. ZEO server pushes invalidations to client. RelStorage: ZEO client polls for invalidated objects. Disk cache of pickled objects per zope instance. On RelStorage size you can use memcached, which is a big advantage, reducing load on the database.

Noeku, ZEO.
Zumtobel: RelStorage, history free, 2 VMs, 16 instances plus some worker instances for asynchronous work, each 2 or 4 threads. RAM cache 30,000 or 100,000 objects, memcached as shared connection cache. If packing takes too long, try relstorage_packer.
HTU: RelStorage, history free. 6 instances, each 1 thread. RAM cache 30,000 objects. This is something you need to tweak, try out some values and measure the effect. Memcached. Poll interval 120 seconds. Blobstorage: shared folder on same machine.

The above is not specific for Plone. The below is.

Plone

Turn off debug mode, logging, deprecation warnings.
Configure plone.app.caching, even if you are not using varnish. Browsers cache things too and it can really help.
Multiple instances: use memcached instead of standard ram cache.
Know plone.memoize and use it.
Never calculate search twice. Check your Python and template code to avoid things that boil down to: if expensive_operation(): expensive_operation().
Use the catalog.
Do not overuse metadata: of you add too many metadata to the catalog brains, they may become bigger than the actual objects, slowing your site down.

Write conflicts:

90% of write conflicts happens in the catalog.
To avoid it, try to reduce the time of the transaction. Hard in standard situations, but you may be able to first prepare some data and later commit it to the database.
Use collective.solr or collective.indexing. I hope that in Plone 6 we will no longer have our own catalog, but use SOLR.

Lots of objects, hundreds of thousands? Are catalog queries slow? Use a separate mount point for the portal_catalog, with higher cache sizes.

Archetypes versus Dexterity. In AT, you should avoid to wake up the object, ask the catalog instead. With Dexterity, it is sometimes cheaper to wake up the object: if objects are small and you iterate over a folder or subtree, or if adding lots of metadata to the catalog would be needed.

Third party services, like LDAP and other databases, need to be cached. Talking to external systems over the network is slow.

In case of serious trouble: measure! munin, fio, collective.traceview, Products.ZopeProfiler, haufe.requestmonitoring, Products.LongRequestLogger. Change one thing at a time and measure it. Really important!

plone.app.caching: always install it. For custom add-ons with own types and templates, you need extra configuration for each type and template. Do this! Calculate in some time for this task, it is some work, but it is well documented.

On high traffic, introduce a new caching rule for one/two/five minute caches, really helps against peak load.

Load balancer. Pound is stable but old and difficult to get measurements. Haproxy not that simple, newer, nice WebUI for stats. You should point the same request type to the same instance.

Web server: nginx. Set the proxy_* to recommended values.