Philip Bauer: Growing Pains: PosKeyErrors and Other Malaises

published Dec 11, 2020, last modified Jan 05, 2021

Talk at Plone Conference 2020

This talk is about the issues that you face when your project grows, the code base grows, the database grows, the problems grow. This is about the causes and some of the remedies.

Symptom 1: huge database

Cause 1: a huge number of revisions or versions.

Remedies:

  • Remove all versions and pack the database. When you migrate to a new Plone version, and you ask your client, they will usually be okay with this.
  • Manage or limit revisions. Easiest is to use collective.revisionmanager for this. Especially, revisions may have been left behind for content that no longer exists. You can easily remove it with this tool.
  • Disable versioning of Files. It is disabled by default, but maybe someone has switched it on.
  • Enable manual versioning instead of automatic. Then the editor needs to check a box when they make a major change that they want to be able to rollback.

Cause 2: no packing.

Remedy: just pack it. Use the zeopack script, which part of plone.recipe.zeoserver. Add a cronjob for this, weekly seems best for most sites.

Cause 3: unused content.

Remedy: delete it. You have to find it first. Of course no code can tell you which content is safe to delete. You could use statistics.py from collective.migrationhelpers to get an idea of where which content is.

Cause 4: the SearchableText index is huge

Remedies:

  • Use solr or elasticsearch and possibly remove the SearchableText index.
  • Don't index files. They are converted to text, but this may not be needed for your site.

Cause 5: large blobs For example, plone.de had a Linux iso image, which was huge.

Remedies:

  • Limit the upload size. You could do this in nginx/apache. Archetypes had something, you can likely do this in Dexterity too.
  • Get stats and remove or replace too large items.

Cause 6: aborted uploads (rare)

Remedy: check IAnnotations(portal).get('file_upload_map').

Symptom 2: slow site

Cause 1: unneeded full renders of content

Remedy: use Python in page templates. By default, page templates use path expressions like this: tal:define="foo context/foo". But this tries to render  foo as html if possible. Use foo python:context.foo instead.

Cause 2: wake up many objects

Remedies:

  • Always try to use brains and metadata. The difference is huge, also with Dexterity.
  • Listing 3000 brains: 0.2 seconds
  • Listing 3000 objects: 2 seconds
  • Same is true for Volto when you use the search-endpoint with fullobjects.

Of course most page templates in Plone will not list thousands of objects, but will be paginated. Still: just use brains, they are so much tastier.

Cause 3: no caching

Remedies:

  • Switch on the built-in caching
  • Add varnish
  • Manage the zeocache (that is a bit of science, ask the community)
  • Use memoize in your code.

Cause 4: hardware

Remedies:

  • Don't be cheap.
  • Buy enough ram to keep the database in memory.
  • Remember that your consulting time probably costs more than buying better hardware would.

Cause 5: slow code

Remedies:

  • Learn and use profiling. A very handy toy for that is py-spy. Sample use: sudo py-spy top --pid 12345
  • Do not call methods multiple times from templates. Call them once, store the result, and use this.

Cause 6: slow data sources

Remedies:

  • decouple, for example using redis or celery
  • Use your choice of async implementations
  • Use lazyloading of images if they come from outside of your Plone site.

Symptom 3: conflict errors

Conflict errors happen when two requests work at the same time and both change the same object. This is complicated, but Zope and the ZODB have built-in conflict resolution.

Cause 1: conflict resolving is not enabled. The zeoserver needs access to the same code that your zeoclient has, otherwise conflicts cannot be resolved and the transaction will be aborted.

Remedy: add all application code to the zeoserver:

[zeoserver]
eggs = ${buildout:eggs}

Cause 2: long running requests change data

Remedies:

  • Prevent writes.
  • If it takes long, do intermediate commits when possible.
  • Prevent crossfire: disable cronjobs and editors when a long request needs to run.
  • Use async. Talk to Asko about that probably.

Symptom 4: PosKeyErrors

Cause 1: missing blobs

Remedies:

  • Copy all blobs of course.
  • Use experimental.gracefulblobmissing in development to create dummy blobs where needed.
  • Find and delete afflicted content in a browser view.
  • There can be cases when you have two zeoclients and the syncing does not work well. Talk to Alessandro about that.

Symptom 5: broken data

Now for the really interesting part. These are errors like:

ModuleNotFoundError
AttributeError
ImportError
PostKeyError
BrokenObject

I could read you my whole blog post about zodb debugging.

Cause 1: code to unpickle som data is missing

Remedies:

  • Ignore the errors, if normal operation still works, and the site only has to stay up for a limited time, because zeopack probably also fails.
  • Fix it with a rename_dict. See zest.zodbupdate for some examples that are actually really useful. [Thanks! MvR]
  • Work around it with an alias_module patch, like plone.app.upgrade does in several cases. Then imports can work again.
  • Find out what and where broken objects are and then fix or remove them safely. Use zodbverify.

Steps for the last one:

  • Call bin/zodbverify -f var/filestorage/Data.fs to get all broken objects.
  • Pick one error type at a time, with an oid (object id) that has a problem.
  • Call bin/zodbverify -f var/filestorage/Data.fs -o -D to inspect one object and find out where it is referenced.
  • For the extra options, you should use the branch from my pull request, which I still have not finished yet, but it runs fine.
  • Remove or fix the object.
  • Important: make notes, write upgrade steps, keep the terminal log, because you will forget it and need it again.

To remove or fix the object, it helps to start the actual Plone site with some special zodbverify sauce:

./bin/instance zodbverify -f var/filestorage/Data.fs -o  -D

Then you can use your debugging skills to try and fix things. Note that after you fixed it, you need to commit the changes explicitly:

import transaction
transaction.commit()

Note that the bad object is still in the database, until you pack it.

Frequent culprits are IntIds and Relations, especially if you migrated from Archetypes to Dexterity. Using collective.relationhelpers you can clean this up:

from collective.relationhelpers.api import cleanup_intids
from collective.relationhelpers.api import purge_relations
from collective.relationhelpers.api import restore_relations
from collective.relationhelpers.api import store_relations
# store all relations in a annotation on the portal
store_relations()
# empty the relation-catalog
purge_relations()
# remove all relationvalues and refs to broken objects from intid
cleanup_intids()
# recreate all relations from a annotation on the portal
restore_relations()

Symptom 6: bad code

Unreadable, untested, unused, undocumented, unmaintained, complicated, overly complex, too much code. If you can convince a client to not want a feature because they will only use it once, that is a win. Every line of code that is not written, is a good line of code.