David Glick: Nice blobs! It'd be a shame if anything happened to them...

published Oct 18, 2017

Talk by David Glick at the Plone Conference 2017 in Barcelona.

This is about storing blobs in the cloud. Mostly I will talk about storing them in S3 from Amazon, but you can store them elsewhere as well.

Earlier this year they had about 750 GB of blob data, growing by a few GB a day, and the hard disk was 90 percent full. It was also hard to keep copies of the site up to date, and keep backups. What to do?

Requirements:

  • move blob data to cloud storage
  • keep reasonable performance
  • no changes at the app level (the Plone Site)
  • smooth migration path: we did not want to bring the site down.

So should we handle it at the app level, storing an S3 url? But you need to have the image locally as well when you create scales.

Inspiration: ZODB client cache. Could we keep some blobs in there? I looked further, and Jim Fulton had already created s3blobstorage and s3blobserver. It looked experimental and not a lot of documentation, though it did not need to be a deal breaker.

So I created collective.s3blobs, which works a bit like the other two packages. When the app needs a blob, it first looks in the ZODB, then a filesystem cache, then an S3 bucket. When getting it from S3, it puts it in the filesystem cache. All recent blobs end up in the S3 blob cache, from both ZODB and the file system cache. It has a size-limited cache.

You can choose which blobs to move with an archive-blobs script, which you can pass some options. This avoided the need for migrating everything at once.

In your buildout you add a storage-wrapper with collective.s3blobs:

storage-wrapper =
  %% import collective.s3blobs
  
    cache-dir ${buildout:directory}/var/blobcache
    cache-size 100000000
    bucket-name your-s3-bucket
    %s
  

This requires a recent plone.recipe.zope2instance recipe.

It was successfull. Certainly for this site, lots of large original images are never shown, only smaller scales, so it helps a lot here.

S3 claims to offer secure storage, that you do not lose your data. We have setup rules to avoid someone accidentally deleting the blobs when they have direct access to S3, outside of Plone. We have versioning in S3.

Ideas:

  • Packing needs to be implemented. The standard ZODB packing will not remove blobs from S3. I have some ideas.
  • We could use CloudFront CDN to serve the images. That has security implications, although for this site it would be fine. But it costs extra, and they did not need it.

It is in stable production use, but unreleased. Needs more documentation and tests.

Code: https://github.com/collective/collective.s3blobs