NLP based Recommender System for Plone

published Oct 14, 2022

Talk by Jan Mevissen and Richard Braun at the Plone Conference 2022 in Namur.

Find out how to level up recommendations to your website based on machine learning open source Python library scikit-learn. Now that we’ve tried its simple and efficient tools ourselves, we will show you hands-on how you can benefit from them. We developed a useful add-on for both Plone Classic and Plone Volto. Get smart content recommendations by using basic Natural Language Processing to integrate this content recommendation system which is accessible for everybody.

Vectorisation: create a vocabulary of all unique words, a bag of words. With countvectorizer you calculate the amount of appearances of each word. This leads to a vector, think of a line in a graph. Then for an unknown text you can compare its vector with the list of known vectors, and see what the nearest neighbour is, using sklearn.neighbors.

interaktiv.recommendations is a Plone package for Classic and Volto. It provides a behaviour which can be added to the content type of your choice, plus a control panel. Say you add this for the Page content type. Then on each page you can show a list of other pages that are similar, so are recommended for the user to read.

Future plans:

auto tagging
dimensionality reduction with part-of-speech-filtering and lemmatisation, which breaks the words down to their base form.

Fast tests

published Oct 14, 2022

Talk by Neyts Zupan at the Plone Conference 2022 in Namur.

Slow tests suck. They are annoying and slow down you and your team. Hours and hours of engineer time is lost waiting for the CI to finish. I'll go through a number of approaches and principles to help you write fast tests.

I started plone.api way back, now in core. I have organised about twenty sprints. Next up is the Nix(OS) sprint 21 to 25 November in Lanzarote.

I developed an app for security on Mac, see https://paretosecurity.com/. We had tests for this, they were fast, but after a while I checked again and it took 3 minutes for only 300 unit tests. What? We need to bring this down. Long tests can cost money, and at least they cost developer time. If the CI pipeline takes twenty minutes, then you cannot wait for it, and you cannot start anything new in that time. There are a few things you can try.

Try one approach, measure it locally, measure it on CI, merge when it is an improvement, wait a few days before trying the next thing.

Use hyperfine to compare results of commands.

Running tests

Throw money at the problem: make sure your team has fast laptops and that you fast have CI runners. Buy a Mac mini for less than 1000 euros and use it as a fast CI runner in GitHub or GitLab.

Try pytest --collect-only. Ballpark: should take 1 second for 1000 tests. If it takes longer, there is something wrong. You are probably telling it to look in too many directories.

Try pytest --collect-only --noconftest. Does this differ much with the previous result?

export PYTHONDONTWRITEBYTECODE=1

Usually we know a few tests that are slow. You can mark them as slow, and do not run them by default, only if you run them explicitly. Then do run them on CI.

Try pytest-incremental or pytest-testmon.

Writing tests

Make sure tests are not doing any network I/O, like to GitHub or gravatar. Use pytest-socket to get them to fail, and then fix them by mocking the network access.

Use pyfakefs to use a fake filesystem in memory.

Do all tests need a database? If tests only do simple things, like doing some calculations, do not setup a database. Or maybe not all tables are needed for this test.

Can I create the database only once? At the end of a test, you can truncate tables, so you don't need to recreate them.

Can I populate the database only once?

Parallellisation

The big wins are here, but it is more involved, so start with the others first.

pytest-xdist helps here. Make sure your parallel runs don't influence each other, like writing to the database when another process truncates it.

pytest-split is good for the CI, not for local runs.

Now that your tests are fast, how do you keep them fast? We made BlueRacer.io. Free for per personal use and open source software.

Extra tip: run pytest --lastfailed to only run the tests that failed in the last run.

See https://github.com/zupo/awesome-pytest-speedup for the full list with more information.

Audience: gocept.pytestlayer helps for running Plone tests in pytest.

Using Plone to build a business application

published Oct 14, 2022

Talk by Gauthier Bastien and Olivier Delaere at the Plone Conference 2022 in Namur.

This talk is about ways to optimize and use Plone to serve hundreds of logged-in users concurrently.

iA.Délib is used by more than 250 public authorities in Belgium: cities, public social core, fire fighters, etc. It is a list of packages to build applications. Plone has many features to support this, but it needs to be optimised. We have a theme, dashboards, office document production, contacts management. When all is setup, you can build a new application within weeks.

We use collective.contact.* and collective.eeafaceted.* packages. We have special workflows for documents. When a user is not allowed to do a transition we do show a grey button for this. When they hover over it, they see the reason why they are not allowed. We use dexterity.localrolesfield, linking roles to an organisation.

We worked on caching and performance. We are never using varnish; plone.app.caching gives us enough power.

When you reindex an object in Plone, it will update all metadata, also when you only update one index. There are 36 metadata columns in a fresh Plone site, and we have 59.

Our biggest organisation creates 500 too 1000 new items every week. In peak hours, there are several transactions per second. 1150 users, hundreds concurrently. Only using 30 GB of RAM. Still performant out of the box. We optimised to use a standard Plone ZODB to manage one million portal_catalog records. Optimizations for big installations will benefit smaller ones: they use the same setup. Some actions in Plone, like in z3c.form are being done two or three times, so we are looking into optimizing that.

We have configured the default Plone RAM cache to 100,000 entries. We have improved the cache so that "hot", recently used cache item will not be deleted.

We use the ZCA for low level adaptations, but we moved the adaptability to TTW configuration. This fit our audience better.

We have used in-place migration since Plone 2.1.2 until Plone 4.3.20. Most content types have been moved to Dexterity, and use GenericSetup. We will probably evaluate collective.exportimport to go to Plone 6, but will try first the in-place migration. This is in Plone and lets users feel confident.

Second site: www.deliberations.be. This brings the politics back to the people. It is a sharing tool between municipalities and citizens. One single portal to find them all.

Using Plone 5.2, Py 3, dexterity, eea.facetednavigation. This fetches data from iA.Délib through the Rest API.

Conclusions, Plone needs some optimisations, but then it works quite performant.

Plone Newsroom Live

published Oct 13, 2022

The Plone Newsroom podcast is live at the Plone Conference 2022 in Namur, Belgium.

Welcome from Fred van Dijk and Philip Bauer. This is episode 11 of the Plone Newsroom podcast.

Plone 6 beta 3 was released last week. Your password needs to be longer. Tables like nicer now. There were four Volto alpha releases meanwhile. We need more documentation and marketing.

Community member news: Ramon got married. Peter Holzer has left Plone, working for a new employer. We appreciate everything he has done, and what others have done who have left. People come and go, and some come back: David Glick.

What were some highlights of the past season? Having Victor on the show. The composite pages episode. The growing up of Plone Classic and Volto side by side.

How about the presentations this week so far? Fred's talk obviously, about how some managers and organisations handle Plone. Volto talks: we can see it is working. Plone beyond 2022, with technical planning on the frontend. Keynote about minerals. Diversity talk: we can fix stuff, improve stuff.

Trainings: Happy that we have the Effective Volto training now, to have the brains of Tiberiu and Victor not only in their heads, but also on a site for us to read. Classic UI Theming: good to see how fast it is to start a new theme with plonecli. Patternslib is great, and it was there before React.

Conference advice:

The talks here are live recorded and almost live uploaded to YouTube, which is great. Skip a beer tonight and watch an extra video.
When talking with people, be an open circle, so others can join.

For next year's conference organisers, here is some advice from Martijn:

Do it.
Push for early bird tickets earlier. You want to know earlier how many people are coming.

The software is an artefact of the Plone community. Johannes, you organised a sprint? Yes, the Buschenschanksprint in Austria. Buschenschanks are in the Südsteiermark area of Austria close to Slovenia. Lots of vineyards there. Organising is not that hard. Have a place for people to work, maybe a place to stay at night as well, organise some food and drinks and internet. Then announce it. The Plone Foundation has funding for this. The sprints are important, are social, and so much work is done, it is awesome!

Rikupekka, please come on stage. He has been working on marketing and especially the new plone.org. We are done for eight months already, so it will be done soon. There are designs, there is code, there is migration, there is progress. This weekend you can sprint on the site. First we need a running site with Volto with the design that the Italians made. Write good content for plone.org, come to me with that.

When you support a website, be it Plone or Wordpress, and a feature request comes in, you can duck. Deny the request. Or accept them all and it will grow into an unmaintainable monster. With both these approaches, every other system starts looking appealing. I call this product fatigue. We see talks here with universities with thousands of Plone sites, and other universities step away from Plone, why?

Plugins:

pas.plugins.oidc for authenticating with openid connect
pas.plugins.authomatic got a Volto port

More about Volto. We both agree that Volto is by rights the default frontend of Plone 6. Will this be the only frontend in five to ten years? A few hands go up in the audience. The difference is: do you have an html canvas with some javascript, or a javascript canvas with some html? Requirements that I see coming in, are better met by Volto. Based on the reaction, you would say that no one here needs to fear the the Classic UI will die out in the next five to ten years. The path to the more dynamic frontend is there, but we are not pushing you.

See you next time!

DevOps Bird's Eye View on Plone 6

published Oct 13, 2022

Talk by Jens Klein at the Plone Conference 2022 in Namur.

pip is the tool almost every Pythonista learns early to use. Plone 6 installs just with "pip install -c ... Plone".

But it needs more: Zope configuration, including existing or development of own add-ons, using newer package-versions, configuring RelStorage, CI-CD integration, building Docker images, deployment on Swarm or K8s, an ingress, some load balancing, caching, ...

The talk is not a tutorial but gives a 3000 foot view and acts as a starter to dig deeper.

In the past we only had Plone as backend. Now also the frontend, running on a different port, different process. The backend talks to the database.

The Plone frontend is a Node project. It pre-renders pages on first request, and for this it talks to the backend.

Plone backend is a WSGI server based on Zope, with a complete CMS Rest API. The backend still has all the Classic UI features in it, even when this is not used with the new default front end.

You would use pip to install the backend: pip install Plone. But it is not perfect, so I created mxdev to override constraints. Previously Buildout was used to generate a directory structure and configuration files, but currently you use cookiecutter-zope-instance for this. And then there is a WSGI server to start Zope, by default this is waitress.

A web request comes in at the web server (nginx, traefik), and then most requests go to frontend, api requests to backend, and the frontend talks to the backend for the first request. You can scale the backend horizontally by adding more backends (zeoclients). The backends talk to the ZEO database, with usually a shared filesystem for the blobs (binary large files). You could scale the frontend as well, if that turns out to be the bottle neck.

If you use multiple backends or frontends, you will need to put a load balancer in between.

If your sysadmins don't like this ZEO setup, you can use a relational database. This uses the RelStorage package. Postgres would be the best choice, but MySQL should work as well. The blobs are stored as large objects in the database, which is more performant for a transaction.

At some point, you need some kind of caching. You can go to a cloud provider for a cache. But you can do it yourself: use a varnish cache. You put this between the web server and backend, and/or between web server and frontend. You need to configure it. Under very high load you could even say: cache this request for three seconds.

Now on to hosting. Let's focus on Docker.

Disclaimer: Docker swarm could die in the long term. But it is nice to start with, and it gives you knowledge that is also useful for Kubernetes.

You get images from docker hub, or other container registries. Usually you will build your own frontend, so you need to create a custom image for this. For the backend you can also do this. You could create an image on your laptop and upload it, but it is better to build and store them in CI/CD, to avoid big uploads on slow connections.

Now it is time to deploy it. Basics: There is docker compose, but you should use Docker Swarm with 1 to n nodes, or same with Kubernetes. You need storage for your database. This could be a managed database by your provider. And you need fixed IPs.

Simple deployment for one site would be a single node docker swarm, 1 docker stack with all included, all in one yaml file. 1 traffic, 2 frontends, 2 backends, 1 postgres. The 2 frontends and backends are so that you can upgrade 1, bring it up, the upgrade the second and bring it up, so no downtime.

You can add varnish in here. Traefik sends all requests there. Varnish returns a cached response, or it adds a header. With the header, Traefik sends the requests to frontend or backend.

Now some tools. With Traefik UI you can inspect what is configured or wrong. Traefik has Let's Encrypt built in. Portainer is a UI and management interface for Docker Swarm and Kubernetes, and you can add this with an image. You inspect the state of the cluster, stop and start services, view logs, open a web console.

You really need a CI/CD system, otherwise this get nasty with images. You need workflows and a container registry. GitLab and GitHub have both.

Want to play with this? See the deployment training.