Peter Odding and Bart Kroon - Understanding PyPy and using it in production

published May 13, 2016, last modified May 17, 2016

Peter Odding and Bart Kroon talk about understanding PyPy and using it in production, at PyGrunn.

PyPy is the JIT (Just In Time) compiler for Python. Also known as: Python in Python. The standard Python interpreter is CPython, so Python written in C. There are other options, of which PyPy is the most mature.

PyPy is a Python implementation. Compliant with CPython 2.7.10 and 3.2.5 at the moment. It is fast. This was not the case earlier. Speed is better. Contrary to popular belief, PyPy can actually reduce memory usage. Multicore programming, stackless feature for massively concurrent programming, so microthreads, greenlets.

PyPy is written in RPython. RPython is a strict subset of Python, statically typed, translated to C and compiled to produce an interpreter. It provides a framework for PyPy and others.

Run it: pypy your_python_file.py.

But when you use C extensions, it is not so easy. Some may work. What then? The PyPy folks would have you use cffi: C Foreign Function Interface, if your module needs C code.

Software Transactional Memory: Python without the Global Interpreter Lock. Actually slower on a single thread, but with two threads you already have performance increase. Side effects make transactions inevitable, so watch out with concurrent logging and file I/O in general: side effects will result in the other threads being rolled back to try again. Interesting to follow, also if you are using other languages.

How PayLogic came to use PyPy. We sell tickets, which can lead to a lot of visitors in very short time, so you get a DDOS from your customers. So we started using a CDN for the html, and only small json requests to servers. For the json we still needed lots of servers, and state synchronisation was still a problem. We did not use any C modules, the biggest part was Tornado. So we just changed to PyPy.

It almost worked. What went wrong?

  • Garbage collection works quite differently in PyPy. PyPy periodically stops execution to mark reachable objects. And objects could be alive after they were no longer used, and we ran out of file descriptors in seconds. A cache solved this.
  • UUID4 implementation in PyPy was wrong, resulting in far from unique non-random ids.

Our results:

  • Quadrupled performance, in 2013 already. Now around eight times, with every upgrade there is a bit more performance improvement.
  • Real saving on hosting costs, less servers needed.
  • Our queue works for at least two million visitors now.

Other things you can do: run javascript, lisp.

Guido van Rossum said: "If you want your code to run faster, you should probably just use PyPy."

Slides: http://peterodding.com/presentations/2016/pypy