Weblog
Martijn Faassen: Morepath under the hood
Martijn Faassen gives the first keynote at Pygrunn, about Morepath under the hood
Python and web developer since 1998. I did Zope, and for a little while it was as popular as Python itself.
What is this about? Implementation details, concepts, creativity, software development.
Morepath is a web microframework. The planet Zope exploded and Morepath came out. It has a unique approach to routing and link generation with Traject. Easy and powerful generic code with Reg. Extensible and overridable with Dectate.
In the early nineties you had simple file system traversal to publish a file on the web. Zope 2, in 1998, had traversal through an object tree, conceptually similar to filesystem traversal. Drawback: all objects need to have code to support web-stuff. Creativity: filesystem traversal is translated to an object tree. Similar: javascript client frameworks that mimick what used to be done on the server.
Zope 3 got traversal with components: adapt an object to an interface that knows how to publish to html, or to json. So the base object can be web agnostic again.
Pyramid simplified traversal, with __getitem__. So the object needs to be web aware again. Might not be an issue.
Routing: map a route to a view function. As developer you need to handle a 404 yourself, instead of letting the framework do this.
You can fight about this as frameworks. But morepath has it all. It is a synthesis.
I experimented with a nicer developer API than Zope was offering to get a view for traversal. So I created experimental packages like iface and crom. I threw them together in Reg. It was just a rewrite of the Zope Component Architecture with a simpler API.
Class dispatch: foo.bar() has self as first argument. Reg uses functools.singledispatch and builds multiple dispatch. But then I generalised it even more to predicate dispatch, as Pyramid had.
Don't be afraid to break stuff when you refactor things.
Dectate is a meta framework for code configuration. Old history involving Zope, Grok, martian, venusian, but now Dectate. With this you can extend or override configuration in your app, for example when you need to change something for one website.
Detours are good for learning.
Splitting things off into a library helps for focus, testing, documentation.
Morepath uses all these superpowers to form a micro framework.
Twitter: @faassen
Bart Wesselink - Processing large quantities of online payments
Bart Wesselink talks about processing large quantities of online payments, at PyGrunn.
See the PyGrunn website for more info about this one-day Python conference in Groningen, The Netherlands.
I am here on a personal basis, not on behalf of a company, although I am a finance guy at SFX Entertainment, and formerly at PayLogic.
- Online payments: digital certainty of cash over cash. It is real-time confirmation that you will get your money and can deliver a product.
- Large quantities: high volume in a short span of time.
- Unit of measurement is transactions per seconds.
- Every second of downtime is revenue lost. This can be as high as 3000 euro per second for some companies.
Standard online payment environment has four corners:
- consumer
- merchant
- issuer/consumer bank
- acquiring/merchant bank
The scheme or payment method can be VISA, which is intermediary between the two banks.
This model has several 'single points of failure'. What we can make redundant, is the PSP/Gateway between the merchant and his bank. Service Level Agreements are useless here: you will never get back anything close to the money you lose. So redundancy is key.
Partially: offer the consumer different ways to pay: credit card, ideal, etcetera. And, when the consumer decides to pay via VISA, you want to have a few options: if one payment provider has problems, then another can fill the gap.
Monitor how well each route is performing. On iDeal (a payment system in the Netherlands) you can now get a message when there is a known problem with a bank.
Credit card plus expiry date is enough for most credit card payments. But there is 3DSecure: extra security for VISA. But: there is more that you need to monitor.
From the first six credit card digits you can learn a lot. There are databases for this, showing card brand, country, card level, etcetera.
Remember: stay compliant to https://www.pcisecuritystandards.org
There are horror stories, like a local payment method that could only handle 1.5 transactions per second.
Lessons learned. Big names does not mean big performance, even for banks. You see sloppy implementations. Do logging and monitoring, lots of them.
Adam Powell and Denis Dallinga - Recommendation systems @ Catawiki
Adam Powell and Denis Dallinga talk about recommendation systems at Catawiki, at PyGrunn.
See the PyGrunn website for more info about this one-day Python conference in Groningen, The Netherlands.
Catawiki is an online auction platform, for all kinds of things, including Napoleon's hair. Some projects ar interesting programmatically.
Auction listing page optimisations. Which options do we recommend to users? We base this on similar users. We can use a Jaccard normalised model. The Co-Occurrence model gives different recommendations.
Bids go into the data warehouse, to Spark, to the ruby 'grape' framework which is a personalisation service. We A/B test new ways of doing recommendations. We need to balance popularity and novelty, freshness (don't keep showing the same ones), diversity.
Problem: new users of which we don't yet know much. We use the Python library theano [see the machine learning talk, Maurits].
We have 35 thousand auction lots per week, and 350 thousand bidders, which leads to a lot of data. Recommendations for all users can be recalculated every five minutes. Using Snowplow for recommendations to all users.
Recommend a category for a new lot that someone enters. gensim Python module for natural language processing, run this over all categories, put this in elastic search.
Peter Odding and Bart Kroon - Understanding PyPy and using it in production
Peter Odding and Bart Kroon talk about understanding PyPy and using it in production, at PyGrunn.
PyPy is the JIT (Just In Time) compiler for Python. Also known as: Python in Python. The standard Python interpreter is CPython, so Python written in C. There are other options, of which PyPy is the most mature.
PyPy is a Python implementation. Compliant with CPython 2.7.10 and 3.2.5 at the moment. It is fast. This was not the case earlier. Speed is better. Contrary to popular belief, PyPy can actually reduce memory usage. Multicore programming, stackless feature for massively concurrent programming, so microthreads, greenlets.
PyPy is written in RPython. RPython is a strict subset of Python, statically typed, translated to C and compiled to produce an interpreter. It provides a framework for PyPy and others.
Run it: pypy your_python_file.py.
But when you use C extensions, it is not so easy. Some may work. What then? The PyPy folks would have you use cffi: C Foreign Function Interface, if your module needs C code.
Software Transactional Memory: Python without the Global Interpreter Lock. Actually slower on a single thread, but with two threads you already have performance increase. Side effects make transactions inevitable, so watch out with concurrent logging and file I/O in general: side effects will result in the other threads being rolled back to try again. Interesting to follow, also if you are using other languages.
How PayLogic came to use PyPy. We sell tickets, which can lead to a lot of visitors in very short time, so you get a DDOS from your customers. So we started using a CDN for the html, and only small json requests to servers. For the json we still needed lots of servers, and state synchronisation was still a problem. We did not use any C modules, the biggest part was Tornado. So we just changed to PyPy.
It almost worked. What went wrong?
- Garbage collection works quite differently in PyPy. PyPy periodically stops execution to mark reachable objects. And objects could be alive after they were no longer used, and we ran out of file descriptors in seconds. A cache solved this.
- UUID4 implementation in PyPy was wrong, resulting in far from unique non-random ids.
Our results:
- Quadrupled performance, in 2013 already. Now around eight times, with every upgrade there is a bit more performance improvement.
- Real saving on hosting costs, less servers needed.
- Our queue works for at least two million visitors now.
Other things you can do: run javascript, lisp.
Guido van Rossum said: "If you want your code to run faster, you should probably just use PyPy."
Ben Meijering - Hello, Machine Learning!
Ben Meijering says hello to Machine Learning, at PyGrunn.
Ben Meijering says hello to Machine Learning, at PyGrunn.
See the PyGrunn website for more info about this one-day Python conference in Groningen, The Netherlands.
I am using a computer vision task as example because it looks nice, but this works just as well for other data.
Machine learning is: teaching computers to learn how to perform tasks.
Task: determine the digit that is in a picture.
Think of tasks in terms of probability. What is the chance that a picture as a whole is of the number zero? Regression model: change data into a value between zero and one so you can see it as a probability. We do a weighted sum of all pixel values.
Linear regression model. As inputs all the pixels. As outputs, the probability that the digit is zero, or one, or two. We apply softmax (instead of sigmoid) to output nodes, because classifications are mutually exclusive. This is a restraint we choose for this example task.
Machine learning libraries:
- Theano, nice for getting started in machine learning
- TensorFlow, good for rapid prototyping, used a lot by Google
- Keras, unified API to both Theano and TensorFlow. Good, because both libraries have their own strength and Keras makes it easy to switch.
Code is compiled to GPU instructions, which can be 30 times faster than CPU.
Goal of training: learn the best possible weights. Inputs plus weights plus model (transform) give the outputs. During training we know what the correct answer is. Errors should be propagated back through the model, through the network of nodes, so the model can update its weights.
I do the plotting with the pandas library: show how well the model is performing after each iteration over the training data. Our first model starts out bad, and improves only a little bit. A second model, with an extra layer of nodes, performs much better at first and still improves afterwards.
So: keep adding layers for more accuracy? No, because you run out of memory. We can use a convolutional network. Inspired by the visual cortex. In this model, we use less connections: the nodes in the second layer are only connected to three nodes, instead of all. And we use the same three weights everywhere.
Machine learning is a very powerful tool, in solving all sorts of tasks. And it is a creative endeavour: how do you compose your models?
Overfitting: the model knows the training data too well, performing well on this data and poorly on others. How to deal with this? Split the data into data for training and data for testing. Or dropout: obscure part of the image when training.
Slides are at https://github.com/bitwise-ben/pygrunn
My website: http://lambda-ds.com