Weblog
Bart Wesselink - Processing large quantities of online payments
Bart Wesselink talks about processing large quantities of online payments, at PyGrunn.
See the PyGrunn website for more info about this one-day Python conference in Groningen, The Netherlands.
I am here on a personal basis, not on behalf of a company, although I am a finance guy at SFX Entertainment, and formerly at PayLogic.
- Online payments: digital certainty of cash over cash. It is real-time confirmation that you will get your money and can deliver a product.
- Large quantities: high volume in a short span of time.
- Unit of measurement is transactions per seconds.
- Every second of downtime is revenue lost. This can be as high as 3000 euro per second for some companies.
Standard online payment environment has four corners:
- consumer
- merchant
- issuer/consumer bank
- acquiring/merchant bank
The scheme or payment method can be VISA, which is intermediary between the two banks.
This model has several 'single points of failure'. What we can make redundant, is the PSP/Gateway between the merchant and his bank. Service Level Agreements are useless here: you will never get back anything close to the money you lose. So redundancy is key.
Partially: offer the consumer different ways to pay: credit card, ideal, etcetera. And, when the consumer decides to pay via VISA, you want to have a few options: if one payment provider has problems, then another can fill the gap.
Monitor how well each route is performing. On iDeal (a payment system in the Netherlands) you can now get a message when there is a known problem with a bank.
Credit card plus expiry date is enough for most credit card payments. But there is 3DSecure: extra security for VISA. But: there is more that you need to monitor.
From the first six credit card digits you can learn a lot. There are databases for this, showing card brand, country, card level, etcetera.
Remember: stay compliant to https://www.pcisecuritystandards.org
There are horror stories, like a local payment method that could only handle 1.5 transactions per second.
Lessons learned. Big names does not mean big performance, even for banks. You see sloppy implementations. Do logging and monitoring, lots of them.
Adam Powell and Denis Dallinga - Recommendation systems @ Catawiki
Adam Powell and Denis Dallinga talk about recommendation systems at Catawiki, at PyGrunn.
See the PyGrunn website for more info about this one-day Python conference in Groningen, The Netherlands.
Catawiki is an online auction platform, for all kinds of things, including Napoleon's hair. Some projects ar interesting programmatically.
Auction listing page optimisations. Which options do we recommend to users? We base this on similar users. We can use a Jaccard normalised model. The Co-Occurrence model gives different recommendations.
Bids go into the data warehouse, to Spark, to the ruby 'grape' framework which is a personalisation service. We A/B test new ways of doing recommendations. We need to balance popularity and novelty, freshness (don't keep showing the same ones), diversity.
Problem: new users of which we don't yet know much. We use the Python library theano [see the machine learning talk, Maurits].
We have 35 thousand auction lots per week, and 350 thousand bidders, which leads to a lot of data. Recommendations for all users can be recalculated every five minutes. Using Snowplow for recommendations to all users.
Recommend a category for a new lot that someone enters. gensim Python module for natural language processing, run this over all categories, put this in elastic search.
Peter Odding and Bart Kroon - Understanding PyPy and using it in production
Peter Odding and Bart Kroon talk about understanding PyPy and using it in production, at PyGrunn.
PyPy is the JIT (Just In Time) compiler for Python. Also known as: Python in Python. The standard Python interpreter is CPython, so Python written in C. There are other options, of which PyPy is the most mature.
PyPy is a Python implementation. Compliant with CPython 2.7.10 and 3.2.5 at the moment. It is fast. This was not the case earlier. Speed is better. Contrary to popular belief, PyPy can actually reduce memory usage. Multicore programming, stackless feature for massively concurrent programming, so microthreads, greenlets.
PyPy is written in RPython. RPython is a strict subset of Python, statically typed, translated to C and compiled to produce an interpreter. It provides a framework for PyPy and others.
Run it: pypy your_python_file.py.
But when you use C extensions, it is not so easy. Some may work. What then? The PyPy folks would have you use cffi: C Foreign Function Interface, if your module needs C code.
Software Transactional Memory: Python without the Global Interpreter Lock. Actually slower on a single thread, but with two threads you already have performance increase. Side effects make transactions inevitable, so watch out with concurrent logging and file I/O in general: side effects will result in the other threads being rolled back to try again. Interesting to follow, also if you are using other languages.
How PayLogic came to use PyPy. We sell tickets, which can lead to a lot of visitors in very short time, so you get a DDOS from your customers. So we started using a CDN for the html, and only small json requests to servers. For the json we still needed lots of servers, and state synchronisation was still a problem. We did not use any C modules, the biggest part was Tornado. So we just changed to PyPy.
It almost worked. What went wrong?
- Garbage collection works quite differently in PyPy. PyPy periodically stops execution to mark reachable objects. And objects could be alive after they were no longer used, and we ran out of file descriptors in seconds. A cache solved this.
- UUID4 implementation in PyPy was wrong, resulting in far from unique non-random ids.
Our results:
- Quadrupled performance, in 2013 already. Now around eight times, with every upgrade there is a bit more performance improvement.
- Real saving on hosting costs, less servers needed.
- Our queue works for at least two million visitors now.
Other things you can do: run javascript, lisp.
Guido van Rossum said: "If you want your code to run faster, you should probably just use PyPy."
Ben Meijering - Hello, Machine Learning!
Ben Meijering says hello to Machine Learning, at PyGrunn.
Ben Meijering says hello to Machine Learning, at PyGrunn.
See the PyGrunn website for more info about this one-day Python conference in Groningen, The Netherlands.
I am using a computer vision task as example because it looks nice, but this works just as well for other data.
Machine learning is: teaching computers to learn how to perform tasks.
Task: determine the digit that is in a picture.
Think of tasks in terms of probability. What is the chance that a picture as a whole is of the number zero? Regression model: change data into a value between zero and one so you can see it as a probability. We do a weighted sum of all pixel values.
Linear regression model. As inputs all the pixels. As outputs, the probability that the digit is zero, or one, or two. We apply softmax (instead of sigmoid) to output nodes, because classifications are mutually exclusive. This is a restraint we choose for this example task.
Machine learning libraries:
- Theano, nice for getting started in machine learning
- TensorFlow, good for rapid prototyping, used a lot by Google
- Keras, unified API to both Theano and TensorFlow. Good, because both libraries have their own strength and Keras makes it easy to switch.
Code is compiled to GPU instructions, which can be 30 times faster than CPU.
Goal of training: learn the best possible weights. Inputs plus weights plus model (transform) give the outputs. During training we know what the correct answer is. Errors should be propagated back through the model, through the network of nodes, so the model can update its weights.
I do the plotting with the pandas library: show how well the model is performing after each iteration over the training data. Our first model starts out bad, and improves only a little bit. A second model, with an extra layer of nodes, performs much better at first and still improves afterwards.
So: keep adding layers for more accuracy? No, because you run out of memory. We can use a convolutional network. Inspired by the visual cortex. In this model, we use less connections: the nodes in the second layer are only connected to three nodes, instead of all. And we use the same three weights everywhere.
Machine learning is a very powerful tool, in solving all sorts of tasks. And it is a creative endeavour: how do you compose your models?
Overfitting: the model knows the training data too well, performing well on this data and poorly on others. How to deal with this? Split the data into data for training and data for testing. Or dropout: obscure part of the image when training.
Slides are at https://github.com/bitwise-ben/pygrunn
My website: http://lambda-ds.com
Reinout van Rees - Improve your django admin: big gains with little effort
Reinout van Rees talks about improving your django admin: big gains with little effort, at PyGrunn.
Reinout van Rees talks about improving your django admin: big gains with little effort, at PyGrunn.
See the PyGrunn website for more info about this one-day Python conference in Groningen, The Netherlands.
The admin interface is Django's wrapper around the database. It has lots of possibilities that usually just take a few lines of code for a big improvement.
Three parts: you have the main admin page, a list of objects, and an object edit page.
What I tell you is all in the documentation of the Django admin.
Main admin page:
- In admin.py register all your models so they end up here.
- Give a verbose_name and verbose_name_plural in models.py in the Meta.
List of objects. I usually change these five items:
- Add a __unicode__ method on your model. Otherwise you see Token object as name. Don't make it too long, needing to access five database models for your name.
- Sort order. In your model Meta add something like: ordering = ('name',). Same in your admin model if you need a different order there.
- Columns. In the admin model specify fields to show: list_display = ['name', 'url']. These can also be method names defined in your model.
- Search. Please add a search box in admin: search_fields = ['name', 'url']. For this half minute of work I got lots of praise from colleagues.
- Filters. In the admin add a filter: list_filter = ['portals', 'organisations'].
All these are just one line. If you take over a project, this is a nice way of improving it and at the same time getting to know the database.
Object edit page:
- Order of fields. Default is just the order in which fields are specified in your model. So in your model put them in the right order. In admin you can add fieldsets to group fields, though this is not really one line.
- Specify verbose_name, help_text on the model: looks much nicer in the admin interface.
- readonly_fields in admin.
- Easier selection. You may have a dropdown of four hundred items, hard to select. But you can use a nicer widget: filter_horizantal with list of fields in admin.
- Faster selection. Tens of thousands of related model objects in a dropdown can make it very slow to display. Solution: row_id_fields in admin.
- Inline objects. Edit an object and on this edit page add a different model inline. In the admin define one model as TabularInline and add inlines on the parent admin model.
Others:
- Queryset filtering, to see only items that you have for example edit permission for.
- Are you missing the option to add a model? Then you are missing an admin model.
- You can define actions.
So: you can do various really easy things to make the django admin much nicer.
If you want to give users access to the admin, you can give him staff status and give permissions to view, add, or edit.
Documentation: https://docs.djangoproject.com/en/1.9/ref/contrib/admin/
Twitter: @reinoutvanrees
