Python Meetup 22 November 2017
Summary of the meeting of the Amsterdam Python Meetup Group on 22 November 2017.
Byte is not really hosting anymore, moving to the cloud. We use lots of Python. Magento. Creating our own service panel. We use tools like Django, SQLAlchemy, Celery, etcetera, so we like to learn from you.
Wouter van Atteveldt: Large scale search and text analysis with Python, Elastic, Celery and a bit of R
I got my PhD in artificial intelligence. I am now a social scientist.
Why do you need text analysis? Example. In 2009 Israel and Palestina were fighting. How did the media report? I downloaded lots of articles and did text analysis:
- U.S. media: lots about Hamas and their rocket attacks that provoked Israel.
- Chinese media: more about the Gaza strip, actions on the ground by Israel.
So you can learn from text analysis. There is a flood of digital information, and you cannot read it all, so you use text analysis to get an overview. You need to go from text to structured data.
Facebook research. They changed the timelines of 600,000 people, some were shown more positive messages, some more negative. More positive messages resulted in less negative messages written by those people, but also less positive messages [if I saw it right]. Lots of things wrong with this research.
We can do much better, with Python. Python is platform independent, easy to learn. Half the data science community is in Python.
Or we can do stuff with the R language. It is aimed at statistics and data. There is convergence via numpy and pandas and R packages.
I helped create https://amcat.nl/ where you can search data, for example finding articles about Dutch politics in the Telegraaf. You can search there for 'excellentie', and see that this Dutch polite term for ministers was used until the sixties, and then resurfaced recently when a minister wanted to reintroduce this, and got satirical comments.
AmCat is a text search front end and an API, written in Python and R scripts and ElasticSearch. We upload texts to elastic on two separate nodes. You can never change an article: they are unique and hashed, so you get the same results when you search again next year.
For the Telegraaf paper, there is no data between 1995 and 1998. The previous years were digitised by a library, and the later years by Telegraaf themselves, but the current owner is not interested in filling the gap.
Our script interface is a Django form plus a 'run' method. This means our CLI, HTTP and API front end can use the same code.
For async jobs we use Celery. I know a professor who likes to write a query of several pages, which can take more than an hour to handle, so he would normally get a timeout. So we do that asynchronously.
We want to do NLP (Natural Language Processing). Or preprocessing. There are many good tools, like Stanford CoreNLP for English. We developed NLPipe, a simple NLP job manager. It caches results. Separated into server and workers. Workers can run on AWS via Docker, but since we have no money we are looking at Surf HPC but they don't support Docker, so we look at Singularity instead. Experts welcome.
[Same talk as on PyGrunn this year. I copied my summary and extended it.]
Goal: show what is possible. Everything is in the Django documentation. Just remember a few things you see here. If you know it is available, you can look it up.]
The example case I will use is a time registration system. Everyone seems to do this. Oh, less hands here than at PyGrunn. The tables we will use are person, group, project and booking. A Person belongs to a Group. A Booking belongs to a Project and a Person.
The Django ORM gives you a mapping between the database and Python. You should not write your own SQL: Django writes pretty well optimised SQL.
Show all objects:
from trs.models import Person, Project, Booking Person.objects.all()
case insensitive searching for part of a name:
or part of a group name:
name starting with:
- sometimes .exclude() is easier, the reverse of filter
- you can stack: .filter().filter().filter()
- query sets are lazy: only really executed at the moment you need it. You can use that for readability: just assign the query to a variable, to make complicated queries more understandable
- start with the model you want
This will use one initial query and then one extra query for each person:
systems = Person.objects.filter(group__name='Systemen') for person in systems: print(person.name, person,group.name)
Not handy. Instead use the following.
select_related: does a big join in SQL so you get one set of results:
for person in systems.select_related('group'): print(person.name, person,group.name)
This does one query.
prefetch_related: does one query for one table, and then one query to get all related items:
for person in systems.prefetch_related('group'): print(person.name, person,group.name)
This does two queries. Both can be good, depending on the situation.
It is expensive to instantiate a model. If you need only one or two fields, Django can give you a plain dictionary or list instead.
List of several fields in tuples:
Single list of values for a single field:
Annotation and aggregation:
- annotate: sum, count, avg
- groupby via values (bit of a weird syntax)
Aggregation gives totals:
from django.db.models import Sum relevant_persons = Booking.objects.filter( booked_by__group__name='Systemen') relevant_persons.aggregate(Sum('hours'))
Annotation adds extra info to each result row:
Filter on bookings for maternity leave, group bookings by year, give sums:
Booking.objects.filter( booked_on__description__icontains='ouderschap' ).values('booked_by__name', 'year_week__year' ).annotate(Sum('hours'))
Note: I used the faker library to replace the names of actual coworkers in my database with random other names.
Practice this with your own code and data! You'll get the hang of it and get to know your data and it is fun.
What I hope you take away from here:
- Read the docs, you now have the overview.
- Make your queries readable.
- Practice, practice, practice.
[I will toss something extra in from PyGrunn, which was probably a question from the audience.]
If you need to do special queries, you can create a sub query yourself:
from django.db.models import Q query = Q(group__name='Systemen') Person.objects.filter(query)
You can write filters that way that are not in default Django.
I worked at Byte and am now working at Optiver, in the tooling team, using Python for data center management. I use Flask and Click, made by the same people. Our sd tool is internal, it is not on the Internet. It deploys Python applications to servers with virtualenv.
On the server side we us stash, artifactory, ansible+Jenkins, supervisord, JIRA. On the user side we have our sd tool to talk to the programs on the server.
You would normally need passwords on some servers, auth keys or ssh keys on others. Some ports open, a different one for each new app. Messy.
So sd talks to an api that sits in between. The api then handles the messy talking to the servers, and you store authentication and port numbers in a central config, instead of on the computer of each user. All users talk the the api, the api talks to the servers.
For the server side api we use flask and flask_restful. For the client side we use click.
When you install Django, you get a house. When you install Flask, you get a brick. Sometimes you want a house. Other times all you need is a brick. You can build something entirely different with a brick. It is easy.
Making an api is done easily with:
from flask_restful import Api, Resource
Then click for the command line tool. I always struggle with argparse, but I like working with click:
import click @click.command() @click.option('--count', default=1, help='...')
Click makes you structure your application in a nice way:
tool/ cli.py actions/ action1.py action2.py
We use a trick to force users to upgrade. With every request to the api we send the version of our cli. The api checks this against a minimum version and aborts if the version is too old.