Weblog
Keynote Google - Machine Learning APIs for Python Developers
Keynote talk from Google about Machine Learning APIs for Python Developers, at PyGrunn.
See the PyGrunn website for more info about this one-day Python conference in Groningen, The Netherlands.
Lee Boonstra and Dmitriy Novakovskiy give this talk, they work for Google Cloud, one of the gold sponsors of PyGrunn.
Python at Google
Google loves Python. :) It is widely used internally and externally. We are sponsoring conferences. We have open source libraries, like a Google Data Python Client Library, libraries for youtube, app engine, etcetera. We use it for build systems, report generation, log analysis, etc.
How can you use Google Cloud Platform for your app or website? You can deploy at scale. You can embed intelligence empowered by machine learning, we provide multiple pre trained models. You can use serverless data processing and analytics.
Machine learning
Let me explain it in a simple way. You want to teach something to a kid: what is a car, what is a bike? You point at a car or bike and explain what it is called. With machines we shoot in lots of data and they start to see patterns.
- Artificial intelligence: process of building smarter computers
- Machine learning: process of making a computer learn
Machine learning is much easier.
Our CEO: "We no longer build mobile-first applications, but AI-first."
We have a lot of data, better models, and more computing power. That is why machine learning is happening a lot now.
Google created the open source Tensorflow Python framework for machine learning. And we have hardware to match. We have ready to use models for vision, speech, jobs, translation, natural language, video intelligence.
- Vision API: object recognition (landmarks), sentiment on faces, extract text, detect inappropriate content. Disney game: search with your phone for a chair, and we show a dragon on the chair. Point your camera at a house, and you see a price tag.
- Speech API: speech recognition (write out the transcript, support for 80 languages).
- Natural language API: really understand the text that is written, recognise nouns and verbs and sentiment.
- Translation API: realtime sub titles, automatic language translation using more context than earlier versions.
- Beta video intelligence: label detection, enable video search (in which frame did the dog first appear).
Demo
Go to the Google Cloud console and create a free account to play with. You need to enable the APIs that you want to use. Install the command line tools if you want to run it on your local machine. And pip install google-cloud.
We use machine learning for example in GMail to show you a possible answer to send for an email you receive.
Walkthrough of machine learning and TensorFlow
Google Cloud Dataflow. Dataflow is a unified programming model for batch or stream data processing. MapReduce-like operations. Parallel workloads. It is open sourced as Apache Beam, and you can run it on Google Cloud Platform.
You put files in Cloud Storage. Process this in batches, with Python and Dataflow. This uses pre-trained machine learning models. Then store results in BigQuery, and visualize the insights in Data Studio.
Reinout van Rees - Querying Django models: fabulous & fast filtering
Reinout van Rees talks about querying Django models: fabulous & fast filtering
See the PyGrunn website for more info about this one-day Python conference in Groningen, The Netherlands.
Goal: show what is possible. Everything is in the Django documentation. Just remember a few things you see here.
Example case: time registration system. Everyone seems to do this. A Person belongs to a Group. A Booking belongs to a Project and a Person.
The Python ORM gives you a mapping between the database and Python.
standard:
Person.objects.all()
basic filtering:
Person.objects.filter(group=1)
specific name:
Person.objects.filter(group__name='Systemen')
case insensitive searching for part of a name:
Person.objects.filter(group__name__icontains='onderhoud')
name starting with:
Person.objects.filter(name__startswith='Reinout')
without group:
Person.objects.filter(group__isnull=True)
Filtering strategy:
- sometimes .exclude() is easier
- you can stack: .filter().filter().filter()
- query sets are lazy: only really executed at the moment you need it.
- just assign the query to a variable, to make complicated queries more understandable
- start with the model you want
Speed:
select_related: does a big join in SQL so you get one set of results
prefetch_related: does one query for one table, and then one query to get all related items
if you need only one or two fields, Django does not need to instantiate a model, but can give you a plain dictionary or list instead:
Person.objects.filter(group__name='Systemen').values('name', 'group__name') Person.objects.filter(group__name='Systemen').values_list('name', 'group__name') Person.objects.filter(group__name='Systemen').values_list('group__name', flat=True)
Annotation and aggregation:
- annotate: sum, count, avg
- aggregation
- groupby via values (bit of a weird syntax)
Aggregation gives totals:
from django.db.models import Sum
Booking.objects.filter(
booked_by__group__name='Systemen'
).aggregate(Sum('hours'))
Annotation adds extra info to each result row:
Booking.objects.filter(
booked_by__group__name='Systemen'
).annotate(Sum('bookings__hours'))[10].bookings__hour__sum
Group bookings by year, give sums:
Booking.objects.filter(
booked_on__description__icontains='Zwanger'
).values('booked_by__name', 'year_week__year'
).annotate(Sum('hours'))
Practice this with your own code and data! You'll get the gang of it and get to know your data and it is fun.
If you need to do special queries, you can create a sub query yourself:
from django.db.models import Q query = Q(group__name='Systemen') Person.objects.filter(query)
You can write filters that way that are not in default Django.
Twitter: @reinoutvanrees
Òscar Vilaplana - Let's make a GraphQL API in Python
Òscar Vilaplana talks about making a GraphQL API in Python, at PyGrunn.
See the PyGrunn website for more info about this one-day Python conference in Groningen, The Netherlands.
This talk is about GraphQL, Graphene, and Python.
I care a lot about value. If something has no value, why are you doing it?
"Our frontend engineers want us to use GraphQL."
Sample case: there are friends, they have plans, the plans are at a location. So various relations between friends, plans and locations.
With REST you usually fetch too much or too little, you have many calls, some documentation but not really standard, it is hard to discover, not really a standard client, much postprocessing and decisions needed.
So you can try to fix some stuff, giving options to include more data, or not include that many fields. I don't really like it. What can we do?
If you go back to the data, you can see a graph: data that is linked to each other.
"Our frontend engineers want us to use GraphQL. They can just ask for what they want."
In the backend you are trying to decide or guess what the client wants. The client want a nice looking web site. What we have is a bunch of data in too many boring tables.
GraphQL is a query language for graphs. You can ask stuff like this and get data in this format back:
{ plans
name
description
creator: {
name
}
}
You define possible queries:
type Query {
plans(limit: Integer): [Plan]
}
type Plan {
...
}
With Graphene you can do this in Python. And there is Django support in graphene_django, to easily wrap this around some models. It is smart about combining queries.
GraphQL makes it easier to expose data. It is closer to the data, so less waste. Easy to get started.
You can play with the GitHub GraphQL api.
Twitter: @grimborg
Jonathan Barnoud - Looking at molecules using Python
Jonathan Barnoud talks about looking at molecules using Python, at PyGrunn.
See the PyGrunn website for more info about this one-day Python conference in Groningen, The Netherlands.
In this presentation, I will demonstrate how Python can be used throughout the workflow of molecular dynamic simulations. We will see how Python can be used to set up simulations, and how we can visualize simulations in a Jupyter Notebook with NGLView. We will also see the MDAnalysis library to write analysis tools, and datreant to organize the data.
I work at the University of Groningen. I look at fat and proteins, at the level of molecules and atoms. We can simulate them using molecular dynamics. Force is equal to the mass times the accelleration (F = m*a). We need initial positions and initial velocities.
My workflow: prepare system, run a simulation, visualise and analyse in Jupyter notebook, which may need several loops through this system, and then I can write a report.
Preparing a simulation: topology, what are the initial coordinates, what are simulation parameters. I use some bash and python scripts to prepare those text files. These go into the simulation engine, which gives as output a trajectory: how will all those molecules move.
There are lots of simulation engines, which need different file formats as input, and give different output formats. So I use Python to create a library that abstracts these differences away.
One of these engines is MD Analysis. The main object is a universe, with a topology and trajectory. The universe is full of atoms. Each atom has attributes attached to it, like name, position, mass. Everything is in arrays. You can select atoms: universe.select_atoms('not resname SOL'). Sample code:
for time_step in universe.trajectory[:10]:
print(universe.atoms[0].position)
nglview can show an analysis from MD analysis (or other engines) by using a javascript library, to visualise it.
Now you may end up with lots of simulation data in lots of directories and files. Your filesystem is now a mess! So we use datreant. (Treant was a talking tree in Dungeons and Dragons.) This helps you to discover where the outcome of which simulation is. And access the data from it.
To conclude:
- Python is awesome.
- Jupyter is awesome too. [See also the talk about a billion stars earlier today.]
- The Python science stack is awesome as well.
- Each field develops awesome tools based on the above.
Maarten Breddels - A billion stars in the Jupyter Notebook
Maarten Breddels talks about a billion stars in the Jupyter Notebook, at PyGrunn.
See the PyGrunn website for more info about this one-day Python conference in Groningen, The Netherlands.
How do you deal with a billion stars in the data of your universe?
The Gaia satellite is scanning the skies. Launched by ESA in 2013. From this satellite we have data from a billion starts, and soon this will be more. We want to visualise this, explore the data, 'see' the data.
If you give each star a single pixel, and plot them all, you get a totally black figure. So we can use different colours when there are more stars at the some spot.
The data we need for this is about 15 GB. Memory bandwidth 10-20 GB, so takes about 1 second. CPU of 2 GHz, multicore 4-8: 12-24 cycles per second.
Storage: native, column based. Normal (POSIX read) method has lots of overhead: from disk to OS cache, to memory. So get direct access to the cache to speed this up.
Visualisation can be 0-3 dimensional, with different calculation costs.
Solution: vaex. It is a Python library, like pandas but for larger datasets. Everything is computed in chunks, so as not to waste memory. For multiple dimensions.
[Demo of cool things in a Jupyter notebook in the browser.]
I wrote ipyvolume to visualize 3d volumes and glyph in Jupyter notebook.
Since it works in the browser, you can also use it on your phone, also in 3D with a VR device.
