Jonathan Barnoud - Looking at molecules using Python

published May 19, 2017

Jonathan Barnoud talks about looking at molecules using Python, at PyGrunn.

See the PyGrunn website for more info about this one-day Python conference in Groningen, The Netherlands.

In this presentation, I will demonstrate how Python can be used throughout the workflow of molecular dynamic simulations. We will see how Python can be used to set up simulations, and how we can visualize simulations in a Jupyter Notebook with NGLView. We will also see the MDAnalysis library to write analysis tools, and datreant to organize the data.

I work at the University of Groningen. I look at fat and proteins, at the level of molecules and atoms. We can simulate them using molecular dynamics. Force is equal to the mass times the accelleration (F = m*a). We need initial positions and initial velocities.

My workflow: prepare system, run a simulation, visualise and analyse in Jupyter notebook, which may need several loops through this system, and then I can write a report.

Preparing a simulation: topology, what are the initial coordinates, what are simulation parameters. I use some bash and python scripts to prepare those text files. These go into the simulation engine, which gives as output a trajectory: how will all those molecules move.

There are lots of simulation engines, which need different file formats as input, and give different output formats. So I use Python to create a library that abstracts these differences away.

One of these engines is MD Analysis. The main object is a universe, with a topology and trajectory. The universe is full of atoms. Each atom has attributes attached to it, like name, position, mass. Everything is in arrays. You can select atoms: universe.select_atoms('not resname SOL'). Sample code:

for time_step in universe.trajectory[:10]:
    print(universe.atoms[0].position)

nglview can show an analysis from MD analysis (or other engines) by using a javascript library, to visualise it.

Now you may end up with lots of simulation data in lots of directories and files. Your filesystem is now a mess! So we use datreant. (Treant was a talking tree in Dungeons and Dragons.) This helps you to discover where the outcome of which simulation is. And access the data from it.

To conclude:

Python is awesome.
Jupyter is awesome too. [See also the talk about a billion stars earlier today.]
The Python science stack is awesome as well.
Each field develops awesome tools based on the above.

Maarten Breddels - A billion stars in the Jupyter Notebook

published May 19, 2017

Maarten Breddels talks about a billion stars in the Jupyter Notebook, at PyGrunn.

See the PyGrunn website for more info about this one-day Python conference in Groningen, The Netherlands.

How do you deal with a billion stars in the data of your universe?

The Gaia satellite is scanning the skies. Launched by ESA in 2013. From this satellite we have data from a billion starts, and soon this will be more. We want to visualise this, explore the data, 'see' the data.

If you give each star a single pixel, and plot them all, you get a totally black figure. So we can use different colours when there are more stars at the some spot.

The data we need for this is about 15 GB. Memory bandwidth 10-20 GB, so takes about 1 second. CPU of 2 GHz, multicore 4-8: 12-24 cycles per second.

Storage: native, column based. Normal (POSIX read) method has lots of overhead: from disk to OS cache, to memory. So get direct access to the cache to speed this up.

Visualisation can be 0-3 dimensional, with different calculation costs.

Solution: vaex. It is a Python library, like pandas but for larger datasets. Everything is computed in chunks, so as not to waste memory. For multiple dimensions.

[Demo of cool things in a Jupyter notebook in the browser.]

I wrote ipyvolume to visualize 3d volumes and glyph in Jupyter notebook.

Since it works in the browser, you can also use it on your phone, also in 3D with a VR device.

Kilian Evang - Viasock: Automagically Serverize Your Scripts

published May 19, 2017

Kilian Evang talks about Viasock: Automagically Serverize Your Scripts

See the PyGrunn website for more info about this one-day Python conference in Groningen, The Netherlands.

Viasock's tagline: Automagically serverize your pipelines.

Pipelines. In the Parallel Meaning Bank, you input some text and it analyses each word. So each word goes through a pipeline.

The Unix philosophy (Doug McIlroy): - Write programs that do one thing and do it well. - Write programs to work together. - Write programs to handle text streams, because that is a universal interface.

A pipeline for us can be this:

$ cat data/01.txt | ./bin/tokenize -m models/tokenizer.model | ./bin/parse -m models/parser.model > out/01.parse

Get input, tokenize it, parse it, give output.

We might update our parsing module while the pipeline is running. We prefer not to redo the tokenize part then, especially if that has taken a long time.

We have a daemon that runs various make processes which update files, orchestrating which makefile is used for which document.

Problem: the tokenizer is meant to work on a big dataset, and needs ten seconds to start. If you run it on one sentence, it takes less than a second to process, but it still needs the ten second start. What to do?

We serverize it. Traditional approach would be to split the tokenizer in a server and client part. The server keeps running, only suffering the ten second startup penalty once. The clients quickly start up and contact the server and get an answer back quickly. Problem: you would need to split this up yourself. And you may need to do this for various tools.

So viasock was born. This is a client and server. The viasock server interacts with your normal, unchanged tool, the tokenizer in our case. The viasock server makes sure the tool is started once, and keeps it running. The viasock clients then talk to your tool via the viasock server.

There are limitations. Your tool must read standard input, process it, output it on standard output, and then repeat.

We want to automagically serverize tools without needing to keep track of changes in your tool, and making sure the viasock server is running before using the viasock client. So:

$ cat input1.txt | viasock run mytool -m mymodel > output1.txt
$ cat input2.txt | viasock run myothertool > output2.txt
$ cat input3.txt | viasock run mytool -m myothermodel > output3.txt

When needed, this starts a new instance of your tool if it is out of date. Any old instance is kept running, until we notice it does not get any new client connections for some time, and then we stop it.

Viasock calculates a SERVERID hash, based on your toolname, modification date, arguments, etcetera, and then uses or starts a server that listens on ./.viasock/sockets/$SERVERID.

If you frequently run a program with high startup overhead on small data, and you don't want to split it into server/client, then give viasock a try.

See the code at https://github.com/texttheater/viasock

See the slides.

Jaap Bresser - Beyond Role Based Auth: Discretionary Access Control with Postgres & SQLAlchemy

published May 19, 2017

Jaap Bresser goes beyond role based authorization: Discretionary Access Control with Postgres & SQLAlchemy

See the PyGrunn website for more info about this one-day Python conference in Groningen, The Netherlands.

I work at http://www.profileermij.nl and this talk is about what we did there. It is an online career platform.

Role based authentication: grant access control based on a role, instead of on a user. Generally it is quite simple and coarse grained: you can edit either all products or none. [I would call it authorization: what you are allowed to do instead of who you are.]

The core part of our website is the user profile, with hard skills (education, positions) and soft skills (personality, preferences). Sensitive data.

In the first version, we used role based authentication. Each customer has an own sub domain. Roles were: user, hr (human resources), admin.

Wanted was: allow users to share their profiles with others. So we looked to discretionary access control (DAC). The first mention I found was from the Department of Defense in 1985. Biggest thing: pass on permission to another subject. So you get authorization per object.

The Linux file system works like this, in a very basic way, where you can give access on a file to one user as well as one group or everyone.

Amazon Web services have resource based permissions and identity based permissions, and lots more options. Quite complex, needed for their wildly varying use cases.

For the profileermij case, we created an ACL (access control list) in a Postgres database. The app uses Flask. The ACL:

a principal uuid
actions: read, write, read_acl, write_acl
paths: dot separated, including wildcard and glob characters
a unique key for external reference

For implementing the ACL, we use the JSONB data type and the JSONB subset operator @>, a Gin index. You need Postgres version 9.4 or higher.

Sample query:

SELECT * FROM profile WHERE profile.acl @> {...};

In Python we use classes to wrap JSON, an SQLAlchemy TypeDecorator. We need to handle paths, and integratie the ACL in an existing application of framework. Processing a request would look like this:

incoming write request
validate against schema
convert to SQLAlchemy model
load ACL and check permissions
save SQLAlchemy object
return response

Take aways from our experience:

DAC meets our requirements
Designing a UX that is understandable for lots of people is hard
The complexity can be managed.
Don't apply it everywhere. You don't need it everywhere. For our profiles we needed it, for most other data not.

[Audience: Django REST framework has this.]

Marco Vellinga - Creating abstraction between consumer and datastore

published May 19, 2017

Marco Vellinga talks about Creating abstraction between consumer and datastore, at PyGrunn.

See the PyGrunn website for more info about this one-day Python conference in Groningen, The Netherlands.

Marco Vellinga is a programmer at Devhouse Spindle.

Default: web app connects to HTTP api that talks to your data store. And another web app talks to another HTTP api, and another, etcetera.

Better: let all apps talk to one data access layer. You need core logic, a contract on how apps can talk to you, validation, error handling.

Some cons:

It is quite difficult to find information about this on the internet.
It is a bit overkill for small projects.
It can take a long time to build before you get it right, at least longer than we expected.

In our solution, data needs to be (de)serializable in json. We are using marshmallow.

Filtering with and/or/not. We created FilterQL.

We wanted versioning of code paths and made a decorator for this. With that, you can call method.v1() and method.v2(). You can version classes or functions this way. We created versionary. You do want some deprecation policy here: having sixteen different versions of the same function seems a bad idea.

We have something for validation too, but we are working on that, it is too verbose currently.

In the end: is it worth building such an abstraction layer? The first day this went live, we found lots of mismatched data, so it helped us find errors in the data. If you can start from scratch, it helps.