Weblog
Maarten Breddels - A billion stars in the Jupyter Notebook
Maarten Breddels talks about a billion stars in the Jupyter Notebook, at PyGrunn.
See the PyGrunn website for more info about this one-day Python conference in Groningen, The Netherlands.
How do you deal with a billion stars in the data of your universe?
The Gaia satellite is scanning the skies. Launched by ESA in 2013. From this satellite we have data from a billion starts, and soon this will be more. We want to visualise this, explore the data, 'see' the data.
If you give each star a single pixel, and plot them all, you get a totally black figure. So we can use different colours when there are more stars at the some spot.
The data we need for this is about 15 GB. Memory bandwidth 10-20 GB, so takes about 1 second. CPU of 2 GHz, multicore 4-8: 12-24 cycles per second.
Storage: native, column based. Normal (POSIX read) method has lots of overhead: from disk to OS cache, to memory. So get direct access to the cache to speed this up.
Visualisation can be 0-3 dimensional, with different calculation costs.
Solution: vaex. It is a Python library, like pandas but for larger datasets. Everything is computed in chunks, so as not to waste memory. For multiple dimensions.
[Demo of cool things in a Jupyter notebook in the browser.]
I wrote ipyvolume to visualize 3d volumes and glyph in Jupyter notebook.
Since it works in the browser, you can also use it on your phone, also in 3D with a VR device.
Kilian Evang - Viasock: Automagically Serverize Your Scripts
Kilian Evang talks about Viasock: Automagically Serverize Your Scripts
See the PyGrunn website for more info about this one-day Python conference in Groningen, The Netherlands.
Viasock's tagline: Automagically serverize your pipelines.
Pipelines. In the Parallel Meaning Bank, you input some text and it analyses each word. So each word goes through a pipeline.
The Unix philosophy (Doug McIlroy): - Write programs that do one thing and do it well. - Write programs to work together. - Write programs to handle text streams, because that is a universal interface.
A pipeline for us can be this:
$ cat data/01.txt | ./bin/tokenize -m models/tokenizer.model | ./bin/parse -m models/parser.model > out/01.parse
Get input, tokenize it, parse it, give output.
We might update our parsing module while the pipeline is running. We prefer not to redo the tokenize part then, especially if that has taken a long time.
We have a daemon that runs various make processes which update files, orchestrating which makefile is used for which document.
Problem: the tokenizer is meant to work on a big dataset, and needs ten seconds to start. If you run it on one sentence, it takes less than a second to process, but it still needs the ten second start. What to do?
We serverize it. Traditional approach would be to split the tokenizer in a server and client part. The server keeps running, only suffering the ten second startup penalty once. The clients quickly start up and contact the server and get an answer back quickly. Problem: you would need to split this up yourself. And you may need to do this for various tools.
So viasock was born. This is a client and server. The viasock server interacts with your normal, unchanged tool, the tokenizer in our case. The viasock server makes sure the tool is started once, and keeps it running. The viasock clients then talk to your tool via the viasock server.
There are limitations. Your tool must read standard input, process it, output it on standard output, and then repeat.
We want to automagically serverize tools without needing to keep track of changes in your tool, and making sure the viasock server is running before using the viasock client. So:
$ cat input1.txt | viasock run mytool -m mymodel > output1.txt $ cat input2.txt | viasock run myothertool > output2.txt $ cat input3.txt | viasock run mytool -m myothermodel > output3.txt
When needed, this starts a new instance of your tool if it is out of date. Any old instance is kept running, until we notice it does not get any new client connections for some time, and then we stop it.
Viasock calculates a SERVERID hash, based on your toolname, modification date, arguments, etcetera, and then uses or starts a server that listens on ./.viasock/sockets/$SERVERID.
If you frequently run a program with high startup overhead on small data, and you don't want to split it into server/client, then give viasock a try.
See the code at https://github.com/texttheater/viasock
See the slides.
Jaap Bresser - Beyond Role Based Auth: Discretionary Access Control with Postgres & SQLAlchemy
Jaap Bresser goes beyond role based authorization: Discretionary Access Control with Postgres & SQLAlchemy
See the PyGrunn website for more info about this one-day Python conference in Groningen, The Netherlands.
I work at http://www.profileermij.nl and this talk is about what we did there. It is an online career platform.
Role based authentication: grant access control based on a role, instead of on a user. Generally it is quite simple and coarse grained: you can edit either all products or none. [I would call it authorization: what you are allowed to do instead of who you are.]
The core part of our website is the user profile, with hard skills (education, positions) and soft skills (personality, preferences). Sensitive data.
In the first version, we used role based authentication. Each customer has an own sub domain. Roles were: user, hr (human resources), admin.
Wanted was: allow users to share their profiles with others. So we looked to discretionary access control (DAC). The first mention I found was from the Department of Defense in 1985. Biggest thing: pass on permission to another subject. So you get authorization per object.
The Linux file system works like this, in a very basic way, where you can give access on a file to one user as well as one group or everyone.
Amazon Web services have resource based permissions and identity based permissions, and lots more options. Quite complex, needed for their wildly varying use cases.
For the profileermij case, we created an ACL (access control list) in a Postgres database. The app uses Flask. The ACL:
- a principal uuid
- actions: read, write, read_acl, write_acl
- paths: dot separated, including wildcard and glob characters
- a unique key for external reference
For implementing the ACL, we use the JSONB data type and the JSONB subset operator @>, a Gin index. You need Postgres version 9.4 or higher.
Sample query:
SELECT * FROM profile WHERE profile.acl @> {...};
In Python we use classes to wrap JSON, an SQLAlchemy TypeDecorator. We need to handle paths, and integratie the ACL in an existing application of framework. Processing a request would look like this:
- incoming write request
- validate against schema
- convert to SQLAlchemy model
- load ACL and check permissions
- save SQLAlchemy object
- return response
Take aways from our experience:
- DAC meets our requirements
- Designing a UX that is understandable for lots of people is hard
- The complexity can be managed.
- Don't apply it everywhere. You don't need it everywhere. For our profiles we needed it, for most other data not.
[Audience: Django REST framework has this.]
Marco Vellinga - Creating abstraction between consumer and datastore
Marco Vellinga talks about Creating abstraction between consumer and datastore, at PyGrunn.
See the PyGrunn website for more info about this one-day Python conference in Groningen, The Netherlands.
Marco Vellinga is a programmer at Devhouse Spindle.
Default: web app connects to HTTP api that talks to your data store. And another web app talks to another HTTP api, and another, etcetera.
Better: let all apps talk to one data access layer. You need core logic, a contract on how apps can talk to you, validation, error handling.
Some cons:
- It is quite difficult to find information about this on the internet.
- It is a bit overkill for small projects.
- It can take a long time to build before you get it right, at least longer than we expected.
In our solution, data needs to be (de)serializable in json. We are using marshmallow.
Filtering with and/or/not. We created FilterQL.
We wanted versioning of code paths and made a decorator for this. With that, you can call method.v1() and method.v2(). You can version classes or functions this way. We created versionary. You do want some deprecation policy here: having sixteen different versions of the same function seems a bad idea.
We have something for validation too, but we are working on that, it is too verbose currently.
In the end: is it worth building such an abstraction layer? The first day this went live, we found lots of mismatched data, so it helped us find errors in the data. If you can start from scratch, it helps.
Pizza's en portemonnees
Column Klokgelui voor Overschiese krant.
Twee weken geleden was ik voor mijn werk in Italië. In Napels ging het snel mis: op een vol perron is mijn portemonnee gerold. Contant geld, bankpasjes, OV-chipkaart, identiteitskaart, kortingsbon voor een croissant, u kent het wel. Vervelend, en vooral een hoop gedoe en geregel. Gelukkig kon ik van een collega geld lenen, anders was het allemaal nog lastiger geworden.
Ik had meteen mijn bankpasjes geblokkeerd, en dat was op tijd. Mijn OV-chipkaart probeerde ik ook te blokkeren, omdat ik het saldo automatisch laat opwaarderen. Maar dan zou ik allerlei abonnementen die erop stonden kwijt zijn, die ik later met moeite op een nieuwe pas zou moeten zien te krijgen. Of ik kon direct een nieuwe bestellen, dan zou het allemaal goed gaan. Maar dan moest ik direct online betalen en dat ging niet, want daar had ik mijn bankpasje bij nodig. Tja.
Het automatisch opwaarderen kon ik online stopzetten. Probleem: deze opdracht moest ik bevestigen door met mijn OV-chipkaart naar een kaartautomaat te gaan. Zucht. Ik gaf het maar op: de kans dat de dief helemaal naar Nederland zou reizen om van mijn pas gebruik te maken, leek me klein.
Boos was ik eigenlijk niet. Teleurgesteld over een Napolitaanse dief, ja. Maar in het hotel werd ik goed geholpen: zij verzorgden de aangifte in het Italiaans bij de politie. Maar goed ook, want daar spraken ze geen Nederlands of Engels, en mijn kennis van het Italiaans gaat niet veel verder dan het eten. Al na twee dagen had ik een noodpaspoort via het consulaat, dus dat ging fijn snel. En de reisverzekering vergoedde gelukkig een deel van de kosten.
Het hoort niet, stelen. En ik zal er geen zielig, begrijpend verhaal bij verzinnen alsof de dief de opbrengst waarschijnlijk gebruikt heeft om zijn bloedjes van kinderen te eten te geven. Maar iets zegt me dat hij (zij?) er armer van is geworden dan ik.
Jezus zei: 'Als iemand je op de wang slaat, biedt hem dan ook de andere wang aan.' Als iemand je portemonnee rolt, wijs hem dan op het briefje van vijf dat eruit valt. Heb je vijand lief. Met enige tegenzin neem ik zijn raad toch maar aan. Het schijnt gezond te zijn.
P.S. Hierbij mijn felicitaties aan mede-columnist Kees van der Meer die benoemd is tot lid in de Orde van Oranje Nassau!