Document Acties

Kilian Evang - Viasock: Automagically Serverize Your Scripts

Opgeslagen onder:

Kilian Evang talks about Viasock: Automagically Serverize Your Scripts

See the PyGrunn website for more info about this one-day Python conference in Groningen, The Netherlands.

Viasock's tagline: Automagically serverize your pipelines.

Pipelines. In the Parallel Meaning Bank, you input some text and it analyses each word. So each word goes through a pipeline.

The Unix philosophy (Doug McIlroy): - Write programs that do one thing and do it well. - Write programs to work together. - Write programs to handle text streams, because that is a universal interface.

A pipeline for us can be this:

$ cat data/01.txt | ./bin/tokenize -m models/tokenizer.model | ./bin/parse -m models/parser.model > out/01.parse

Get input, tokenize it, parse it, give output.

We might update our parsing module while the pipeline is running. We prefer not to redo the tokenize part then, especially if that has taken a long time.

We have a daemon that runs various make processes which update files, orchestrating which makefile is used for which document.

Problem: the tokenizer is meant to work on a big dataset, and needs ten seconds to start. If you run it on one sentence, it takes less than a second to process, but it still needs the ten second start. What to do?

We serverize it. Traditional approach would be to split the tokenizer in a server and client part. The server keeps running, only suffering the ten second startup penalty once. The clients quickly start up and contact the server and get an answer back quickly. Problem: you would need to split this up yourself. And you may need to do this for various tools.

So viasock was born. This is a client and server. The viasock server interacts with your normal, unchanged tool, the tokenizer in our case. The viasock server makes sure the tool is started once, and keeps it running. The viasock clients then talk to your tool via the viasock server.

There are limitations. Your tool must read standard input, process it, output it on standard output, and then repeat.

We want to automagically serverize tools without needing to keep track of changes in your tool, and making sure the viasock server is running before using the viasock client. So:

$ cat input1.txt | viasock run mytool -m mymodel > output1.txt
$ cat input2.txt | viasock run myothertool > output2.txt
$ cat input3.txt | viasock run mytool -m myothermodel > output3.txt

When needed, this starts a new instance of your tool if it is out of date. Any old instance is kept running, until we notice it does not get any new client connections for some time, and then we stop it.

Viasock calculates a SERVERID hash, based on your toolname, modification date, arguments, etcetera, and then uses or starts a server that listens on ./.viasock/sockets/$SERVERID.

If you frequently run a program with high startup overhead on small data, and you don't want to split it into server/client, then give viasock a try.

See the code at https://github.com/texttheater/viasock

See the slides.