Weblog
Verhaal: Loslaten
Kort verhaal, waarvan ik het begin heb gebruikt als auditie voor de Topklas van de SKVR.
De Topklas heb ik er niet mee gehaald, maar hier een kort magisch realistisch verhaal van mijn hand: Loslaten.
Holger Krekel - re-inventing Python packaging & testing
Holger Krekel gives the keynote at Pygrunn, about re-inventing Python packaging and testing.
See the PyGrunn website for more info about this one-day Python conference in Groningen, The Netherlands.
I am @hpk42 on Twitter.
I started programming in 1984. I am going to tell you how distribution and installation worked that day, as you are too young to know. Me and a friend would sit down after school and take a magazine. One of us would read some hexadecimal numbers from it and the other typed it in. One and a half hour later we could play a pacman game.
Apprentice: "Can anyone tell me why X isn't finished?"
Master: "It takes a long time to write software."
Projects take time. CPython is 22 years old.
Where do all these efforts go into? Into mathematical algorithms? No. Deployment takes a huge bite. Software needs to run on different machines, needs to be configured, tested, packaged, distributed, installed, executed, maintained, monitored, etcetera.
The problem with deployment is the real world. Machines are different, users are different, networks are different, operating systems are different, software versions are different.
There are producers of software. If as a producer I change an API or a UI that creates a danger for my users. This means releasing a new version is dangerous, because for the users deploying the new version is potentially dangerous.
A lot can be solved by automation. Automated tests help. You need to communicate (allow users to track changes, have discussions). Configurations should be versioned so you can go back to earlier versions or at least see what the difference is. You need a packaging system and a deployment system. This may be more important than choosing which language to use.
The modern idea to simplify programming is usually: let's have only one way so it is clear for everyone what to do. Oh, and it should be my way.
Standardization fosters collaboration, even if the standard is not perfect. But tools that come out of this standardization are more important than the standardization document itself.
Are standardized platforms a win? For example 64/Amiga, iOS, Android, Debian, .NET, company wide choices for virtual machines and packaging. This reduces complexity, but increases the lock-in. You may not want to bet your whole business on one platform.
Modernism: have one true order. For example, Principia Mathematica for having one system of mathematics that could do everything. Gödel proved this was impossible.
Let's check the koans of Perl and Python. Perl says there is more than one way to do it. Python says there should be one - and preferably only one - obvious way to do it. Both say there are multiple ways. You need to take that into account.
A note on the Python standard library: Python includes lots of functionality. This was a good idea in the past. Today, PyPI often provides better APIs, and we can still improve it.
Perl has the CPAN, Comprehensive Perl Archive Network. Lots of good structure in there.
Python is still catching up. Python is growing declarative packaging metadata instead of in the Python setup.py file. Trying to standardize on pip and wheels, but easy_install remains a possibility. Uploading or PyPI server interaction today is hard. The server is hard to deploy on a laptop. There are no enforced version semantics. It has a brittle protocol. It is hard to move away from setup.py though.
http://pypi-mirrors.org lists about eight mirror of the official http://pypi.python.org server. Most are not up to date or even not updating at all. Not good.
Perl and Python are both not living up to their koans. Python has lot to improve.
What needs to be improved? setuptools and distribute are being merged. The bandersnatch tool is being deployed, which is much better and faster for mirroring. Several PEPs are being discussed and considered. The people proposing these PEPs are talking to each other, so communication is good. New version comparison, new packaging metadata version, new rules on PyPI, etcetera. A lot is happening.
We should be aware of the standardization trap: you try to solve the five existing ways of doing something by adding a sixth way. To avoid this, don't demand that the world changes first before your tool or idea can be used. To a certain degree Python fell into that trap, but that is outside the scope for this talk.
I would like to focus on integration of meta tools. These can configure and invoke existing tools and make them work for most use cases. You can enable and facilitate new technology there.
Testing
Python has lots of testing tools, like nose, py.test, unittest, unittest2, zope.testing, make test, setup.py test.
tox is a "meta" test running tool. Its mission is to standardize testing in Python. It is a bit like a Makefile. It runs the tests with the tools of your choice. It acts as a front-end to CI servers. See http://tox.testrun.org for details.
travisci (Travis CI) is a "meta" test running service. It configures all kinds of dependencies, priming your environment.
devpi
I have a new idea, devpi: support packaging, testing and deployment. The devpi-server part is a new compatible Python index and upload service. The client part has sub commands for managing release and QA workflows.
Why a new index server? In the existing solutions, I missed an automatically tested extensible code base, or other parts.
devpi-server is self-updating. It is a selective mirror. It does not try to update all packages on the original PyPI, just the ones that you actually use.
But: working with multiple indexes is burdensome. You can use devpi to provide "workflow" subcommands. use to set the current PyPI index. upload to build and upload packages from a checkout. test to download and test a package. So you can create a package, upload it to a local test PyPI, test the package and then upload it to the real PyPI.
I did the last pytest releases using devpi.
Development plans: MIT licensed, test driven development. Get early adopters.
The main messages from this talk:
- Evolve and build standards, do not impose them.
- Integrate existing solutions, do not add yet another way, if possible.
- Let's share this tooling and collaborate. Maybe you have some tool to reliably create a Debian package from a Python package. Make it available and get feedback and code from others.
Strive for something simpler, see the requests library. Simplicity is usually something that emerges by using a piece of software.
Luuk van der Velden - Best practices for the lone coder syndrome
Luuk van der Velden talks about best practices for the lone coder syndrome, at PyGrunn.
See the PyGrunn website for more info about this one-day Python conference in Groningen, The Netherlands.
I do a PhD at the Center for Neuroscience, University of Amsterdam. I switched from Matlab to Python a few years ago. I am a passionate and critical programmer.
Programming is not a substantial part of most science educations, apart from obvious studies like computer science. A lot of experiments in sciences generate more and more data. The demand on computer power and data analysis is becoming bigger.
A PhD student, which we take as example of a lone coder, is responsible for his own project. He or she does the work himself: experiments, analysis. Collaborations do happen, but are asymmetric. I can talk to others, but they usually do not program together with me. Or they pass me some Matlab code that I then have to translate into Python.
A PhD will take about four years, so your code needs to keep running for all that time, maybe longer. Development is continuous.
Cutting corners when working on your own is attractive. You are the only one who uses it, and it works, so why bother improving it for corner cases? High standards demands discipline. So you end up with duplicated code, unreadable code, no documentation, unstructured functionality with no eye for reuse, code rot.
We have a scripting pitfall. Scripting languages like Python are a flexible tool to link different data producing systems, process data and create summaries and figures. Pitfalls for common scripts are: data hiding, hiding of complexity, division of functionality (household and processing), lack of scalability, no handles for code reuse.
What a script for scientific analysis should do, is defining what you want, concisely.
Prototyping is essential for researching a solution. It is used continuously. Consolidation is very different from prototyping. Some things are better left as a prototype.
You should have a hard core of software that is tested well. In your scripts you use this, instead of copying an old full script. 'Soft' code sits between the hard core and the script, as an interface.
As a scientist you did not get educated as a programmer. So you should get educated. And as Python programmers we should educate them. Presently the emphasis is on getting work done, not on programming. Matlab is the default language. This was originally a stripped down version for teaching students, but everyone kept using it. Closed source software goes against scientific ethos.
Python offers a full featured scientific computing stack. Python scales with your skills. You can use imperative code, functional, object oriented or meta programming. Python is free, so you can use the latest version without needing to pay for an upgrade like with Matlab.
We can organize courses and workshops, for example Software Carpentry.
Álex González - Python and Scala smoke the peace pipe
Álex González talks about Python and Scala, at PyGrunn.
ay Python conference in Groningen, The Netherlands.
Thrift is an interface definition language. You can use to work with several languages at the same time. It gives you basic types, transport, protocol, versioning, processors (input, output).
It helps for example your Python client talk to the Scala server or the other way around.
Types: bool, byte, several integers, string, struct. Also containers: list, set, map (dict in Python). An exception, services (for example a method).
Transport:
- TFileTransport uses files.
- TFramedTransport, for non-blocking servers, chunked data
- TMemoryTransport, user memory for IO
- TSocket for blocking sockets
- TZlibTransport for compressed transport.
Protocols: binary, compact, dense, with and without metadata.
Versioning. For every field in a struct you should add an integer identifier, otherwise you automatically get negative numbers.
Similar things are SOAP, CORBA, COM, Pillar, Protocol buffers (Google's protobuf is really similar to Thrift).
I am @agonzalezro on Twitter. See also http://agonzalezro.github.io
Armin Ronacher - A year with MongoDB
Armin Ronacher talks about MongoDB, at PyGrunn.
See the PyGrunn website for more info about this one-day Python conference in Groningen, The Netherlands.
I do computers, currently at Fireteam. We do the internet for pointy-shooty games.
I started out hating MongoDB a year ago. Then it started making sense after a while, but I ended up not liking it much. But: MongoDB is a pretty okay data store, as Jared Heftly says. We are just really good at finding corner cases.
MongoDB is like a nuclear reactor, but if you use it well, it is safe. I said that in October. Currently I am less enthousiastic.
We had a game. Christmas came around. Server load went up a lot.
Why did we pick MongoDB initially? It is schemaless, the database is sharded automatically, it is sessionless. But schemaless is just wrong, MongoDB's sharding is annoying, thinking in records is hard, and there is a reason people use sessions.
MongoDB has several parts: mongod, mongoc, mongos. So it has many moving parts.
First we failed ourselves. We were on Amazon, which is not good for databases. Mongos and Mongod were split but on the same server, which meant that they were constantly waiting on each other. We went to two cores and then it was fine. Still, EBS (Elastic Block Storage) is not good for IO, so not good for databases. Try writing a lot of data for a minute, just with dd and you will see what I mean.
MongoDB has no transactions. You can work around this, but we really did need it. It is meant for Document-level operations, storing documents within documents, but that did not really work for us. Your mileage may vary.
MongoDB is stateful. It assumes that the data is saved correctly. If you want to be sure, you need to ask it explicitly.
It crashes a lot. We did not update from 2.0 for a while because we would have hit lots of segfaults.
To break your cluster: add new primary, remove old primary, don't shutdown old primary (this step is bad!), network partitions and one of them overrides the config of the other in the mongoc. That happened to us during Christmas.
Schema versus schemaless is like static typing versus dynamic typing. Ever since C# and TypeScript, static typing with an escape hatch to dynamic typing wins. I think almost everyone adds schemas to MongoDB. It is what we do anyway.
getLastError() is just disappointing. Because you have to ask this all the time, things are always slower.
There is a lack of joins. This is called a 'feature'. I see people joining in their code by hand. The database should be much better at doing this than the average user. MongoDB does not have Map-Reduce, except a version that hardly counts.
When using the find or aggregate functions in the API to get records, you can basically get SQL injection when a user makes sure to get a dollar sign at the beginning of a string, as MongoDB handles that differently.
Even MySQL supports MVCC, so transactions. MongoDB: no.
MongoDB can only use one index per query, so quite limited. Negations never use indexes; not too unreasonable, but very annoying. There is a query optimizer though.
Making MongoDB far less slow on OS X:
mongod --noprealloc --smallfiles --nojournal run
Do not use : or | in your collection names, or it will not work if you try to import it on Windows.
A third of the data is the key. That is just insane. A reason to use schemas.
A MongoDB cluster needs to boot in a certain order.
MongoDB is a pretty good data dump thing. It is not a SQL database, but you probably want a SQL database, at least until RethinkDB is ready. Probably we would have had similar problems with RethinkDB though.
It is improving. There is a lot of backing from really big companies.
I don't want to do this again. I want to use Postgres. If I ever get data that is so large that Postgres cannot handle it, I have apparently done something successful and I will start doing something else. Postgres already has solved so many problems at the database level so you do not have to come up with solutions yourself at a higher level.
Without a doubt, MongoDB will get better and be the database of choice for some problems.
The project we use it for does still run on MongoDB and that will probably remain that way.