Maarten Breddels - A billion stars in the Jupyter Notebook

published May 19, 2017

Maarten Breddels talks about a billion stars in the Jupyter Notebook, at PyGrunn.

See the PyGrunn website for more info about this one-day Python conference in Groningen, The Netherlands.

How do you deal with a billion stars in the data of your universe?

The Gaia satellite is scanning the skies. Launched by ESA in 2013. From this satellite we have data from a billion starts, and soon this will be more. We want to visualise this, explore the data, 'see' the data.

If you give each star a single pixel, and plot them all, you get a totally black figure. So we can use different colours when there are more stars at the some spot.

The data we need for this is about 15 GB. Memory bandwidth 10-20 GB, so takes about 1 second. CPU of 2 GHz, multicore 4-8: 12-24 cycles per second.

Storage: native, column based. Normal (POSIX read) method has lots of overhead: from disk to OS cache, to memory. So get direct access to the cache to speed this up.

Visualisation can be 0-3 dimensional, with different calculation costs.

Solution: vaex. It is a Python library, like pandas but for larger datasets. Everything is computed in chunks, so as not to waste memory. For multiple dimensions.

[Demo of cool things in a Jupyter notebook in the browser.]

I wrote ipyvolume to visualize 3d volumes and glyph in Jupyter notebook.

Since it works in the browser, you can also use it on your phone, also in 3D with a VR device.