“90% of the world’s data has been generated in the last two years alone”.

I tried to find an updated version of this mind-boggling and widely referenced Sintef statistic from 2013, but needless to say, one can only imagine how much data we’ve generated since then. With great data comes great responsibility, and just about every tech company has had to allocate tremendous resources to navigate the data landscape in the last decade. On January 27th, we brought together some experts from SF’s data world to talk about some specific ways they’ve approached this, by leveraging Data Pipelines to gather and disseminate valuable data from myriad sources.

Maxime Beauchemin, Data Engineer, Airbnb

Max is responsible for Airbnb’s open source data pipeline management tool, Airflow, used internally at Airbnb, as well as other companies such as BlueApron, Lyft, Stripe and Yahoo.

Max walked us through some of Airflow’s very cool features:

  • Scales out
  • Queues: target specific workers/configuration
  • Pools: limit concurrency + prioritize
  • Complex dependency rules: branching, joining, subworkflows
  • Hackability: define your own constructs, callbacks, …
  • SLA monitoring / alerting
  • Easily alter states: perform DAG surgeries
  • Clarity around ownership and versioning
  • Data profiling
  • Templating
  • Centralized connection management
  • Plugins!
  • Builtin integrations

Watch his talk here:

Dr. Samantha Zeitlin, Data Scientist, Sighten

Samantha walked us through her use of pandas, an open source Python data analysis library. While she did not bring along a fuzzy and bumbling bamboo-eating companion, her presentation was packed with useful insights.

Dr. Zeitlin showed us how she creates pipelines to consume energy data, and some of the pitfalls in managing this complex data. She recommends and expands on the following points in her talk:

  • Validate data and filetypes
  • Simple merges and masks are great ways to handle data (with pandas!)
  • Use multiple masks for complicated tasks (poem!)
  • Use dynamic naming tricks
  • Validate assumptions about reference data

Here she is walking through all of the above:

Bradford Stephens, Founder, 22Acacia

Bradford walked us through some of the reasons stream processing can be sub-optimal for data processing, and introduced us to his new open-source collaborative stream processing tool, Sossity. Here’s how to use Sossity to quickly build a collaborative data pipeline, because in Bradford’s words, “stream processing should really be about defining what you want to do and not having to write a bunch of code to do it”:

  1. Write operations in Python or Java
  2. Compose pipelines in a config file or subscribe to others
  3. Check into github
  4. Autoscale REST endpoints, streaming jobs, outputs are created

Bam!

Watch Bradford walk us through Sossity while sporting a stylish Norwegian sweater:

Dan Kador, COO, Keen IO

At Keen, we’re in the business of solving data headaches for our customers. This means we had to solve some pretty massive, and interesting, meta-headaches first! Dan Kador has overseen the evolution of our tech stack since we were founded in 2011, and he walked us through what solving some of those key problems looked like. Some of those problems included:

  • Collection: how to get data from our customers to our servers?
  • Exploration: we figured out how to store all this data, but how do we now explore and actually do the analytics on it?
  • Now what? We know how to explore the collected data. What’s next? Here’s where our APIs come in so handy.

Watch Dan talk through all this and more:

Q&A Panel Session

We wrapped up the night with a panel Q&A session. Some of the great questions included:

  • Have any of you solved some of the the problems around MySQL data capture?
  • What are some design tradeoffs between config files vs. coding configure files? When to choose one over the other?
  • Processing and scaling pipelines — how to know what to run when, and how to run autoscaling tools in a financially responsible way?

Watch it all here: