Manu, one of our platform infrastructure engineers, sent out an amazing write-up of a recent stress test on the Keen IO write event path. I wanted to share this to give a peek into the inner workings of our platform team as they build out, test, and strengthen the Keen IO infrastructure. Enjoy, and thanks again to Manu for sharing! –Kevin

tl;dr With the help of Terry’s awesome ETL tool we were able to do a stress test with our write event path. We were able to sustain a 3x write rate for almost half an hour without any impact. We were able to try out a couple of changes to confirm what made the throughput improve and the fact that we did actually improve throughput.

Background

Last month, we had a particular client sending us 2x more events than we were doing before that. We ran into some problems with not being able to keep up with the write load. Another similar event occured today during which we made multiple config changes to get everything back into a stable state. With Terry’s bulk insert of events we were able to confirm the impact in a more definitive way.

The image below overlays the write event volume graph on the write event delta graph and shows the timeline of the changes.

Observations

Observation #1: We were able to increase throughput by adding more IO executors and running more JVMs. Adding IO executors was hinted by Storm suggesting that capacity of those bolts was > 1.0 so that was fairly obvious. But increasing executors also seemed to increase the CPU load on some boxes. We were able to offset that by running more JVMs (or so I think)

Observation #2: Our overall throughput decreases when we are already backlogged. There was some talk about this being similar to a ‘thundering herd’ problem. When we are backlogged we need more resources to get out of the backlog but we read and write more data from Kafka, so Kafka (or the Kafka consumer / producer code in Storm) could get overbooked. It’s a theory but we need to investigate this further.

The reason why this is a real thing is that while we were able to handle a 3x increase in load in steady state, we couldn’t chew through the delta faster than incoming events, even after reducing the load by 2/3rds. Increasing executors and JVMs reduced the slope of the increase but couldn’t get it moving downwards like something that worked in steady state.

Observation #3: Our max Kafka fetch size was set to 2MB (later 1MB accidentally) and that was limiting the rate at which we read from Kafka. These values seem too low because we were stuck in a situation where Storm executors were free but they were not able to drain the queue. We just weren’t able to read from Kafka fast enough. Increasing this had the immediate effect that all the bolts in the topology started seeing a lot more load and showing high ‘capacity’ values. It also reflected in an immediate drop in delta. This is one of the fastest recoveries I’ve seen 🙂 ​

None of this caused a significant increase in load on Kafka or Cassandra. So I’m pretty sure we can go higher if we need to. This also demonstrates the usefulness of stress testing. We should try and make this a regular feature and possibly add better ways to stress test other parts of our stack like queries.