The What, How, and Why of OR Filters

Hello Keen community! Back in September, we announced the release of “or” filters. In this post, we’ll share what we learned in a deep dive on the what, how, and why.

Why, Part 1: Foundation for a Richer Query Language

Let’s start with the big question: why build “or” filters? One simple answer is that it has been among our most-requested features for a long time and that in and of itself was enough to justify building it. On our quest to make Keen ever more flexible, this capability allows our customers to quickly query their data in ways that up until now were either complicated and expensive, or outright impossible.

But the scenarios unlocked by the new architecture on which “or” filters were built are even more exciting. We’ll discuss the new model in detail below, but just as a teaser here are some of the enticing scenarios that it will enable us to build:

  • Filtering on functions of properties, e.g. (player.level mod 5) eq 0
  • Filtering on relationships between properties, e.g.
    player.level < (1.2 * monster.level)
  • Using expressions in the target_property of a query, e.g.
    sum (order.count * order.unit_price)
  • Using expressions in the group_by of a query, e.g.
    group_by (user.point_program exists true) or
    group_by bin(customer.age, [0, 20, 40, 60, 80, 100])

Adding these capabilities will open up whole new paradigms of querying in Keen (if you’ve got ideas, we’d love to hear them), and bring us one step closer to parity/compatibility with SQL and other mature query languages. But before we go any deeper let’s go back and cover the basics.

What, Part 1: A Simple Example

To satisfy those of you who arrived here just looking for some sample code, here’s what an “or” filter looks like:

(You can see a similar example and read more in our API docs.)

How, Part 1: A Fundamentally New Concept

Looking at the sample request above, it may seem like “or” filters are a simple feature - and in many respects they are. They allow you to query against events that match any one of a set of conditions, rather than matching all conditions. But under the hood, the implementation was actually quite complex, and it’s worth going into why.

Prior to the introduction of “or” filters, all supported filter types conformed to the same pattern: a 3-tuple of `(property_name, operator, property_value)`. This was reflected in our implementation, which explicitly defined a filter as a POJO (Plain Old Java Object) with those properties:

This allowed for simple and efficient code, but it lacked the flexibility necessary to introduce filters with fundamentally different structures. We could have hacked in support for “or” filters by just glomming onto that existing `Filter` class, but it would have degraded overall code quality and would have left us in even worse shape when we inevitably add the next filter operator with yet another structure. It seemed like there must be a better way…

How, Part 2: What is a Filter?

Good software architecture often starts with asking simple questions. What, really, is a filter? Logically speaking it’s just a predicate on an event, nothing more and nothing less. We could have modeled it that way and it would have worked, but there was an even broader generalization to make: an expression is a function that takes an event as input and produces some value (in the Keen type system) as output, and a filter is just an expression that always produces a boolean (i.e. true or false) output.

An expression can be a constant, or a reference to a property, or some function with one or more expressions as its operands. Since functions are themselves expressions, they can appear as operands to other functions and form a so-called Abstract Syntax Tree (AST). For now we have defined functions for the existing filter types (such as “eq”, “gte”, and now “or”) but the logic allows for expressions using any mathematical operations you can think of: addition, subtraction, multiplication, division, modulus, logarithms, exponentiation, or even binning etc.

What, Part 2: Using “or” Filters

The example from “What, Part 1” illustrates the mechanics of running an “or” filter query, but what problem is that query actually solving? Suppose that you want to add a graph to your embedded customer-facing analytics dashboard showing them how many of their clickthrough events were from their “high-value” customers, which you define to be customers who either (a) have a lifetime revenue over $100 or (b) explicitly subscribe to a premium tier. The example query above is solving this problem.

Without “or” filters that would be much trickier to accomplish. You could query the two parts separately and sum them, i.e.:

Chart A: How many clickthroughs from high-value customers?

Double-counted sum of all customers with LTR > $100 and 2,135 premium tier customers

But this will end up double-counting customers who are subscribed to the premium tier and have a lifetime revenue of over $100. You could correct for this (using the inclusion-exclusion principle) by subtracting out this double-counted amount:

Chart B: How many clickthroughs from high-value customers?

7,863 total minus 1,601 (premium tier accounts with LTR > $100) = 6,262

This works, but now you are running three queries to get the result you want - which means three times the compute usage, plus extra load time on your dashboard. So even this simple case illustrates the value of native “or” filters, and in more involved cases (such as an “or” of three or more conditions) the savings in time, cost, and complexity can be great.

Why, Part 2: Making Keen a One-Stop Shop

Our mission is to make it as easy as possible for you to turn your data into a valuable resource for your users. Keen already provides a lightweight and low friction way to do simple analyses, but there are many more scenarios that can be enabled by increased expressivity. The more questions that Keen can answer for your users, and the more efficiently those questions can be structured, the higher the value it can provide. “Or” filters is just one such feature that we’ve recently implemented and we’ll share many more in the future.

Until Keen can efficiently solve all your analytics needs, we’ll always have our work cut out for us - but we’re making great progress, and we’re happy to have you along for the ride. Drop us a note at team@keen.io with any feedback about “or” filters, expressions, or anything else you’d like to see in the product.

Thanks!
Kevin Litwack
Chief Platform Architect


Keen and GDPR

You’ve probably heard all about the EU’s new regulation, the General Data Protection Regulation (GDPR). The GDPR applies not only to EU-based businesses but also to any business that controls or processes data of EU citizens. Not only is GDPR an important step in protecting privacy for European citizens, it also raises the bar for data protection, security, and compliance in the industry.

At Keen, we’ve been hard at work to ensure that our own practices are GDPR-compliant. A big piece of that is ensuring that our product makes it easy for you to use Keen to handle data in compliance with GDPR requirements. In March 2018 we published a blog post that detailed the steps we would take in order to accomplish this.

Since that time, we’ve accomplished the following:

  • Appointed a Data Protection Officer and a data protection working team
  • Built a formal data map
  • Performed internal threat modeling and gap analysis (and set up a recurring schedule)
  • Adopted and formalized written policies around core areas, including (but not necessarily limited to): data protection, data backup, data retention, access management, and breach management and reporting
  • Conducted formal data protection training for all Keen employees
  • Encrypted data at rest (still in progress for some data)
  • We’re working with a 3rd party auditor to schedule annual security audits
  • Completed legal paperwork to confirm that our Data Sub-processors (primarily Amazon) are GDPR-compliant
  • Offer a Data Processor Agreement to our customers upon request
  • Received Privacy Shield certification

There are several additional security enhancements that we will continue to iterate on and improve over time:

  • More granular access controls, allowing Keen employees to be granted access according to the Principle of Least Privilege
  • Full customer data access audit history
  • Lockdown of Keen employee devices, and/or limiting access to customer data to certain approved devices

** A note about data deletion **

During our many conversations with customers about their GDPR compliance efforts and concerns, the most common theme was the need for various types of data deletion. Some examples that we’ve heard include:

  • specific property removal from all events
  • deletion (or anonymization) of all events matching certain filters (e.g. all events with a specific user.id for “right to be forgotten” requests)
  • one-time deletions of all data before some time threshold
  • on-going “expiration” of data older than some horizon

While the Keen delete API endpoint can handle some of these at small scale, for larger use cases we felt that a more powerful toolset was needed. That toolset is now under active and on-going development, and is used internally. It can be run on customers’ behalf on a case-by-case basis. If you have GDPR-related deletion needs please contact us for more details.

Keep a lookout for more updates on our blog as we continue to make performance and security enhancements to Keen.


Keen and the EU General Data Protection Regulation (GDPR)

Update on Keen and GDPR Compliance

Keen is deeply committed to doing our part to ensure that personal data is adequately protected. As such, we are actively reviewing the requirements of EU Regulation 2016/679 (more commonly referred to as “GDPR”) and how they affect us and our customers. In this blog post we’ll try to provide as much information and guidance as possible for you to remain in GDPR compliance with Keen.

Our Data Protection Philosophy

Keen stores two different classes of data: (a) the account information of our direct customers, as provided to us via accounts on the keen.io website and/or through support channels such as e-mail or chat; and (b) data about our customers’ customers in the form of events submitted to our streams API.

We have designed our system to be resistant to attack against either class of data, but the second category (Keen’s customers’ event data) is more complicated due to the fact that we allow highly flexible content and cannot directly control what information is included or how personally identifiable or sensitive the information or data might be. For this reason we always recommend against the storage of any Personally Identifiable Information (PII) or otherwise sensitive data in event properties.

We believe that most use cases for Keen do not inherently rely on personal data and such data can be anonymized, pseudonymized, or omitted entirely without losing value. As such it is more valuable for our customer base as a whole for us to focus our engineering effort on other aspects of the product, rather that building high-assurance security protections that most customers do not need.

That said, we strive to be as secure as possible, and will continue to improve our security posture. We also recognize that some customers do have legitimate use cases for storing some amount of low-sensitivity PII (such as e-mail or IP addresses, for example), and those require a somewhat more rigorous data protection strategy than what we have in place now. So over the coming months we are making investments to move in that direction.

How Keen Secures Data Today

Our data protection strategy spans several dimensions: technology, people, and processes.

Technology

The most direct way that we protect data is by limiting access to it using standard industry best practices. All data is stored on hardware in Amazon’s AWS cloud, using a VPC to isolate all servers from the outside internet. These systems can only be accessed via a set of bastion hosts which are regularly updated with the latest security patches, and which can only be connected to using SSH channels secured by a select group of Keen employees’ cryptographic access keys. We’ve also adopted strict requirements around access to the AWS environment itself, including mandatory Multi-Factor Authentication (MFA) and complex passwords.

This structure makes direct access to our internal systems quite difficult for an unauthorized person, but it cannot protect the public-facing endpoints such as keen.io (i.e. our website) or api.keen.io. We secure these via the access keys available in each Keen Project or Organization, which adhere to cryptographic best practices.

(Please note that we currently do not encrypt traffic between various internal services within our VPC, nor do we encrypt data at rest. Up to this point we have not felt that there was much value in doing so, since the only practical exploit of this would require direct physical access to Amazon infrastructure. However we do plan to enable basic data-at-rest encryption soon; see roadmap below.)

People

The Keen web UI includes a mechanism by which authorized Keen employees can view customer data directly. This is used to help investigate and address any issues or questions reported to us by customers, as well as occasionally by our operational engineering team to diagnose and mitigate degradation of service. The mechanism is password-protected and limited to those who require it to provide customer support or to fulfill other responsibilities.

We also adhere to a policy of only using this access when it is necessary, and will seek permission before viewing customers’ raw event data. (In rare circumstances where the need is urgent, such as a system-wide outage, we may skip this step — but only as a last resort.)

Currently this “root” access is all or nothing and we rely on our hiring and training processes to mitigate the risk of unnecessary access by a Keen employee. The build out of a granular access control system is on our roadmap (see below).

Processes

We adhere to the following processes to help ensure that data is kept safe:

  • Access management: when a Keen employee leaves the company, we follow a checklist to ensure that all of their permissions are revoked.
  • Design and code reviews: all changes to the system are reviewed carefully by senior engineers, as well as tested in an isolated staging environment prior to deployment to production.
  • Threat modeling: periodically we review the threat model and try to identify gaps, assess risk, and determine what mitigations (if any) should be prioritized.
  • Automated backups: all data is automatically backed up to Amazon S3 to allow us to recover in the event of a catastrophic loss, whether due to malicious attack or other unexpected events. These backups age out over time, so any data which is removed from the source will eventually no longer appear in the backups. (We currently can’t offer any guarantees about how long it will be for any specific piece of data.)
  • Data retention: Keen stores data for as long as it is necessary to provide services to our customers and for an indefinite period after a customer stops using Keen. In most cases, data associated with a customer account will be kept until a customer requests deletion. (There is also a self-service delete API which is suitable for removing small amounts of data.)

Our Security and Privacy Roadmap

We will be making improvements to all of the above according to the following roadmap.

What we are intending to deliver by the GDPR deadline

GDPR goes into effect on May 25, 2018. Prior to that time Keen intends to:

  • Appoint a Data Protection Officer and a data protection working team
  • Build a formal data map
  • Perform internal threat modeling and gap analysis (and set up a recurring schedule)
  • Adopt and/or formalize written policies around core areas, including (but not necessarily limited to): data protection, data backup, data retention, access management, and breach management and reporting
  • Institute formal data protection training for all Keen employees
  • Encrypt data at rest
  • Schedule annual security audit with a 3rd party auditor (however the audit may not be completed until later in 2018)

We also intend to do the necessary legal paperwork to be able to confirm that our Data Sub-processors (primarily Amazon) are GDPR-compliant, and to be able to offer a Data Sub-processor Addendum to the contracts of customers who request it.

What we hope to improve over time

The following are examples of additional security enhancements that will not be addressed by the May 25 deadline:

  • More granular access controls, allowing Keen employees to be granted access according to the Principle of Least Privilege
  • Full data access audit history
  • Lockdown of Keen employee devices, and/or limiting access to customer data to certain approved devices
  • Integration with an intrusion detection system/service
  • Industry certifications

In addition, we expect that threat modeling and gap analysis (both our own and those done by a 3rd party auditor) will identify opportunities to further harden the system and provide redundant layers of risk mitigation. Those will be prioritized and incorporated into our roadmap as appropriate.

Next Steps

Ultimately our goal is to make Keen as valuable as possible to all of our customers. We appreciate your understanding, and also greatly value your input. If you have questions, concerns, or feedback about our approach or how it will affect your own GDPR compliance efforts, please reach out to us at team@keen.io!

Thanks,
Kevin


Order and Limit Results of Grouped Queries (Hooray!)

Greetings Keen community! I’d like to make a quick feature announcement that will (hopefully) make many of you happy 😊

At Keen IO we’ve created a platform for collecting and analyzing data. In addition to the ability to count the individuals who performed a particular action, the API includes the ability to group results by one or more properties of the events (similar to the GROUP BY clause in SQL). For example: count the number of individuals who made a purchase and group by the country they live in. This makes it possible to see who made purchases in the United States versus Australia or elsewhere.

This grouping functionality can be very powerful, but there’s one annoying drawback: if there are many different values for your group_by property then the results can get quite large. (In the example above note all of the tiny slivers representing countries with only a handful of purchases.) What if I’m only interested in the top 5 or 10? Until now the only option was to post-process the response on the client (e.g. using Python or JavaScript) to sort and then discard the unwanted groups.

Today I’m excited to announce that, by popular demand, we’ve made this much easier! We recently added a feature called order_by that allows you to rank and return only the results that you’re most interested in. (To those familiar with SQL: this works very much like the ORDER BY clause, as you might expect.)

The order_by parameter orders results returned by a group_by query. The feature includes the ability to specify ascending (ASC) or descending (DESC) ordering, and allows you to order by multiple properties and/or by the result of the analysis.

Most importantly the new order_by feature includes the ability to limit the number of groups that are returned (again, mirroring the SQL LIMIT clause). This type of analysis can help answer important questions such as:

  • Who are the top 100 game players in the US?
  • What are the top 10 most popular article titles from last week?
  • Which 5 authors submitted the most number of articles last week?
  • What are the top 3 grossing states based on sum purchases during Black Friday?

order_by can be used with any Keen query that has a group_by, which in turn can be used with most Keen analysis types. (limit can be used with any order_by query.) For more details on the exact API syntax please check out the order_by API docs.

There is one important caveat to call out: using order_by and limit in and of itself won’t make your queries faster or cheaper, because Keen still has to compute the full result in order to be able to sort and truncate it. But being able to have the API take care of this clean-up for you can be a real time saver; during our brief internal beta I’ve already come to rely on it as a key part of my Keen analysis toolbox.

I’d like to extend a huge thanks to our developer community for all the honest constructive feedback they’ve given us over the years (on this issue and many others). You’re all critical in helping us understand where we can focus our engineering efforts to provide the most value. On that note: we have many more product enhancements on the radar for 2018, so if you want to place your votes we’re all ears! Feedback (both positive and negative) on the order_by feature is also welcome, of course. Please reach out to us at team@keen.io any time 🚀

Cheers,
Kevin Litwack | Platform Engineer


Tracking GitHub Data with Keen IO

Today we’re announcing a new webhook-based integration with one of our favorite companies, GitHub!

We believe an important aspect of creating healthy, sustainable projects is having good visibility into how well the people behind them are collaborating. At Keen IO, we’re pretty good at capturing JSON data from webhooks and making it useful, which is exactly what we’ve done with GitHub’s event stream. By allowing you to track and analyze GitHub data, we’ve made it easy for open source maintainers, community managers, and developers to view and discover more information to quantify the success of their projects.

This integration records everything from pushes, pull requests, and comments, to administrative events like project creation, team member additions, and wiki updates.

Once the integration is setup, you can use Keen IO’s visualization tools like the ExplorerDashboards, and Compute API to dig into granular workflow metrics, like:

  • Total number of first-time vs. repeat contributors over time
  • Average comments per issue or commits per pull request, segmented by repo
  • Pull request additions or deletions across all repositories, segmented by contributor
  • Total number of pull requests that are actually merged into a given branch
Number of comments per day on Keen IO’s JavaScript library repos
Number of pull requests per day merged in Keen IO’s repos, “false” represents not merged
Percentage of different author associations of pull request reviews

Ready to try it out?

Assigning webhooks for each of these event types can be a tedious process, so we created a simple script to handle this setup work for you.

Check out the setup instructions hereWith four steps, you will be set up and ready to rock in no time.

What metrics are you excited discover?

We’d love to hear from you! What metrics and charts would you like to see in a dashboard? What are challenges you have had with working with GitHub data? We’ve talked to a lot of open source maintainers, but we want to hear more from you. Feel free to respond to this blog post or send an email to opensource@keen.io. Also, if you build anything with your GitHub data, we’d love to see it! ❤


Announcing Hacktoberfest 2017 with Keen IO

It’s October, which you probably already know! 👻 But more importantly, that means it is time for Hacktoberfest! Keen IO is happy to announce we will be joining Hacktoberfest this year.

What is Hacktoberfest?

Digital Ocean with GitHub launched Hacktoberfest in 2014 to encourage contributions to open source projects. If you open four pull requests on any public GitHub repo, you get a free limited edition shirt from Digital Ocean. You can find issues in hundreds of different projects on GitHub using the hacktoberfest label. Last year, 29,616 registered participants had opened at least four pull requests to complete Hacktoberfest successfully, which is amazing. 👏

Hacktoberfest with Keen IO

If you have ever seen our Twitter feed, you know at Keen IO we love sending our community t-shirts. So, we have something to sweeten the deal this year. If you open and get at least one pull request merged on any Keen IO repo, we will send you a free Keen IO shirt and sticker too.

You might wonder… What kind of issues are open on Keen IO GitHub repos? Most of them are on our SDK repos for JavaScript, iOS/Swift, Java/Android, Ruby, PHP, and .NET. Since we value documentation as a form of open source contribution, there’s a chunk of them that are related to documentation updates. We labeled issues with “hacktoberfest” that have a well-defined scope and are self-contained. You can search through them here.

Some examples are…

If you have an issue in mind that doesn’t already exist, feel free to open an issue on a Keen IO repository and we can discuss if it is an issue that is a good fit for Hacktoberfest.

Now, how do you get your swag from Keen IO?

First, submit a pull request for any of the issues labeled with the “hacktoberfest”. It isn’t required, but it is also helpful to comment on the issue you are working on to say you want to complete it. This prevents other people from doing duplicate work.

If you are new to contributing to open source, this guide from GitHub is super helpful. We are always willing to walk you through it too. You can reach out in issues and pull requests, email us at opensource@keen.io, or join our Community Slack at keen.chat.

Then, once you have submitted a pull request, go through the review process, and get your PR merged, we will ask you to fill out a form for your shirt.

Also, don’t forget to also register at hacktoberfest.digitalocean.com for your limited edition Hacktoberfest shirt from Digital Ocean if you complete four pull requests on any public GitHub repository. They also have more details on the month long event.

These candy corns are really excited about Hacktoberfest

Thank you! 💖

We really appreciate your interest in contributing to open source projects at Keen IO. Currently, we are working to make it easier to contribute to any of the Keen IO SDKs and are happy to see any interest in the projects. There’s an issue open for everyone from someone wanting to practice writing documentation to improving the experience of using the SDKs. Every contribution makes a difference and matters to us. At the same time, we are happy to help others try contributing to open source software. Can’t wait to see what you create!

See you on GitHub! 👋

 


P.S. Keen IO has an open source software discount that is available to any open source or open data project. We’d love to hear more about your project of any size and share more details about the discount. We’d especially like to hear about how you are using Keen IO or any analytics within your project. Please feel free to reach out to opensource@keen.io for more info.


SendGrid and Keen IO have partnered to provide robust email analytics solution

Today we’re announcing our partnership with SendGrid to provide the most powerful email analytics for SendGrid users.

1_1bYc-VQ9QhG9CwM3yvp6vQ.png1_0Byq1XOh2ll6cm8WFDJkAQ.png

SendGrid Email Analytics — Powered by Keen IO

Connect to Keen from your SendGrid account in seconds. Start collecting and storing email data for as long as you need it. No code or engineering work required!

The SendGrid Email Analytics App operates right out-of-the-box to provide the essential dashboards and metrics needed to compare and analyze email campaigns and marketing performance. Keen’s analytics includes capabilities for detailed drill down to understand users and their behavior.

Keen IO’s analytics with SendGrid enables you to:

  • Know who is receiving, opening, and clicking emails in realtime
  • Build targeted campaigns based on user behavior and campaign performance
  • Find your most or least engaged users
  • Extract lists of users for list-cleaning and segmentation
  • Drill in with a point-and-click data explorer to reveal exactly what’s happening with your emails
  • Keep ALL of your raw email event data (No forced archiving)
  • Build analytics for your customers directly into your SaaS platform
  • Programmatically query your email event data by API

1_1bYc-VQ9QhG9CwM3yvp6vQ.png

1_SlfI_s6U3H6mLiJ0Wp_6SQ.png

SendGrid Email Analytics — Powered by Keen IO

The solution includes campaign reports, as well as an exploratory query interface, segmentation capabilities, and the ability to drill down into raw email data.

Interested in learning more? Check out the Keen IO Email Analytics Solutionon SendGrid’s Partners Marketplace.


.NET Summer Hackfest Round One Recap

We kicked off the .NET Summer Hackfest with the goal of porting our existing Keen IO .NET SDK to .NET Standard 2.0, and I’m excited to say that we just about accomplished our goal! Our entire SDK, unit tests, and CI builds have been converted to run cross-platform on .NET Standard. All there is left to do is a little bit of clean up and some documentation updates that are in the works.

There are some big benefits to adopting .NET Standard 2.0, here are some highlights:

  • The Keen .NET SDK can be used with .NET Core, which means it can be included in apps deployed on Linux, Mac OS, and cool stuff like Raspberry Pi
  • Mono-based projects will be officially supported in their next version, which may or may not have worked before, but now it’ll for sure work. This also means Unity can use the new .NET Standard library!
  • We can multi-target and to reduce the size of the codebase and complexity
  • All the Xamarin variations will be supported in their next version

Everyone who contributed during this event was open, collaborative, and ready to learn and teach. We were very happy to be a part of this and look forward to future ‘hackfests’.

I’d like to give a special shoutout to and thank our community contributors that jumped in on the project: Doni Ivanov & Tarun Pothulapati

I’d also like to thank Justin & Brian from our team, Jon & Immo from Microsoft, & Microsoft MVP Oren for all their work and support during our two week sprint.


9 Projects Showcased at Open Source Show and Tell 2017

The 4th annual Open Source Show & Tell wrapped up and we had such a great time experiencing and seeing some cool open source projects.

We got interactive with Ashley going on a journey building smart musical IoT plushies, and were wowed by Beth’s talk on unifying the .NET developer community.

Joel walked us through the inner workings of software development (the good, bad, and ugly), and show us how the purely functional and open source package manager, Nix, can help with package and configuration management. Zach took us on a journey into why the open source project Steeltoe was built, and showed us how developers can write in .NET and still implement industry best practices when building services for the cloud.

We learned from Josh at Algolia how you can scale a developer community by creating webhooks for community support, and Sarah (image left) took us along a journey understanding open source’s role in cloud computing at companies like Google.

Julia presented about internationalizing if-me, an open source non-profit mental health communication platform maintained by contributors from many backgrounds and skill sets.

There were lots of other excellent talks about open source project like Babelfish a self-hosted server for source code parser presented by Eiso, and Nicolas’s talk about helping people build better APIs following best practices via Apicur.io.

Check out all of the topics and talks here.

Big thanks to GitHub, Google, and Microsoft for co-organizing and hosting. Looking forward to seeing you at Open Source Show and Tell next year!


We ❤ open source. We’d love to hear more about your project and share it with others. To help with any analytics needs, Keen IO has an open source software discount available to any open source or open data project. Please feel free to reach out to community@keen.io for more info.


Visualizing your Keen IO Data with Python and Bokeh

In a previous post I wrote, we created a basic example that analyzed earthquakes using the Keen Python Client with Jupyter Notebook. In this post we’re going to be looking at creating visualizations in Python using a visualization library called Bokeh.

Getting Started

To install Bokeh, run pip install bokeh in your shell. After Bokeh has finished installing, open up Jupyter Notebook by running jupyter notebook. In the first cell, you’ll need to set up a Keen Client in Python:

import keen from keen.client import KeenClient
KEEN_PROJECT_ID = "572dfdae3831443195b2f30c"
KEEN_READ_KEY = "5de7f166da2e36f6c8617347a7a729cfda6d5413db8d88d7f696b61ddaa4fe1e5cdb7d019de9bb0ac846d91e83cdac01e973585d0fba43fadf92f06a695558b890665da824a0cf6a946ac09f5746c9102d228a1165323fdd0c52c92b80e78eca"
client = KeenClient(
    project_id=KEEN_PROJECT_ID,
    read_key=KEEN_READ_KEY
)

We’ll need to run a query similar to the one we used last time. In the next cell, make a count_unique query on the earthquakes collection with a daily type interval. This will return a dictionary containing the number of earthquakes per day.

earthquakes_by_day = client.count_unique(“earthquakes”,
    timeframe={
        “start”: arrow.get(2017, 6, 12).format(),
        “end”: arrow.get(2017, 7, 12).format()
    },
    target_property=”id”,
    interval=”daily”
)
Output of the “count_unique” query

Let’s import Bokeh so we can visualize earthquakes_by_day. Run the code below in a new cell.

from bokeh.plotting import figure, show
from bokeh.io import output_notebook
output_notebook()

The first line imports figure and show, two functions that will let us plot our data. The next line imports a function called output_notebook. We need to run this method before we start plotting data so our plots are drawn below our notebook cells.

Plot Our Data

A line graph would be a great choice to plot this data. In order to plot this, we need to pull out the number of earthquakes for a timeframe and the corresponding date.

y = list(map(lambda row: row[“value”], earthquakes_by_day))
x = list(map(lambda row: arrow.get(row[“timeframe”][“start”]).datetime.replace(tzinfo=None), earthquakes_by_day))
# Now we can plot! 
# `figure` initializes the chart object
pl = figure(title=”Number of Earthquakes per Day”, x_axis_type=”datetime”)
# `line` takes lists of the x and y values and plots.
pl.line(x, y)
show(pl)
 1_-l0IIA5bXrOKT0VE3zASdA.png
Line graph generated by Bokeh

In the code sample above, y is a list containing the counts per day, x a list of the datetime values, and figure initializes the chart object. The linemethod takes x and y and plots those values into a line. show(pl) is the method that actually draws the chart in our notebook.

Customize Our Chart

Bokeh even lets us add tooltips to our charts! We can import HoverTool by calling from bokeh.models import HoverTool in a new cell and pass an instance of HoverTool to our figure object.

from bokeh.models import HoverTool
pl = figure(title=”Number of Earthquakes per Day”,
    x_axis_type=”datetime”)
pl.line(x, y)
hover = HoverTool(
    tooltips=[
        (“Date”, “@x{%F}”),
        (“Count”, “@y{int}”)
    ],
    formatters={“x”: “datetime”},
    mode=’vline’
)
pl.add_tools(hover)
show(pl)
1_ZK1qXd6DHijn42u17114Ug
Tooltips in our graph!

We have to do a little bit of configuration in HoverTool to make sure the tooltips displayed the correct date and didn’t display any values in between the data points (try removing the tooltips option and see what’s displayed). You can check out the Bokeh docs on HoverTool if you want the tooltips to look different.

Notice that there are a lot of earthquakes on 6/17! This might be an interesting place to dive deeper.

We pulled data from a Keen project using Python, drew a line graph for a month’s worth of data, and added interactivity to the chart we drew. The code for this example is available on GitHub. Try playing around with it yourself! If you want to use this example to visualize your own event data, sign up for your own Keen account and read how to get started!

Next time, we’ll plot the earthquakes that happened in that time period using Basemap and see if we can find anything interesting.


11 Beautiful Event Data Models

One of the most common requests that I get here at Keen is for help with data modeling. After all, you’ve got to collect the right data in order to get any value out of it. Here’s an inventory of common, well-modeled events across a variety of industries:

  1. B2B Saas (create_account, subscribe, payment, use_feature)
  2. E-Commerce (view_item, add_to_cart, purchase)
  3. Gaming (create_user, level_start, level_complete, purchase)

All of the examples are live code samples that you can run and test yourself by clicking “Edit in JSFiddle”.

B2B SaaS Event Data Models

Track what’s happening in your business so that you can make good decisions. With just a handful of key events, you have the foundation for the classic SaaS pirate metrics (AARR: Acquisition, Activation, Revenue, Retention).

Create Account Event (Acquisition)

Capture an event when someone signs up for the first time or creates an account in some other way.

https://jsfiddle.net/7dtm77nc/6/?tabs=js

Subscribe (Acquisition)

Track an event when someone subscribes to your newsletter, chatbot, etc.

https://jsfiddle.net/7dtm77nc/7/?tabs=js

Use Feature (Activation)

It’s really common for product managers and marketers to want to know who is doing what in their products, so they can make roadmap decisions and setup marketing automation. Here’s an example of a event where a feature “Subscribe to SMS alerts” has been done by the user.

By including details about the feature on the event, you can provide yourself a nice dataset for later A/B testing and analysis. (e.g. did changing the button text increase or decrease usage?).

https://jsfiddle.net/3rezjb1h/?tabs=js

Invoice Payment (Revenue & Retention)

This is a simplified example of an invoice payment event. If you use Stripe for payments, you can consume their event firehouse into Keen directly and don’t need to model it yourself.

You can see the full Stripe Invoice object here.

https://jsfiddle.net/ff9y8ppw/3/?tabs=js

Checkout more SaaS analytics uses and applications.

E-commerce Event Data Models

Track what’s happening in your store so that you can maximize sales, marketing investments, and provide detailed analytics to your vendors.

View Item / View Product Event

People checking out your goods? Good. Track it.

https://jsfiddle.net/3rezjb1h/1/?tabs=js

Add Item to Cart

Track every time someone adds a product to their cart, bag, or basket.

https://jsfiddle.net/2wb21enc/?tabs=js

Successful Checkout Event

Track an event every time an order is successfully completed. Use this event to count the total number of orders that happen on your site(s).

Use the Purchase Product Event (below) to track trends in purchases of individual items.

https://jsfiddle.net/7yqkjjsd/1/?tabs=js

Product Purchase Event

Track an individual event for each item purchased. That way you can include lots of rich details about the product and easily run trends on specific products.

https://jsfiddle.net/xxdjmkss/?tabs=js

Gaming Event Data Models

Track what’s important in your game so that you can measure activation, engagement, retention, and purchasing behavior. Checkout this related guide: Data Models & Code Samples for Freemium Gaming Analytics

New Player Event

Track every time a new player starts your game for the first time.

https://jsfiddle.net/p0csfttc/1/?tabs=js

Level Start Event

Track each time a player starts a new level in your game.

https://jsfiddle.net/ndfdeu4s/?tabs=js

Level Complete Event

Track each time a player successfully defeats a level in your game. The data model is the same as level_start, but you’ll have much fewer of these events depending on what type of game you’ve designed.

https://jsfiddle.net/yf96tyny/1/?tabs=js

In-Game Purchase Event

Track when players making purchase in your game you get that $.

https://jsfiddle.net/p0csfttc/1/?tabs=js


A new way to debug your data models

We’re excited to announce the new and improved Streams Manager for inspecting the data schema of your event collections in Keen IO. We built the Streams Manager so you can ensure your data is structured well and set up to get the answers you need.

With Streams Manager you can:

  • Inspect and review the data schema for each of your event collections
  • Review the last 10 events for each of your event collections
  • Delete event collections that are no longer needed
  • Inspect the trends across your combined data streams over the last 30-day period

The Streams Manager can be found within the ‘Streams’ tab of your Project Console.

Inspect your data models with the Streams Manager

Ready to get started? Log-in to your Keen IO account or create a new account to start streaming data.

Questions or feedback? Hit us up anytime on Slack.


Architecture of Giants: Data Stacks at Facebook, Netflix, Airbnb, and Pinterest

Here at Keen IO, we believe that companies who learn to wield event data will have a competitive advantage. That certainly seems to be the case at the world’s leading tech companies. We continue to be amazed by the data engineering teams at Facebook, Amazon, Airbnb, Pinterest, and Netflix. Their work sets new standards for what software and businesses can know.

Because their products have massive adoption, these teams must continuously redefine what it means to do analytics at scale. They’ve invested millions into their data architectures, and have data teams that outnumber the entire engineering departments at most companies.

We built Keen IO so that most software engineering teams could leverage the latest large-scale event data technologies without having to set up everything from scratch. But, if you’re curious about what it would be like to be a giant, continue on for a collection of architectures from the best of them.

Netflix

With 93 million MAU, Netflix has no shortage of interactions to capture. As their engineering team describes in the Evolution of the Netflix Data Pipeline, they capture roughly 500 billion events per day, which translates to roughly 1.3 PB per day. At peak hours, they’ll record 8 million events per second. They employ over 100 people as data engineers or analysts.

Here’s a simplified view of their data architecture from the aforementioned post, showing Apache Kafka, Elastic Search, AWS S3, Apache Spark, Apache Hadoop, and EMR as major components.

Source: Evolution of Netflix Data Pipeline

Facebook

With over 1B active users, Facebook has one of the largest data warehouses in the world, storing more than 300 petabytes. The data is used for a wide range of applications, from traditional batch processing to graph analytics, machine learning, and real-time interactive analytics.

In order to do interactive querying at scale, Facebook engineering invented Presto, a custom distributed SQL query engine optimized for ad-hoc analysis. It’s used by over a thousand employees, who run more than 30,000 queries daily across a variety of pluggable backend data stores like Hive, HBase, and Scribe.

Airbnb

Airbnb supports over 100M users browsing over 2M listings, and their ability to intelligently make new travel suggestions to those users is critical to their growth. Their team runs an amazing blog AirbnbEng where they recently wrote about Data Infrastructure at Airbnb last year.

At a meetup we hosted last year, “Building a World-Class Analytics Team”, Elena Grewal, a Data Science Manager at Airbnb, mentioned that they had already scaled Airbnb’s data team to 30+ engineers. That’s a $5M+ annual investment on headcount alone.

Keen IO

Keen IO is an event data platform that my team built. It provides big data infrastructure as a service to thousands of companies. With APIs for capturing, analyzing, streaming, and embedding event data, we make it relatively easy for any developer to run world-class event data architecture, without having to staff a huge team and build a bunch of infrastructure. Our customers capture billions of events and query trillions of data points daily.

Although a typical developer using Keen would never need to know what’s happening behind the scenes when they send an event or run a query, here’s what the architecture looks like that processes their requests.

Keen IO Event Data Platform

On the top row (the ingestion side), load balancers handle billions of incoming post requests as events stream in from apps, web sites, connected devices, servers, billing systems, etc. Events are validated, queued, and optionally enriched with additional metadata like IP-to-geo lookups. This all happens within seconds.

Once safely stored in Apache Cassandra, event data is available for querying via a REST API. Our architecture (via technologies like Apache Storm, DynamoDB, Redis, and AWS lambda), supports various querying needs from real-time data exploration on the raw incoming data, to cached queries which can be instantly loaded in applications and customer-facing reports.

Pinterest

Pinterest serves over 100M MAU doing over 10B+ pageviews per month. As of 2015, they had scaled their data team to over 250 engineers. Their infrastructure relies heavily on Apache Kafka, Storm, Hadoop, HBase, and Redshift.

Pinterest Data Architecture Overview

Not only does the Pinterest team need to keep track of enormous amounts of data related to Pinterest’s customer base. Like any social platform, they also need to provide detailed analytics to their ad buyers. Tongbo Huang wrote “Behind the Pins: Building Analytics at Pinterest” about their work revamping their analytics stack to meet that need. Here’s how they used Apache Kafka, AWS S3, and HBase to do it:

Data Architecture for Pinterest Analytics for Businesses
User View of Pinterest Analytics for Businesses

Twitter / Crashlytics

In Handling 5 Billions Sessions Per Day — in Real Time, Ed Solovey describes some of the architecture built by the Crashlytics Answers team to handle billions of daily mobile device events.

Event Reception
Archival
Batch Computation
Speed Computation
Combined View

Thank You

Thank you to the collaborative data engineering community who continue to not only invent new data technology, but to open source it and write about their learnings. Our work wouldn’t be possible without the foundational work of so many engineering teams who have come before us. Nor would it be possible without those who continue to collaborate with us day in and day out. Comments and feedback welcome on this post.

Special thanks to the authors and architects of the posts mentioned above: Steven Wu at Netflix, Martin Traverso at Facebook Presto, AirbnbEng, Pinterest Engineering, and Ed Solovey at Crashlytics Answers.

Thanks also to editors Terry Horner, Dan Kador, Manu Mahajan, and Ryan Spraetz .


Building an Empire on Event Data

Photo by Joshua K Jackson

Facebook, Google, Amazon, and Netflix have built their businesses on event data. They’ve invested hundreds of millions behind data scientists and engineers, all to help them get to a deep understanding and analysis of the actions their users or customers take, to inform decisions all across their businesses.

Other companies hoping to compete in a space where event data is crucial to their success must find a way to mirror the capabilities of the market leaders with far fewer resources. They’re starting to do that with event data platforms like Keen IO.

What does “Event Data” mean?

Event data isn’t like its older counterpart, entity data, which describes objects and is stored in tables. Event data describes actions, and its structure allows many rich attributes to be recorded about the state of something at a particular point in time.

Every time someone loads a webpage, clicks an ad, pauses a song, updates a profile, or even takes a step into a retail location, their actions can be tracked and analyzed. These events span so many channels and so many types of interactions that they paint an extremely detailed picture of what captivates customers.

Event data is sufficiently unique that it demands a specialized approach, specialized architecture, and specialized access patterns.

In the early days of data analysis, it took huge teams of data scientists and specialized data engineers to process event data for companies the size of Walmart. Now, however, even a single developer can capture billions of detailed interactions and begin running queries in seconds, accessing the data programmatically and in real time. This makes it possible to build intelligent apps and services that use insights from event data, to personalize the user experience, and display information dynamically.

One Major Challenge, but Many Rewards

A few industry giants have been able to build event data powerhouses because of the incredible access they have to talent. They hire expensive, specialized teams who build their own home-grown technology stacks. In many cases, companies like Facebook end up inventing their own distributed systems technologies to handle emergent data needs.

Most other companies lack this endless flow of resources. They can’t afford to build the infrastructure and acquire the headcount needed to maintain it. Even those that have the capital are struggling against a massive shortage of talent for roles in data infrastructure and data science. New candidates won’t materialize fast enough to build and support the world-class data capabilities every company wishes they had.

However, capturing event data is extremely important. It lets companies build a new class of products and experiences, and identify patterns that otherwise would be impossible to see. It also lets them build apps that perform far more advanced, programmatic analysis, and make real-time decisions on how to engage the user — suggesting the right product, showcasing the right content, and asking for the right actions.

Just as organizations migrated en masse from on-premise servers to cloud hosting and storage in the mid-2000s, many companies are starting to adopt data platforms like Keen so they can compete in areas they couldn’t build in-house.

Keen IO: The Event Data Platform

We built Keen to let customers use our platform as the foundation for all of the powerful event data capabilities they want to build. By leaving the analytics infrastructure to Keen, any developer or company can wield the power of event data extremely well, without a specialized background in data engineering.

We help over 3,500 customers crunch trillions of data points every day, gathering data with our APIs and storing it for them to analyze with programmatic queries to fuel any metrics or tools they need to build. Once they adopt Keen, customers report huge savings in engineering and analyst resources, far better accuracy in measuring crucial app and user analytics, and the ability to infuse real-time analytics into every part of their operations.

Event data is increasingly interwoven into software. Photo by Carlos Muza.

Event Data in Action

When companies build on an event data platform, they can accelerate their businesses in ways that weren’t possible before.

  • They anticipate what users will need and take the product in the right direction, by using event data to improve the user experience and test changes to the application or hardware.
  • They show users extremely relevant content and demand higher ad revenue from top advertisers because of the engagement metrics they derive from event data.
  • They provide deep reporting and quantify ROI for their customers — when SaaS products can provide reliable and accurate reporting, they deepen customer trust, engagement, and spend.

Can Event Data Bring a Richer Future?

The ability for companies to operate like they have Facebook’s data infrastructure is a game-changer. They can scale faster, make better decisions, and create smarter, helpful products people don’t even know they need yet. Event data will inevitably shape the way almost every company grows, and those who don’t embrace it will likely lose out to the ones who do.

Comments welcome, or start a conversion with us over at Keen IO.


Data Science Cultures: Archaeology vs. Astronomy

I’ve been writing a lot about intentionality in data science, about how having a sense of history (present and future), can be incredibly powerful for any enterprise.

Think about how archaeologists use data to seek the truth, as compared with how astronomers do it.

Clay pot remnants. Credit: Wessex Archaeology

Archaeology starts with digging. It’s all about studying the data that’s buried in the system (i.e. the fossil record), which means studying things that probably weren’t put there intentionally (depending on your belief system). Without a time machine, it’s impossible to change the structure of the record, to apply intention to the signal, so we do the best we can with what we’ve got: we mine through the accidental signals, discarding the (literal) mountains of noise, in an effort to find the truth about history. Perhaps as should be expected, this effort is expensive and leads to mixed results.

On the other hand, Astronomy is a very different field. Astronomy starts way earlier than digging — it starts with planting. At instrumentation-time, astronomers can point the telescopes where they want, measuring and recording the signals they want. Unlike archaeologists, astronomers have the ability to design the record and its structure, to choose the signals with intention. Doing this intentionally sets up the data record to yield the discoveries they already know they want to make, but also to be somewhat future-proof (which means it can yield unpredicted, emergent discoveries, to be harvested down the road — often by different people than the planters).

The spacecraft Dawn’s spiral descent toward dwarf planet Ceres. Credit: NASA/JPL

Now let’s compare the results of these two fields.

Astronomy (with its cousin Astrophysics) has taught us amazing lessons, things about the motion of the galaxy, the origin of the universe, and the underlying physical principles of multi-dimensional reality.

Turning our gaze to the stars, we learn about the earth. That’s pretty impressive.

Meanwhile, as of last year, Archaeology is still struggling to figure out how many years ago Homo sapiens emerged. And they can’t seem to agree on it, even though all the data is right under our noses. This isn’t because they’re incompetent (some of the best pattern-seeking humans in all of science work in archaeology), but rather because the data sucks.

Clearly, one of these truth-seeking disciplines is a lot more powerful than the other, and at Keen IO, we contend this is because they can control the data model. Data modeling is powerful indeed.

Inspection of any kind — be it human introspection or scientific inquiry — is more powerful when you can apply a variety of observational frameworks, choosing the best of them.