The Art and Science of Storytelling Through Data at Jawbone

Insight
Insight
Published in
7 min readApr 12, 2016

--

Kirstin Aschbacher was a Fellow in the fall 2015 session of Insight’s Data Science Fellows program in Silicon Valley. She is now working as a Data Scientist at Jawbone. Kirstin recently wrote about how getting a good night’s sleep may influence how much, and what, we eat the next day, for the Jawbone blog. In the post below, Kirstin describes the process of developing a data science project from concept to publication.

NOTE: ALL THE DATA USED BELOW, AND IN KIRSTIN’S ORIGINAL POST, CAME FROM AN ANONYMIZED
DATA SET OF JAWBONE UP USERS.

Data science and media communications may seem as far apart as the math team and the cheerleaders in high school. However, to write a good data science blog you have to find their common ground.

Recently, I wrote a blog post for Jawbone on sleep and calories, and spent a good deal of time thinking about how exactly to find this common ground. Many data blogs focus on the models and mathematics, and I thought it might be helpful to share a little bit about the art and science of storytelling through data. There are three main principles to keep in mind.

1. Start with the Question: So What?

The first thing I did was think about who the users are and what motivates them. What problems are they are trying to solve when they purchase a Jawbone UP? I asked my friends who wear trackers what they like and don’t like. I asked coworkers over coffee what they think motivates UP users. Second, I considered where are the data strong (i.e., big in size, high signal to noise, and measures sexy constructs). Themes that emerged were that people really like the UP sleep data, and that many people use trackers to help balance vices and virtues, like eating delicious foods but exercising enough to maintain their weight. Picking the right question is half the battle.

Next, I read the scientific literature on sleep and eating, and identified a hypothesis supported by experimental and mechanistic evidence. I also identified the key gap I could fill. It appears that no one has examined this relationship in a very large sample, especially using passive sleep signals (versus self-reported sleep). Moreover, people sleep differently in a lab than in the comfort of their homes. So there is real value capture in demonstrating that these lab findings translate broadly.

Finally, you must develop a one-sentence pitch that anyone standing next to you at the bus station can relate to. For this sleep study the main question we asked was,”Does poor sleep influence how much food you eat the next day?” Stay focused why the product you are providing is valuable. For people who buy the UP, if we can detect the relationship between sleep and calories, we can begin to developSmart Coach insights to give users tailored feedback. These findings are actionable.

2. Simplify. Simplify. Simplify.

Some people are so busy being brilliant you can’t understand anything they are saying. True brilliance is making a complicated idea seem simple. I asked for a lot of feedback while writing the blog from fellow data scientists, writers, and marketing experts at Jawbone. I cut a lot of words, simplified my language (a blog is not a scientific publication), and made difficult decisions about how to ride the line between being understood and perfectly accurate. A dear friend of mine once said, “You can’t love people the way you want to love them, you have to love them in the way they want to be loved.”

I asked for a lot of feedback while writing the blog from fellow data scientists, writers, and marketing experts at Jawbone.

For example, I started with the graph below. This was the way I had always represented an association between variables in papers. It was my automatic, academic way of thinking about things.

However, when I sat down to implement the scatterplot in Highcharts, I realized I needed a different solution. Here’s what we ended up publishing from Jawbone:

The information is clearer and more interpretable when there are fewer points, in this case binned by the hour. The overall trend should stand out the most. People want to look just long enough to find out how late they can stay up before it has caloric consequences. Here is a simplified example of what the Javascript code looks like.

3. Return on Investment (ROI) = Product/Time

In academia, I could spend months to a year perfecting a paper. At Jawbone, I need to build a data product in a few weeks. This calls for a shift from perfection to production.

SQL can optimize time by performing computations, recodes, and dropping missing or invalid values (example below).

In this blog, the food type analysis (see below) is an example of a task that could easily become time-consuming. SQL calculations and common sense heuristics can often get you 80% of the way there in 20% of the time. I used SQL to pull one million food entries from “early birds”, another million from “night owls” and counted how many times each food occurred. I calculated an index of overrepresentation — i.e., the ratio of how frequently early birds versus night owls logged each food.

However, initially, some foods that rose to the top of the list were uninteresting (e.g., “water”). Others were highly overrepresented, but infrequent (i.e., the ratio was high but the total occurrence was low), and therefore they didn’t highlight a concern relevant to most UP users. To solve these problems, I filtered out words like “water” and selected only food items with at least 50,000 total entries. One could use fancier techniques to determine an optimal cut-off, but this heuristic was fast and it worked very well. Finally, I ported the data into Python and used SciPy’s chi-square tests to verify whether food items differed significantly between the groups. The take home message here is that smart SQL calculations and simple statistics can be a means to achieve high ROI.

Lastly, we did an additional analysis (not included in the blog) looking at how the overrepresentation of these food categories differed by major cities across the globe. That was fascinating, and it might become a follow up post. I played with Google Translate to help manage the fact that some of the food items were entered in different languages. For example, I used Python to return a string that could be integrated back into a SQL pull (see below).

This program allowed me to generate a line of code to feed back into SQL, to facilitate the extraction of food categories, or exclusion of food items, where the string data were encoded in different languages (e.g., example below).

SELECT meal_item, meal_date FROM meals_table
WHERE LOWER(meal_item) not in (‘water’, ’eau’, ’wasser’, ’agua’, ‘água’, ’воды’, ’水’, ’水’, ’পানি’, ’maji’, ’ماء’ ,’اب’);

As you can see, the analyses for this blog did not require an extremely complex model. Sometimes, the more complex the model, the harder it is to explain in a simple way. In practice, however, it becomes complex to integrate multiple programming languages (SQL, Python, JavaScript, and HTML) while reviewing the scientific literature, working with the press team, and crafting a clear story on a tight timeline.

Insight prepared me very well for these challenges. Beyond the technical skills we acquired, the program forced us to down-scope an idea in a very practical way. In academia, I find that the emphasis is often placed on tackling big problems. Paradoxically, however, simpler stories are often the most successful. So, it can be helpful to focus less on the data, and more on what the data reveal about the problems that matter most to your users.

Interested in transitioning to career in data science?
Find out more about the
Insight Data Science Fellows Program in New York and Silicon Valley, apply today, or sign up for program updates.

Already a data scientist or engineer?
Find out more about our
Advanced Workshops for Data Professionals. Register for two-day workshops in Apache Spark and Data Visualization, or sign up for workshop updates.

--

--