Learning to Think Like a Data Scientist

Esther Richler
Insight
Published in
10 min readNov 10, 2020

--

Two rolling hills covered in poppy blooms with the sky and wisps of cloud visible in the background.

Originally published in The ML Rebellion.

Earlier this year I made the transition from academic neuroscientist to applied data scientist. As part of this process, I had to shift out of my familiar research oriented mindset and into a more business-relevant, product-driven kind of thinking. To ease and guide this adjustment, I participated in the Insight Data Science fellowship where I used a business-minded approach to develop a data product. In this blog, I will share what I learned from the experience by walking through the steps I took to complete this project. Part 1 covers the project planning phase, and Part 2 covers the project execution phase. Throughout both, I will highlight key concepts and outline a framework for developing data products in a business setting.

Project Planning Phase

The main task of this phase is project ideation, and the main goal is to develop a project roadmap. No coding or machine learning is used. The three steps in this phase are:

  1. Identifying the business need
  2. Defining a minimum viable product
  3. Identifying potential data sources

Properly completed, these steps represent the roots of a healthy project.

Step 1. Identify the Need

The overarching purpose of any data product is to provide value by filling an unmet need. It is important to define this need right at the start of the project planning phase because it will become the guide by which all subsequent decisions are made. You can think of this guiding need (as I will call it) as a fixed target that you are navigating toward. It begins broad and can be approached from various directions. Then, as you build your product, it narrows and becomes more defined. It may be tempting to try to grow a product around a particular dataset or tool, but without a guiding need you will quickly find yourself directionless and unfocused. This is poor business practice and not recommended.

I started by brainstorming needs that could be met with data science approaches. I then converted the needs I had generated into a shortlist of potential project ideas. At this point, it was helpful to bounce my ideas off of colleagues and friends, especially those with relevant domain knowledge. I found that some of my project ideas had already been implemented, or did not really solve a need, so I weeded those out. Finally, after some deliberation, I settled on a project that would address a wildflower-related tourism need. More specifically, I would build a product to let users know where and when wildflowers were blooming.

Some of the factors that I took into consideration when making this choice were scalability potential, impact potential, and user base potential. Because I only had four weeks for this project, the scale had to stay small. I wanted to ensure that my product had enough room for future growth — I wanted to choose a guiding need that, given more time and resources, was broad enough to allow for added functionalities. I also wanted to make sure that the product could be widely relevant across users and use-cases, and that it could provide disruptive value. These factors are not essential in a personal data project, but cultivating this kind of thinking is good preparation for working in a profit-driven environment.

Another reason that I chose this particular project was that there was a lot of tension and stress in the air at the time (June 2020), and I figured it would do me some good to think about flowers for a little while.

Wildflower Bloom Season

Wildflowers bloom in the springtime, and every year visitors flock to state and national parks to see the blooms and enjoy their expansive displays of color. But wildflower season only lasts a few weeks, and the exact timing, location, and quality of the blooms vary from year to year. To help people time their visits, many parks provide periodic bloom status updates. However, updates are not available for blooms outside of these parks. I wanted to explore the needs caused by this lack of information, so I created a few scenarios to get me thinking about the potential end-users of my product:

Michelle lives in San Francisco and would like a quick last-minute getaway before starting a new job. Her friends have told her about the beautiful fields of wildflowers they saw a week earlier in Yosemite, but she does not want to travel that far. Is there someplace else close by where she can enjoy the bloom season?

David is planning a three month RV road trip through the western US. He likes painting watercolors of nature scenes, and wants to be able to set up his easel and capture the wildflower scenery at spots along the way. How can he plan a route that gets him to the right places at the right times?

Emily, an Instagram influencer living in LA, is always on the lookout for good photo-ops. She heard that the nearby Antelope Valley Poppy Reserve is bursting with color, but she wants to avoid the crowds. Besides, she doesn’t want her photos to look like everyone else’s, and she’s willing to travel to remote locations for unique opportunities. Where can she go to get poppy photos that will set her apart from the crowd?

Crowds at the Antelope Valley California Poppy Reserve (image courtesy of California State Parks, 2019)

Putting myself into the shoes of the end users helped me explore the need and figure out what kind of value I could provide. I knew I would not be able to solve each individual’s unique need, but I realized that all of their needs had a common theme: difficulty with planning trips to see blooms in non-park locations. I decided I wanted to build a product that would grant users access to more wildflower locations by making bloom information easily available.

I realized that all of their needs had a common theme: difficulty with planning trips to see blooms in non-park locations.

Step 2. Define a minimum viable product

Once a guiding need is identified, it is time to define a minimum viable product. A minimum viable product is the most stripped down version of the product that can still fill the guiding need. In other words, it is a proof-of-concept prototype that meets the basic expectations of the end-user. It also serves as a starting point in the product development process, with improvements and functionalities added in an iterative manner.

Learning about this concept was incredibly freeing for me because it let me circumvent my perfectionist tendencies and rapidly work to produce a deliverable. In academic research, the ultimate goal is to unearth new truths. Results must be accurate and exact, and project timelines are long to allow for multiple validation steps. But I was now on new terrain: the bottom line was business value, and the goal was to get there as fast as possible. Business needs can shift quickly, so it is imperative to maintain maneuverability by identifying problems early and nipping them in the bud. Products are validated by their usefulness, not by their exactness, and the ability to rapidly recognize and discard dead-end directions is far more valuable than the ability to burrow deep into a problem and produce flawless work. It felt strange to abandon the pursuit of perfection, but I soon came to find the fast pace exhilarating and the productivity rewarding. It also helped to know that I could correct errors on subsequent iterations if needed.

Technique to define a minimum viable product

I used two techniques to help me define my minimum viable product. The first was to think about the product in specific detail: It would be an interactive web application that provides current bloom status and a two week bloom forecast for poppies in California. However, my mentors pointed out that targeting the entire state was too ambitious for a minimally viable version of the product. So I decided to provide information for just three specific California locations. Other regions could be added to later versions of the product if I chose to scale it up and get the necessary data engineering support.

The three California locations I chose to include in my minimum viable product

These kinds of specific details serve to focus your efforts — but they are not set in stone. As the project progresses it may become necessary to change them a bit to account for limitations, roadblocks, or shifting demands. This is okay as long as they remain within the scope of the guiding need. And as you will see later, I did in fact have to change some details as I moved through the steps of the project.

In retrospect, I think it is a good idea to start out with an overly inflated list of specific details, and then whittle them down one by one as you work towards your minimum viable product. This lets you explore the future potential of the product, and gives you wiggle room for pivots, if needed later.

It can even be useful to think about related products that are beyond the scope of the guiding need. For example, I wondered if there was a need for products to predict fall foliage color changes in the northeastern US, or Jacaranda blossoms in South Africa. I had no intention of building such products, but this kind of thinking helped me establish the boundaries of the product I was building. And who knows, I might find that once I’ve established a data pipeline for my product, it would not be so difficult to build these additional products!

The second technique I used to define the minimum viable product was to visualize in detail what it would look like. This helped me dig down even further into specific details. I got out some paper and markers and drew a few sample mockups until I had one I was happy with. This process forced me to think about all the inputs and outputs of the app, and about how I would communicate the results to the user. I decided that the landing page would have two user input boxes — one for location and another for date — and a clickable ‘Find Poppies’ button. The predicted bloom level for the requested location and date would then be outputted in text format as either ‘low’, ‘medium’, or ‘high’. I gave the app a temporary name (PoppyPredictor) as a placeholder until I could come up with something more catchy.

With these specific goals planted firmly in mind, I was now prepared to start thinking about what data I needed and to identify potential data sources.

An early product mockup

Step 3. Identify potential data sources

The most important data I needed was information about current and past poppy blooms in California. I thought of a few possible ways to get this information:

  1. Scrape bloom updates from park websites
  2. Access poppy sighting reports from specialized websites like GBIF or iNaturalist
  3. Identify large blooms in satellite images
  4. Use poppy mentions on Twitter or Instagram as a proxy for bloom occurrences

I also looked for sources of temperature and precipitation data because these could be useful predictors of poppy growth. Fortunately, current and past California meteorological records were freely available and easy to access.

Now no data source is perfect, and each comes with its own set of challenges (e.g. accessibility, dirtiness, missing labels, not enough data, etc.). But the top priority when deciding which one to use should be how close it can get you to meeting the guiding need. Technical difficulties can be overcome, but an irrelevant dataset — no matter how clean or big or easy to work with — will do nothing to get you to your goal. Besides, challenges can be fun, and they’re a great way to showcase creativity and problem solving skills!

With that in mind, I evaluated the usefulness of each of the poppy data sources under consideration. Social media and park websites yield rich and fertile data about poppy blooms in popular locations. That kind of data is useful for building forecasting models, but it lacks the key information needed for my product: data about blooms in non-park locations. A quick perusal of botany-related websites revealed poppy sighting reports in non-park locations, but these reports were sparse and infrequent, and they were limited to the initial poppy sightings of the season. While that information could help predict the start of poppy season in non-park locations, it was still not sufficient to meet the guiding need. Satellite data, on the other hand, is both agnostic about location and not reliant on human observers. These characteristics make it an excellent and consistent source of information about off-the-beaten-path wildflower blooms. And while satellite data can suffer a loss of granularity compared to data from other sources, this deficiency is more than compensated for by the unique value it provides.

Artist’s rendition of the Sentinel-2 satellite (image courtesy ESA, EADS Astrium)

I had now identified two potential streams of data — satellite and weather — and it was time to finally move out of the planning phase and start getting my hands dirty with some actual datasets.

In Part 2 of this blog, I will discuss data preparation and project execution. Spoiler alert: the final product looks different than what I had envisioned in the planning stages. Stay tuned to hear about the plot twists that arose and how I navigated them!

For more information about the project check out the GitHub repository.

Are you ready to make a change & transition to a career in tech? Sign up to learn more about Insight Fellows programs and start your application today.

A note to my readers: I’d love to get to know you! Just for fun, I planted botany puns throughout this blog post. Hit me up on Twitter @EstherRichler and tell me which ones you’ve found. Even better, come up with new ones I haven’t thought of!

--

--

Data scientist | Communicator | Former neuroscientist | Lover of anything word related