Preparing for Insight

Insight
Insight
Published in
5 min readApr 16, 2014

--

John Joo is an Insight alumnus from the August 2013 session with a PhD in applied physics from Harvard. He recently joined Insight as a Program Director in January, leading the most recent cohort of Fellows in their transition from academia to industry.

When I was first considering making the transition from applied physics to data science, I had a lot of questions. What skills did I need to develop to get started in data science? What courses should I take? Did I need to know how to program and code? What languages? How much statistics did I need to know? The list goes on. Now that I’ve spent a few months as a Program Director here at Insight, I think it’s time I shared with you the tools and tips that got me, and nearly 100 other Insight Fellows, started on our transition to data science.

Data Science Resources from Insight Data Science

Python

Python has become the standard programming language for data science at almost all of Insight’s mentor companies, so the more you can do to practice coding in Python, the better. As a data scientist, you’ll use Python to write scripts that take your data set (in whatever format it happens to be in) and get it into a format that you can work with. Once you have your data cleaned, you’ll store it in a SQL database.

Action Item: Complete Google’s Python Class. Insight Fellows from previous sessions have raved about this course and it should only take a few days to complete.

MySQL

Scientists working in academia often just store data in text files, but in industry a database is almost always used to store data because of the increased performance, amongst other reasons. MySQL is one of the most often used databases in industry. Although you will most likely be using Hive or Pig to access “big data,” we have found that these languages are so close to SQL that Fellows have no trouble crossing the bridge to Hive and Pig after mastering SQL.

Action Item: Work through Tutorials 1–6 on the SQLzoo website. Feel free to explore the website to be able to complete questions in the tutorials. As a bonus, if you feel comfortable with the SQL on the SQLzoo website, try to work through the SQL homework questions on the Databases course on Coursera. If you want to start a data project using MySQL and Python, follow Zetcode’s tutorial to learn to interface with MySQL through Python.

Data Analysis using Python

Now that you have gotten your data into MySQL using your Python coding skills, you can start doing data analysis to find meaningful insights in the data or to make predictions. Up to a few years ago the R statistical programming language was the best way to do this (and continues to be one of the best, if you’re already proficient in R). However, in the past couple of years, Python, with it’s various libraries like SciPy, Numpy, Pandas, iPython and matplotlib, have made it the tool of choice for data scientists.

Action Item: Watch Wes Mckinney’s video of how to use pandas in Python for data analysis. Follow along with his examples using the data and iPython notebooks found here. For a quick introduction to pandas, check the 10-minutes to pandas tutorial. Wes McKinney wrote the book Python for Data Analysis, and you may want to order it to use as a reference book.

Machine Learning

Having an understanding of the most common machine learning techniques and experience using them is becoming a must-have for many data science teams. You could spend years studying the entire field, of course, but even having basic exposure at the level of an introductory class will get you most of the way toward being able to speak intelligently about the topic, knowing when to use a specific algorithm for a given problem and being able to start working with a given machine learning technique to build your project.

Action Item: Take Andrew Ng’s machine learning course on Coursera. This is widely viewed in industry and by Insight Fellows as the best introduction and overview to machine learning that exists today. Insight Fellows from previous sessions found it very helpful to take summary notes while going through the course and attempting the exercises in Python.

Computer Science Fundamentals

Data scientists don’t deal much with the fundamentals of computer science (CS) and most of your interviews to be a data scientist will focus on topics that correspond to doing data science (ie: the topics above). However, for better or worse, a part of the interview process at each company will include coding interviews. At Insight, Fellows spend a decent amount of time doing CS coding exercises in preparation for these types of interviews. In particular, the two (and pretty much the only two) topics you’ll need to understand (at a CS 101 level) are algorithms and data structures.

Action Item: Code the examples in Problem Solving with Algorithms and Data Structures in Python. In particular, become familiar with stacks, queues, linked lists, merge sort, quick sort, and searching and hashing. If you prefer to learn by watching lectures, check out the MIT Introduction to Algorithms course. Bonus: For each algorithm or data structure you learn about, try to program it from scratch in Python, from memory. Many Fellows have also found Leetcode to also be useful in the interview prep for their CS section.

Daily Reading

Here are some of the primary news sources read by people in tech. We highly recommend starting to skim some of these news sources on a daily basis:

Concluding thoughts

Working through the textbooks and tutorials above is great, but the best way to learn data science is by doing data science. The next step is to actually use these tools and techniques to start a data-related side project.

This list was started in the first sessions of Insight as a way to make sure that our Fellows were prepared for Day 1 of the program — we called it “Preparing for Insight.” Each session since, we have worked with the current Fellows and mentors to iterate and improve this list to be one of the most effective “getting started” guides out there. With nearly 100 Insight alumni now working as data scientists, we feel confident that this list will prepare you to begin your own transition.

Interested in transitioning to career in data science?
Find out more about the
Insight Data Science Fellows Program in New York and Silicon Valley, apply today, or sign up for program updates.

Already a data scientist or engineer?
Find out more about our
Advanced Workshops for Data Professionals. Register for two-day workshops in Apache Spark and Data Visualization, or sign up for workshop updates.

--

--