It’s a pun. On the word awesome. You know? It’s punny!
Here you are — you’ve ventured into the vast and exciting world of data science. You’ve thought of your first unique and exciting project that is based on an idea you love and feel that it is something that is going to revolutionize your industry!
You’ve spent so much time learning Python and exploring its capabilities and you know that it’s go time. You open your computer, start your Jupyter Notebook, annddd…
Nothing. Absolutely nothing is feeding into that big brain of yours because you are stuck before you even start. If you feel as if you’re the only person in the world that is going through this, whether it be on project #1 or #100, believe me, you are not. My name is Acusio Bivona, I am a Part-Time Online Data Science student at Flatiron School, and this blog post is all about teaching you a simple, general workflow that can be followed for many data science projects. You just have to do one thing — be O.S.E.M.N!
Step 1a: Obtain your Data
The first thing you have to do on any data science project is obtain your data. After all, you are a Data Scientist — not a nothing scientist!
There are a variety of ways to obtain your data. One of the more common ways is through finding a dataset that has already been created. Kaggle.com is one of the most popular sources for finding datasets that have been vetted and trusted by the data science community. Another very popular way of obtaining data is through using SQL. SQL allows you to query data in relational databases so you can get exactly the kind of data you need.
If you can’t find a dataset that is to your liking, another way to obtain data is through web scraping. Simply put from the peeps over at scrapinghub.com, web scraping is “…the process of retrieving or ‘scraping’ data from a website. Unlike the mundane, mind-numbing process of manually extracting data, web scraping uses intelligent automation to retrieve hundreds, millions, or even billions of data points from the internet’s seemingly endless frontier.”
Web scraping sounds amazing, but it does have a very big caveat that has to be mentioned, which is that this technique can hold a serious ethical issue. Whenever you want or need to use data that is not generally known to be allowed to be used by the public, such as Kaggle, it is recommended (and sometimes necessary) to ask for permission to use that information. If you don’t ask for permission, then you should at the very least cite that source so it is known where that information came from. If you falsely claim this information as your own, you are committing plagiarism — which in most cases when caught, is a misdemeanor that can result in punishments such as “…fines of anywhere between $100 and $50,000 — and up to one year in jail.” (1)
When done correctly, web scraping is a remarkably powerful tool that allows for the easy obtainment of data. But, please use caution and be respectful of other people’s hard work. Nobody likes a stealer.
Probably the third most popular way of obtaining data is to obtain it yourself or with your marketing team. There’s nothing quite like a good old fashioned survey.. or 100… or 10,000. In many industries, surveys & questionnaires are a primary way of obtaining information and feedback from customers. Teams are dedicated towards creating and distributing the survey, while another team can compile the results, and a yet another team may be in charge of taking those compilations and converting them into a computer-friendly format, such as a .csv file. Those teams are who create what we, the data scientists, get the pleasure to analyze. However, in certain circumstances all those steps may have to be done by you — and that’s okay! Stay organized, have a plan, and you will get all those beautiful bits of computer gold called data that you need.
Step 1b: Opening your Data
Once you’ve settled on the data you’re going to work with, it is now time to open it and get to work. For this particular post, I am going to demo how to open a .csv file. Below, you will see the code for opening a file. This dataset is from kaggle.com:
import pandas as pd
df = pd.read_csv(“winemag-data-130k-v2.csv”)
The first step here is to import the package known as pandas. This package gives you the ability to create a dataframe out of your csv file. In the first line, note where it says as pd. This simply allows you to use pandas throughout your notebook by only having to type pd — not the entire word pandas.
In the second line, we have the step that actually converts the file into a workable dataframe, using pd.read_csv. Inside the parentheses, you place the name of your csv file in quotes, which is known as placing it in a string.
*IMPORTANT* Whenever you create a dataframe, you have to save it into a variable. This allows you to call the dataframe at any time during the project, which is especially necessary when you make edits to it. Whenever, you use the pd.read step mentioned above, you are only going to pull up the original dataframe.
After opening the file, you want to display your dataframe to make sure that it was loaded in correctly. That is what’s done in line 3 with df.head(). Attaching .head() to the end of df means that the first 5 data entries will be displayed. In contrast, .tail() will display the final 5 entries of data.
Congrats! You’ve made it through step 1. Now, it’s time to get into the fun — sometimes not so fun — stuff.
Step 2: Scrubbing/Cleaning your Data — AKA Preprocessing
I’m gonna give you guys a couple insider secrets. You ready?
- Get used to cleaning your data
- Get really good at cleaning data
Wicked good stuff right? As simple as it may be, they are unequivocally true. And what is also true is that most dataframes aren’t as neat, organized, or filled out to the liking of a data scientist. James Irving, my fantastic professor at Flatiron School, told me and my cohort, “you will probably spend about 80% of your time cleaning and preprocessing your data.”
Yup. 80%. And believe me, it’s not always easy, nor is it always fun. But, it is the most important step in your workflow. Preprocessing your data makes your data workable and easier for your computer to understand. Remember, data science projects are just a form of communication between you and your machine. If you don’t give it good instructions, you’re not going to get good results.
There are a lot — and I do mean a lot — of different ways to preprocess your data. However, within that complexity, there are a handful of steps that must be considered for any data science project.
Note* The key word there was consider. Not every project will require the following steps — but you have to at least look at your data and make a decision on whether or not you will perform any/all of them in your work.
The major preprocessing steps are:
- Filling in null/missing values
- Drop/Add columns as you see fit
- Removing outliers
- Scaling/Normalizing your data
Filling in null/missing values is probably the most important of the three, especially when working with numerical data. Going back to the communication metaphor, having missing data is like having a talk with your friend and they’re randomly skipping words, making their sentence(s) incoherent. If you don’t get the full idea of what they’re saying, how are you supposed to know how to respond? Your computer is the same — feed it proper values and it will do its job in getting you proper results.
Dropping or adding columns is not always going to be a part of your workflow because this step is entirely at your discretion. It’s ultimately up to you how much data you feel is necessary for your project. But, you may find that some columns are of no use, so in that sense removing your column is totally okay. On the flipside, you may want to create new columns based on what you learned. And that’s okay too, as long as your data is pertinent to your task and keep everything organized!
*Personal Opinion* For me, data is data. And in general, the more data you have available, the better your model will perform, especially with machine learning. So don’t go about chopping as much data as you can so that everything runs faster. Instead, keep as much of your data as you can, and find ways to make your models run more efficiently.
Removing outliers helps prevent a skew or bias in your data. Under many circumstances, you want your data to follow a normal distribution (see image below).
Removing the outliers from your data, using a tool such as zscore from scipy.stats, will put your data into as normal a distribution as it can, so that it can produce optimal results when you run your models.
Scaling/Normalizing your data is taking your data and making sure it’s all in the same scale. For example, let’s say that you have a column in your dataframe that consists of your data points being represented by U.S. dollars. There are some data points where dollars are in the hundreds, while there are other data points that are in the billions of dollars. One billion (1,000,000,000) is tremendously larger than one hundred (100), which means they are not on the same scale. Using scaling or normalization transforms your data so that it is on the same scale, which makes it more optimal for your computer to understand. However, there is an important catch here. When you scale or normalize your data, it can become very hard to interpret what the scaled data actually means. This is where creating visuals with either matplotlib or seaborn becomes very important.
So, you’ve followed my excellent advice on preprocessing and are ready for the next step. YES!!
Step 3: Exploring your Data
All aboard, explorers! This is the step where you begin to get a better idea of what kind of data you’re actually working with. This step can be as long or as short as you want it to be. And truthfully, this and scrubbing/cleaning/preprocessing tend to blend in with each other pretty often. In fact, they should be intertwined. Whenever you make a preprocessing change, you want to explore that the change you wanted actually happened and determine what the next step should be from that.
The most common term for exploring your data is through EDA, or Exploratory Data Analysis. Generally, this involves creating charts, graphs, and other visual aides that can give you a visual idea of what’s happening.
*Big Time Hint* Visuals are your best friend in presentations. Although they are sometimes challenging and time-consuming to make, they are invaluable. When you present your results, more often than not you will be speaking to an audience that has no clue what you actually do — commonly referred to as a “non-technical audience”. Creating high quality visuals can bridge the gap between your data science skills and their business insights to create a mutual understanding.
However, creating visuals is not the only way to perform EDA. If you’re a numbers person like me, using commands such as .info(), .describe(), and .dtypes are great tools to better understand your data.
So, you have obtained your data, successfully opened it, preprocessed everything, and performed your EDA. Congrats! Because now, is where we put all that hard work to use — modeling.
Step 4: Modeling your Data
Okay, not that kind of modeling — but computer modeling! This is where you put your data to the test. There are many different approaches because there are many different ways to perform this step. But in general, there are 4 easy steps to follow for this part of your workflow:
- Create a ‘vanilla’ model
- Make changes to model structure once you get results
- Re-run model
- Repeat steps 2 & 3 until your brain explodes
… okay maybe not until your brain explodes. But you get the idea. Similar to your EDA step, this part can be as long or as short as you want it to be. You want to start with a vanilla model, which is basically just the default or blandest configuration of whatever kind of model you’re using. Remember — your vanilla model should not be your only model because it is not a good idea to only run one model.
There are so many ways to tweak the way your model performs (known as hyperparameter tuning), so just going off of the default options is lazy and oftentimes produces sub-optimal results. Whether it’s running different types of machine learning models (i.e. decision trees vs. random forests), adding or removing layers from a neural network, or something else, stay engaged and always strive to get the best results possible. Time and available computer resources are very often factors that can inhibit the modeling process, so stay focused and use your time and resources wisely!
Step 5: Interpret Results
Before I start — yes, I know that the word “interpret” does not start with the letter “n”. Yes, it is a phonetics thing. No, O.S.E.M.I does not sound as cool as O.S.E.M.N!
Anyways, onto more important things. You’ve made it to the final step! Now is the time to provide meaning, insight, and direction into your hard work. I mentioned earlier that the preprocessing step is the most important step in your workflow. From a data science perspective, that is true. But from the perspective of increasing the success of the company you work for, this is the only step your non-technical audience is going to care about. Take this step seriously, be professional, be prepared, and be confident.
When presenting your results to a non-technical audience, do your best to not use overly technical terms. Use professional, everyday language that everybody in the room can understand. If you are asked a question you don’t know the answer to, do not, and I repeat, do not lie or fib to make yourself sound better. Answer honestly to whomever is asking the question, and tell them that you don’t know, but you will do everything you can to get that person an accurate answer as quickly as possible.
This step in the process is where you begin to make a name and add value for yourself. At any occupation you have in any industry, you want to do everything you can to make yourself invaluable. You want to separate yourself from the pack and grab the attention of your superiors in a positive way. Create a great presentation, provide high-quality visuals, and present an energy showing that you are in charge and know what you’re talking about.
The O.S.E.M.N process has enabled me to better focus and game-plan what steps I need to take in order to create a high-quality data science project. To whoever has read this, I hope you have found this helpful and wish you the best of luck on your data science journey!