Going over the 3 most common ways to visualize your data in Python.
Data visualization is a crucial aspect to being a complete and successful data scientist. Manipulating and analyzing data can only go so far if you can’t properly display what your solution is. While super cool or fancy visualizations can be a mind-blowing visual experience, ultimately your objective as a data scientist is to display your findings in a way that is clear and to the point. So sometimes sticking to and mastering the basics is the best way to go. In this blog post, I’m going to go over three of the most common ways to visualize data in Python — scatter plots, bar graphs, and histograms.
The most common library for plotting and data visualization is Matplotlib. It is very user-friendly and has a wide array of functions to create a variety of data visualizations. In order to use this library, you’ll want to make sure you import it using the following code:
import matplotlib.pyplot as plt%matplotlib inline
The magic command, %matplotlib inline, allows you to show your plots inside your Jupyter Notebook, if you are using one. For external plots, you would use %matplotlib qt. An external plot typically is used if you are planning to create an interactive visualization. Otherwise, you’ll want to use inline for most other circumstances.
Scatter Plots — plt.scatter()
Scatter plots, which can also be referred to as correlation plots, are a data visualization that uses unique points of data to show the relationship, or the correlation, between two different variables.
When creating your scatter plot, you will need to pass in two parameters into your plt.scatter() function, which would be your x and y variables. You can also add in other parameters, such as label=“”, to provide a little more context on what each point is representing. Some other useful functions to add context to your plot include:
- plt.title(), which adds a title to your plot
- plt.legend(), which will provide a key that describes what is being plotted. *Pro tip* If you use the label parameter mentioned above while also using plt.legend(), it will take what you put in the label parameter and include it in your key.
- plt.xlabel(), which labels the x-axis
- plt.ylabel(), which labels the y-axis
- plt.figure(figsize=(width, height)), which will determine the size of your plot, in inches
Then, to actually display your plot, you will want to use plt.show(). Some example code is below:
plt.figure(figsize=(10,6))plt.scatter(x, y, label = "This will appear in legend")plt.xlabel('X Values')
plt.ylabel('Y Values')plt.title('Scatter Plot in Matplotlib')
Bar Graphs — plt.bar()/plt.barh()
Bar charts are another popular data visualization technique. They are excellent to use when working with categorical data because each category will be represented by a rectangular bar, and the height or length of the bar is determined by that category’s frequency. As you can see in the above picture, the x-axis has each programming language (category), and the y-axis has the values (frequency) for each language. Bar graphs can also be plotted horizontally using plt.barh(), such as below:
The thing that makes Matplotlib great is that no matter what kind of plot you want to create, you can use the same functions outlined earlier to add context to your graph. With that in mind, here is an example code block on how to plot a bar graph:
plt.figure(figsize=(10,6))plt.bar(x, y, label=’Sample Data’)plt.xlabel(‘Programming Languages’)
plt.ylabel(‘Language Frequency’)plt.title(‘Popular Programming Languages’)
Histograms — plt.hist()
When looking at a histogram, it is common to confuse them with bar graphs. However, the fundamental difference in a histogram is that it plots the frequency distribution of your data. In other words, your data is split into groups, instead of every value getting its own bar. Looking at the graph above, each group consists of ten values, and the total number of occurrences in each respective group is what’s plotted. Histograms allow for certain characteristics of your data to be observed that can’t be seen in a bar graph, such as if there are outliers or some form of skewness. To plot a histogram, use a block of code like the one below:
plt.hist(x, bins = 10, edgecolor=’black’)plt.xlabel(‘Groups’)
plt.ylabel(‘Frequency of Values’)
plt.title(‘A Histogram in Matplotlib’)
In line one of the code, there are two very helpful arguments to help manipulate and clarify the data. First, is bins=. Bins describes how many groups you want displayed in your plot. In the histogram from earlier, bins was set to 10, meaning that the data will be categorized into 10 groups. But what if you want fewer groups? Just change bins to a smaller number, such as five. When doing this, the range of values will remain the same, but each bar will become wider because it is collecting more data into each group, since the number of groups has been reduced. And the opposite would be true if you increased your number of bins. If you set your bins to 50, then each bar will become thinner because there are more groups to classify the data into. Adding more bins can also make the process of visually seeing whether your data is normally distributed or not easier.
The second helpful argument is edgecolor=“”. This argument adds a color to the border of each bin, making it easier to designate where they begin and end. An example of this is below. You will see that it is very easy to distinguish each separate group in the histogram below, whereas the one from earlier is much less precise.
And that concludes the overview of three of the most common data visualization techniques in Python. Thank you for reading!