In today’s data visualization tutorial we’ll learn how during exploratory data analysis, we can use Python to subset two or columns from a pandas DataFrame and draw a simple scatter chart to detect outlier observations.
Create example data
We’ll start by importing the pandas data analysis library which we’ll use to render the scatter chart.
import pandas as pd
We’ll then create some dummy data that you can use to follow along this tutorial.
language = ['Java', 'Python', 'Javascript', 'R', 'Python', 'Python']
interviews = [12, 15, 15, 14, 18, 8]
salary = [157.0, 107.0, 172.0, 102.0, 175.0, 143.0]
interviews = dict(language =language, salary = salary, interviews = interviews, )
my_df = pd.DataFrame(data=interviews)
my_df.head()
This will render our DataFrame first 5 rows:
language | salary | interviews | |
---|---|---|---|
0 | Java | 157.0 | 12 |
1 | Python | 107.0 | 15 |
2 | Javascript | 172.0 | 15 |
3 | R | 102.0 | 14 |
4 | Python | 175.0 | 18 |
Plotting a pandas scatter from two columns
First, we’ll subset couple of DataFrame columns:
subset = my_df[['salary', 'interviews']]
Calling the plot DataFrame method and passing our two columns into the data parameter will fail:
scatter = my_df.plot(data = subset, x = 'interviews', y = 'salary', kind='scatter')
This will render the following type error:
TypeError: plot() got multiple values for argument 'data'
Instead we’ll call the plot method directly on our subset DataFrame:
scatter = subset.plot(x = 'interviews', y = 'salary', kind='scatter', c= 'green');
scatter.set_title('Interviews vs salary');
# export the pandas chart to an image
scatter.figure.savefig('interviews_salary.png')
This will render the following chart:
Note that we can tweak the kind parameter to plot the most commonly used charts: lines, bars, box plots, histograms etc.
Scatter plot with multiple markers in pandas
A follow up question we received is how to draw a pandas chart with multiple marker colors.
To exemplify this, we’ll first insert a new numeric column to our DataFrame.
years_experience = [1, 1.5, 3, 4, 3, 2]
my_df.insert(3, column='experience', value = years_experience)
We’ll then render the chart. We’ll use the experience column and a color map to define the marker colors.
scatter = my_df.plot(x = 'interviews', y = 'salary', kind = 'scatter', c = 'experience', colormap = 'magma', );
scatter.set_title('Interviews vs salary_by language');
scatter.figure.savefig('interviews_salary_experience.png')
Here’s the result: