As part of our Data Wrangling process we often clean our Data set and remove outlier observations before proceeding with further analysis and visualization.
In today’s tutorial we’ll learn how to use the Pandas library’s DataFrame.dropna() method to get rid of rows containing missing values.
Example Dataset
We’ll get started by importing the Pandas and Numpy libraries and create a very simple DataFrame from a dictionary.
import pandas as pd
import numpy as np
employees = {'employee': [ 'John', 'Don', 'Joe'],
'salary':[110, 120, 190],
'employer' : [np.nan,'ABC Corp',np.nan]}
my_data = pd.DataFrame(data=employees)
my_data.head()
Here’s our data:
Counting number of missing values
You might want to first identify and count the missing values in your DataFrame. Here’s the code and result:
my_data.isna().sum()
The result is a Pandas Series containing the number of missing values in each column.
employee 0 salary 0 employer 2 dtype: int64
Drop rows with missing values from our Python DataFrame
As mentioned before, we’ll use the DataFrame.dropna() method.
We can create a new DataFrame containing rows with non empty values:
my_data1 = my_data.dropna(axis=0)
my_data1.head()
Here’s the result:
We can as well use the inplace=True parameter to persist the changes in our original DataFrame:
my_data.dropna(axis=0, inplace=True)
Delete rows with nan with condition
What if we would like to drop rows with NAN, but do that only if the empty values are located in specific columns?
Luckily, we can use the subset parameter and pass the relevant columns to the dropna() method. The following code will search for empty values on two specific columns.
my_data.dropna(axis=0, subset=['employee', 'salary'] )
Remove columns with NAN
If we would like to delete columns containing NAN values, then we’ll pass the axis=1 parameter to dropna():
my_data2 = my_data.dropna(axis=1)
Next Learning
We have several tutorials which you might want to look into related to sub-setting and slicing DataFrames according to certain conditions.