In today’s quick tutorial we’ll learn how to sum columns in Pandas DataFrames which contains missing or non available data, which might be marked as NAN by Pandas.
Create a DataFrame
We’ll start by creating a simple DataFrame for you to follow along with this example.
# import pandas and numpy libraries
import pandas as pd, numpy as np
rev_dict = {'year':['2012', '2014', '2013', '2022'], 'h1':[np.nan, 250, 175, 190], 'h2': [235, 245, 240, np.nan]}
revenue = pd.DataFrame (data = rev_dict)
print (revenue)
year | h1 | h2 | |
---|---|---|---|
0 | 2012 | NaN | 235.0 |
1 | 2014 | 250.0 | 245.0 |
2 | 2013 | 175.0 | 240.0 |
3 | 2022 | 190.0 | NaN |
Sum columns with nan cell values in Pandas
We can now go ahead and use the fillna() DataFrame method in order to handle cells with missing values and then sum the columns.
We will replace the NAN missing values with zeros and sum the columns. This way we can actually skip / ignore the missing values.
revenue['total'] = revenue['h1'].fillna(0) + revenue['h2'].fillna(0)
print (revenue)
Here’s our DataFrame:
year | h1 | h2 | total | |
---|---|---|---|---|
0 | 2012 | NaN | 235.0 | 235.0 |
1 | 2014 | 250.0 | 245.0 | 495.0 |
2 | 2013 | 175.0 | 240.0 | 415.0 |
3 | 2022 | 190.0 | NaN | 190.0 |
We can alternatively fill all the missing values in the entire DataFrame and then simply sum the columns:
revenue.fillna(0, axis=1, inplace=True)
revenue['total'] = revenue['h1'] + revenue['h2']
Sum a list of Pandas columns
A follow up question i got is how to use a Python list to define columns in a DataFrame that should be aggregated:
revenue['total'] = revenue[['h1', 'h2']].sum(axis=1)