In order to find the average of a single or multiple pandas columns i use the DataFrame mean() function.
Assuming your DataFrame name is mydf and your column names are col_1 and col_2, here are two ways to calculate the average value for one or multiple numeric values:
# one column
mydf['col_1'].mean()
# multiple
mydf[['col_1', 'col_2']].mean()
Compute average of selected pandas columns – Example
Let’s get started by prepping our test DataFrame. As usual, we’ll use the auto-generated candidates data.
Here we go:
import pandas as pd
data = pd.read_csv('survey.csv')
print(data)
Here’s the result.
month | salary | num_candidates | |
---|---|---|---|
1 | April | 118.0 | 83.0 |
2 | February | 127.0 | 80.0 |
3 | May | 122.0 | 75.0 |
4 | July | 146.0 | 82.0 |
5 | September | 122.0 | 79.0 |
6 | February | 130.0 | 90.0 |
7 | July | 118.0 | 73.0 |
8 | November | 116.0 | 77.0 |
9 | February | 114.0 | 88.0 |
10 | October | 147.0 | 78.0 |
Note that if you want to follow along this example, you can copy the following table and use the pd.read_clipboard() method to populate your own dataframe.
import pandas as pd
your_df = pd.read_clipboard()
Find the mean / average of one column
To find the average of one column (Series), we simply type:
data['salary'].mean()
The result will be 126.
Calculate mean of multiple columns
In our case, we can simply invoke the mean() method on the DataFrame itself.
data['salary'].mean()
The result will be:
salary 126.0
num_candidates 80.5
dtype: float64
Chances are that your DataFrame will be wider, and contains several columns. In that case, we’ll first subset our DataFrame by the relevant columns and then calculate the mean.
cols = ['salary', 'num_candidates']
data[cols].mean()
The result will be similar.
Find pandas columns average with describe()
We typically use the describe() DataFrame method in order to quickly look into our dataset descriptive statistical figures – such as the cont, minimum and maximum values, standard deviation etc’.
We can use the following snippet to get a Series representing the mean of each numeric column in our DataFrame:
stats = data.describe()
stats.loc['mean']
This will return the following Series:
salary 126.0 num_candidates 80.5 Name: mean, dtype: float64
Creating a DataFrame or list from your columns mean values
You can easily turn your mean values into a new DataFrame or to a Python list object:
data_mean = pd.DataFrame(data.mean(), columns=['mean_values'])
#create list of mean values
mean_lst = data.mean().to_list
Plot column average in Pandas
As the Pandas library contains basic methods for plotting, making a simple chart to visualize multiple column averages is a breeze:
data.mean().plot(kind='bar');
Here’s the chart:
Calculate the mean of you Series with df.describe()
We can use the DataFrame method pd.describe to quickly look into the key statistical calculations of our DataFrame numeric columns – including the mean.
data.describe().round()
And the result:
Calculate the median of a DataFrame
For completeness, here’s a simple snippet to calculate the median of multiple DataFrame columns:
data.median()
This will render the following results, which represent the median observation of each columns of the dataset.
salary 122.0 num_candidates 79.5 dtype: float64
How to calculate variance and covariance in Pandas DataFrame?
As a follow up topic, we will also learn to calculate the statistical variance of one or multiple columns in a Python Pandas DataFrame. We will focus on several use cases:
- Variance of a Series or Pandas DataFrame column
- Variance of all columns in a Pandas DataFrame
- Variance of a Pandas Groupby object
- Pandas covariance
Create a DataFrame
As we typically do, we’ll start by importing the Pandas library into your favorite Data Analysis environment and then go ahead and create some example data. Feel free to use the DataFrame below to follow along this example.
import pandas as pd
#Define Dataframe columns
language = ['Go', 'Kotlin', 'Swift', 'Java']
first_interview = (76, 78, 84, 83)
second_interview = (51, 59, 58, 58)
third_interview = (15, 12, 19, 24)
# gather data in a dictionary
hr = dict(interview_1=first_interview, interview_2=second_interview, interview_3=second_interview)
# Construct the DataFrame from a dictionary
interviews = pd.DataFrame(hr, index=language)
print (interviews.head())
Here’s an output of our test data:
language | interview_1 | interview_2 | interview_3 | |
---|---|---|---|---|
0 | Go | 76 | 51 | 51 |
1 | Kotlin | 78 | 59 | 59 |
2 | Swift | 84 | 58 | 58 |
3 | Java | 83 | 58 | 58 |
Variance of a Pandas Column / Series
To calculate a Pandas column variant, we simply slice the column and use the var() Series method.
interviews['interview_1'].var().round(2)
Note that we used the round() function to minimize the trailing decimals.
Alternatively, we can define a new Series:
my_s = interviews['interview_1']
my_s.var().round(2)
Variance of all DataFrame columns
If we want to calculate the variance of all columns, we can use the DataFrame var() method, as shown below:
interviews.var().round(2)
This will render the following result:
interview_1 14.92 interview_2 13.67 interview_3 13.67 dtype: float64
You might also want to use the select_dtypes() DataFrame method, to subset the columns by data type:
interviews.select_dtypes(include='int64').var().round()
Calculate variance of some columns
You can use the Pandas ‘brackets’ notation to subset several columns and then apply the calculation.
my_df = interviews[['interview_2', 'interview_3']]
my_df.var().round(2)
This will result in:
interview_2 13.67 interview_3 13.67 dtype: float64
Pandas groupby variance
We’ll first add a new column to our DataFrame and use it group the data and calculate its variance.
interviews['area'] = ['Full_Stack','Full_Stack','Server','Server']
interviews.groupby('area').agg('var')
This will result in the following:
interview_1 | interview_2 | interview_3 | |
---|---|---|---|
area | |||
Full_Stack | 2.0 | 32.0 | 32.0 |
Server | 0.5 | 0.0 | 0.0 |
Get the Pandas covariance
In the same fashion we are able to calculate the covariance of a DataFrame columns
interviews.cov()
Or if we want to calculate the co-variance of a specific DataFrame subset:
interviews[['interview_1','interview_2']].cov()