In Data Analysis we often execute arithmetic operations on our dataset. In today’s tutorial we would like to show how you can easily multiply two or more columns in a single DataFrame or on multiple ones.
Multiply Pandas DataFrame columns
In order to create a new column that contains the product of two or more DataFrame numeric columns, multiply the column values as following:
your_df['product_column'] = your_df['column_1'] * your_df['column_2'] * your_df['column_n']
Data Preparation
We will start by defining two DataFrames to run through the tutorials examples using the pd.DataFrame() constructor.
import pandas as pd
#define DataFrames
df1 = pd.DataFrame({'employee': [ 'Larry', 'Liam', 'Niall'],
'hours_worked': [90,80,80],
'salary_hour' : [24.3,31.5,45.4]})
df2= pd.DataFrame({'employee': [ 'Larry', 'Liam', 'Niall'],
'hourly_sold_units': [120,90,100]})
Multiply DataFrame columns (by row) element wise
In this example, we’ll create a new DataFrame column by multiplying the values of two additional columns. Here’s the code:
df1['total_salary'] = df1['hours_worked'] * df1['salary_hour']
df1.head()
And here is our DataFrame:
Multiply based on condition
What if we would choose to perform arithmetic operations only on rows values whicht answer a specific boolean condition?
In this example we would like to calculate the salary only for employees who worked over 80 hours.
df1['overtime_salary'] = (df1['hours_worked'] >80)* df1['hours_worked'] * df1['salary_hour']
print(df1.head())
Multiply columns from different DataFrames
If you have your data in different DataFrames you can obviously concatenate or join then together. You can also create new columns in your Python DataFrame by performing arithmetic operations between matching rows element wise.
df1['total_sales'] = df1['hours_worked'] * df2['hourly_sold_units']
df1.head()
An alternative method for accomplishing the same result using apply and lambda:
df1['overtime_salary'] = df1.apply(lambda row: row['hours_worked'] * row['salary_hour'] if row['hours_worked'] > 80 else 0 , axis=1)
Performance considerations
Based on my experience, i have seen that when performing calculations on pandas columns, using vectorized operations is typically way more efficient than applying a function on each row of our DataFrame.
# apply
%timeit df1['overtime_salary'] = df1.apply(lambda row: row['hours_worked'] * row['salary_hour'] if row['hours_worked'] > 80 else 0 , axis=1)
#vectorized
%timeit df1['overtime_salary'] = (df1['hours_worked'] >80)* df1['hours_worked'] * df1['salary_hour']
I have used the %timeit magic function, to assess the performance of the apply function vs vectorized operations. Even on our very tiny dataset the vectorized calculation runs over twice as fast. For larger datasets the performance gain of vectorized operations is way higher.
612 µs ± 21.4 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each) 283 µs ± 26.1 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
Multiply and sum calculated column
We can easily now go ahead and sum the values of the new calculated column:
print (df1 ['total_sales'].sum())
Multiply DataFrame columns by specific constant / scalar / value
If we just want to multiply column values by a constant value, we should use the following code. In the example below, we calculate the net revenue of our company by factoring the total sales by a scalar:
df1['net_revenue'] = df1['total_sales'] * 0.77