Step #1: Create example data
We will start by importing the pandas library and creating a very simple DataFrame that you can use to follow along.
import pandas as pd
dates = pd.Series(pd.date_range(start='12/26/22', end = '01/07/23', freq='B'))
channel = ['direct', 'indirect', 'web', 'direct', 'indirect', 'web','indirect', 'web', 'web', 'direct']
sales = [120, 140, 134, 156, 188, 200, 178, 213, 234, 98]
#create the DataFrame
performance = pd.DataFrame(dict(sales=sales, dates=dates, channel=channel))
print (performance.head())
This will return the following data:
sales | dates | channel | |
---|---|---|---|
0 | 120 | 2022-12-26 | direct |
1 | 140 | 2022-12-27 | indirect |
2 | 134 | 2022-12-28 | web |
3 | 156 | 2022-12-29 | direct |
4 | 188 | 2022-12-30 | indirect |
Step #2: Aggregate by datetime /timestamp
Aggregating by our date range is relatively simple:
performance.groupby ([dates.dt.year]).agg(overall_sales= ('sales' , 'sum'))
This will return the following DataFrame object:
overall_sales | |
---|---|
2022 | 738 |
2023 | 923 |
Note that we have used the dt accessor to derive the year value from our datetime column.
Step #3: Groupby pandas by date and other column
In the second example we will aggregate our data by the date column and by channel. Syntax is similar to the previous example, with key difference being the column names we pass to the groupby method:
performance.groupby ([dates.dt.year,channel]).agg(overall_sales= ('sales' , 'sum'))
This will return the following DataFrame:
Step #4: Group by Dataframe by year using the date index
Let’s start by creating a new DataFrame for this example:
performance1 = pd.DataFrame(dict(sales=sales), index=dates)
In this time series, the index consists of datetime objects. We can use those to groupby:
performance1.groupby(performance.index.year)['sales'].sum()
The result will be:
2022 738 2023 923 Name: sales, dtype: int64
Note that in order to groupby year, we didn’t need the dt accessor. We can directly call the year property of the Datetime Index to aggregate by year / month / month name / week etc’.
Step# 5: Groupby timestamp and count
Last example for this tutorial is aggregating by a timestamp and then counting distinct values.
performance.groupby ([dates.dt.month_name()]).agg(sales_count= ('sales' , 'count'))
And the result will be:
sales_count | |
---|---|
December | 5 |
January | 5 |