How to group by multiple columns in R DataFrames?

To aggregate data in an R dataFrame you can use the code below:

library (dplyr)
my_df %>%
  group_by (var_1,var_2) %>%
  summarise(total= sum(value_col))

Aggregating R data by multiple categorical variables

Step 1: Install dplyr package

If the dplyr package is not yet installed in your environment make sure to install it using the following command:

install.packages ('dplyr')

Alternatively consider installing tidyverse, which is a mega library of very useful R packages such as ggplot, tidyr, stringr and dplyr (and many more). In your RStudio script or Jupyter Notebook, run the following command to install tidyverse:

install.packages ('tidyverse')

Step 2: Create your DataFrame

Next, i will create a simple DataFrame using the following code:

month <- c ('March', 'March', 'November', 'November', 'March', 'November')
language <- c ('Python', 'Javascript', 'Javascript', 'Python', 'Javascript', 'Python')
median_salary <- c (142, 154, 125, 102, 168, 175)
interviews <- data.frame (month = month, language = language, salary = median_salary)

print(interviews)

Here’s our data:

monthlanguagesalary
1MarchPython142
2MarchJavascript154
3NovemberJavascript125
4NovemberPython102
5MarchJavascript168
6NovemberPython175

Step 3: Group and Summarize by multiple columns

I can now use the dplyr package capabilities to group and summarize data by multiple variables, in our case: month and language.

library (dplyr)
interviews_gp <- interviews %>%
  group_by (month,language) %>%
  summarise(total= sum(salary))

After running the snippet in RStudio / Jupyter we get the following DataFrame:

monthlanguagetotal
1MarchJavascript322
2MarchPython142
3NovemberJavascript125
4NovemberPython277

We can also get the mean salary by month and language by using a somewhat modified snippet:

library(dplyr)

interviews_gp <- interviews %>%
  group_by (language, month,) %>%
  summarise(average_salary= mean(salary))
languagemonthaverage_salary
1JavascriptMarch161.0
2JavascriptNovember125.0
3PythonMarch142.0
4PythonNovember138.5

Step 4: Plotting our results

We can use the ggplot2 library to render a simple bar chart plot to better visualize our findings:

interviews %>%
  group_by (language, month) %>%
  summarise(average_salary= mean(salary)) %>%
  ggplot(aes (fill= month, y=average_salary, x=language))  + 
  geom_col(position="dodge") + 
  labs(title="Salary by month and language", x = "Language", y="Avg salary") +
  theme_minimal()

Note: the setting (position=”dodge”) is required to have the columns stacked right to each other.

Here’s the chart: