To aggregate data in an R dataFrame you can use the code below:
library (dplyr)
my_df %>%
group_by (var_1,var_2) %>%
summarise(total= sum(value_col))
Aggregating R data by multiple categorical variables
Step 1: Install dplyr package
If the dplyr package is not yet installed in your environment make sure to install it using the following command:
install.packages ('dplyr')
Alternatively consider installing tidyverse, which is a mega library of very useful R packages such as ggplot, tidyr, stringr and dplyr (and many more). In your RStudio script or Jupyter Notebook, run the following command to install tidyverse:
install.packages ('tidyverse')
Step 2: Create your DataFrame
Next, i will create a simple DataFrame using the following code:
month <- c ('March', 'March', 'November', 'November', 'March', 'November')
language <- c ('Python', 'Javascript', 'Javascript', 'Python', 'Javascript', 'Python')
median_salary <- c (142, 154, 125, 102, 168, 175)
interviews <- data.frame (month = month, language = language, salary = median_salary)
print(interviews)
Here’s our data:
month | language | salary | |
---|---|---|---|
1 | March | Python | 142 |
2 | March | Javascript | 154 |
3 | November | Javascript | 125 |
4 | November | Python | 102 |
5 | March | Javascript | 168 |
6 | November | Python | 175 |
Step 3: Group and Summarize by multiple columns
I can now use the dplyr package capabilities to group and summarize data by multiple variables, in our case: month and language.
library (dplyr)
interviews_gp <- interviews %>%
group_by (month,language) %>%
summarise(total= sum(salary))
After running the snippet in RStudio / Jupyter we get the following DataFrame:
month | language | total | |
---|---|---|---|
1 | March | Javascript | 322 |
2 | March | Python | 142 |
3 | November | Javascript | 125 |
4 | November | Python | 277 |
We can also get the mean salary by month and language by using a somewhat modified snippet:
library(dplyr)
interviews_gp <- interviews %>%
group_by (language, month,) %>%
summarise(average_salary= mean(salary))
language | month | average_salary | |
---|---|---|---|
1 | Javascript | March | 161.0 |
2 | Javascript | November | 125.0 |
3 | Python | March | 142.0 |
4 | Python | November | 138.5 |
Step 4: Plotting our results
We can use the ggplot2 library to render a simple bar chart plot to better visualize our findings:
interviews %>%
group_by (language, month) %>%
summarise(average_salary= mean(salary)) %>%
ggplot(aes (fill= month, y=average_salary, x=language)) +
geom_col(position="dodge") +
labs(title="Salary by month and language", x = "Language", y="Avg salary") +
theme_minimal()
Note: the setting (position=”dodge”) is required to have the columns stacked right to each other.
Here’s the chart: