In today’s R Data Analysis tutorial we will learn how to subset and filter one or multiple rows in an R DataFrame , then keep those columns as a new DataFrame for further analysis and visualization.
In this tutorial we will mostly use the dplyr package that allow for easier manipulation of R DataFrames.
Initialize R DataFrame
We will start by creating a very simple DataFrame:
#R
area <- c ("North", "South", "West", "East")
indirect <- c(275, 218, 217, 226)
direct <- c(353, 350, 326, 368)
online <- c(150, 186, 132, 136)
revenue <- data.frame (area = area, indirect = indirect, direct = direct, online = online)
After running this R script command in RStudio or other R development environment, we get the following DataFrame:
area | indirect | direct | online | |
---|---|---|---|---|
1 | North | 275 | 353 | 150 |
2 | South | 218 | 350 | 186 |
3 | West | 217 | 326 | 132 |
4 | East | 226 | 368 | 136 |
Select by index with base r
We can use the script below to keep specific rows by row position. Using base R, we first define a row vector and then use that vector to subset our DataFrame.
# select by index with base r
selected_rows <- c(2,3,4)
subset <- revenue[selected_rows,]
print (subset)
The result is a DataFrame:
2 | South | 218 | 350 | 186 |
3 | West | 217 | 326 | 132 |
4 | East | 226 | 368 | 136 |
Subset rows by position with dplyr
Using the dplyr library and specifically the slice function we can easily extract specific single or multiple rows at specific positions:
library(dplyr)
selected_rows <- c(2,3,4)
subset <- slice(revenue, selected_rows)
print (subset)
Note: make sure to install the dplyr package before calling it. Failing to do so will raise the following exception:
error in library(dplyr) : there is no package called ‘dplyr’
Extract rows containing certain column values
We can use the filter function delivered by dplyr to keep rows that meet a certain value criteria:
library(dplyr)
subset <- filter (revenue, online >= 150)
print (subset)
You’ll get the following DataFrame:
1 | North | 275 | 353 | 150 |
2 | South | 218 | 350 | 186 |
Subset data by multiple conditions
In the same fashion we can define more complex conditions to define our row selection:
library(dplyr)
subset <- filter (revenue, online < 150 & direct >=350)
print (subset)
This returns a single row DataFrame:
1 | East | 226 | 368 | 136 |
Keep random rows
In our last example, we’ll use the random function to select an arbitrary number of rows from our Data. In our case, we’ll choose to select two rows.
library(dplyr)
random_subset <- sample_n ( revenue, 2, replace=TRUE)
print(random_subset)