In today’s tutorial we’ll go through several examples in which we’ll subset Pandas DataFrame and keep certain specific columns for our analysis.
Example DataFrame
We will start by defining a DataFrame which we will populate with fictitious data. You can use the provided data to follow along this tutorial.
import pandas as pd
# define dictionary with data
candidates_dict = {'domain': ['Python', 'Python', 'Python'],
'area': ['Seattle', 'New York', 'Madrid'],
'interviewees': [24, 98, 78],
'salary': [162, 184, 187]}
#initialize a DataFrame
candidates = pd.DataFrame (candidates_dict)
candidates.head()
Here’s a quick view of our simple DataFrame data:
domain | area | interviewees | salary | |
---|---|---|---|---|
0 | Python | Seattle | 24 | 162 |
1 | Python | New York | 98 | 184 |
2 | Python | Madrid | 78 | 187 |
Important note: Don’t forget to import the pandas library into your Python namespace. Here’s how to troubleshoot Pandas module not found errors.
Keep multiple columns (in list) and drop the rest
We can easily define a list of columns to keep and slice our DataFrame accordingly. In the example below, we pass a list containing multiple columns to slice accordingly. You can obviously pass as many columns as needed:
subset = candidates [['area', 'salary']]
subset.head()
area | salary | |
---|---|---|
0 | Seattle | 162 |
1 | New York | 184 |
2 | Madrid | 187 |
To slice a single column (Pandas Series), just use the simple bracket notation:
subset = candidates['area']
Get the first columns only – keeping columns by index
Let’s assume that we want to subset our data so that the first 3 columns will be kept. We can use the iloc accessor and achieve that:
# find first n columns
subset = candidates.iloc[:,0:3]
subset.head()
domain | area | interviewees | |
---|---|---|---|
0 | Python | Seattle | 24 |
1 | Python | New York | 98 |
2 | Python | Madrid | 78 |
Keep numeric columns
Let’s now assume that we have a very wide DataFrame and we would like to focus on the numeric columns only. We can accomplish that with using the select_dtypes() function.
subset = candidates.select_dtypes(include = 'number')
subset.head()
interviewees | salary | |
---|---|---|
0 | 24 | 162 |
1 | 98 | 184 |
2 | 78 | 187 |
Conversely, we can looking into the non numeric value by using the exclude parameter:
subset = candidates.select_dtypes(exclude = 'number')
subset.head()
Find all columns in list
Let’s assume that would like to pass a list of columns and keep only those columns which name matches a value in the list:
subset = candidates.loc[:,candidates.columns.isin(['area', 'salary'])]
subset.head()
area | salary | |
---|---|---|
0 | Seattle | 162 |
1 | New York | 184 |
2 | Madrid | 187 |
Find columns containing a specific string
In the next example we’ll look for a specific string in a column name and retain those columns only:
subset = candidates.loc[:,candidates.columns.str.find('ar') > -1]
subset.head()
Find columns using conditions / with prefix
In the last example we’ll leave those columns which name starts with a specific string
subset = candidates.loc[:,candidates.columns.str.startswith('a')]
subset.head()
area | |
---|---|
0 | Seattle |
1 | New York |
2 | Madrid |