How to keep specific columns in a Pandas DataFrame?

In today’s tutorial we’ll go through several examples in which we’ll subset Pandas DataFrame and keep certain specific columns for our analysis.

Example DataFrame

We will start by defining a DataFrame which we will populate with fictitious data. You can use the provided data to follow along this tutorial.

import pandas as pd

# define dictionary with data

candidates_dict = {'domain': ['Python', 'Python', 'Python'],
 'area': ['Seattle', 'New York', 'Madrid'],
 'interviewees': [24, 98, 78],
 'salary': [162, 184, 187]}

#initialize a DataFrame
candidates = pd.DataFrame (candidates_dict)

candidates.head()

Here’s a quick view of our simple DataFrame data:

domainareaintervieweessalary
0PythonSeattle24162
1PythonNew York98184
2PythonMadrid78187

Important note: Don’t forget to import the pandas library into your Python namespace. Here’s how to troubleshoot Pandas module not found errors.

Keep multiple columns (in list) and drop the rest

We can easily define a list of columns to keep and slice our DataFrame accordingly. In the example below, we pass a list containing multiple columns to slice accordingly. You can obviously pass as many columns as needed:

subset = candidates [['area', 'salary']]
subset.head()
areasalary
0Seattle162
1New York184
2Madrid187

To slice a single column (Pandas Series), just use the simple bracket notation:

subset = candidates['area']

Get the first columns only – keeping columns by index

Let’s assume that we want to subset our data so that the first 3 columns will be kept. We can use the iloc accessor and achieve that:

# find first n columns
subset = candidates.iloc[:,0:3]
subset.head()
domainareainterviewees
0PythonSeattle24
1PythonNew York98
2PythonMadrid78

Keep numeric columns

Let’s now assume that we have a very wide DataFrame and we would like to focus on the numeric columns only. We can accomplish that with using the select_dtypes() function.

subset =  candidates.select_dtypes(include = 'number')
subset.head()
intervieweessalary
024162
198184
278187

Conversely, we can looking into the non numeric value by using the exclude parameter:

subset =  candidates.select_dtypes(exclude = 'number')
subset.head()

Find all columns in list

Let’s assume that would like to pass a list of columns and keep only those columns which name matches a value in the list:

subset = candidates.loc[:,candidates.columns.isin(['area', 'salary'])]
subset.head()
areasalary
0Seattle162
1New York184
2Madrid187

Find columns containing a specific string

In the next example we’ll look for a specific string in a column name and retain those columns only:

subset = candidates.loc[:,candidates.columns.str.find('ar') > -1]
subset.head()

Find columns using conditions / with prefix

In the last example we’ll leave those columns which name starts with a specific string

subset = candidates.loc[:,candidates.columns.str.startswith('a')]
subset.head()
area
0Seattle
1New York
2Madrid

Suggested learning