How to subset a dataframe by one or multiple columns?

When trying to make sense of a large DataFrame in Pandas, you might need to subset it by columns and rows. In this tutorial we’ll show the most prevalent use cases of column partitioning your DataFrame.

Create an example DataFrame

We’ll start by setting up our example DataFrame, which we’ll do by running the following Python code in our favorite environment (for simplicity, i use Anaconda and Jupyter Lab).

import pandas as pd # import the pandas library

languages = ({"language": [ "Python", "C-Sharp", "Javascript","PHP"] ,
         "avg_difficulty_level": [3, 2, 2, 1],
              "avg_salary": [120, 100, 120, 80],
          "applications": [10,15,14,20]})

# Now let's convert our dict to a DataFrame

languages = pd.DataFrame.from_dict(languages)

Now, let us look into our DataFrame first rows using the head() method.

languages.head()

As you can see, this is a pretty simple DataFrame we’ll use as an example in this post:

languageavg_difficulty_levelavg_salaryapplications
0Python312010
1C-Sharp210015
2Javascript212014
3PHP18020

Select a single column

We’ll start with the simplest case, which is to subset one column out of our dataset. Running the following command will create a Series object:

By name / label

languages["language"]

The output is a pandas Series:

0        Python
1       C-Sharp
2    Javascript
3           PHP
Name: language, dtype: object

By index

You can use the iloc accessor to slice your DataFrame by the row or column index. The snippet below subsets the leftmost column:

languages.iloc[:,0]

Select multiple columns in Pandas

By name

When passing a list of columns, Pandas will return a DataFrame containing part of the data.

languages[["language", "applications"]]
languageapplications
0Python10
1C-Sharp15
2Javascript14
3PHP20

By label (with loc)

df.loc[:,["language","applications"]]

The result will be similar.

By index

Using the iloc accessor you can also retrieve specific multiple columns.

languages.iloc[: ,[0,3]]
languageapplications
0Python10
1C-Sharp15
2JavaScript14
3PHP20

By condition

In this case, we’ll just show the columns which name matches a specific expression. We’ll use the quite handy filter method:

languages.filter(axis = 1, like="avg")

Notes:

  1. we can also filter by a specific regular expression (regex).
  2. We can apply the parameter axis=0 to filter by specific row value.

Filter specific rows by condition

Here’s a pretty straightforward way to subset the DataFrame according to a row value:

languages[(languages["applications"] > 15)]

We have only one result:

languages[(languages["applications"] > 15)]
languageavg_difficulty_levelavg_salaryapplications
3PHP18020