How to subset a dataframe by one or multiple columns?

When trying to make sense of a large DataFrame in Pandas, you might need to subset it by columns and rows. In this tutorial we’ll show the most prevalent use cases of column partitioning your DataFrame.

We’ll start by setting up our example DataFrame, which we’ll do by running the following Python code in our favorite environment (for simplicity, i use Anaconda and Jupyter Lab).

import pandas as pd # import the pandas library

languages = ({"language": [ "Python", "C-Sharp", "Javascript","PHP"] ,
         "avg_difficulty_level": [3, 2, 2, 1],
              "avg_salary": [120, 100, 120, 80],
          "applications": [10,15,14,20]})

# Now let's convert our dict to a DataFrame

languages = pd.DataFrame.from_dict(languages)

Now, let us look into our Dataframe first rows using the head() method.

languages.head()

As you can see, this is a pretty simple DataFrame we’ll use as an example in this post:

languageavg_difficulty_levelavg_salaryapplications
0Python312010
1C-Sharp210015
2Javascript212014
3PHP18020

Filtering a single column

We’ll start with the simplest case, which is to subset one column out of our dataset. Running the following command will create a Series object:

By name / label

languages["language"]

The output is a Series:

0        Python
1       C-Sharp
2    Javascript
3           PHP
Name: language, dtype: object
type(languages["language"])

pandas.core.series.Series

By index

The following command will also return a Series containing the first column

languages.iloc[:,0]

Selecting multiple columns

By name

When passing a list of columns, Pandas will return a DataFrame containing part of the data.

languages[["language", "applications"]]
languageapplications
0Python10
1C-Sharp15
2Javascript14
3PHP20

By label (with loc)

df.loc[:,["language","applications"]]

The result will be similar.

By index

Now let’s retrieve multiple columns by using the index:

languages.iloc[: ,[0,3]]
languageapplications
0Python10
1C-Sharp15
2Javascript14
3PHP20

By condition

In this case, we’ll just show the columns which name matches a specific expression. We’ll use the quite handy filter method:

languages.filter(axis = 1, like="avg")

Notes:

  1. we can also filter by a specific regular expression (regex).
  2. We can apply the parameter axis=0 to filter by specific row value.

Filter specific rows by condition

Here’s a pretty straightforward way to subset the DataFrame according to a row value:

languages[(languages["applications"] > 15)]

We have only one result:

languages[(languages["applications"] > 15)]
languageavg_difficulty_levelavg_salaryapplications
3PHP18020