When trying to make sense of a large DataFrame in Pandas, you might need to subset it by columns and rows. In this tutorial we’ll show the most prevalent use cases of column partitioning your DataFrame.
Create an example DataFrame
We’ll start by setting up our example DataFrame, which we’ll do by running the following Python code in our favorite environment (for simplicity, i use Anaconda and Jupyter Lab).
import pandas as pd # import the pandas library
languages = ({"language": [ "Python", "C-Sharp", "Javascript","PHP"] ,
"avg_difficulty_level": [3, 2, 2, 1],
"avg_salary": [120, 100, 120, 80],
"applications": [10,15,14,20]})
# Now let's convert our dict to a DataFrame
languages = pd.DataFrame.from_dict(languages)
Now, let us look into our DataFrame first rows using the head() method.
languages.head()
As you can see, this is a pretty simple DataFrame we’ll use as an example in this post:
language | avg_difficulty_level | avg_salary | applications | |
---|---|---|---|---|
0 | Python | 3 | 120 | 10 |
1 | C-Sharp | 2 | 100 | 15 |
2 | Javascript | 2 | 120 | 14 |
3 | PHP | 1 | 80 | 20 |
Select a single column
We’ll start with the simplest case, which is to subset one column out of our dataset. Running the following command will create a Series object:
By name / label
languages["language"]
The output is a pandas Series:
0 Python 1 C-Sharp 2 Javascript 3 PHP Name: language, dtype: object
By index
You can use the iloc accessor to slice your DataFrame by the row or column index. The snippet below subsets the leftmost column:
languages.iloc[:,0]
Select multiple columns in Pandas
By name
When passing a list of columns, Pandas will return a DataFrame containing part of the data.
languages[["language", "applications"]]
language | applications | |
---|---|---|
0 | Python | 10 |
1 | C-Sharp | 15 |
2 | Javascript | 14 |
3 | PHP | 20 |
By label (with loc)
df.loc[:,["language","applications"]]
The result will be similar.
By index
Using the iloc accessor you can also retrieve specific multiple columns.
languages.iloc[: ,[0,3]]
language | applications | |
---|---|---|
0 | Python | 10 |
1 | C-Sharp | 15 |
2 | JavaScript | 14 |
3 | PHP | 20 |
By condition
In this case, we’ll just show the columns which name matches a specific expression. We’ll use the quite handy filter method:
languages.filter(axis = 1, like="avg")
Notes:
- we can also filter by a specific regular expression (regex).
- We can apply the parameter axis=0 to filter by specific row value.
Filter specific rows by condition
Here’s a pretty straightforward way to subset the DataFrame according to a row value:
languages[(languages["applications"] > 15)]
We have only one result:
languages[(languages["applications"] > 15)]
language | avg_difficulty_level | avg_salary | applications | |
---|---|---|---|---|
3 | PHP | 1 | 80 | 20 |