Our task for today will be to slices DataFrames according to specific criteria and keep the relevant rows.
Example Data
We will import the pandas library and initialize a simple DataFrame that contains some HR related data.
import pandas as pd
sales_dict = {
'team' : ['B2B', 'B2C', 'Retail', 'Online', 'B2B'],
'direct' : [2564, 2301, 4961, 4652, 2013],
'indirect' : [3112, 2373, 2918, 3210, 3432],
'telesales' : [4513, 4095, 3317, 4534, 3083],
}
sales = pd.DataFrame (sales_dict)
sales['total'] =sales.sum(numeric_only=True,axis = 1)
sales.head()
Subset top n rows
We can use the nlargest DataFrame method to slice the top n rows from our DataFrame and keep them in a new DataFrame object.
sales_subset = sales.nlargest(n=3,columns = 'total')
print (sales_subset)
We’ll get the top 3 observations with the biggest total sales amount:
team | direct | indirect | telesales | total | |
---|---|---|---|---|---|
3 | Online | 4652 | 3210 | 4534 | 12396 |
2 | Retail | 4961 | 2918 | 3317 | 11196 |
0 | B2B | 2564 | 3112 | 4513 | 10189 |
We can accomplish the same in a little bit elegant way by sorting the DataFrame and use the iloc indexer.
sales_subset = sales.sort_values(by='total', ascending = False).iloc[0:3]
Keep rows by value and condition
Let’s assume that we would like to slice our DataFrame to show only rows which total revenue is below 10,000.
filt = sales['total'] < 10000
sales_subset = sales[filt]
print (sales_subset)
This returns a couple of records:
team | direct | indirect | telesales | total | |
---|---|---|---|---|---|
1 | B2C | 2301 | 2373 | 4095 | 8769 |
4 | B2B | 2013 | 3432 | 3083 | 8528 |
Subset your DataFrame by index
Next case is to slice the data by row index. We can easily do it with the iloc accessor. Assume that we would like to keep records 1,2 and 4. We’ll do that by passing the corresponding values as a Python list.
sales_subset = sales.iloc[[1,3,4],]
print (sales_subset)
You can use a similar technique to keep the first n DataFrame rows. For example display the first 3 records in your DataFrame:
sales.iloc[:3,]
Keep unique records
We can use the drop_duplicates() DataFrame method to remove all duplicated records from our DataFrame. This allows us to get rid of dups according to specific criteria. In our case we’ll drop any records that have the same ‘team’ value (except the first occurrence).
sales_subset = sales.drop_duplicates('team')
print (sales_subset)
Subset every nth row
Last case for today is that we would like to keep the rows at odd positions (every 2nd row). You can use the bracket notation to slice through the DataFrame rows. To keep the even index rows, set the step to 2 as showed below:
subset_even_rows = sales[::2]
Next learning
How to create a pandas crosstab from multiple DataFrame columns?