How to keep specific rows in pandas DataFrame objects?

Our task for today will be to slices DataFrames according to specific criteria and keep the relevant rows.

Example Data

We will import the pandas library and initialize a simple DataFrame that contains some HR related data.

import pandas as pd

sales_dict = {
             'team' : ['B2B', 'B2C', 'Retail', 'Online', 'B2B'],
             'direct' : [2564, 2301, 4961, 4652, 2013],
             'indirect' : [3112, 2373, 2918, 3210, 3432],
             'telesales' : [4513, 4095, 3317, 4534, 3083],
}

sales = pd.DataFrame (sales_dict)
sales['total'] =sales.sum(numeric_only=True,axis = 1)

sales.head()

Subset top n rows

We can use the nlargest DataFrame method to slice the top n rows from our DataFrame and keep them in a new DataFrame object.

sales_subset = sales.nlargest(n=3,columns = 'total')
print (sales_subset)

We’ll get the top 3 observations with the biggest total sales amount:

teamdirectindirecttelesalestotal
3Online46523210453412396
2Retail49612918331711196
0B2B25643112451310189

We can accomplish the same in a little bit elegant way by sorting the DataFrame and use the iloc indexer.

sales_subset = sales.sort_values(by='total', ascending = False).iloc[0:3]

Keep rows by value and condition

Let’s assume that we would like to slice our DataFrame to show only rows which total revenue is below 10,000.

filt = sales['total'] < 10000
sales_subset = sales[filt]
print (sales_subset)

This returns a couple of records:

teamdirectindirecttelesalestotal
1B2C2301237340958769
4B2B2013343230838528

Subset your DataFrame by index

Next case is to slice the data by row index. We can easily do it with the iloc accessor. Assume that we would like to keep records 1,2 and 4. We’ll do that by passing the corresponding values as a Python list.

sales_subset = sales.iloc[[1,3,4],]
print (sales_subset)

You can use a similar technique to keep the first n DataFrame rows. For example display the first 3 records in your DataFrame:

sales.iloc[:3,]

Keep unique records

We can use the drop_duplicates() DataFrame method to remove all duplicated records from our DataFrame. This allows us to get rid of dups according to specific criteria. In our case we’ll drop any records that have the same ‘team’ value (except the first occurrence).

sales_subset = sales.drop_duplicates('team')
print (sales_subset)

Subset every nth row

Last case for today is that we would like to keep the rows at odd positions (every 2nd row). You can use the bracket notation to slice through the DataFrame rows. To keep the even index rows, set the step to 2 as showed below:

subset_even_rows = sales[::2]

Next learning

How to create a pandas crosstab from multiple DataFrame columns?