As part of your Data wrangling process you might need to quickly count occurrences of same or different values in your entire dataset or in specific columns. In today’s tutorial we’ll go over several use cases and provide some short example code snippets to help you tackle each scenario.
We’ll first acquire our dataset and import the Pandas library into Jupyter Notebook/Lab or other Python IDE.
import pandas as pd data = pd.read_csv('hr_data.csv') data.info()
Using value_counts to count unique values in a column
We can easily enumerate unique occurrences of a column values using the Series value_counts() method. In our case we’ll invoke value_counts and pass the language column as a parameter.
Here’s the result:
Note: Running the value_counts method on the DataFrame (rather than on a specific column) will return the number of unique values in all the DataFrame columns.
Groupby count specific values example
An alternative technique is to use the Groupby.size() method to count occurrences in a specific column. Here’s an example:
Count specific value occurrences with condition
Next case is to enumerate occurrences of specific values that meet a certain condition. In our case we’ll just count the occurrences of the string ‘Python’ in our languages column.
# condition (if) filtr = data['language'] == 'Python' filtr.sum()
The result will be 4.
Here’s another example – this time we apply a condition on the salary figure:
filtr = data['salary'] > 123 filtr.sum()
The result will be 8.
Note the usage of the Series.sum method to enumerate the number of matching rows.
You can obviously concatenate conditions as need in case you need a more complex filter:
# Using an OR condition filtr = data['language'] == 'Python' | data['language'] == 'R' filtr.sum()
Find number of unique occurrences in a column
# how many unique values in column data['language'].nunique()
Counting empty values in a column / Series
#count NaN values data['days_to_hire'].isnull().sum()
Finding number of repeated / duplicated values in a Series
To quickly identify the number of duplicated we’ll substract the number of non-duplicated items in the column (which we’ll identify using the drop_duplicates method) from the overall number of records.
# how many duplicates len(data['language']) - len(data['language'].drop_duplicates())
Find a string value across multiple columns
If you would like to count the number of occurrences of a specific string in your entire DataFrame you could use the following code:
filtr = data == 'Python' data[filtr].count()
In our case the result will be: