This tutorial will teach you how to cast a pandas column of objects to numeric types such as integer or floats. This is a common task you will typically execute during the data cleaning phase of your data acquisition workflow; typically when harmonizing the data types in a column that contains figures, but also non-numeric data such as strings.
Create example data
As we typically do, we will first create a simple DataFrame that you can use to follow along.
import pandas as pd
dates = pd.date_range(start='1/10/2023', periods = 5, freq = 'B' )
sales = [162, '142', 136, '134', None]
campaign = pd.DataFrame (dict (dates = dates, sales = sales))
campaign.head()
This will render the following DataFrame rows:
dates | sales | |
---|---|---|
0 | 2023-01-10 | 162 |
1 | 2023-01-11 | 142 |
2 | 2023-01-12 | 136 |
3 | 2023-01-13 | 134 |
4 | 2023-01-16 | None |
Let’s look at the columns data types:
campaign.dtypes
This will return:
dates datetime64[ns] sales object dtype: object
The sales column has mixed data types, hence as expected its type is object.
Note: The sales column also contains an empty value, which we would like to replace. If we’ll run the fillna() command on the column we will get the following TypeError exception:
#TypeError: unsupported operand type(s) for +: 'int' and 'str'
Before attempting to replace the empty values in our DataFrame we should first convert the column to numeric values.
Convert pandas column to numbers
There are two main options to cast a Series/ column to integers or float numbers: the pd.to_numeric function and the astype() method. As we have empty values in our column, we’ll opt for pd.to_numeric.
pd.to_numeric vs astype
- Unlike pd.to_numeric, astype() doesn’t handle NAN empty values. To circumvent this, we use the parameter errors=’ignore’.
- While pd.to_numeric know to infer the expected data type for the conversion, when using astype() we need to provide the target data type as a parameter.
Using pd.to_numeric
campaign['sales'] = pd.to_numeric(campaign['sales'], errors = 'raise')
campaign['sales'].head()
0 162.0 1 142.0 2 136.0 3 134.0 4 NaN Name: sales, dtype: float64
Replace empty values
We can now easily replace the empty value in the sales column by the column average value:
campaign['sales'] = campaign['sales'].fillna(campaign['sales'].mean())
campaign['sales'].head()
Here’s the result:
0 162.0 1 142.0 2 136.0 3 134.0 4 143.5 Name: sales, dtype: float64
Converting pandas column from float to integer
Now we can use the astype() method:
campaign['sales'] = campaign['sales'].astype('int64')
campaign['sales'].head()
And a final look at our column:
0 162 1 142 2 136 3 134 4 143 Name: sales, dtype: int64