In today’s Pandas Data Analysis tutorial i would like to cover the basics of Python DataFrame column conversion to strings.
We will focus on several key use cases here:
- Converting specific columns to strings using the astype() method.
- Exporting a DataFrame to a string object
- Converting a Datetime object to a string
Example data
We will start by creating some test data for you to follow along this exercise:
#import Pandas
import pandas as pd
#Define data dictionary
cand_dict = {'office_id' : [ 'ny', 2,3],
'city': ['nyc', 'boston', 'austin'],
'num_candidates': [10,20,58]}
#Initialize DataFrame
candidates = pd.DataFrame(cand_dict)
Let’s find out the respective data types of our DataFrame columns:
candidates.dtypes
The result will be:
office_id object city object num_candidates int64 dtype: object
The column office_id includes both numeric integers and text characters, hence it is assigned an object data type.
Convert DataFrame columns to strings
Let’s assume that we would like to concatenate the office_id and the city columns.
candidates['city_id'] = candidates['office_id'] + '_'+ candidates['city']
This will render a Type error, as we are trying to concatenate integers and strings.
TypeError: unsupported operand type(s) for +: 'int' and 'str'
Let’s then go ahead and convert the city_id to the string data type and then easily combine the columns:
candidates['office_id'] = candidates['office_id'].astype('string')
candidates['city_id'] = candidates['office_id'] + '_'+ candidates['city']
print(candidates.head())
We’ll get the city_id column in our DataFrame:
office_id | city | num_candidates | city_id | |
---|---|---|---|---|
0 | ny | nyc | 10 | ny_nyc |
1 | 2 | boston | 20 | 2_boston |
2 | 3 | austin | 58 | 3_austin |
Cast DataFrame object to string
We found out beforehand that the city field was interpreted as a Pandas objects. We can cast every column to a specific data type by passing a dictionary of matching column and dtype pairs as highlighted below:
candidates.astype({'city':'string', 'num_candidates':'int32'}).dtypes
And the result that we will get is:
city string num_candidates int32 dtype: object
Export Pandas Dataframe to strings using to_string()
Here’s an example:
# Saving a DataFrame column
print(candidates['city'].to_string())
#Entire DataFrame
print(candidates.to_string())
Convert Datetime to string
In our next example, we’ll use Python to turn a column containing Datetime type objects to strings.
We’ll first generate the some simple DataFrame :
import pandas as pd
week = pd.date_range('2022-10-10', periods = 7, freq = 'd')
sales = [120, 130, 150, 167, 180, 120, 150 ]
sales_df = pd.DataFrame (dict(week = week, sales=sales))
sales_df.dtypes
Will return:
week datetime64[ns]
sales int64
dtype: object
Now we can use the astype method as shown above to return a series.
sales_df['week'].astype('string')
Note that the conversion to strings won’t be perpetuated in your original DataFrame. You can easily persist your changes by creating a new DataFrame:
sales_df_2 = sales_df.astype({'week':'string'})
Change Datetime format and convert to string
In the next example we’ll simply go ahead and modify the format of a datetime column using the srftime fomatter. In this example we’ll use the data that we have previously generated.
sales_df['week'].dt.strftime('%d-%m-%y')
We’ll get the following result
In a similar fashion you can modify the datetime value to other formats, including years, months, days, hours, minutes and so forth.
Example of Renaming a DataFrame column
A common requirement is to rename a column after it being cast to a new data type. Here is a quick snippet that you can use as an example:
cand_2 = candidates.astype({'city':'string', 'num_candidates':'int32'})
cand_2.rename(columns = {'city':'my_city'})
Converting Pandas column to int
A couple of readers have asked about a simple process to convert a column containing numeric data from string to integer types. Wanted to point you to this tutorial that helps understanding how to cast columns to int in Pandas.
The gist of it is simply to use the astype method on a specific DataFrame column, In our case:
candidates['num_candidates'].astype('int')