In a recent tutorial, we learn how to add rows to existing DataFrames in Pandas. Today, we’ll work with DF columns.
In this post, we would like to double click on several use cases that are foundational when wrangling tabular data with Pandas:
Adding columns into Python DataFrames
- New columns based on other columns
- Adding columns with default / constant / same value (could be a column of zeros).
- Inserting a column based on values in another DataFrame
- Adding new column to col index
- Create a Series from a column
- Export a column to a csv file
Create Data
# Import libraries
import pandas as pd
import numpy as np
np.random.seed(100)
# Create data frame, with specific column names
dates =pd.date_range(start="20210101",periods=5)
scores = pd.DataFrame(data = np.random.randint(60,100, size = (5,2)), columns = ['score_1', 'score_2'], index =dates )
display(scores)
Here’s our DataFrame:
score_1 | score_2 | |
---|---|---|
2021-01-01 | 68 | 84 |
2021-01-02 | 63 | 99 |
2021-01-03 | 83 | 75 |
2021-01-04 | 70 | 90 |
2021-01-05 | 94 | 62 |
2. Column based on other column values
Simple example that showcases a calculated column:
# based on other columns
scores['dif'] = scores['score_2']-scores['score_1']
Here’s our output:
score_1 | score_2 | dif | |
---|---|---|---|
2021-01-01 | 68 | 84 | 16 |
2021-01-02 | 63 | 99 | 36 |
2021-01-03 | 83 | 75 | -8 |
2021-01-04 | 70 | 90 | 20 |
2021-01-05 | 94 | 62 | -32 |
We could as well use a lambda function:
scores['dif2'] = scores.apply (lambda x:scores['score_2']-scores['score_1'])
3. Column with default values
Let’s assume we want to assign all our students to the Engineering department:
# with constant value
scores['area'] = 'Engineering'
scores
score_1 | score_2 | dif | area | |
---|---|---|---|---|
2021-01-01 | 68 | 84 | 16 | Engineering |
2021-01-02 | 63 | 99 | 36 | Engineering |
2021-01-03 | 83 | 75 | -8 | Engineering |
2021-01-04 | 70 | 90 | 20 | Engineering |
2021-01-05 | 94 | 62 | -32 | Engineering |
If we want to insert a column of zeros:
scores['zeros'] = 0
4. Column with values from other DataFrame
Let’s start by defining a new DataFrame from a list:
# from other dataframe
cities = pd.DataFrame(['NYC', 'FRA', 'SFO', 'MUC', 'SYD'], columns=['city'], index=dates)
We can then assign as following:
scores['city'] = cities['city']
scores
score_1 | score_2 | dif | city | ||
---|---|---|---|---|---|
2021-01-01 | 68 | 84 | 16 | NYC | |
2021-01-02 | 63 | 99 | 36 | FRA | |
2021-01-03 | 83 | 75 | -8 | SFO | |
2021-01-04 | 70 | 90 | 20 | MUC | |
2021-01-05 | 94 | 62 | -32 | SYD |
Or alternatively use pd.concat():
# recreating our original DF
scores = pd.DataFrame(data = np.random.randint(60,100, size = (5,2)), columns = ['score_1', 'score_2'], index =dates )
scores_cities = pd.concat([scores, cities], axis=1)
Note: the method above will work well if both DataFrames have a similar index. If that’s not the case you might want to use:
scores['city'] = cities['city'].values
scores
4. Adding new column to index
One we added a new column, we can easily add it to the DataFrame index:
scores_cities.set_index('city', append=True)
5. Create new Series from Column
We are easily able to drop column values to a Series.
city_s = scores_cities['city']
type(city_s)
Calling the type function of city_s will return:
pandas.core.series.Series
6. Export new column to CSV file
Last topic in today’s tutorial is show how you can easily export a specific column to a CSV file:
city_s.to_csv('cities.csv')