As part of the Data Wrangling process, we often need to slice and subset existing Datasets to focus on the most relevant data for our analysis.
In today’s tutorial we’ll show how you can easily use Python to create a new Dataframe from a list of columns of an existing one.
Preparation
We’ll import the Pandas library and create a simple dataset by importing a csv file.
import pandas as pd
# construct a DataFrame
hr = pd.read_csv('hr_data.csv')
'Display the column index
hr.columns
Here are the column labels / names:
Index(['language', 'month', 'salary', 'num_candidates', 'days_to_hire'], dtype='object')
New dataframe from multiple columns list
# define list of column names
cols = ['language', 'num_candidates', 'days_to_hire']
# Create a Df by slicing the source DataFrame
subset = hr[cols]
Let’s verify the type of the created object.
type(subset)
And the result is as expected a DataFrame:
pandas.core.frame.DataFrame
Let us look into the DataFrame values:
subset.head()
DataFrame from multiple column index
In this example we’ll construct a new DataFrame by slicing two columns from our source DataFrame, using the column index values
cols= [hr.columns[0], hr.columns[3]]
subset = hr[cols]
subset.head()
Here’s the result:
Construct DataFrame from Series
In this case, we will use a Series to initialize a new DataFrame.
s = hr['language']
subset = pd.DataFrame(s)
subset.head()
Add a column based on Series
In this example, we will insert a column based on a Pandas Series to an existing DataFrame.
# define new series
s= pd.Series([i for i in range(20)])
#insert new series as column
subset.insert(len(subset.columns), 'new_col',s)
#look into DataFrame column index
subset.columns
Here’s the result:
Index(['language', 'num_candidates', 'new_col'], dtype='object')