As part of your data acquisition pipeline you might need to read data from multiple sources, including database tables or views.
In this post we would like to show how you are able to leverage the Pandas and the SQLAlchemy Python libraries to read data from a database . In our case, we’ll use MySQL as the database, but as you are using the SQLAlchemy library, you can use it with minimal changes on PostgreSQL, SQLite, MS SQL Server, Oracle, Access and others.
Pandas pd.read_sql example
We’ll start by importing the Pandas and SQLAlchemy libraries into our Python data analysis libraries.
#Python3
import pandas as pd
from sqlalchemy import create_engine
Then create the connection string to the database:
# We'll define a connection with user root (password: root) to server localhost and MySQL database #world that is shipped with the software. Replace this as needed with your database credentials and #database
sqlengine = create_engine( "mysql://root:root@localhost/world?charset=utf8mb4",
isolation_level="READ UNCOMMITTED")
You might receive the following error:
ModuleNotFoundError: No module named 'MySQLdb'
If that’s the case, you might simply need to import the mysqlclient client library using pip.
Proceed to your Python3 terminal and type:
pip install mysqlclient
Next, we’ll execute a select SQL statement and extract the data into a Pandas dataframe:
# we'll populate the countries dataframe by reading the select * statement on the country table using the connection string defined above.
countries = pd.read_sql("select * from country", sqlengine)
Note: pd.read_sql can be used to retrieve complete table data or run a specific query. In that sense, it generalizes both pd.read_sql_table and pd.read_sql_query methods in Pandas.
Note: You are able to retrieve data from one or multiple columns in your table. Just tweak the select statement appropriately.
Back to our analysis. We can quickly explore the new dataframe:
countries.shape
(239, 15)
Let’s list the new dataframe columns:
countries.columns
We’ll get back the columns list
Index(['Code', 'Name', 'Continent', 'Region', 'SurfaceArea', 'IndepYear',
'Population', 'LifeExpectancy', 'GNP', 'GNPOld', 'LocalName',
'GovernmentForm', 'HeadOfState', 'Capital', 'Code2'],
dtype='object')
Now we can go ahead and plot the GNP by continent using Pandas. We can obviosuly improve the chart with Matplotlib and Seaborn.
continent_gnp = countries.groupby("Continent")["GNP"].mean()
continent_gnp.plot(kind="bar",figsize=(10,6));
Here’s our simple chart:
pd.read_sql() related errors
Possible errors and way to solve:
- Can’t connect to MySQL server on localhost (10061): Ensure that the MySQL server service is running and try again.
- ‘charmap’ codec can’t decode byte 0x81 in position 8: character maps to: Add the ?charset=utf8mb4″ to the connection string defined above.
Next steps
Now that you have your data captured into the Pandas dataframe, you can start your data wrangling and if needed, export the results in an Excel or csv formats.