In any organization that depends on continuous batches of data for the purposes of decision-making analytics, it becomes super important to streamline and automate data processing workflows. Larger teams will usually consist of a Data Architect who carefully creates the blueprints of the data infrastructure, a Data Engineer who dives into the code to build out the data infrastructure, a Data Analyst who gathers and assesses the data needs across different functional teams and ensures the reliability of the data, and a Data Scientist who uses the data to create business value through machine learning. For a data science team to work cohesively together, I think it is super important that every person on the team has some knowledge of another data member’s role and functions. I also think this is the best way to really elevate yourself as a team player and become a well-rounded data professional.
In this blog post, I aim to demonstrate how a Data Scientist can expand their data engineering knowledge and skills through creating simple data pipelines using Apache Airflow. In addition to Airflow, this post includes Amazon S3, Snowflake and Slack as part of the technology stack to demonstrate how fruitful a Data Scientist’s toolkit can be. I hope to present how awesome and powerful these tools can be to better your data products and data science projects.
There are a myriad of reasons why one would want to generate a report on metrics that are constantly looked at certain frequencies. Maybe every Friday your supervisor wants to know the week-to-week performance of their employees or senior analysts would like to know the amount of revenue made by their company on a daily basis. Whatever the application may be, these repetitive checks on data can be made easily accessible once you create a data pipeline that allows for it.
In this post, I demonstrate this pipeline by turning a standard Amazon Redshift query into an automated e-mail report. This post assumes that you have the following already set up and ready to use:
An Amazon Redshift data warehouse or database with tables to query from
The following diagram shows how the pipeline works.
A cronjob is set up using Terminal in Mac OS. This cronjob schedules times in which it can execute a Python script containing all the necessary code for a specific report. This report is generated by loading data from Amazon Redshift, structuring it into a simple pandas dataframe, and converting this dataframe into an HTML table so that it can be displayed in an e-mail using Gmail.
Connect Amazon Redshift to Python and Load Data
First, we read in a simple query from Amazon Redshift by running the following code in a Python script:
import psycopg2 as pg
import pandas as pd
query = 'select * from mytable'
connection = pg.connect(dbname= 'database_name', host = 'host_url',
port= '5439', user= 'user', password= 'password')
cursor = connection.cursor()
cursor.execute(query)
data = pd.DataFrame(cursor.fetchall())
cursor.close
connection.close()
mytableis the above table that contains data for the report. By executing the above, a pandas dataframe assigned to the variable data will be loaded into the Python environment. connection.close()is used to close the connection between Python and Amazon Redshift as keeping it connected can use up memory and intensive computer resources. Go here for further assistance in getting the appropriate information for the connectionvariable in the code above such as database_name and host_url.
Format Pandas Dataframe into HTML
Next, we render the table into html using the following code:
Running the above code converts the dataframe contents into html text as seen above. One important thing to note is that it converts the dataframe into a body of html code. Notice that there are no tags included. This html text is passed to a string that will be pasted into an e-mail which is described in the next section. For those familiar with html, when using the .set_table_stylesmethod, you are free to style the table however way you see fit.
Format E-mail and Connect Python to Gmail
Here, we format the components of the e-mail and have Python easily access the Gmail API.
import datetime
import smtplib
from email.mime.multipart import MIMEMultipart
from email.mime.text import MIMEText
date = datetime.datetime.now()
date = date.strftime('%a %b %d')
fromaddress = 'artemio@artemiorimando.com'
toaddress = ['artemio@artemiorimando.com']
html = 'data_html' % data_html
message = MIMEMultipart("alternative", None, [MIMEText(html,'html')])
message['From'] = fromaddr
message['To'] = ', '.join(toaddr)
message['Subject'] = 'My Report - ' + date
server = smtplib.SMTP('smtp.gmail.com', 587)
server.starttls()
server.login(fromaddress, 'email_password')
text = message.as_string()
server.sendmail(fromaddress, toaddress, text)
server.quit()
For testing and debugging purposes, it is extremely useful to test the contents of the e-mail by sending it and receiving it, to and from yourself, as seen with the toaddressand fromaddress. I use the datetime module to add the date to the subject line of the e-mail. This could be particularly useful for tracking e-mail reports. Notice that the html variable is a string that contains html code with the dataframe html body code pasted into it. Another thing to note is that the variable toaddress is a list. This provides the functionality to send this e-mail to multiple recipients. Finally, the smtplibmodule is used to tap into the Gmail API. The above smtp credentials for Gmail have been provided. The only thing left to do is to provide your Gmail account password as seen with email_password, and to ensure that once the e-mail is sent, that you close the API connection with server.quit().
Running the above code will have your table ready and sent to your Gmail inbox with the subject line as coded. When clicking on the e-mail, you will see the html formatted dataframe sitting nicely in the message.
Putting the Script Together and Scheduling a Cronjob
Once everything has been tested nicely, all that is left is to reformat the script nicely, save it as a .py file and to set up a cronjob. The following Python code is how the above demonstration would look like in a single script. I decided to save it as myscript.py
#!/Users/artemiorimando/anaconda3/bin/python
# -*- coding: utf-8 -*-
"""
Created on Thu Jul 18 21:06:10 2018
@author: artemiorimando
"""
import datetime
import psycopg2 as pg
import smtplib
import pandas as pd
from email.mime.multipart import MIMEMultipart
from email.mime.text import MIMEText
# Write Query and Connect to Redshift Database
query = 'select * from mytable' connection = pg.connect(dbname= 'database_name', host = 'host_url' port= '5439', user= 'user', password= 'password')
cursor = connection.cursor()
cursor.execute(query)
data = pd.DataFrame(cursor.fetchall())
cursor.close
connection.close()
# Style Dataframe with HTML
data_html = data.style
data_html = data_html.set_table_styles([{'selector': 'table', 'props': [('border-collapse', 'collapse')]}, {'selector': 'table, td, th', 'props': [('border', '1px solid black')]}]) data_html = data_html.render()
# Set up E-mail
date = datetime.datetime.now()
date = date.strftime('%a %b %d')
fromaddr = 'artemio@artemiorimando.com'
toaddr = ['artemio@artemiorimando.com']
html = 'data_html' % data_html
msg = MIMEMultipart("alternative", None, [MIMEText(html,'html')])
msg['From'] = fromaddr
msg['To'] = ', '.join(toaddr)
msg['Subject'] = 'My Report - ' + date
server = smtplib.SMTP('smtp.gmail.com', 587)
server.starttls()
server.login(fromaddr, 'email_password')
text = msg.as_string()
server.sendmail(fromaddr, toaddr, text)
server.quit()
The big thing to note here is that it is important that you have the pathway to your executable Python interpreter as displayed in the very first line of the script. When using Spyder in Anaconda, this is automatically done for you. The last part is to type a few lines of code in the Mac OS application, Terminal.
The first line gives your Mac permission to execute the Python script as if it was you using it. The second line opens up an editor where you can set up a schedule for your script.
The first 5 apparent characters represents the timing in which you want the cronjob to execute your script. Read this useful blog post to get more of an idea of how you can schedule your script executions to whenever you like.
After typing out the time and the pathway to your Python script as seen in the above image, do the following steps:
press ctrl+o
hit return
press ctrl+x
With that, you will have successfully set up your cronjob! This cronjob will execute your Python script at the specified times where all in the backend, data is being pulled from Amazon Redshift and turned into a dataframe, then compiled into a nice HTML table and attached to the message of an e-mail which becomes sent out.
In this post, I will demonstrate a simple ETL process of Yelp data by calling the Yelp API in Python, and transforming and loading the data from Python into a Microsoft SQL Server database. This process is exemplary for any data science project that requires raw data to be extracted and stored to be consumed by other applications or used for further analysis.
Before we begin, the steps to this ETL process assumes the following four things:
We have Microsoft SQL Server and SQL Server Management Studio installed. This guide can help us install both Microsoft SQL Server 2014 Express and SQL Server 2014 Management Studio.
We have Python and an IDE installed. This guide can help us install Anaconda which installs Python 3.6 and the Spyder IDE.
pyodbc module is installed after installation of Anaconda. Using the Anaconda Prompt, refer to these instructions to install pyodbc.
We have a valid connection between Microsoft SQL Server and other local systems through the ODBC Data Source Administrator tool. Follow these simple steps to set up this connection.
EXTRACTION
To extract the raw Yelp data, we must make an API call to Yelp’s repositories.
Obtain App ID and App Secret
First, we go to the Yelp Developer page and scroll to the bottom and click ‘Get Started’.
Next we click on ‘Manage App’ in the left menu bar and record our App ID and App secret. I whited-out the App ID below but you would see some form of text there. We will be needing these values in order to call the API within Python.
Run Yelp API Python Script
Next, using the App ID and App Secret, we run the following Python script which calls the Yelp API. In this example, I will be requesting business data for Kiku Sushi, a sushi restaurant that I have ordered from a few times.
# We import the requests module which allows us to make the API call
import requests
# Replace [app_id] with the App ID and [app_secret] with the App Secret
app_id = '[app_id]'
app_secret = '[app_secret]'
data = {'grant_type': 'client_credentials',
'client_id': app_id,
'client_secret': app_secret}
token = requests.post('https://api.yelp.com/oauth2/token', data = data)
access_token = token.json()['access_token']
headers = {'Authorization': 'bearer %s' % access_token}
# Call Yelp API to pull business data for Kiku Sushi
biz_id = 'kiku-sushi-burnaby'
url = 'https://api.yelp.com/v3/businesses/%s' % biz_id
response = requests.get(url = url, headers = headers)
response_data = response.json()
A successful API call will return the data in JSON format which is read by Python as a dictionary object.
Notice how the url variable within the script is a string whose value depends on the Yelp API documentation provided specifically for requesting business data.
The Request section in the documentation tells you the appropriate url to use. The Yelp API documentation provides a brief overview of the data points and data types received from the API call. The different data points and their respective data types is important to know when we load the data to the Microsoft SQL Server database later on.
Accessing the Dictionary
Using the documentation, we can extract a few data points of interest by accessing the dictionary as you normally would using Python syntax. The following lines of code will provide examples of some data extractions.
# Extract the business ID, name, price, rating and address
biz_id = response_data['id']
biz_name = response_data['name']
price = response_data['price']
rating = response_data['rating']
review_count = response_data['review_count']
location = response_data['location']
address = location['display_address']
street = address[0]
city_prov_pc = address[1]
country = address[2]
At this point, the extraction of the data is complete and we move onto transforming the data for proper storage into Microsoft SQL Server.
TRANSFORMATION
To transform the extracted data points, we simply reassign the data types. If we do not complete this step, we will run into data type conversion issues when storing it within Microsoft SQL Server.
The following code simply reassigns the data types to the extracted data points that we would like to store.
# Reassign data types to extracted data points
biz_id = str(biz_id)
biz_name = str(biz_name)
price = str(price)
rating = float(rating)
review_count = int(review_count)
street = str(street)
city_prov_pc = str(city_prov_pc)
country = str(country)
After the transformations are complete, we move into the final stage of loading the data into Microsoft SQL Server.
LOADING
In order to load a database such as those in Microsoft SQL Server, we need to ensure that we have a database created with the appropriate columns fields and column types.
Microsoft SQL Server Table Creation
After we log into our default database engine in SQL Server Management Studio, we set up and run the following T-SQL code.
-- Note that the number assigned to each varchar represents the number of characters that the data point can take up
CREATE TABLE Yelp (id varchar(50), name varchar(50), price varchar(5), rating float, review_count int, street varchar(50), city_prov_pc varchar(50), country varchar(50))
This effectively creates a table with the appropriate data types that allows us to store the Yelp data we extracted and transformed.
SELECT * FROM YELP
When we run the T-SQL code, we should see an empty table. This verifies successful table creation.
Transferring Data from Python to Microsoft SQL Server
The last step is to run a Python script that takes the data points and saves them into Microsoft SQL Server. We run the following Python code to accomplish this task.
# We import the pyodbc module which gives us the ability and functionality to transfer data straight into Microsoft SQL Server
import pyodbc
# Connect to the appropriate database by replacing [datasource_name] with the data source name as set up through the ODBC Data Source Administrator and by replacing [database] with the database name within SQL Server Management Studio
datasource_name = '[datasource_name]'
database = '[database_name]'
connection_string = 'dsn=%s; database=%s' % (datasource_name, database)
connection = pyodbc.connect(connection_string)
# After a connection is established, we write out the data storage commands to send to Microsoft SQL Server
cursor = connection.cursor()
cursor.execute('INSERT INTO YELP (id, name, price, rating, review_count, street, city_prov_pc, country) values (?, ?, ?, ?, ?, ?, ?, ?)', biz_id, biz_name, price, rating, review_count, street, city_prov_pc, country)
cursor.commit()
After this script is run, we can do a final check that the data has been successfully loaded onto the Microsoft SQL Server database by rerunning a Yelp table query. Once we do, we see that we have in fact successfully transferred the data over.
FURTHER WORK
This simple ETL process for Yelp data demonstrated the ability to tap into Yelp’s data repository using Python, simple data type considerations and loading data into Microsoft SQL Server.
One thing to note here is that we did not consider the more difficult data points to extract. For example, the Yelp API provides a data point corresponding to a restaurant’s operational hours which is stored as a dictionary within a list within a dictionary. Although not too difficult to extract, these kinds of data points do require more work.
Secondly, we should note that some data points are not always readily available because restaurant owners choose not to fill out this information. Also, as documented by Yelp, there will be no data available from an API call if the restaurant does not have any reviews (even if it is clear that they have a Yelp page)! We would have to account for the potential errors from the inability to extract specific information. For example, we could set up try-catch blocks in the Python code and have Microsoft SQL Server store NULL values.
Another thing to note is that there are security and efficiency considerations for loading data into a database. This exercise did not consider database creation design, where it is almost always efficient to have row keys and essential to minimize the data type memory space. It also did not demonstrate access to a secure database (where a username and password is required).
Although it is obvious that there is more that can be done, this post depicts the endless possibilities of how we may choose to further consider this data. Now that this data is stored in a nice tabular format within Microsoft SQL Server, we can use it for further analysis or other purposes within our data science projects. Further work can be done to automate the data extraction process, and set up more advanced SQL tables. Finally, There are a wide variety of social media API’s out there to try out and master.
Meet Artemio!
Artemio is a Torontonian-at-heart living in Vancouver, BC. You can find him in and around the city sipping bubble tea and playing Pokemon GO.
Welcome!
You will find blog posts written about a passion for data science, travel, and the joys of life.