There are a myriad of reasons why one would want to generate a report on metrics that are constantly looked at certain frequencies. Maybe every Friday your supervisor wants to know the week-to-week performance of their employees or senior analysts would like to know the amount of revenue made by their company on a daily basis. Whatever the application may be, these repetitive checks on data can be made easily accessible once you create a data pipeline that allows for it.
In this post, I demonstrate this pipeline by turning a standard Amazon Redshift query into an automated e-mail report. This post assumes that you have the following already set up and ready to use:
- An Amazon Redshift data warehouse or database with tables to query from
- Script Editor or IDE, such as Visual Code Studio or Spyder from Anaconda
- Python modules pandas and psycopg2 installed
- G-mail account with the Gmail API enabled
The following diagram shows how the pipeline works.
A cronjob is set up using Terminal in Mac OS. This cronjob schedules times in which it can execute a Python script containing all the necessary code for a specific report. This report is generated by loading data from Amazon Redshift, structuring it into a simple pandas dataframe, and converting this dataframe into an HTML table so that it can be displayed in an e-mail using Gmail.
Connect Amazon Redshift to Python and Load Data
First, we read in a simple query from Amazon Redshift by running the following code in a Python script:
import psycopg2 as pg import pandas as pd query = 'select * from mytable' connection = pg.connect(dbname= 'database_name', host = 'host_url', port= '5439', user= 'user', password= 'password') cursor = connection.cursor() cursor.execute(query) data = pd.DataFrame(cursor.fetchall()) cursor.close connection.close()
mytable is the above table that contains data for the report. By executing the above, a pandas dataframe assigned to the variable data will be loaded into the Python environment. connection.close() is used to close the connection between Python and Amazon Redshift as keeping it connected can use up memory and intensive computer resources. Go here for further assistance in getting the appropriate information for the connection variable in the code above such as database_name and host_url.
Format Pandas Dataframe into HTML
Next, we render the table into html using the following code:
data_html = formatted_data.style data_html = data_html.set_table_styles([{'selector': 'table', 'props': [('border-collapse', 'collapse')]}, {'selector': 'table, td, th', 'props': [('border', '1px solid black')]}]) data_html = data_html.render()
Running the above code converts the dataframe contents into html text as seen above. One important thing to note is that it converts the dataframe into a body of html code. Notice that there are no tags included. This html text is passed to a string that will be pasted into an e-mail which is described in the next section. For those familiar with html, when using the .set_table_styles method, you are free to style the table however way you see fit.
Format E-mail and Connect Python to Gmail
Here, we format the components of the e-mail and have Python easily access the Gmail API.
import datetime import smtplib from email.mime.multipart import MIMEMultipart from email.mime.text import MIMEText date = datetime.datetime.now() date = date.strftime('%a %b %d') fromaddress = 'artemio@artemiorimando.com' toaddress = ['artemio@artemiorimando.com'] html = 'data_html' % data_html message = MIMEMultipart("alternative", None, [MIMEText(html,'html')]) message['From'] = fromaddr message['To'] = ', '.join(toaddr) message['Subject'] = 'My Report - ' + date server = smtplib.SMTP('smtp.gmail.com', 587) server.starttls() server.login(fromaddress, 'email_password') text = message.as_string() server.sendmail(fromaddress, toaddress, text) server.quit()
For testing and debugging purposes, it is extremely useful to test the contents of the e-mail by sending it and receiving it, to and from yourself, as seen with the toaddress and fromaddress. I use the datetime module to add the date to the subject line of the e-mail. This could be particularly useful for tracking e-mail reports. Notice that the html variable is a string that contains html code with the dataframe html body code pasted into it. Another thing to note is that the variable toaddress is a list. This provides the functionality to send this e-mail to multiple recipients. Finally, the smtplib module is used to tap into the Gmail API. The above smtp credentials for Gmail have been provided. The only thing left to do is to provide your Gmail account password as seen with email_password, and to ensure that once the e-mail is sent, that you close the API connection with server.quit().
Running the above code will have your table ready and sent to your Gmail inbox with the subject line as coded. When clicking on the e-mail, you will see the html formatted dataframe sitting nicely in the message.
Putting the Script Together and Scheduling a Cronjob
Once everything has been tested nicely, all that is left is to reformat the script nicely, save it as a .py file and to set up a cronjob. The following Python code is how the above demonstration would look like in a single script. I decided to save it as myscript.py
#!/Users/artemiorimando/anaconda3/bin/python # -*- coding: utf-8 -*- """ Created on Thu Jul 18 21:06:10 2018 @author: artemiorimando """ import datetime import psycopg2 as pg import smtplib import pandas as pd from email.mime.multipart import MIMEMultipart from email.mime.text import MIMEText # Write Query and Connect to Redshift Database query = 'select * from mytable' connection = pg.connect(dbname= 'database_name', host = 'host_url' port= '5439', user= 'user', password= 'password') cursor = connection.cursor() cursor.execute(query) data = pd.DataFrame(cursor.fetchall()) cursor.close connection.close() # Style Dataframe with HTML data_html = data.style data_html = data_html.set_table_styles([{'selector': 'table', 'props': [('border-collapse', 'collapse')]}, {'selector': 'table, td, th', 'props': [('border', '1px solid black')]}]) data_html = data_html.render() # Set up E-mail date = datetime.datetime.now() date = date.strftime('%a %b %d') fromaddr = 'artemio@artemiorimando.com' toaddr = ['artemio@artemiorimando.com'] html = 'data_html' % data_html msg = MIMEMultipart("alternative", None, [MIMEText(html,'html')]) msg['From'] = fromaddr msg['To'] = ', '.join(toaddr) msg['Subject'] = 'My Report - ' + date server = smtplib.SMTP('smtp.gmail.com', 587) server.starttls() server.login(fromaddr, 'email_password') text = msg.as_string() server.sendmail(fromaddr, toaddr, text) server.quit()
The big thing to note here is that it is important that you have the pathway to your executable Python interpreter as displayed in the very first line of the script. When using Spyder in Anaconda, this is automatically done for you. The last part is to type a few lines of code in the Mac OS application, Terminal.
The first line gives your Mac permission to execute the Python script as if it was you using it. The second line opens up an editor where you can set up a schedule for your script.
The first 5 apparent characters represents the timing in which you want the cronjob to execute your script. Read this useful blog post to get more of an idea of how you can schedule your script executions to whenever you like.
After typing out the time and the pathway to your Python script as seen in the above image, do the following steps:
- press ctrl+o
- hit return
- press ctrl+x
With that, you will have successfully set up your cronjob! This cronjob will execute your Python script at the specified times where all in the backend, data is being pulled from Amazon Redshift and turned into a dataframe, then compiled into a nice HTML table and attached to the message of an e-mail which becomes sent out.