Turning Amazon Redshift Queries into Automated E-mail Reports using Python in Mac OS X

There are a myriad of reasons why one would want to generate a report on metrics that are constantly looked at certain frequencies. Maybe every Friday your supervisor wants to know the week-to-week performance of their employees or senior analysts would like to know the amount of revenue made by their company on a daily basis. Whatever the application may be, these repetitive checks on data can be made easily accessible once you create a data pipeline that allows for it.

In this post, I demonstrate this pipeline by turning a standard Amazon Redshift query into an automated e-mail report. This post assumes that you have the following already set up and ready to use:

  1. An Amazon Redshift data warehouse or database with tables to query from
  2. Script Editor or IDE, such as Visual Code Studio or Spyder from Anaconda
  3. Python modules pandas and psycopg2 installed
  4. G-mail account with the Gmail API enabled

The following diagram shows how the pipeline works.emailautomationdiagram

A cronjob is set up using Terminal in Mac OS. This cronjob schedules times in which it can execute a Python script containing all the necessary code for a specific report. This report is generated by loading data from Amazon Redshift, structuring it into a simple pandas dataframe, and converting this dataframe into an HTML table so that it can be displayed in an e-mail using Gmail.

Connect Amazon Redshift to Python and Load Data

First, we read in a simple query from Amazon Redshift by running the following code in a Python script:


import psycopg2 as pg
import pandas as pd

query = 'select * from mytable'

connection = pg.connect(dbname= 'database_name', host = 'host_url',
port= '5439', user= 'user', password= 'password')

cursor = connection.cursor()
cursor.execute(query)

data = pd.DataFrame(cursor.fetchall())

cursor.close
connection.close()

Screen Shot 2018-07-19 at 8.33.22 PM

mytable is the above table that contains data for the report. By executing the above, a pandas dataframe assigned to the variable data will be loaded into the Python environment. connection.close() is used to close the connection between Python and Amazon Redshift as keeping it connected can use up memory and intensive computer resources. Go here for further assistance in getting the appropriate information for the connection variable in the code above such as database_name and host_url.

Format Pandas Dataframe into HTML

Next, we render the table into html using the following code:


data_html = formatted_data.style
data_html = data_html.set_table_styles([{'selector': 'table', 'props': [('border-collapse', 'collapse')]}, {'selector': 'table, td, th', 'props': [('border', '1px solid black')]}])
data_html = data_html.render()

Screen Shot 2018-07-19 at 8.35.08 PM

Running the above code converts the dataframe contents into html text as seen above. One important thing to note is that it converts the dataframe into a body of html code. Notice that there are no tags included. This html text is passed to a string that will be pasted into an e-mail which is described in the next section. For those familiar with html, when using the .set_table_styles method, you are free to style the table however way you see fit.

Format E-mail and Connect Python to Gmail

Here, we format the components of the e-mail and have Python easily access the Gmail API.


import datetime
import smtplib
from email.mime.multipart import MIMEMultipart
from email.mime.text import MIMEText

date = datetime.datetime.now()
date = date.strftime('%a %b %d')

fromaddress = 'artemio@artemiorimando.com'
toaddress = ['artemio@artemiorimando.com']

html = 'data_html' % data_html

message = MIMEMultipart("alternative", None, [MIMEText(html,'html')])
message['From'] = fromaddr
message['To'] = ', '.join(toaddr)
message['Subject'] = 'My Report - ' + date

server = smtplib.SMTP('smtp.gmail.com', 587)
server.starttls()
server.login(fromaddress, 'email_password')
text = message.as_string()
server.sendmail(fromaddress, toaddress, text)
server.quit()

For testing and debugging purposes, it is extremely useful to test the contents of the e-mail by sending it and receiving it, to and from yourself, as seen with the toaddress and fromaddress. I use the datetime module to add the date to the subject line of the e-mail. This could be particularly useful for tracking e-mail reports. Notice that the html variable is a string that contains html code with the dataframe html body code pasted into it. Another thing to note is that the variable toaddress is a list. This provides the functionality to send this e-mail to multiple recipients. Finally, the smtplib module is used to tap into the Gmail API. The above smtp credentials for Gmail have been provided. The only thing left to do is to provide your Gmail account password as seen with email_password, and to ensure that once the e-mail is sent, that you close the API connection with server.quit().

Screen Shot 2018-07-19 at 9.01.48 PM

Running the above code will have your table ready and sent to your Gmail inbox with the subject line as coded. When clicking on the e-mail, you will see the html formatted dataframe sitting nicely in the message.

Screen Shot 2018-07-19 at 9.07.05 PM

Putting the Script Together and Scheduling a Cronjob

Once everything has been tested nicely, all that is left is to reformat the script nicely, save it as a .py file and to set up a cronjob. The following Python code is how the above demonstration would look like in a single script. I decided to save it as myscript.py


#!/Users/artemiorimando/anaconda3/bin/python
# -*- coding: utf-8 -*-
"""
Created on Thu Jul 18 21:06:10 2018

@author: artemiorimando
"""

import datetime
import psycopg2 as pg
import smtplib
import pandas as pd
from email.mime.multipart import MIMEMultipart
from email.mime.text import MIMEText

# Write Query and Connect to Redshift Database
query = 'select * from mytable' connection = pg.connect(dbname= 'database_name', host = 'host_url' port= '5439', user= 'user', password= 'password')

cursor = connection.cursor()
cursor.execute(query)
data = pd.DataFrame(cursor.fetchall())
cursor.close
connection.close()

# Style Dataframe with HTML
data_html = data.style
data_html = data_html.set_table_styles([{'selector': 'table', 'props': [('border-collapse', 'collapse')]}, {'selector': 'table, td, th', 'props': [('border', '1px solid black')]}]) data_html = data_html.render()

# Set up E-mail
date = datetime.datetime.now()
date = date.strftime('%a %b %d')

fromaddr = 'artemio@artemiorimando.com'
toaddr = ['artemio@artemiorimando.com']

html = 'data_html' % data_html

msg = MIMEMultipart("alternative", None, [MIMEText(html,'html')])
msg['From'] = fromaddr
msg['To'] = ', '.join(toaddr)
msg['Subject'] = 'My Report - ' + date

server = smtplib.SMTP('smtp.gmail.com', 587)
server.starttls()
server.login(fromaddr, 'email_password')
text = msg.as_string()
server.sendmail(fromaddr, toaddr, text)
server.quit()

The big thing to note here is that it is important that you have the pathway to your executable Python interpreter as displayed in the very first line of the script. When using Spyder in Anaconda, this is automatically done for you. The last part is to type a few lines of code in the Mac OS application, Terminal.

Screen Shot 2018-07-19 at 9.22.15 PM

The first line gives your Mac permission to execute the Python script as if it was you using it. The second line opens up an editor where you can set up a schedule for your script.

Screen Shot 2018-07-19 at 9.25.21 PM

The first 5 apparent characters represents the timing in which you want the cronjob to execute your script. Read this useful blog post to get more of an idea of how you can schedule your script executions to whenever you like.

After typing out the time and the pathway to your Python script as seen in the above image, do the following steps:

  1. press ctrl+o 
  2. hit return
  3. press ctrl+x

With that, you will have successfully set up your cronjob! This cronjob will execute your Python script at the specified times where all in the backend, data is being pulled from Amazon Redshift and turned into a dataframe, then compiled into a nice HTML table and attached to the message of an e-mail which becomes sent out.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: