Artemio Rimando - Evaluating a life measured in smiles | A data scientist lifestyle blog
  • HOME
  • LET’S COLLABORATE
    • CONTACT ME
    • SUBSCRIBE
    • BIG BROTHER GAMES
      • INQUIRY
      • PAST GAMES
  • ABOUT
    • CREDENTIALS
    • ABOUT
  • DATA SCIENCE
    • PYTHON
    • APPLIED LEARNING
    • SOFT SKILLS
  • CAREER
    • CAREER ADVICE
    • RECRUITMENT ADVICE
  • LIFESTYLE
    • OUT AND PROUD
    • LIFESTYLE
  • TRAVEL
    • HONG KONG
    • MACAU
    • SAN FRANCISCO
HOME
LET'S COLLABORATE
    CONTACT ME
    SUBSCRIBE
    BIG BROTHER GAMES
    INQUIRY
    PAST GAMES
ABOUT
    CREDENTIALS
    ABOUT
DATA SCIENCE
    PYTHON
    APPLIED LEARNING
    SOFT SKILLS
CAREER
    CAREER ADVICE
    RECRUITMENT ADVICE
LIFESTYLE
    OUT AND PROUD
    LIFESTYLE
TRAVEL
    HONG KONG
    MACAU
    SAN FRANCISCO
Artemio Rimando - Evaluating a life measured in smiles | A data scientist lifestyle blog
  • HOME
  • LET’S COLLABORATE
    • CONTACT ME
    • SUBSCRIBE
    • BIG BROTHER GAMES
      • INQUIRY
      • PAST GAMES
  • ABOUT
    • CREDENTIALS
    • ABOUT
  • DATA SCIENCE
    • PYTHON
    • APPLIED LEARNING
    • SOFT SKILLS
  • CAREER
    • CAREER ADVICE
    • RECRUITMENT ADVICE
  • LIFESTYLE
    • OUT AND PROUD
    • LIFESTYLE
  • TRAVEL
    • HONG KONG
    • MACAU
    • SAN FRANCISCO
Applied Learning•Data Science

Data Science for Social Change: Reporting on Suicide

As part of the data science team at Cisco, I’ve had the privilege to use my skills for social impact. This project aims to shift the way the media reports on suicides using artificial intelligence. Articles written about suicides often contain harmful or visual language that can act as a contagion to those suffering with a mental illness or going through suicidal ideology.

The data science development team has partnered with The Erika Legacy Foundation and worked with mental health experts from The World Health Organization to develop a web-based tool that can identify harmful language used in articles using machine learning and natural language processing.

A Cisco-specific blog publication can be found here, where the VP of Data & Analytics at Cisco, Shanthi Iyer speaks more to this initiative.

The tool can be found here.

It has been a huge blessing to be a part of this effort and look forward to seeing the evolution of the tool.

Amazon S3•Applied Learning•Data Engineering•Data Science•Python•Slack•Snowflake

Data Engineering using Airflow with Amazon S3, Snowflake and Slack

In any organization that depends on continuous batches of data for the purposes of decision-making analytics, it becomes super important to streamline and automate data processing workflows. Larger teams will usually consist of a Data Architect who carefully creates the blueprints of the data infrastructure, a Data Engineer who dives into the code to build out the data infrastructure, a Data Analyst who gathers and assesses the data needs across different functional teams and ensures the reliability of the data, and a Data Scientist who uses the data to create business value through machine learning. For a data science team to work cohesively together, I think it is super important that every person on the team has some knowledge of another data member’s role and functions. I also think this is the best way to really elevate yourself as a team player and become a well-rounded data professional.

In this blog post, I aim to demonstrate how a Data Scientist can expand their data engineering knowledge and skills through creating simple data pipelines using Apache Airflow. In addition to Airflow, this post includes Amazon S3, Snowflake and Slack as part of the technology stack to demonstrate how fruitful a Data Scientist’s toolkit can be. I hope to present how awesome and powerful these tools can be to better your data products and data science projects.

Continue reading
Applied Learning•Data Science

5 Things I Learned from Building Data Pipelines from the Ground Up

Over the course of the past year, I have been continuing to refine my career as a data scientist and expand my technological skillset. One aspect of this career growth is joining Game Hive as its first Data Scientist and being part of a hardworking team to jumpstart its own internal data science and business intelligence processes. This new beginning has given me exposure to digital marketing where I work with a super rich mobile gaming user-level and device-level data by using different digital marketing platform API’s and game app-related data. With this new kind of exposure, I have engaged in some really interesting data pipelining experiences and have really pushed myself to learn more and apply these learnings to further my growth in data analytics.

From gathering digital marketing data to building out a data warehouse to building dashboards and performing predictive modelling, here are 5 things I learned from building data pipelines from the ground up.

1. Finding the right tools and technologies for the pipeline is a real balancing act

There needs to be some pre-existing knowledge of the available technologies out there to be able to piece together how the data pipeline is going to fit together. I needed to consider the value of time, human resources, and technological costs when selecting the appropriate tools for the job and preparing for an execution plan. The following were valid questions to ask when I prepared to build out a pipeline:

  • Is a data warehouse a viable solution to store your data?
  • Are there going to be multiple databases or a big data warehouse or both?
  • Who is going to be responsible and accountable for the data warehouse?
  • What are the costs for this data warehouse? Do we know our data ingestion and storage needs enough to propose a customizable plan? Is it within budget?
  • What skills are needed to build the data warehouse? Is this a one person job or do we need a team? What human resources currently exist to support this project?
  • Who are the end users of this data? How much of a learning curve is there if we adopt new analytical technologies?
  • Do end-users require fast and quick data insights? Do they require some flexibility to dive into the data themselves?
  • How much will it cost to adopt a new analytics platform?
  • How long will the pipeline take to build? What is a reasonable timeline to drive insights as soon as possible?
  • How long will it take to test and QA the pipeline?

By taking some ownership in building out the data pipeline, there are some decisions that were needed to be made based on my current technical skill levels and understanding the needs of the organization. Buying the right tools and technologies that suited the workflows and processes of the organization while staying within budget were also important considerations.

2. It is important to have a deeper understanding of the components of a data file, the best practices for file storage, and how these affect data loading

To most professionals that digest or analyze data, this may come as trivial or obvious, but I definitely learned to not take these things for granted (I appreciate the little data things!). I learned to pay closer attention to the components of a csv file that help define the difference between one data point to another, and also the difference between the positions of data points in the document. Sometimes csv files contain messy data such as columns that have unpredictable data strings. These strings might contain the same character as the delimiter that tells us how columns are separated from one another. We either have to learn how to handle these throughout the entire document, or see if a data ingestion tool has the means to identify a delimiter from a normal string of the same character. The most popular delimiter is the “,” comma, and countless times have I seen the comma act as a delimiter but also co-existing inside a string.

Best practices for file storage include knowing the best ways to organize and retrieve data depending on its use. If you have data that depends on time and most likely will be used to perform a time series analysis, or somehow your historical data needs to be represented visually, then it may be beneficial to actually include the date in your file name. This becomes extremely useful when your data begins to scale with the organization. For example, a database of sales won’t probably contain data from 3 years ago but it might be useful to want that data within this specific time range for future analyses. Loading and retrieving that data becomes easier since it is easily identifiable by the file name.

3. It is necessary to have deeper knowledge about metadata

When ingesting data into a database, it becomes so important to know not only the contents of your data, but the data that describes your data! For example, a table I have created contained different brand names of phones. Here I was thinking that phone names could possibly not be any larger than maybe 15 to 20 characters. My own personal bias has gotten me to believe that there was no possibility that phone names could be any larger than names such as “iPhone X”, or “Samsung Galaxy Note” and was I wrong. I encountered a situation where the name of a device had 72 characters in it! I learned that the best practices in building out tables within a database requires deeper understanding the contents of your data columns, the data types and the appropriate amount of space needed to store that data.

When ingesting data coming from other data sources or API’s, your best bet is to read documentation! It is so important to familiarize yourself on the details of their metadata, so that you set yourself up for success by translating the appropriate data types and schemas into your own tables. This is completely opposite to what I describe in the next paragraph. There are things where I will actually disagree to do in practice, but nonetheless have done during my learning experiences because I have not thought of a better solution.

I have learned to prepare for new metadata in which I have not seen before and could not determine pre-existing data types. For example, there was a situation where I attempted to store a string of length 90 but the column in which it is stored is set to be varchar(75) and so the table will fail to record it. This simply occurred because I had no way of determining the maximum size of the string to be ingested. It may be wise to prepare for unpredictable data ingestions due to lack of documentation or questionable data architecture. In this situation, if I have never seen a string of length 100 before, I would probably set the column data type to be varchar(200). You might be wondering how a business with this data strategy survived without any proper data architecture, and my only response is that some businesses thrive on their core service and product. Sometimes it takes years to redevelop or redesign an internal data infrastructure, and some may not see the benefit in doing so or shy away from treating it as urgent. Why fix something when it isn’t broken?

4. It is possible to seamlessly handle errors while monitoring data ingestion and data manipulation

One of the cooler parts of the data pipelining process is the ability to monitor how successful your data ingestion processes were. One important thing I learned is to manage the data pipeline in such a way that in case any errors occurred during processing, I would be notified right away and I could directly fix the problem with ease. For example, I worked on a data pipeline that downloaded data using an API and pushed it towards a table inside a database. Sometimes, the Python script would fail to communicate with the third party data provider’s server resulting in a server error. One way to handle future occurrences of this error was to set up error handling cases within the script where I could easily re-run the API call again without having to touch the code.

Another thing I learned is that the job of monitoring and validating your data can seamlessly fit into your workflows. At Game Hive, most of my communication with the team is done through Slack, and it was here that I realized that Slack was my main form of communication with my team, but could also be my main form of communication with myself! Within my data ingestion Python scripts, I would include a little messaging snippet where whenever an error occurred, I would receive a Slack notification indicating the source of the error, the reason behind it, and how to address it quickly.

5. The best kinds of data validation practices is a team effort, not a solo endeavour

There are many points along the data pipeline that allow stakeholders to really maximize the integrity of the data being used. Reiterating my learning point 4., at the programming and engineering front, error notifications can be used even more to verify that the correct number of rows and columns from the data set was uploaded to the database. Any uploading errors that occurred in that process could also be sent as notifications to ensure that all the correct data has been uploaded.

During the analytical processes of end users, it is also crucial to have experienced and number-savvy members of your team look at the data that is being ingested through dashboards and analytics. Even the slightest of numerical trends can be easily detected by an experienced professional and it becomes good practice to communicate those internal gut feelings about whether the data is wrong.

Obviously, the less human made checkpoints there are in the data pipeline, then more time and effort could be used to actually use the data to derive invaluable insights. It does not hurt to have different stages of data validation throughout the pipeline but there are great ways to maximize accuracy, efficiency, and credibility among your data and peers.

 

 

A special thanks to the lovely Michael Ma for inspiring me to write about my experiences in data pipelining.

Amazon Redshift•Applied Learning•Data Science•How-To•Python

Turning Amazon Redshift Queries into Automated E-mail Reports using Python in Mac OS X

There are a myriad of reasons why one would want to generate a report on metrics that are constantly looked at certain frequencies. Maybe every Friday your supervisor wants to know the week-to-week performance of their employees or senior analysts would like to know the amount of revenue made by their company on a daily basis. Whatever the application may be, these repetitive checks on data can be made easily accessible once you create a data pipeline that allows for it.

In this post, I demonstrate this pipeline by turning a standard Amazon Redshift query into an automated e-mail report. This post assumes that you have the following already set up and ready to use:

  1. An Amazon Redshift data warehouse or database with tables to query from
  2. Script Editor or IDE, such as Visual Code Studio or Spyder from Anaconda
  3. Python modules pandas and psycopg2 installed
  4. G-mail account with the Gmail API enabled

The following diagram shows how the pipeline works.emailautomationdiagram

A cronjob is set up using Terminal in Mac OS. This cronjob schedules times in which it can execute a Python script containing all the necessary code for a specific report. This report is generated by loading data from Amazon Redshift, structuring it into a simple pandas dataframe, and converting this dataframe into an HTML table so that it can be displayed in an e-mail using Gmail.

Connect Amazon Redshift to Python and Load Data

First, we read in a simple query from Amazon Redshift by running the following code in a Python script:


import psycopg2 as pg
import pandas as pd

query = 'select * from mytable'

connection = pg.connect(dbname= 'database_name', host = 'host_url',
port= '5439', user= 'user', password= 'password')

cursor = connection.cursor()
cursor.execute(query)

data = pd.DataFrame(cursor.fetchall())

cursor.close
connection.close()

Screen Shot 2018-07-19 at 8.33.22 PM

mytable is the above table that contains data for the report. By executing the above, a pandas dataframe assigned to the variable data will be loaded into the Python environment. connection.close() is used to close the connection between Python and Amazon Redshift as keeping it connected can use up memory and intensive computer resources. Go here for further assistance in getting the appropriate information for the connection variable in the code above such as database_name and host_url.

Format Pandas Dataframe into HTML

Next, we render the table into html using the following code:


data_html = formatted_data.style
data_html = data_html.set_table_styles([{'selector': 'table', 'props': [('border-collapse', 'collapse')]}, {'selector': 'table, td, th', 'props': [('border', '1px solid black')]}])
data_html = data_html.render()

Screen Shot 2018-07-19 at 8.35.08 PM

Running the above code converts the dataframe contents into html text as seen above. One important thing to note is that it converts the dataframe into a body of html code. Notice that there are no tags included. This html text is passed to a string that will be pasted into an e-mail which is described in the next section. For those familiar with html, when using the .set_table_styles method, you are free to style the table however way you see fit.

Format E-mail and Connect Python to Gmail

Here, we format the components of the e-mail and have Python easily access the Gmail API.


import datetime
import smtplib
from email.mime.multipart import MIMEMultipart
from email.mime.text import MIMEText

date = datetime.datetime.now()
date = date.strftime('%a %b %d')

fromaddress = 'artemio@artemiorimando.com'
toaddress = ['artemio@artemiorimando.com']

html = 'data_html' % data_html

message = MIMEMultipart("alternative", None, [MIMEText(html,'html')])
message['From'] = fromaddr
message['To'] = ', '.join(toaddr)
message['Subject'] = 'My Report - ' + date

server = smtplib.SMTP('smtp.gmail.com', 587)
server.starttls()
server.login(fromaddress, 'email_password')
text = message.as_string()
server.sendmail(fromaddress, toaddress, text)
server.quit()

For testing and debugging purposes, it is extremely useful to test the contents of the e-mail by sending it and receiving it, to and from yourself, as seen with the toaddress and fromaddress. I use the datetime module to add the date to the subject line of the e-mail. This could be particularly useful for tracking e-mail reports. Notice that the html variable is a string that contains html code with the dataframe html body code pasted into it. Another thing to note is that the variable toaddress is a list. This provides the functionality to send this e-mail to multiple recipients. Finally, the smtplib module is used to tap into the Gmail API. The above smtp credentials for Gmail have been provided. The only thing left to do is to provide your Gmail account password as seen with email_password, and to ensure that once the e-mail is sent, that you close the API connection with server.quit().

Screen Shot 2018-07-19 at 9.01.48 PM

Running the above code will have your table ready and sent to your Gmail inbox with the subject line as coded. When clicking on the e-mail, you will see the html formatted dataframe sitting nicely in the message.

Screen Shot 2018-07-19 at 9.07.05 PM

Putting the Script Together and Scheduling a Cronjob

Once everything has been tested nicely, all that is left is to reformat the script nicely, save it as a .py file and to set up a cronjob. The following Python code is how the above demonstration would look like in a single script. I decided to save it as myscript.py


#!/Users/artemiorimando/anaconda3/bin/python
# -*- coding: utf-8 -*-
"""
Created on Thu Jul 18 21:06:10 2018

@author: artemiorimando
"""

import datetime
import psycopg2 as pg
import smtplib
import pandas as pd
from email.mime.multipart import MIMEMultipart
from email.mime.text import MIMEText

# Write Query and Connect to Redshift Database
query = 'select * from mytable' connection = pg.connect(dbname= 'database_name', host = 'host_url' port= '5439', user= 'user', password= 'password')

cursor = connection.cursor()
cursor.execute(query)
data = pd.DataFrame(cursor.fetchall())
cursor.close
connection.close()

# Style Dataframe with HTML
data_html = data.style
data_html = data_html.set_table_styles([{'selector': 'table', 'props': [('border-collapse', 'collapse')]}, {'selector': 'table, td, th', 'props': [('border', '1px solid black')]}]) data_html = data_html.render()

# Set up E-mail
date = datetime.datetime.now()
date = date.strftime('%a %b %d')

fromaddr = 'artemio@artemiorimando.com'
toaddr = ['artemio@artemiorimando.com']

html = 'data_html' % data_html

msg = MIMEMultipart("alternative", None, [MIMEText(html,'html')])
msg['From'] = fromaddr
msg['To'] = ', '.join(toaddr)
msg['Subject'] = 'My Report - ' + date

server = smtplib.SMTP('smtp.gmail.com', 587)
server.starttls()
server.login(fromaddr, 'email_password')
text = msg.as_string()
server.sendmail(fromaddr, toaddr, text)
server.quit()

The big thing to note here is that it is important that you have the pathway to your executable Python interpreter as displayed in the very first line of the script. When using Spyder in Anaconda, this is automatically done for you. The last part is to type a few lines of code in the Mac OS application, Terminal.

Screen Shot 2018-07-19 at 9.22.15 PM

The first line gives your Mac permission to execute the Python script as if it was you using it. The second line opens up an editor where you can set up a schedule for your script.

Screen Shot 2018-07-19 at 9.25.21 PM

The first 5 apparent characters represents the timing in which you want the cronjob to execute your script. Read this useful blog post to get more of an idea of how you can schedule your script executions to whenever you like.

After typing out the time and the pathway to your Python script as seen in the above image, do the following steps:

  1. press ctrl+o 
  2. hit return
  3. press ctrl+x

With that, you will have successfully set up your cronjob! This cronjob will execute your Python script at the specified times where all in the backend, data is being pulled from Amazon Redshift and turned into a dataframe, then compiled into a nice HTML table and attached to the message of an e-mail which becomes sent out.

Page 1 of 3123»

Meet Artemio!

Artemio is a Torontonian-at-heart living in Vancouver, BC. You can find him in and around the city sipping bubble tea and playing Pokemon GO.

Welcome!

You will find blog posts written about a passion for data science, travel, and the joys of life.

Follow Me!

Subscribe Here!

Instagram Feed

artemiorimando

Achoo-choo 🇨🇦🚂 #covidtravel2020 #wearamas Achoo-choo 🇨🇦🚂 #covidtravel2020 #wearamask #choochootrain #revelstoke #beautifulbc🍁
Stay golden 🌄 #covidtravel2020 #albertaviews #l Stay golden 🌄 #covidtravel2020 #albertaviews #lakeannette
Out here capturing a summer take of a similar phot Out here capturing a summer take of a similar photo I took in the winter last year 🙈🤓 #covidtravel2020 #albertaviews #lakelouisecanada
Crystal clear 😌 #covidtravel2020 #albertaviews Crystal clear 😌 #covidtravel2020 #albertaviews #morainelake #luckyaf
Looking like sound waves but all I hear is quiet 😌🌄 #covidtravel2020 #albertaviews #pyramidlake #tranquil
Early riser 🌄🇨🇦 #covidtravel2020 #jasperp Early riser 🌄🇨🇦 #covidtravel2020 #jasperprovincialpark #albertaviews
Feeling lucky we saw the highest point of the Cana Feeling lucky we saw the highest point of the Canadian Rockies today 🇨🇦🏞️🚡 #covidtravel2020 #albertaviews #luckyaf
Streaming Game 6 and walking this trail 😏🦖🇨🇦 #winwin #covidtravel2020 #beautifulbc🍁 #kamloops
As of late, caused unnecessary game drama/stress w As of late, caused unnecessary game drama/stress with friends and family 😏, (ironically) advocated for destigmatizing mental health 🙏, longest streak for not leaving the house was like 20 days, drank a lot of bubble tea, and a picture under a bridge to show for it. 2020 has been wild so far. #pandemic #staysafe
Smoggy sunrise 🇹🇭 #thailand #bangkok #infini Smoggy sunrise 🇹🇭 #thailand #bangkok #infinitypool #sunrise #gaytravel #gaypassport #travelasia #instatravel #travelpics #travelgram #travel #globetrotter #igtravel #igtravelworld
Wet trunks, sandy toes, sun block, speedy boats 🇹🇭 #thailand #phuket #kohphiphi #phiphi #phiphiislands #paradise #gaytravel #gaypassport #travelasia #instatravel #travelpics #travelgram #travel #globetrotter #igtravel #igtravelworld
Ayy Okay👌🕶️🇹🇭 #thailand #phuket #koh Ayy Okay👌🕶️🇹🇭 #thailand #phuket #kohphiphi #phiphi #phiphiislands #gaytravel #gaypassport #travelasia #instatravel #travelpics #travelgram #travel #globetrotter #igtravel #igtravelworld
🤳 Big Buddha 🇹🇭 #thailand #phuket #bigbud 🤳 Big Buddha 🇹🇭 #thailand #phuket #bigbuddha #gaytravel #gaypassport #travelasia #instatravel #travelpics #travelgram #travel #globetrotter #igtravel #igtravelworld #blackandwhite
🇻🇳 Long 🐲 Lan 🦁 Quy 🐢 Phung🐥 #v 🇻🇳 Long 🐲 Lan 🦁 Quy 🐢 Phung🐥

#vietnam #haolu #ancientcapital #ancient #gaysian #gaytravel #gaypassport #instagay #travelasia #instatravel #travelpics #travelgram #travel #globetrotter #igtravel #igtravelworld #blackandwhite
Ha Long Bae 🚣‍♂️🇻🇳 #vietnam #halong Ha Long Bae 🚣‍♂️🇻🇳 #vietnam #halongbay #baitulongbay #gaysian #gaytravel #gaypassport #instagay #travelasia #instatravel #travelpics #travelgram #travel #globetrotter #igtravel #igtravelworld #blackandwhite
Load More... Follow on Instagram

Most Popular Posts

  • Data Engineering using Airflow with Amazon S3, Snowflake and Slack
    Data Engineering using Airflow with Amazon S3, Snowflake and Slack
  • Scorecard Building in R - Part IV - Training, Testing and Validating the Logistic Regression Model
    Scorecard Building in R - Part IV - Training, Testing and Validating the Logistic Regression Model
  • Scorecard Building in R - Part II - Data Preparation and Analysis
    Scorecard Building in R - Part II - Data Preparation and Analysis
  • Scorecard Building in R - Part III - Data Transformation
    Scorecard Building in R - Part III - Data Transformation

Links to the Past

artemiorimando

Achoo-choo 🇨🇦🚂 #covidtravel2020 #wearamas Achoo-choo 🇨🇦🚂 #covidtravel2020 #wearamask #choochootrain #revelstoke #beautifulbc🍁
Stay golden 🌄 #covidtravel2020 #albertaviews #l Stay golden 🌄 #covidtravel2020 #albertaviews #lakeannette
Out here capturing a summer take of a similar phot Out here capturing a summer take of a similar photo I took in the winter last year 🙈🤓 #covidtravel2020 #albertaviews #lakelouisecanada
Crystal clear 😌 #covidtravel2020 #albertaviews Crystal clear 😌 #covidtravel2020 #albertaviews #morainelake #luckyaf
Looking like sound waves but all I hear is quiet 😌🌄 #covidtravel2020 #albertaviews #pyramidlake #tranquil
Early riser 🌄🇨🇦 #covidtravel2020 #jasperp Early riser 🌄🇨🇦 #covidtravel2020 #jasperprovincialpark #albertaviews
Follow on Instagram
This error message is only visible to WordPress admins
Error: There is no connected account for the user 31859063.

Subscribe for new updates!

© 2019 ARTEMIO RIMANDO // All rights reserved.