Extract, Transform, and Load Yelp Data using Python and Microsoft SQL Server

In this post, I will demonstrate a simple ETL process of Yelp data by calling the Yelp API in Python, and transforming and loading the data from Python into a Microsoft SQL Server database. This process is exemplary for any data science project that requires raw data to be extracted and stored to be consumed by other applications or used for further analysis.

Before we begin, the steps to this ETL process assumes the following four things:

  1. We have a verified and registered Yelp account.
  2. We have Microsoft SQL Server and SQL Server Management Studio installed. This guide can help us install both Microsoft SQL Server 2014 Express and SQL Server 2014 Management Studio.
  3. We have Python and an IDE installed. This guide can help us install Anaconda which installs Python 3.6 and the Spyder IDE.
  4. pyodbc module is installed after installation of Anaconda. Using the Anaconda Prompt, refer to these instructions to install pyodbc.
  5. We have a valid connection between Microsoft SQL Server and other local systems through the ODBC Data Source Administrator tool. Follow these simple steps to set up this connection.

EXTRACTION

To extract the raw Yelp data, we must make an API call to Yelp’s repositories.

Obtain App ID and App Secret

First, we go to the Yelp Developer page and scroll to the bottom and click ‘Get Started’.

Get_Started

Next we click on ‘Manage App’ in the left menu bar and record our App ID and App secret.  I whited-out the App ID below but you would see some form of text there. We will be needing these values in order to call the API within Python.

My_App

Run Yelp API Python Script

Next, using the App ID and App Secret, we run the following Python script which calls the Yelp API. In this example, I will be requesting business data for Kiku Sushi, a sushi restaurant that I have ordered from a few times.


# We import the requests module which allows us to make the API call
import requests

# Replace [app_id] with the App ID and [app_secret] with the App Secret
 app_id = '[app_id]'
 app_secret = '[app_secret]'
 data = {'grant_type': 'client_credentials',
         'client_id': app_id,
         'client_secret': app_secret}
 token = requests.post('https://api.yelp.com/oauth2/token', data = data)
 access_token = token.json()['access_token']
 headers = {'Authorization': 'bearer %s' % access_token}

# Call Yelp API to pull business data for Kiku Sushi
 biz_id = 'kiku-sushi-burnaby'
 url = 'https://api.yelp.com/v3/businesses/%s' % biz_id
 response = requests.get(url = url, headers = headers)
 response_data = response.json()

A successful API call will return the data in JSON format which is read by Python as a dictionary object.

Response_Data

Response_Data_2

Notice how the url variable within the script is a string whose value depends on the Yelp API documentation provided specifically for requesting business data.

Yelp_Biz_Request

The Request section in the documentation tells you the appropriate url to use. The Yelp API documentation provides a brief overview of the data points and data types received from the API call. The different data points and their respective data types is important to know when we load the data to the Microsoft SQL Server database later on.

Accessing the Dictionary

Using the documentation, we can extract a few data points of interest by accessing the dictionary as you normally would using Python syntax. The following lines of code will provide examples of some data extractions.


# Extract the business ID, name, price, rating and address

biz_id = response_data['id']
biz_name = response_data['name']
price = response_data['price']
rating = response_data['rating']
review_count = response_data['review_count']
location = response_data['location']
address = location['display_address']
street = address[0]
city_prov_pc = address[1]
country = address[2]

At this point, the extraction of the data is complete and we move onto transforming the data for proper storage into Microsoft SQL Server.

TRANSFORMATION

To transform the extracted data points, we simply reassign the data types. If we do not complete this step, we will run into data type conversion issues when storing it within Microsoft SQL Server.

The following code simply reassigns the data types to the extracted data points that we would like to store.


# Reassign data types to extracted data points
biz_id = str(biz_id)
biz_name = str(biz_name)
price = str(price)
rating = float(rating)
review_count = int(review_count)
street = str(street)
city_prov_pc = str(city_prov_pc)
country = str(country)

After the transformations are complete, we move into the final stage of loading the data into Microsoft SQL Server.

LOADING

In order to load a database such as those in Microsoft SQL Server, we need to ensure that we have a database created with the appropriate columns fields and column types.

Microsoft SQL Server Table Creation

After we log into our default database engine in SQL Server Management Studio, we set up and run the following T-SQL code.


-- Note that the number assigned to each varchar represents the number of characters that the data point can take up

CREATE TABLE Yelp (id varchar(50), name varchar(50), price varchar(5), rating float, review_count int, street varchar(50), city_prov_pc varchar(50), country varchar(50))

This effectively creates a table with the appropriate data types that allows us to store the Yelp data we extracted and transformed.


SELECT * FROM YELP

When we run the T-SQL code, we should see an empty table. This verifies successful table creation.

Empty_Table

Transferring Data from Python to Microsoft SQL Server

The last step is to run a Python script that takes the data points and saves them into Microsoft SQL Server. We run the following Python code to accomplish this task.


# We import the pyodbc module which gives us the ability and functionality to transfer data straight into Microsoft SQL Server
import pyodbc

# Connect to the appropriate database by replacing [datasource_name] with the data source name as set up through the ODBC Data Source Administrator and by replacing [database] with the database name within SQL Server Management Studio
 datasource_name = '[datasource_name]'
 database = '[database_name]'
 connection_string = 'dsn=%s; database=%s' % (datasource_name, database)
 connection = pyodbc.connect(connection_string)

# After a connection is established, we write out the data storage commands to send to Microsoft SQL Server
cursor = connection.cursor()

cursor.execute('INSERT INTO YELP (id, name, price, rating, review_count, street, city_prov_pc, country) values (?, ?, ?, ?, ?, ?, ?, ?)', biz_id, biz_name, price, rating, review_count, street, city_prov_pc, country)

cursor.commit()

After this script is run, we can do a final check that the data has been successfully loaded onto the Microsoft SQL Server database by rerunning a Yelp table query. Once we do, we see that we have in fact successfully transferred the data over.

SQL_Table

FURTHER WORK

This simple ETL process for Yelp data demonstrated the ability to tap into Yelp’s data repository using Python, simple data type considerations and loading data into Microsoft SQL Server.

One thing to note here is that we did not consider the more difficult data points to extract. For example, the Yelp API provides a data point corresponding to a restaurant’s operational hours which is stored as a dictionary within a list within a dictionary. Although not too difficult to extract, these kinds of data points do require more work.

Secondly, we should note that some data points are not always readily available because restaurant owners choose not to fill out this information. Also, as documented by Yelp, there will be no data available from an API call if the restaurant does not have any reviews (even if it is clear that they have a Yelp page)! We would have to account for the potential errors from the inability to extract specific information. For example, we could set up try-catch blocks in the Python code and have Microsoft SQL Server store NULL values.

Another thing to note is that there are security and efficiency considerations for loading data into a database. This exercise did not consider database creation design, where it is almost always efficient to have row keys and essential to minimize the data type memory space. It also did not demonstrate access to a secure database (where a username and password is required).

Although it is obvious that there is more that can be done, this post depicts the endless possibilities of how we may choose to further consider this data. Now that this data is stored in a nice tabular format within Microsoft SQL Server, we can use it for further analysis or other purposes within our data science projects. Further work can be done to automate the data extraction process, and set up more advanced SQL tables. Finally, There are a wide variety of social media API’s out there to try out and master.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s