Data Science

Scorecard Building in R – Part V – Rejected Sample Inference, Grade Analysis and Scoring Techniques

In the previous section, part IV of the scorecard building process, I trained, validated and tested a logistic regression model serving as the heart of the scorecard. In this section, I address the obvious sample selection problem where loans are accepted based on merit and personal credit information, and also rejected because of lack of credentials. I also look into analyzing model assumptions where the predicted scores for the training set is used to built a grading scheme. As an extra exercise, I scale the log-odds score into a more understandable scoring function.

To fully account for the sample selection bias in our model, performance inference methods are utilized to predict the performance of rejected clients if they were actually given a loan. The first step is creating a function that maps some sort of domain knowledge of features to the log-odds of accepted customers. This function will then be used to predict the probabilities of rejected customers. Here, the rejected data set does not include which applicants applied for a 36-month loan term so I assume that all of these applicants were considered.

I will utilize the following packages:


I will require the dataset of rejected applicants and the data available from them. This can be downloaded here.

rejected_data <- read.csv("C:/Users/artemior/Desktop/Lending Club Model/RejectStatsD.csv")

To start off, I create a new logistic regression model with the same Bad_Binary variables and only the features that are common between both accepted and rejected applicants. In this case, both contain zip code, state and employment length. The data that I will use comes from section 2, more specifically WOE_matrix and Bad_Binary. It is assumed that building this inference model takes into account all variable WOE transformations. Bad_Binary_Inference calls upon the original Bad variable from the features_36 vector.

Bad_Binary_Original <- features_36$Bad
sample_inference_features <- WOE_matrix_final[c("zip_code", "addr_state", "emp_length")]
sample_inference_features["Bad_Binary"] <- Bad_Binary_Original

Run a simple generalized linear model on the accepted applicants data set.

sample_inference_model <- glm(Bad_Binary ~ ., data = sample_inference_features)

Here, I calculate the WOEs for the rejected applicants by applying the WOE tables from the accepted applicants onto the rejected applicant data set. The code and methodology of transformation is exactly that in part III.

features_36_inference % select(-Amount.Requested, -Application.Date,
-Loan.Title, -Risk_Score,
-Debt.To.Income.Ratio, -Policy.Code)

features_36_inference_names <- colnames(features_36_inference)

Initiate cluster for parallel processing.


number_cores <- detectCores() – 1
cluster <- makeCluster(number_cores)
clusterExport(cluster, c("IV", "min_function", "max_function",
"only_features_36", "recode", "WOE_tables"))

Create the WOE matrix table for the rejected data applicants.

WOE_matrix_table_inference <- parSapply(cluster, as.matrix(features_36_inference_names),
FUN = WOE_tables_function)

WOE_matrix_inference is the converted WOE matrix for the rejected applicant data set. This is the dataset we will be using to predict their scores for model performance inference.

WOE_matrix_inference <- parSapply(cluster, features_36_inference_names,
FUN = create_WOE_matrix)

Using WOE_matrix_inference to come up with predicted probabilities using the sample_inference_model, which was built on the accepted applicant data set.

rejected_inference_prob <- predict(sample_inference_model,
data = WOE_matrix_inference,
type = "response")
rejected_inference_prob_matrix <- as.matrix(rejected_inference_prob)
rejected_inference_prob_dataframe <-
colnames(rejected_inference_prob_dataframe) <- c("Probabilities")


Now I obtain the predicted probabilities using the cross-validated elastic-net logistic regression model from section 4 for all accepted applicants

number_cores <- detectCores()
cluster <- makeCluster(number_cores)

accepted_prob <- predict(, LC_WOE_Dataset, type = "prob")
accepted_prob_matrix <- as.matrix(accepted_prob[,2])
accepted_prob_dataframe <-
colnames(accepted_prob_dataframe) <- c("Probabilities")

I combine the probabilities from both the rejected and accepted applicants and generate a graph that depicts the distribution of the probabilities.

probability_matrix <- rbind(accepted_prob_matrix, rejected_inference_prob_matrix)
probability_matrix <-
colnames(probability_matrix) <- c("Probabilities")

I use ggplot to plot the distribution of probabilities of default. Here we see that the distribution is left skewed. This is a representation of both accepted and rejected applications. It is also useful to analyze the distribution of the accepted and rejected applicants separately.

probability_distribution <- ggplot(data = probability_matrix, aes(Probabilities))
probability_distribution <- probability_distribution + geom_histogram(bins = 50)

accepted_probability_distribution <- ggplot(data = accepted_prob_dataframe, aes(Probabilities))
accepted_probability_distribution <- accepted_probability_distribution + geom_histogram(bins = 50)


rejected_probability_distribution <- ggplot(data = rejected_inference_prob_dataframe, aes(Probabilities))
rejected_probability_distribution <- rejected_probability_distribution + geom_histogram(bins = 50)


The accepted probability distribution is left-skewed while the distribution of the rejected applicants is normal. This could be interpreted as rejected applicants exhibiting a normally distributed score if they were to be funded. Therefore, rejected sample selection bias is minimal if applicants are being rejected randomly through a normal distribution. If this was not the case, I would suspect some bias in the acceptance and rejection of applicants.

After gaining some insight on how our model will perform on a population it has never seen before. We look to formalizing the scorecard by creating a grading scheme that defines several levels of risk.

Here, I am going to organize the accepted applicant probability data set into bins to initiate a lift analysis. We use lift analysis to help us determine which bins of scores are going to be described by particular letter grades. I create 25 different bins to mimic the sub-grade system that LC has from their given public data set. I then append the “Bad” column from features_36 to this vector and summarize the information so that I may be able to calculate the proportions of bads within each bin.

bins = 25
Bad_Binary_Values <- features_36$Bad
prob_bad_matrix <-, Bad_Binary_Values))
colnames(prob_bad_matrix) <- c("Probabilities", "Bad_Binary_Values")

I sort the probabilities and binary values in an increasing order based on the probabilities column. By ordering the first column, the corresponding values in the second column are also sorted.

Probabilities <- prob_bad_matrix[,1]
Bad_Binary_Values <- prob_bad_matrix[,2]
order_accepted_prob <- prob_bad_matrix[order(Probabilities, Bad_Binary_Values, decreasing = FALSE),]

I create the bins based on the sorted probabilities and create a new data frame consisting of only the bins and bad binary values. This will be the data frame I use to conduct a lift analysis.

bin_prob <- cut(order_accepted_prob$Probabilities, breaks = bins, labels = 1:bins)
order_bin <-, order_accepted_prob[,2]))
colnames(order_bin) <- c("Bin", "Bad")

I summarize the information where I calculate the proportion of bads within each bin.

bin_table <- table(order_bin$Bin, order_bin$Bad)

Bin_Summary <- group_by(order_bin, Bin)

Bad_Summary <- summarize(Bin_Summary, Total = n(), Good = sum(Bad), Bad = 1 - Good/Total)

Using Bad_Summary, I plot a bar plot that represents the lift analysis.

lift_plot <- ggplot(Bad_Summary, aes(x = Bin, y = Bad))
lift_plot <- lift_plot + geom_bar(stat = "identity", colour = "skyblue", fill = "skyblue")
lift_plot <- lift_plot + xlab("Bin")
lift_plot <- lift_plot + ylab("Proportion of Bad")


Here, the graph shows 25 bins and the proportion of bad customers within each bin. As expected, the bins have a decreasing trend of proportion of bads which shows the effectiveness of our classifer.By separating them into 25 bins, I mimic LC’s subgrading system a nd could apply the exact same logic to this scorecard.

To finalize the scorecard, I generate a linear function of log-odds and apply a three-digit score mapping system that will assist upper management in understanding the risk score obtained from the scorecard. First I convert the probability scores into log-odds form and figure out what type of linear transformation I would like to apply.

Accepted_Probabilities <- Probabilities
LogOdds <- log(Accepted_Probabilities)

The score is up to the analyst and how they feel is the best way to present it to upper management for easy interpretation. Here, I will apply a three-digit score transformation to the log-odds using a simple linear function. To calculate the slope of this linear line, I use the minimum and maximum log-odds and use that as my range for a score range from 100 – 1000.

max_score <- 1000
min_score <- 100
max_LogOdds <- max(LogOdds)
min_LogOdds <- min(LogOdds)

linear_slope <- (max_score - min_score)/(max_LogOdds - min_LogOdds)
linear_intercept <- max_score - linear_slope * max_LogOdds

Here, the linear slope is 82.6 and the intercept is 1000. This means that the average applicant will have a risk score that is maxed out, guaranteed funding . As the Log-Odds decreases based on the features that are inputted into the model, the score will decrease significantly. The way to interpret the score here is that for every 1 unit of log-odds, the score will decrease by 82.6 units. That means someone that is twice as risky as someone with 1 unit of log-odds will have their score decreased by 165.2. Therefore, every 82.6 units in a score indicates levels of riskiness that is easily understood by non-technical audiences.

Concluding Remarks

There you have it! After 5 parts into the scorecard building process, the scorecard is ready for presentation and production. Now, I could go even further into explaining some of the things that could be changed for the scorecard, such as changing scoring thresholds that meet business requirements or accepted levels of risk. I could even go further into explaining how you can link the scorecard, grades and expected losses and margins of profit for the company. This stems beyond what I have demonstrated here but is definitely possible for utmost consideration especially for a financial company.

Source Code

The R code for all 5 parts of the scorecard building process can be found at my Github page.

Scorecard Building in R – Part II – Data Preparation and Analysis

I used the dataframe manipulation package ‘dplyr’, some basic parallel processing to get the code running faster with the package ‘parallel’, and the ‘Information’ package which allows me to analyze the features within the data set using weight-of-evidence and information value.


First, I read in the Lending Club csv file downloaded from Lending Club website. The file is saved on my local desktop which is easily accessed by the read.csv function.

data <- read.csv("C:/Users/artemior/Desktop/Lending Club model/LoanStats3d.csv")

Next, I create a column that indicates whether I will keep an observation (row) or not. This will be based on the loan statuses because for a predictive logistic regression model, I would like all the statuses that will be strictly defined as a ‘Good’ loan or a ‘Bad’ loan.

data <- mutate(data,
Keep = ifelse(loan_status == "Charged Off" |
loan_status == "Default" |
loan_status == "Fully Paid" |
loan_status == "Late (16-30 days)" |
loan_status == "Late (31-120 days)",
"Keep", "Remove"))

After creating the ‘Keep’ column I filter the data depending on whether the observation had “Keep” or “Remove”.

sample <- filter(data, Keep == "Keep")

I further filter the data set to create two new samples. The Lending Club offers two exclusive types of loan products. To improve predictability of the riskiness of its loans, we can create two sub-risk models, one for all 36-month term loans and 60-month term loans.

sample_36 <- filter(sample, term == " 36 months")
sample_60 <- filter(sample, term == " 60 months")

For the purposes of this scorecard building demonstration I will create a model using the 36-month term loans. Using the mutate function, I create a new column called ‘Bad’ which will be my binary independent variable used
in the logistic regression.

sample_36 <- mutate(sample_36, Bad = ifelse(loan_status == "Fully Paid", 1, 0))

The next step is to clean up the table to remove any data points I do not want to include in the prediction model. Variables such as employment title would take more time to analyze so for the purposes of this analysis I remove them.

features_36 % select(-id, -member_id, -loan_amnt,
-funded_amnt, -funded_amnt_inv, -term, -int_rate, -installment,
-grade, -sub_grade, -pymnt_plan, -purpose, -loan_status,
-emp_title, -out_prncp, -out_prncp_inv, -total_pymnt, -total_pymnt_inv,
-total_rec_int, -total_rec_late_fee, -recoveries, -last_pymnt_d, -last_pymnt_amnt,
-next_pymnt_d, -policy_code, -total_rec_prncp, -Keep)

To further understand the data, I want to take a look at the number of observations per category under each variable. This will weed out any data points that could be problematic in future algorithms.

Once the features table is complete, I use the methodology of information value to transform the raw feature data. In theory, transforming the raw data into a proportional log-odds value as seend in the Weight-of-Evidence maps better onto a logistic regression fitted curve.

IV <- create_infotables(data = features_36, y = "Bad")

We can generate a summary of the IV’s for each feature. The IV for a particular feature represents the sum of individual bin IV’s.


We can even check the IV tables for individual features and see how each feature was binned, the percentage of observations that the bin represents out of the total number of observations, the WOE attributed to the bin and as well as the IV. The following code is an example of presenting the feature summary for the last credit pull date.


I analyze the behaviors of continuous and ordered-discrete variables by plotting their weight-of-evidences. In theory, the best possible transformation occurs when weight-of-evidences exhibit a monotonic relationship. First, I define features_36_names as the vector of column names. This will serve as the vector which I will use a function that plots every WOE graph for each feature in the features_36_names matrix. I remove features from the list that are categorical and would generate way too many bins to plot later on. For example, I removed the feature zip_code as there would be over 500 different kinds.

features_36_names_plot <- colnames(features_36)[c(-7, -11, -ncol(features_36))]

Here is the code for the ploeWOE function as I previously mentioned. This function generates a WOE plot for input x, where x is a string that represents the column name of a specific feature. Recall that I generated a list of strings in features_36_names.

plotWOE <- function(x) {
p <- plot_infotables(IV, variable = x, show_values = TRUE)
return(p) }

To make my for loop code clean and faster, I define a number as the length of the features name vector.

feature_name_vector_length_plot <- length(features_36_names_plot)

Now for the fun part, to generate a graph for each feature, I use a for loop which will go over every string object in the features_names_36 list, and plot a WOE graph for each string name that corresponds to a feature in the features_36 matrix. To be safe, I created an error-handling portion of code because somewhere in this huge matrix of features, I may have missed a feature or two in which a WOE plot cannot be created. This would occur if a particular feature only contained 1 category or value for every observed loan.

for (i in 1:feature_name_vector_length_plot) {
p <- tryCatch(plotWOE(features_36_names_plot[i]),
error = function(e)
{print(paste("Removed variable: ",
features_36_names_plot[i])); NaN})
print(p) }

About 90 graphs are generated using for loop. Below I present and discuss two examples of what kinds of graphs are presented and what they mean.


The home ownership weight-of-evidence plot displays how a greater proportion of good consumer loan customers own their homes and a greater proportion of bad consumer loans pay rent where they live. Those who still pay mortgage are slightly better customers.


The months since delinquency (or time since you failed to pay off some form of credit) weight-of-evidence plot presents another intuitive relationship. The more months that pass since a customer’s most recent delinquency will make them more likely to be a good customer in paying off their loan. The lower the amount of months since a customer’s most recent delinquency means that they have just recently failed to pay off other forms of credit. This goes to show that even if you had a delinquency in your lifetime, you can improve your credit management and behaviors over time.

In the plot, something weird happens when customers had their delinquency between 19 – 31 months before they received another consumer loan. This could suggest a lagging effect where it takes some time to fully chase down a customer. It could be the case that sometimes months and months of notification is given before the customer is actually classified as delinquent.

In the next post, Scorecard Building – Part III – Data Transformation, I am going to describe how the data we prepared and analyzed using Information Theory will be transformed to better suit a logistic regression model.

Scorecard Building in R – Part I – Introduction

Part of my job as a Data Scientist is to create, update and maintain a small-to-medium business scorecard. This machine learning generated application allows its users to identify applicants that are more likely to pay back their loan or not. Here, I take the opportunity to showcase the steps I take in building a reliable scorecard, and the analysis associated with evaluating it by using R. I will accomplish this with the use of public data provided by the consumer and commercial lending company, Lending Club (downloaded here).

Here is an overview of the essential steps to take when building this scorecard:

  1. Data Collection, Cleaning and Manipulation
  2. Data Transformations: Weight-of-Evidence and Information Value
  3. Training, Validating and Testing a Model: Logistic Regression
  4. Scorecard Evaluation and Analysis
  5. Finalizing Scorecard with other Techniques

See the next post, Scorecard Building – Part II – Data Preparation and Analysis to see how the data is prepared for further scorecard building.

Time-Series Model Building for TSX Stock Prices Using R

Time-series modelling and forecasting was definitely a core concept to learn and one of the more important technical skills to pick up in a masters program in economics. Having primarily focused on economic applications of time-series such as the estimation and prediction of future Canadian Real GDP and Real Interest Rates, I was able to get a sense of the power of time-series modelling.

One of the things I wished these applied econometric courses taught me was how to at least follow some sort of standard procedure in building a good time-series model! It can be misleading to have a time-series model project that you worked really hard on, knowing that you followed every step in the textbook, but realize in practice that it is a horrible model!

Luckily, I was able to pick up on one of the fundamental statistical model building procedures: splitting your data into a training and testing set with time-series data.

In this mini-project, I use a data set of 141 observations from Yahoo Finance Canada where each observation represents the S&P/TSX stock price for a particular month. Here, I wanted to demonstrate the importance of building a forecasting model on a training set, using this model to forecast future values, and comparing the forecasted values with actual observed values to see how well the model performed. Take note on the Charizard and Venusaur colour palettes of graphs during the read!

The following describes the procedures taken to create this model:

  1. Perform decomposition of time-series using LOESS on full data set of stock prices.
  2. Obtain a vector of seasonally adjusted stock prices (remainder) after the seasonal and trend components are removed.
  3. Separate this remainder into a training and test set where the training set consists of all seasonally adjusted stock prices between January 2006 to December 2015.
  4. Build an ARIMA model on the training set and obtain predicted values into the present month (October 2016).
  5. Compare predicted values to actual values observed and assess model performance.

Using the above procedures, I was able to obtain the following graph using an ARIMA(5,0,0) model or more simply an MA(5) model.


When we see the estimated model (Fitted line) compared to the actual behaviour of the stocks, the two are seemingly close. It is evident though that the model predictions into 2016 come close to what was actually observed but still over-predict the seasonally adjusted stock prices. About 6 months into the forecast, the predictions start to decrease and under-predict actual stock performance. Now, forecasts such as these are not meant to go out for too long as they become unreliable. For demonstration purposes, it is nice to see that the model can reliably predict behaviour for a few months into the future.

Since I do not want to put complete faith into this one model, I also ran a Simple Exponential Smoothing time-series model using HoltWinters.

The following describes the procedures taken to create this model:

  1. Separate the raw time-series data into a training and test set where the training set consists of all seasonally adjusted stock prices between January 2006 to December 2015.
  2. Build an Simple Exponential Smoothing (SES) model using Holt-Winters on the training set and obtain predicted values into the present month (October 2016).
  3. Compare predicted values to actual values observed and assess model performance.

Using the above procedures, I was able to obtain the following graph:stockpriceforecastHW

I am happy with the results of this graph. First, one needs to note that this model takes in raw stock prices and as such should be interpreted as their raw prices. The SES model fits closely with the actual values and the forecasting behaviour is similar to that obtained from the ARIMA(5,0,0) model. The confidence intervals shaded in blue show some boundaries to these values which is also very nice.

Further Work

In this post, I addressed the idea of being able to split a time-series data set into a training and testing set and model accordingly. In the first model, I used LOESS, a decomposition method to make the stock price data stationary by removing the seasonal and trending components. This allowed me to reliably apply an ARIMA model and make subsequent predictions. In the second model, I directly applied the Holt-Winters method and obtained similar results.

Although one could see that the time-series models were trained well and to a certain degree able to predict well, there is much more left to be said here. It is much better practice to compare the performance between the two models through calculations of their prediction errors. To take it even further, we could apply more advanced time-series modelling via neural networks.

Source Code

This project has been done in R. The source code can be found at my github here.

My Personal Vancouver Transit Usage: Analysis using Tableau

One of the things I love about Vancouver is its public transportation system, Translink. I grew up loving trains, and so it only seemed natural that riding the Skytrain be one of the funnest things I have experienced when I first came to Vancouver. It has been around two years since I have moved here and I still use it to go everywhere in and out of the Vancouver area. A cool feature of the Translink system is the Compass Card, a re-loadable fare pass in which frequent riders will use to tap themselves on and off the transit system through fare gates. Part of the reason why I love the idea of tapping on and off the fare gates or on bus rides is because of how the system records data of where and when you have tapped.

The thought of Translink’s ability to easily conduct commuter analysis using the millions of data recorded everyday for strategic pricing and vehicle allocation is intriguing. As such, this is what motivates riders like me to analyze my personal rider behavior. Conveniently, the Compass Card website allows you to download your own personal .csv file. The file contains lines of transactions representing every single time you have tapped on and off the system.

The motivation behind this post is to showcase some data analysis. I would love to present what I have learned about my transit behavior between September 2016 and August 2017 using Tableau Public. For any Pokemon fans out there, visualizations take on a Charizard colour palette.

On average, I began my travels with the bus 2.7 times more than the train each month. Equivalently, the bus began 73% of my trips.

Rider Usage Growth

73 Percent Bus Usage

This makes sense as the bus begins my commute to almost everywhere I go when I begin at home. It is interesting to see that my ridership has consistently increased up until the second quarter in 2017. The slight kink in the graph is due to the fact that I spent most of the month of May 2017 travelling (I went to Japan for the first time!)

I used the transit system in 289 out of 365 days and most days, I took 2-3 trips.

Trip Data

Here, I defined a trip as one where I would be required to make a new full fare payment. It is possible that multiple forms of transit may be used within an hour and a half time interval before having to pay again. These potential ways of transferring between types of transit (ie. bus to a train) are not considered as trips.

I tried avoiding the morning transit rush. I am more likely to use transit during evening rush hour. Weekend usage often starts in the late morning.

Daily Trip Schedule

This huge morning spread in my transit usage behavior reflects my choice to go to the gym before I go to work, especially during the spring and summer seasons when it gets brighter outside earlier in the day. Therefore, I can begin using transit as early as 5:00am! Also, being given a flexible work schedule, sometimes I choose to head to work as late as anywhere between 8:00am and 9:00am.

I often go out Friday and Saturday evenings and as much as I love taking transit, there is no clear increase in transit usage behavior during this time because depending on the activity, I may already be in walking distance of what I want to do, or transit may not be my ideal form of transportation.

I saved money getting a Zone 1 Monthly Pass at $91.00 with my behavior! I would have spent on average, $103.00 a month on individual fares.

Fare Usage

Often times, I don’t think about how many times I tap on and off the system and overlook what I would be paying if I did not have a monthly pass. This is full proof that getting a zone 1 monthly pass is worth it as a frequent transit user and I do not have to worry about other financial alternatives.

If I was more nit-picky, I could definitely save more money by not getting the monthly pass during the months where it would not be worth it. For example, every December,  I fly out to Toronto for two weeks to visit family for the holidays. One might think that this could signal a behavioral change to pay closer attention to my budget allocation towards public transit. In reality, I actually prefer not having to worry about loading my compass card every month. Hence, I have it set to auto-load where the system automatically charges my credit card and loads a monthly pass to my compass card.

Further Considerations

This analysis was a great introductory way for me to explore Tableau as an analytical tool. I will definitely be using it more often to create vibrant visualizations and hone in on insights from interesting data. Some future considerations I have for these kinds of analysis is to utilize maps and locations to enhance the visualization and stories behind transit data. In this particular case, almost all of my trips began in Vancouver and rarely in any other surrounding city so geographic visuals may have not been much use.

Another future consideration is to augment the existing transit data with other data sources such as the distances traveled using transit possibly obtained from Google Maps for example. Some analysis on how much it would cost per kilometer traveled or personal summary statistics on distances traveled also sounds interesting.

Amidst the world of available data in everyday life, one last future consideration is that the next time you tap off the transit system, think about how that is one more data point for your next analysis!

Implementing a Predictive Model Pipeline using R and Microsoft Azure Machine Learning

In this post, I aim to demonstrate the process of building a simple machine learning model in R and implementing it as predictive web-service in Microsoft Azure Machine Learning (Azure ML). One is not limited to the built-in machine learning capabilities of Azure ML since the Azure ML environment enables the use of R scripts, and the ability to upload and utilize R packages.

In practice, this gives the Data Scientist the flexibility they need to use their own carefully R-crafted machine learning models within the Azure ML environment. Once an R-built machine learning model is fully implemented within the Azure ML environment, the web-service provides an API in which calls can be made by external applications. This provides large value to any organization that seeks to automate decision processes using predictive modelling.

This demonstration is subdivided into three sections:

  1. Building a Machine Learning Model in R
  2. Preparing for Model Implementation in Microsoft Azure Machine Learning
  3. Creating a Predictive Web Service in Microsoft Azure Machine Learning

Before we begin, this demonstration assumes all of the following are satisfied:

  1. We have a verified and registered Microsoft Azure Machine Learning account. Microsoft allows you to try the product for free if you do not have a subscription. Click the “Sign-up here” on the far right using the link above and follow the steps to gain access to Azure ML.
  2. We have R and an associated IDE installed. I will be using the free version of R-Studio throughout this process.
  3. We have the following R packages installed: dplyr, dummies, caretcaretEnsemble
  4. We have the Human Resources Analytics data set downloaded as this will be the main source of data for building our predictive model.

Building a Machine Learning Model in R

To keep things simple, I will be building a simple logistic regression model.

Data Pre-processing

First, I pre-process the Human Resources Analytics data set to hot-encode the categorical features sales and salary.


dataset <- read.csv("HRdata.csv")
dataset <-, names = c("sales", "salary"), sep = "_")
dataset <- dataset[-c(15,19)] 

# Columns 15 and 19 represent sales_RandD and salary_high which are removed to prevent the dummy variable trap

Training the Logistic Regression Model

Next, I train a Logistic Regression Model and check that it can successfully generate predictions for new data.


# Create logistic regression model
glm_model - glm(left ~ ., data = dataset)

# Generate predictions for new data
newdata <- data.frame(satisfaction_level = 0.5, last_evaluation = 0.5, number_project = 1, average_montly_hours = 160, time_spend_company = 2, Work_accident = 0, promotion_last_5years = 1, sales_accounting = 0, sales_hr = 0, sales_IT = 0, sales_management = 0, sales_marketing = 0, sales_product_mng = 0, sales_sales = 1, sales_support = 0, sales_technical = 0, salary_low = 0, salary_medium = 1)
prediction <- predict(object = stack.rf, newdata = newdata) 


Executing the above gives us a probability of 0.193 indicating that this employee has low risk of leaving.

Saving the R-Built Model

Since the goal is to use our very own R-built model in Microsoft Azure Machine Learning, we need to be able to utilize our model without having to generate the above code over again. We run the following code to save our model and all of its parameters:

saveRDS(glm_model, file = "glm_model.rds")

Running this code will save the glm_model.rds file in the active working directory within R-Studio. In this demonstration, the glm_model.rds file is saved to my desktop.


Creating Package Project Environment

The next couple of steps are crucial in ensuring that we end up with a package file that can be uploaded to Azure ML. The reason for creating this package is to ensure that Azure ML can call on our logistic regression model to generate predictions.

First, we must initialize the package creation process by starting a new project. In R-Studio this achieved by the following:

  • Click “File” in the top left corner
  • Click “New Project…” and a pop-up screen will appear


  • Click “New Directory”


  • Click “R Package”


  • Type in a package name. Here I used “azuremlglm” as my package name. Make sure to create the project folder by setting a project subdirectory. Here, I used my desktop as the location for this project folder.
  • After clicking “Create Project”, a new R-Studio working environment will open with the default R file being “hello.R”. Since I saved my project to my desktop, I also noticed that a new folder was created.


  • Now we are set to build our package. Within our package environment in R-Studio, we can close the “hello.R” file and create three new R scripts by hitting ctrl + shift + N twice. These three scripts will be needed in the following sections.

Filling the Package with Necessary Items

By successfully setting up the package creation environment, we are now free to fill this package with anything that we may find useful in our predictive modelling pipeline. For the purposes of this demonstration, this package will only include the logistic regression model built from the first section, and a function that Azure ML can use to generate predictions.

Before writing anything in the new R script, we write the following R code in the first script to add the glm_model.rds file to our package. To better accomplish this, we can drag the .rds file to the azuremlglm folder since the R-Studio working directory is that project folder.

# Read the .rds file into the package environment
glm_model_rds <- readRDS("stack_randomforest_model.rds")


By reading in the .rds file that contained our logistic regression model into the package environment, we are now free to utilize the model in any way we wish. It is important that we save this script within the project folder. Here, I saved it as glm_model_rds.R as seen on the tab.


The Prediction Function

Since the primary use of this package is to utilize the logistic regression model to produce predictions, we need to create a function that takes in a data frame containing new data and outputs a prediction. This is very similar to the prediction verification procedure we did in the first section after building the model and using the predict function on new data.

In the new R script that we created, we write the following R code:

# Create function that allows Azure ML to generate predictions using logistic regression model

prediction_function <- function(newdata) {
 prediction <- predict(glm_model_rds, newdata = newdata)

Here, I saved this function as prediction_function.rds


The Decision Function

When Azure ML receives new data and passes its arguments to this function, we expect the resulting predictive web-service to produce the predicted probability. What if our decision process required more than just the predicted probability?

The added benefit of being able to create your own models and packages to use in Azure Machine Learning are tenfold. In many cases, you may want the Azure ML API call to output decision processes as a result of the predictions created by your machine learning model. Consider the following example:

# Create function with decision policy

decision_policy <- function(probability) {
 if (probability < 0.2) {return("Employee is low risk, occassional check-up where necessary.")}
 else if (probability >= 0.2 & probability < 0.6) {return("Employee is medium risk, take action in employee retention where necessary.")}
 else (return("Employee is high risk, notify upper management to ensure risk is mitigated in work environment."))

This decision function takes the logistic regression model’s predicted probability of a new observation and applies a Human Resource policy that meets the organization’s needs. As you can see, instead of a predicted probability, this function is recommending some form of action to be taken one the predictive model is used. It is possible that the result of a predictive model can trigger many different company-wide policies, no matter what the industry-specific application.

Here’s another example in the alternative business-financing industry. A predicted probability of risk to a specific business owner can trigger different loan-product pricing policies, and trigger different employee actions to be taken. If the Azure ML API call can output a series of policies and rules, there is huge value in being able to automate decision processes in order to get that loan out faster or rejected faster.

Creating your own models in R and including decision policies within your R packages could be the solution to an automated decision process within any organization.

Now, back to package creation. Given the newly created decision_function, we need to be sure to update our prediction_function to be able to implement these new policies.

# Create function that allows Azure ML to generate predictions using stacked model

prediction_function <- function(newdata) {
 prediction <- predict(glm_model_rds, newdata = newdata)

With the prediction function and decision function ready to go, it is important that we run these functions so that it is saved within the package environment.

It is also important that we save this R script within the package folder. Here, I saved the decision_policy.R function and re-saved the prediction_function.R as shown in the tabs.



Once these three separate R scripts are saved, we are ready to build and save our package. To build and save the package, we do the following:

  • Click the “Build” tab in the top right corner


  • Click “Build & Reload”


  • Verify that the package was built and saved by going to the R library folder, “R/win-library/3.3”. Here, my library is saved in my Documents folder.


  • With the package folder from above, you want to create a .zip file of it. You can do this by right-clicking the file, going to “send to”, then selecting “Compressed (zipped) folder”. After doing so, it will create a .zip file of your package. Do not rename this .zip file. I also proceeded to drag this .zip file to my desktop.


  • This next step is extremely important. With the newly saved .zip file, you want to create ANOTHER .zip file of it. The reason for this is because of the weird way that Azure ML reads in package files. This time I renamed the new .zip file as “2_azuremlglm”.  You should now have a .zip file that contains a .zip file that contains the actual azuremlglm folder package. You can delete the first .zip file created from the previous step as it is no longer needed.


  • This is the resulting package file that will be uploaded to Azure ML.

Creating a Predictive Web Service in Microsoft Azure Machine Learning

We are in the final stretch of the implementation process! This last section will describe how to configure Microsoft Azure Machine Learning to utilize our logistic regression model and decision rules.

Uploading the Package File and Creating a New Experiment

Once we have logged in, we want to do the following steps:

  • Click the “NEW” button in the bottom left corner, click “DATASET”, and then click “FROM LOCAL FILE” as shown


  • Upload the .zip file created from the previous section


  • When the upload is successful, you should receive the following message at the bottom of the screen


  • Next, we create a new blank experiment. We do this by clicking the “NEW” button at the bottom left corner again, click “EXPERIMENT”, and then click the first option “Blank Experiment”


  • Now we are ready to configure our Azure Machine Learning experiment


Setting up the Experiment Platform

In order for Microsoft Azure Machine Learning to utilize our logistic regression model, we need to set up the platform in such a way that it knows to take in new data inputs and produce prediction outputs. We accomplish this with the following layout.


  • The Execute R Script on the left defines the schema of the inputs. This module will connect to the first input node of the second Execute R Script. The code inputted in this module is as follows


  • A module was placed for the package so that it can be installed within the Azure ML environment. This module is inputted into the third node of the second Execute R Script module.
  • The Execute R Script in the center is where we utilize the logistic regression model package. This module contains the following code


  • Once all of the above are satisfied, we are ready to deploy the predictive web service.

Deploying the Predictive Web service

  • At the bottom of the screen, we will deploy the web service by clicking on DEPLOY WEB SERVICE”, then clicking “Deploy Web service (classic)”.


  • Azure ML will then automatically add the Web service input and Web service output modules to the appropriate nodes as follows


  • The Web service output automatically connected to the second output node of the Execute R Script module. We actually want this to connect to the first output node of the Execute R Script as shown


  • Click the “RUN” button at the bottom of the screen to verify the web service
  • Click the “DEPLOY WEB SERVICE” button once again, and select “Deploy Web service (Classic). The following page will show up


  • Finally, we are able to test that our predictive model works by clicking the blue “Test” button in the “REQUEST/RESPONSE” row.


  • After confirming the test, we should get the following result


  • This confirms that our predictive model works and all decision policies have been correctly implemented. The API is ready to go and can be consumed by external applications.

Further Considerations

Throughout this post, I showcased the process of implementing a simple predictive model using R and Microsoft Azure Machine Learning model. Of course, there are much more efficient ways of utilizing predictive models such as directly using the platform of Azure ML  to train, validate and test machine learning models, or directly using the Execute R Script module and doing all the R hard-coding there.

I want to emphasize that the process outlined here may seem less efficient to build and carry out, but I think it offers a good way to organize and automate decision pipelines. By going through the process of building and creating R packages that can then be uploaded to Azure ML, we are able to implement many decision rules within the R package. For example, an organization may choose to implement several product pricing rules  or internal decision policies as a result of what the predictive model outputs. There is plenty of room to automate these decisions for faster turnaround of work. Creating packages also gives us the ability to train, validate, and test more complex machine learning models and saving their results accordingly. I am sure there are plenty of other reasons and uses than the ones I stated here in which building your own machine learning R packages and then uploading it to Azure ML is highly beneficial.

In the future, I look to implement this process by using more complex machine learning models rather than the simple logistic regression. I also look to learn some more software application development as this is clearly not the end of the data science pipeline. With Azure ML producing an API, it would be nice to be able to see the full extent of this pipeline by utilizing the API through my own created applications. Finally, some important takeaways from this post are the abilities to organize and automate an operational data science pipeline and the thought-process behind automating company-related decisions.

Overcoming the First Hurdle: From Knowing a Little to Learning a Lot

Growth as a data scientist will take on many forms and scale up several different paths depending on the function that you serve within your work environment. The learning curve as an early-stage Data Scientist will vary on several things such as your background education and knowledge, prior experiences within the field and industry, and whether you work within a team of data scientists, or as a standalone data scientist.

For myself, the learning curve was and continues to be steep and challenging. I began my career as a standalone data scientist for a start-up company, coming straight out of school and having very limited knowledge of the financial industry. All I had under my knowledge-base at the time was an in-depth understanding of the Logistic Regression, some economic analytical projects involving time-series, and a toolkit consisting of R and Microsoft Excel. Out of uplifting encouragement, I could of done more to add to my skill set before I started my job, but with what I knew, with an eagerness to learn, and with an immense curiosity, I had exactly what I needed to begin my career.

My role as a data scientist is to build and maintain proprietary credit scoring models, and provide adhoc analysis and reports upon request. There was already a whole list of challenges that I faced when I first started: a lack of appropriate credit scorecard building knowledge, a lack of knowledge on advanced data analytic techniques, verifying that my work met industry standards, and lacking the knowledge to closely monitor model effects.

These challenges pushed me to figure out the best practices and processes in the best way I thought possible. Here are some of the ways I went about addressing the challenges I faced during the start of my career.

Conducting Independent Research

My first gut instinct to approach a problem where you virtually have almost no background experience and no one to turn to for answers is to research! Having obtained a Master’s degree from a program that infused independent research heavily within its curriculum, this only came natural to me. For example, the most important thing in tackling a scorecard building project was first understanding its entirety and breaking it down into manageable and understandable pieces. It was extremely important to know why it is used, how it is used, and how it will benefit my company’s operations.

What often happened throughout my research was that I would find complex solutions that were difficult to implement without advanced enterprise software or advanced programming knowledge, or I would find solutions that seemed too easy and not convincing enough to use. This process of researching and attempting to reproduce certain projects on the internet definitely increased my technical understanding and in many ways helped me boost my proficiency in R. Along the way, I even picked up some Python and I also learned to how to write queries in Microsoft SQL Server and MySQL to better streamline my data and model building processes.


Another challenge was ensuring that the credit scoring models were built following best practices within the financial industry. This was a little more difficult for two reasons. The first one being that a scorecard for an alternative business-lending company would differ immensely from the more common scorecards developed in the industry such as that for personal loans. Secondly, the modelling practices for alternative subprime business-lending is still relatively new with the emergence of these industries stemming back since the 2008 Financial Crisis. Therefore, research is limited and most ideas behind these driving forces are mostly proprietary.

To overcome this challenge, I engaged in some more internet research, but more importantly, I networked with industry professionals and took what I could from my discussions with them. Most of our discussions involved understanding what techniques were used widely in the industry. During this time, LinkedIN, and my personal connections contributed to my learning of overcoming this challenge. I learned to set up interactions with professionals online as well learned to generate and connect ideas between professionals within my own work.

Engaging in Trial and Error

At first, there is high pressure when you first start as a data scientist with expectations of completing your projects within specified deadlines. The scorecard was my very first project and with the limited knowledge that I had, I was almost forced into a situation of trial and error. Initially, my practices involved researching and building in an endless cycle, where I often updated the scorecard to meet new standards and practices I learned along the way. At the time, there was very little internal user feedback on the scorecard because it was assumed that it was performing exactly the way it should be. It was essential that through this trial and error process that there was constant communication and understanding among the company in order to continue building a robust scorecard. Here, I learned a lot about not only the technical side of model building, but also found that my role as a standalone data scientist has a unique place within the operational team.

Being Prepared and Building Confidence

No data science problems at high levels of technicality and knowledge can be solved so easily. As a standalone data scientist where you are mostly doing things on your own accord and expected to make educated executive decisions, you are bound to run into personal hurdles such as worries and frustrations. When something goes wrong with your models, you become the first person accountable which in many ways can be offsetting. I came to realize that all of these feelings were natural and it was perfectly fine!

In order to overcome this challenge, it was always in my best interest to be prepared to provide thorough answers to questions that the company asked me, and be able to address concerns. Whenever there was a problem or concern raised with the models I built, or the data analysis methodologies, I was always forward with a positive answer or came up with a solution. It was in my best interest to be accountable and honest with my abilities. This stemmed from the realization that I do not know everything, but I do want to learn to make sure I do my best work in order to help the company grow. With the appropriate communication among upper management and their moral support, these personal challenges slowly faded and I actually began to expedite my learning of more applied business data science.

Moving Forward

What I appreciate the most about the early stages of my career is the amounts of learning that I have done and the huge amounts of growth I experienced as a person. With that said, the learning never ends as new modelling needs occur, data repositories grow with new data to be analyzed, and new modelling techniques and solutions are introduced with new technologies.

I know that as I continue along this career path, I am bound to learn some more programming, apply other predictive models, and conduct interesting kinds of analysis. With these ongoing changes within a fast-growing company, there is bound to be one problem solved with ten more problems arising. The best part of being in the early-stage of my career is that I know I still have a lot to learn, and as I move forward, I will anticipate the challenges ahead, and be more than happy to tackle them one step at a time.

Employee Turnover: A Risk Segmenting Investigation

In this post, I conduct a simple risk analysis of employee turnover using the Human Resources Analytics data set from Kaggle.

I describe this analysis as an example of simple risk segmenting because I would like to have a general idea of which combination of employee characteristics can provide evidence towards higher employee turnover.

To accomplish this, I developed a function in R that will take a data frame and two characteristics of interest in order to generate a matrix whose entries represent the probability of employee turnover given the two characteristics. I call these values, turnover rates.

Human Resources Analytics Data

Firstly, let us go over the details of the human resources analytics data set.

hr_data <- read.csv("HR_comma_sep.csv", header = TRUE)



The variables are described as follows:

  • satisfaction_level represents the employee’s level of satisfaction on a 0 – 100% scale
  • last_evaluation represents the employee’s numeric score on their last evaluation
  • number_project is the number of projects accomplished by an employee to date
  • average_montly_hours is the average monthly hours an employee spends at work
  • time_spend_company is the amount of years an employee worked at this company
  • work_accident is a binary variable where 1 the employee experienced an accident, and 0 otherwise
  • left variable represents the binary class where 1 means the employee left, and 0 otherwise.
  • promotion_last_5years is a binary variable where 1 means the employee was promoted in the last 5 years, and 0 otherwise
  • sales is a categorical variable representing the employee’s main job function
  • salary is a categorical variable representing an employee’s salary level

The Rate Function

The following R code presents the function used to conduct this analysis.

# To use rate_matrix, a data frame df must be supplied and two column names from df must be known. The data frame must contain a numeric binary class feature y.
# If any of the characteristics are numeric on a continuous scale, a cut must be specified to place the values into categorical ranges or buckets.

rate_matrix <- function(df, y, c1 = NA, c2 = NA, cut = 10, avg = TRUE) {

# If y is not a binary integer, then stop the function.
if (is.integer(df[[y]]) != TRUE) { stop("Please ensure y is a binary class integer.") }

df_col_names <- colnames(df)

# If c1 and c2 are not available
if ( & { stop("Please recall function with a c1 and/or c2 value.") }

# If only c1 is provided
else if ( {

if (is.integer(df[[c1]])) {
var1 <- as.character(df[[c1]])
var1 <- unique(var1)
var1 <- as.numeric(var1)
var1 <- sort(var1, decreasing = FALSE) }

else if (is.numeric(df[[c1]])) {
var1 <- cut(df[[c1]], cut)
df[[c1]] <- var1
var1 <- levels(var1) }

else {
var1 <- df[[c1]]
var1 <- as.character(var1)
var1 <- unique(var1)
var1 <- sort(var1, decreasing = FALSE) }

c1_pos <- which(df_col_names == c1) # Number of column of characteristic c1

var1_len <- length(var1)

m <- matrix(NA, nrow = var1_len, ncol = 1)

rownames(m) <- var1
colnames(m) <- c1

for (i in 1:var1_len) {
bad <- df[,1][which(df[,c1_pos] == var1[i] & df[[y]] == 1)]
bad_count <- length(bad)

good <- df[,1][which(df[,c1_pos] == var1[i] & df[[y]] == 0)]
good_count <- length(good)

m[i,1] <- round(bad_count / (bad_count + good_count), 2) } }

# If c1 and c2 are provided
else {
if (is.integer(df[[c1]])) {
var1 <- as.character(df[[c1]])
var1 <- unique(var1)
var1 <- as.numeric(var1)
var1 <- sort(var1, decreasing = FALSE) }

else if (is.numeric(df[[c1]])) {
var1 <- cut(df[[c1]], cut)
df[[c1]] <- var1
var1 <- levels(var1) }

else {
var1 <- df[[c1]]
var1 <- as.character(var1)
var1 <- unique(var1)
var1 <- sort(var1, decreasing = FALSE) }

if (is.integer(df[[c2]])) {
var2 <- as.character(df[[c2]])
var2 <- unique(var2)
var2 <- as.numeric(var2)
var2 <- sort(var2, decreasing = FALSE) }

else if (is.numeric(df[[c2]])) {
var2 <- cut(df[[c2]], cut)
df[[c2]] <- var2
var2 <- levels(var2) }

else {
var2 <- df[[c2]]
var2 <- as.character(var2)
var2 <- unique(var2)
var2 <- sort(var2, decreasing = FALSE) }

c1_pos <- which(df_col_names == c1) # Number of column of characteristic c1
c2_pos <- which(df_col_names == c2) # Number of column of characteristic c2

var1_len <- length(var1)
var2_len <- length(var2)

m <- matrix(NA, nrow = var1_len, ncol = var2_len)

rownames(m) <- var1
colnames(m) <- var2

class_1 <- max(df[[y]])
class_0 <- min(df[[y]])

for (i in 1:var1_len) {
for (j in 1:var2_len) {
bad <- df[,1][which(df[,c1_pos] == var1[i] & df[,c2_pos] == var2[j] & df[[y]] == class_1)]
bad_count <- length(bad)

good <- df[,1][which(df[,c1_pos] == var1[i] & df[,c2_pos] == var2[j] & df[[y]] == class_0)]
good_count <- length(good)
m[i,j] <- round(bad_count / (bad_count + good_count), 2) } } }

# Create class 1 matrix report that includes averages
if (avg == TRUE) {
ColumnAverage <- apply(m, 2, mean, na.rm = TRUE)
ColumnAverage <- round(ColumnAverage, 2)
RowAverage <- apply(m, 1, mean, na.rm = TRUE)
RowAverage <- round(RowAverage, 2)
RowAverage <- c(RowAverage, NA)
m <- rbind(m, ColumnAverage)
m <- cbind(m, RowAverage)
return(m) }
else {
return(m) }


Employee Turnover Data Investigation

To begin this data investigation, I use the assumption that I have gained significant amounts of experience and field knowledge within Human Resources. I begin this heuristic analysis with the thought that employee turnover is greatly affected by how an employee feels about their job and about the company.

Are employees with small satisfaction levels more likely to leave?

The first thing I would like to confirm is that employees with small satisfaction levels are more likely to leave.

satisfaction <- rate_matrix(df = hr_data, y = "left", c1 = "satisfaction_level", cut = 20, avg = TRUE)



The function call here uses a cut value of 20 with no particular reason. I want a large enough cut value to provide evidence of my claim.

As seen in the matrix, satisfaction levels between 0.0891 and 0.136 shows that 92% of employees categorized in this range will leave. This provides evidence that low satisfaction levels among employees are at highest risk of leaving the company.

As we would expect, the highest levels of satisfaction of 0.954 to 1 experience 0% employee turnover.

For simplicity and ease of understanding, I define 0.5 as the average satisfaction level. By taking a look at below average satisfaction levels between 0.363 to 0.408 and 0.408 to 0.454, there is an odd significant increase to the risk of employees leaving. This particular area of employee satisfaction requires more investigation because it goes against intuition.

Are employees with below average satisfaction levels more likely to leave across different job functions?

To alleviate this concern of odd satisfaction levels defying our intuition, I continue the investigation by seeing whether satisfaction levels vary across other characteristics from the data. It is likely possible that these below average satisfaction levels are tied to their job function.

satisfaction_salary <- rate_matrix(df = hr_data, y = "left", c1 = "satisfaction_level", c2 = "sales", cut = 20, avg = TRUE)



Here, the same ranges of 0.363 to 0.408 and 0.408 to 0.454 satisfaction levels are generally at high risk to leave even across all job functions. There is evidence to suggest that somewhat unhappy workers are willing to leave regardless of their job function.

Is an unhappy employee’s likelihood of leaving related to average monthly hours worked?

To continue answering why below average satisfaction levels ranges experience higher employee turnover than we expect, I take a look at the relationship between satisfaction levels and average monthly hours worked. It could be that below average satisfaction levels at this company are tied to employees being overworked.

# First, convert the integer variable average_montly_hours into a numeric variable to take advantage of the function's ability to breakdown numeric variables into ranges.

average_montly_hours <- hr_data["average_montly_hours"]
average_montly_hours <- unlist(average_montly_hours)
average_montly_hours <- as.numeric(average_montly_hours)

hr_data["average_montly_hours"] <- average_montly_hours

satisfaction_avghours <- rate_matrix(df = hr_data, y = "left", c1 = "satisfaction_level", c2 = "average_montly_hours", cut = 20, avg = TRUE)



To reiterate, the row ranges represent the satisfaction levels and the column ranges represent the average monthly hours worked. Here, there is strong evidence to suggest that employees within the below average satisfaction level range of 0.363 to 0.408 and 0.408 to 0.454 work between 117 to 160 hours a month.

Using domain knowledge, typically, a full-time employee will work at least 160 hours a month, given that a full-time position merits 40 hours a week for 4 weeks in any given month. The data suggests here that we have a higher probability of workers leaving given they work less than a regular full-time employee! This was different from my initial train of thought that the employees were potentially overworked.

Given this finding, I come to one particular conclusion: employees with highest risk of leaving are those that are on contract, seasonal employees, or are part-time employees.

By considering other variables such as the number of projects worked on by an employee, it is possible to further support this conclusion.

satisfaction_projects <- rate_matrix(df = hr_data, y = "left", c1 = "satisfaction_level", c2 = "number_project", cut = 20, avg = TRUE)



Here, it is evident to see that the below average satisfaction levels of 0.363 to 0.408 and 0.408 to 0.454 may in fact correspond to contract or part-time employees as the probability of turnover sharply decreases after 2 projects completed.

Are contract, part-time or seasonal employees more likely to be unhappy if the job is accident-prone?

Now that we identified the high risk groups of employee turnover within this data set, this question comes to mind because we would like to address the fact that an employee’s enjoyment in their role should be tied to their satisfaction levels. It could be that these part-time employees are experiencing hardships during their time at work, thereby contributing to their risk of leaving.

To answer this question, I take a look at the satisfaction level and number of projects completed given that an employee experienced a workplace accident.

# I use the package dplyr in order to filter the hr_data dataframe to only include observations that experienced a workplace accident

accident_obs <- filter(hr_data, Work_accident == 1)

satisfaction_accident <- rate_matrix(df = accident_obs, y = "left", c1 = "satisfaction_level", c2 = "number_project", cut = 20, avg = TRUE)



Here, given the below average satisfaction levels of 0.363 to 0.408 and 0.408 to 0.454 for number of projects equal to 2 and given that employees experienced a workplace accident, there is evidence to suggest that there is a higher chance of turnover.

Further Work

The purpose of this analysis was to apply a risk segmenting method on human resources analytics data to identify potential reasons for employee turnover. I used probabilities or turnover rates to help identify some groups of employees that were at risk of leaving the company.

I found that there were higher chances of turnover given the employee had an extremely low satisfaction level, but also discovered that the type of employee (contract, part-time, seasonal) could be identified as groups of high risk of turnover. I addressed a possible fact that the likelihood of unhappiness for part-time employees  was attributed to them working on jobs that were accident-prone.

With the example presented in this post, Human Resources can use this information to put more efforts into ensuring contract, part-time, or seasonal employees experience lower turnover rates. This analysis allowed us to identify which groups of employees are at risk and allowed us to identify potential causes.

This risk analysis approach can be applied to any other field of practice other than Human Resources, including Health and Finance. It is useful to be able to come up with quick generic risk segments within your population so that further risk management solutions can be implemented for specific problems at hand.

Lastly, this post only provides a simple way to segment and analyze risk groups but it is not the only way! More advanced methods such as clustering and decision trees can help identify risk groups more thoroughly and informatively to provide an even bigger picture. For quick checks to domain expertise in any particular field of practice, the rate function I present here can be sufficient enough in identifying risk groups.

Extract, Transform, and Load Yelp Data using Python and Microsoft SQL Server

In this post, I will demonstrate a simple ETL process of Yelp data by calling the Yelp API in Python, and transforming and loading the data from Python into a Microsoft SQL Server database. This process is exemplary for any data science project that requires raw data to be extracted and stored to be consumed by other applications or used for further analysis.

Before we begin, the steps to this ETL process assumes the following four things:

  1. We have a verified and registered Yelp account.
  2. We have Microsoft SQL Server and SQL Server Management Studio installed. This guide can help us install both Microsoft SQL Server 2014 Express and SQL Server 2014 Management Studio.
  3. We have Python and an IDE installed. This guide can help us install Anaconda which installs Python 3.6 and the Spyder IDE.
  4. pyodbc module is installed after installation of Anaconda. Using the Anaconda Prompt, refer to these instructions to install pyodbc.
  5. We have a valid connection between Microsoft SQL Server and other local systems through the ODBC Data Source Administrator tool. Follow these simple steps to set up this connection.


To extract the raw Yelp data, we must make an API call to Yelp’s repositories.

Obtain App ID and App Secret

First, we go to the Yelp Developer page and scroll to the bottom and click ‘Get Started’.


Next we click on ‘Manage App’ in the left menu bar and record our App ID and App secret.  I whited-out the App ID below but you would see some form of text there. We will be needing these values in order to call the API within Python.


Run Yelp API Python Script

Next, using the App ID and App Secret, we run the following Python script which calls the Yelp API. In this example, I will be requesting business data for Kiku Sushi, a sushi restaurant that I have ordered from a few times.

# We import the requests module which allows us to make the API call
import requests

# Replace [app_id] with the App ID and [app_secret] with the App Secret
 app_id = '[app_id]'
 app_secret = '[app_secret]'
 data = {'grant_type': 'client_credentials',
         'client_id': app_id,
         'client_secret': app_secret}
 token ='', data = data)
 access_token = token.json()['access_token']
 headers = {'Authorization': 'bearer %s' % access_token}

# Call Yelp API to pull business data for Kiku Sushi
 biz_id = 'kiku-sushi-burnaby'
 url = '' % biz_id
 response = requests.get(url = url, headers = headers)
 response_data = response.json()

A successful API call will return the data in JSON format which is read by Python as a dictionary object.



Notice how the url variable within the script is a string whose value depends on the Yelp API documentation provided specifically for requesting business data.


The Request section in the documentation tells you the appropriate url to use. The Yelp API documentation provides a brief overview of the data points and data types received from the API call. The different data points and their respective data types is important to know when we load the data to the Microsoft SQL Server database later on.

Accessing the Dictionary

Using the documentation, we can extract a few data points of interest by accessing the dictionary as you normally would using Python syntax. The following lines of code will provide examples of some data extractions.

# Extract the business ID, name, price, rating and address

biz_id = response_data['id']
biz_name = response_data['name']
price = response_data['price']
rating = response_data['rating']
review_count = response_data['review_count']
location = response_data['location']
address = location['display_address']
street = address[0]
city_prov_pc = address[1]
country = address[2]

At this point, the extraction of the data is complete and we move onto transforming the data for proper storage into Microsoft SQL Server.


To transform the extracted data points, we simply reassign the data types. If we do not complete this step, we will run into data type conversion issues when storing it within Microsoft SQL Server.

The following code simply reassigns the data types to the extracted data points that we would like to store.

# Reassign data types to extracted data points
biz_id = str(biz_id)
biz_name = str(biz_name)
price = str(price)
rating = float(rating)
review_count = int(review_count)
street = str(street)
city_prov_pc = str(city_prov_pc)
country = str(country)

After the transformations are complete, we move into the final stage of loading the data into Microsoft SQL Server.


In order to load a database such as those in Microsoft SQL Server, we need to ensure that we have a database created with the appropriate columns fields and column types.

Microsoft SQL Server Table Creation

After we log into our default database engine in SQL Server Management Studio, we set up and run the following T-SQL code.

-- Note that the number assigned to each varchar represents the number of characters that the data point can take up

CREATE TABLE Yelp (id varchar(50), name varchar(50), price varchar(5), rating float, review_count int, street varchar(50), city_prov_pc varchar(50), country varchar(50))

This effectively creates a table with the appropriate data types that allows us to store the Yelp data we extracted and transformed.


When we run the T-SQL code, we should see an empty table. This verifies successful table creation.


Transferring Data from Python to Microsoft SQL Server

The last step is to run a Python script that takes the data points and saves them into Microsoft SQL Server. We run the following Python code to accomplish this task.

# We import the pyodbc module which gives us the ability and functionality to transfer data straight into Microsoft SQL Server
import pyodbc

# Connect to the appropriate database by replacing [datasource_name] with the data source name as set up through the ODBC Data Source Administrator and by replacing&amp;amp;nbsp;[database] with the database name within SQL Server Management Studio
 datasource_name = '[datasource_name]'
 database = '[database_name]'
 connection_string = 'dsn=%s; database=%s' % (datasource_name, database)
 connection = pyodbc.connect(connection_string)

# After a connection is established, we write out the data storage commands to send to Microsoft SQL Server
cursor = connection.cursor()

cursor.execute('INSERT INTO YELP (id, name, price, rating, review_count, street, city_prov_pc, country) values (?, ?, ?, ?, ?, ?, ?, ?)', biz_id, biz_name, price, rating, review_count, street, city_prov_pc, country)


After this script is run, we can do a final check that the data has been successfully loaded onto the Microsoft SQL Server database by rerunning a Yelp table query. Once we do, we see that we have in fact successfully transferred the data over.



This simple ETL process for Yelp data demonstrated the ability to tap into Yelp’s data repository using Python, simple data type considerations and loading data into Microsoft SQL Server.

One thing to note here is that we did not consider the more difficult data points to extract. For example, the Yelp API provides a data point corresponding to a restaurant’s operational hours which is stored as a dictionary within a list within a dictionary. Although not too difficult to extract, these kinds of data points do require more work.

Secondly, we should note that some data points are not always readily available because restaurant owners choose not to fill out this information. Also, as documented by Yelp, there will be no data available from an API call if the restaurant does not have any reviews (even if it is clear that they have a Yelp page)! We would have to account for the potential errors from the inability to extract specific information. For example, we could set up try-catch blocks in the Python code and have Microsoft SQL Server store NULL values.

Another thing to note is that there are security and efficiency considerations for loading data into a database. This exercise did not consider database creation design, where it is almost always efficient to have row keys and essential to minimize the data type memory space. It also did not demonstrate access to a secure database (where a username and password is required).

Although it is obvious that there is more that can be done, this post depicts the endless possibilities of how we may choose to further consider this data. Now that this data is stored in a nice tabular format within Microsoft SQL Server, we can use it for further analysis or other purposes within our data science projects. Further work can be done to automate the data extraction process, and set up more advanced SQL tables. Finally, There are a wide variety of social media API’s out there to try out and master.

5 Lessons in Applied Data Science from Alternative Business Lending

After being part of a fast-growing financial company for about a year at Merchant Advance Capital, I have come to accept the limitations when wanting to eagerly dive into data that is unique to the industry. Initially, it was frustrating to see that so many modelling practices and standards learned throughout my education could not simply be followed within the alternative business lending industry. As I slowly started to peel back at what I knew, and begun to open myself up more to things that I did not know in practice, I soon noticed that I needed to conform my attitude and skills towards what the company really needed from my role. I want to talk a little bit about what I have learned thus far and hope to reflect on these lessons so that they may help me push forward in becoming a better Data Scientist.

1. Applying data science is pointless if you don’t know the data you’re working with and how it relates to your problem at hand.

The bulk of Merchant Advance Capital’s alternative lending practices is providing loans to subprime businesses within Canada. Many of these businesses lack the collateral to successfully obtain loans from a bank or are considering quick and cheap alternatives for their business needs. One important thing to note here is that building models to predict risk levels of different businesses requires knowing exactly what kinds of businesses you are lending to. It is super easy and sometimes tempting to gather a bunch of business characteristics and immediately send them through a machine learning algorithm to obtain predictions. It is always better to carefully choose, craft and analyze these characteristics and ensure that the relationships drawn from them make intuitive business sense. Domain expertise is very crucial.

2. Refrain from using machine learning algorithms where you cannot fully interpret the relationship between business characteristics and your model predictions.

I had to learn this the hard way when several reporting issues came about through different avenues. One such avenue was within the operations department, where loan application administrators had a difficult time translating machine learning predictive outcomes to business owners and their respective sales representatives. As a result, there began to be a lack of trust within the scorecard regime. In the event that a loan is a rejected, these respective parties deserve a fair reason as to why they have been declined. If you were to build risk scorecards using black-box methods, more often than not, your predictions will be very hard to interpret from a characteristic-to-characteristic level. It would also be difficult to explain why a business owner scored a certain way if a sales representative demanded specific reason for decline.

3. Refrain from using machine learning algorithms where you cannot fully understand the costs and benefits of your model predictions.

When first developing a risk scorecard, little did I know how significantly involved its use would be within the core business of the company. The predictions of your machine learning model can translate into restrictions on product pricing and the promotion of certain products to different segmented populations. It is so important that the characteristics used to describe and understand your target population are quantifiable and make intuitive business sense. It could so happen that these characteristics will be a unique aspect of your customer base that generates the most money or generates the most loss.

4. There must be a balance between the implementation of machine learning algorithms and the use of them at the operational level.

One of the biggest hypes in data science is the ability to utilize, understand and process big data in a matter of minutes. Applied data scientists often face challenges that are operation-specific such as lack of data automation, collection and organization. In a subprime lending industry where the bulk of our customer base are somewhat technologically adverse, the simplest solution for loan applications is through e-mail and paper submissions.

With huge technological inefficiencies as a restriction on the data pipeline, I often run into a give-and-take situation with respect to predictive modelling and process automation. Sometimes efficiency is accomplished by not including every business characteristic in the model because it either cannot be automated, its availability is costly or it is simply untrustworthy. I often run into unfavourable validation statistics that could have easily been solved with the provision of more uncorrelated predictive features, but the data collection is inefficient and expensive.

Sometimes predictive prowess and operational efficiency have to go hand in hand. Of course, short-term downfalls such as these can slowly be overcome as operational changes improve, technological capabilities are enhanced, and further research is done to understand which data points are worth collecting.

5. The Financial Industry is well-known for its standard modelling practices and conservatism. Sometimes, it is more beneficial to use these practices as benchmarks and gain flexibility using alternative underwriting practices.

It is important to know what kinds of data are unique to the company and what would not typically be looked at by major financial institutions. With the uproar of social media presence among today’s businesses, bad online reviews, nicely composed websites or product images can make or break the decision to receive financing. In cases like these, data science can immensely enhance the power of underwriting applications. The utilization of social media text analytics, geo-locational analysis, and the human experience can trump the analysis of a few financial ratios that financial institutions would normally be restricted to using.