Scorecard Building in R – Part V – Rejected Sample Inference, Grade Analysis and Scoring Techniques

In the previous section, part IV of the scorecard building process, I trained, validated and tested a logistic regression model serving as the heart of the scorecard. In this section, I address the obvious sample selection problem where loans are accepted based on merit and personal credit information, and also rejected because of lack of credentials. I also look into analyzing model assumptions where the predicted scores for the training set is used to built a grading scheme. As an extra exercise, I scale the log-odds score into a more understandable scoring function.

To fully account for the sample selection bias in our model, performance inference methods are utilized to predict the performance of rejected clients if they were actually given a loan. The first step is creating a function that maps some sort of domain knowledge of features to the log-odds of accepted customers. This function will then be used to predict the probabilities of rejected customers. Here, the rejected data set does not include which applicants applied for a 36-month loan term so I assume that all of these applicants were considered.

I will utilize the following packages:


I will require the dataset of rejected applicants and the data available from them. This can be downloaded here.

rejected_data <- read.csv("C:/Users/artemior/Desktop/Lending Club Model/RejectStatsD.csv")

To start off, I create a new logistic regression model with the same Bad_Binary variables and only the features that are common between both accepted and rejected applicants. In this case, both contain zip code, state and employment length. The data that I will use comes from section 2, more specifically WOE_matrix and Bad_Binary. It is assumed that building this inference model takes into account all variable WOE transformations. Bad_Binary_Inference calls upon the original Bad variable from the features_36 vector.

Bad_Binary_Original <- features_36$Bad
sample_inference_features <- WOE_matrix_final[c("zip_code", "addr_state", "emp_length")]
sample_inference_features["Bad_Binary"] <- Bad_Binary_Original

Run a simple generalized linear model on the accepted applicants data set.

sample_inference_model <- glm(Bad_Binary ~ ., data = sample_inference_features)

Here, I calculate the WOEs for the rejected applicants by applying the WOE tables from the accepted applicants onto the rejected applicant data set. The code and methodology of transformation is exactly that in part III.

features_36_inference % select(-Amount.Requested, -Application.Date,
-Loan.Title, -Risk_Score,
-Debt.To.Income.Ratio, -Policy.Code)

features_36_inference_names <- colnames(features_36_inference)

Initiate cluster for parallel processing.


number_cores <- detectCores() – 1
cluster <- makeCluster(number_cores)
clusterExport(cluster, c("IV", "min_function", "max_function",
"only_features_36", "recode", "WOE_tables"))

Create the WOE matrix table for the rejected data applicants.

WOE_matrix_table_inference <- parSapply(cluster, as.matrix(features_36_inference_names),
FUN = WOE_tables_function)

WOE_matrix_inference is the converted WOE matrix for the rejected applicant data set. This is the dataset we will be using to predict their scores for model performance inference.

WOE_matrix_inference <- parSapply(cluster, features_36_inference_names,
FUN = create_WOE_matrix)

Using WOE_matrix_inference to come up with predicted probabilities using the sample_inference_model, which was built on the accepted applicant data set.

rejected_inference_prob <- predict(sample_inference_model,
data = WOE_matrix_inference,
type = "response")
rejected_inference_prob_matrix <- as.matrix(rejected_inference_prob)
rejected_inference_prob_dataframe <-
colnames(rejected_inference_prob_dataframe) <- c("Probabilities")


Now I obtain the predicted probabilities using the cross-validated elastic-net logistic regression model from section 4 for all accepted applicants

number_cores <- detectCores()
cluster <- makeCluster(number_cores)

accepted_prob <- predict(, LC_WOE_Dataset, type = "prob")
accepted_prob_matrix <- as.matrix(accepted_prob[,2])
accepted_prob_dataframe <-
colnames(accepted_prob_dataframe) <- c("Probabilities")

I combine the probabilities from both the rejected and accepted applicants and generate a graph that depicts the distribution of the probabilities.

probability_matrix <- rbind(accepted_prob_matrix, rejected_inference_prob_matrix)
probability_matrix <-
colnames(probability_matrix) <- c("Probabilities")

I use ggplot to plot the distribution of probabilities of default. Here we see that the distribution is left skewed. This is a representation of both accepted and rejected applications. It is also useful to analyze the distribution of the accepted and rejected applicants separately.

probability_distribution <- ggplot(data = probability_matrix, aes(Probabilities))
probability_distribution <- probability_distribution + geom_histogram(bins = 50)

accepted_probability_distribution <- ggplot(data = accepted_prob_dataframe, aes(Probabilities))
accepted_probability_distribution <- accepted_probability_distribution + geom_histogram(bins = 50)


rejected_probability_distribution <- ggplot(data = rejected_inference_prob_dataframe, aes(Probabilities))
rejected_probability_distribution <- rejected_probability_distribution + geom_histogram(bins = 50)


The accepted probability distribution is left-skewed while the distribution of the rejected applicants is normal. This could be interpreted as rejected applicants exhibiting a normally distributed score if they were to be funded. Therefore, rejected sample selection bias is minimal if applicants are being rejected randomly through a normal distribution. If this was not the case, I would suspect some bias in the acceptance and rejection of applicants.

After gaining some insight on how our model will perform on a population it has never seen before. We look to formalizing the scorecard by creating a grading scheme that defines several levels of risk.

Here, I am going to organize the accepted applicant probability data set into bins to initiate a lift analysis. We use lift analysis to help us determine which bins of scores are going to be described by particular letter grades. I create 25 different bins to mimic the sub-grade system that LC has from their given public data set. I then append the “Bad” column from features_36 to this vector and summarize the information so that I may be able to calculate the proportions of bads within each bin.

bins = 25
Bad_Binary_Values <- features_36$Bad
prob_bad_matrix <-, Bad_Binary_Values))
colnames(prob_bad_matrix) <- c("Probabilities", "Bad_Binary_Values")

I sort the probabilities and binary values in an increasing order based on the probabilities column. By ordering the first column, the corresponding values in the second column are also sorted.

Probabilities <- prob_bad_matrix[,1]
Bad_Binary_Values <- prob_bad_matrix[,2]
order_accepted_prob <- prob_bad_matrix[order(Probabilities, Bad_Binary_Values, decreasing = FALSE),]

I create the bins based on the sorted probabilities and create a new data frame consisting of only the bins and bad binary values. This will be the data frame I use to conduct a lift analysis.

bin_prob <- cut(order_accepted_prob$Probabilities, breaks = bins, labels = 1:bins)
order_bin <-, order_accepted_prob[,2]))
colnames(order_bin) <- c("Bin", "Bad")

I summarize the information where I calculate the proportion of bads within each bin.

bin_table <- table(order_bin$Bin, order_bin$Bad)

Bin_Summary <- group_by(order_bin, Bin)

Bad_Summary <- summarize(Bin_Summary, Total = n(), Good = sum(Bad), Bad = 1 - Good/Total)

Using Bad_Summary, I plot a bar plot that represents the lift analysis.

lift_plot <- ggplot(Bad_Summary, aes(x = Bin, y = Bad))
lift_plot <- lift_plot + geom_bar(stat = "identity", colour = "skyblue", fill = "skyblue")
lift_plot <- lift_plot + xlab("Bin")
lift_plot <- lift_plot + ylab("Proportion of Bad")


Here, the graph shows 25 bins and the proportion of bad customers within each bin. As expected, the bins have a decreasing trend of proportion of bads which shows the effectiveness of our classifer.By separating them into 25 bins, I mimic LC’s subgrading system a nd could apply the exact same logic to this scorecard.

To finalize the scorecard, I generate a linear function of log-odds and apply a three-digit score mapping system that will assist upper management in understanding the risk score obtained from the scorecard. First I convert the probability scores into log-odds form and figure out what type of linear transformation I would like to apply.

Accepted_Probabilities <- Probabilities
LogOdds <- log(Accepted_Probabilities)

The score is up to the analyst and how they feel is the best way to present it to upper management for easy interpretation. Here, I will apply a three-digit score transformation to the log-odds using a simple linear function. To calculate the slope of this linear line, I use the minimum and maximum log-odds and use that as my range for a score range from 100 – 1000.

max_score <- 1000
min_score <- 100
max_LogOdds <- max(LogOdds)
min_LogOdds <- min(LogOdds)

linear_slope <- (max_score - min_score)/(max_LogOdds - min_LogOdds)
linear_intercept <- max_score - linear_slope * max_LogOdds

Here, the linear slope is 82.6 and the intercept is 1000. This means that the average applicant will have a risk score that is maxed out, guaranteed funding . As the Log-Odds decreases based on the features that are inputted into the model, the score will decrease significantly. The way to interpret the score here is that for every 1 unit of log-odds, the score will decrease by 82.6 units. That means someone that is twice as risky as someone with 1 unit of log-odds will have their score decreased by 165.2. Therefore, every 82.6 units in a score indicates levels of riskiness that is easily understood by non-technical audiences.

Concluding Remarks

There you have it! After 5 parts into the scorecard building process, the scorecard is ready for presentation and production. Now, I could go even further into explaining some of the things that could be changed for the scorecard, such as changing scoring thresholds that meet business requirements or accepted levels of risk. I could even go further into explaining how you can link the scorecard, grades and expected losses and margins of profit for the company. This stems beyond what I have demonstrated here but is definitely possible for utmost consideration especially for a financial company.

Source Code

The R code for all 5 parts of the scorecard building process can be found at my Github page.

Scorecard Building in R – Part II – Data Preparation and Analysis

I used the dataframe manipulation package ‘dplyr’, some basic parallel processing to get the code running faster with the package ‘parallel’, and the ‘Information’ package which allows me to analyze the features within the data set using weight-of-evidence and information value.


First, I read in the Lending Club csv file downloaded from Lending Club website. The file is saved on my local desktop which is easily accessed by the read.csv function.

data <- read.csv("C:/Users/artemior/Desktop/Lending Club model/LoanStats3d.csv")

Next, I create a column that indicates whether I will keep an observation (row) or not. This will be based on the loan statuses because for a predictive logistic regression model, I would like all the statuses that will be strictly defined as a ‘Good’ loan or a ‘Bad’ loan.

data <- mutate(data,
Keep = ifelse(loan_status == "Charged Off" |
loan_status == "Default" |
loan_status == "Fully Paid" |
loan_status == "Late (16-30 days)" |
loan_status == "Late (31-120 days)",
"Keep", "Remove"))

After creating the ‘Keep’ column I filter the data depending on whether the observation had “Keep” or “Remove”.

sample <- filter(data, Keep == "Keep")

I further filter the data set to create two new samples. The Lending Club offers two exclusive types of loan products. To improve predictability of the riskiness of its loans, we can create two sub-risk models, one for all 36-month term loans and 60-month term loans.

sample_36 <- filter(sample, term == " 36 months")
sample_60 <- filter(sample, term == " 60 months")

For the purposes of this scorecard building demonstration I will create a model using the 36-month term loans. Using the mutate function, I create a new column called ‘Bad’ which will be my binary independent variable used
in the logistic regression.

sample_36 <- mutate(sample_36, Bad = ifelse(loan_status == "Fully Paid", 1, 0))

The next step is to clean up the table to remove any data points I do not want to include in the prediction model. Variables such as employment title would take more time to analyze so for the purposes of this analysis I remove them.

features_36 % select(-id, -member_id, -loan_amnt,
-funded_amnt, -funded_amnt_inv, -term, -int_rate, -installment,
-grade, -sub_grade, -pymnt_plan, -purpose, -loan_status,
-emp_title, -out_prncp, -out_prncp_inv, -total_pymnt, -total_pymnt_inv,
-total_rec_int, -total_rec_late_fee, -recoveries, -last_pymnt_d, -last_pymnt_amnt,
-next_pymnt_d, -policy_code, -total_rec_prncp, -Keep)

To further understand the data, I want to take a look at the number of observations per category under each variable. This will weed out any data points that could be problematic in future algorithms.

Once the features table is complete, I use the methodology of information value to transform the raw feature data. In theory, transforming the raw data into a proportional log-odds value as seend in the Weight-of-Evidence maps better onto a logistic regression fitted curve.

IV <- create_infotables(data = features_36, y = "Bad")

We can generate a summary of the IV’s for each feature. The IV for a particular feature represents the sum of individual bin IV’s.


We can even check the IV tables for individual features and see how each feature was binned, the percentage of observations that the bin represents out of the total number of observations, the WOE attributed to the bin and as well as the IV. The following code is an example of presenting the feature summary for the last credit pull date.


I analyze the behaviors of continuous and ordered-discrete variables by plotting their weight-of-evidences. In theory, the best possible transformation occurs when weight-of-evidences exhibit a monotonic relationship. First, I define features_36_names as the vector of column names. This will serve as the vector which I will use a function that plots every WOE graph for each feature in the features_36_names matrix. I remove features from the list that are categorical and would generate way too many bins to plot later on. For example, I removed the feature zip_code as there would be over 500 different kinds.

features_36_names_plot <- colnames(features_36)[c(-7, -11, -ncol(features_36))]

Here is the code for the ploeWOE function as I previously mentioned. This function generates a WOE plot for input x, where x is a string that represents the column name of a specific feature. Recall that I generated a list of strings in features_36_names.

plotWOE <- function(x) {
p <- plot_infotables(IV, variable = x, show_values = TRUE)
return(p) }

To make my for loop code clean and faster, I define a number as the length of the features name vector.

feature_name_vector_length_plot <- length(features_36_names_plot)

Now for the fun part, to generate a graph for each feature, I use a for loop which will go over every string object in the features_names_36 list, and plot a WOE graph for each string name that corresponds to a feature in the features_36 matrix. To be safe, I created an error-handling portion of code because somewhere in this huge matrix of features, I may have missed a feature or two in which a WOE plot cannot be created. This would occur if a particular feature only contained 1 category or value for every observed loan.

for (i in 1:feature_name_vector_length_plot) {
p <- tryCatch(plotWOE(features_36_names_plot[i]),
error = function(e)
{print(paste("Removed variable: ",
features_36_names_plot[i])); NaN})
print(p) }

About 90 graphs are generated using for loop. Below I present and discuss two examples of what kinds of graphs are presented and what they mean.


The home ownership weight-of-evidence plot displays how a greater proportion of good consumer loan customers own their homes and a greater proportion of bad consumer loans pay rent where they live. Those who still pay mortgage are slightly better customers.


The months since delinquency (or time since you failed to pay off some form of credit) weight-of-evidence plot presents another intuitive relationship. The more months that pass since a customer’s most recent delinquency will make them more likely to be a good customer in paying off their loan. The lower the amount of months since a customer’s most recent delinquency means that they have just recently failed to pay off other forms of credit. This goes to show that even if you had a delinquency in your lifetime, you can improve your credit management and behaviors over time.

In the plot, something weird happens when customers had their delinquency between 19 – 31 months before they received another consumer loan. This could suggest a lagging effect where it takes some time to fully chase down a customer. It could be the case that sometimes months and months of notification is given before the customer is actually classified as delinquent.

In the next post, Scorecard Building – Part III – Data Transformation, I am going to describe how the data we prepared and analyzed using Information Theory will be transformed to better suit a logistic regression model.

Scorecard Building in R – Part I – Introduction

Part of my job as a Data Scientist is to create, update and maintain a small-to-medium business scorecard. This machine learning generated application allows its users to identify applicants that are more likely to pay back their loan or not. Here, I take the opportunity to showcase the steps I take in building a reliable scorecard, and the analysis associated with evaluating it by using R. I will accomplish this with the use of public data provided by the consumer and commercial lending company, Lending Club (downloaded here).

Here is an overview of the essential steps to take when building this scorecard:

  1. Data Collection, Cleaning and Manipulation
  2. Data Transformations: Weight-of-Evidence and Information Value
  3. Training, Validating and Testing a Model: Logistic Regression
  4. Scorecard Evaluation and Analysis
  5. Finalizing Scorecard with other Techniques

See the next post, Scorecard Building – Part II – Data Preparation and Analysis to see how the data is prepared for further scorecard building.

A Friendly BC Hydro Electricity Consumption Analysis using Tableau

If there is something to appreciate about the Canadian West Coast, it is definitely in the way that it leads by example through environmentally friendly practices. One of the ways that British Columbia takes on this initiative is through administering electrical energy in the cleanest and most cost-efficient way.  This is all made possible through BC Hydro, a Canadian-controlled crown corporation responsible for providing British Columbia residences reliable and affordable electricity. British Columbia prides itself in delivering electrical energy and is known to have one of the lowest electricity consumer prices in Canada. One way to show appreciation for natural resource consumption is to, of course, take a look at our very own personal electricity consumption. Before diving into any analysis, first it might be useful to provide some context into how BC Hydro prices electricity consumption.

Electricity Pricing

BC Hydro uses a two-stage pricing algorithm where consumers are required to pay $0.0858 per kWh up to a max consumption threshold of 1350 kWh within a two-month period. This rate increases to the second stage at $0.1287 per kWh if the consumer uses over 1350 kWh within the two months. In addition to the two-stage pricing algorithm, consumers are required to pay a base rate of $0.1899 times the number of days in their billing period, and pay a rider-rate which is a buffer cost to consumers to cover unpredictable economic circumstances such as abnormal market prices or inaccurate water level forecasts. The entire cost becomes subject to GST and the total is the payable amount during every billing period.  If you want to read for knowledge’s sake, BC Hydro thoroughly explains the pricing of electricity consumption on their website.


The motivation behind this post is to analyze a very special electricity consumption data set (special because I was given permission by my great friend Skye to analyze his electricity consumption data!) I will be analyzing Skye’s personal electricity consumption in the full calendar years of 2015 and 2016.

Data, Set, Go!

It is amazing how accessible BC Hydro makes personal electricity consumption data to its paying customers. Skye has revealed that to obtain your personalized data set, you would simply login to your MyHydro account, and request an exported .csv file. Within 24 hours of submitting a request, you will receive an e-mail with a personalized .csv file.

The file contains two data columns, the first column containing the Start Interval Time/Date representing the time stamp at the beginning of each hour of every day of the year, and the second column containing Net Consumption (kWh) representing the amount of electricity used up until the end of the hour measured in kilowatts per hour!

One thing I noticed within Skye’s data set is that there were some missing Net Consumption values.  Missing values were indicated with N/A attached to some time stamps. Further looking into this data and without any prior knowledge to Skye’s electricity consumption behavior, there is no way to really know for certain why the data was missing. To rectify this situation, I simply replaced missing values with the most recent level of net consumption. For example, if there was a missing value on September 13, 2016 at 10:00am, I would assume a forward-looking value such that this missing value would be replaced with the net consumption value from 9:00am. If there was trailing missing values, or consecutive missing values, I would replace them all with the most recent available net consumption value.

Without further ado, I shall begin reporting what I found from Skye’s electricity consumption data using Tableau Public. Pokemon Fans, please take notice in the Venusaur colour palette!

There was an increase of about 3% in net consumption expenditure from 2015 to 2016. Skye exhibits typical seasonal trends in electricity consumption.

Electricity Consumption Seasonal

Net Expenditure is what Skye is charged during the two-stage pricing algorithm on a per month basis. He does not actually ever step into stage-two of pricing as he is well below the 1350 kWh threshold every two months (which is amazing, save energy and save money!)

Visually, it seems that Skye exhibits a typical consumption behavior, where his expenditure in the first three quarters of each year has a decreasing trend, he hits a minimum in September, then scales back up in the Fall and Winter months. Is it possible that what we see visually is not verified statistically? We can validate this theory that Skye is behaving typically or the way he should be. Consider the following R code:

table2015 <- c(42.87, 39.01, 33.9, 29.25, 22.5, 26.68, 33.06, 24.45, 15.57,
 27.11, 49.37, 49.39)
table2016 <- c(44.46, 31.57, 32.68, 29, 25.53, 25.43, 24.53, 29.42, 20.43,
 32.03, 43.96, 67.62)
chisq.test(table2015, 2016)

Behavior Distribution

Here, I perform a simple chi-square test to validate that these data-points are in fact Skye’s typical behavior. The null hypothesis is that the expenditures in 2015 and 2016 are independent, or in simpler terms, we hypothesize the possibility that Skye’s behavior has changed and therefore has significantly changed his electrical consumption. Since the result presents a p-value of 0.2329 (much greater than 0.05,  a benchmark value to consider this null hypothesis), we reject the null hypothesis and conclude that there is evidence to suggest that Skye is behaving how he should be!

Although we have statistical evidence regarding his typical behavior, one still needs to question what happened in December 2016, where Skye’s expenditure increased by almost 37% from 2015! Could this probably be a result of one of the coldest winters that Lower Mainland has experienced in years?

Skye has more expenditure control in 2016. His net daily expenditures are more sporadic in 2015 with monthly averages between $0.80 and $1.40 per day, whereas 2016 was less sporadic with monthly averages between $0.70 and $1.20 (exception, December 2016 at average of $2.20).


Daily Expenditure 2015


Daily Expenditure 2016

Here, I use the term sporadic to describe the range in distributions of net daily expenditures per month.  For example, the box-plot ranges in the first quarter of 2015 are much wider than the box-plots in the first quarter of 2016. This is especially evident in the summer months. To put it simply, Skye has more consistency and better control of his electricity consumption and expenditures in 2016.

We have seen that Skye’s expenditure has increased by 3.42% from 2015 to 2016. One would think that with more controlled electricity consumption in 2016, his expenditure would be lower. By taking a look at December 2016, we can see that his expenditure is abnormally higher (by almost an additional $0.70 per day!) It is interesting to see that electricity consumption behavior was definitely more different in 2016 and kept at all-time low until the winter season.

Another thing to point out is that outlying value of about $4.00 in December 2016. Anecdotally, Skye says it is most likely because he forgot to turn off the stove that day!

Skye’s favourite days of electricity consumption are Tuesdays, Wednesdays, Saturdays and Sundays. Net hourly expenditure is seasonally and consistently higher (more expenditures are greater than $0.04 per hour) during these days.

Net Hourly Expenditure 2015 ($)

Monthly Daily Expenditure 2015

Net Hourly Expenditure 2016 ($)

Monthly Daily Expenditure 2016

Consistent with what we have been seeing, the winter months continue to observe the highest expenditure per hour. In addition, it seems that Skye has more of a liking to use higher levels of electricity on Tuesdays, Wednesdays and on the weekends with expenditures ranging between $0.03 to $0.09 per hour.

To account for a smaller difference, we could see that in 2015, Skye used more electricity throughout November everyday in 2015, but has decreased his hourly expenditure over November weekends in 2016.

Skye exhibits behavioral change in 2016 mornings and mid-evenings with higher net hourly expenditures compared to 2015.

Net Hourly Expenditure 2015 ($)

Daily Hour Charge 2015

Net Hourly Expenditure 2016 ($)

Daily Hour Charge 2016

It seems that Skye is consuming more electricity in the morning in 2016, evidently with an additional $0.01 to $0.02 per hour from 2015 rates. A behavioral change is signaled by the 7AM to 8AM mark in the morning and by the 6PM to 7PM mark in the evening. Even times like Mondays in 2016 at 11PM in the late night exhibit increase in electricity consumption.

Skye spends about 57% less than the average British Columbian household.

Compare Rates

According to BC Hydro, the average BC household consumes an average of about 900 kWh of electricity per month (not including seasonality). If we apply the exact same logic by taking the average consumption of Skye’s past 2 years, we see that he actually consumes way less than the average household.

Watt to Consider for the Future

This analysis was a great way to continue displaying fun data in Tableau. With just a two-column personalized electricity consumption data set, I was able to dig a little deeper on the spending behavior of Skye. Some things came to mind as I was conducting this analysis which could be used as motivation for further analysis posts.

This analysis serves as a perfect transition into utilizing machine learning methods to forecast future expenditures. We can definitely come to understand what exactly our forecasting machine learning models intend to capture and how these behaviors will help us predict future behavior. In this analysis, 2015 and 2016 data was used but in reality, data up to the current date and data before 2015 can be obtained. This gives more opportunity to build and test forecasting models accordingly.

In addition to the amazing visualizations produced by Tableau, a perfect consideration for future time-series modelling is to plot the data using ggplot2 in R. Interchangeably using these two visualization tools can serve as good practice and could provide more insights when used in conjunction with one another.

A special thanks to Skye for letting me use his data, it was fun!


My Personal Vancouver Transit Usage: Analysis using Tableau

One of the things I love about Vancouver is its public transportation system, Translink. I grew up loving trains, and so it only seemed natural that riding the Skytrain be one of the funnest things I have experienced when I first came to Vancouver. It has been around two years since I have moved here and I still use it to go everywhere in and out of the Vancouver area. A cool feature of the Translink system is the Compass Card, a re-loadable fare pass in which frequent riders will use to tap themselves on and off the transit system through fare gates. Part of the reason why I love the idea of tapping on and off the fare gates or on bus rides is because of how the system records data of where and when you have tapped.

The thought of Translink’s ability to easily conduct commuter analysis using the millions of data recorded everyday for strategic pricing and vehicle allocation is intriguing. As such, this is what motivates riders like me to analyze my personal rider behavior. Conveniently, the Compass Card website allows you to download your own personal .csv file. The file contains lines of transactions representing every single time you have tapped on and off the system.

The motivation behind this post is to showcase some data analysis. I would love to present what I have learned about my transit behavior between September 2016 and August 2017 using Tableau Public. For any Pokemon fans out there, visualizations take on a Charizard colour palette.

On average, I began my travels with the bus 2.7 times more than the train each month. Equivalently, the bus began 73% of my trips.

Rider Usage Growth

73 Percent Bus Usage

This makes sense as the bus begins my commute to almost everywhere I go when I begin at home. It is interesting to see that my ridership has consistently increased up until the second quarter in 2017. The slight kink in the graph is due to the fact that I spent most of the month of May 2017 travelling (I went to Japan for the first time!)

I used the transit system in 289 out of 365 days and most days, I took 2-3 trips.

Trip Data

Here, I defined a trip as one where I would be required to make a new full fare payment. It is possible that multiple forms of transit may be used within an hour and a half time interval before having to pay again. These potential ways of transferring between types of transit (ie. bus to a train) are not considered as trips.

I tried avoiding the morning transit rush. I am more likely to use transit during evening rush hour. Weekend usage often starts in the late morning.

Daily Trip Schedule

This huge morning spread in my transit usage behavior reflects my choice to go to the gym before I go to work, especially during the spring and summer seasons when it gets brighter outside earlier in the day. Therefore, I can begin using transit as early as 5:00am! Also, being given a flexible work schedule, sometimes I choose to head to work as late as anywhere between 8:00am and 9:00am.

I often go out Friday and Saturday evenings and as much as I love taking transit, there is no clear increase in transit usage behavior during this time because depending on the activity, I may already be in walking distance of what I want to do, or transit may not be my ideal form of transportation.

I saved money getting a Zone 1 Monthly Pass at $91.00 with my behavior! I would have spent on average, $103.00 a month on individual fares.

Fare Usage

Often times, I don’t think about how many times I tap on and off the system and overlook what I would be paying if I did not have a monthly pass. This is full proof that getting a zone 1 monthly pass is worth it as a frequent transit user and I do not have to worry about other financial alternatives.

If I was more nit-picky, I could definitely save more money by not getting the monthly pass during the months where it would not be worth it. For example, every December,  I fly out to Toronto for two weeks to visit family for the holidays. One might think that this could signal a behavioral change to pay closer attention to my budget allocation towards public transit. In reality, I actually prefer not having to worry about loading my compass card every month. Hence, I have it set to auto-load where the system automatically charges my credit card and loads a monthly pass to my compass card.

Further Considerations

This analysis was a great introductory way for me to explore Tableau as an analytical tool. I will definitely be using it more often to create vibrant visualizations and hone in on insights from interesting data. Some future considerations I have for these kinds of analysis is to utilize maps and locations to enhance the visualization and stories behind transit data. In this particular case, almost all of my trips began in Vancouver and rarely in any other surrounding city so geographic visuals may have not been much use.

Another future consideration is to augment the existing transit data with other data sources such as the distances traveled using transit possibly obtained from Google Maps for example. Some analysis on how much it would cost per kilometer traveled or personal summary statistics on distances traveled also sounds interesting.

Amidst the world of available data in everyday life, one last future consideration is that the next time you tap off the transit system, think about how that is one more data point for your next analysis!

Employee Turnover: A Risk Segmenting Investigation

In this post, I conduct a simple risk analysis of employee turnover using the Human Resources Analytics data set from Kaggle.

I describe this analysis as an example of simple risk segmenting because I would like to have a general idea of which combination of employee characteristics can provide evidence towards higher employee turnover.

To accomplish this, I developed a function in R that will take a data frame and two characteristics of interest in order to generate a matrix whose entries represent the probability of employee turnover given the two characteristics. I call these values, turnover rates.

Human Resources Analytics Data

Firstly, let us go over the details of the human resources analytics data set.

hr_data <- read.csv("HR_comma_sep.csv", header = TRUE)



The variables are described as follows:

  • satisfaction_level represents the employee’s level of satisfaction on a 0 – 100% scale
  • last_evaluation represents the employee’s numeric score on their last evaluation
  • number_project is the number of projects accomplished by an employee to date
  • average_montly_hours is the average monthly hours an employee spends at work
  • time_spend_company is the amount of years an employee worked at this company
  • work_accident is a binary variable where 1 the employee experienced an accident, and 0 otherwise
  • left variable represents the binary class where 1 means the employee left, and 0 otherwise.
  • promotion_last_5years is a binary variable where 1 means the employee was promoted in the last 5 years, and 0 otherwise
  • sales is a categorical variable representing the employee’s main job function
  • salary is a categorical variable representing an employee’s salary level

The Rate Function

The following R code presents the function used to conduct this analysis.

# To use rate_matrix, a data frame df must be supplied and two column names from df must be known. The data frame must contain a numeric binary class feature y.
# If any of the characteristics are numeric on a continuous scale, a cut must be specified to place the values into categorical ranges or buckets.

rate_matrix <- function(df, y, c1 = NA, c2 = NA, cut = 10, avg = TRUE) {

# If y is not a binary integer, then stop the function.
if (is.integer(df[[y]]) != TRUE) { stop("Please ensure y is a binary class integer.") }

df_col_names <- colnames(df)

# If c1 and c2 are not available
if ( & { stop("Please recall function with a c1 and/or c2 value.") }

# If only c1 is provided
else if ( {

if (is.integer(df[[c1]])) {
var1 <- as.character(df[[c1]])
var1 <- unique(var1)
var1 <- as.numeric(var1)
var1 <- sort(var1, decreasing = FALSE) }

else if (is.numeric(df[[c1]])) {
var1 <- cut(df[[c1]], cut)
df[[c1]] <- var1
var1 <- levels(var1) }

else {
var1 <- df[[c1]]
var1 <- as.character(var1)
var1 <- unique(var1)
var1 <- sort(var1, decreasing = FALSE) }

c1_pos <- which(df_col_names == c1) # Number of column of characteristic c1

var1_len <- length(var1)

m <- matrix(NA, nrow = var1_len, ncol = 1)

rownames(m) <- var1
colnames(m) <- c1

for (i in 1:var1_len) {
bad <- df[,1][which(df[,c1_pos] == var1[i] & df[[y]] == 1)]
bad_count <- length(bad)

good <- df[,1][which(df[,c1_pos] == var1[i] & df[[y]] == 0)]
good_count <- length(good)

m[i,1] <- round(bad_count / (bad_count + good_count), 2) } }

# If c1 and c2 are provided
else {
if (is.integer(df[[c1]])) {
var1 <- as.character(df[[c1]])
var1 <- unique(var1)
var1 <- as.numeric(var1)
var1 <- sort(var1, decreasing = FALSE) }

else if (is.numeric(df[[c1]])) {
var1 <- cut(df[[c1]], cut)
df[[c1]] <- var1
var1 <- levels(var1) }

else {
var1 <- df[[c1]]
var1 <- as.character(var1)
var1 <- unique(var1)
var1 <- sort(var1, decreasing = FALSE) }

if (is.integer(df[[c2]])) {
var2 <- as.character(df[[c2]])
var2 <- unique(var2)
var2 <- as.numeric(var2)
var2 <- sort(var2, decreasing = FALSE) }

else if (is.numeric(df[[c2]])) {
var2 <- cut(df[[c2]], cut)
df[[c2]] <- var2
var2 <- levels(var2) }

else {
var2 <- df[[c2]]
var2 <- as.character(var2)
var2 <- unique(var2)
var2 <- sort(var2, decreasing = FALSE) }

c1_pos <- which(df_col_names == c1) # Number of column of characteristic c1
c2_pos <- which(df_col_names == c2) # Number of column of characteristic c2

var1_len <- length(var1)
var2_len <- length(var2)

m <- matrix(NA, nrow = var1_len, ncol = var2_len)

rownames(m) <- var1
colnames(m) <- var2

class_1 <- max(df[[y]])
class_0 <- min(df[[y]])

for (i in 1:var1_len) {
for (j in 1:var2_len) {
bad <- df[,1][which(df[,c1_pos] == var1[i] & df[,c2_pos] == var2[j] & df[[y]] == class_1)]
bad_count <- length(bad)

good <- df[,1][which(df[,c1_pos] == var1[i] & df[,c2_pos] == var2[j] & df[[y]] == class_0)]
good_count <- length(good)
m[i,j] <- round(bad_count / (bad_count + good_count), 2) } } }

# Create class 1 matrix report that includes averages
if (avg == TRUE) {
ColumnAverage <- apply(m, 2, mean, na.rm = TRUE)
ColumnAverage <- round(ColumnAverage, 2)
RowAverage <- apply(m, 1, mean, na.rm = TRUE)
RowAverage <- round(RowAverage, 2)
RowAverage <- c(RowAverage, NA)
m <- rbind(m, ColumnAverage)
m <- cbind(m, RowAverage)
return(m) }
else {
return(m) }


Employee Turnover Data Investigation

To begin this data investigation, I use the assumption that I have gained significant amounts of experience and field knowledge within Human Resources. I begin this heuristic analysis with the thought that employee turnover is greatly affected by how an employee feels about their job and about the company.

Are employees with small satisfaction levels more likely to leave?

The first thing I would like to confirm is that employees with small satisfaction levels are more likely to leave.

satisfaction <- rate_matrix(df = hr_data, y = "left", c1 = "satisfaction_level", cut = 20, avg = TRUE)



The function call here uses a cut value of 20 with no particular reason. I want a large enough cut value to provide evidence of my claim.

As seen in the matrix, satisfaction levels between 0.0891 and 0.136 shows that 92% of employees categorized in this range will leave. This provides evidence that low satisfaction levels among employees are at highest risk of leaving the company.

As we would expect, the highest levels of satisfaction of 0.954 to 1 experience 0% employee turnover.

For simplicity and ease of understanding, I define 0.5 as the average satisfaction level. By taking a look at below average satisfaction levels between 0.363 to 0.408 and 0.408 to 0.454, there is an odd significant increase to the risk of employees leaving. This particular area of employee satisfaction requires more investigation because it goes against intuition.

Are employees with below average satisfaction levels more likely to leave across different job functions?

To alleviate this concern of odd satisfaction levels defying our intuition, I continue the investigation by seeing whether satisfaction levels vary across other characteristics from the data. It is likely possible that these below average satisfaction levels are tied to their job function.

satisfaction_salary <- rate_matrix(df = hr_data, y = "left", c1 = "satisfaction_level", c2 = "sales", cut = 20, avg = TRUE)



Here, the same ranges of 0.363 to 0.408 and 0.408 to 0.454 satisfaction levels are generally at high risk to leave even across all job functions. There is evidence to suggest that somewhat unhappy workers are willing to leave regardless of their job function.

Is an unhappy employee’s likelihood of leaving related to average monthly hours worked?

To continue answering why below average satisfaction levels ranges experience higher employee turnover than we expect, I take a look at the relationship between satisfaction levels and average monthly hours worked. It could be that below average satisfaction levels at this company are tied to employees being overworked.

# First, convert the integer variable average_montly_hours into a numeric variable to take advantage of the function's ability to breakdown numeric variables into ranges.

average_montly_hours <- hr_data["average_montly_hours"]
average_montly_hours <- unlist(average_montly_hours)
average_montly_hours <- as.numeric(average_montly_hours)

hr_data["average_montly_hours"] <- average_montly_hours

satisfaction_avghours <- rate_matrix(df = hr_data, y = "left", c1 = "satisfaction_level", c2 = "average_montly_hours", cut = 20, avg = TRUE)



To reiterate, the row ranges represent the satisfaction levels and the column ranges represent the average monthly hours worked. Here, there is strong evidence to suggest that employees within the below average satisfaction level range of 0.363 to 0.408 and 0.408 to 0.454 work between 117 to 160 hours a month.

Using domain knowledge, typically, a full-time employee will work at least 160 hours a month, given that a full-time position merits 40 hours a week for 4 weeks in any given month. The data suggests here that we have a higher probability of workers leaving given they work less than a regular full-time employee! This was different from my initial train of thought that the employees were potentially overworked.

Given this finding, I come to one particular conclusion: employees with highest risk of leaving are those that are on contract, seasonal employees, or are part-time employees.

By considering other variables such as the number of projects worked on by an employee, it is possible to further support this conclusion.

satisfaction_projects <- rate_matrix(df = hr_data, y = "left", c1 = "satisfaction_level", c2 = "number_project", cut = 20, avg = TRUE)



Here, it is evident to see that the below average satisfaction levels of 0.363 to 0.408 and 0.408 to 0.454 may in fact correspond to contract or part-time employees as the probability of turnover sharply decreases after 2 projects completed.

Are contract, part-time or seasonal employees more likely to be unhappy if the job is accident-prone?

Now that we identified the high risk groups of employee turnover within this data set, this question comes to mind because we would like to address the fact that an employee’s enjoyment in their role should be tied to their satisfaction levels. It could be that these part-time employees are experiencing hardships during their time at work, thereby contributing to their risk of leaving.

To answer this question, I take a look at the satisfaction level and number of projects completed given that an employee experienced a workplace accident.

# I use the package dplyr in order to filter the hr_data dataframe to only include observations that experienced a workplace accident

accident_obs <- filter(hr_data, Work_accident == 1)

satisfaction_accident <- rate_matrix(df = accident_obs, y = "left", c1 = "satisfaction_level", c2 = "number_project", cut = 20, avg = TRUE)



Here, given the below average satisfaction levels of 0.363 to 0.408 and 0.408 to 0.454 for number of projects equal to 2 and given that employees experienced a workplace accident, there is evidence to suggest that there is a higher chance of turnover.

Further Work

The purpose of this analysis was to apply a risk segmenting method on human resources analytics data to identify potential reasons for employee turnover. I used probabilities or turnover rates to help identify some groups of employees that were at risk of leaving the company.

I found that there were higher chances of turnover given the employee had an extremely low satisfaction level, but also discovered that the type of employee (contract, part-time, seasonal) could be identified as groups of high risk of turnover. I addressed a possible fact that the likelihood of unhappiness for part-time employees  was attributed to them working on jobs that were accident-prone.

With the example presented in this post, Human Resources can use this information to put more efforts into ensuring contract, part-time, or seasonal employees experience lower turnover rates. This analysis allowed us to identify which groups of employees are at risk and allowed us to identify potential causes.

This risk analysis approach can be applied to any other field of practice other than Human Resources, including Health and Finance. It is useful to be able to come up with quick generic risk segments within your population so that further risk management solutions can be implemented for specific problems at hand.

Lastly, this post only provides a simple way to segment and analyze risk groups but it is not the only way! More advanced methods such as clustering and decision trees can help identify risk groups more thoroughly and informatively to provide an even bigger picture. For quick checks to domain expertise in any particular field of practice, the rate function I present here can be sufficient enough in identifying risk groups.