Analysis

My Personal Vancouver Transit Usage: Analysis using Tableau

One of the things I love about Vancouver is its public transportation system, Translink. I grew up loving trains, and so it only seemed natural that riding the Skytrain be one of the funnest things I have experienced when I first came to Vancouver. It has been around two years since I have moved here and I still use it to go everywhere in and out of the Vancouver area. A cool feature of the Translink system is the Compass Card, a re-loadable fare pass in which frequent riders will use to tap themselves on and off the transit system through fare gates. Part of the reason why I love the idea of tapping on and off the fare gates or on bus rides is because of how the system records data of where and when you have tapped.

The thought of Translink’s ability to easily conduct commuter analysis using the millions of data recorded everyday for strategic pricing and vehicle allocation is intriguing. As such, this is what motivates riders like me to analyze my personal rider behavior. Conveniently, the Compass Card website allows you to download your own personal .csv file. The file contains lines of transactions representing every single time you have tapped on and off the system.

The motivation behind this post is to showcase some data analysis. I would love to present what I have learned about my transit behavior between September 2016 and August 2017 using Tableau Public. For any Pokemon fans out there, visualizations take on a Charizard colour palette.

On average, I began my travels with the bus 2.7 times more than the train each month. Equivalently, the bus began 73% of my trips.

Rider Usage Growth

73 Percent Bus Usage

This makes sense as the bus begins my commute to almost everywhere I go when I begin at home. It is interesting to see that my ridership has consistently increased up until the second quarter in 2017. The slight kink in the graph is due to the fact that I spent most of the month of May 2017 travelling (I went to Japan for the first time!)

I used the transit system in 289 out of 365 days and most days, I took 2-3 trips.

Trip Data

Here, I defined a trip as one where I would be required to make a new full fare payment. It is possible that multiple forms of transit may be used within an hour and a half time interval before having to pay again. These potential ways of transferring between types of transit (ie. bus to a train) are not considered as trips.

I tried avoiding the morning transit rush. I am more likely to use transit during evening rush hour. Weekend usage often starts in the late morning.

Daily Trip Schedule

This huge morning spread in my transit usage behavior reflects my choice to go to the gym before I go to work, especially during the spring and summer seasons when it gets brighter outside earlier in the day. Therefore, I can begin using transit as early as 5:00am! Also, being given a flexible work schedule, sometimes I choose to head to work as late as anywhere between 8:00am and 9:00am.

I often go out Friday and Saturday evenings and as much as I love taking transit, there is no clear increase in transit usage behavior during this time because depending on the activity, I may already be in walking distance of what I want to do, or transit may not be my ideal form of transportation.

I saved money getting a Zone 1 Monthly Pass at $91.00 with my behavior! I would have spent on average, $103.00 a month on individual fares.

Fare Usage

Often times, I don’t think about how many times I tap on and off the system and overlook what I would be paying if I did not have a monthly pass. This is full proof that getting a zone 1 monthly pass is worth it as a frequent transit user and I do not have to worry about other financial alternatives.

If I was more nit-picky, I could definitely save more money by not getting the monthly pass during the months where it would not be worth it. For example, every December,  I fly out to Toronto for two weeks to visit family for the holidays. One might think that this could signal a behavioral change to pay closer attention to my budget allocation towards public transit. In reality, I actually prefer not having to worry about loading my compass card every month. Hence, I have it set to auto-load where the system automatically charges my credit card and loads a monthly pass to my compass card.

Further Considerations

This analysis was a great introductory way for me to explore Tableau as an analytical tool. I will definitely be using it more often to create vibrant visualizations and hone in on insights from interesting data. Some future considerations I have for these kinds of analysis is to utilize maps and locations to enhance the visualization and stories behind transit data. In this particular case, almost all of my trips began in Vancouver and rarely in any other surrounding city so geographic visuals may have not been much use.

Another future consideration is to augment the existing transit data with other data sources such as the distances traveled using transit possibly obtained from Google Maps for example. Some analysis on how much it would cost per kilometer traveled or personal summary statistics on distances traveled also sounds interesting.

Amidst the world of available data in everyday life, one last future consideration is that the next time you tap off the transit system, think about how that is one more data point for your next analysis!

Employee Turnover: A Risk Segmenting Investigation

In this post, I conduct a simple risk analysis of employee turnover using the Human Resources Analytics data set from Kaggle.

I describe this analysis as an example of simple risk segmenting because I would like to have a general idea of which combination of employee characteristics can provide evidence towards higher employee turnover.

To accomplish this, I developed a function in R that will take a data frame and two characteristics of interest in order to generate a matrix whose entries represent the probability of employee turnover given the two characteristics. I call these values, turnover rates.

Human Resources Analytics Data

Firstly, let us go over the details of the human resources analytics data set.


hr_data <- read.csv("HR_comma_sep.csv", header = TRUE)

str(hr_data)

hr_analytics_data_summary

The variables are described as follows:

  • satisfaction_level represents the employee’s level of satisfaction on a 0 – 100% scale
  • last_evaluation represents the employee’s numeric score on their last evaluation
  • number_project is the number of projects accomplished by an employee to date
  • average_montly_hours is the average monthly hours an employee spends at work
  • time_spend_company is the amount of years an employee worked at this company
  • work_accident is a binary variable where 1 the employee experienced an accident, and 0 otherwise
  • left variable represents the binary class where 1 means the employee left, and 0 otherwise.
  • promotion_last_5years is a binary variable where 1 means the employee was promoted in the last 5 years, and 0 otherwise
  • sales is a categorical variable representing the employee’s main job function
  • salary is a categorical variable representing an employee’s salary level

The Rate Function

The following R code presents the function used to conduct this analysis.


# To use rate_matrix, a data frame df must be supplied and two column names from df must be known. The data frame must contain a numeric binary class feature y.
# If any of the characteristics are numeric on a continuous scale, a cut must be specified to place the values into categorical ranges or buckets.

rate_matrix <- function(df, y, c1 = NA, c2 = NA, cut = 10, avg = TRUE) {

# If y is not a binary integer, then stop the function.
if (is.integer(df[[y]]) != TRUE) { stop("Please ensure y is a binary class integer.") }

df_col_names <- colnames(df)

# If c1 and c2 are not available
if (is.na(c1) & is.na(c2)) { stop("Please recall function with a c1 and/or c2 value.") }

# If only c1 is provided
else if (is.na(c2)) {

if (is.integer(df[[c1]])) {
var1 <- as.character(df[[c1]])
var1 <- unique(var1)
var1 <- as.numeric(var1)
var1 <- sort(var1, decreasing = FALSE) }

else if (is.numeric(df[[c1]])) {
var1 <- cut(df[[c1]], cut)
df[[c1]] <- var1
var1 <- levels(var1) }

else {
var1 <- df[[c1]]
var1 <- as.character(var1)
var1 <- unique(var1)
var1 <- sort(var1, decreasing = FALSE) }

c1_pos <- which(df_col_names == c1) # Number of column of characteristic c1

var1_len <- length(var1)

m <- matrix(NA, nrow = var1_len, ncol = 1)

rownames(m) <- var1
colnames(m) <- c1

for (i in 1:var1_len) {
bad <- df[,1][which(df[,c1_pos] == var1[i] & df[[y]] == 1)]
bad_count <- length(bad)

good <- df[,1][which(df[,c1_pos] == var1[i] & df[[y]] == 0)]
good_count <- length(good)

m[i,1] <- round(bad_count / (bad_count + good_count), 2) } }

# If c1 and c2 are provided
else {
if (is.integer(df[[c1]])) {
var1 <- as.character(df[[c1]])
var1 <- unique(var1)
var1 <- as.numeric(var1)
var1 <- sort(var1, decreasing = FALSE) }

else if (is.numeric(df[[c1]])) {
var1 <- cut(df[[c1]], cut)
df[[c1]] <- var1
var1 <- levels(var1) }

else {
var1 <- df[[c1]]
var1 <- as.character(var1)
var1 <- unique(var1)
var1 <- sort(var1, decreasing = FALSE) }

if (is.integer(df[[c2]])) {
var2 <- as.character(df[[c2]])
var2 <- unique(var2)
var2 <- as.numeric(var2)
var2 <- sort(var2, decreasing = FALSE) }

else if (is.numeric(df[[c2]])) {
var2 <- cut(df[[c2]], cut)
df[[c2]] <- var2
var2 <- levels(var2) }

else {
var2 <- df[[c2]]
var2 <- as.character(var2)
var2 <- unique(var2)
var2 <- sort(var2, decreasing = FALSE) }

c1_pos <- which(df_col_names == c1) # Number of column of characteristic c1
c2_pos <- which(df_col_names == c2) # Number of column of characteristic c2

var1_len <- length(var1)
var2_len <- length(var2)

m <- matrix(NA, nrow = var1_len, ncol = var2_len)

rownames(m) <- var1
colnames(m) <- var2

class_1 <- max(df[[y]])
class_0 <- min(df[[y]])

for (i in 1:var1_len) {
for (j in 1:var2_len) {
bad <- df[,1][which(df[,c1_pos] == var1[i] & df[,c2_pos] == var2[j] & df[[y]] == class_1)]
bad_count <- length(bad)

good <- df[,1][which(df[,c1_pos] == var1[i] & df[,c2_pos] == var2[j] & df[[y]] == class_0)]
good_count <- length(good)
m[i,j] <- round(bad_count / (bad_count + good_count), 2) } } }

# Create class 1 matrix report that includes averages
if (avg == TRUE) {
ColumnAverage <- apply(m, 2, mean, na.rm = TRUE)
ColumnAverage <- round(ColumnAverage, 2)
RowAverage <- apply(m, 1, mean, na.rm = TRUE)
RowAverage <- round(RowAverage, 2)
RowAverage <- c(RowAverage, NA)
m <- rbind(m, ColumnAverage)
m <- cbind(m, RowAverage)
return(m) }
else {
return(m) }

}

Employee Turnover Data Investigation

To begin this data investigation, I use the assumption that I have gained significant amounts of experience and field knowledge within Human Resources. I begin this heuristic analysis with the thought that employee turnover is greatly affected by how an employee feels about their job and about the company.

Are employees with small satisfaction levels more likely to leave?

The first thing I would like to confirm is that employees with small satisfaction levels are more likely to leave.


satisfaction <- rate_matrix(df = hr_data, y = "left", c1 = "satisfaction_level", cut = 20, avg = TRUE)

View(satisfaction)

satisfaction_level_single

The function call here uses a cut value of 20 with no particular reason. I want a large enough cut value to provide evidence of my claim.

As seen in the matrix, satisfaction levels between 0.0891 and 0.136 shows that 92% of employees categorized in this range will leave. This provides evidence that low satisfaction levels among employees are at highest risk of leaving the company.

As we would expect, the highest levels of satisfaction of 0.954 to 1 experience 0% employee turnover.

For simplicity and ease of understanding, I define 0.5 as the average satisfaction level. By taking a look at below average satisfaction levels between 0.363 to 0.408 and 0.408 to 0.454, there is an odd significant increase to the risk of employees leaving. This particular area of employee satisfaction requires more investigation because it goes against intuition.

Are employees with below average satisfaction levels more likely to leave across different job functions?

To alleviate this concern of odd satisfaction levels defying our intuition, I continue the investigation by seeing whether satisfaction levels vary across other characteristics from the data. It is likely possible that these below average satisfaction levels are tied to their job function.


satisfaction_salary <- rate_matrix(df = hr_data, y = "left", c1 = "satisfaction_level", c2 = "sales", cut = 20, avg = TRUE)

View(satisfaction_salary)

satisfaction_sales

Here, the same ranges of 0.363 to 0.408 and 0.408 to 0.454 satisfaction levels are generally at high risk to leave even across all job functions. There is evidence to suggest that somewhat unhappy workers are willing to leave regardless of their job function.

Is an unhappy employee’s likelihood of leaving related to average monthly hours worked?

To continue answering why below average satisfaction levels ranges experience higher employee turnover than we expect, I take a look at the relationship between satisfaction levels and average monthly hours worked. It could be that below average satisfaction levels at this company are tied to employees being overworked.


# First, convert the integer variable average_montly_hours into a numeric variable to take advantage of the function's ability to breakdown numeric variables into ranges.

average_montly_hours <- hr_data["average_montly_hours"]
average_montly_hours <- unlist(average_montly_hours)
average_montly_hours <- as.numeric(average_montly_hours)

hr_data["average_montly_hours"] <- average_montly_hours

satisfaction_avghours <- rate_matrix(df = hr_data, y = "left", c1 = "satisfaction_level", c2 = "average_montly_hours", cut = 20, avg = TRUE)

View(satisfaction_avghours)

satisfaction_hours

To reiterate, the row ranges represent the satisfaction levels and the column ranges represent the average monthly hours worked. Here, there is strong evidence to suggest that employees within the below average satisfaction level range of 0.363 to 0.408 and 0.408 to 0.454 work between 117 to 160 hours a month.

Using domain knowledge, typically, a full-time employee will work at least 160 hours a month, given that a full-time position merits 40 hours a week for 4 weeks in any given month. The data suggests here that we have a higher probability of workers leaving given they work less than a regular full-time employee! This was different from my initial train of thought that the employees were potentially overworked.

Given this finding, I come to one particular conclusion: employees with highest risk of leaving are those that are on contract, seasonal employees, or are part-time employees.

By considering other variables such as the number of projects worked on by an employee, it is possible to further support this conclusion.


satisfaction_projects <- rate_matrix(df = hr_data, y = "left", c1 = "satisfaction_level", c2 = "number_project", cut = 20, avg = TRUE)

View(satisfaction_projects)

satisfaction_projects

Here, it is evident to see that the below average satisfaction levels of 0.363 to 0.408 and 0.408 to 0.454 may in fact correspond to contract or part-time employees as the probability of turnover sharply decreases after 2 projects completed.

Are contract, part-time or seasonal employees more likely to be unhappy if the job is accident-prone?

Now that we identified the high risk groups of employee turnover within this data set, this question comes to mind because we would like to address the fact that an employee’s enjoyment in their role should be tied to their satisfaction levels. It could be that these part-time employees are experiencing hardships during their time at work, thereby contributing to their risk of leaving.

To answer this question, I take a look at the satisfaction level and number of projects completed given that an employee experienced a workplace accident.


# I use the package dplyr in order to filter the hr_data dataframe to only include observations that experienced a workplace accident
require(dplyr)

accident_obs <- filter(hr_data, Work_accident == 1)

satisfaction_accident <- rate_matrix(df = accident_obs, y = "left", c1 = "satisfaction_level", c2 = "number_project", cut = 20, avg = TRUE)

View(satisfaction_accident)

satisfaction_accident

Here, given the below average satisfaction levels of 0.363 to 0.408 and 0.408 to 0.454 for number of projects equal to 2 and given that employees experienced a workplace accident, there is evidence to suggest that there is a higher chance of turnover.

Further Work

The purpose of this analysis was to apply a risk segmenting method on human resources analytics data to identify potential reasons for employee turnover. I used probabilities or turnover rates to help identify some groups of employees that were at risk of leaving the company.

I found that there were higher chances of turnover given the employee had an extremely low satisfaction level, but also discovered that the type of employee (contract, part-time, seasonal) could be identified as groups of high risk of turnover. I addressed a possible fact that the likelihood of unhappiness for part-time employees  was attributed to them working on jobs that were accident-prone.

With the example presented in this post, Human Resources can use this information to put more efforts into ensuring contract, part-time, or seasonal employees experience lower turnover rates. This analysis allowed us to identify which groups of employees are at risk and allowed us to identify potential causes.

This risk analysis approach can be applied to any other field of practice other than Human Resources, including Health and Finance. It is useful to be able to come up with quick generic risk segments within your population so that further risk management solutions can be implemented for specific problems at hand.

Lastly, this post only provides a simple way to segment and analyze risk groups but it is not the only way! More advanced methods such as clustering and decision trees can help identify risk groups more thoroughly and informatively to provide an even bigger picture. For quick checks to domain expertise in any particular field of practice, the rate function I present here can be sufficient enough in identifying risk groups.