Month: April 2017

Overcoming the First Hurdle: From Knowing a Little to Learning a Lot

Growth as a data scientist will take on many forms and scale up several different paths depending on the function that you serve within your work environment. The learning curve as an early-stage Data Scientist will vary on several things such as your background education and knowledge, prior experiences within the field and industry, and whether you work within a team of data scientists, or as a standalone data scientist.

For myself, the learning curve was and continues to be steep and challenging. I began my career as a standalone data scientist for a start-up company, coming straight out of school and having very limited knowledge of the financial industry. All I had under my knowledge-base at the time was an in-depth understanding of the Logistic Regression, some economic analytical projects involving time-series, and a toolkit consisting of R and Microsoft Excel. Out of uplifting encouragement, I could of done more to add to my skill set before I started my job, but with what I knew, with an eagerness to learn, and with an immense curiosity, I had exactly what I needed to begin my career.

My role as a data scientist is to build and maintain proprietary credit scoring models, and provide adhoc analysis and reports upon request. There was already a whole list of challenges that I faced when I first started: a lack of appropriate credit scorecard building knowledge, a lack of knowledge on advanced data analytic techniques, verifying that my work met industry standards, and lacking the knowledge to closely monitor model effects.

These challenges pushed me to figure out the best practices and processes in the best way I thought possible. Here are some of the ways I went about addressing the challenges I faced during the start of my career.

Conducting Independent Research

My first gut instinct to approach a problem where you virtually have almost no background experience and no one to turn to for answers is to research! Having obtained a Master’s degree from a program that infused independent research heavily within its curriculum, this only came natural to me. For example, the most important thing in tackling a scorecard building project was first understanding its entirety and breaking it down into manageable and understandable pieces. It was extremely important to know why it is used, how it is used, and how it will benefit my company’s operations.

What often happened throughout my research was that I would find complex solutions that were difficult to implement without advanced enterprise software or advanced programming knowledge, or I would find solutions that seemed too easy and not convincing enough to use. This process of researching and attempting to reproduce certain projects on the internet definitely increased my technical understanding and in many ways helped me boost my proficiency in R. Along the way, I even picked up some Python and I also learned to how to write queries in Microsoft SQL Server and MySQL to better streamline my data and model building processes.


Another challenge was ensuring that the credit scoring models were built following best practices within the financial industry. This was a little more difficult for two reasons. The first one being that a scorecard for an alternative business-lending company would differ immensely from the more common scorecards developed in the industry such as that for personal loans. Secondly, the modelling practices for alternative subprime business-lending is still relatively new with the emergence of these industries stemming back since the 2008 Financial Crisis. Therefore, research is limited and most ideas behind these driving forces are mostly proprietary.

To overcome this challenge, I engaged in some more internet research, but more importantly, I networked with industry professionals and took what I could from my discussions with them. Most of our discussions involved understanding what techniques were used widely in the industry. During this time, LinkedIN, and my personal connections contributed to my learning of overcoming this challenge. I learned to set up interactions with professionals online as well learned to generate and connect ideas between professionals within my own work.

Engaging in Trial and Error

At first, there is high pressure when you first start as a data scientist with expectations of completing your projects within specified deadlines. The scorecard was my very first project and with the limited knowledge that I had, I was almost forced into a situation of trial and error. Initially, my practices involved researching and building in an endless cycle, where I often updated the scorecard to meet new standards and practices I learned along the way. At the time, there was very little internal user feedback on the scorecard because it was assumed that it was performing exactly the way it should be. It was essential that through this trial and error process that there was constant communication and understanding among the company in order to continue building a robust scorecard. Here, I learned a lot about not only the technical side of model building, but also found that my role as a standalone data scientist has a unique place within the operational team.

Being Prepared and Building Confidence

No data science problems at high levels of technicality and knowledge can be solved so easily. As a standalone data scientist where you are mostly doing things on your own accord and expected to make educated executive decisions, you are bound to run into personal hurdles such as worries and frustrations. When something goes wrong with your models, you become the first person accountable which in many ways can be offsetting. I came to realize that all of these feelings were natural and it was perfectly fine!

In order to overcome this challenge, it was always in my best interest to be prepared to provide thorough answers to questions that the company asked me, and be able to address concerns. Whenever there was a problem or concern raised with the models I built, or the data analysis methodologies, I was always forward with a positive answer or came up with a solution. It was in my best interest to be accountable and honest with my abilities. This stemmed from the realization that I do not know everything, but I do want to learn to make sure I do my best work in order to help the company grow. With the appropriate communication among upper management and their moral support, these personal challenges slowly faded and I actually began to expedite my learning of more applied business data science.

Moving Forward

What I appreciate the most about the early stages of my career is the amounts of learning that I have done and the huge amounts of growth I experienced as a person. With that said, the learning never ends as new modelling needs occur, data repositories grow with new data to be analyzed, and new modelling techniques and solutions are introduced with new technologies.

I know that as I continue along this career path, I am bound to learn some more programming, apply other predictive models, and conduct interesting kinds of analysis. With these ongoing changes within a fast-growing company, there is bound to be one problem solved with ten more problems arising. The best part of being in the early-stage of my career is that I know I still have a lot to learn, and as I move forward, I will anticipate the challenges ahead, and be more than happy to tackle them one step at a time.

Employee Turnover: A Risk Segmenting Investigation

In this post, I conduct a simple risk analysis of employee turnover using the Human Resources Analytics data set from Kaggle.

I describe this analysis as an example of simple risk segmenting because I would like to have a general idea of which combination of employee characteristics can provide evidence towards higher employee turnover.

To accomplish this, I developed a function in R that will take a data frame and two characteristics of interest in order to generate a matrix whose entries represent the probability of employee turnover given the two characteristics. I call these values, turnover rates.

Human Resources Analytics Data

Firstly, let us go over the details of the human resources analytics data set.

hr_data <- read.csv("HR_comma_sep.csv", header = TRUE)



The variables are described as follows:

  • satisfaction_level represents the employee’s level of satisfaction on a 0 – 100% scale
  • last_evaluation represents the employee’s numeric score on their last evaluation
  • number_project is the number of projects accomplished by an employee to date
  • average_montly_hours is the average monthly hours an employee spends at work
  • time_spend_company is the amount of years an employee worked at this company
  • work_accident is a binary variable where 1 the employee experienced an accident, and 0 otherwise
  • left variable represents the binary class where 1 means the employee left, and 0 otherwise.
  • promotion_last_5years is a binary variable where 1 means the employee was promoted in the last 5 years, and 0 otherwise
  • sales is a categorical variable representing the employee’s main job function
  • salary is a categorical variable representing an employee’s salary level

The Rate Function

The following R code presents the function used to conduct this analysis.

# To use rate_matrix, a data frame df must be supplied and two column names from df must be known. The data frame must contain a numeric binary class feature y.
# If any of the characteristics are numeric on a continuous scale, a cut must be specified to place the values into categorical ranges or buckets.

rate_matrix <- function(df, y, c1 = NA, c2 = NA, cut = 10, avg = TRUE) {

# If y is not a binary integer, then stop the function.
if (is.integer(df[[y]]) != TRUE) { stop("Please ensure y is a binary class integer.") }

df_col_names <- colnames(df)

# If c1 and c2 are not available
if ( & { stop("Please recall function with a c1 and/or c2 value.") }

# If only c1 is provided
else if ( {

if (is.integer(df[[c1]])) {
var1 <- as.character(df[[c1]])
var1 <- unique(var1)
var1 <- as.numeric(var1)
var1 <- sort(var1, decreasing = FALSE) }

else if (is.numeric(df[[c1]])) {
var1 <- cut(df[[c1]], cut)
df[[c1]] <- var1
var1 <- levels(var1) }

else {
var1 <- df[[c1]]
var1 <- as.character(var1)
var1 <- unique(var1)
var1 <- sort(var1, decreasing = FALSE) }

c1_pos <- which(df_col_names == c1) # Number of column of characteristic c1

var1_len <- length(var1)

m <- matrix(NA, nrow = var1_len, ncol = 1)

rownames(m) <- var1
colnames(m) <- c1

for (i in 1:var1_len) {
bad <- df[,1][which(df[,c1_pos] == var1[i] & df[[y]] == 1)]
bad_count <- length(bad)

good <- df[,1][which(df[,c1_pos] == var1[i] & df[[y]] == 0)]
good_count <- length(good)

m[i,1] <- round(bad_count / (bad_count + good_count), 2) } }

# If c1 and c2 are provided
else {
if (is.integer(df[[c1]])) {
var1 <- as.character(df[[c1]])
var1 <- unique(var1)
var1 <- as.numeric(var1)
var1 <- sort(var1, decreasing = FALSE) }

else if (is.numeric(df[[c1]])) {
var1 <- cut(df[[c1]], cut)
df[[c1]] <- var1
var1 <- levels(var1) }

else {
var1 <- df[[c1]]
var1 <- as.character(var1)
var1 <- unique(var1)
var1 <- sort(var1, decreasing = FALSE) }

if (is.integer(df[[c2]])) {
var2 <- as.character(df[[c2]])
var2 <- unique(var2)
var2 <- as.numeric(var2)
var2 <- sort(var2, decreasing = FALSE) }

else if (is.numeric(df[[c2]])) {
var2 <- cut(df[[c2]], cut)
df[[c2]] <- var2
var2 <- levels(var2) }

else {
var2 <- df[[c2]]
var2 <- as.character(var2)
var2 <- unique(var2)
var2 <- sort(var2, decreasing = FALSE) }

c1_pos <- which(df_col_names == c1) # Number of column of characteristic c1
c2_pos <- which(df_col_names == c2) # Number of column of characteristic c2

var1_len <- length(var1)
var2_len <- length(var2)

m <- matrix(NA, nrow = var1_len, ncol = var2_len)

rownames(m) <- var1
colnames(m) <- var2

class_1 <- max(df[[y]])
class_0 <- min(df[[y]])

for (i in 1:var1_len) {
for (j in 1:var2_len) {
bad <- df[,1][which(df[,c1_pos] == var1[i] & df[,c2_pos] == var2[j] & df[[y]] == class_1)]
bad_count <- length(bad)

good <- df[,1][which(df[,c1_pos] == var1[i] & df[,c2_pos] == var2[j] & df[[y]] == class_0)]
good_count <- length(good)
m[i,j] <- round(bad_count / (bad_count + good_count), 2) } } }

# Create class 1 matrix report that includes averages
if (avg == TRUE) {
ColumnAverage <- apply(m, 2, mean, na.rm = TRUE)
ColumnAverage <- round(ColumnAverage, 2)
RowAverage <- apply(m, 1, mean, na.rm = TRUE)
RowAverage <- round(RowAverage, 2)
RowAverage <- c(RowAverage, NA)
m <- rbind(m, ColumnAverage)
m <- cbind(m, RowAverage)
return(m) }
else {
return(m) }


Employee Turnover Data Investigation

To begin this data investigation, I use the assumption that I have gained significant amounts of experience and field knowledge within Human Resources. I begin this heuristic analysis with the thought that employee turnover is greatly affected by how an employee feels about their job and about the company.

Are employees with small satisfaction levels more likely to leave?

The first thing I would like to confirm is that employees with small satisfaction levels are more likely to leave.

satisfaction <- rate_matrix(df = hr_data, y = "left", c1 = "satisfaction_level", cut = 20, avg = TRUE)



The function call here uses a cut value of 20 with no particular reason. I want a large enough cut value to provide evidence of my claim.

As seen in the matrix, satisfaction levels between 0.0891 and 0.136 shows that 92% of employees categorized in this range will leave. This provides evidence that low satisfaction levels among employees are at highest risk of leaving the company.

As we would expect, the highest levels of satisfaction of 0.954 to 1 experience 0% employee turnover.

For simplicity and ease of understanding, I define 0.5 as the average satisfaction level. By taking a look at below average satisfaction levels between 0.363 to 0.408 and 0.408 to 0.454, there is an odd significant increase to the risk of employees leaving. This particular area of employee satisfaction requires more investigation because it goes against intuition.

Are employees with below average satisfaction levels more likely to leave across different job functions?

To alleviate this concern of odd satisfaction levels defying our intuition, I continue the investigation by seeing whether satisfaction levels vary across other characteristics from the data. It is likely possible that these below average satisfaction levels are tied to their job function.

satisfaction_salary <- rate_matrix(df = hr_data, y = "left", c1 = "satisfaction_level", c2 = "sales", cut = 20, avg = TRUE)



Here, the same ranges of 0.363 to 0.408 and 0.408 to 0.454 satisfaction levels are generally at high risk to leave even across all job functions. There is evidence to suggest that somewhat unhappy workers are willing to leave regardless of their job function.

Is an unhappy employee’s likelihood of leaving related to average monthly hours worked?

To continue answering why below average satisfaction levels ranges experience higher employee turnover than we expect, I take a look at the relationship between satisfaction levels and average monthly hours worked. It could be that below average satisfaction levels at this company are tied to employees being overworked.

# First, convert the integer variable average_montly_hours into a numeric variable to take advantage of the function's ability to breakdown numeric variables into ranges.

average_montly_hours <- hr_data["average_montly_hours"]
average_montly_hours <- unlist(average_montly_hours)
average_montly_hours <- as.numeric(average_montly_hours)

hr_data["average_montly_hours"] <- average_montly_hours

satisfaction_avghours <- rate_matrix(df = hr_data, y = "left", c1 = "satisfaction_level", c2 = "average_montly_hours", cut = 20, avg = TRUE)



To reiterate, the row ranges represent the satisfaction levels and the column ranges represent the average monthly hours worked. Here, there is strong evidence to suggest that employees within the below average satisfaction level range of 0.363 to 0.408 and 0.408 to 0.454 work between 117 to 160 hours a month.

Using domain knowledge, typically, a full-time employee will work at least 160 hours a month, given that a full-time position merits 40 hours a week for 4 weeks in any given month. The data suggests here that we have a higher probability of workers leaving given they work less than a regular full-time employee! This was different from my initial train of thought that the employees were potentially overworked.

Given this finding, I come to one particular conclusion: employees with highest risk of leaving are those that are on contract, seasonal employees, or are part-time employees.

By considering other variables such as the number of projects worked on by an employee, it is possible to further support this conclusion.

satisfaction_projects <- rate_matrix(df = hr_data, y = "left", c1 = "satisfaction_level", c2 = "number_project", cut = 20, avg = TRUE)



Here, it is evident to see that the below average satisfaction levels of 0.363 to 0.408 and 0.408 to 0.454 may in fact correspond to contract or part-time employees as the probability of turnover sharply decreases after 2 projects completed.

Are contract, part-time or seasonal employees more likely to be unhappy if the job is accident-prone?

Now that we identified the high risk groups of employee turnover within this data set, this question comes to mind because we would like to address the fact that an employee’s enjoyment in their role should be tied to their satisfaction levels. It could be that these part-time employees are experiencing hardships during their time at work, thereby contributing to their risk of leaving.

To answer this question, I take a look at the satisfaction level and number of projects completed given that an employee experienced a workplace accident.

# I use the package dplyr in order to filter the hr_data dataframe to only include observations that experienced a workplace accident

accident_obs <- filter(hr_data, Work_accident == 1)

satisfaction_accident <- rate_matrix(df = accident_obs, y = "left", c1 = "satisfaction_level", c2 = "number_project", cut = 20, avg = TRUE)



Here, given the below average satisfaction levels of 0.363 to 0.408 and 0.408 to 0.454 for number of projects equal to 2 and given that employees experienced a workplace accident, there is evidence to suggest that there is a higher chance of turnover.

Further Work

The purpose of this analysis was to apply a risk segmenting method on human resources analytics data to identify potential reasons for employee turnover. I used probabilities or turnover rates to help identify some groups of employees that were at risk of leaving the company.

I found that there were higher chances of turnover given the employee had an extremely low satisfaction level, but also discovered that the type of employee (contract, part-time, seasonal) could be identified as groups of high risk of turnover. I addressed a possible fact that the likelihood of unhappiness for part-time employees  was attributed to them working on jobs that were accident-prone.

With the example presented in this post, Human Resources can use this information to put more efforts into ensuring contract, part-time, or seasonal employees experience lower turnover rates. This analysis allowed us to identify which groups of employees are at risk and allowed us to identify potential causes.

This risk analysis approach can be applied to any other field of practice other than Human Resources, including Health and Finance. It is useful to be able to come up with quick generic risk segments within your population so that further risk management solutions can be implemented for specific problems at hand.

Lastly, this post only provides a simple way to segment and analyze risk groups but it is not the only way! More advanced methods such as clustering and decision trees can help identify risk groups more thoroughly and informatively to provide an even bigger picture. For quick checks to domain expertise in any particular field of practice, the rate function I present here can be sufficient enough in identifying risk groups.