R

A Friendly BC Hydro Electricity Consumption Analysis using Tableau

If there is something to appreciate about the Canadian West Coast, it is definitely in the way that it leads by example through environmentally friendly practices. One of the ways that British Columbia takes on this initiative is through administering electrical energy in the cleanest and most cost-efficient way.  This is all made possible through BC Hydro, a Canadian-controlled crown corporation responsible for providing British Columbia residences reliable and affordable electricity. British Columbia prides itself in delivering electrical energy and is known to have one of the lowest electricity consumer prices in Canada. One way to show appreciation for natural resource consumption is to, of course, take a look at our very own personal electricity consumption. Before diving into any analysis, first it might be useful to provide some context into how BC Hydro prices electricity consumption.

Electricity Pricing

BC Hydro uses a two-stage pricing algorithm where consumers are required to pay $0.0858 per kWh up to a max consumption threshold of 1350 kWh within a two-month period. This rate increases to the second stage at $0.1287 per kWh if the consumer uses over 1350 kWh within the two months. In addition to the two-stage pricing algorithm, consumers are required to pay a base rate of $0.1899 times the number of days in their billing period, and pay a rider-rate which is a buffer cost to consumers to cover unpredictable economic circumstances such as abnormal market prices or inaccurate water level forecasts. The entire cost becomes subject to GST and the total is the payable amount during every billing period.  If you want to read for knowledge’s sake, BC Hydro thoroughly explains the pricing of electricity consumption on their website.

Motivation

The motivation behind this post is to analyze a very special electricity consumption data set (special because I was given permission by my great friend Skye to analyze his electricity consumption data!) I will be analyzing Skye’s personal electricity consumption in the full calendar years of 2015 and 2016.

Data, Set, Go!

It is amazing how accessible BC Hydro makes personal electricity consumption data to its paying customers. Skye has revealed that to obtain your personalized data set, you would simply login to your MyHydro account, and request an exported .csv file. Within 24 hours of submitting a request, you will receive an e-mail with a personalized .csv file.

The file contains two data columns, the first column containing the Start Interval Time/Date representing the time stamp at the beginning of each hour of every day of the year, and the second column containing Net Consumption (kWh) representing the amount of electricity used up until the end of the hour measured in kilowatts per hour!

One thing I noticed within Skye’s data set is that there were some missing Net Consumption values.  Missing values were indicated with N/A attached to some time stamps. Further looking into this data and without any prior knowledge to Skye’s electricity consumption behavior, there is no way to really know for certain why the data was missing. To rectify this situation, I simply replaced missing values with the most recent level of net consumption. For example, if there was a missing value on September 13, 2016 at 10:00am, I would assume a forward-looking value such that this missing value would be replaced with the net consumption value from 9:00am. If there was trailing missing values, or consecutive missing values, I would replace them all with the most recent available net consumption value.

Without further ado, I shall begin reporting what I found from Skye’s electricity consumption data using Tableau Public. Pokemon Fans, please take notice in the Venusaur colour palette!

There was an increase of about 3% in net consumption expenditure from 2015 to 2016. Skye exhibits typical seasonal trends in electricity consumption.

Electricity Consumption Seasonal

Net Expenditure is what Skye is charged during the two-stage pricing algorithm on a per month basis. He does not actually ever step into stage-two of pricing as he is well below the 1350 kWh threshold every two months (which is amazing, save energy and save money!)

Visually, it seems that Skye exhibits a typical consumption behavior, where his expenditure in the first three quarters of each year has a decreasing trend, he hits a minimum in September, then scales back up in the Fall and Winter months. Is it possible that what we see visually is not verified statistically? We can validate this theory that Skye is behaving typically or the way he should be. Consider the following R code:

table2015 <- c(42.87, 39.01, 33.9, 29.25, 22.5, 26.68, 33.06, 24.45, 15.57,
 27.11, 49.37, 49.39)
table2016 <- c(44.46, 31.57, 32.68, 29, 25.53, 25.43, 24.53, 29.42, 20.43,
 32.03, 43.96, 67.62)
chisq.test(table2015, 2016)

Behavior Distribution

Here, I perform a simple chi-square test to validate that these data-points are in fact Skye’s typical behavior. The null hypothesis is that the expenditures in 2015 and 2016 are independent, or in simpler terms, we hypothesize the possibility that Skye’s behavior has changed and therefore has significantly changed his electrical consumption. Since the result presents a p-value of 0.2329 (much greater than 0.05,  a benchmark value to consider this null hypothesis), we reject the null hypothesis and conclude that there is evidence to suggest that Skye is behaving how he should be!

Although we have statistical evidence regarding his typical behavior, one still needs to question what happened in December 2016, where Skye’s expenditure increased by almost 37% from 2015! Could this probably be a result of one of the coldest winters that Lower Mainland has experienced in years?

Skye has more expenditure control in 2016. His net daily expenditures are more sporadic in 2015 with monthly averages between $0.80 and $1.40 per day, whereas 2016 was less sporadic with monthly averages between $0.70 and $1.20 (exception, December 2016 at average of $2.20).

2015

Daily Expenditure 2015

2016

Daily Expenditure 2016

Here, I use the term sporadic to describe the range in distributions of net daily expenditures per month.  For example, the box-plot ranges in the first quarter of 2015 are much wider than the box-plots in the first quarter of 2016. This is especially evident in the summer months. To put it simply, Skye has more consistency and better control of his electricity consumption and expenditures in 2016.

We have seen that Skye’s expenditure has increased by 3.42% from 2015 to 2016. One would think that with more controlled electricity consumption in 2016, his expenditure would be lower. By taking a look at December 2016, we can see that his expenditure is abnormally higher (by almost an additional $0.70 per day!) It is interesting to see that electricity consumption behavior was definitely more different in 2016 and kept at all-time low until the winter season.

Another thing to point out is that outlying value of about $4.00 in December 2016. Anecdotally, Skye says it is most likely because he forgot to turn off the stove that day!

Skye’s favourite days of electricity consumption are Tuesdays, Wednesdays, Saturdays and Sundays. Net hourly expenditure is seasonally and consistently higher (more expenditures are greater than $0.04 per hour) during these days.

Net Hourly Expenditure 2015 ($)

Monthly Daily Expenditure 2015

Net Hourly Expenditure 2016 ($)

Monthly Daily Expenditure 2016

Consistent with what we have been seeing, the winter months continue to observe the highest expenditure per hour. In addition, it seems that Skye has more of a liking to use higher levels of electricity on Tuesdays, Wednesdays and on the weekends with expenditures ranging between $0.03 to $0.09 per hour.

To account for a smaller difference, we could see that in 2015, Skye used more electricity throughout November everyday in 2015, but has decreased his hourly expenditure over November weekends in 2016.

Skye exhibits behavioral change in 2016 mornings and mid-evenings with higher net hourly expenditures compared to 2015.

Net Hourly Expenditure 2015 ($)

Daily Hour Charge 2015

Net Hourly Expenditure 2016 ($)

Daily Hour Charge 2016

It seems that Skye is consuming more electricity in the morning in 2016, evidently with an additional $0.01 to $0.02 per hour from 2015 rates. A behavioral change is signaled by the 7AM to 8AM mark in the morning and by the 6PM to 7PM mark in the evening. Even times like Mondays in 2016 at 11PM in the late night exhibit increase in electricity consumption.

Skye spends about 57% less than the average British Columbian household.

Compare Rates

According to BC Hydro, the average BC household consumes an average of about 900 kWh of electricity per month (not including seasonality). If we apply the exact same logic by taking the average consumption of Skye’s past 2 years, we see that he actually consumes way less than the average household.

Watt to Consider for the Future

This analysis was a great way to continue displaying fun data in Tableau. With just a two-column personalized electricity consumption data set, I was able to dig a little deeper on the spending behavior of Skye. Some things came to mind as I was conducting this analysis which could be used as motivation for further analysis posts.

This analysis serves as a perfect transition into utilizing machine learning methods to forecast future expenditures. We can definitely come to understand what exactly our forecasting machine learning models intend to capture and how these behaviors will help us predict future behavior. In this analysis, 2015 and 2016 data was used but in reality, data up to the current date and data before 2015 can be obtained. This gives more opportunity to build and test forecasting models accordingly.

In addition to the amazing visualizations produced by Tableau, a perfect consideration for future time-series modelling is to plot the data using ggplot2 in R. Interchangeably using these two visualization tools can serve as good practice and could provide more insights when used in conjunction with one another.

A special thanks to Skye for letting me use his data, it was fun!

 

Implementing a Predictive Model Pipeline using R and Microsoft Azure Machine Learning

In this post, I aim to demonstrate the process of building a simple machine learning model in R and implementing it as predictive web-service in Microsoft Azure Machine Learning (Azure ML). One is not limited to the built-in machine learning capabilities of Azure ML since the Azure ML environment enables the use of R scripts, and the ability to upload and utilize R packages.

In practice, this gives the Data Scientist the flexibility they need to use their own carefully R-crafted machine learning models within the Azure ML environment. Once an R-built machine learning model is fully implemented within the Azure ML environment, the web-service provides an API in which calls can be made by external applications. This provides large value to any organization that seeks to automate decision processes using predictive modelling.

This demonstration is subdivided into three sections:

  1. Building a Machine Learning Model in R
  2. Preparing for Model Implementation in Microsoft Azure Machine Learning
  3. Creating a Predictive Web Service in Microsoft Azure Machine Learning

Before we begin, this demonstration assumes all of the following are satisfied:

  1. We have a verified and registered Microsoft Azure Machine Learning account. Microsoft allows you to try the product for free if you do not have a subscription. Click the “Sign-up here” on the far right using the link above and follow the steps to gain access to Azure ML.
  2. We have R and an associated IDE installed. I will be using the free version of R-Studio throughout this process.
  3. We have the following R packages installed: dplyr, dummies, caretcaretEnsemble
  4. We have the Human Resources Analytics data set downloaded as this will be the main source of data for building our predictive model.

Building a Machine Learning Model in R

To keep things simple, I will be building a simple logistic regression model.

Data Pre-processing

First, I pre-process the Human Resources Analytics data set to hot-encode the categorical features sales and salary.


require(dummies)

dataset <- read.csv("HRdata.csv")
dataset <- dummy.data.frame(dataset, names = c("sales", "salary"), sep = "_")
dataset <- dataset[-c(15,19)] 

# Columns 15 and 19 represent sales_RandD and salary_high which are removed to prevent the dummy variable trap

Training the Logistic Regression Model

Next, I train a Logistic Regression Model and check that it can successfully generate predictions for new data.

 

# Create logistic regression model
glm_model - glm(left ~ ., data = dataset)

# Generate predictions for new data
newdata <- data.frame(satisfaction_level = 0.5, last_evaluation = 0.5, number_project = 1, average_montly_hours = 160, time_spend_company = 2, Work_accident = 0, promotion_last_5years = 1, sales_accounting = 0, sales_hr = 0, sales_IT = 0, sales_management = 0, sales_marketing = 0, sales_product_mng = 0, sales_sales = 1, sales_support = 0, sales_technical = 0, salary_low = 0, salary_medium = 1)
prediction <- predict(object = stack.rf, newdata = newdata) 

print(prediction)

Executing the above gives us a probability of 0.193 indicating that this employee has low risk of leaving.

Saving the R-Built Model

Since the goal is to use our very own R-built model in Microsoft Azure Machine Learning, we need to be able to utilize our model without having to generate the above code over again. We run the following code to save our model and all of its parameters:


saveRDS(glm_model, file = "glm_model.rds")

Running this code will save the glm_model.rds file in the active working directory within R-Studio. In this demonstration, the glm_model.rds file is saved to my desktop.

glm_modelrds

Creating Package Project Environment

The next couple of steps are crucial in ensuring that we end up with a package file that can be uploaded to Azure ML. The reason for creating this package is to ensure that Azure ML can call on our logistic regression model to generate predictions.

First, we must initialize the package creation process by starting a new project. In R-Studio this achieved by the following:

  • Click “File” in the top left corner
  • Click “New Project…” and a pop-up screen will appear

    new_project_popup.PNG

  • Click “New Directory”

    new_package

  • Click “R Package”

    glm_createproject

  • Type in a package name. Here I used “azuremlglm” as my package name. Make sure to create the project folder by setting a project subdirectory. Here, I used my desktop as the location for this project folder.
  • After clicking “Create Project”, a new R-Studio working environment will open with the default R file being “hello.R”. Since I saved my project to my desktop, I also noticed that a new folder was created.

    glm_azuremlglm

  • Now we are set to build our package. Within our package environment in R-Studio, we can close the “hello.R” file and create three new R scripts by hitting ctrl + shift + N twice. These three scripts will be needed in the following sections.

Filling the Package with Necessary Items

By successfully setting up the package creation environment, we are now free to fill this package with anything that we may find useful in our predictive modelling pipeline. For the purposes of this demonstration, this package will only include the logistic regression model built from the first section, and a function that Azure ML can use to generate predictions.

Before writing anything in the new R script, we write the following R code in the first script to add the glm_model.rds file to our package. To better accomplish this, we can drag the .rds file to the azuremlglm folder since the R-Studio working directory is that project folder.


# Read the .rds file into the package environment
glm_model_rds <- readRDS("stack_randomforest_model.rds")

glm_workingenvironment_1

By reading in the .rds file that contained our logistic regression model into the package environment, we are now free to utilize the model in any way we wish. It is important that we save this script within the project folder. Here, I saved it as glm_model_rds.R as seen on the tab.

glm_tab_1

The Prediction Function

Since the primary use of this package is to utilize the logistic regression model to produce predictions, we need to create a function that takes in a data frame containing new data and outputs a prediction. This is very similar to the prediction verification procedure we did in the first section after building the model and using the predict function on new data.

In the new R script that we created, we write the following R code:


# Create function that allows Azure ML to generate predictions using logistic regression model

prediction_function <- function(newdata) {
 prediction <- predict(glm_model_rds, newdata = newdata)
 return(data.frame(prediction))
}

Here, I saved this function as prediction_function.rds

glm_predictionfunction

The Decision Function

When Azure ML receives new data and passes its arguments to this function, we expect the resulting predictive web-service to produce the predicted probability. What if our decision process required more than just the predicted probability?

The added benefit of being able to create your own models and packages to use in Azure Machine Learning are tenfold. In many cases, you may want the Azure ML API call to output decision processes as a result of the predictions created by your machine learning model. Consider the following example:


# Create function with decision policy

decision_policy <- function(probability) {
 if (probability < 0.2) {return("Employee is low risk, occassional check-up where necessary.")}
 else if (probability >= 0.2 & probability < 0.6) {return("Employee is medium risk, take action in employee retention where necessary.")}
 else (return("Employee is high risk, notify upper management to ensure risk is mitigated in work environment."))
}

This decision function takes the logistic regression model’s predicted probability of a new observation and applies a Human Resource policy that meets the organization’s needs. As you can see, instead of a predicted probability, this function is recommending some form of action to be taken one the predictive model is used. It is possible that the result of a predictive model can trigger many different company-wide policies, no matter what the industry-specific application.

Here’s another example in the alternative business-financing industry. A predicted probability of risk to a specific business owner can trigger different loan-product pricing policies, and trigger different employee actions to be taken. If the Azure ML API call can output a series of policies and rules, there is huge value in being able to automate decision processes in order to get that loan out faster or rejected faster.

Creating your own models in R and including decision policies within your R packages could be the solution to an automated decision process within any organization.

Now, back to package creation. Given the newly created decision_function, we need to be sure to update our prediction_function to be able to implement these new policies.


# Create function that allows Azure ML to generate predictions using stacked model

prediction_function <- function(newdata) {
 prediction <- predict(glm_model_rds, newdata = newdata)
 return(data.frame(decision_policy(prediction)))
}

With the prediction function and decision function ready to go, it is important that we run these functions so that it is saved within the package environment.

It is also important that we save this R script within the package folder. Here, I saved the decision_policy.R function and re-saved the prediction_function.R as shown in the tabs.

glm_decisionfunction

glm_predictionfunction_2

Once these three separate R scripts are saved, we are ready to build and save our package. To build and save the package, we do the following:

  • Click the “Build” tab in the top right corner

    stack_build

  • Click “Build & Reload”

    glm_buildandreload

  • Verify that the package was built and saved by going to the R library folder, “R/win-library/3.3”. Here, my library is saved in my Documents folder.

    glm_azuremlgml_doc

  • With the package folder from above, you want to create a .zip file of it. You can do this by right-clicking the file, going to “send to”, then selecting “Compressed (zipped) folder”. After doing so, it will create a .zip file of your package. Do not rename this .zip file. I also proceeded to drag this .zip file to my desktop.

    glm_1_azuremlglm

  • This next step is extremely important. With the newly saved .zip file, you want to create ANOTHER .zip file of it. The reason for this is because of the weird way that Azure ML reads in package files. This time I renamed the new .zip file as “2_azuremlglm”.  You should now have a .zip file that contains a .zip file that contains the actual azuremlglm folder package. You can delete the first .zip file created from the previous step as it is no longer needed.

    glm_2_azuremlglm

  • This is the resulting package file that will be uploaded to Azure ML.

Creating a Predictive Web Service in Microsoft Azure Machine Learning

We are in the final stretch of the implementation process! This last section will describe how to configure Microsoft Azure Machine Learning to utilize our logistic regression model and decision rules.

Uploading the Package File and Creating a New Experiment

Once we have logged in, we want to do the following steps:

  • Click the “NEW” button in the bottom left corner, click “DATASET”, and then click “FROM LOCAL FILE” as shown

    stack_azureml_newdataset

  • Upload the .zip file created from the previous section

    glm_uploaddataset

  • When the upload is successful, you should receive the following message at the bottom of the screen

    glm_uploadcomplete

  • Next, we create a new blank experiment. We do this by clicking the “NEW” button at the bottom left corner again, click “EXPERIMENT”, and then click the first option “Blank Experiment”

    stack_blankexperiment

  • Now we are ready to configure our Azure Machine Learning experiment

    stack_azuremlenvironment

Setting up the Experiment Platform

In order for Microsoft Azure Machine Learning to utilize our logistic regression model, we need to set up the platform in such a way that it knows to take in new data inputs and produce prediction outputs. We accomplish this with the following layout.

glm_layout

  • The Execute R Script on the left defines the schema of the inputs. This module will connect to the first input node of the second Execute R Script. The code inputted in this module is as follows

    glm_schema

  • A module was placed for the 2_azuregmlglm.zip package so that it can be installed within the Azure ML environment. This module is inputted into the third node of the second Execute R Script module.
  • The Execute R Script in the center is where we utilize the logistic regression model package. This module contains the following code

    glm_readpackagemodule

  • Once all of the above are satisfied, we are ready to deploy the predictive web service.

Deploying the Predictive Web service

  • At the bottom of the screen, we will deploy the web service by clicking on DEPLOY WEB SERVICE”, then clicking “Deploy Web service (classic)”.

    glm_azuremenu

  • Azure ML will then automatically add the Web service input and Web service output modules to the appropriate nodes as follows

    glm_azuremllayout2

  • The Web service output automatically connected to the second output node of the Execute R Script module. We actually want this to connect to the first output node of the Execute R Script as shown

    glm_firstnode

  • Click the “RUN” button at the bottom of the screen to verify the web service
  • Click the “DEPLOY WEB SERVICE” button once again, and select “Deploy Web service (Classic). The following page will show up

glm_apipage

  • Finally, we are able to test that our predictive model works by clicking the blue “Test” button in the “REQUEST/RESPONSE” row.

    glm_enterdatatopredict

  • After confirming the test, we should get the following result

glm_outputachieved

  • This confirms that our predictive model works and all decision policies have been correctly implemented. The API is ready to go and can be consumed by external applications.

Further Considerations

Throughout this post, I showcased the process of implementing a simple predictive model using R and Microsoft Azure Machine Learning model. Of course, there are much more efficient ways of utilizing predictive models such as directly using the platform of Azure ML  to train, validate and test machine learning models, or directly using the Execute R Script module and doing all the R hard-coding there.

I want to emphasize that the process outlined here may seem less efficient to build and carry out, but I think it offers a good way to organize and automate decision pipelines. By going through the process of building and creating R packages that can then be uploaded to Azure ML, we are able to implement many decision rules within the R package. For example, an organization may choose to implement several product pricing rules  or internal decision policies as a result of what the predictive model outputs. There is plenty of room to automate these decisions for faster turnaround of work. Creating packages also gives us the ability to train, validate, and test more complex machine learning models and saving their results accordingly. I am sure there are plenty of other reasons and uses than the ones I stated here in which building your own machine learning R packages and then uploading it to Azure ML is highly beneficial.

In the future, I look to implement this process by using more complex machine learning models rather than the simple logistic regression. I also look to learn some more software application development as this is clearly not the end of the data science pipeline. With Azure ML producing an API, it would be nice to be able to see the full extent of this pipeline by utilizing the API through my own created applications. Finally, some important takeaways from this post are the abilities to organize and automate an operational data science pipeline and the thought-process behind automating company-related decisions.

Employee Turnover: A Risk Segmenting Investigation

In this post, I conduct a simple risk analysis of employee turnover using the Human Resources Analytics data set from Kaggle.

I describe this analysis as an example of simple risk segmenting because I would like to have a general idea of which combination of employee characteristics can provide evidence towards higher employee turnover.

To accomplish this, I developed a function in R that will take a data frame and two characteristics of interest in order to generate a matrix whose entries represent the probability of employee turnover given the two characteristics. I call these values, turnover rates.

Human Resources Analytics Data

Firstly, let us go over the details of the human resources analytics data set.


hr_data <- read.csv("HR_comma_sep.csv", header = TRUE)

str(hr_data)

hr_analytics_data_summary

The variables are described as follows:

  • satisfaction_level represents the employee’s level of satisfaction on a 0 – 100% scale
  • last_evaluation represents the employee’s numeric score on their last evaluation
  • number_project is the number of projects accomplished by an employee to date
  • average_montly_hours is the average monthly hours an employee spends at work
  • time_spend_company is the amount of years an employee worked at this company
  • work_accident is a binary variable where 1 the employee experienced an accident, and 0 otherwise
  • left variable represents the binary class where 1 means the employee left, and 0 otherwise.
  • promotion_last_5years is a binary variable where 1 means the employee was promoted in the last 5 years, and 0 otherwise
  • sales is a categorical variable representing the employee’s main job function
  • salary is a categorical variable representing an employee’s salary level

The Rate Function

The following R code presents the function used to conduct this analysis.


# To use rate_matrix, a data frame df must be supplied and two column names from df must be known. The data frame must contain a numeric binary class feature y.
# If any of the characteristics are numeric on a continuous scale, a cut must be specified to place the values into categorical ranges or buckets.

rate_matrix <- function(df, y, c1 = NA, c2 = NA, cut = 10, avg = TRUE) {

# If y is not a binary integer, then stop the function.
if (is.integer(df[[y]]) != TRUE) { stop("Please ensure y is a binary class integer.") }

df_col_names <- colnames(df)

# If c1 and c2 are not available
if (is.na(c1) & is.na(c2)) { stop("Please recall function with a c1 and/or c2 value.") }

# If only c1 is provided
else if (is.na(c2)) {

if (is.integer(df[[c1]])) {
var1 <- as.character(df[[c1]])
var1 <- unique(var1)
var1 <- as.numeric(var1)
var1 <- sort(var1, decreasing = FALSE) }

else if (is.numeric(df[[c1]])) {
var1 <- cut(df[[c1]], cut)
df[[c1]] <- var1
var1 <- levels(var1) }

else {
var1 <- df[[c1]]
var1 <- as.character(var1)
var1 <- unique(var1)
var1 <- sort(var1, decreasing = FALSE) }

c1_pos <- which(df_col_names == c1) # Number of column of characteristic c1

var1_len <- length(var1)

m <- matrix(NA, nrow = var1_len, ncol = 1)

rownames(m) <- var1
colnames(m) <- c1

for (i in 1:var1_len) {
bad <- df[,1][which(df[,c1_pos] == var1[i] & df[[y]] == 1)]
bad_count <- length(bad)

good <- df[,1][which(df[,c1_pos] == var1[i] & df[[y]] == 0)]
good_count <- length(good)

m[i,1] <- round(bad_count / (bad_count + good_count), 2) } }

# If c1 and c2 are provided
else {
if (is.integer(df[[c1]])) {
var1 <- as.character(df[[c1]])
var1 <- unique(var1)
var1 <- as.numeric(var1)
var1 <- sort(var1, decreasing = FALSE) }

else if (is.numeric(df[[c1]])) {
var1 <- cut(df[[c1]], cut)
df[[c1]] <- var1
var1 <- levels(var1) }

else {
var1 <- df[[c1]]
var1 <- as.character(var1)
var1 <- unique(var1)
var1 <- sort(var1, decreasing = FALSE) }

if (is.integer(df[[c2]])) {
var2 <- as.character(df[[c2]])
var2 <- unique(var2)
var2 <- as.numeric(var2)
var2 <- sort(var2, decreasing = FALSE) }

else if (is.numeric(df[[c2]])) {
var2 <- cut(df[[c2]], cut)
df[[c2]] <- var2
var2 <- levels(var2) }

else {
var2 <- df[[c2]]
var2 <- as.character(var2)
var2 <- unique(var2)
var2 <- sort(var2, decreasing = FALSE) }

c1_pos <- which(df_col_names == c1) # Number of column of characteristic c1
c2_pos <- which(df_col_names == c2) # Number of column of characteristic c2

var1_len <- length(var1)
var2_len <- length(var2)

m <- matrix(NA, nrow = var1_len, ncol = var2_len)

rownames(m) <- var1
colnames(m) <- var2

class_1 <- max(df[[y]])
class_0 <- min(df[[y]])

for (i in 1:var1_len) {
for (j in 1:var2_len) {
bad <- df[,1][which(df[,c1_pos] == var1[i] & df[,c2_pos] == var2[j] & df[[y]] == class_1)]
bad_count <- length(bad)

good <- df[,1][which(df[,c1_pos] == var1[i] & df[,c2_pos] == var2[j] & df[[y]] == class_0)]
good_count <- length(good)
m[i,j] <- round(bad_count / (bad_count + good_count), 2) } } }

# Create class 1 matrix report that includes averages
if (avg == TRUE) {
ColumnAverage <- apply(m, 2, mean, na.rm = TRUE)
ColumnAverage <- round(ColumnAverage, 2)
RowAverage <- apply(m, 1, mean, na.rm = TRUE)
RowAverage <- round(RowAverage, 2)
RowAverage <- c(RowAverage, NA)
m <- rbind(m, ColumnAverage)
m <- cbind(m, RowAverage)
return(m) }
else {
return(m) }

}

Employee Turnover Data Investigation

To begin this data investigation, I use the assumption that I have gained significant amounts of experience and field knowledge within Human Resources. I begin this heuristic analysis with the thought that employee turnover is greatly affected by how an employee feels about their job and about the company.

Are employees with small satisfaction levels more likely to leave?

The first thing I would like to confirm is that employees with small satisfaction levels are more likely to leave.


satisfaction <- rate_matrix(df = hr_data, y = "left", c1 = "satisfaction_level", cut = 20, avg = TRUE)

View(satisfaction)

satisfaction_level_single

The function call here uses a cut value of 20 with no particular reason. I want a large enough cut value to provide evidence of my claim.

As seen in the matrix, satisfaction levels between 0.0891 and 0.136 shows that 92% of employees categorized in this range will leave. This provides evidence that low satisfaction levels among employees are at highest risk of leaving the company.

As we would expect, the highest levels of satisfaction of 0.954 to 1 experience 0% employee turnover.

For simplicity and ease of understanding, I define 0.5 as the average satisfaction level. By taking a look at below average satisfaction levels between 0.363 to 0.408 and 0.408 to 0.454, there is an odd significant increase to the risk of employees leaving. This particular area of employee satisfaction requires more investigation because it goes against intuition.

Are employees with below average satisfaction levels more likely to leave across different job functions?

To alleviate this concern of odd satisfaction levels defying our intuition, I continue the investigation by seeing whether satisfaction levels vary across other characteristics from the data. It is likely possible that these below average satisfaction levels are tied to their job function.


satisfaction_salary <- rate_matrix(df = hr_data, y = "left", c1 = "satisfaction_level", c2 = "sales", cut = 20, avg = TRUE)

View(satisfaction_salary)

satisfaction_sales

Here, the same ranges of 0.363 to 0.408 and 0.408 to 0.454 satisfaction levels are generally at high risk to leave even across all job functions. There is evidence to suggest that somewhat unhappy workers are willing to leave regardless of their job function.

Is an unhappy employee’s likelihood of leaving related to average monthly hours worked?

To continue answering why below average satisfaction levels ranges experience higher employee turnover than we expect, I take a look at the relationship between satisfaction levels and average monthly hours worked. It could be that below average satisfaction levels at this company are tied to employees being overworked.


# First, convert the integer variable average_montly_hours into a numeric variable to take advantage of the function's ability to breakdown numeric variables into ranges.

average_montly_hours <- hr_data["average_montly_hours"]
average_montly_hours <- unlist(average_montly_hours)
average_montly_hours <- as.numeric(average_montly_hours)

hr_data["average_montly_hours"] <- average_montly_hours

satisfaction_avghours <- rate_matrix(df = hr_data, y = "left", c1 = "satisfaction_level", c2 = "average_montly_hours", cut = 20, avg = TRUE)

View(satisfaction_avghours)

satisfaction_hours

To reiterate, the row ranges represent the satisfaction levels and the column ranges represent the average monthly hours worked. Here, there is strong evidence to suggest that employees within the below average satisfaction level range of 0.363 to 0.408 and 0.408 to 0.454 work between 117 to 160 hours a month.

Using domain knowledge, typically, a full-time employee will work at least 160 hours a month, given that a full-time position merits 40 hours a week for 4 weeks in any given month. The data suggests here that we have a higher probability of workers leaving given they work less than a regular full-time employee! This was different from my initial train of thought that the employees were potentially overworked.

Given this finding, I come to one particular conclusion: employees with highest risk of leaving are those that are on contract, seasonal employees, or are part-time employees.

By considering other variables such as the number of projects worked on by an employee, it is possible to further support this conclusion.


satisfaction_projects <- rate_matrix(df = hr_data, y = "left", c1 = "satisfaction_level", c2 = "number_project", cut = 20, avg = TRUE)

View(satisfaction_projects)

satisfaction_projects

Here, it is evident to see that the below average satisfaction levels of 0.363 to 0.408 and 0.408 to 0.454 may in fact correspond to contract or part-time employees as the probability of turnover sharply decreases after 2 projects completed.

Are contract, part-time or seasonal employees more likely to be unhappy if the job is accident-prone?

Now that we identified the high risk groups of employee turnover within this data set, this question comes to mind because we would like to address the fact that an employee’s enjoyment in their role should be tied to their satisfaction levels. It could be that these part-time employees are experiencing hardships during their time at work, thereby contributing to their risk of leaving.

To answer this question, I take a look at the satisfaction level and number of projects completed given that an employee experienced a workplace accident.


# I use the package dplyr in order to filter the hr_data dataframe to only include observations that experienced a workplace accident
require(dplyr)

accident_obs <- filter(hr_data, Work_accident == 1)

satisfaction_accident <- rate_matrix(df = accident_obs, y = "left", c1 = "satisfaction_level", c2 = "number_project", cut = 20, avg = TRUE)

View(satisfaction_accident)

satisfaction_accident

Here, given the below average satisfaction levels of 0.363 to 0.408 and 0.408 to 0.454 for number of projects equal to 2 and given that employees experienced a workplace accident, there is evidence to suggest that there is a higher chance of turnover.

Further Work

The purpose of this analysis was to apply a risk segmenting method on human resources analytics data to identify potential reasons for employee turnover. I used probabilities or turnover rates to help identify some groups of employees that were at risk of leaving the company.

I found that there were higher chances of turnover given the employee had an extremely low satisfaction level, but also discovered that the type of employee (contract, part-time, seasonal) could be identified as groups of high risk of turnover. I addressed a possible fact that the likelihood of unhappiness for part-time employees  was attributed to them working on jobs that were accident-prone.

With the example presented in this post, Human Resources can use this information to put more efforts into ensuring contract, part-time, or seasonal employees experience lower turnover rates. This analysis allowed us to identify which groups of employees are at risk and allowed us to identify potential causes.

This risk analysis approach can be applied to any other field of practice other than Human Resources, including Health and Finance. It is useful to be able to come up with quick generic risk segments within your population so that further risk management solutions can be implemented for specific problems at hand.

Lastly, this post only provides a simple way to segment and analyze risk groups but it is not the only way! More advanced methods such as clustering and decision trees can help identify risk groups more thoroughly and informatively to provide an even bigger picture. For quick checks to domain expertise in any particular field of practice, the rate function I present here can be sufficient enough in identifying risk groups.