Quickly go to any section of the Scorecard Building in R 5-Part Series:
ii. Data Collection, Cleaning and Manipulation
iii. Data Transformations: Weight-of-Evidence
iv. Scorecard Evaluation and Analysis
v. Finalizing Scorecard with other Techniques
In the previous section, part IV of the scorecard building process, I trained, validated and tested a logistic regression model serving as the heart of the scorecard. In this section, I address the obvious sample selection problem where loans are accepted based on merit and personal credit information, and also rejected because of lack of credentials. I also look into analyzing model assumptions where the predicted scores for the training set is used to built a grading scheme. As an extra exercise, I scale the log-odds score into a more understandable scoring function.
To fully account for the sample selection bias in our model, performance inference methods are utilized to predict the performance of rejected clients if they were actually given a loan. The first step is creating a function that maps some sort of domain knowledge of features to the log-odds of accepted customers. This function will then be used to predict the probabilities of rejected customers. Here, the rejected data set does not include which applicants applied for a 36-month loan term so I assume that all of these applicants were considered.
I will utilize the following packages:
library(dplyr) library(ggplot2) library(scales)
I will require the dataset of rejected applicants and the data available from them. This can be downloaded here.rejected_data &amp;lt;- read.csv("C:/Users/artemior/Desktop/Lending Club Model/RejectStatsD.csv")
To start off, I create a new logistic regression model with the same Bad_Binary variables and only the features that are common between both accepted and rejected applicants. In this case, both contain zip code, state and employment length. The data that I will use comes from section 2, more specifically WOE_matrix and Bad_Binary. It is assumed that building this inference model takes into account all variable WOE transformations. Bad_Binary_Inference calls upon the original Bad variable from the features_36 vector.Bad_Binary_Original &amp;lt;- features_36$Bad sample_inference_features &amp;lt;- WOE_matrix_final[c("zip_code", "addr_state", "emp_length")] sample_inference_features["Bad_Binary"] &amp;lt;- Bad_Binary_Original
Run a simple generalized linear model on the accepted applicants data set.sample_inference_model &amp;lt;- glm(Bad_Binary ~ ., data = sample_inference_features)
Here, I calculate the WOEs for the rejected applicants by applying the WOE tables from the accepted applicants onto the rejected applicant data set. The code and methodology of transformation is exactly that in part III.features_36_inference % select(-Amount.Requested, -Application.Date, -Loan.Title, -Risk_Score, -Debt.To.Income.Ratio, -Policy.Code) features_36_inference_names &amp;lt;- colnames(features_36_inference)
Initiate cluster for parallel processing.require(parallel) number_cores &amp;lt;- detectCores() - 1 cluster &amp;lt;- makeCluster(number_cores) clusterExport(cluster, c("IV", "min_function", "max_function", "features_36_inference", "features_36_inference_names", "only_features_36", "recode", "WOE_tables"))
Create the WOE matrix table for the rejected data applicants.WOE_matrix_table_inference &amp;lt;- parSapply(cluster, as.matrix(features_36_inference_names), FUN = WOE_tables_function)
WOE_matrix_inference is the converted WOE matrix for the rejected applicant data set. This is the dataset we will be using to predict their scores for model performance inference.WOE_matrix_inference &amp;lt;- parSapply(cluster, features_36_inference_names, FUN = create_WOE_matrix)
Using WOE_matrix_inference to come up with predicted probabilities using the sample_inference_model, which was built on the accepted applicant data set.rejected_inference_prob &amp;lt;- predict(sample_inference_model, data = WOE_matrix_inference, type = "response") rejected_inference_prob_matrix &amp;lt;- as.matrix(rejected_inference_prob) rejected_inference_prob_dataframe &amp;lt;- as.data.frame(rejected_inference_prob_matrix) colnames(rejected_inference_prob_dataframe) &amp;lt;- c("Probabilities") stopCluster(cluster)
Now I obtain the predicted probabilities using the cross-validated elastic-net logistic regression model from section 4 for all accepted applicantsnumber_cores &amp;lt;- detectCores() cluster &amp;lt;- makeCluster(number_cores) registerDoParallel(cluster) accepted_prob &amp;lt;- predict(glmnet.fit, LC_WOE_Dataset, type = "prob") accepted_prob_matrix &amp;lt;- as.matrix(accepted_prob[,2]) accepted_prob_dataframe &amp;lt;- as.data.frame(accepted_prob_matrix) colnames(accepted_prob_dataframe) &amp;lt;- c("Probabilities")
I combine the probabilities from both the rejected and accepted applicants and generate a graph that depicts the distribution of the probabilities.probability_matrix &amp;lt;- rbind(accepted_prob_matrix, rejected_inference_prob_matrix) probability_matrix &amp;lt;- as.data.frame(probability_matrix) colnames(probability_matrix) &amp;lt;- c("Probabilities")
I use ggplot to plot the distribution of probabilities of default. Here we see that the distribution is left skewed. This is a representation of both accepted and rejected applications. It is also useful to analyze the distribution of the accepted and rejected applicants separately.probability_distribution &amp;lt;- ggplot(data = probability_matrix, aes(Probabilities)) probability_distribution &amp;lt;- probability_distribution + geom_histogram(bins = 50) accepted_probability_distribution &amp;lt;- ggplot(data = accepted_prob_dataframe, aes(Probabilities)) accepted_probability_distribution &amp;lt;- accepted_probability_distribution + geom_histogram(bins = 50)
rejected_probability_distribution &amp;lt;- ggplot(data = rejected_inference_prob_dataframe, aes(Probabilities)) rejected_probability_distribution &amp;lt;- rejected_probability_distribution + geom_histogram(bins = 50)
The accepted probability distribution is left-skewed while the distribution of the rejected applicants is normal. This could be interpreted as rejected applicants exhibiting a normally distributed score if they were to be funded. Therefore, rejected sample selection bias is minimal if applicants are being rejected randomly through a normal distribution. If this was not the case, I would suspect some bias in the acceptance and rejection of applicants.
After gaining some insight on how our model will perform on a population it has never seen before. We look to formalizing the scorecard by creating a grading scheme that defines several levels of risk.
Here, I am going to organize the accepted applicant probability data set into bins to initiate a lift analysis. We use lift analysis to help us determine which bins of scores are going to be described by particular letter grades. I create 25 different bins to mimic the sub-grade system that LC has from their given public data set. I then append the “Bad” column from features_36 to this vector and summarize the information so that I may be able to calculate the proportions of bads within each bin.bins = 25 Bad_Binary_Values &amp;lt;- features_36$Bad prob_bad_matrix &amp;lt;- as.data.frame(cbind(accepted_prob_matrix, Bad_Binary_Values)) colnames(prob_bad_matrix) &amp;lt;- c("Probabilities", "Bad_Binary_Values")
I sort the probabilities and binary values in an increasing order based on the probabilities column. By ordering the first column, the corresponding values in the second column are also sorted.Probabilities &amp;lt;- prob_bad_matrix[,1] Bad_Binary_Values &amp;lt;- prob_bad_matrix[,2] order_accepted_prob &amp;lt;- prob_bad_matrix[order(Probabilities, Bad_Binary_Values, decreasing = FALSE),]
I create the bins based on the sorted probabilities and create a new data frame consisting of only the bins and bad binary values. This will be the data frame I use to conduct a lift analysis.bin_prob &amp;lt;- cut(order_accepted_prob$Probabilities, breaks = bins, labels = 1:bins) order_bin &amp;lt;- as.data.frame(cbind(bin_prob, order_accepted_prob[,2])) colnames(order_bin) &amp;lt;- c("Bin", "Bad")
I summarize the information where I calculate the proportion of bads within each bin.bin_table &amp;lt;- table(order_bin$Bin, order_bin$Bad) Bin_Summary &amp;lt;- group_by(order_bin, Bin) Bad_Summary &amp;lt;- summarize(Bin_Summary, Total = n(), Good = sum(Bad), Bad = 1 - Good/Total)
Using Bad_Summary, I plot a bar plot that represents the lift analysis.lift_plot &amp;lt;- ggplot(Bad_Summary, aes(x = Bin, y = Bad)) lift_plot &amp;lt;- lift_plot + geom_bar(stat = "identity", colour = "skyblue", fill = "skyblue") lift_plot &amp;lt;- lift_plot + xlab("Bin") lift_plot &amp;lt;- lift_plot + ylab("Proportion of Bad")
Here, the graph shows 25 bins and the proportion of bad customers within each bin. As expected, the bins have a decreasing trend of proportion of bads which shows the effectiveness of our classifer.By separating them into 25 bins, I mimic LC’s subgrading system a nd could apply the exact same logic to this scorecard.
To finalize the scorecard, I generate a linear function of log-odds and apply a three-digit score mapping system that will assist upper management in understanding the risk score obtained from the scorecard. First I convert the probability scores into log-odds form and figure out what type of linear transformation I would like to apply.Accepted_Probabilities &amp;lt;- Probabilities LogOdds &amp;lt;- log(Accepted_Probabilities)
The score is up to the analyst and how they feel is the best way to present it to upper management for easy interpretation. Here, I will apply a three-digit score transformation to the log-odds using a simple linear function. To calculate the slope of this linear line, I use the minimum and maximum log-odds and use that as my range for a score range from 100 – 1000.max_score &amp;lt;- 1000 min_score &amp;lt;- 100 max_LogOdds &amp;lt;- max(LogOdds) min_LogOdds &amp;lt;- min(LogOdds) linear_slope &amp;lt;- (max_score - min_score)/(max_LogOdds - min_LogOdds) linear_intercept &amp;lt;- max_score - linear_slope * max_LogOdds
Here, the linear slope is 82.6 and the intercept is 1000. This means that the average applicant will have a risk score that is maxed out, guaranteed funding . As the Log-Odds decreases based on the features that are inputted into the model, the score will decrease significantly. The way to interpret the score here is that for every 1 unit of log-odds, the score will decrease by 82.6 units. That means someone that is twice as risky as someone with 1 unit of log-odds will have their score decreased by 165.2. Therefore, every 82.6 units in a score indicates levels of riskiness that is easily understood by non-technical audiences.
There you have it! After 5 parts into the scorecard building process, the scorecard is ready for presentation and production. Now, I could go even further into explaining some of the things that could be changed for the scorecard, such as changing scoring thresholds that meet business requirements or accepted levels of risk. I could even go further into explaining how you can link the scorecard, grades and expected losses and margins of profit for the company. This stems beyond what I have demonstrated here but is definitely possible for utmost consideration especially for a financial company.
The R code for all 5 parts of the scorecard building process can be found at my Github page.