Quickly go to any section of the Scorecard Building in R 5-Part Series:
ii. Data Collection, Cleaning and Manipulation
iii. Data Transformations: Weight-of-Evidence
iv. Scorecard Evaluation and Analysis
v. Finalizing Scorecard with other Techniques
Continuing from part III where the Weight-of-Evidence matrix and information values were determined to give us an idea of how the consumer credit information could lead to predict the performance of 36-month loans. In this part, we train, test and validate an elastic-net Logistic Regression model. This statistical model is one of the most widely used machine learning techniques that maps a bernoulli distributed variable to a continuous log-odds value. Here we will also use parallel processing to speed up the high amounts of calculations and algorithms done by the caret package on the data set.
library(caret) library(doParallel) library(pROC) library(glmnet) library(Matrix)
Set the seed so that we may receive reproducible results when we train our model.set.seed(20160727)
Redefine the WOE matrix obtained from part III as our main dataset for this section. Please see part III to see how WOE_matrix_final was obtained.LC_WOE_Dataset &amp;lt;- WOE_matrix_final
Use createDataPartition to divide the random sample into a training and test set where 75% of the data goes to training and 25% goes to testingpartition &amp;lt;- createDataPartition(LC_WOE_Dataset$Bad_Binary, p = 0.75, list = FALSE) training &amp;lt;- LC_WOE_Dataset[partition,] testing &amp;lt;- LC_WOE_Dataset[-partition,]
Define the type of resampling that will be used. Here, I am interested in using repeated k-fold cross-validation. More specifically, I apply 3-fold cross-validation by setting the number of folds to 3. Later on, I want to select a model that maximizes the statistic AUC for this specific classification model. I set savePredictions to TRUE to save predictions for each hold-out in each step of the cross-validation. I set classProbs to TRUE to compute class probabilities and predicted values for each resample. summaryFunction is set to twoClassSummary which allows us to compute true-positive rates and false-positive rates later on.fitControl &amp;lt;- trainControl(method = "cv", number = 3, savePredictions = TRUE, classProbs = TRUE, summaryFunction = twoClassSummary)
Now we train the model. First we set the seed to ensure that the algorithm is being run on the exact same data in each fold.set.seed(1107)
Here, parallel processing is initiated to speed up algorithms thereafter.number_cores &amp;lt;- detectCores() cluster &amp;lt;- makeCluster(number_cores) registerDoParallel(cluster)
Set up the lambda and alpha grids in which the train function will used to generate an elastic-net logistic regression model. The alpha term acts as a weight between L1 and L2 regularizations, where in such extremes, alpha = 1 gives the LASSO regression and alpha = 0 gives the RIDGE regression. Penalized linear regression models aims to balance the bias-variance trade-off who exhibits a relationship of increasing bias to decrease variance. The lambda parameter further penalizes coefficient estimate to 0 which indirectly serves as variable reduction. It is also the result of an elastic-net regression to handle collinearities very well.
Caution: the following code takes a long time to run. Here, I let the algorithm use its default grid of alpha and lambda parameters to obtain a solution. I could have more control over this if I set up my own sequences of alpha and lambda values and include a ‘tuneGrid’ entry into the glmnet.fit train function.glmnet.fit &amp;lt;- train(Bad_Binary ~., data = training, method = "glmnet", family = "binomial", metric = "ROC", trControl = fitControl, tuneLength = 5)
When completed, we take a look at the summary of tuning parameters that were used in the cross-validation process. Here, we use the ROC to optimize alpha and lambda. Lambda = 1 is used and therefore, the model converges to a LASSO regression model. We also plot the graph of changing the tuning parameter, lambda.glmnet.fit plot(glmnet.fit)
There are some terminology to address. Specificity as presented in the summary is the fraction of loans that were good and predicted good by the model. Sensitivity is the fraction of loans that were bad and predicted bad by the model. In the glmnet.fit summary, all sensitivities and specificities are presented for every alpha parameter and lambda hyperparameters. All corresponding ROC’s which actually represent the AUC value are presented.
Given the code above, 3-fold cross-validation splits the data set into 3 parts, labelled Set 1, 2 and 3. The algorithm selects 1 of the sets to be used as a test set and trains the model on the other two sets. It then calculates the AUC and stores it in memory. The algorithm repeats this procedure for every combination of training and test sets, ie. Training = ((1,2), (1,3), (2,3)), Testing = ((3), (2), (1)). After all ROCAUC’s are calculated, it averages over them and presents this for a particular initiated alpha parameter and lambda hyperparameter. In this case, the train function runs 3-fold cross-validation given the grid of starting parameters. THe ROCAUC given out by the algorithm represents the average over the 3 models that were trained in each iteration.
Now that we obtained an ROCAUC of 0.8754938 from the cross-valiation step. We now need to test how well the model performs on a set the model has never seen, our initial test set. If a similar ROCAUC is shown from testing the model on the test set, then we can conclude that overfitting has been appropriately addressed. It is up to the discretion of the analyst to decide the threshold similarity between cross-validation ROCAUC and test set ROCAUC. test.glmnet.fit calculates the predicted probabilities used to generate the ROCAUC.test.glmnet.fit &amp;lt;- predict(glmnet.fit, testing, type = "prob") auc.condition &amp;lt;- ifelse(testing$Bad_Binary == "Good", 1, 0) auc.test.glmnet.fit &amp;lt;- roc(auc.condition, test.glmnet.fit[])
auc.test.glmnet.fit provides an ROCAUC of 0.8792 which comes really close to our cross-validated ROCAUC. Here, I am happy with the results and proceed to use this as the final model for the purpose of LC’s risk scorecard.auc.test.glmnet.fit
I proceed to set up a visualization of the ROCAUC through a roc plot using the package pROC.plot(auc.test.glmnet.fit, col = "red", grid = TRUE)
Since the primary focus of this project is to set up a logistic regression scorecard for Lending Club, the model obtained here is sufficient enough. I could go further and test out several different classification machine learning models such as random forests, binary trees, etc. By having the elastic-net logistic regression, I produce the coefficient estimates corresponding to the regularization parameter used. The heart of the model lies within the coefficient estimates. As previously mentioned, the algorithm is a form of variable selection in that it pins down features to 0 if overfitting is suspected or collinearities are present within the WOE dataset.final.model &amp;lt;- glmnet.fit$finalModel coef.final.model &amp;lt;- as.matrix(coef(final.model, glmnet.fit$bestTune$lambda))
In the next section, Scorecard Building – Part V – Rejected Sample Inference, Grade Analysis and Scoring Techniques, I discuss how the model is evaluated and analyzed for further business implications.