Quickly go to any section of the Scorecard Building in R 5-Part Series:
i. Introduction
ii. Data Collection, Cleaning and Manipulation
iii. Data Transformations: Weight-of-Evidence
iv. Scorecard Evaluation and Analysis
v. Finalizing Scorecard with other Techniques

I used the dataframe manipulation package ‘dplyr’, some basic parallel processing to get the code running faster with the package ‘parallel’, and the ‘Information’ package which allows me to analyze the features within the data set using weight-of-evidence and information value.


First, I read in the Lending Club csv file downloaded from Lending Club website. The file is saved on my local desktop which is easily accessed by the read.csv function.

data <- read.csv("C:/Users/artemior/Desktop/Lending Club model/LoanStats3d.csv")

Next, I create a column that indicates whether I will keep an observation (row) or not. This will be based on the loan statuses because for a predictive logistic regression model, I would like all the statuses that will be strictly defined as a ‘Good’ loan or a ‘Bad’ loan.

data <- mutate(data,
Keep = ifelse(loan_status == "Charged Off" |
loan_status == "Default" |
loan_status == "Fully Paid" |
loan_status == "Late (16-30 days)" |
loan_status == "Late (31-120 days)",
"Keep", "Remove"))

After creating the ‘Keep’ column I filter the data depending on whether the observation had “Keep” or “Remove”.

sample <- filter(data, Keep == "Keep")

I further filter the data set to create two new samples. The Lending Club offers two exclusive types of loan products. To improve predictability of the riskiness of its loans, we can create two sub-risk models, one for all 36-month term loans and 60-month term loans.

sample_36 <- filter(sample, term == " 36 months")
sample_60 <- filter(sample, term == " 60 months")

For the purposes of this scorecard building demonstration I will create a model using the 36-month term loans. Using the mutate function, I create a new column called ‘Bad’ which will be my binary independent variable used
in the logistic regression.

sample_36 <- mutate(sample_36, Bad = ifelse(loan_status == "Fully Paid", 1, 0))

The next step is to clean up the table to remove any data points I do not want to include in the prediction model. Variables such as employment title would take more time to analyze so for the purposes of this analysis I remove them.

features_36 % select(-id, -member_id, -loan_amnt,
-funded_amnt, -funded_amnt_inv, -term, -int_rate, -installment,
-grade, -sub_grade, -pymnt_plan, -purpose, -loan_status,
-emp_title, -out_prncp, -out_prncp_inv, -total_pymnt, -total_pymnt_inv,
-total_rec_int, -total_rec_late_fee, -recoveries, -last_pymnt_d, -last_pymnt_amnt,
-next_pymnt_d, -policy_code, -total_rec_prncp, -Keep)

To further understand the data, I want to take a look at the number of observations per category under each variable. This will weed out any data points that could be problematic in future algorithms.

Once the features table is complete, I use the methodology of information value to transform the raw feature data. In theory, transforming the raw data into a proportional log-odds value as seend in the Weight-of-Evidence maps better onto a logistic regression fitted curve.

IV <- create_infotables(data = features_36, y = "Bad")

We can generate a summary of the IV’s for each feature. The IV for a particular feature represents the sum of individual bin IV’s.


We can even check the IV tables for individual features and see how each feature was binned, the percentage of observations that the bin represents out of the total number of observations, the WOE attributed to the bin and as well as the IV. The following code is an example of presenting the feature summary for the last credit pull date.


I analyze the behaviors of continuous and ordered-discrete variables by plotting their weight-of-evidences. In theory, the best possible transformation occurs when weight-of-evidences exhibit a monotonic relationship. First, I define features_36_names as the vector of column names. This will serve as the vector which I will use a function that plots every WOE graph for each feature in the features_36_names matrix. I remove features from the list that are categorical and would generate way too many bins to plot later on. For example, I removed the feature zip_code as there would be over 500 different kinds.

features_36_names_plot <- colnames(features_36)[c(-7, -11, -ncol(features_36))]

Here is the code for the ploeWOE function as I previously mentioned. This function generates a WOE plot for input x, where x is a string that represents the column name of a specific feature. Recall that I generated a list of strings in features_36_names.

plotWOE <- function(x) {
p <- plot_infotables(IV, variable = x, show_values = TRUE)
return(p) }

To make my for loop code clean and faster, I define a number as the length of the features name vector.

feature_name_vector_length_plot <- length(features_36_names_plot)

Now for the fun part, to generate a graph for each feature, I use a for loop which will go over every string object in the features_names_36 list, and plot a WOE graph for each string name that corresponds to a feature in the features_36 matrix. To be safe, I created an error-handling portion of code because somewhere in this huge matrix of features, I may have missed a feature or two in which a WOE plot cannot be created. This would occur if a particular feature only contained 1 category or value for every observed loan.

for (i in 1:feature_name_vector_length_plot) {
p <- tryCatch(plotWOE(features_36_names_plot[i]),
error = function(e)
{print(paste("Removed variable: ",
features_36_names_plot[i])); NaN})
print(p) }

About 90 graphs are generated using for loop. Below I present and discuss two examples of what kinds of graphs are presented and what they mean.


The home ownership weight-of-evidence plot displays how a greater proportion of good consumer loan customers own their homes and a greater proportion of bad consumer loans pay rent where they live. Those who still pay mortgage are slightly better customers.


The months since delinquency (or time since you failed to pay off some form of credit) weight-of-evidence plot presents another intuitive relationship. The more months that pass since a customer’s most recent delinquency will make them more likely to be a good customer in paying off their loan. The lower the amount of months since a customer’s most recent delinquency means that they have just recently failed to pay off other forms of credit. This goes to show that even if you had a delinquency in your lifetime, you can improve your credit management and behaviors over time.

In the plot, something weird happens when customers had their delinquency between 19 – 31 months before they received another consumer loan. This could suggest a lagging effect where it takes some time to fully chase down a customer. It could be the case that sometimes months and months of notification is given before the customer is actually classified as delinquent.

In the next post, Scorecard Building – Part III – Data Transformation, I am going to describe how the data we prepared and analyzed using Information Theory will be transformed to better suit a logistic regression model.