Quickly go to any section of the Scorecard Building in R 5-Part Series:
i. Introduction
ii. Data Collection, Cleaning and Manipulation
iii. Data Transformations: Weight-of-Evidence
iv. Scorecard Evaluation and Analysis
v. Finalizing Scorecard with other Techniques


In part II of the scorecard building process, I had prepared the Lending Club data in order to create a Logistic Regression model that would enact as a scorecard in predicting good customers from bad ones.

In this section, I transform the data set by applying weight-of-evidence (WOE) value conversions where in the previous section, a weight-of-evidence was created for specific binning groups. Here, for each feature column, I take each data point and assign the weight-of-evidence value from its corresponding binning group. For example, for the home ownership variable, all customers who are paying mortgage on their homes have a weight-of-evidence of 0.29. Every entry in the home ownership column in the data set with a value of ‘Mortgage’ would then be replaced with 0.29. This transformation takes place for all features within the data set so that the new matrix contains only weight-of-evidence valued transformations.

The following code begins this process. Here, I continue to use the code from part II. I use the package ‘parallel’ to apply some basic parallel processing that will help make the code run faster.

library(parallel)

First, here is a function to obtain the minimum value of range in string form and maximum value of range in string form.

min_function <- function(x) {
remove_brackets <- gsub("\\[|\\]", "", x = x)
take_min <- gsub(",.*", "", remove_brackets)
min_value <- as.numeric(take_min)
return(min_value)
}

max_function <- function(x) {
remove_brackets <- gsub("\\[|\\]", "", x = x)
take_max <- gsub(".*,", "", remove_brackets)
max_value <- as.numeric(take_max)
return(max_value)
}

The following presents a function that tabulates all WOE with their respective categories. This will help group all variables and allow for easier lookups when NA’s are transformed into -1.

features_36_names_WOE <- colnames(features_36)[-ncol(features_36)]
features_36_names_WOE_vector_length <- length(features_36_names_WOE)
only_features_36 <- features_36[-ncol(features_36)]

WOE_tables_function <- function(x) {
table_text <- sprintf("IV$Tables$%s", x)
create_table <- eval(parse(text = table_text))
MIN <- sapply(create_table, min_function, USE.NAMES = FALSE)[,1]
MAX <- sapply(create_table, max_function, USE.NAMES = FALSE)[,1]
MIN_equal_NA <- is.na(MIN)
count_MIN_equal_NA <- length(MIN[MIN_equal_NA])

if (count_MIN_equal_NA == 1) {
MIN[is.na(MIN)] <- -1
MAX[is.na(MAX)] <- -1
WOE <- create_table$WOE
categories <- create_table[,1]
table <- cbind.data.frame(categories, WOE)
return(table)

} else {
WOE <- create_table$WOE
table <- cbind(MIN, MAX, WOE)
return(table)
}
}

To obtain the results of this WOE_tables function quickly, we assign the function to three cores attributed to the laptop I am using. This is known as Parallel Processing. This allows the slow task of applying a function over each row of data to be sped up.

First, calculate the number of cores that are located within the laptop. The memory used to apply the WOE_tables_function will be distributed among the number of cores minus 1. We need to save the last core for any other sort of activity we want to do that may or may not be programming related.

number_cores <- detectCores() - 1

Initiate Cluster, where the cluster is just the defined group of cores that are designated to process the memory.

cluster <- makeCluster(number_cores)

I assign the main functions that the cluster will be handling and processing. These include the functions that are called within the main function WOE_tables_function.

clusterExport(cluster, c("IV", "min_function", "max_function"))

WOE_tables is the resulting matrix which is created off the cluster. We use the function parSapply which is very similar to the sapply function except is run with parallel processing.

WOE_tables <- parSapply(cluster, as.matrix(features_36_names_WOE), FUN = WOE_tables_function)

Usually at this point we would close the cluster so that the computer may resume using memory for other computer functions. Since we will still require its work, we do not close it and resume coding our way to obtaining the final aggregated WOE matrix.

recode is a helper function that takes in the feature column name and searches for it in the WOE_tables vector. It then replaces all raw data inputs in the feature vector with their respective WOE values.

recode <- function(x, y) {
r_WOE_table_text <- sprintf("WOE_tables$%s", y)
create_r_WOE_table <- eval(parse(text = r_WOE_table_text))
data_type_indicator <- create_r_WOE_table[1,1]

if (is.factor(data_type_indicator)) {
category_Table <- as.numeric(create_r_WOE_table[,1])
corresponding_WOE_Table <- as.character(create_r_WOE_table[,2])
category_Table_length <- length(category_Table)
raw_variable <- as.numeric(factor(x))

for (i in 1:category_Table_length) {
condition_1 <- raw_variable == category_Table[i]
raw_variable[condition_1] <- corresponding_WOE_Table[i]
}

return(as.numeric(raw_variable))

} else if (data_type_indicator == -1) {
min_r_Table <- create_r_WOE_table[,1]
max_r_Table <- create_r_WOE_table[,2]
corresponding_WOE_Table <- as.character(create_r_WOE_table[,3])
min_r_Table_length <- length(min_r_Table)
raw_variable <- x

for (i in 2:min_r_Table_length) {
condition_1 = min_r_Table[i]
condition_2 <- raw_variable <= max_r_Table[i]
raw_variable[condition_1 & condition_2] <- as.numeric(corresponding_WOE_Table[i])
}

condition_3 <- is.na(raw_variable)
raw_variable[condition_3] <- corresponding_WOE_Table[1]

return(as.numeric(raw_variable))

} else {
min_r_Table <- create_r_WOE_table[,1]
max_r_Table <- create_r_WOE_table[,2]
corresponding_WOE_Table <- create_r_WOE_table[,3]
min_r_Table_length <- length(min_r_Table)
raw_variable <- x

for (i in 1:min_r_Table_length) {
condition_1 = min_r_Table[i]
condition_2 <- raw_variable <= max_r_Table[i]
raw_variable[condition_1 & condition_2] <- corresponding_WOE_Table[i]
}

return(as.numeric(raw_variable))

}
}

WOE_matrix_final applies the recode function over the entire vector of feature names. Another helper function create_WOE_matrix allows the matrix to be created.

create_WOE_matrix <- function(x) {
variable_text <- sprintf("only_features_36$%s", x)
create_variable <- eval(parse(text = variable_text))
variable <- create_variable
variable_name <- x
WOE_vector <- recode(variable, variable_name)
return(WOE_vector)
}

Finally, create WOE_matrix through parallel processing. Again, we export the set of subfunctions that will be called by the main function create_WOE_matrix to the cluster. After creating WOE_matrix, it is important to include the Binary vector of whether a loan has resulted to be “Good” or “Bad”. Finally, we obtain the goal of this section of the project, WOE_matrix_final. Notice that the last line of code stops the cluster.

clusterExport(cluster, c("only_features_36", "create_WOE_matrix", "recode", "WOE_tables"))

WOE_matrix <- parSapply(cluster, features_36_names_WOE, FUN = create_WOE_matrix)
WOE_matrix <- as.data.frame(WOE_matrix)
Bad_Binary <- features_36$Bad
Bad_Condition_1 <- Bad_Binary == 1
Bad_Condition_0 <- Bad_Binary == 0
Bad_Binary[Bad_Condition_1] <- "Good"
Bad_Binary[Bad_Condition_0] <- "Bad"
Bad_Binary <- as.factor(Bad_Binary)
WOE_matrix["Bad_Binary"] <- Bad_Binary
WOE_matrix_final <- WOE_matrix

stopCluster(cluster)

In the next section, Scorecard Building – Part IV – Training, Testing and Validating the Logistic Regression Model I will take the transformed data set and apply various machine learning techniques to get a preliminary scorecard.