Use R to explore a real-life data set, then preprocess the data set such that it’s in the appropriate format before applying the credit risk models. First, I examed the dataset loan_data
discussed in the video throughout the exercises in DataCamp.
- Goal: understand the number, percentage of defaults.
- To learn more about variable structures and spot unexpected tendencies in the data
- Examine the relationship between
loan_status
and certainfactor
variables.
default information is stored in the response variable loan_status
, where 1 represents a default,
and 0 represents non-default
.
For example, you would expect that the proportion of defaults in the group of customers with grade
G (worst credit rating score) is substantially higher than the proportion of defaults in the grade
A group (best credit rating score).
- EL= PD* EAD * LGD
Components of expected loss ( EL) , Probability of default (PD), Exposure at default (EAD), Loss given default (LGD)
# Call CrossTable() on grade and loan_status
> CrossTable(loan_data$grade, loan_data$loan_status, prop.r = TRUE,
prop.c = FALSE, prop.t = FALSE, prop.chisq = FALSE)
- Use hist() to create a histogram with only one argument:
loan_data$loan_amnt
. Assign the result to a new object calledhist_1
. - Use
$breaks
along with the objecthist_1
to get more information on the histogram breaks. Knowing the location of the breaks is important because if they are poorly chosen, the histogram may be misleading. - Change the number of breaks in
hist_1
to 200 by specifying thebreaks
argument. Additionally, name the x-axis"Loan amount"
using thexlab
argument and title it"Histogram of the loan amount"
using themain
argument.