Use R to explore a real-life data set, then preprocess the data set such that it’s in the appropriate format before applying the credit risk models. First, I examed the dataset loan_data discussed in the video throughout the exercises in DataCamp.

  • Goal: understand the number, percentage of defaults.
  • To learn more about variable structures and spot unexpected tendencies in the data
  • Examine the relationship between loan_status and certain factor variables.

default information is stored in the response variable loan_status, where 1 represents a default,and 0 represents non-default.

For example, you would expect that the proportion of defaults in the group of customers with grade G (worst credit rating score) is substantially higher than the proportion of defaults in the grade A group (best credit rating score).

  • EL= PD* EAD * LGD

Components of expected loss ( EL) ,  Probability of default (PD), Exposure at default (EAD),  Loss given default (LGD)

crosstable on loan_status

# Call CrossTable() on grade and loan_status
> CrossTable(loan_data$grade, loan_data$loan_status, prop.r = TRUE,
prop.c = FALSE, prop.t = FALSE, prop.chisq = FALSE)



  • Use hist() to create a histogram with only one argument: loan_data$loan_amnt. Assign the result to a new object called hist_1.
  • Use $breaks along with the object hist_1 to get more information on the histogram breaks. Knowing the location of the breaks is important because if they are poorly chosen, the histogram may be misleading.
  • Change the number of breaks in hist_1 to 200 by specifying the breaks argument. Additionally, name the x-axis "Loan amount" using the xlab argument and title it "Histogram of the loan amount" using the mainargument.