Credit Risk Detection in Peer-to-Peer Lending Using CatBoost

P2P (Peer-to-peer) lending has gained popularity among private borrowers, small businesses, and MSMEs due to its ability to provide direct access to loans without the strict requirements imposed by traditional banks and financial institutions. However, P2P lending faces a significant challenge in terms of credit risk, resulting in a high rate of loan repayment failures. To address this issue, the study aimed to develop a credit risk detection system using a loan dataset obtained from the Bondora company by implementing one of the gradient boosting algorithms which are called the CatBoost (Categorical Boosting) method. The performance of the CatBoost algorithm was evaluated using ROC (Receiver Operating Characteristics) curves and AUC (Area Under Curve). Three scenarios were considered, and the results revealed that scenario 2, with a data splitting ratio of 90:10, achieved the best outcome with an AUC value of 0.80329. This outperformed scenario 1, with a data splitting ratio of 80:20 and an AUC value of approximately 0.789583, as well as scenario 3, with a data splitting ratio of 70:30 and an AUC value of around 0.781066.


Introduction
P2P lending has become increasingly popular among individuals, small businesses, and MSMEs because it offers more flexible requirements and less strict criteria in contrast to traditional banks and financial institutions [1].However, P2P lending has a high failure rate for borrowers to repay their loans and P2P lending also has low-interest rates, which is known as credit risk [1]- [3].Credit risk is a major problem in P2P Lending because payers who fail to make loan payments make the lender suffer losses [2].To minimize credit risk, a system is needed that can assist in determining credit risk.
To overcome this challenge, a paper survey by M. Bazarbash [4] suggests utilizing machine learning to assess credit risk in P2P lending.Machine learning involves the application of sophisticated algorithms that can be executed by machines to identify patterns in data, with the primary aim of making predictions.Machine learning models are specifically designed to analyze extensive information from diverse data sources, enabling them to detect patterns that conventional econometric models may overlook.By comprehending the complex and nonlinear relationships between risk factors and credit risk outcomes, machine learning techniques offer a promising solution to address credit risk in P2P lending.They have the capability to mitigate the issue of information asymmetry, which is at the core of challenges in P2P lending.Consequently, the detection of credit risk in P2P lending can play a crucial role in minimizing loan repayment failures.
Many studies about credit risk prediction using machine learning approaches have been observed.In 2018, J. Mezei [5] conducted a study "Credit risk evaluation in peer-to-peer lending with linguistic data transformation and supervised learning".The study employed fuzzy sets-based linguistic data transformation combined with a neural network method using the Bondora dataset spanning from February 2, 2009, to June 17, 2017.The study achieved an AUC value of 0.855.In 2019, Linnea Machado [6] conducted a study "Credit risk modeling and prediction: Logistic regression versus machine learning boosting algorithms".The research utilized two datasets and employed various algorithms including CatBoost, Logistic Regression, and XGBoost.The findings indicated that the CatBoost algorithm outperformed the others.In dataset 1, CatBoost achieved an AUC value of 0.863 and an accuracy of 0.836 on the training data, while on the test data, it achieved an AUC value of 0.731 and an accuracy of 0.817.When applied to dataset 2, CatBoost achieved an AUC value of 1 and an accuracy of 0.999 on the training data, and an AUC of 0.925 and an accuracy of 0.946 on the test data.
In the following year, in 2022, Nghia Nguyen [ Based on insights from prior research, the present study adopts the CatBoost model due to its strong performance.The evaluation metrics employed will be AUC (Area Under Curve) and ROC (Receiver Operating Characteristic) curve, as these were utilized in previous studies to assess the effectiveness of credit risk detection.The dataset utilized in this study originates from the Bondora company.The primary objective of this research is to identify credit risk in the context of P2P lending.

Research Methods
In this study, the objective is to develop a credit risk assessment system in P2P lending that harness the CatBoost algorithm, the whole process presents in Figure 1.Based on the system design in Figure 1, the first step involves preprocessing the P2P Lending dataset.Subsequently, the dataset is divided into two subsets: training data and test data, which are used for the training and testing stages, respectively.Following the training phase, a validation step employing three-fold cross-validation is conducted to determine the optimal AUC result.The model yielding the best AUC result is then utilized in the testing phase.Then, the performance will be evaluated in the testing phase using the ROC curves and AUC.

Credit Risk in P2P Lending
P2P lending is an internet-based loan service that establishes a direct connection between borrowers and lenders [9].The emergence of P2P lending has eliminated the necessity for traditional banks to serve as intermediaries for borrowers [10].Small businesses previously encountered challenges in accessing loans from conventional banks due to factors such as elevated default rates, limited data accessibility, and the perception that small business loans were not lucrative [10].The accessibility and low-interest rates offered by P2P lending increase the credit risk, as the large number of borrowers raises concerns about loan repayment defaults [11], [12].Furthermore, the inadequate implementation of risk control measures by P2P lending platforms has exacerbated the rise in credit risk [13].
Credit risk poses a significant challenge in the realm of P2P lending as borrowers frequently default on loan payments, leading to financial losses for lenders [14]- [16].The problem of credit risk is compounded by the information imbalance between borrowers and lenders, particularly in the context of P2P lending, where unsecured loans are commonly involved [17].Consequently, there is a pressing need for an effective and precise credit risk assessment method that leverages machine learning techniques.Such an approach would serve to mitigate credit risk and ensure the sustainable growth of the P2P lending sector [1].

Bondora Dataset (Peer-to-Peer Lending Dataset)
The dataset utilized is a publicly available loan dataset from the Bondora company1 .The dataset is publicly available and always updated every day [5].The loan dataset contained the records of transactions from February 28, 2009, to November 14, 2022, in euros (EUR).The dataset contained 263,213 records and 112 column attributes, so that column attributes will be selected according to the prediction of failure in loan payments.The selected attribute columns in Table 2 are based on the analysis of variables that will be utilized for identifying loan defaults.
The characteristics of the dataset are categorical and numerical data, with categorical data located in the New Credit Customer, Country, Status, Marital Status, Use of Loan, and Employment Status columns.Meanwhile, numerical data is in the Age, Applied Amount, Interest, Loan Duration, and Income Total (see Table 1

Pre-processing
In the first stage, pre-processing will be conducted before the data is used to build the model.The preprocessing steps are outlined in Figure 2. Initially, given the substantial amount of data, the elimination of duplicate entries and missing values will be removed, as this is a widely employed technique to enhance the model's optimization at a later stage.Subsequently, the analysis will specifically focus on data labeled as "repaid" and "late", with the data labeled as "current" being excluded from the analysis.Next, data normalization will be applied to transform categorical data into numerical values, while numerical data will be scaled using a min-max scaler.Finally, the dataset will be split into training and test sets in a specific ratio to ensure optimal outcomes.

CatBoost
Binary decision trees are utilized as the fundamental predictors in CatBoost, which is an implementation of gradient boosting [18].The challenges arising from diverse features, noisy data, and complex relationships are effectively addressed by gradient boosting, which is a robust machine learning technique [19].Notably, CatBoost algorithm excels not only in handling numeric features but also in effectively managing categorical features [20].It employs an uncontrived decision tree structure where all non-terminal nodes at the same level of the tree possess identical splitting criteria.This ensures that the length of the path from the root node to each leaf is equal to the depth of the tree [21].An illustration of the CatBoost can be seen in Figure 3. CatBoost employs ordered target statistics to handle categorical features [18].This approach involves estimating the target value for each category by calculating the corresponding output value.The specific mathematical equation for this process is provided in Equation Error!Reference source not found.[6].
The average of the output   is detonated by  , , that is value of the i-th categorical variable in the k-th exercise observations from the given dataset.The equation can be seen in Equation Error!Reference source not found.[6].
̂, = Parameter P is the prior probability to determine the default class using parameter a > 0 [18].

Training
Once the data has been pre-processed, the training phase for the CatBoost model commences.The process begins by determining the value of N, representing the number of trees trained in the CatBoost model.Subsequently, a random permutation is performed, followed by the creation of a new tree that predicts the permutations based on the trained model.The permutation estimates for each model are recalculated, and the model with the best permutation will be used for the testing stage.The training process will be conducted until N iteration.A visual representation of the CatBoost training stages can be found in Figure 4.  employed by the CatBoost model.By traversing the tree and counting the number of leaves, predictions are made.The prediction outcomes are then evaluated using AUC and ROC curves to evaluate the effectiveness of CatBoost in credit risk assessment for P2P lending.

Model Evaluation
The ROC curve is a visual representation utilized to evaluate the classifier's performance [6], [22].In the ROC curve, The Sensitivity or True Positive Rate (TPR), which represents the proportion of correct predictions classified as true, is depicted on the Y-axis [22].On the other hand, the False Positive Rate (FPR), which represents to the ratio of incorrect predictions classified as correct predictions, is depicted on the Xaxis.The ROC curve is utilized to visually represent the CatBoost model's capacity to precisely classify credit risk by evaluating the ratio of accurate and inaccurate predictions.
Then there is an important index of ROC, namely AUC, which is the value of the area between the ROC curve and the abscissa [22].A better credit risk score is indicated by a higher value of the AUC (Area Under the Curve), which ranges from 0 to 1 [22].
TPR and FPR can be calculated using Equation Error!Reference source not found.)and Error!Reference source not found.).In these equations, the count of positive instances that are accurately classified is denoted by TP (True Positive), The count of positive instances that are mistakenly classified as negative is denoted by FP (False Positive), The count of negative instances that are accurately classified is denoted by TN (True Negative), and the count of negative instances that are inaccurately classified is denoted by FN (False Negative).

Results and Discussions
To fulfill the study's objectives, three scenarios were executed to assess: 1.The effectiveness of the CatBoost algorithm's hyperparameters in credit risk classification; 2. A comparison of different training and testing data ratios; 3. The system's performance measurement using AUC and ROC metrics.

Preprocessed Dataset
After the pre-processing stage, the dataset is reduced to 171,257 instances, where the data consist of 94,258 rows Repaid and 76,999 rows Late status.The data distribution can be seen in Figure 6.The status values are encoded, where "Repaid" is represented as "0" and "Late" is represented as "1".In the process of normalizing categorical attributes, "New Credit Customer" and "Country" are manually encoded, while "Marital Status", "Use of Loan", and "Employment Status" attributes are encoded using the sklearn library.The dataset will be divided into three scenarios: 70:30, 80:20, and 90:10.These scenarios were chosen to assess the CatBoost model's ability to identify patterns in the training data and produce good results when tested on the test data.The preprocessed data is presented in Table 3.

Result
The experiment of this study was conducted to obtain the best parameters by utilizing the grid search method as shown inTable 4. The grid search method is utilized to search for optimal results for the depth and learning rate parameters.The iteration parameter is manually set to 500 for each scenario.Additional parameters, such as the random_seed parameter and cat_features, are included to enhance the performance of CatBoost.In this study, the random seed parameter is set to 1 for each scenario, while the cat_features parameter is determined based on the total number of attribute columns in the dataset, which is 10.Considering that the study utilizes CPU as the computing unit, the task_type parameter is set to CPU.Furthermore, since the evaluation metric focuses on the AUC value, the eval_metric parameter is set to AUC.
The ROC curve for scenario 1, 2, 3 are displayed in Figure 7 shows an AUC value of 0,789583, Figure 8 shows an AUC value of 0,80329, and Figure 9 shows an AUC value of 0,781066, respectively, they provided a visual rfigure 1fiepresentation of the experimental results.Based on the results of  The results from all three scenarios can be considered favorable as the AUC values obtained in each scenario were close to 1.This proximity to 1 indicates that the model was effective in properly identifying credit risk.Despite achieving satisfactory results, the outcomes obtained by this model still fall short of the best results.This limitation can be attributed to the large number of data records and features, which hinder the optimal performance of the research model.The model's performance could potentially be enhanced by incorporating the feature importance method in the selection of features.Nevertheless, based on the obtained results, it can be concluded that the CatBoost model exhibits effective performance in identifying credit risk.These findings underscore the model's robust capability in detecting credit risk.

Conclusion
The aims of this study already succeeded in identifying credit risk in P2P lending according to the experiments result, using 171,257 raw Bondora loan data.It had been proven by three different scenarios that were conducted with data splits of 80:20, 90:10, and 70:30; with the result of AUC values for each scenario were 0.789583, 0.80329, and 0.781066, respectively.Scenario 2 exhibited superior AUC results compared to scenarios 1 and 3.However, the obtained results fell short of optimal outcomes primarily due to the large volume of data and feature records, as well as the failure to employ the feature importance method.Consequently, the selection of features used in the model remained suboptimal.Moreover, this research concludes that the CatBoost model performs well in detecting credit risk.It is the authors hope that these findings contribute to minimizing credit risk in P2P lending.In future research, it is recommended to improve feature selection compared to the approach used in this study.Alternatively, employing the Principal Component Analysis (PCA) method can be considered.

Figure 4 .
Figure 4. CatBoost Training Flowchart 2.6.Testing Once the CatBoost training is completed, the model will proceed to the testing phase.The steps involved in CatBoost testing are illustrated in Figure 5.

Figure 5 .
Figure 5. CatBoost Testing Flowchart During the model testing phase, the parameters obtained from the best tree structure during training are

Figure 7 .Figure 8 .Figure 9 .
Figure 7.The result of the ROC curve for Scenario 1 (80:20) 7] conducted a study titled "A Proposed Model for Card Fraud Detection Based on CatBoost and Deep Neural Network."This study presented two models: a neural network model and a CatBoost model.The experiment utilized a dataset from IEEE CIS, which consisted of real-world transactions provided by Vesta Corporation.The results demonstrated that the CatBoost model outperformed the neural network model, achieving an AUC value of 0.974, whereas the neural network model achieved an AUC value of 0.84.During the same year, Xingyun Li [8] conducted a study titled "Prediction of Loan Default Based on Multi-Model Fusion."This research introduced a novel fusion model that combined Logistic Regression, Random Forest, and CatBoost techniques to predict loan defaults.The results indicated that the multi-model fusion, which integrated multiple techniques, slightly outperformed the CatBoost model.The CatBoost model achieved an AUC value of 0.992, whereas the multi-model fusion proposed by the study achieved an AUC value of 0.994.

Table 3 .
Sample Data Preprocessed

Table 5 .
Result of AUC from 3-Scenario