# Python implements the credit scoring card model based on variancethreshold and tpotregressor

2022-06-24 09:03:03

explain ： This is a practical project of machine learning （ Incidental data + Code + file + Video Explanation ）, If you need data + Code + file + Video Explanation You can get it directly at the end of the article .

### 1. Project background

Now , More and more 80,90 Buy a house or a car with a loan , For a while , The loan business provided by banks has become the trend of the times “ New favorite ”. Bank loan means that an individual or enterprise lends funds to an individual or enterprise who needs funds at a certain interest rate according to the policies of the country where the bank is located , An economic act in which a time limit for return is agreed .

In order to reduce the non-performing loan ratio , Ensure the safety of their own funds , Improve the level of risk control , Banks and other financial institutions will build a credit scoring card model to score customers according to their credit history data . According to the customer's credit score , It can estimate the possibility of customers' repayment on time , And decide whether to grant loans and the amount and interest rate of loans .

In this project, low variance method is used for feature selection , Using genetic algorithm to build credit scoring card model .

### 2. Data acquisition

The modeling data comes from the network ( Compiled by the author of this project ), The statistics of data items are as follows ：

The details of the data are as follows ( Part of the show )：

### 3. Data preprocessing

3.1 use Pandas Tool view data

Use Pandas The tool head() Method to view the first five rows of data ：

Key code ：

3.2 Missing data view

Use Pandas The tool info() Method to view data information ：

You can see from the above picture that , All in all 6 A variable , There are no missing values in the data , common 1000 Data .

Key code ：

3.3 Descriptive statistics

adopt Pandas The tool describe() Method to see the average of the data 、 Standard deviation 、 minimum value 、 quantile 、 Maximum .

The key codes are as follows ：

### 4. Exploratory data analysis

4.1 Line chart of credit score

use Matplotlib The tool plot() Methods draw a line chart ：

As can be seen from the above figure , Most people have a credit rating of 65~75 and 80~90.

4.2 Credit score histogram

use Matplotlib The tool hist() Method draw histogram ：

As you can see from the picture above , The credit score is at 80~90 The majority of people are divided , It shows that most people have good credit .

4.3 Scatter plot of data

The trend relationship between monthly income and credit score is shown through the fitting line of the scatter chart ：

As you can see from the picture above , There is no linear relationship between monthly income and credit score .

4.4 correlation analysis

As you can see from the above figure , The larger the value, the stronger the correlation , A positive value is a positive correlation 、 A negative value is a negative correlation .

### 5. Feature Engineering

5.1 Establish characteristic data and label data

The key codes are as follows ：

5.2 Data set splitting

adopt train_test_split() Method according to 80% Training set 、20% Divide the test set , The key codes are as follows ：

5.3 Low variance filtering feature selection

Use VarianceThreshold() Low variance filtering method for feature selection , The key codes are as follows ：

The result returned ：

As can be seen from the above figure , The threshold is 0.21, The variance values of all features are greater than 0.21, So there is no need to remove some features .

### 6. Build genetic algorithm regression model

Genetic algorithm combines population members iteratively based on creating initial population , Thus according to the parents “ features / Parameters ” The idea of creating children . At the end of each iteration , We do fitting tests , And the most suitable individuals will be taken from the original population + New populations are created . therefore , In each iteration , We will create new descendants , If offspring perform better , They can be used to replace existing individuals . This increases overall performance or at least remains the same for each iteration .

TPOT The main regressors supported are decision trees 、 Ensemble tree 、 Linear model 、xgboost.

The main use of TPOTRegressor Algorithm , For target regression .

6.1 Model parameters

### 7. Model to evaluate

7.1 Evaluation indicators and results

The evaluation index mainly includes the interpretable variance value 、 Mean absolute error 、 Mean square error 、R Square value, etc .

As can be seen from the table above ,R Party for 73.07%  The interpretable variance is 73.33%,GBDT The regression model works well , If you want to achieve better results , You can adjust the parameters ,generations Adjusted for 100,population_size Adjusted for 1000, But it will take a long time .

The key codes are as follows ：

7.2 Comparison between real value and predicted value

It can be seen from the above figure that the fluctuations of the real value and the predicted value are basically the same , The fitting effect of the model is good .

### 8. Conclusion and Prospect

in summary , This paper adopts genetic algorithm regression model , Finally, it is proved that the model we proposed is effective . This model can be used for daily credit scoring .

The materials needed for the actual combat of this machine learning project , The project resources are as follows ：

Project description ：