评分卡构建学习

2024-08-20 22:08
文章标签 学习 构建 评分

本文主要是介绍评分卡构建学习,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

关于评分卡构建的学习,复现实验。参考文章:基于Python的信用评分卡模型分析(一)(二)

虽然是重现但还是很激动呀~
利用jupyter notebook中代码断点运行的特点,非常方便学习和做笔记。

我将数据集中的属性名称转换为中文,便于观察,代码打印了许多中间结果,便于自己理解评分卡构建的整体过程。难点在于,分箱和分数计算(公式还在琢磨)

数据处理和分析

import pandas as pd
import matplotlib.pyplot as plt #导入图像库
import matplotlib
import seaborn as sns
from sklearn.metrics import roc_curve,auc
import statsmodels.api as sm
data = pd.read_csv('dataSet/cs-training.csv')
data.describe().to_csv('dataSet/cs-trainingDes.csv')
data.head()
SeriousDlqin2yrsRevolvingUtilizationOfUnsecuredLinesageNumberOfTime30-59DaysPastDueNotWorseDebtRatioMonthlyIncomeNumberOfOpenCreditLinesAndLoansNumberOfTimes90DaysLateNumberRealEstateLoansOrLinesNumberOfTime60-89DaysPastDueNotWorseNumberOfDependents
010.7661274520.8029829120.0130602.0
100.9571514000.1218762600.040001.0
200.6581803810.0851133042.021000.0
300.2338103000.0360503300.050000.0
400.9072394910.02492663588.070100.0
#查看data的描述信息
dataDes = pd.read_csv('dataSet/cs-trainingDes.csv')
dataDes
Unnamed: 0SeriousDlqin2yrsRevolvingUtilizationOfUnsecuredLinesageNumberOfTime30-59DaysPastDueNotWorseDebtRatioMonthlyIncomeNumberOfOpenCreditLinesAndLoansNumberOfTimes90DaysLateNumberRealEstateLoansOrLinesNumberOfTime60-89DaysPastDueNotWorseNumberOfDependents
0count150000.000000150000.000000150000.000000150000.000000150000.0000001.202690e+05150000.000000150000.000000150000.000000150000.000000146076.000000
1mean0.0668406.04843852.2952070.421033353.0050766.670221e+038.4527600.2659731.0182400.2403870.757222
2std0.249746249.75537114.7718664.1927812037.8185231.438467e+045.1459514.1693041.1297714.1551791.115086
3min0.0000000.0000000.0000000.0000000.0000000.000000e+000.0000000.0000000.0000000.0000000.000000
425%0.0000000.02986741.0000000.0000000.1750743.400000e+035.0000000.0000000.0000000.0000000.000000
550%0.0000000.15418152.0000000.0000000.3665085.400000e+038.0000000.0000001.0000000.0000000.000000
675%0.0000000.55904663.0000000.0000000.8682548.249000e+0311.0000000.0000002.0000000.0000001.000000
7max1.00000050708.000000109.00000098.000000329664.0000003.008750e+0658.00000098.00000054.00000098.00000020.000000

修改data的列名

data.rename(columns={'SeriousDlqin2yrs':'是否逾期','RevolvingUtilizationOfUnsecuredLines':'信用额度','NumberOfTime30-59DaysPastDueNotWorse':'逾期30到60天次数','DebtRatio':'债务占收入比','NumberOfOpenCreditLinesAndLoans':'未偿还贷款','NumberOfTimes90DaysLate':'逾期90天次数','NumberRealEstateLoansOrLines':'抵押财产','NumberOfTime60-89DaysPastDueNotWorse':'逾期60到89天次数','NumberOfDependents':'家庭人数'},inplace = True)
data.head()
是否逾期信用额度age逾期30到60天次数债务占收入比MonthlyIncome未偿还贷款逾期90天次数抵押财产逾期60到89天次数家庭人数
010.7661274520.8029829120.0130602.0
100.9571514000.1218762600.040001.0
200.6581803810.0851133042.021000.0
300.2338103000.0360503300.050000.0
400.9072394910.02492663588.070100.0

上表中MonthlyIncome和NumberOfDependents的count计数不是150000,所以存在缺失值

用随机森林对缺失值预测填充函数。首先,将MonthlyIncome的位置放到第一列,作为标签列y,其他部分作为特征列,放到随机森林的算法里进行计算,预测monthlyIncome的值。将得到的预测值填充到原来的数据中。

from sklearn.ensemble import RandomForestRegressor
# 用随机森林对缺失值预测填充函数
def set_missing(df):# 把已有的数值型特征取出来process_df = df.ix[:,[5,0,1,2,3,4,6,7,8,9]]   #变换了数据列的顺序,将MonthlyIncome的位置放到第一列,作为标签列y,其他部分作为特征列,放到随机森林的算法里进行计算,预测monthlyIncome的值。# 分成已知该特征和未知该特征两部分known = process_df[process_df.MonthlyIncome.notnull()].as_matrix()unknown = process_df[process_df.MonthlyIncome.isnull()].as_matrix()# X为特征属性值X = known[:, 1:]# y为结果标签值y = known[:, 0]# fit到RandomForestRegressor之中rfr = RandomForestRegressor(random_state=0,n_estimators=200,max_depth=3,n_jobs=-1)rfr.fit(X,y)# 用得到的模型进行未知特征值预测predicted = rfr.predict(unknown[:, 1:]).round(0)print(predicted)# 用得到的预测结果填补原缺失数据df.loc[(df.MonthlyIncome.isnull()), 'MonthlyIncome'] = predictedreturn df
data=set_missing(data)#用随机森林填补比较多的缺失值
[8311. 1159. 8311. ... 1159. 2554. 2554.]
data=data.dropna()#删除比较少的缺失值
data = data.drop_duplicates()#删除重复项    
data.to_csv('MissingData.csv',index=False)
#删除到某一行,行号会缺失,所以需要再次读取
data=pd.read_csv('MissingData.csv')
#print(data)
data = data[data['age'] > 0] # 年龄等于0的异常值进行剔除
data.ix[:100,[1,2]].boxplot() #也可用plot.box()
print(data.head())
plt.show()
   是否逾期      信用额度  age  逾期30到60天次数    债务占收入比  MonthlyIncome  未偿还贷款  逾期90天次数  \
0     1  0.766127   45           2  0.802982         9120.0     13        0   
1     0  0.957151   40           0  0.121876         2600.0      4        0   
2     0  0.658180   38           1  0.085113         3042.0      2        1   
3     0  0.233810   30           0  0.036050         3300.0      5        0   
4     0  0.907239   49           1  0.024926        63588.0      7        0   抵押财产  逾期60到89天次数  家庭人数  
0     6           0   2.0  
1     0           0   1.0  
2     0           0   0.0  
3     0           0   0.0  
4     1           0   0.0  
箱型图可以方便的查看属性变量的取值范围,此图中将俩个变量放在一起,Y的变量取值范围导致另一个变量的箱型图显示不出来,单个画就没问题了。

在这里插入图片描述

剔除变量NumberOfTime30-59DaysPastDueNotWorse、NumberOfTimes90DaysLate、NumberOfTime60-89DaysPastDueNotWorse的异常值。
数据集中好客户为0,违约客户为1,考虑到正常的理解,能正常履约并支付利息的客户为1,所以我们将其取反。

#剔除异常值
data = data[data['逾期30到60天次数'] < 90]
#变量SeriousDlqin2yrs取反
data['是否逾期']=1-data['是否逾期']
from sklearn.cross_validation import train_test_split
Y = data['是否逾期']
X = data.ix[:, 1:]
#测试集占比30%
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=0)
# print(Y_train)
train = pd.concat([Y_train, X_train], axis=1)
test = pd.concat([Y_test, X_test], axis=1)
clasTest = test.groupby('是否逾期')['是否逾期'].count()
train.to_csv('TrainData.csv',index=False)
test.to_csv('TestData.csv',index=False)

变量分箱(binning)是对连续变量离散化(discretization)的一种称呼。信用评分卡开发中一般有常用的等距分段、等深分段、最优分段。其中等距分段(Equval length intervals)是指分段的区间是一致的,比如年龄以十年作为一个分段;等深分段(Equal frequency intervals)是先确定分段数量,然后令每个分段中数据数量大致相等;最优分段(Optimal Binning)又叫监督离散化(supervised discretizaion),使用递归划分(Recursive Partitioning)将连续变量分为分段,背后是一种基于条件推断查找较佳分组的算法。

# 定义自动分箱函数  最优分箱
def mono_bin(Y, X, n = 20):r = 0good=Y.sum()bad=Y.count()-goodwhile np.abs(r) < 1:#将X的值对应到Bucket每个区间上d1 = pd.DataFrame({"X": X, "Y": Y, "Bucket": pd.qcut(X, n)})d2 = d1.groupby('Bucket', as_index = True)r, p = stats.spearmanr(d2.mean().X, d2.mean().Y)n = n - 1#print(d1)d3 = pd.DataFrame(d2.X.min(), columns = ['min'])print(d3)d3['min']=d2.min().Xprint(d2.X)d3['max'] = d2.max().Xd3['sum'] = d2.sum().Yd3['total'] = d2.count().Yd3['rate'] = d2.mean().Yd3['woe']=np.log((d3['rate']/(1-d3['rate']))/(good/bad))d3['goodattribute']=d3['sum']/goodd3['badattribute']=(d3['total']-d3['sum'])/badiv=((d3['goodattribute']-d3['badattribute'])*d3['woe']).sum()d3['IV'] = ivd4 = (d3.sort_index(by = 'min'))print("=" * 60)print(d4)cut=[]cut.append(float('-inf'))for i in range(1,n+1):qua=X.quantile(i/(n+1))cut.append(round(qua,4))cut.append(float('inf'))woe=list(d4['woe'].round(3))return d4,iv,cut,woe

WoE分析, 是对指标分箱、计算各个档位的WoE值并观察WoE值随指标变化的趋势。其中WoE的数学定义是:
woe=ln(goodattribute/badattribute)
在进行分析时,我们需要对各指标从小到大排列,并计算出相应分档的WoE值。其中正向指标越大,WoE值越小;反向指标越大,WoE值越大。正向指标的WoE值负斜率越大,反响指标的正斜率越大,则说明指标区分能力好。WoE值趋近于直线,则意味指标判断能力较弱。若正向指标和WoE正相关趋势、反向指标同WoE出现负相关趋势,则说明此指标不符合经济意义,则应当予以去除

import numpy as np
import scipy.stats.stats as stats
dfx1, ivx1,cutx1,woex1=mono_bin(data.是否逾期,data.信用额度,n=10)
#print(dfx1, ivx1,cutx1,woex1)
Empty DataFrame
Columns: [min]
Index: []
<pandas.core.groupby.groupby.SeriesGroupBy object at 0x0000024ECE6CD6A0>
============================================================min           max    sum  total      rate       woe  \
Bucket                                                                       
(-0.001, 0.0311]  0.000000      0.031125  35659  36339  0.981287  1.322345   
(0.0311, 0.158]   0.031128      0.158089  35590  36338  0.979415  1.225098   
(0.158, 0.558]    0.158100      0.558255  34499  36338  0.949392  0.294389   
(0.558, 50708.0]  0.558278  50708.000000  29900  36339  0.822807 -1.101834   goodattribute  badattribute        IV  
Bucket                                                   
(-0.001, 0.0311]       0.262879      0.070060  0.989174  
(0.0311, 0.158]        0.262370      0.077066  0.989174  
(0.158, 0.558]         0.254327      0.189470  0.989174  
(0.558, 50708.0]       0.220423      0.663404  0.989174  
dfx2, ivx2,cutx2,woex2=mono_bin(data.是否逾期, data.age, n=10)
Empty DataFrame
Columns: [min]
Index: []
<pandas.core.groupby.groupby.SeriesGroupBy object at 0x0000024ECE6CD828>
============================================================min  max    sum  total      rate       woe  goodattribute  \
Bucket                                                                      
(20.999, 33.0]   21   33  14471  16287  0.888500 -0.561809       0.106681   
(33.0, 40.0]     34   40  16073  17737  0.906185 -0.369403       0.118491   
(40.0, 45.0]     41   45  14683  16043  0.915228 -0.258113       0.108243   
(45.0, 49.0]     46   49  13619  14828  0.918465 -0.215647       0.100400   
(49.0, 54.0]     50   54  16516  17814  0.927136 -0.093814       0.121756   
(54.0, 59.0]     55   59  15757  16670  0.945231  0.210985       0.116161   
(59.0, 64.0]     60   64  15923  16613  0.958466  0.501509       0.117385   
(64.0, 71.0]     65   71  14194  14608  0.971659  0.897390       0.104638   
(71.0, 107.0]    72  107  14412  14754  0.976820  1.103687       0.106246   badattribute        IV  
Bucket                                  
(20.999, 33.0]      0.187101  0.241178  
(33.0, 40.0]        0.171440  0.241178  
(40.0, 45.0]        0.140120  0.241178  
(45.0, 49.0]        0.124562  0.241178  
(49.0, 54.0]        0.133732  0.241178  
(54.0, 59.0]        0.094066  0.241178  
(59.0, 64.0]        0.071090  0.241178  
(64.0, 71.0]        0.042654  0.241178  
(71.0, 107.0]       0.035236  0.241178  
dfx4, ivx4,cutx4,woex4 =mono_bin(data.是否逾期, data.债务占收入比, n=20)
Empty DataFrame
Columns: [min]
Index: []
<pandas.core.groupby.groupby.SeriesGroupBy object at 0x0000024ECE6CD6A0>
============================================================min            max    sum  total      rate       woe  \
Bucket                                                                         
(-0.001, 0.236]    0.000000       0.235948  45593  48452  0.940993  0.131963   
(0.236, 0.545]     0.235953       0.544862  45434  48451  0.937731  0.074679   
(0.545, 329664.0]  0.544864  329664.000000  44621  48451  0.920951 -0.181979   goodattribute  badattribute        IV  
Bucket                                                    
(-0.001, 0.236]         0.336113      0.294560  0.019231  
(0.236, 0.545]          0.334940      0.310839  0.019231  
(0.545, 329664.0]       0.328947      0.394601  0.019231  
dfx5, ivx5,cutx5,woex5 =mono_bin(data.是否逾期, data.MonthlyIncome, n=10) 
Empty DataFrame
Columns: [min]
Index: []
<pandas.core.groupby.groupby.SeriesGroupBy object at 0x0000024ECE6FBF28>
============================================================min        max    sum  total      rate       woe  \
Bucket                                                                     
(-0.001, 3400.0]        0.0     3400.0  44952  48760  0.921903 -0.168828   
(3400.0, 6850.0]     3401.0     6850.0  44600  48145  0.926368 -0.105123   
(6850.0, 3008750.0]  6851.0  3008750.0  46096  48449  0.951433  0.337716   goodattribute  badattribute        IV  
Bucket                                                      
(-0.001, 3400.0]          0.331387      0.392335  0.047012  
(3400.0, 6850.0]          0.328792      0.365238  0.047012  
(6850.0, 3008750.0]       0.339821      0.242427  0.047012  
def self_bin(Y,X,cat):good=Y.sum()bad=Y.count()-goodd1=pd.DataFrame({'X':X,'Y':Y,'Bucket':pd.cut(X,cat)})d2=d1.groupby('Bucket', as_index = True)d3 = pd.DataFrame(d2.X.min(), columns=['min'])d3['min'] = d2.min().Xd3['max'] = d2.max().Xd3['sum'] = d2.sum().Yd3['total'] = d2.count().Yd3['rate'] = d2.mean().Yd3['woe'] = np.log((d3['rate'] / (1 - d3['rate'])) / (good / bad))d3['goodattribute'] = d3['sum'] / goodd3['badattribute'] = (d3['total'] - d3['sum']) / badiv = ((d3['goodattribute'] - d3['badattribute']) * d3['woe']).sum()d4 = (d3.sort_index(by='min'))print("=" * 60)print(d4)woe = list(d4['woe'].round(3))return d4, iv,woe#连续变量离散化
pinf = float('inf')#正无穷大
ninf = float('-inf')#负无穷大
cutx3 = [ninf, 0, 1, 3, 5, pinf]
cutx6 = [ninf, 1, 2, 3, 5, pinf]
cutx7 = [ninf, 0, 1, 3, 5, pinf]
cutx8 = [ninf, 0,1,2, 3, pinf]
cutx9 = [ninf, 0, 1, 3, pinf]
cutx10 = [ninf, 0, 1, 2, 3, 5, pinf]
dfx3, ivx3,woex3 = self_bin(data.是否逾期, data['逾期30到60天次数'], cutx3)
dfx6, ivx6 ,woex6= self_bin(data.是否逾期, data['未偿还贷款'], cutx6)
dfx7, ivx7,woex7 = self_bin(data.是否逾期, data['逾期90天次数'], cutx7)
dfx8, ivx8,woex8 = self_bin(data.是否逾期, data['抵押财产'], cutx8)
dfx9, ivx9,woex9 = self_bin(data.是否逾期, data['逾期60到89天次数'], cutx9)
dfx10, ivx10,woex10 = self_bin(data.是否逾期, data['家庭人数'], cutx10)
============================================================min  max     sum   total      rate       woe  goodattribute  \
Bucket                                                                     
(-inf, 0.0]    0    0  117077  122020  0.959490  0.527540       0.863094   
(0.0, 1.0]     1    1   13381   15744  0.849911 -0.903415       0.098645   
(1.0, 3.0]     2    3    4467    6279  0.711419 -1.735033       0.032931   
(3.0, 5.0]     4    5     606    1075  0.563721 -2.381042       0.004467   
(5.0, inf]     6   13     117     236  0.495763 -2.654269       0.000863   badattribute  
Bucket                     
(-inf, 0.0]      0.509273  
(0.0, 1.0]       0.243458  
(1.0, 3.0]       0.186689  
(3.0, 5.0]       0.048321  
(5.0, inf]       0.012260  
============================================================min  max    sum   total      rate       woe  goodattribute  \
Bucket                                                                    
(-inf, 1.0]    0    1   4438    5322  0.833897 -1.023817       0.032717   
(1.0, 2.0]     2    2   5577    6162  0.905063 -0.382525       0.041114   
(2.0, 3.0]     3    3   7853    8519  0.921822 -0.169958       0.057892   
(3.0, 5.0]     4    5  22082   23622  0.934807  0.025661       0.162789   
(5.0, inf]     6   58  95698  101729  0.940715  0.126966       0.705488   badattribute  
Bucket                     
(-inf, 1.0]      0.091078  
(1.0, 2.0]       0.060272  
(2.0, 3.0]       0.068617  
(3.0, 5.0]       0.158665  
(5.0, inf]       0.621368  
============================================================min  max     sum   total      rate       woe  goodattribute  \
Bucket                                                                     
(-inf, 0.0]    0    0  131008  137449  0.953139  0.375256       0.965794   
(0.0, 1.0]     1    1    3396    5130  0.661988 -1.965152       0.025035   
(1.0, 3.0]     2    3    1041    2178  0.477961 -2.725530       0.007674   
(3.0, 5.0]     4    5     142     417  0.340528 -3.298263       0.001047   
(5.0, inf]     6   17      61     180  0.338889 -3.305569       0.000450   badattribute  
Bucket                     
(-inf, 0.0]      0.663610  
(0.0, 1.0]       0.178652  
(1.0, 3.0]       0.117144  
(3.0, 5.0]       0.028333  
(5.0, inf]       0.012260  ============================================================min  max    sum  total      rate       woe  goodattribute  \
Bucket                                                                   
(-inf, 0.0]    0    0  48757  53172  0.916968 -0.235478       0.359438   
(0.0, 1.0]     1    1  48477  51191  0.946983  0.245347       0.357373   
(1.0, 2.0]     2    2  29410  31155  0.943990  0.187261       0.216811   
(2.0, 3.0]     3    3   5812   6230  0.932905 -0.005120       0.042846   
(3.0, inf]     4   54   3192   3606  0.885191 -0.594782       0.023531   badattribute  
Bucket                     
(-inf, 0.0]      0.454873  
(0.0, 1.0]       0.279621  
(1.0, 2.0]       0.179786  
(2.0, 3.0]       0.043066  
(3.0, inf]       0.042654  
============================================================min  max     sum   total      rate       woe  goodattribute  \
Bucket                                                                     
(-inf, 0.0]    0    0  130993  138127  0.948352  0.272953       0.965683   
(0.0, 1.0]     1    1    3905    5647  0.691518 -1.830095       0.028788   
(1.0, 3.0]     2    3     688    1415  0.486219 -2.692457       0.005072   
(3.0, inf]     4   11      62     165  0.375758 -3.144914       0.000457   badattribute  
Bucket                     
(-inf, 0.0]      0.735009  
(0.0, 1.0]       0.179477  
(1.0, 3.0]       0.074902  
(3.0, inf]       0.010612  ============================================================min   max    sum  total      rate       woe  goodattribute  \
Bucket                                                                    
(-inf, 0.0]  0.0   0.0  81248  86234  0.942181  0.153553       0.598962   
(0.0, 1.0]   1.0   1.0  24370  26291  0.926933 -0.096812       0.179656   
(1.0, 2.0]   2.0   2.0  17929  19500  0.919436 -0.202612       0.132173   
(2.0, 3.0]   3.0   3.0   8646   9479  0.912122 -0.297501       0.063738   
(3.0, 5.0]   4.0   5.0   3241   3605  0.899029 -0.450836       0.023893   
(5.0, inf]   6.0  20.0    214    245  0.873469 -0.705330       0.001578   badattribute  
Bucket                     
(-inf, 0.0]      0.513703  
(0.0, 1.0]       0.197919  
(1.0, 2.0]       0.161859  
(2.0, 3.0]       0.085823  
(3.0, 5.0]       0.037503  
(5.0, inf]       0.003194  
corr = data.corr()#计算各变量的相关性系数
#print(corr.index)
#xticks = ['x0','x1','x2','x3','x4','x5','x6','x7','x8','x9','x10']#x轴标签
xticks =list(corr.index)
yticks = list(corr.index)#y轴标签
fig = plt.figure(figsize=(22,20))#figsize=(14,12)使热力图变大
ax1 = fig.add_subplot(1, 1, 1)
#sns.heatmap(corr, annot=True, cmap='PuRd' ,ax=ax1, annot_kws={'size': 16, 'weight': 'light','color': 'black'})
#sns.heatmap(corr, annot=True, cmap='YlGnBu' ,ax=ax1, annot_kws={'size': 16, 'weight': 'light','color': 'black'})
#sns.heatmap(corr, annot=True, cmap='rainbow' ,ax=ax1, annot_kws={'size': 16, 'weight': 'light','color': 'black'})
sns.heatmap(corr, annot=True, cmap='RdPu' ,ax=ax1, annot_kws={'size': 16, 'weight': 'light','color': 'green'})
### 绘制相关性系数热力图
#cmap="YlGnBu" (rainbow)设置heatmap颜色
ax1.set_xticklabels(xticks, rotation=90, fontsize=20)
ax1.set_yticklabels(yticks, rotation=0, fontsize=20)
ax1.set_yticklabels(yticks, rotation=0, fontsize=20)plt.rcParams['font.sans-serif']=['SimHei']     #用来正常显示中文标签
plt.rcParams['axes.unicode_minus']=False      #用来正常显示负号
plt.show()

在这里插入图片描述

IV指标是一般用来确定自变量的预测能力。
每个字段进行分箱之后会产生一个IV值,代表这个字段对标签字段的影响力,IV越大代表分箱效果越好,字段对标签字段的影响力越大

list(data.columns)#y轴标签
['是否逾期','信用额度','age','逾期30到60天次数','债务占收入比','MonthlyIncome','未偿还贷款','逾期90天次数','抵押财产','逾期60到89天次数','家庭人数']
ivlist=[ivx1,ivx2,ivx3,ivx4,ivx5,ivx6,ivx7,ivx8,ivx9,ivx10]
ivlist
[0.9891738801650342,0.24117787840722144,0.7189254612784397,0.019231014490398168,0.04701224378739177,0.07968800751468878,0.8426781922043317,0.059857660209756414,0.5586891401396025,0.03472056480690539]

原谅老夫的少女心一定要画成粉色,哈哈哈~

ivlist=[ivx1,ivx2,ivx3,ivx4,ivx5,ivx6,ivx7,ivx8,ivx9,ivx10]#各变量IV
#xticks = list(data.columns)#y轴标签
index=['x1','x2','x3','x4','x5','x6','x7','x8','x9','x10']#x轴的标签
fig1 = plt.figure(figsize=(14,8))#figsize=(14,12)使热力图变大)
ax1 = fig1.add_subplot(1, 1, 1)
x = np.arange(len(index))+1
#ax1.bar(x, ivlist, width=0.4,facecolor = 'hotpink')#生成柱状图
ax1.bar(x, ivlist, width=0.4,facecolor = 'lightcoral')#生成柱状图
ax1.set_xticks(x)
ax1.set_xticklabels(xticks, rotation=90, fontsize=14)
ax1.set_ylabel('IV(Information Value)', fontsize=14)
#在柱状图上添加数字标签
for a, b in zip(x, ivlist):plt.text(a, b + 0.01, '%.4f' % b, ha='center', va='bottom', fontsize=10)
plt.show()

在这里插入图片描述

证据权重(Weight of Evidence,WOE)转换可以将Logistic回归模型转变为标准评分卡格式。引入WOE转换的目的并不是为了提高模型质量,有一些变量不应该被纳入模型,这或者是因为它们不能增加模型值,或者是因为与其模型相关系数有关的误差较大,其实建立标准信用评分卡也可以不采用WOE转换。这种情况下,Logistic回归模型需要处理更大数量的自变量。尽管这样会增加建模程序的复杂性,但最终得到的评分卡都是一样的。

在建立模型之前,我们需要将筛选后的变量转换为WoE值,用于信用评分。
def outlier_processing(df,col):s=df[col]oneQuoter=s.quantile(0.25)threeQuote=s.quantile(0.75)irq=threeQuote-oneQuotermin=oneQuoter-1.5*irqmax=threeQuote+1.5*irqdf=df[df[col]<=max]df=df[df[col]>=min]return df
data = pd.read_csv('MissingData.csv')
# 年龄等于0的异常值进行剔除
data = data[data['age'] > 0]
data = data[data['逾期30到60天次数'] < 90]#剔除异常值
data['是否逾期']=1-data['是否逾期']
Y = data['是否逾期']
X = data.ix[:, 1:]
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=0)
# print(Y_train)
train = pd.concat([Y_train, X_train], axis=1)
test = pd.concat([Y_test, X_test], axis=1)
clasTest = test.groupby('是否逾期')['是否逾期'].count()
train.to_csv('TrainData.csv',index=False)
test.to_csv('TestData.csv',index=False)
print(train.shape)
print(test.shape)
(101747, 11)
(43607, 11)
import pandas as pd
import numpy as np
from pandas import Series,DataFrame
import scipy.stats.stats as stats
import matplotlib.pyplot as plt
import statsmodels.api as sm
import math
#替换成woe函数
def replace_woe(series,cut,woe):list=[]i=0while i<len(series):value=series[i]print(i)j=len(cut)-2m=len(cut)-2while j>=0:if value>=cut[j]:j=-1else:j -=1m -= 1list.append(woe[m])i += 1return list

我们将每个变量都进行替换,并将其保存到WoeData.csv文件中:
将整体数据分成俩部分,一部分做训练,一部分做测试
将训练部分属性值转换WOE、测试部分属性值转换WOE
训练集使用逻辑回归训练,得到模型,测试集测试

# TrainData替换成woe
data=pd.read_csv('TrainData.csv')
#print(data.head())
data['信用额度'] = Series(replace_woe(data['信用额度'], cutx1, woex1))
#print(data['信用额度'][1400:1500])
data['age'] = Series(replace_woe(data['age'], cutx2, woex2))
data['逾期30到60天次数'] = Series(replace_woe(data['逾期30到60天次数'], cutx3, woex3))
data['债务占收入比'] = Series(replace_woe(data['债务占收入比'], cutx4, woex4))
data['MonthlyIncome'] = Series(replace_woe(data['MonthlyIncome'], cutx5, woex5))
data['未偿还贷款'] = Series(replace_woe(data['未偿还贷款'], cutx6, woex6))
data['逾期90天次数'] = Series(replace_woe(data['逾期90天次数'], cutx7, woex7))
data['抵押财产'] = Series(replace_woe(data['抵押财产'], cutx8, woex8))
data['逾期60到89天次数'] = Series(replace_woe(data['逾期60到89天次数'], cutx9, woex9))
data['家庭人数'] = Series(replace_woe(data['家庭人数'], cutx10, woex10))
data.to_csv('trainWoeData.csv', index=False)
# TestData替换成woe
test= pd.read_csv('TestData.csv')
# 替换成woe
test['信用额度'] = Series(replace_woe(test['信用额度'], cutx1, woex1))
test['age'] = Series(replace_woe(test['age'], cutx2, woex2))
test['逾期30到60天次数'] = Series(replace_woe(test['逾期30到60天次数'], cutx3, woex3))
test['债务占收入比'] = Series(replace_woe(test['债务占收入比'], cutx4, woex4))
test['MonthlyIncome'] = Series(replace_woe(test['MonthlyIncome'], cutx5, woex5))
test['未偿还贷款'] = Series(replace_woe(test['未偿还贷款'], cutx6, woex6))
test['逾期90天次数'] = Series(replace_woe(test['逾期90天次数'], cutx7, woex7))
test['抵押财产'] = Series(replace_woe(test['抵押财产'], cutx8, woex8))
test['逾期60到89天次数'] = Series(replace_woe(test['逾期60到89天次数'], cutx9, woex9))
test['家庭人数'] = Series(replace_woe(test['家庭人数'], cutx10, woex10))
test.to_csv('TestWoeData.csv', index=False)
#训练部分
matplotlib.rcParams['axes.unicode_minus'] = False
#导入数据
data = pd.read_csv('trainWoeData.csv')
#应变量
Y=data['是否逾期']
#自变量,剔除对因变量影响不明显的变量
X=data.drop(['是否逾期','债务占收入比','MonthlyIncome', '未偿还贷款','抵押财产','家庭人数'],axis=1)
X1=sm.add_constant(X)
logit=sm.Logit(Y,X1)
result=logit.fit()
print(result.params)#测试部分
test = pd.read_csv('TestWoeData.csv')
Y_test = test['是否逾期']
X_test = test.drop(['是否逾期', '信用额度', 'MonthlyIncome', '未偿还贷款','抵押财产', '家庭人数'], axis=1)
X3 = sm.add_constant(X_test)
resu = result.predict(X3)
fpr, tpr, threshold = roc_curve(Y_test, resu)
rocauc = auc(fpr, tpr)
plt.plot(fpr, tpr, 'b', label='AUC = %0.2f' % rocauc)
plt.legend(loc='lower right')
plt.plot([0, 1], [0, 1], 'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('真正率')
plt.xlabel('假正率')
plt.show()
Optimization terminated successfully.Current function value: 0.186940Iterations 8
const         9.555259
信用额度          0.630777
age           0.511745
逾期30到60天次数    1.035706
逾期90天次数       1.747674
逾期60到89天次数    1.085101
dtype: float64

在这里插入图片描述

通过ROC曲线和AUC来评估模型的拟合能力,上图为ROC曲线,AUC值为0.81,说明该模型的预测效果还是不错的,正确率较高。

计算分数

#计算分数
#coe为逻辑回归模型的系数
coe=[9.738849,0.638002,0.505995,1.032246,1.790041,1.131956]
# 我们取600分为基础分值,PDO为20(每高20分好坏比翻一倍),好坏比取20。
p = 20 / math.log(2)
q = 600 - 20 * math.log(20) / math.log(2)
baseScore = round(q + p * coe[0], 0)
baseScore
795.0
#计算各部分函数
def get_score(coe,woe,factor,label):scores=[]for w in woe:score=round(coe*w*factor,0)scores.append(score)print(list(data.columns)[label],'woe:',woe,'score:',scores)return scores
# 各项部分分数
x1 = get_score(coe[1], woex1, p,1)
x2 = get_score(coe[2], woex2, p,2)
x3 = get_score(coe[3], woex3, p,3)
x7 = get_score(coe[4], woex7, p,7)
x9 = get_score(coe[5], woex9, p,9)
信用额度 woe: [1.322, 1.225, 0.294, -1.102] score: [24.0, 23.0, 5.0, -20.0]
age woe: [-0.562, -0.369, -0.258, -0.216, -0.094, 0.211, 0.502, 0.897, 1.104] score: [-8.0, -5.0, -4.0, -3.0, -1.0, 3.0, 7.0, 13.0, 16.0]
逾期30到60天次数 woe: [0.528, -0.903, -1.735, -2.381, -2.654] score: [16.0, -27.0, -52.0, -71.0, -79.0]
逾期90天次数 woe: [0.375, -1.965, -2.726, -3.298, -3.306] score: [19.0, -101.0, -141.0, -170.0, -171.0]
逾期60到89天次数 woe: [0.273, -1.83, -2.692, -3.145] score: [9.0, -60.0, -88.0, -103.0]

评分标准

在这里插入图片描述

#根据变量计算分数
def compute_score(series,cut,score):list = []i = 0while i < len(series):value = series[i]j = len(cut) - 2m = len(cut) - 2while j >= 0:if value >= cut[j]:j = -1else:j -= 1m -= 1list.append(score[m])i += 1return list
test1 = pd.read_csv('TestData.csv')
test1['BaseScore']=Series(np.zeros(len(test1)))+baseScore
test1['x1'] = Series(compute_score(test1['信用额度'], cutx1, x1))
test1['x2'] = Series(compute_score(test1['age'], cutx2, x2))
test1['x3'] = Series(compute_score(test1['逾期30到60天次数'], cutx3, x3))
test1['x7'] = Series(compute_score(test1['逾期90天次数'], cutx7, x7))
test1['x9'] = Series(compute_score(test1['逾期60到89天次数'], cutx9, x9))
test1['Score'] = test1['x1'] + test1['x2'] + test1['x3'] + test1['x7'] +test1['x9']  + baseScore
test1.to_csv('ScoreData.csv', index=False)

x1-x9是对应字段的得分,基础分795和对应得分相加,得到最后的分数

test1.head()
是否逾期信用额度age逾期30到60天次数债务占收入比MonthlyIncome未偿还贷款逾期90天次数抵押财产逾期60到89天次数家庭人数BaseScorex1x2x3x7x9Score
010.6173524140.16758915000.0141102.0795.0-20.0-4.0-71.0-141.0-60.0499.0
110.0841765800.38885114583.0120301.0795.023.03.0-27.0-101.0-60.0633.0
210.3077574700.18131318900.0100203.0795.05.0-3.0-27.0-101.0-60.0609.0
310.0032656500.3046166000.060200.0795.024.013.0-27.0-101.0-60.0644.0
410.0185173805870.0000002554.060100.0795.024.0-5.0-27.0-101.0-60.0626.0

–参考博客
[1]: http://math.stackexchange.com/
[2]: https://www.jianshu.com/p/159f381c661d

之后如果有新的调研会继续补充

这篇关于评分卡构建学习的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/1091304

相关文章

Python中构建终端应用界面利器Blessed模块的使用

《Python中构建终端应用界面利器Blessed模块的使用》Blessed库作为一个轻量级且功能强大的解决方案,开始在开发者中赢得口碑,今天,我们就一起来探索一下它是如何让终端UI开发变得轻松而高... 目录一、安装与配置:简单、快速、无障碍二、基本功能:从彩色文本到动态交互1. 显示基本内容2. 创建链

Golang使用etcd构建分布式锁的示例分享

《Golang使用etcd构建分布式锁的示例分享》在本教程中,我们将学习如何使用Go和etcd构建分布式锁系统,分布式锁系统对于管理对分布式系统中共享资源的并发访问至关重要,它有助于维护一致性,防止竞... 目录引言环境准备新建Go项目实现加锁和解锁功能测试分布式锁重构实现失败重试总结引言我们将使用Go作

HarmonyOS学习(七)——UI(五)常用布局总结

自适应布局 1.1、线性布局(LinearLayout) 通过线性容器Row和Column实现线性布局。Column容器内的子组件按照垂直方向排列,Row组件中的子组件按照水平方向排列。 属性说明space通过space参数设置主轴上子组件的间距,达到各子组件在排列上的等间距效果alignItems设置子组件在交叉轴上的对齐方式,且在各类尺寸屏幕上表现一致,其中交叉轴为垂直时,取值为Vert

Ilya-AI分享的他在OpenAI学习到的15个提示工程技巧

Ilya(不是本人,claude AI)在社交媒体上分享了他在OpenAI学习到的15个Prompt撰写技巧。 以下是详细的内容: 提示精确化:在编写提示时,力求表达清晰准确。清楚地阐述任务需求和概念定义至关重要。例:不用"分析文本",而用"判断这段话的情感倾向:积极、消极还是中性"。 快速迭代:善于快速连续调整提示。熟练的提示工程师能够灵活地进行多轮优化。例:从"总结文章"到"用

【前端学习】AntV G6-08 深入图形与图形分组、自定义节点、节点动画(下)

【课程链接】 AntV G6:深入图形与图形分组、自定义节点、节点动画(下)_哔哩哔哩_bilibili 本章十吾老师讲解了一个复杂的自定义节点中,应该怎样去计算和绘制图形,如何给一个图形制作不间断的动画,以及在鼠标事件之后产生动画。(有点难,需要好好理解) <!DOCTYPE html><html><head><meta charset="UTF-8"><title>06

学习hash总结

2014/1/29/   最近刚开始学hash,名字很陌生,但是hash的思想却很熟悉,以前早就做过此类的题,但是不知道这就是hash思想而已,说白了hash就是一个映射,往往灵活利用数组的下标来实现算法,hash的作用:1、判重;2、统计次数;

嵌入式QT开发:构建高效智能的嵌入式系统

摘要: 本文深入探讨了嵌入式 QT 相关的各个方面。从 QT 框架的基础架构和核心概念出发,详细阐述了其在嵌入式环境中的优势与特点。文中分析了嵌入式 QT 的开发环境搭建过程,包括交叉编译工具链的配置等关键步骤。进一步探讨了嵌入式 QT 的界面设计与开发,涵盖了从基本控件的使用到复杂界面布局的构建。同时也深入研究了信号与槽机制在嵌入式系统中的应用,以及嵌入式 QT 与硬件设备的交互,包括输入输出设

零基础学习Redis(10) -- zset类型命令使用

zset是有序集合,内部除了存储元素外,还会存储一个score,存储在zset中的元素会按照score的大小升序排列,不同元素的score可以重复,score相同的元素会按照元素的字典序排列。 1. zset常用命令 1.1 zadd  zadd key [NX | XX] [GT | LT]   [CH] [INCR] score member [score member ...]

Retrieval-based-Voice-Conversion-WebUI模型构建指南

一、模型介绍 Retrieval-based-Voice-Conversion-WebUI(简称 RVC)模型是一个基于 VITS(Variational Inference with adversarial learning for end-to-end Text-to-Speech)的简单易用的语音转换框架。 具有以下特点 简单易用:RVC 模型通过简单易用的网页界面,使得用户无需深入了

【机器学习】高斯过程的基本概念和应用领域以及在python中的实例

引言 高斯过程(Gaussian Process,简称GP)是一种概率模型,用于描述一组随机变量的联合概率分布,其中任何一个有限维度的子集都具有高斯分布 文章目录 引言一、高斯过程1.1 基本定义1.1.1 随机过程1.1.2 高斯分布 1.2 高斯过程的特性1.2.1 联合高斯性1.2.2 均值函数1.2.3 协方差函数(或核函数) 1.3 核函数1.4 高斯过程回归(Gauss