评分卡构建学习

2024-08-20 22:08
文章标签 学习 构建 评分

本文主要是介绍评分卡构建学习,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

关于评分卡构建的学习,复现实验。参考文章:基于Python的信用评分卡模型分析(一)(二)

虽然是重现但还是很激动呀~
利用jupyter notebook中代码断点运行的特点,非常方便学习和做笔记。

我将数据集中的属性名称转换为中文,便于观察,代码打印了许多中间结果,便于自己理解评分卡构建的整体过程。难点在于,分箱和分数计算(公式还在琢磨)

数据处理和分析

import pandas as pd
import matplotlib.pyplot as plt #导入图像库
import matplotlib
import seaborn as sns
from sklearn.metrics import roc_curve,auc
import statsmodels.api as sm
data = pd.read_csv('dataSet/cs-training.csv')
data.describe().to_csv('dataSet/cs-trainingDes.csv')
data.head()
SeriousDlqin2yrsRevolvingUtilizationOfUnsecuredLinesageNumberOfTime30-59DaysPastDueNotWorseDebtRatioMonthlyIncomeNumberOfOpenCreditLinesAndLoansNumberOfTimes90DaysLateNumberRealEstateLoansOrLinesNumberOfTime60-89DaysPastDueNotWorseNumberOfDependents
010.7661274520.8029829120.0130602.0
100.9571514000.1218762600.040001.0
200.6581803810.0851133042.021000.0
300.2338103000.0360503300.050000.0
400.9072394910.02492663588.070100.0
#查看data的描述信息
dataDes = pd.read_csv('dataSet/cs-trainingDes.csv')
dataDes
Unnamed: 0SeriousDlqin2yrsRevolvingUtilizationOfUnsecuredLinesageNumberOfTime30-59DaysPastDueNotWorseDebtRatioMonthlyIncomeNumberOfOpenCreditLinesAndLoansNumberOfTimes90DaysLateNumberRealEstateLoansOrLinesNumberOfTime60-89DaysPastDueNotWorseNumberOfDependents
0count150000.000000150000.000000150000.000000150000.000000150000.0000001.202690e+05150000.000000150000.000000150000.000000150000.000000146076.000000
1mean0.0668406.04843852.2952070.421033353.0050766.670221e+038.4527600.2659731.0182400.2403870.757222
2std0.249746249.75537114.7718664.1927812037.8185231.438467e+045.1459514.1693041.1297714.1551791.115086
3min0.0000000.0000000.0000000.0000000.0000000.000000e+000.0000000.0000000.0000000.0000000.000000
425%0.0000000.02986741.0000000.0000000.1750743.400000e+035.0000000.0000000.0000000.0000000.000000
550%0.0000000.15418152.0000000.0000000.3665085.400000e+038.0000000.0000001.0000000.0000000.000000
675%0.0000000.55904663.0000000.0000000.8682548.249000e+0311.0000000.0000002.0000000.0000001.000000
7max1.00000050708.000000109.00000098.000000329664.0000003.008750e+0658.00000098.00000054.00000098.00000020.000000

修改data的列名

data.rename(columns={'SeriousDlqin2yrs':'是否逾期','RevolvingUtilizationOfUnsecuredLines':'信用额度','NumberOfTime30-59DaysPastDueNotWorse':'逾期30到60天次数','DebtRatio':'债务占收入比','NumberOfOpenCreditLinesAndLoans':'未偿还贷款','NumberOfTimes90DaysLate':'逾期90天次数','NumberRealEstateLoansOrLines':'抵押财产','NumberOfTime60-89DaysPastDueNotWorse':'逾期60到89天次数','NumberOfDependents':'家庭人数'},inplace = True)
data.head()
是否逾期信用额度age逾期30到60天次数债务占收入比MonthlyIncome未偿还贷款逾期90天次数抵押财产逾期60到89天次数家庭人数
010.7661274520.8029829120.0130602.0
100.9571514000.1218762600.040001.0
200.6581803810.0851133042.021000.0
300.2338103000.0360503300.050000.0
400.9072394910.02492663588.070100.0

上表中MonthlyIncome和NumberOfDependents的count计数不是150000,所以存在缺失值

用随机森林对缺失值预测填充函数。首先,将MonthlyIncome的位置放到第一列,作为标签列y,其他部分作为特征列,放到随机森林的算法里进行计算,预测monthlyIncome的值。将得到的预测值填充到原来的数据中。

from sklearn.ensemble import RandomForestRegressor
# 用随机森林对缺失值预测填充函数
def set_missing(df):# 把已有的数值型特征取出来process_df = df.ix[:,[5,0,1,2,3,4,6,7,8,9]]   #变换了数据列的顺序,将MonthlyIncome的位置放到第一列,作为标签列y,其他部分作为特征列,放到随机森林的算法里进行计算,预测monthlyIncome的值。# 分成已知该特征和未知该特征两部分known = process_df[process_df.MonthlyIncome.notnull()].as_matrix()unknown = process_df[process_df.MonthlyIncome.isnull()].as_matrix()# X为特征属性值X = known[:, 1:]# y为结果标签值y = known[:, 0]# fit到RandomForestRegressor之中rfr = RandomForestRegressor(random_state=0,n_estimators=200,max_depth=3,n_jobs=-1)rfr.fit(X,y)# 用得到的模型进行未知特征值预测predicted = rfr.predict(unknown[:, 1:]).round(0)print(predicted)# 用得到的预测结果填补原缺失数据df.loc[(df.MonthlyIncome.isnull()), 'MonthlyIncome'] = predictedreturn df
data=set_missing(data)#用随机森林填补比较多的缺失值
[8311. 1159. 8311. ... 1159. 2554. 2554.]
data=data.dropna()#删除比较少的缺失值
data = data.drop_duplicates()#删除重复项    
data.to_csv('MissingData.csv',index=False)
#删除到某一行,行号会缺失,所以需要再次读取
data=pd.read_csv('MissingData.csv')
#print(data)
data = data[data['age'] > 0] # 年龄等于0的异常值进行剔除
data.ix[:100,[1,2]].boxplot() #也可用plot.box()
print(data.head())
plt.show()
   是否逾期      信用额度  age  逾期30到60天次数    债务占收入比  MonthlyIncome  未偿还贷款  逾期90天次数  \
0     1  0.766127   45           2  0.802982         9120.0     13        0   
1     0  0.957151   40           0  0.121876         2600.0      4        0   
2     0  0.658180   38           1  0.085113         3042.0      2        1   
3     0  0.233810   30           0  0.036050         3300.0      5        0   
4     0  0.907239   49           1  0.024926        63588.0      7        0   抵押财产  逾期60到89天次数  家庭人数  
0     6           0   2.0  
1     0           0   1.0  
2     0           0   0.0  
3     0           0   0.0  
4     1           0   0.0  
箱型图可以方便的查看属性变量的取值范围,此图中将俩个变量放在一起,Y的变量取值范围导致另一个变量的箱型图显示不出来,单个画就没问题了。

在这里插入图片描述

剔除变量NumberOfTime30-59DaysPastDueNotWorse、NumberOfTimes90DaysLate、NumberOfTime60-89DaysPastDueNotWorse的异常值。
数据集中好客户为0,违约客户为1,考虑到正常的理解,能正常履约并支付利息的客户为1,所以我们将其取反。

#剔除异常值
data = data[data['逾期30到60天次数'] < 90]
#变量SeriousDlqin2yrs取反
data['是否逾期']=1-data['是否逾期']
from sklearn.cross_validation import train_test_split
Y = data['是否逾期']
X = data.ix[:, 1:]
#测试集占比30%
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=0)
# print(Y_train)
train = pd.concat([Y_train, X_train], axis=1)
test = pd.concat([Y_test, X_test], axis=1)
clasTest = test.groupby('是否逾期')['是否逾期'].count()
train.to_csv('TrainData.csv',index=False)
test.to_csv('TestData.csv',index=False)

变量分箱(binning)是对连续变量离散化(discretization)的一种称呼。信用评分卡开发中一般有常用的等距分段、等深分段、最优分段。其中等距分段(Equval length intervals)是指分段的区间是一致的,比如年龄以十年作为一个分段;等深分段(Equal frequency intervals)是先确定分段数量,然后令每个分段中数据数量大致相等;最优分段(Optimal Binning)又叫监督离散化(supervised discretizaion),使用递归划分(Recursive Partitioning)将连续变量分为分段,背后是一种基于条件推断查找较佳分组的算法。

# 定义自动分箱函数  最优分箱
def mono_bin(Y, X, n = 20):r = 0good=Y.sum()bad=Y.count()-goodwhile np.abs(r) < 1:#将X的值对应到Bucket每个区间上d1 = pd.DataFrame({"X": X, "Y": Y, "Bucket": pd.qcut(X, n)})d2 = d1.groupby('Bucket', as_index = True)r, p = stats.spearmanr(d2.mean().X, d2.mean().Y)n = n - 1#print(d1)d3 = pd.DataFrame(d2.X.min(), columns = ['min'])print(d3)d3['min']=d2.min().Xprint(d2.X)d3['max'] = d2.max().Xd3['sum'] = d2.sum().Yd3['total'] = d2.count().Yd3['rate'] = d2.mean().Yd3['woe']=np.log((d3['rate']/(1-d3['rate']))/(good/bad))d3['goodattribute']=d3['sum']/goodd3['badattribute']=(d3['total']-d3['sum'])/badiv=((d3['goodattribute']-d3['badattribute'])*d3['woe']).sum()d3['IV'] = ivd4 = (d3.sort_index(by = 'min'))print("=" * 60)print(d4)cut=[]cut.append(float('-inf'))for i in range(1,n+1):qua=X.quantile(i/(n+1))cut.append(round(qua,4))cut.append(float('inf'))woe=list(d4['woe'].round(3))return d4,iv,cut,woe

WoE分析, 是对指标分箱、计算各个档位的WoE值并观察WoE值随指标变化的趋势。其中WoE的数学定义是:
woe=ln(goodattribute/badattribute)
在进行分析时,我们需要对各指标从小到大排列,并计算出相应分档的WoE值。其中正向指标越大,WoE值越小;反向指标越大,WoE值越大。正向指标的WoE值负斜率越大,反响指标的正斜率越大,则说明指标区分能力好。WoE值趋近于直线,则意味指标判断能力较弱。若正向指标和WoE正相关趋势、反向指标同WoE出现负相关趋势,则说明此指标不符合经济意义,则应当予以去除

import numpy as np
import scipy.stats.stats as stats
dfx1, ivx1,cutx1,woex1=mono_bin(data.是否逾期,data.信用额度,n=10)
#print(dfx1, ivx1,cutx1,woex1)
Empty DataFrame
Columns: [min]
Index: []
<pandas.core.groupby.groupby.SeriesGroupBy object at 0x0000024ECE6CD6A0>
============================================================min           max    sum  total      rate       woe  \
Bucket                                                                       
(-0.001, 0.0311]  0.000000      0.031125  35659  36339  0.981287  1.322345   
(0.0311, 0.158]   0.031128      0.158089  35590  36338  0.979415  1.225098   
(0.158, 0.558]    0.158100      0.558255  34499  36338  0.949392  0.294389   
(0.558, 50708.0]  0.558278  50708.000000  29900  36339  0.822807 -1.101834   goodattribute  badattribute        IV  
Bucket                                                   
(-0.001, 0.0311]       0.262879      0.070060  0.989174  
(0.0311, 0.158]        0.262370      0.077066  0.989174  
(0.158, 0.558]         0.254327      0.189470  0.989174  
(0.558, 50708.0]       0.220423      0.663404  0.989174  
dfx2, ivx2,cutx2,woex2=mono_bin(data.是否逾期, data.age, n=10)
Empty DataFrame
Columns: [min]
Index: []
<pandas.core.groupby.groupby.SeriesGroupBy object at 0x0000024ECE6CD828>
============================================================min  max    sum  total      rate       woe  goodattribute  \
Bucket                                                                      
(20.999, 33.0]   21   33  14471  16287  0.888500 -0.561809       0.106681   
(33.0, 40.0]     34   40  16073  17737  0.906185 -0.369403       0.118491   
(40.0, 45.0]     41   45  14683  16043  0.915228 -0.258113       0.108243   
(45.0, 49.0]     46   49  13619  14828  0.918465 -0.215647       0.100400   
(49.0, 54.0]     50   54  16516  17814  0.927136 -0.093814       0.121756   
(54.0, 59.0]     55   59  15757  16670  0.945231  0.210985       0.116161   
(59.0, 64.0]     60   64  15923  16613  0.958466  0.501509       0.117385   
(64.0, 71.0]     65   71  14194  14608  0.971659  0.897390       0.104638   
(71.0, 107.0]    72  107  14412  14754  0.976820  1.103687       0.106246   badattribute        IV  
Bucket                                  
(20.999, 33.0]      0.187101  0.241178  
(33.0, 40.0]        0.171440  0.241178  
(40.0, 45.0]        0.140120  0.241178  
(45.0, 49.0]        0.124562  0.241178  
(49.0, 54.0]        0.133732  0.241178  
(54.0, 59.0]        0.094066  0.241178  
(59.0, 64.0]        0.071090  0.241178  
(64.0, 71.0]        0.042654  0.241178  
(71.0, 107.0]       0.035236  0.241178  
dfx4, ivx4,cutx4,woex4 =mono_bin(data.是否逾期, data.债务占收入比, n=20)
Empty DataFrame
Columns: [min]
Index: []
<pandas.core.groupby.groupby.SeriesGroupBy object at 0x0000024ECE6CD6A0>
============================================================min            max    sum  total      rate       woe  \
Bucket                                                                         
(-0.001, 0.236]    0.000000       0.235948  45593  48452  0.940993  0.131963   
(0.236, 0.545]     0.235953       0.544862  45434  48451  0.937731  0.074679   
(0.545, 329664.0]  0.544864  329664.000000  44621  48451  0.920951 -0.181979   goodattribute  badattribute        IV  
Bucket                                                    
(-0.001, 0.236]         0.336113      0.294560  0.019231  
(0.236, 0.545]          0.334940      0.310839  0.019231  
(0.545, 329664.0]       0.328947      0.394601  0.019231  
dfx5, ivx5,cutx5,woex5 =mono_bin(data.是否逾期, data.MonthlyIncome, n=10) 
Empty DataFrame
Columns: [min]
Index: []
<pandas.core.groupby.groupby.SeriesGroupBy object at 0x0000024ECE6FBF28>
============================================================min        max    sum  total      rate       woe  \
Bucket                                                                     
(-0.001, 3400.0]        0.0     3400.0  44952  48760  0.921903 -0.168828   
(3400.0, 6850.0]     3401.0     6850.0  44600  48145  0.926368 -0.105123   
(6850.0, 3008750.0]  6851.0  3008750.0  46096  48449  0.951433  0.337716   goodattribute  badattribute        IV  
Bucket                                                      
(-0.001, 3400.0]          0.331387      0.392335  0.047012  
(3400.0, 6850.0]          0.328792      0.365238  0.047012  
(6850.0, 3008750.0]       0.339821      0.242427  0.047012  
def self_bin(Y,X,cat):good=Y.sum()bad=Y.count()-goodd1=pd.DataFrame({'X':X,'Y':Y,'Bucket':pd.cut(X,cat)})d2=d1.groupby('Bucket', as_index = True)d3 = pd.DataFrame(d2.X.min(), columns=['min'])d3['min'] = d2.min().Xd3['max'] = d2.max().Xd3['sum'] = d2.sum().Yd3['total'] = d2.count().Yd3['rate'] = d2.mean().Yd3['woe'] = np.log((d3['rate'] / (1 - d3['rate'])) / (good / bad))d3['goodattribute'] = d3['sum'] / goodd3['badattribute'] = (d3['total'] - d3['sum']) / badiv = ((d3['goodattribute'] - d3['badattribute']) * d3['woe']).sum()d4 = (d3.sort_index(by='min'))print("=" * 60)print(d4)woe = list(d4['woe'].round(3))return d4, iv,woe#连续变量离散化
pinf = float('inf')#正无穷大
ninf = float('-inf')#负无穷大
cutx3 = [ninf, 0, 1, 3, 5, pinf]
cutx6 = [ninf, 1, 2, 3, 5, pinf]
cutx7 = [ninf, 0, 1, 3, 5, pinf]
cutx8 = [ninf, 0,1,2, 3, pinf]
cutx9 = [ninf, 0, 1, 3, pinf]
cutx10 = [ninf, 0, 1, 2, 3, 5, pinf]
dfx3, ivx3,woex3 = self_bin(data.是否逾期, data['逾期30到60天次数'], cutx3)
dfx6, ivx6 ,woex6= self_bin(data.是否逾期, data['未偿还贷款'], cutx6)
dfx7, ivx7,woex7 = self_bin(data.是否逾期, data['逾期90天次数'], cutx7)
dfx8, ivx8,woex8 = self_bin(data.是否逾期, data['抵押财产'], cutx8)
dfx9, ivx9,woex9 = self_bin(data.是否逾期, data['逾期60到89天次数'], cutx9)
dfx10, ivx10,woex10 = self_bin(data.是否逾期, data['家庭人数'], cutx10)
============================================================min  max     sum   total      rate       woe  goodattribute  \
Bucket                                                                     
(-inf, 0.0]    0    0  117077  122020  0.959490  0.527540       0.863094   
(0.0, 1.0]     1    1   13381   15744  0.849911 -0.903415       0.098645   
(1.0, 3.0]     2    3    4467    6279  0.711419 -1.735033       0.032931   
(3.0, 5.0]     4    5     606    1075  0.563721 -2.381042       0.004467   
(5.0, inf]     6   13     117     236  0.495763 -2.654269       0.000863   badattribute  
Bucket                     
(-inf, 0.0]      0.509273  
(0.0, 1.0]       0.243458  
(1.0, 3.0]       0.186689  
(3.0, 5.0]       0.048321  
(5.0, inf]       0.012260  
============================================================min  max    sum   total      rate       woe  goodattribute  \
Bucket                                                                    
(-inf, 1.0]    0    1   4438    5322  0.833897 -1.023817       0.032717   
(1.0, 2.0]     2    2   5577    6162  0.905063 -0.382525       0.041114   
(2.0, 3.0]     3    3   7853    8519  0.921822 -0.169958       0.057892   
(3.0, 5.0]     4    5  22082   23622  0.934807  0.025661       0.162789   
(5.0, inf]     6   58  95698  101729  0.940715  0.126966       0.705488   badattribute  
Bucket                     
(-inf, 1.0]      0.091078  
(1.0, 2.0]       0.060272  
(2.0, 3.0]       0.068617  
(3.0, 5.0]       0.158665  
(5.0, inf]       0.621368  
============================================================min  max     sum   total      rate       woe  goodattribute  \
Bucket                                                                     
(-inf, 0.0]    0    0  131008  137449  0.953139  0.375256       0.965794   
(0.0, 1.0]     1    1    3396    5130  0.661988 -1.965152       0.025035   
(1.0, 3.0]     2    3    1041    2178  0.477961 -2.725530       0.007674   
(3.0, 5.0]     4    5     142     417  0.340528 -3.298263       0.001047   
(5.0, inf]     6   17      61     180  0.338889 -3.305569       0.000450   badattribute  
Bucket                     
(-inf, 0.0]      0.663610  
(0.0, 1.0]       0.178652  
(1.0, 3.0]       0.117144  
(3.0, 5.0]       0.028333  
(5.0, inf]       0.012260  ============================================================min  max    sum  total      rate       woe  goodattribute  \
Bucket                                                                   
(-inf, 0.0]    0    0  48757  53172  0.916968 -0.235478       0.359438   
(0.0, 1.0]     1    1  48477  51191  0.946983  0.245347       0.357373   
(1.0, 2.0]     2    2  29410  31155  0.943990  0.187261       0.216811   
(2.0, 3.0]     3    3   5812   6230  0.932905 -0.005120       0.042846   
(3.0, inf]     4   54   3192   3606  0.885191 -0.594782       0.023531   badattribute  
Bucket                     
(-inf, 0.0]      0.454873  
(0.0, 1.0]       0.279621  
(1.0, 2.0]       0.179786  
(2.0, 3.0]       0.043066  
(3.0, inf]       0.042654  
============================================================min  max     sum   total      rate       woe  goodattribute  \
Bucket                                                                     
(-inf, 0.0]    0    0  130993  138127  0.948352  0.272953       0.965683   
(0.0, 1.0]     1    1    3905    5647  0.691518 -1.830095       0.028788   
(1.0, 3.0]     2    3     688    1415  0.486219 -2.692457       0.005072   
(3.0, inf]     4   11      62     165  0.375758 -3.144914       0.000457   badattribute  
Bucket                     
(-inf, 0.0]      0.735009  
(0.0, 1.0]       0.179477  
(1.0, 3.0]       0.074902  
(3.0, inf]       0.010612  ============================================================min   max    sum  total      rate       woe  goodattribute  \
Bucket                                                                    
(-inf, 0.0]  0.0   0.0  81248  86234  0.942181  0.153553       0.598962   
(0.0, 1.0]   1.0   1.0  24370  26291  0.926933 -0.096812       0.179656   
(1.0, 2.0]   2.0   2.0  17929  19500  0.919436 -0.202612       0.132173   
(2.0, 3.0]   3.0   3.0   8646   9479  0.912122 -0.297501       0.063738   
(3.0, 5.0]   4.0   5.0   3241   3605  0.899029 -0.450836       0.023893   
(5.0, inf]   6.0  20.0    214    245  0.873469 -0.705330       0.001578   badattribute  
Bucket                     
(-inf, 0.0]      0.513703  
(0.0, 1.0]       0.197919  
(1.0, 2.0]       0.161859  
(2.0, 3.0]       0.085823  
(3.0, 5.0]       0.037503  
(5.0, inf]       0.003194  
corr = data.corr()#计算各变量的相关性系数
#print(corr.index)
#xticks = ['x0','x1','x2','x3','x4','x5','x6','x7','x8','x9','x10']#x轴标签
xticks =list(corr.index)
yticks = list(corr.index)#y轴标签
fig = plt.figure(figsize=(22,20))#figsize=(14,12)使热力图变大
ax1 = fig.add_subplot(1, 1, 1)
#sns.heatmap(corr, annot=True, cmap='PuRd' ,ax=ax1, annot_kws={'size': 16, 'weight': 'light','color': 'black'})
#sns.heatmap(corr, annot=True, cmap='YlGnBu' ,ax=ax1, annot_kws={'size': 16, 'weight': 'light','color': 'black'})
#sns.heatmap(corr, annot=True, cmap='rainbow' ,ax=ax1, annot_kws={'size': 16, 'weight': 'light','color': 'black'})
sns.heatmap(corr, annot=True, cmap='RdPu' ,ax=ax1, annot_kws={'size': 16, 'weight': 'light','color': 'green'})
### 绘制相关性系数热力图
#cmap="YlGnBu" (rainbow)设置heatmap颜色
ax1.set_xticklabels(xticks, rotation=90, fontsize=20)
ax1.set_yticklabels(yticks, rotation=0, fontsize=20)
ax1.set_yticklabels(yticks, rotation=0, fontsize=20)plt.rcParams['font.sans-serif']=['SimHei']     #用来正常显示中文标签
plt.rcParams['axes.unicode_minus']=False      #用来正常显示负号
plt.show()

在这里插入图片描述

IV指标是一般用来确定自变量的预测能力。
每个字段进行分箱之后会产生一个IV值,代表这个字段对标签字段的影响力,IV越大代表分箱效果越好,字段对标签字段的影响力越大

list(data.columns)#y轴标签
['是否逾期','信用额度','age','逾期30到60天次数','债务占收入比','MonthlyIncome','未偿还贷款','逾期90天次数','抵押财产','逾期60到89天次数','家庭人数']
ivlist=[ivx1,ivx2,ivx3,ivx4,ivx5,ivx6,ivx7,ivx8,ivx9,ivx10]
ivlist
[0.9891738801650342,0.24117787840722144,0.7189254612784397,0.019231014490398168,0.04701224378739177,0.07968800751468878,0.8426781922043317,0.059857660209756414,0.5586891401396025,0.03472056480690539]

原谅老夫的少女心一定要画成粉色,哈哈哈~

ivlist=[ivx1,ivx2,ivx3,ivx4,ivx5,ivx6,ivx7,ivx8,ivx9,ivx10]#各变量IV
#xticks = list(data.columns)#y轴标签
index=['x1','x2','x3','x4','x5','x6','x7','x8','x9','x10']#x轴的标签
fig1 = plt.figure(figsize=(14,8))#figsize=(14,12)使热力图变大)
ax1 = fig1.add_subplot(1, 1, 1)
x = np.arange(len(index))+1
#ax1.bar(x, ivlist, width=0.4,facecolor = 'hotpink')#生成柱状图
ax1.bar(x, ivlist, width=0.4,facecolor = 'lightcoral')#生成柱状图
ax1.set_xticks(x)
ax1.set_xticklabels(xticks, rotation=90, fontsize=14)
ax1.set_ylabel('IV(Information Value)', fontsize=14)
#在柱状图上添加数字标签
for a, b in zip(x, ivlist):plt.text(a, b + 0.01, '%.4f' % b, ha='center', va='bottom', fontsize=10)
plt.show()

在这里插入图片描述

证据权重(Weight of Evidence,WOE)转换可以将Logistic回归模型转变为标准评分卡格式。引入WOE转换的目的并不是为了提高模型质量,有一些变量不应该被纳入模型,这或者是因为它们不能增加模型值,或者是因为与其模型相关系数有关的误差较大,其实建立标准信用评分卡也可以不采用WOE转换。这种情况下,Logistic回归模型需要处理更大数量的自变量。尽管这样会增加建模程序的复杂性,但最终得到的评分卡都是一样的。

在建立模型之前,我们需要将筛选后的变量转换为WoE值,用于信用评分。
def outlier_processing(df,col):s=df[col]oneQuoter=s.quantile(0.25)threeQuote=s.quantile(0.75)irq=threeQuote-oneQuotermin=oneQuoter-1.5*irqmax=threeQuote+1.5*irqdf=df[df[col]<=max]df=df[df[col]>=min]return df
data = pd.read_csv('MissingData.csv')
# 年龄等于0的异常值进行剔除
data = data[data['age'] > 0]
data = data[data['逾期30到60天次数'] < 90]#剔除异常值
data['是否逾期']=1-data['是否逾期']
Y = data['是否逾期']
X = data.ix[:, 1:]
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=0)
# print(Y_train)
train = pd.concat([Y_train, X_train], axis=1)
test = pd.concat([Y_test, X_test], axis=1)
clasTest = test.groupby('是否逾期')['是否逾期'].count()
train.to_csv('TrainData.csv',index=False)
test.to_csv('TestData.csv',index=False)
print(train.shape)
print(test.shape)
(101747, 11)
(43607, 11)
import pandas as pd
import numpy as np
from pandas import Series,DataFrame
import scipy.stats.stats as stats
import matplotlib.pyplot as plt
import statsmodels.api as sm
import math
#替换成woe函数
def replace_woe(series,cut,woe):list=[]i=0while i<len(series):value=series[i]print(i)j=len(cut)-2m=len(cut)-2while j>=0:if value>=cut[j]:j=-1else:j -=1m -= 1list.append(woe[m])i += 1return list

我们将每个变量都进行替换,并将其保存到WoeData.csv文件中:
将整体数据分成俩部分,一部分做训练,一部分做测试
将训练部分属性值转换WOE、测试部分属性值转换WOE
训练集使用逻辑回归训练,得到模型,测试集测试

# TrainData替换成woe
data=pd.read_csv('TrainData.csv')
#print(data.head())
data['信用额度'] = Series(replace_woe(data['信用额度'], cutx1, woex1))
#print(data['信用额度'][1400:1500])
data['age'] = Series(replace_woe(data['age'], cutx2, woex2))
data['逾期30到60天次数'] = Series(replace_woe(data['逾期30到60天次数'], cutx3, woex3))
data['债务占收入比'] = Series(replace_woe(data['债务占收入比'], cutx4, woex4))
data['MonthlyIncome'] = Series(replace_woe(data['MonthlyIncome'], cutx5, woex5))
data['未偿还贷款'] = Series(replace_woe(data['未偿还贷款'], cutx6, woex6))
data['逾期90天次数'] = Series(replace_woe(data['逾期90天次数'], cutx7, woex7))
data['抵押财产'] = Series(replace_woe(data['抵押财产'], cutx8, woex8))
data['逾期60到89天次数'] = Series(replace_woe(data['逾期60到89天次数'], cutx9, woex9))
data['家庭人数'] = Series(replace_woe(data['家庭人数'], cutx10, woex10))
data.to_csv('trainWoeData.csv', index=False)
# TestData替换成woe
test= pd.read_csv('TestData.csv')
# 替换成woe
test['信用额度'] = Series(replace_woe(test['信用额度'], cutx1, woex1))
test['age'] = Series(replace_woe(test['age'], cutx2, woex2))
test['逾期30到60天次数'] = Series(replace_woe(test['逾期30到60天次数'], cutx3, woex3))
test['债务占收入比'] = Series(replace_woe(test['债务占收入比'], cutx4, woex4))
test['MonthlyIncome'] = Series(replace_woe(test['MonthlyIncome'], cutx5, woex5))
test['未偿还贷款'] = Series(replace_woe(test['未偿还贷款'], cutx6, woex6))
test['逾期90天次数'] = Series(replace_woe(test['逾期90天次数'], cutx7, woex7))
test['抵押财产'] = Series(replace_woe(test['抵押财产'], cutx8, woex8))
test['逾期60到89天次数'] = Series(replace_woe(test['逾期60到89天次数'], cutx9, woex9))
test['家庭人数'] = Series(replace_woe(test['家庭人数'], cutx10, woex10))
test.to_csv('TestWoeData.csv', index=False)
#训练部分
matplotlib.rcParams['axes.unicode_minus'] = False
#导入数据
data = pd.read_csv('trainWoeData.csv')
#应变量
Y=data['是否逾期']
#自变量,剔除对因变量影响不明显的变量
X=data.drop(['是否逾期','债务占收入比','MonthlyIncome', '未偿还贷款','抵押财产','家庭人数'],axis=1)
X1=sm.add_constant(X)
logit=sm.Logit(Y,X1)
result=logit.fit()
print(result.params)#测试部分
test = pd.read_csv('TestWoeData.csv')
Y_test = test['是否逾期']
X_test = test.drop(['是否逾期', '信用额度', 'MonthlyIncome', '未偿还贷款','抵押财产', '家庭人数'], axis=1)
X3 = sm.add_constant(X_test)
resu = result.predict(X3)
fpr, tpr, threshold = roc_curve(Y_test, resu)
rocauc = auc(fpr, tpr)
plt.plot(fpr, tpr, 'b', label='AUC = %0.2f' % rocauc)
plt.legend(loc='lower right')
plt.plot([0, 1], [0, 1], 'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('真正率')
plt.xlabel('假正率')
plt.show()
Optimization terminated successfully.Current function value: 0.186940Iterations 8
const         9.555259
信用额度          0.630777
age           0.511745
逾期30到60天次数    1.035706
逾期90天次数       1.747674
逾期60到89天次数    1.085101
dtype: float64

在这里插入图片描述

通过ROC曲线和AUC来评估模型的拟合能力,上图为ROC曲线,AUC值为0.81,说明该模型的预测效果还是不错的,正确率较高。

计算分数

#计算分数
#coe为逻辑回归模型的系数
coe=[9.738849,0.638002,0.505995,1.032246,1.790041,1.131956]
# 我们取600分为基础分值,PDO为20(每高20分好坏比翻一倍),好坏比取20。
p = 20 / math.log(2)
q = 600 - 20 * math.log(20) / math.log(2)
baseScore = round(q + p * coe[0], 0)
baseScore
795.0
#计算各部分函数
def get_score(coe,woe,factor,label):scores=[]for w in woe:score=round(coe*w*factor,0)scores.append(score)print(list(data.columns)[label],'woe:',woe,'score:',scores)return scores
# 各项部分分数
x1 = get_score(coe[1], woex1, p,1)
x2 = get_score(coe[2], woex2, p,2)
x3 = get_score(coe[3], woex3, p,3)
x7 = get_score(coe[4], woex7, p,7)
x9 = get_score(coe[5], woex9, p,9)
信用额度 woe: [1.322, 1.225, 0.294, -1.102] score: [24.0, 23.0, 5.0, -20.0]
age woe: [-0.562, -0.369, -0.258, -0.216, -0.094, 0.211, 0.502, 0.897, 1.104] score: [-8.0, -5.0, -4.0, -3.0, -1.0, 3.0, 7.0, 13.0, 16.0]
逾期30到60天次数 woe: [0.528, -0.903, -1.735, -2.381, -2.654] score: [16.0, -27.0, -52.0, -71.0, -79.0]
逾期90天次数 woe: [0.375, -1.965, -2.726, -3.298, -3.306] score: [19.0, -101.0, -141.0, -170.0, -171.0]
逾期60到89天次数 woe: [0.273, -1.83, -2.692, -3.145] score: [9.0, -60.0, -88.0, -103.0]

评分标准

在这里插入图片描述

#根据变量计算分数
def compute_score(series,cut,score):list = []i = 0while i < len(series):value = series[i]j = len(cut) - 2m = len(cut) - 2while j >= 0:if value >= cut[j]:j = -1else:j -= 1m -= 1list.append(score[m])i += 1return list
test1 = pd.read_csv('TestData.csv')
test1['BaseScore']=Series(np.zeros(len(test1)))+baseScore
test1['x1'] = Series(compute_score(test1['信用额度'], cutx1, x1))
test1['x2'] = Series(compute_score(test1['age'], cutx2, x2))
test1['x3'] = Series(compute_score(test1['逾期30到60天次数'], cutx3, x3))
test1['x7'] = Series(compute_score(test1['逾期90天次数'], cutx7, x7))
test1['x9'] = Series(compute_score(test1['逾期60到89天次数'], cutx9, x9))
test1['Score'] = test1['x1'] + test1['x2'] + test1['x3'] + test1['x7'] +test1['x9']  + baseScore
test1.to_csv('ScoreData.csv', index=False)

x1-x9是对应字段的得分,基础分795和对应得分相加,得到最后的分数

test1.head()
是否逾期信用额度age逾期30到60天次数债务占收入比MonthlyIncome未偿还贷款逾期90天次数抵押财产逾期60到89天次数家庭人数BaseScorex1x2x3x7x9Score
010.6173524140.16758915000.0141102.0795.0-20.0-4.0-71.0-141.0-60.0499.0
110.0841765800.38885114583.0120301.0795.023.03.0-27.0-101.0-60.0633.0
210.3077574700.18131318900.0100203.0795.05.0-3.0-27.0-101.0-60.0609.0
310.0032656500.3046166000.060200.0795.024.013.0-27.0-101.0-60.0644.0
410.0185173805870.0000002554.060100.0795.024.0-5.0-27.0-101.0-60.0626.0

–参考博客
[1]: http://math.stackexchange.com/
[2]: https://www.jianshu.com/p/159f381c661d

之后如果有新的调研会继续补充

这篇关于评分卡构建学习的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/1091304

相关文章

HarmonyOS学习(七)——UI(五)常用布局总结

自适应布局 1.1、线性布局(LinearLayout) 通过线性容器Row和Column实现线性布局。Column容器内的子组件按照垂直方向排列,Row组件中的子组件按照水平方向排列。 属性说明space通过space参数设置主轴上子组件的间距,达到各子组件在排列上的等间距效果alignItems设置子组件在交叉轴上的对齐方式,且在各类尺寸屏幕上表现一致,其中交叉轴为垂直时,取值为Vert

Ilya-AI分享的他在OpenAI学习到的15个提示工程技巧

Ilya(不是本人,claude AI)在社交媒体上分享了他在OpenAI学习到的15个Prompt撰写技巧。 以下是详细的内容: 提示精确化:在编写提示时,力求表达清晰准确。清楚地阐述任务需求和概念定义至关重要。例:不用"分析文本",而用"判断这段话的情感倾向:积极、消极还是中性"。 快速迭代:善于快速连续调整提示。熟练的提示工程师能够灵活地进行多轮优化。例:从"总结文章"到"用

【前端学习】AntV G6-08 深入图形与图形分组、自定义节点、节点动画(下)

【课程链接】 AntV G6:深入图形与图形分组、自定义节点、节点动画(下)_哔哩哔哩_bilibili 本章十吾老师讲解了一个复杂的自定义节点中,应该怎样去计算和绘制图形,如何给一个图形制作不间断的动画,以及在鼠标事件之后产生动画。(有点难,需要好好理解) <!DOCTYPE html><html><head><meta charset="UTF-8"><title>06

学习hash总结

2014/1/29/   最近刚开始学hash,名字很陌生,但是hash的思想却很熟悉,以前早就做过此类的题,但是不知道这就是hash思想而已,说白了hash就是一个映射,往往灵活利用数组的下标来实现算法,hash的作用:1、判重;2、统计次数;

嵌入式QT开发:构建高效智能的嵌入式系统

摘要: 本文深入探讨了嵌入式 QT 相关的各个方面。从 QT 框架的基础架构和核心概念出发,详细阐述了其在嵌入式环境中的优势与特点。文中分析了嵌入式 QT 的开发环境搭建过程,包括交叉编译工具链的配置等关键步骤。进一步探讨了嵌入式 QT 的界面设计与开发,涵盖了从基本控件的使用到复杂界面布局的构建。同时也深入研究了信号与槽机制在嵌入式系统中的应用,以及嵌入式 QT 与硬件设备的交互,包括输入输出设

零基础学习Redis(10) -- zset类型命令使用

zset是有序集合,内部除了存储元素外,还会存储一个score,存储在zset中的元素会按照score的大小升序排列,不同元素的score可以重复,score相同的元素会按照元素的字典序排列。 1. zset常用命令 1.1 zadd  zadd key [NX | XX] [GT | LT]   [CH] [INCR] score member [score member ...]

Retrieval-based-Voice-Conversion-WebUI模型构建指南

一、模型介绍 Retrieval-based-Voice-Conversion-WebUI(简称 RVC)模型是一个基于 VITS(Variational Inference with adversarial learning for end-to-end Text-to-Speech)的简单易用的语音转换框架。 具有以下特点 简单易用:RVC 模型通过简单易用的网页界面,使得用户无需深入了

【机器学习】高斯过程的基本概念和应用领域以及在python中的实例

引言 高斯过程(Gaussian Process,简称GP)是一种概率模型,用于描述一组随机变量的联合概率分布,其中任何一个有限维度的子集都具有高斯分布 文章目录 引言一、高斯过程1.1 基本定义1.1.1 随机过程1.1.2 高斯分布 1.2 高斯过程的特性1.2.1 联合高斯性1.2.2 均值函数1.2.3 协方差函数(或核函数) 1.3 核函数1.4 高斯过程回归(Gauss

【学习笔记】 陈强-机器学习-Python-Ch15 人工神经网络(1)sklearn

系列文章目录 监督学习:参数方法 【学习笔记】 陈强-机器学习-Python-Ch4 线性回归 【学习笔记】 陈强-机器学习-Python-Ch5 逻辑回归 【课后题练习】 陈强-机器学习-Python-Ch5 逻辑回归(SAheart.csv) 【学习笔记】 陈强-机器学习-Python-Ch6 多项逻辑回归 【学习笔记 及 课后题练习】 陈强-机器学习-Python-Ch7 判别分析 【学

系统架构师考试学习笔记第三篇——架构设计高级知识(20)通信系统架构设计理论与实践

本章知识考点:         第20课时主要学习通信系统架构设计的理论和工作中的实践。根据新版考试大纲,本课时知识点会涉及案例分析题(25分),而在历年考试中,案例题对该部分内容的考查并不多,虽在综合知识选择题目中经常考查,但分值也不高。本课时内容侧重于对知识点的记忆和理解,按照以往的出题规律,通信系统架构设计基础知识点多来源于教材内的基础网络设备、网络架构和教材外最新时事热点技术。本课时知识