本文主要是介绍评分卡构建学习,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!
关于评分卡构建的学习,复现实验。参考文章:基于Python的信用评分卡模型分析(一)(二)
虽然是重现但还是很激动呀~
利用jupyter notebook中代码断点运行的特点,非常方便学习和做笔记。
我将数据集中的属性名称转换为中文,便于观察,代码打印了许多中间结果,便于自己理解评分卡构建的整体过程。难点在于,分箱和分数计算(公式还在琢磨)
数据处理和分析
import pandas as pd
import matplotlib.pyplot as plt #导入图像库
import matplotlib
import seaborn as sns
from sklearn.metrics import roc_curve,auc
import statsmodels.api as sm
data = pd.read_csv('dataSet/cs-training.csv')
data.describe().to_csv('dataSet/cs-trainingDes.csv')
data.head()
SeriousDlqin2yrs | RevolvingUtilizationOfUnsecuredLines | age | NumberOfTime30-59DaysPastDueNotWorse | DebtRatio | MonthlyIncome | NumberOfOpenCreditLinesAndLoans | NumberOfTimes90DaysLate | NumberRealEstateLoansOrLines | NumberOfTime60-89DaysPastDueNotWorse | NumberOfDependents | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0.766127 | 45 | 2 | 0.802982 | 9120.0 | 13 | 0 | 6 | 0 | 2.0 |
1 | 0 | 0.957151 | 40 | 0 | 0.121876 | 2600.0 | 4 | 0 | 0 | 0 | 1.0 |
2 | 0 | 0.658180 | 38 | 1 | 0.085113 | 3042.0 | 2 | 1 | 0 | 0 | 0.0 |
3 | 0 | 0.233810 | 30 | 0 | 0.036050 | 3300.0 | 5 | 0 | 0 | 0 | 0.0 |
4 | 0 | 0.907239 | 49 | 1 | 0.024926 | 63588.0 | 7 | 0 | 1 | 0 | 0.0 |
#查看data的描述信息
dataDes = pd.read_csv('dataSet/cs-trainingDes.csv')
dataDes
Unnamed: 0 | SeriousDlqin2yrs | RevolvingUtilizationOfUnsecuredLines | age | NumberOfTime30-59DaysPastDueNotWorse | DebtRatio | MonthlyIncome | NumberOfOpenCreditLinesAndLoans | NumberOfTimes90DaysLate | NumberRealEstateLoansOrLines | NumberOfTime60-89DaysPastDueNotWorse | NumberOfDependents | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | count | 150000.000000 | 150000.000000 | 150000.000000 | 150000.000000 | 150000.000000 | 1.202690e+05 | 150000.000000 | 150000.000000 | 150000.000000 | 150000.000000 | 146076.000000 |
1 | mean | 0.066840 | 6.048438 | 52.295207 | 0.421033 | 353.005076 | 6.670221e+03 | 8.452760 | 0.265973 | 1.018240 | 0.240387 | 0.757222 |
2 | std | 0.249746 | 249.755371 | 14.771866 | 4.192781 | 2037.818523 | 1.438467e+04 | 5.145951 | 4.169304 | 1.129771 | 4.155179 | 1.115086 |
3 | min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000e+00 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
4 | 25% | 0.000000 | 0.029867 | 41.000000 | 0.000000 | 0.175074 | 3.400000e+03 | 5.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
5 | 50% | 0.000000 | 0.154181 | 52.000000 | 0.000000 | 0.366508 | 5.400000e+03 | 8.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 |
6 | 75% | 0.000000 | 0.559046 | 63.000000 | 0.000000 | 0.868254 | 8.249000e+03 | 11.000000 | 0.000000 | 2.000000 | 0.000000 | 1.000000 |
7 | max | 1.000000 | 50708.000000 | 109.000000 | 98.000000 | 329664.000000 | 3.008750e+06 | 58.000000 | 98.000000 | 54.000000 | 98.000000 | 20.000000 |
修改data的列名
data.rename(columns={'SeriousDlqin2yrs':'是否逾期','RevolvingUtilizationOfUnsecuredLines':'信用额度','NumberOfTime30-59DaysPastDueNotWorse':'逾期30到60天次数','DebtRatio':'债务占收入比','NumberOfOpenCreditLinesAndLoans':'未偿还贷款','NumberOfTimes90DaysLate':'逾期90天次数','NumberRealEstateLoansOrLines':'抵押财产','NumberOfTime60-89DaysPastDueNotWorse':'逾期60到89天次数','NumberOfDependents':'家庭人数'},inplace = True)
data.head()
是否逾期 | 信用额度 | age | 逾期30到60天次数 | 债务占收入比 | MonthlyIncome | 未偿还贷款 | 逾期90天次数 | 抵押财产 | 逾期60到89天次数 | 家庭人数 | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0.766127 | 45 | 2 | 0.802982 | 9120.0 | 13 | 0 | 6 | 0 | 2.0 |
1 | 0 | 0.957151 | 40 | 0 | 0.121876 | 2600.0 | 4 | 0 | 0 | 0 | 1.0 |
2 | 0 | 0.658180 | 38 | 1 | 0.085113 | 3042.0 | 2 | 1 | 0 | 0 | 0.0 |
3 | 0 | 0.233810 | 30 | 0 | 0.036050 | 3300.0 | 5 | 0 | 0 | 0 | 0.0 |
4 | 0 | 0.907239 | 49 | 1 | 0.024926 | 63588.0 | 7 | 0 | 1 | 0 | 0.0 |
上表中MonthlyIncome和NumberOfDependents的count计数不是150000,所以存在缺失值
用随机森林对缺失值预测填充函数。首先,将MonthlyIncome的位置放到第一列,作为标签列y,其他部分作为特征列,放到随机森林的算法里进行计算,预测monthlyIncome的值。将得到的预测值填充到原来的数据中。
from sklearn.ensemble import RandomForestRegressor
# 用随机森林对缺失值预测填充函数
def set_missing(df):# 把已有的数值型特征取出来process_df = df.ix[:,[5,0,1,2,3,4,6,7,8,9]] #变换了数据列的顺序,将MonthlyIncome的位置放到第一列,作为标签列y,其他部分作为特征列,放到随机森林的算法里进行计算,预测monthlyIncome的值。# 分成已知该特征和未知该特征两部分known = process_df[process_df.MonthlyIncome.notnull()].as_matrix()unknown = process_df[process_df.MonthlyIncome.isnull()].as_matrix()# X为特征属性值X = known[:, 1:]# y为结果标签值y = known[:, 0]# fit到RandomForestRegressor之中rfr = RandomForestRegressor(random_state=0,n_estimators=200,max_depth=3,n_jobs=-1)rfr.fit(X,y)# 用得到的模型进行未知特征值预测predicted = rfr.predict(unknown[:, 1:]).round(0)print(predicted)# 用得到的预测结果填补原缺失数据df.loc[(df.MonthlyIncome.isnull()), 'MonthlyIncome'] = predictedreturn df
data=set_missing(data)#用随机森林填补比较多的缺失值
[8311. 1159. 8311. ... 1159. 2554. 2554.]
data=data.dropna()#删除比较少的缺失值
data = data.drop_duplicates()#删除重复项
data.to_csv('MissingData.csv',index=False)
#删除到某一行,行号会缺失,所以需要再次读取
data=pd.read_csv('MissingData.csv')
#print(data)
data = data[data['age'] > 0] # 年龄等于0的异常值进行剔除
data.ix[:100,[1,2]].boxplot() #也可用plot.box()
print(data.head())
plt.show()
是否逾期 信用额度 age 逾期30到60天次数 债务占收入比 MonthlyIncome 未偿还贷款 逾期90天次数 \
0 1 0.766127 45 2 0.802982 9120.0 13 0
1 0 0.957151 40 0 0.121876 2600.0 4 0
2 0 0.658180 38 1 0.085113 3042.0 2 1
3 0 0.233810 30 0 0.036050 3300.0 5 0
4 0 0.907239 49 1 0.024926 63588.0 7 0 抵押财产 逾期60到89天次数 家庭人数
0 6 0 2.0
1 0 0 1.0
2 0 0 0.0
3 0 0 0.0
4 1 0 0.0
箱型图可以方便的查看属性变量的取值范围,此图中将俩个变量放在一起,Y的变量取值范围导致另一个变量的箱型图显示不出来,单个画就没问题了。
剔除变量NumberOfTime30-59DaysPastDueNotWorse、NumberOfTimes90DaysLate、NumberOfTime60-89DaysPastDueNotWorse的异常值。
数据集中好客户为0,违约客户为1,考虑到正常的理解,能正常履约并支付利息的客户为1,所以我们将其取反。
#剔除异常值
data = data[data['逾期30到60天次数'] < 90]
#变量SeriousDlqin2yrs取反
data['是否逾期']=1-data['是否逾期']
from sklearn.cross_validation import train_test_split
Y = data['是否逾期']
X = data.ix[:, 1:]
#测试集占比30%
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=0)
# print(Y_train)
train = pd.concat([Y_train, X_train], axis=1)
test = pd.concat([Y_test, X_test], axis=1)
clasTest = test.groupby('是否逾期')['是否逾期'].count()
train.to_csv('TrainData.csv',index=False)
test.to_csv('TestData.csv',index=False)
变量分箱(binning)是对连续变量离散化(discretization)的一种称呼。信用评分卡开发中一般有常用的等距分段、等深分段、最优分段。其中等距分段(Equval length intervals)是指分段的区间是一致的,比如年龄以十年作为一个分段;等深分段(Equal frequency intervals)是先确定分段数量,然后令每个分段中数据数量大致相等;最优分段(Optimal Binning)又叫监督离散化(supervised discretizaion),使用递归划分(Recursive Partitioning)将连续变量分为分段,背后是一种基于条件推断查找较佳分组的算法。
# 定义自动分箱函数 最优分箱
def mono_bin(Y, X, n = 20):r = 0good=Y.sum()bad=Y.count()-goodwhile np.abs(r) < 1:#将X的值对应到Bucket每个区间上d1 = pd.DataFrame({"X": X, "Y": Y, "Bucket": pd.qcut(X, n)})d2 = d1.groupby('Bucket', as_index = True)r, p = stats.spearmanr(d2.mean().X, d2.mean().Y)n = n - 1#print(d1)d3 = pd.DataFrame(d2.X.min(), columns = ['min'])print(d3)d3['min']=d2.min().Xprint(d2.X)d3['max'] = d2.max().Xd3['sum'] = d2.sum().Yd3['total'] = d2.count().Yd3['rate'] = d2.mean().Yd3['woe']=np.log((d3['rate']/(1-d3['rate']))/(good/bad))d3['goodattribute']=d3['sum']/goodd3['badattribute']=(d3['total']-d3['sum'])/badiv=((d3['goodattribute']-d3['badattribute'])*d3['woe']).sum()d3['IV'] = ivd4 = (d3.sort_index(by = 'min'))print("=" * 60)print(d4)cut=[]cut.append(float('-inf'))for i in range(1,n+1):qua=X.quantile(i/(n+1))cut.append(round(qua,4))cut.append(float('inf'))woe=list(d4['woe'].round(3))return d4,iv,cut,woe
WoE分析, 是对指标分箱、计算各个档位的WoE值并观察WoE值随指标变化的趋势。其中WoE的数学定义是:
woe=ln(goodattribute/badattribute)
在进行分析时,我们需要对各指标从小到大排列,并计算出相应分档的WoE值。其中正向指标越大,WoE值越小;反向指标越大,WoE值越大。正向指标的WoE值负斜率越大,反响指标的正斜率越大,则说明指标区分能力好。WoE值趋近于直线,则意味指标判断能力较弱。若正向指标和WoE正相关趋势、反向指标同WoE出现负相关趋势,则说明此指标不符合经济意义,则应当予以去除
import numpy as np
import scipy.stats.stats as stats
dfx1, ivx1,cutx1,woex1=mono_bin(data.是否逾期,data.信用额度,n=10)
#print(dfx1, ivx1,cutx1,woex1)
Empty DataFrame
Columns: [min]
Index: []
<pandas.core.groupby.groupby.SeriesGroupBy object at 0x0000024ECE6CD6A0>
============================================================min max sum total rate woe \
Bucket
(-0.001, 0.0311] 0.000000 0.031125 35659 36339 0.981287 1.322345
(0.0311, 0.158] 0.031128 0.158089 35590 36338 0.979415 1.225098
(0.158, 0.558] 0.158100 0.558255 34499 36338 0.949392 0.294389
(0.558, 50708.0] 0.558278 50708.000000 29900 36339 0.822807 -1.101834 goodattribute badattribute IV
Bucket
(-0.001, 0.0311] 0.262879 0.070060 0.989174
(0.0311, 0.158] 0.262370 0.077066 0.989174
(0.158, 0.558] 0.254327 0.189470 0.989174
(0.558, 50708.0] 0.220423 0.663404 0.989174
dfx2, ivx2,cutx2,woex2=mono_bin(data.是否逾期, data.age, n=10)
Empty DataFrame
Columns: [min]
Index: []
<pandas.core.groupby.groupby.SeriesGroupBy object at 0x0000024ECE6CD828>
============================================================min max sum total rate woe goodattribute \
Bucket
(20.999, 33.0] 21 33 14471 16287 0.888500 -0.561809 0.106681
(33.0, 40.0] 34 40 16073 17737 0.906185 -0.369403 0.118491
(40.0, 45.0] 41 45 14683 16043 0.915228 -0.258113 0.108243
(45.0, 49.0] 46 49 13619 14828 0.918465 -0.215647 0.100400
(49.0, 54.0] 50 54 16516 17814 0.927136 -0.093814 0.121756
(54.0, 59.0] 55 59 15757 16670 0.945231 0.210985 0.116161
(59.0, 64.0] 60 64 15923 16613 0.958466 0.501509 0.117385
(64.0, 71.0] 65 71 14194 14608 0.971659 0.897390 0.104638
(71.0, 107.0] 72 107 14412 14754 0.976820 1.103687 0.106246 badattribute IV
Bucket
(20.999, 33.0] 0.187101 0.241178
(33.0, 40.0] 0.171440 0.241178
(40.0, 45.0] 0.140120 0.241178
(45.0, 49.0] 0.124562 0.241178
(49.0, 54.0] 0.133732 0.241178
(54.0, 59.0] 0.094066 0.241178
(59.0, 64.0] 0.071090 0.241178
(64.0, 71.0] 0.042654 0.241178
(71.0, 107.0] 0.035236 0.241178
dfx4, ivx4,cutx4,woex4 =mono_bin(data.是否逾期, data.债务占收入比, n=20)
Empty DataFrame
Columns: [min]
Index: []
<pandas.core.groupby.groupby.SeriesGroupBy object at 0x0000024ECE6CD6A0>
============================================================min max sum total rate woe \
Bucket
(-0.001, 0.236] 0.000000 0.235948 45593 48452 0.940993 0.131963
(0.236, 0.545] 0.235953 0.544862 45434 48451 0.937731 0.074679
(0.545, 329664.0] 0.544864 329664.000000 44621 48451 0.920951 -0.181979 goodattribute badattribute IV
Bucket
(-0.001, 0.236] 0.336113 0.294560 0.019231
(0.236, 0.545] 0.334940 0.310839 0.019231
(0.545, 329664.0] 0.328947 0.394601 0.019231
dfx5, ivx5,cutx5,woex5 =mono_bin(data.是否逾期, data.MonthlyIncome, n=10)
Empty DataFrame
Columns: [min]
Index: []
<pandas.core.groupby.groupby.SeriesGroupBy object at 0x0000024ECE6FBF28>
============================================================min max sum total rate woe \
Bucket
(-0.001, 3400.0] 0.0 3400.0 44952 48760 0.921903 -0.168828
(3400.0, 6850.0] 3401.0 6850.0 44600 48145 0.926368 -0.105123
(6850.0, 3008750.0] 6851.0 3008750.0 46096 48449 0.951433 0.337716 goodattribute badattribute IV
Bucket
(-0.001, 3400.0] 0.331387 0.392335 0.047012
(3400.0, 6850.0] 0.328792 0.365238 0.047012
(6850.0, 3008750.0] 0.339821 0.242427 0.047012
def self_bin(Y,X,cat):good=Y.sum()bad=Y.count()-goodd1=pd.DataFrame({'X':X,'Y':Y,'Bucket':pd.cut(X,cat)})d2=d1.groupby('Bucket', as_index = True)d3 = pd.DataFrame(d2.X.min(), columns=['min'])d3['min'] = d2.min().Xd3['max'] = d2.max().Xd3['sum'] = d2.sum().Yd3['total'] = d2.count().Yd3['rate'] = d2.mean().Yd3['woe'] = np.log((d3['rate'] / (1 - d3['rate'])) / (good / bad))d3['goodattribute'] = d3['sum'] / goodd3['badattribute'] = (d3['total'] - d3['sum']) / badiv = ((d3['goodattribute'] - d3['badattribute']) * d3['woe']).sum()d4 = (d3.sort_index(by='min'))print("=" * 60)print(d4)woe = list(d4['woe'].round(3))return d4, iv,woe#连续变量离散化
pinf = float('inf')#正无穷大
ninf = float('-inf')#负无穷大
cutx3 = [ninf, 0, 1, 3, 5, pinf]
cutx6 = [ninf, 1, 2, 3, 5, pinf]
cutx7 = [ninf, 0, 1, 3, 5, pinf]
cutx8 = [ninf, 0,1,2, 3, pinf]
cutx9 = [ninf, 0, 1, 3, pinf]
cutx10 = [ninf, 0, 1, 2, 3, 5, pinf]
dfx3, ivx3,woex3 = self_bin(data.是否逾期, data['逾期30到60天次数'], cutx3)
dfx6, ivx6 ,woex6= self_bin(data.是否逾期, data['未偿还贷款'], cutx6)
dfx7, ivx7,woex7 = self_bin(data.是否逾期, data['逾期90天次数'], cutx7)
dfx8, ivx8,woex8 = self_bin(data.是否逾期, data['抵押财产'], cutx8)
dfx9, ivx9,woex9 = self_bin(data.是否逾期, data['逾期60到89天次数'], cutx9)
dfx10, ivx10,woex10 = self_bin(data.是否逾期, data['家庭人数'], cutx10)
============================================================min max sum total rate woe goodattribute \
Bucket
(-inf, 0.0] 0 0 117077 122020 0.959490 0.527540 0.863094
(0.0, 1.0] 1 1 13381 15744 0.849911 -0.903415 0.098645
(1.0, 3.0] 2 3 4467 6279 0.711419 -1.735033 0.032931
(3.0, 5.0] 4 5 606 1075 0.563721 -2.381042 0.004467
(5.0, inf] 6 13 117 236 0.495763 -2.654269 0.000863 badattribute
Bucket
(-inf, 0.0] 0.509273
(0.0, 1.0] 0.243458
(1.0, 3.0] 0.186689
(3.0, 5.0] 0.048321
(5.0, inf] 0.012260
============================================================min max sum total rate woe goodattribute \
Bucket
(-inf, 1.0] 0 1 4438 5322 0.833897 -1.023817 0.032717
(1.0, 2.0] 2 2 5577 6162 0.905063 -0.382525 0.041114
(2.0, 3.0] 3 3 7853 8519 0.921822 -0.169958 0.057892
(3.0, 5.0] 4 5 22082 23622 0.934807 0.025661 0.162789
(5.0, inf] 6 58 95698 101729 0.940715 0.126966 0.705488 badattribute
Bucket
(-inf, 1.0] 0.091078
(1.0, 2.0] 0.060272
(2.0, 3.0] 0.068617
(3.0, 5.0] 0.158665
(5.0, inf] 0.621368
============================================================min max sum total rate woe goodattribute \
Bucket
(-inf, 0.0] 0 0 131008 137449 0.953139 0.375256 0.965794
(0.0, 1.0] 1 1 3396 5130 0.661988 -1.965152 0.025035
(1.0, 3.0] 2 3 1041 2178 0.477961 -2.725530 0.007674
(3.0, 5.0] 4 5 142 417 0.340528 -3.298263 0.001047
(5.0, inf] 6 17 61 180 0.338889 -3.305569 0.000450 badattribute
Bucket
(-inf, 0.0] 0.663610
(0.0, 1.0] 0.178652
(1.0, 3.0] 0.117144
(3.0, 5.0] 0.028333
(5.0, inf] 0.012260 ============================================================min max sum total rate woe goodattribute \
Bucket
(-inf, 0.0] 0 0 48757 53172 0.916968 -0.235478 0.359438
(0.0, 1.0] 1 1 48477 51191 0.946983 0.245347 0.357373
(1.0, 2.0] 2 2 29410 31155 0.943990 0.187261 0.216811
(2.0, 3.0] 3 3 5812 6230 0.932905 -0.005120 0.042846
(3.0, inf] 4 54 3192 3606 0.885191 -0.594782 0.023531 badattribute
Bucket
(-inf, 0.0] 0.454873
(0.0, 1.0] 0.279621
(1.0, 2.0] 0.179786
(2.0, 3.0] 0.043066
(3.0, inf] 0.042654
============================================================min max sum total rate woe goodattribute \
Bucket
(-inf, 0.0] 0 0 130993 138127 0.948352 0.272953 0.965683
(0.0, 1.0] 1 1 3905 5647 0.691518 -1.830095 0.028788
(1.0, 3.0] 2 3 688 1415 0.486219 -2.692457 0.005072
(3.0, inf] 4 11 62 165 0.375758 -3.144914 0.000457 badattribute
Bucket
(-inf, 0.0] 0.735009
(0.0, 1.0] 0.179477
(1.0, 3.0] 0.074902
(3.0, inf] 0.010612 ============================================================min max sum total rate woe goodattribute \
Bucket
(-inf, 0.0] 0.0 0.0 81248 86234 0.942181 0.153553 0.598962
(0.0, 1.0] 1.0 1.0 24370 26291 0.926933 -0.096812 0.179656
(1.0, 2.0] 2.0 2.0 17929 19500 0.919436 -0.202612 0.132173
(2.0, 3.0] 3.0 3.0 8646 9479 0.912122 -0.297501 0.063738
(3.0, 5.0] 4.0 5.0 3241 3605 0.899029 -0.450836 0.023893
(5.0, inf] 6.0 20.0 214 245 0.873469 -0.705330 0.001578 badattribute
Bucket
(-inf, 0.0] 0.513703
(0.0, 1.0] 0.197919
(1.0, 2.0] 0.161859
(2.0, 3.0] 0.085823
(3.0, 5.0] 0.037503
(5.0, inf] 0.003194
corr = data.corr()#计算各变量的相关性系数
#print(corr.index)
#xticks = ['x0','x1','x2','x3','x4','x5','x6','x7','x8','x9','x10']#x轴标签
xticks =list(corr.index)
yticks = list(corr.index)#y轴标签
fig = plt.figure(figsize=(22,20))#figsize=(14,12)使热力图变大
ax1 = fig.add_subplot(1, 1, 1)
#sns.heatmap(corr, annot=True, cmap='PuRd' ,ax=ax1, annot_kws={'size': 16, 'weight': 'light','color': 'black'})
#sns.heatmap(corr, annot=True, cmap='YlGnBu' ,ax=ax1, annot_kws={'size': 16, 'weight': 'light','color': 'black'})
#sns.heatmap(corr, annot=True, cmap='rainbow' ,ax=ax1, annot_kws={'size': 16, 'weight': 'light','color': 'black'})
sns.heatmap(corr, annot=True, cmap='RdPu' ,ax=ax1, annot_kws={'size': 16, 'weight': 'light','color': 'green'})
### 绘制相关性系数热力图
#cmap="YlGnBu" (rainbow)设置heatmap颜色
ax1.set_xticklabels(xticks, rotation=90, fontsize=20)
ax1.set_yticklabels(yticks, rotation=0, fontsize=20)
ax1.set_yticklabels(yticks, rotation=0, fontsize=20)plt.rcParams['font.sans-serif']=['SimHei'] #用来正常显示中文标签
plt.rcParams['axes.unicode_minus']=False #用来正常显示负号
plt.show()
IV指标是一般用来确定自变量的预测能力。
每个字段进行分箱之后会产生一个IV值,代表这个字段对标签字段的影响力,IV越大代表分箱效果越好,字段对标签字段的影响力越大
list(data.columns)#y轴标签
['是否逾期','信用额度','age','逾期30到60天次数','债务占收入比','MonthlyIncome','未偿还贷款','逾期90天次数','抵押财产','逾期60到89天次数','家庭人数']
ivlist=[ivx1,ivx2,ivx3,ivx4,ivx5,ivx6,ivx7,ivx8,ivx9,ivx10]
ivlist
[0.9891738801650342,0.24117787840722144,0.7189254612784397,0.019231014490398168,0.04701224378739177,0.07968800751468878,0.8426781922043317,0.059857660209756414,0.5586891401396025,0.03472056480690539]
原谅老夫的少女心一定要画成粉色,哈哈哈~
ivlist=[ivx1,ivx2,ivx3,ivx4,ivx5,ivx6,ivx7,ivx8,ivx9,ivx10]#各变量IV
#xticks = list(data.columns)#y轴标签
index=['x1','x2','x3','x4','x5','x6','x7','x8','x9','x10']#x轴的标签
fig1 = plt.figure(figsize=(14,8))#figsize=(14,12)使热力图变大)
ax1 = fig1.add_subplot(1, 1, 1)
x = np.arange(len(index))+1
#ax1.bar(x, ivlist, width=0.4,facecolor = 'hotpink')#生成柱状图
ax1.bar(x, ivlist, width=0.4,facecolor = 'lightcoral')#生成柱状图
ax1.set_xticks(x)
ax1.set_xticklabels(xticks, rotation=90, fontsize=14)
ax1.set_ylabel('IV(Information Value)', fontsize=14)
#在柱状图上添加数字标签
for a, b in zip(x, ivlist):plt.text(a, b + 0.01, '%.4f' % b, ha='center', va='bottom', fontsize=10)
plt.show()
证据权重(Weight of Evidence,WOE)转换可以将Logistic回归模型转变为标准评分卡格式。引入WOE转换的目的并不是为了提高模型质量,有一些变量不应该被纳入模型,这或者是因为它们不能增加模型值,或者是因为与其模型相关系数有关的误差较大,其实建立标准信用评分卡也可以不采用WOE转换。这种情况下,Logistic回归模型需要处理更大数量的自变量。尽管这样会增加建模程序的复杂性,但最终得到的评分卡都是一样的。
在建立模型之前,我们需要将筛选后的变量转换为WoE值,用于信用评分。
def outlier_processing(df,col):s=df[col]oneQuoter=s.quantile(0.25)threeQuote=s.quantile(0.75)irq=threeQuote-oneQuotermin=oneQuoter-1.5*irqmax=threeQuote+1.5*irqdf=df[df[col]<=max]df=df[df[col]>=min]return df
data = pd.read_csv('MissingData.csv')
# 年龄等于0的异常值进行剔除
data = data[data['age'] > 0]
data = data[data['逾期30到60天次数'] < 90]#剔除异常值
data['是否逾期']=1-data['是否逾期']
Y = data['是否逾期']
X = data.ix[:, 1:]
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=0)
# print(Y_train)
train = pd.concat([Y_train, X_train], axis=1)
test = pd.concat([Y_test, X_test], axis=1)
clasTest = test.groupby('是否逾期')['是否逾期'].count()
train.to_csv('TrainData.csv',index=False)
test.to_csv('TestData.csv',index=False)
print(train.shape)
print(test.shape)
(101747, 11)
(43607, 11)
import pandas as pd
import numpy as np
from pandas import Series,DataFrame
import scipy.stats.stats as stats
import matplotlib.pyplot as plt
import statsmodels.api as sm
import math
#替换成woe函数
def replace_woe(series,cut,woe):list=[]i=0while i<len(series):value=series[i]print(i)j=len(cut)-2m=len(cut)-2while j>=0:if value>=cut[j]:j=-1else:j -=1m -= 1list.append(woe[m])i += 1return list
我们将每个变量都进行替换,并将其保存到WoeData.csv文件中:
将整体数据分成俩部分,一部分做训练,一部分做测试
将训练部分属性值转换WOE、测试部分属性值转换WOE
训练集使用逻辑回归训练,得到模型,测试集测试
# TrainData替换成woe
data=pd.read_csv('TrainData.csv')
#print(data.head())
data['信用额度'] = Series(replace_woe(data['信用额度'], cutx1, woex1))
#print(data['信用额度'][1400:1500])
data['age'] = Series(replace_woe(data['age'], cutx2, woex2))
data['逾期30到60天次数'] = Series(replace_woe(data['逾期30到60天次数'], cutx3, woex3))
data['债务占收入比'] = Series(replace_woe(data['债务占收入比'], cutx4, woex4))
data['MonthlyIncome'] = Series(replace_woe(data['MonthlyIncome'], cutx5, woex5))
data['未偿还贷款'] = Series(replace_woe(data['未偿还贷款'], cutx6, woex6))
data['逾期90天次数'] = Series(replace_woe(data['逾期90天次数'], cutx7, woex7))
data['抵押财产'] = Series(replace_woe(data['抵押财产'], cutx8, woex8))
data['逾期60到89天次数'] = Series(replace_woe(data['逾期60到89天次数'], cutx9, woex9))
data['家庭人数'] = Series(replace_woe(data['家庭人数'], cutx10, woex10))
data.to_csv('trainWoeData.csv', index=False)
# TestData替换成woe
test= pd.read_csv('TestData.csv')
# 替换成woe
test['信用额度'] = Series(replace_woe(test['信用额度'], cutx1, woex1))
test['age'] = Series(replace_woe(test['age'], cutx2, woex2))
test['逾期30到60天次数'] = Series(replace_woe(test['逾期30到60天次数'], cutx3, woex3))
test['债务占收入比'] = Series(replace_woe(test['债务占收入比'], cutx4, woex4))
test['MonthlyIncome'] = Series(replace_woe(test['MonthlyIncome'], cutx5, woex5))
test['未偿还贷款'] = Series(replace_woe(test['未偿还贷款'], cutx6, woex6))
test['逾期90天次数'] = Series(replace_woe(test['逾期90天次数'], cutx7, woex7))
test['抵押财产'] = Series(replace_woe(test['抵押财产'], cutx8, woex8))
test['逾期60到89天次数'] = Series(replace_woe(test['逾期60到89天次数'], cutx9, woex9))
test['家庭人数'] = Series(replace_woe(test['家庭人数'], cutx10, woex10))
test.to_csv('TestWoeData.csv', index=False)
#训练部分
matplotlib.rcParams['axes.unicode_minus'] = False
#导入数据
data = pd.read_csv('trainWoeData.csv')
#应变量
Y=data['是否逾期']
#自变量,剔除对因变量影响不明显的变量
X=data.drop(['是否逾期','债务占收入比','MonthlyIncome', '未偿还贷款','抵押财产','家庭人数'],axis=1)
X1=sm.add_constant(X)
logit=sm.Logit(Y,X1)
result=logit.fit()
print(result.params)#测试部分
test = pd.read_csv('TestWoeData.csv')
Y_test = test['是否逾期']
X_test = test.drop(['是否逾期', '信用额度', 'MonthlyIncome', '未偿还贷款','抵押财产', '家庭人数'], axis=1)
X3 = sm.add_constant(X_test)
resu = result.predict(X3)
fpr, tpr, threshold = roc_curve(Y_test, resu)
rocauc = auc(fpr, tpr)
plt.plot(fpr, tpr, 'b', label='AUC = %0.2f' % rocauc)
plt.legend(loc='lower right')
plt.plot([0, 1], [0, 1], 'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('真正率')
plt.xlabel('假正率')
plt.show()
Optimization terminated successfully.Current function value: 0.186940Iterations 8
const 9.555259
信用额度 0.630777
age 0.511745
逾期30到60天次数 1.035706
逾期90天次数 1.747674
逾期60到89天次数 1.085101
dtype: float64
通过ROC曲线和AUC来评估模型的拟合能力,上图为ROC曲线,AUC值为0.81,说明该模型的预测效果还是不错的,正确率较高。
计算分数
#计算分数
#coe为逻辑回归模型的系数
coe=[9.738849,0.638002,0.505995,1.032246,1.790041,1.131956]
# 我们取600分为基础分值,PDO为20(每高20分好坏比翻一倍),好坏比取20。
p = 20 / math.log(2)
q = 600 - 20 * math.log(20) / math.log(2)
baseScore = round(q + p * coe[0], 0)
baseScore
795.0
#计算各部分函数
def get_score(coe,woe,factor,label):scores=[]for w in woe:score=round(coe*w*factor,0)scores.append(score)print(list(data.columns)[label],'woe:',woe,'score:',scores)return scores
# 各项部分分数
x1 = get_score(coe[1], woex1, p,1)
x2 = get_score(coe[2], woex2, p,2)
x3 = get_score(coe[3], woex3, p,3)
x7 = get_score(coe[4], woex7, p,7)
x9 = get_score(coe[5], woex9, p,9)
信用额度 woe: [1.322, 1.225, 0.294, -1.102] score: [24.0, 23.0, 5.0, -20.0]
age woe: [-0.562, -0.369, -0.258, -0.216, -0.094, 0.211, 0.502, 0.897, 1.104] score: [-8.0, -5.0, -4.0, -3.0, -1.0, 3.0, 7.0, 13.0, 16.0]
逾期30到60天次数 woe: [0.528, -0.903, -1.735, -2.381, -2.654] score: [16.0, -27.0, -52.0, -71.0, -79.0]
逾期90天次数 woe: [0.375, -1.965, -2.726, -3.298, -3.306] score: [19.0, -101.0, -141.0, -170.0, -171.0]
逾期60到89天次数 woe: [0.273, -1.83, -2.692, -3.145] score: [9.0, -60.0, -88.0, -103.0]
评分标准
#根据变量计算分数
def compute_score(series,cut,score):list = []i = 0while i < len(series):value = series[i]j = len(cut) - 2m = len(cut) - 2while j >= 0:if value >= cut[j]:j = -1else:j -= 1m -= 1list.append(score[m])i += 1return list
test1 = pd.read_csv('TestData.csv')
test1['BaseScore']=Series(np.zeros(len(test1)))+baseScore
test1['x1'] = Series(compute_score(test1['信用额度'], cutx1, x1))
test1['x2'] = Series(compute_score(test1['age'], cutx2, x2))
test1['x3'] = Series(compute_score(test1['逾期30到60天次数'], cutx3, x3))
test1['x7'] = Series(compute_score(test1['逾期90天次数'], cutx7, x7))
test1['x9'] = Series(compute_score(test1['逾期60到89天次数'], cutx9, x9))
test1['Score'] = test1['x1'] + test1['x2'] + test1['x3'] + test1['x7'] +test1['x9'] + baseScore
test1.to_csv('ScoreData.csv', index=False)
x1-x9是对应字段的得分,基础分795和对应得分相加,得到最后的分数
test1.head()
是否逾期 | 信用额度 | age | 逾期30到60天次数 | 债务占收入比 | MonthlyIncome | 未偿还贷款 | 逾期90天次数 | 抵押财产 | 逾期60到89天次数 | 家庭人数 | BaseScore | x1 | x2 | x3 | x7 | x9 | Score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0.617352 | 41 | 4 | 0.167589 | 15000.0 | 14 | 1 | 1 | 0 | 2.0 | 795.0 | -20.0 | -4.0 | -71.0 | -141.0 | -60.0 | 499.0 |
1 | 1 | 0.084176 | 58 | 0 | 0.388851 | 14583.0 | 12 | 0 | 3 | 0 | 1.0 | 795.0 | 23.0 | 3.0 | -27.0 | -101.0 | -60.0 | 633.0 |
2 | 1 | 0.307757 | 47 | 0 | 0.181313 | 18900.0 | 10 | 0 | 2 | 0 | 3.0 | 795.0 | 5.0 | -3.0 | -27.0 | -101.0 | -60.0 | 609.0 |
3 | 1 | 0.003265 | 65 | 0 | 0.304616 | 6000.0 | 6 | 0 | 2 | 0 | 0.0 | 795.0 | 24.0 | 13.0 | -27.0 | -101.0 | -60.0 | 644.0 |
4 | 1 | 0.018517 | 38 | 0 | 5870.000000 | 2554.0 | 6 | 0 | 1 | 0 | 0.0 | 795.0 | 24.0 | -5.0 | -27.0 | -101.0 | -60.0 | 626.0 |
–参考博客
[1]: http://math.stackexchange.com/
[2]: https://www.jianshu.com/p/159f381c661d
之后如果有新的调研会继续补充
这篇关于评分卡构建学习的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!