本文主要是介绍数据挖掘—逻辑回归算法之如何实现客户逾期还款业务,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!
文章目录
- 1、分析背景
- 2、分析流程
- 3、目标
- 4、数据介绍
1、分析背景
贷款申请人向某(P2P)平台申请贷款时,平台会通过线上或者线下让客户填写借贷申请表,收集客户的基本信息,同时会借助第三方如征信机构的信息,通过这些信息属性来做成逻辑回归预测模型,平台可以通过预测判断贷款申请是否会违约,从而决定是否向申请人发送贷款。
算法根据历史数据需要建立一个模型来
2、分析流程
- 数据处理(清洗、筛选、删除、特征工程等—划分数据集(测试集和训练集)—构建yhat值即解释变量的筛选)
- 建立模型
- 模型检验(画ROC曲线,求AUC值)
- 预测
3、目标
- 能够掌握数据清洗、特征工程
- 掌握样本不均衡处理
4、数据介绍
数据集是lending club平台发生的借贷 的业务数据,共有52个变量,39522条记录。
(1)数据预处理
1、查看数据的总体情况
import warnings
warnings.filterwarnings("ignore")
#去掉一些没用的特征,如desc,url等,并将剩下的特征保留在一个新的csv文件中:
import pandas as pd
loans_2020=pd.read_csv("./LoanStats3a.csv",skiprows=1)#第一列是字符串需要跳过
half_count=len(loans_2020)/2 #4万行除以2=19767.5行
loans_2020=loans_2020.dropna(thresh=half_count,axis=1)#2万行中删除空白值超过一半的列,thresh:删除
loans_2020=loans_2020.drop(["desc","url"],axis=1)#按照列中,删除描述和url链接;
loans_2020.to_csv("loans_2020.csv",index=False)#追加到loans_2020.csv中,index表示不加索引。#输出数据标签,初步判断无用特征:
import pandas as pd
loans_2020=pd.read_csv("loans_2020.csv")
print("第一行的数据展示\n",loans_2020.iloc[0])#第一行的数据
print("原始数据=",loans_2020.shape[1])#shape[1]代表有多少列,shape[0]代表有多少行;
输出:
第一行的数据展示id 1077501
member_id 1.2966e+06
loan_amnt 5000
funded_amnt 5000
funded_amnt_inv 4975
term 36 months
int_rate 10.65%
installment 162.87
grade B
sub_grade B2
emp_title NaN
emp_length 10+ years
home_ownership RENT
annual_inc 24000
verification_status Verified
issue_d Dec-11
loan_status Fully Paid
pymnt_plan n
purpose credit_card
title Computer
zip_code 860xx
addr_state AZ
dti 27.65
delinq_2yrs 0
earliest_cr_line Jan-85
inq_last_6mths 1
open_acc 3
pub_rec 0
revol_bal 13648
revol_util 83.70%
total_acc 9
initial_list_status f
out_prncp 0
out_prncp_inv 0
total_pymnt 5863.16
total_pymnt_inv 5833.84
total_rec_prncp 5000
total_rec_int 863.16
total_rec_late_fee 0
recoveries 0
collection_recovery_fee 0
last_pymnt_d Jan-15
last_pymnt_amnt 171.62
last_credit_pull_d Nov-16
collections_12_mths_ex_med 0
policy_code 1
application_type INDIVIDUAL
acc_now_delinq 0
chargeoff_within_12_mths 0
delinq_amnt 0
pub_rec_bankruptcies 0
tax_liens 0
Name: 0, dtype: object
原始数据= 52
可以很明显地从常识来判断“ID”与“member id ”与银行是否进行放贷没有关系,funded_amount和funded_amunt_inv为预测之后银行对该借贷人的放款,也没有关系。因此按照产品经理以及大家共同商议来进行特征选择,择去掉的特征代码。
2、删除无用的特征:
loans_2020=loans_2020.drop(["id","member_id","funded_amnt","funded_amnt_inv","grade","sub_grade","emp_title","issue_d"],axis=1)loans_2020=loans_2020.drop(["zip_code","out_prncp","out_prncp_inv","total_pymnt","total_pymnt_inv","total_rec_prncp"],axis=1)loans_2020=loans_2020.drop(["total_rec_int","total_rec_late_fee","recoveries","collection_recovery_fee","last_pymnt_d","last_pymnt_amnt"],axis=1)
print("现在的列数= ",loans_2020.shape[1])
输出:现在的列数= 32
之前是52列。
3、确定当前贷款状态
print(loans_2020["loan_status"].value_counts())#计算该列特征的属性的个数输出:
Fully Paid 33693
Charged Off 5612
Current 201
Late (31-120 days) 10
In Grace Period 9
Late (16-30 days) 5
Default 1
将其做一个二分类,用0,1表示:
#做一个二分类,用0,1表示:
loans_2020=loans_2020[(loans_2020["loan_status"]=="Fully Paid")|(loans_2020["loan_status"]=="Charged Off")]
status_replace={"loan_status":{"Fully Paid":1,"Charged Off":0}}
#特征当做key,value里还有一个字典,第一个键值改为1,表示完全支付,第二个键改为0,表示违约
loans_2020=loans_2020.replace(status_replace)#执行的是查找并替换的操作;loans_2020["loan_status"]
输出:
0 1
1 0
2 1
…………………………
39530 1
Name: loan_status, Length: 39305, dtype: int64
4、去掉特征中只有一种属性的值:对于分类模型的预测并没有帮助,
#去掉特征中只有唯一属性的值:
orig_columns=loans_2020.columns#展现出所有的列
drop_columns=[]
for col in orig_columns:#先删除空值,再去重唯一的属性:col_series=loans_2020[col].dropna().unique()#去重唯一属性if len(col_series)==1:drop_columns.append(col)
loans_2020=loans_2020.drop(drop_columns,axis=1)
print(drop_columns)
print(30*"-")
print(loans_2020.shape)
loans_2020.to_csv("filtered_loans_2020",index=False)
输出:
['initial_list_status', 'collections_12_mths_ex_med', 'policy_code', 'application_type', 'acc_now_delinq', 'chargeoff_within_12_mths', 'delinq_amnt', 'tax_liens']
------------------------------
(39305, 24)——还剩下24个特征
筛选出特征值和标签了,但还需要做缺失值、标点符号、%号、字符集、str值等处理,把24个列的缺失值做一个处理,统计每列的缺失值情况。
5、缺失值处理
#5、处理缺失值:import pandas as pd
loans=pd.read_csv("filtered_loans_2020")
null_counts=loans.isnull().sum()#用pandas的isnull统计每列的缺失值,给累加起来
print(null_counts)
输出:
loan_amnt 0
term 0
int_rate 0
installment 0
emp_length 1073
home_ownership 0
annual_inc 0
verification_status 0
loan_status 0
pymnt_plan 0
purpose 0
title 11
addr_state 0
dti 0
delinq_2yrs 0
earliest_cr_line 0
inq_last_6mths 0
open_acc 0
pub_rec 0
revol_bal 0
revol_util 50
total_acc 0
last_credit_pull_d 1
pub_rec_bankruptcies 449
dtype: int64
从统计的结果可以看到,title 和revol_util缺失数据较少,删除掉对数据影响不大,pub_rec_bankruptcies缺失值较多,说明该数据统计情况较差,可以在文本中直接删除;同时查看一些特征值的数据类型
#查看数据类型
loans=loans.drop("pub_rec_bankruptcies",axis=1)
loans=loans.dropna(axis=0)
print(loans.dtypes.value_counts())#用dtype来统计有多少是object\int\float类型;
输出:
object 12
float64 10
int64 1
dtype: int64
6、数据类型的转换
重点关注str类型的数据,由于sklearn不支持字符串类型的数据,所以需要将上面字符型的数据进行处理:
#数据类型的转换,重点关注str类型:
#pandas只选定str类型的数据,采用select_dtypes
object_columns_df=loans.select_dtypes(include=["object"])
print(object_columns_df.iloc[0])输出:
term 36 months
int_rate 10.65%
emp_length 10+ years
home_ownership RENT
verification_status Verified
pymnt_plan n
purpose credit_card
title Computer
addr_state AZ
earliest_cr_line Jan-85
revol_util 83.70%
last_credit_pull_d Nov-16
Name: 0, dtype: object
#独热编码
cols=["home_ownership","verification_status","emp_length","term","addr_state"]
for c in cols:print(loans[c].value_counts())
输出:
RENT 18237
MORTGAGE 17035
OWN 2805
OTHER 96
NONE 1
Name: home_ownership, dtype: int64
Not Verified 16182
Verified 12251
Source Verified 9741
Name: verification_status, dtype: int64
10+ years 8794
< 1 year 4492
2 years 4339
3 years 4052
4 years 3397
5 years 3262
1 year 3182
6 years 2201
7 years 1747
8 years 1463
9 years 1245
Name: emp_length, dtype: int6436 months 2798060 months 10194
Name: term, dtype: int64
CA 6876
NY 3644
FL 2739
TX 2657
NJ 1799
IL 1478
PA 1470
VA 1355
GA 1342
MA 1278
OH 1176
MD 1019
AZ 824
WA 788
CO 758
NC 739
CT 725
MI 688
MO 654
MN 589
NV 477
SC 457
OR 431
WI 429
AL 424
LA 422
KY 320
OK 292
KS 257
UT 248
AR 233
DC 211
RI 196
NM 180
HI 168
WV 167
NH 160
DE 109
MT 78
WY 78
AK 77
SD 61
VT 54
MS 19
TN 16
ID 6
IA 5
NE 1
Name: addr_state, dtype: int64
#将purpose和title表达意思相近,title表达属性较多,将其舍弃掉:
print(loans["purpose"].value_counts())
print(35*"--")
print(loans["title"].value_counts())输出:
debt_consolidation 18057
credit_card 4927
other 3761
home_improvement 2846
major_purchase 2103
small_business 1745
car 1489
wedding 924
medical 665
moving 551
house 364
vacation 347
educational 300
renewable_energy 95
Name: purpose, dtype: int64
----------------------------------------------------------------------
Debt Consolidation 2122
Debt Consolidation Loan 1670
Personal Loan 625
Consolidation 502
debt consolidation 483
Credit Card Consolidation 348
Home Improvement 343
Debt consolidation 320
Small Business Loan 310
Credit Card Loan 302
Personal 296
Consolidation Loan 252
Home Improvement Loan 237
personal loan 224
Wedding Loan 207
personal 206
Loan 205
consolidation 193
Car Loan 192
Other Loan 177
Wedding 151
Credit Card Payoff 149
Credit Card Refinance 140
Major Purchase Loan 136
Consolidate 126
Medical 114
Credit Card 111
home improvement 105
My Loan 90
Credit Cards 90...
Business Loan - SEO 1
The Stock Loan 1
a little help from my friends 1
Education is Money 1
J.L. 1
Debt Consolidation for a new beginning 1
reorganize finances 1
2010 Success 1
Photography Studio Startup 1
Home improvement/ consolidate debt 1
IN THE GREEN 1
Cosolidate Loan 1
consolidate 1 1
Going credit free, marrying my girlfriend and visiting home! 1
Paying for Dream Vacation! 1
Knocking out Debt 1
purchase home appliances and remodeling 1
Consolidate 2 high interest credit cards 1
Legal Expenses Loan 1
Eliminating credit card 1
katie's personal loan 1
Sep-11 1
Weddings are expensive 1
SCOOTER 1
Baby1 1
Medical Transcriptionist Course Fees 1
Surgery Expenses 1
our new start 1
credit cards be gone 1
TGM VACATION 1
Name: title, Length: 18933, dtype: int64
将emp_length映射成为字典,emp_length成为 key,值就是value, 类似"10+ years": 10,
#再往后调用replace函数,将利息以及透支额度占信用比例的这两列特征的%转化用astype处理
mapping_dict={"emp_length":{"10+year":10,"9 years": 9,"8 years": 8,"7 years": 7,"6 years": 6,"5 years": 5,"4 years": 4,"3 years": 3,"2 years": 2,"1 year": 1,"< 1 year": 0,"n/a": 0 }}
#删除其他一些无用的特征 :
loans=loans.drop(["last_credit_pull_d","earliest_cr_line", "addr_state", "title"],axis=1)
#删除字符末尾指定的指定字符:
loans["int_rate"]=loans["int_rate"].str.rstrip("%").astype("float")
loans["revol_util"]=loans["revol_util"].str.rstrip("%").astype("float")
loans=loans.replace(mapping_dict)
loans.iloc[0]
输出:
loan_amnt 5000
term 36 months
int_rate 10.65
installment 162.87
emp_length 10+ years
home_ownership RENT
annual_inc 24000
verification_status Verified
loan_status 1
pymnt_plan n
purpose credit_card
dti 27.65
delinq_2yrs 0
inq_last_6mths 1
open_acc 3
pub_rec 0
revol_bal 13648
revol_util 83.7
total_acc 9
Name: 0, dtype: object
剩余的其他字符型特征,此处选择使用pandas的get_dummies()函数,直接映射为数值型
cat_columns=["home_ownership","verification_status","emp_length","purpose","term"]
dummy_df=pd.get_dummies(loans[cat_columns])
dummy_df
loans=pd.concat([loans,dummy_df],axis=1)
loans=loans.drop("pymnt_plan",axis=1)
loans
loans.to_csv("cleaned_loans_2020.csv",index=False)loans = pd.read_csv("cleaned_loans_2020.csv") # 清洗完的数据拿过来,现在的数据要么是float类型和int类型
print(loans.info())
#独热编码,使得一个特征的多个属性变为了多个列
输出:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38174 entries, 0 to 38173
Data columns (total 48 columns):
loan_amnt 38174 non-null float64
int_rate 38174 non-null float64
installment 38174 non-null float64
annual_inc 38174 non-null float64
loan_status 38174 non-null int64
home_ownership_MORTGAGE 38174 non-null int64
home_ownership_NONE 38174 non-null int64
home_ownership_OTHER 38174 non-null int64
home_ownership_OWN 38174 non-null int64
………………
模型训练:
from sklearn.linear_model import LogisticRegression # 分类
lr = LogisticRegression() # 调用逻辑回归的算法包
cols = loans.columns # 4万行 * 24列的样本
train_cols = cols.drop("loan_status") # 删除loan_status这一列作为目标值
features = loans[train_cols] # 23列的特征矩阵
target = loans["loan_status"] # 作为标签矩阵
lr.fit(features, target) #开始训练
predictions = lr.predict(features) # 开始预测
lr.predict_proba(features)#lr的概率模型
输出:
array([[0.23870984, 0.76129016],[0.35466694, 0.64533306],[0.32152312, 0.67847688],...,[0.30736624, 0.69263376],[0.10044728, 0.89955272],[0.09492527, 0.90507473]])
predictions[:10]#1代表有偿还
输出 :array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int64)
lr.predict_proba(features)
输出:
array([[0.23870984, 0.76129016],[0.35466694, 0.64533306],[0.32152312, 0.67847688],...,[0.30736624, 0.69263376],[0.10044728, 0.89955272],[0.09492527, 0.90507473]])
#建立混淆矩阵
# 假正类(False Positive,FP):将负类预测为正类
fp_filter = (predictions == 1) & (loans["loan_status"] == 0)
fp = len(predictions[fp_filter])
print(fp)
print("----------------------------------------")# 真正类(True Positive,TP):将正类预测为正类
tp_filter = (predictions == 1) & (loans["loan_status"] == 1)
tp = len(predictions[tp_filter])
print(tp)
print("----------------------------------------")# 假负类(False Negative,FN):将正类预测为负类
fn_filter = (predictions == 0) & (loans["loan_status"] == 1)
fn = len(predictions[fn_filter])
print(fn)
print("----------------------------------------")# 真负类(True Negative,TN):将负类预测为负类
tn_filter = (predictions == 0) & (loans["loan_status"] == 0)
tn = len(predictions[tn_filter])
print(tn)
输出:
5357
----------------------------------------
32788
----------------------------------------
21
----------------------------------------
8
模型评价
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_predict#使用交叉验证
lr=LogisticRegression()
predictions=cross_val_predict(lr,features,target,cv=10)
predictions=pd.Series(predictions)
print(predictions[:1000])
# Rates:
tpr = tp / float((tp + fn))
fpr = fp / float((fp + tn))
print(tpr)#真正率
print(fpr)#假正率
输出:
0.7121521533725502
0.4648648648648649
从结果来看,TPR和FPR的值都很高,说明来一个人基本都是判断为可以借钱,所以模型分类就没有什么意义,因此我们换一个使用权重法来对不同的特征赋予不同的值,最后再来看看效果。
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_predict
"""
权重项可以自己定义的
0代表5倍的
1代表10倍的
"""
penalty = {0: 5,1: 1
}lr = LogisticRegression(class_weight=penalty)
# kf = KFold(features.shape[0], random_state=1)
kf = 10
predictions = cross_val_predict(lr, features, target, cv=kf)
predictions = pd.Series(predictions)# False positives.
fp_filter = (predictions == 1) & (loans["loan_status"] == 0)
fp = len(predictions[fp_filter])# True positives.
tp_filter = (predictions == 1) & (loans["loan_status"] == 1)
tp = len(predictions[tp_filter])# False negatives.
fn_filter = (predictions == 0) & (loans["loan_status"] == 1)
fn = len(predictions[fn_filter])# True negatives
tn_filter = (predictions == 0) & (loans["loan_status"] == 0)
tn = len(predictions[tn_filter])# Rates
tpr = tp / float((tp + fn))
fpr = fp / float((fp + tn))print(tpr)
print(fpr)
输出:
0.7121521533725502
0.4648648648648649
由于样本不均衡,所以容易导致我们构建的分类器把所有样本都归为样本量较大的那一个类。解决的方法有很多,其中一个是进行数据增强,就是把少的样本增多,但是要添加的数据要么是收集的,要么是自己造的,所以这项工作还是挺难的。还有就是考虑权重,将少的样本的权重增大,期望模型能够达到比较均衡的状态。
以上分析重点不在于给出精准的预测模型,只是给出使用机器学习建模的一般流程,分为两大部分:数据处理和模型学习,第一部分需要大量的业务知识对原始数据进行清理及特征提取,第二部分模型学习,涉及长时间的模型参数调整,调整方向和策略需要大家进一步的研究。模型效果不理想时,可以考虑的调整策略:
1.调节正负样本的权重参数。
2.更换模型算法。
3.同时几个使用模型进行预测,然后取去测的最终结果。
4.使用原数据,生成新特征。
5.调整模型参数
这篇关于数据挖掘—逻辑回归算法之如何实现客户逾期还款业务的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!