数据挖掘—逻辑回归算法之如何实现客户逾期还款业务

本文主要是介绍数据挖掘—逻辑回归算法之如何实现客户逾期还款业务，希望对大家解决编程问题提供一定的参考价值，需要的开发者们随着小编来一起学习吧！

文章目录

- 1、分析背景
- 2、分析流程
- 3、目标
- 4、数据介绍

1、分析背景

贷款申请人向某（P2P）平台申请贷款时，平台会通过线上或者线下让客户填写借贷申请表，收集客户的基本信息，同时会借助第三方如征信机构的信息，通过这些信息属性来做成逻辑回归预测模型，平台可以通过预测判断贷款申请是否会违约，从而决定是否向申请人发送贷款。
算法根据历史数据需要建立一个模型来

2、分析流程

数据处理（清洗、筛选、删除、特征工程等—划分数据集（测试集和训练集）—构建yhat值即解释变量的筛选）
建立模型
模型检验（画ROC曲线，求AUC值)
预测

3、目标

能够掌握数据清洗、特征工程
掌握样本不均衡处理

4、数据介绍

数据集是lending club平台发生的借贷的业务数据，共有52个变量，39522条记录。
在这里插入图片描述

（1）数据预处理
1、查看数据的总体情况

import warnings 
warnings.filterwarnings("ignore")
#去掉一些没用的特征，如desc,url等，并将剩下的特征保留在一个新的csv文件中：
import pandas  as pd 
loans_2020=pd.read_csv("./LoanStats3a.csv",skiprows=1)#第一列是字符串需要跳过
half_count=len(loans_2020)/2   #4万行除以2=19767.5行 
loans_2020=loans_2020.dropna(thresh=half_count,axis=1)#2万行中删除空白值超过一半的列，thresh：删除
loans_2020=loans_2020.drop(["desc","url"],axis=1)#按照列中，删除描述和url链接；
loans_2020.to_csv("loans_2020.csv",index=False)#追加到loans_2020.csv中，index表示不加索引。#输出数据标签，初步判断无用特征：
import pandas as pd 
loans_2020=pd.read_csv("loans_2020.csv")
print("第一行的数据展示\n",loans_2020.iloc[0])#第一行的数据 
print("原始数据=",loans_2020.shape[1])#shape[1]代表有多少列，shape[0]代表有多少行；

输出：

第一行的数据展示id                                1077501
member_id                      1.2966e+06
loan_amnt                            5000
funded_amnt                          5000
funded_amnt_inv                      4975
term                            36 months
int_rate                           10.65%
installment                        162.87
grade                                   B
sub_grade                              B2
emp_title                             NaN
emp_length                      10+ years
home_ownership                       RENT
annual_inc                          24000
verification_status              Verified
issue_d                            Dec-11
loan_status                    Fully Paid
pymnt_plan                              n
purpose                       credit_card
title                            Computer
zip_code                            860xx
addr_state                             AZ
dti                                 27.65
delinq_2yrs                             0
earliest_cr_line                   Jan-85
inq_last_6mths                          1
open_acc                                3
pub_rec                                 0
revol_bal                           13648
revol_util                         83.70%
total_acc                               9
initial_list_status                     f
out_prncp                               0
out_prncp_inv                           0
total_pymnt                       5863.16
total_pymnt_inv                   5833.84
total_rec_prncp                      5000
total_rec_int                      863.16
total_rec_late_fee                      0
recoveries                              0
collection_recovery_fee                 0
last_pymnt_d                       Jan-15
last_pymnt_amnt                    171.62
last_credit_pull_d                 Nov-16
collections_12_mths_ex_med              0
policy_code                             1
application_type               INDIVIDUAL
acc_now_delinq                          0
chargeoff_within_12_mths                0
delinq_amnt                             0
pub_rec_bankruptcies                    0
tax_liens                               0
Name: 0, dtype: object
原始数据= 52

可以很明显地从常识来判断“ID”与“member id ”与银行是否进行放贷没有关系，funded_amount和funded_amunt_inv为预测之后银行对该借贷人的放款，也没有关系。因此按照产品经理以及大家共同商议来进行特征选择，择去掉的特征代码。

2、删除无用的特征：

loans_2020=loans_2020.drop(["id","member_id","funded_amnt","funded_amnt_inv","grade","sub_grade","emp_title","issue_d"],axis=1)loans_2020=loans_2020.drop(["zip_code","out_prncp","out_prncp_inv","total_pymnt","total_pymnt_inv","total_rec_prncp"],axis=1)loans_2020=loans_2020.drop(["total_rec_int","total_rec_late_fee","recoveries","collection_recovery_fee","last_pymnt_d","last_pymnt_amnt"],axis=1)
print("现在的列数= ",loans_2020.shape[1])

输出：现在的列数= 32
之前是52列。

3、确定当前贷款状态

print(loans_2020["loan_status"].value_counts())#计算该列特征的属性的个数输出：
Fully Paid            33693
Charged Off            5612
Current                 201
Late (31-120 days)       10
In Grace Period           9
Late (16-30 days)         5
Default                   1

将其做一个二分类，用0,1表示：

#做一个二分类，用0,1表示：
loans_2020=loans_2020[(loans_2020["loan_status"]=="Fully Paid")|(loans_2020["loan_status"]=="Charged Off")]
status_replace={"loan_status":{"Fully Paid":1,"Charged Off":0}}
#特征当做key，value里还有一个字典，第一个键值改为1，表示完全支付，第二个键改为0，表示违约
loans_2020=loans_2020.replace(status_replace)#执行的是查找并替换的操作；loans_2020["loan_status"]

输出：
0        1
1        0
2        1
…………………………
39530    1
Name: loan_status, Length: 39305, dtype: int64

4、去掉特征中只有一种属性的值：对于分类模型的预测并没有帮助，

#去掉特征中只有唯一属性的值：
orig_columns=loans_2020.columns#展现出所有的列
drop_columns=[]
for col in orig_columns:#先删除空值，再去重唯一的属性：col_series=loans_2020[col].dropna().unique()#去重唯一属性if len(col_series)==1:drop_columns.append(col)
loans_2020=loans_2020.drop(drop_columns,axis=1)
print(drop_columns)
print(30*"-")
print(loans_2020.shape)
loans_2020.to_csv("filtered_loans_2020",index=False)

输出：

['initial_list_status', 'collections_12_mths_ex_med', 'policy_code', 'application_type', 'acc_now_delinq', 'chargeoff_within_12_mths', 'delinq_amnt', 'tax_liens']
------------------------------
(39305, 24)——还剩下24个特征

筛选出特征值和标签了，但还需要做缺失值、标点符号、%号、字符集、str值等处理，把24个列的缺失值做一个处理，统计每列的缺失值情况。

5、缺失值处理

#5、处理缺失值：import pandas as pd 
loans=pd.read_csv("filtered_loans_2020")
null_counts=loans.isnull().sum()#用pandas的isnull统计每列的缺失值，给累加起来
print(null_counts)

输出：

loan_amnt                  0
term                       0
int_rate                   0
installment                0
emp_length              1073
home_ownership             0
annual_inc                 0
verification_status        0
loan_status                0
pymnt_plan                 0
purpose                    0
title                     11
addr_state                 0
dti                        0
delinq_2yrs                0
earliest_cr_line           0
inq_last_6mths             0
open_acc                   0
pub_rec                    0
revol_bal                  0
revol_util                50
total_acc                  0
last_credit_pull_d         1
pub_rec_bankruptcies     449
dtype: int64

从统计的结果可以看到，title 和revol_util缺失数据较少，删除掉对数据影响不大，pub_rec_bankruptcies缺失值较多，说明该数据统计情况较差，可以在文本中直接删除；同时查看一些特征值的数据类型

#查看数据类型
loans=loans.drop("pub_rec_bankruptcies",axis=1)
loans=loans.dropna(axis=0)
print(loans.dtypes.value_counts())#用dtype来统计有多少是object\int\float类型；

输出：

object     12
float64    10
int64       1
dtype: int64

6、数据类型的转换
重点关注str类型的数据，由于sklearn不支持字符串类型的数据，所以需要将上面字符型的数据进行处理：

#数据类型的转换，重点关注str类型：
#pandas只选定str类型的数据，采用select_dtypes
object_columns_df=loans.select_dtypes(include=["object"])
print(object_columns_df.iloc[0])输出：
term                     36 months
int_rate                    10.65%
emp_length               10+ years
home_ownership                RENT
verification_status       Verified
pymnt_plan                       n
purpose                credit_card
title                     Computer
addr_state                      AZ
earliest_cr_line            Jan-85
revol_util                  83.70%
last_credit_pull_d          Nov-16
Name: 0, dtype: object

#独热编码
cols=["home_ownership","verification_status","emp_length","term","addr_state"]
for c in cols:print(loans[c].value_counts())

输出：

RENT        18237
MORTGAGE    17035
OWN          2805
OTHER          96
NONE            1
Name: home_ownership, dtype: int64
Not Verified       16182
Verified           12251
Source Verified     9741
Name: verification_status, dtype: int64
10+ years    8794
< 1 year     4492
2 years      4339
3 years      4052
4 years      3397
5 years      3262
1 year       3182
6 years      2201
7 years      1747
8 years      1463
9 years      1245
Name: emp_length, dtype: int6436 months    2798060 months    10194
Name: term, dtype: int64
CA    6876
NY    3644
FL    2739
TX    2657
NJ    1799
IL    1478
PA    1470
VA    1355
GA    1342
MA    1278
OH    1176
MD    1019
AZ     824
WA     788
CO     758
NC     739
CT     725
MI     688
MO     654
MN     589
NV     477
SC     457
OR     431
WI     429
AL     424
LA     422
KY     320
OK     292
KS     257
UT     248
AR     233
DC     211
RI     196
NM     180
HI     168
WV     167
NH     160
DE     109
MT      78
WY      78
AK      77
SD      61
VT      54
MS      19
TN      16
ID       6
IA       5
NE       1
Name: addr_state, dtype: int64

#将purpose和title表达意思相近，title表达属性较多，将其舍弃掉：
print(loans["purpose"].value_counts())
print(35*"--")
print(loans["title"].value_counts())输出：
debt_consolidation    18057
credit_card            4927
other                  3761
home_improvement       2846
major_purchase         2103
small_business         1745
car                    1489
wedding                 924
medical                 665
moving                  551
house                   364
vacation                347
educational             300
renewable_energy         95
Name: purpose, dtype: int64
----------------------------------------------------------------------
Debt Consolidation                                              2122
Debt Consolidation Loan                                         1670
Personal Loan                                                    625
Consolidation                                                    502
debt consolidation                                               483
Credit Card Consolidation                                        348
Home Improvement                                                 343
Debt consolidation                                               320
Small Business Loan                                              310
Credit Card Loan                                                 302
Personal                                                         296
Consolidation Loan                                               252
Home Improvement Loan                                            237
personal loan                                                    224
Wedding Loan                                                     207
personal                                                         206
Loan                                                             205
consolidation                                                    193
Car Loan                                                         192
Other Loan                                                       177
Wedding                                                          151
Credit Card Payoff                                               149
Credit Card Refinance                                            140
Major Purchase Loan                                              136
Consolidate                                                      126
Medical                                                          114
Credit Card                                                      111
home improvement                                                 105
My Loan                                                           90
Credit Cards                                                      90... 
Business Loan - SEO                                                1
The Stock Loan                                                     1
a little help from my friends                                      1
Education is Money                                                 1
J.L.                                                               1
Debt Consolidation for a new beginning                             1
reorganize finances                                                1
2010 Success                                                       1
Photography Studio Startup                                         1
Home improvement/ consolidate debt                                 1
IN THE GREEN                                                       1
Cosolidate Loan                                                    1
consolidate 1                                                      1
Going credit free, marrying my girlfriend and visiting home!       1
Paying for Dream Vacation!                                         1
Knocking out Debt                                                  1
purchase home appliances and remodeling                            1
Consolidate 2 high interest credit cards                           1
Legal Expenses Loan                                                1
Eliminating credit card                                            1
katie's personal loan                                              1
Sep-11                                                             1
Weddings are expensive                                             1
SCOOTER                                                            1
Baby1                                                              1
Medical Transcriptionist Course Fees                               1
Surgery Expenses                                                   1
our new start                                                      1
credit cards be gone                                               1
TGM VACATION                                                       1
Name: title, Length: 18933, dtype: int64

将emp_length映射成为字典，emp_length成为 key，值就是value, 类似"10+ years": 10,
#再往后调用replace函数，将利息以及透支额度占信用比例的这两列特征的%转化用astype处理

mapping_dict={"emp_length":{"10+year":10,"9 years": 9,"8 years": 8,"7 years": 7,"6 years": 6,"5 years": 5,"4 years": 4,"3 years": 3,"2 years": 2,"1 year": 1,"< 1 year": 0,"n/a": 0 }}
#删除其他一些无用的特征 ：
loans=loans.drop(["last_credit_pull_d","earliest_cr_line", "addr_state", "title"],axis=1)
#删除字符末尾指定的指定字符：
loans["int_rate"]=loans["int_rate"].str.rstrip("%").astype("float")
loans["revol_util"]=loans["revol_util"].str.rstrip("%").astype("float")
loans=loans.replace(mapping_dict)
loans.iloc[0]

输出：

loan_amnt                     5000
term                     36 months
int_rate                     10.65
installment                 162.87
emp_length               10+ years
home_ownership                RENT
annual_inc                   24000
verification_status       Verified
loan_status                      1
pymnt_plan                       n
purpose                credit_card
dti                          27.65
delinq_2yrs                      0
inq_last_6mths                   1
open_acc                         3
pub_rec                          0
revol_bal                    13648
revol_util                    83.7
total_acc                        9
Name: 0, dtype: object

剩余的其他字符型特征，此处选择使用pandas的get_dummies()函数，直接映射为数值型

cat_columns=["home_ownership","verification_status","emp_length","purpose","term"]
dummy_df=pd.get_dummies(loans[cat_columns])
dummy_df
loans=pd.concat([loans,dummy_df],axis=1)
loans=loans.drop("pymnt_plan",axis=1)
loans
loans.to_csv("cleaned_loans_2020.csv",index=False)loans = pd.read_csv("cleaned_loans_2020.csv") # 清洗完的数据拿过来，现在的数据要么是float类型和int类型
print(loans.info())
#独热编码，使得一个特征的多个属性变为了多个列
输出：
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38174 entries, 0 to 38173
Data columns (total 48 columns):
loan_amnt                              38174 non-null float64
int_rate                               38174 non-null float64
installment                            38174 non-null float64
annual_inc                             38174 non-null float64
loan_status                            38174 non-null int64
home_ownership_MORTGAGE                38174 non-null int64
home_ownership_NONE                    38174 non-null int64
home_ownership_OTHER                   38174 non-null int64
home_ownership_OWN                     38174 non-null int64
………………

模型训练：

from sklearn.linear_model import LogisticRegression # 分类
lr = LogisticRegression() # 调用逻辑回归的算法包
cols = loans.columns # 4万行 * 24列的样本
train_cols = cols.drop("loan_status") # 删除loan_status这一列作为目标值
features = loans[train_cols] # 23列的特征矩阵
target = loans["loan_status"] # 作为标签矩阵
lr.fit(features, target) #开始训练
predictions = lr.predict(features) # 开始预测
lr.predict_proba(features)#lr的概率模型

输出：

array([[0.23870984, 0.76129016],[0.35466694, 0.64533306],[0.32152312, 0.67847688],...,[0.30736624, 0.69263376],[0.10044728, 0.89955272],[0.09492527, 0.90507473]])

predictions[:10]#1代表有偿还

输出：array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int64)

lr.predict_proba(features)

输出：

array([[0.23870984, 0.76129016],[0.35466694, 0.64533306],[0.32152312, 0.67847688],...,[0.30736624, 0.69263376],[0.10044728, 0.89955272],[0.09492527, 0.90507473]])

#建立混淆矩阵
# 假正类（False Positive，FP）：将负类预测为正类
fp_filter = (predictions == 1) & (loans["loan_status"] == 0)
fp = len(predictions[fp_filter])
print(fp)
print("----------------------------------------")# 真正类（True Positive，TP）：将正类预测为正类
tp_filter = (predictions == 1) & (loans["loan_status"] == 1)
tp = len(predictions[tp_filter])
print(tp)
print("----------------------------------------")# 假负类（False Negative，FN）：将正类预测为负类
fn_filter = (predictions == 0) & (loans["loan_status"] == 1)
fn = len(predictions[fn_filter])
print(fn)
print("----------------------------------------")# 真负类（True Negative，TN）：将负类预测为负类
tn_filter = (predictions == 0) & (loans["loan_status"] == 0)
tn = len(predictions[tn_filter])
print(tn)

输出：

5357
----------------------------------------
32788
----------------------------------------
21
----------------------------------------
8

模型评价

from  sklearn.linear_model import LogisticRegression
from  sklearn.model_selection import cross_val_predict#使用交叉验证
lr=LogisticRegression()
predictions=cross_val_predict(lr,features,target,cv=10)
predictions=pd.Series(predictions)
print(predictions[:1000])

# Rates：
tpr = tp / float((tp + fn))
fpr = fp / float((fp + tn))
print(tpr)#真正率
print(fpr)#假正率

输出：
0.7121521533725502
0.4648648648648649
从结果来看，TPR和FPR的值都很高，说明来一个人基本都是判断为可以借钱，所以模型分类就没有什么意义,因此我们换一个使用权重法来对不同的特征赋予不同的值，最后再来看看效果。

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_predict
"""
权重项可以自己定义的
0代表5倍的
1代表10倍的
"""
penalty = {0: 5,1: 1
}lr = LogisticRegression(class_weight=penalty)
# kf = KFold(features.shape[0], random_state=1)
kf = 10
predictions = cross_val_predict(lr, features, target, cv=kf)
predictions = pd.Series(predictions)# False positives.
fp_filter = (predictions == 1) & (loans["loan_status"] == 0)
fp = len(predictions[fp_filter])# True positives.
tp_filter = (predictions == 1) & (loans["loan_status"] == 1)
tp = len(predictions[tp_filter])# False negatives.
fn_filter = (predictions == 0) & (loans["loan_status"] == 1)
fn = len(predictions[fn_filter])# True negatives
tn_filter = (predictions == 0) & (loans["loan_status"] == 0)
tn = len(predictions[tn_filter])# Rates
tpr = tp / float((tp + fn))
fpr = fp / float((fp + tn))print(tpr)
print(fpr)