汽车贷款违约预测

2023-12-20 05:18
文章标签 预测 违约 汽车贷款

本文主要是介绍汽车贷款违约预测,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

逻辑回归

数据说明:本数据是一份汽车贷款违约数据

名称中文含义
application_id申请者ID
account_number帐户号
bad_ind是否违约
vehicle_year汽车购买时间
vehicle_make汽车制造商
bankruptcy_ind曾经破产标识
tot_derog五年内信用不良事件数量(比如手机欠费消号)
tot_tr全部帐户数量
age_oldest_tr最久账号存续时间(月)
tot_open_tr在使用帐户数量
tot_rev_tr在使用可循环贷款帐户数量(比如信用卡)
tot_rev_debt在使用可循环贷款帐户余额(比如信用卡欠款)
tot_rev_line可循环贷款帐户限额(信用卡授权额度)
rev_util可循环贷款帐户使用比例(余额/限额)
fico_scoreFICO打分
purch_price汽车购买金额(元)
msrp建议售价
down_pyt分期付款的首次交款
loan_term贷款期限(月)
loan_amt贷款金额
ltv贷款金额/建议售价*100
tot_income月均收入(元)
veh_mileage行使历程(Mile)
used_ind是否二手车
weight样本权重
%matplotlib inline
import os
import numpy as np
from scipy import stats
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt

# os.chdir(‘E:/data’)
pd.set_option(‘display.max_columns’, None)

导入数据和数据清洗

accepts = pd.read_csv('accepts.csv', skipinitialspace=True)
accepts = accepts.dropna(axis=0, how='any')

分类变量的相关关系

  • 曾经破产标识与是否违约是否有关系?

交叉表

cross_table = pd.crosstab(accepts.bankruptcy_ind, accepts.bad_ind, margins=True)
cross_table
bad_ind01All
bankruptcy_ind
N30767193795
Y24367310
All33197864105

列联表

def percConvert(ser):return ser/float(ser[-1])

cross_table.apply(percConvert, axis=1)

bad_ind01All
bankruptcy_ind
N0.8105400.1894601.0
Y0.7838710.2161291.0
All0.8085260.1914741.0
print('''chisq = %6.4f 
p-value = %6.4f
dof = %i 
expected_freq = %s'''  %stats.chi2_contingency(cross_table.iloc[:2, :2]))
chisq = 1.1500 
p-value = 0.2835
dof = 1 
expected_freq = [[3068.35688185  726.64311815][ 250.64311815   59.35688185]]

逻辑回归

accepts.plot(x='fico_score', y='bad_ind', kind='scatter')
<matplotlib.axes._subplots.AxesSubplot at 0x63c4ef0>

在这里插入图片描述

•随机抽样,建立训练集与测试集

train = accepts.sample(frac=0.7, random_state=1234).copy()
test = accepts[~ accepts.index.isin(train.index)].copy()
print(' 训练集样本量: %i \n 测试集样本量: %i' %(len(train), len(test)))
 训练集样本量: 2874 测试集样本量: 1231
lg = smf.glm('bad_ind ~ fico_score', data=train, family=sm.families.Binomial(sm.families.links.logit)).fit()
lg.summary()
Generalized Linear Model Regression Results
Dep. Variable:bad_ind No. Observations: 2874
Model:GLM Df Residuals: 2872
Model Family:Binomial Df Model: 1
Link Function:logit Scale: 1.0
Method:IRLS Log-Likelihood: -1267.8
Date:Tue, 29 May 2018 Deviance: 2535.7
Time:15:04:24 Pearson chi2: 2.75e+03
No. Iterations:5
coefstd errzP>|z|[0.0250.975]
Intercept 8.8759 0.648 13.702 0.000 7.606 10.146
fico_score -0.0151 0.001 -15.687 0.000 -0.017 -0.013
formula = '''bad_ind ~ fico_score + bankruptcy_ind
+ tot_derog + age_oldest_tr + rev_util + ltv + veh_mileage'''

lg_m = smf.glm(formula=formula, data=train,
family=sm.families.Binomial(sm.families.links.logit)).fit()
lg_m.summary().tables[1]

coefstd errzP>|z|[0.0250.975]
Intercept 4.9355 0.828 5.960 0.000 3.312 6.559
bankruptcy_ind[T.Y] -0.4181 0.195 -2.143 0.032 -0.801 -0.036
fico_score -0.0131 0.001 -11.053 0.000 -0.015 -0.011
tot_derog 0.0529 0.016 3.260 0.001 0.021 0.085
age_oldest_tr -0.0043 0.001 -6.673 0.000 -0.006 -0.003
rev_util 0.0008 0.001 1.593 0.111 -0.000 0.002
ltv 0.0290 0.003 8.571 0.000 0.022 0.036
veh_mileage 2.502e-06 1.51e-06 1.654 0.098-4.63e-07 5.47e-06
# 向前法
def forward_select(data, response):remaining = set(data.columns)remaining.remove(response)selected = []current_score, best_new_score = float('inf'), float('inf')while remaining:aic_with_candidates=[]for candidate in remaining:formula = "{} ~ {}".format(response,' + '.join(selected + [candidate]))aic = smf.glm(formula=formula, data=data, family=sm.families.Binomial(sm.families.links.logit)).fit().aicaic_with_candidates.append((aic, candidate))aic_with_candidates.sort(reverse=True)best_new_score, best_candidate=aic_with_candidates.pop()if current_score > best_new_score: remaining.remove(best_candidate)selected.append(best_candidate)current_score = best_new_scoreprint ('aic is {},continuing!'.format(current_score))else:        print ('forward selection over!')break
formula <span class="token operator">&#61;</span> <span class="token string">&#34;{} ~ {} &#34;</span><span class="token punctuation">.</span><span class="token builtin">format</span><span class="token punctuation">(</span>response<span class="token punctuation">,</span><span class="token string">&#39; &#43; &#39;</span><span class="token punctuation">.</span>join<span class="token punctuation">(</span>selected<span class="token punctuation">)</span><span class="token punctuation">)</span>
<span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">&#39;final formula is {}&#39;</span><span class="token punctuation">.</span><span class="token builtin">format</span><span class="token punctuation">(</span>formula<span class="token punctuation">)</span><span class="token punctuation">)</span>
model <span class="token operator">&#61;</span> smf<span class="token punctuation">.</span>glm<span class="token punctuation">(</span>formula<span class="token operator">&#61;</span>formula<span class="token punctuation">,</span> data<span class="token operator">&#61;</span>data<span class="token punctuation">,</span> family<span class="token operator">&#61;</span>sm<span class="token punctuation">.</span>families<span class="token punctuation">.</span>Binomial<span class="token punctuation">(</span>sm<span class="token punctuation">.</span>families<span class="token punctuation">.</span>links<span class="token punctuation">.</span>logit<span class="token punctuation">)</span>
<span class="token punctuation">)</span><span class="token punctuation">.</span>fit<span class="token punctuation">(</span><span class="token punctuation">)</span>
<span class="token keyword">return</span><span class="token punctuation">(</span>model<span class="token punctuation">)</span>

candidates = ['bad_ind', 'fico_score', 'bankruptcy_ind', 'tot_derog','age_oldest_tr', 'rev_util', 'ltv', 'veh_mileage']
data_for_select = train[candidates]lg_m1 = forward_select(data=data_for_select, response='bad_ind')
lg_m1.summary().tables[1]
aic is 2539.6525973826097,continuing!
aic is 2448.972227745799,continuing!
aic is 2406.5983198124773,continuing!
aic is 2401.0559077596185,continuing!
aic is 2397.9413617381233,continuing!
aic is 2397.0135732954586,continuing!
aic is 2396.212716240673,continuing!
final formula is bad_ind ~ fico_score + ltv + age_oldest_tr + tot_derog + bankruptcy_ind + veh_mileage + rev_util 
coefstd errzP>|z|[0.0250.975]
Intercept 4.9355 0.828 5.960 0.000 3.312 6.559
bankruptcy_ind[T.Y] -0.4181 0.195 -2.143 0.032 -0.801 -0.036
fico_score -0.0131 0.001 -11.053 0.000 -0.015 -0.011
ltv 0.0290 0.003 8.571 0.000 0.022 0.036
age_oldest_tr -0.0043 0.001 -6.673 0.000 -0.006 -0.003
tot_derog 0.0529 0.016 3.260 0.001 0.021 0.085
veh_mileage 2.502e-06 1.51e-06 1.654 0.098-4.63e-07 5.47e-06
rev_util 0.0008 0.001 1.593 0.111 -0.000 0.002

Seemingly wrong when using ‘statsmmodels.stats.outliers_influence.variance_inflation_factor’

def vif(df, col_i):from statsmodels.formula.api import ols
cols <span class="token operator">&#61;</span> <span class="token builtin">list</span><span class="token punctuation">(</span>df<span class="token punctuation">.</span>columns<span class="token punctuation">)</span>
cols<span class="token punctuation">.</span>remove<span class="token punctuation">(</span>col_i<span class="token punctuation">)</span>
cols_noti <span class="token operator">&#61;</span> cols
formula <span class="token operator">&#61;</span> col_i <span class="token operator">&#43;</span> <span class="token string">&#39;~&#39;</span> <span class="token operator">&#43;</span> <span class="token string">&#39;&#43;&#39;</span><span class="token punctuation">.</span>join<span class="token punctuation">(</span>cols_noti<span class="token punctuation">)</span>
r2 <span class="token operator">&#61;</span> ols<span class="token punctuation">(</span>formula<span class="token punctuation">,</span> df<span class="token punctuation">)</span><span class="token punctuation">.</span>fit<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">.</span>rsquared
<span class="token keyword">return</span> <span class="token number">1</span><span class="token punctuation">.</span> <span class="token operator">/</span> <span class="token punctuation">(</span><span class="token number">1</span><span class="token punctuation">.</span> <span class="token operator">-</span> r2<span class="token punctuation">)</span>

exog = train[candidates].drop(['bad_ind', 'bankruptcy_ind'], axis=1)for i in exog.columns:print(i, '\t', vif(df=exog, col_i=i))
fico_score 	 1.542313308954432
tot_derog 	 1.347832436613074
age_oldest_tr 	 1.1399926313381807
rev_util 	 1.0843803200842592
ltv 	 1.0246247922768867
veh_mileage 	 1.0105135995489778

预测

train['proba'] = lg_m1.predict(train)
test['proba'] = lg_m1.predict(test)

test[‘proba’].head()

4     0.123459
6     0.002545
10    0.071279
11    0.219843
13    0.241252
Name: proba, dtype: float64

模型评估

设定阈值

test['prediction'] = (test['proba'] > 0.5).astype('int')

混淆矩阵

pd.crosstab(test.bad_ind, test.prediction, margins=True)
prediction01All
bad_ind
0969331002
119930229
All1168631231
  • 计算准确率
acc = sum(test['prediction'] == test['bad_ind']) /np.float(len(test))
print('The accurancy is %.2f' %acc)
The accurancy is 0.81
for i in np.arange(0, 1, 0.1):prediction = (test['proba'] > i).astype('int')confusion_matrix = pd.crosstab(test.bad_ind, prediction,margins = True)precision = confusion_matrix.iloc[1, 1] /confusion_matrix.loc['All', 1]recall = confusion_matrix.iloc[1, 1] / confusion_matrix.loc[1, 'All']f1_score = 2 * (precision * recall) / (precision + recall)print('threshold: %s, precision: %.2f, recall:%.2f , f1_score:%.2f'\%(i, precision, recall, f1_score))
threshold: 0.0, precision: 0.19, recall:1.00 , f1_score:0.31
threshold: 0.1, precision: 0.26, recall:0.92 , f1_score:0.41
threshold: 0.2, precision: 0.34, recall:0.70 , f1_score:0.46
threshold: 0.30000000000000004, precision: 0.41, recall:0.46 , f1_score:0.43
threshold: 0.4, precision: 0.45, recall:0.25 , f1_score:0.32
threshold: 0.5, precision: 0.48, recall:0.13 , f1_score:0.21
threshold: 0.6000000000000001, precision: 0.50, recall:0.05 , f1_score:0.09
threshold: 0.7000000000000001, precision: 0.67, recall:0.02 , f1_score:0.03
threshold: 0.8, precision: 0.50, recall:0.00 , f1_score:0.01
threshold: 0.9, precision: 0.50, recall:0.00 , f1_score:0.01
  • 绘制ROC曲线
import sklearn.metrics as metrics

fpr_test, tpr_test, th_test = metrics.roc_curve(test.bad_ind, test.proba)
fpr_train, tpr_train, th_train = metrics.roc_curve(
train.bad_ind, train.proba)

plt.figure(figsize=[3, 3])
plt.plot(fpr_test, tpr_test, ‘b–’)
plt.plot(fpr_train, tpr_train, ‘r-’)
plt.title(‘ROC curve’)
plt.show()

在这里插入图片描述

print('AUC = %.4f' %metrics.auc(fpr_test, tpr_test))
AUC = 0.7619
print(metrics.classification_report(test.bad_ind, test.prediction))  # 计算评估指标
             precision    recall  f1-score   support
      0       0.83      0.97      0.89      10021       0.48      0.13      0.21       229

avg / total 0.76 0.81 0.77 1231

statsmodel会默认进行标准化

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

cols = [‘fico_score’ ,‘ltv’ ,‘age_oldest_tr’ ,‘tot_derog’]
train1 = train[cols]; test1 = test[cols]

train2 = pd.DataFrame(scaler.fit_transform(train1), columns=cols, index=train1.index)
test2 = pd.DataFrame(scaler.transform(test1), columns=cols, index = test1.index)
train3 = train2.join(train.bad_ind).join(train.bankruptcy_ind)
test3 = test2.join(test.bad_ind).join(test.bankruptcy_ind)

formula2 = ‘bad_ind ~’ + ‘+’.join(cols) + ‘+ bankruptcy_ind’
lg_m2 = smf.glm(formula=formula2, data=train3,
family=sm.families.Binomial(sm.families.links.logit)).fit()
# formula2
train3[‘proba’] = lg_m2.predict(train3)
test3[‘proba’] =lg_m2.predict(test3)

fpr_test, tpr_test, th_test = metrics.roc_curve(test3.bad_ind, test3.proba)
fpr_train, tpr_train, th_train = metrics.roc_curve(
train3.bad_ind, train3.proba)

plt.figure(figsize=[6, 6])
plt.plot(fpr_test, tpr_test, ‘b-’)
plt.plot(fpr_train, tpr_train, ‘r-’)
plt.title(‘ROC curve’)
print(‘AUC = %.4f’ %metrics.auc(fpr_test, tpr_test))

test3[‘prediction’] = (test3[‘proba’] > 0.5).astype(‘int’)
pd.crosstab(test3.bad_ind, test3.prediction, margins=True)

AUC = 0.7614
prediction01All
bad_ind
0971311002
119831229
All1169621231

在这里插入图片描述

这篇关于汽车贷款违约预测的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/514879

相关文章

Tensorflow lstm实现的小说撰写预测

最近,在研究深度学习方面的知识,结合Tensorflow,完成了基于lstm的小说预测程序demo。 lstm是改进的RNN,具有长期记忆功能,相对于RNN,增加了多个门来控制输入与输出。原理方面的知识网上很多,在此,我只是将我短暂学习的tensorflow写一个预测小说的demo,如果有错误,还望大家指出。 1、将小说进行分词,去除空格,建立词汇表与id的字典,生成初始输入模型的x与y d

临床基础两手抓!这个12+神经网络模型太贪了,免疫治疗预测、通路重要性、基因重要性、通路交互作用性全部拿下!

生信碱移 IRnet介绍 用于预测病人免疫治疗反应类型的生物过程嵌入神经网络,提供通路、通路交互、基因重要性的多重可解释性评估。 临床实践中常常遇到许多复杂的问题,常见的两种是: 二分类或多分类:预测患者对治疗有无耐受(二分类)、判断患者的疾病分级(多分类); 连续数值的预测:预测癌症病人的风险、预测患者的白细胞数值水平; 尽管传统的机器学习提供了高效的建模预测与初步的特征重

结合Python与GUI实现比赛预测与游戏数据分析

在现代软件开发中,用户界面设计和数据处理紧密结合,以提升用户体验和功能性。本篇博客将基于Python代码和相关数据分析进行讨论,尤其是如何通过PyQt5等图形界面库实现交互式功能。同时,我们将探讨如何通过嵌入式预测模型为用户提供赛果预测服务。 本文的主要内容包括: 基于PyQt5的图形用户界面设计。结合数据进行比赛预测。文件处理和数据分析流程。 1. PyQt5 图形用户界面设计

CNN-LSTM模型中应用贝叶斯推断进行时间序列预测

这篇论文的标题是《在混合CNN-LSTM模型中应用贝叶斯推断进行时间序列预测》,作者是Thi-Lich Nghiem, Viet-Duc Le, Thi-Lan Le, Pierre Maréchal, Daniel Delahaye, Andrija Vidosavljevic。论文发表在2022年10月于越南富国岛举行的国际多媒体分析与模式识别会议(MAPR)上。 摘要部分提到,卷积

多维时序 | Matlab基于SSA-SVR麻雀算法优化支持向量机的数据多变量时间序列预测

多维时序 | Matlab基于SSA-SVR麻雀算法优化支持向量机的数据多变量时间序列预测 目录 多维时序 | Matlab基于SSA-SVR麻雀算法优化支持向量机的数据多变量时间序列预测效果一览基本介绍程序设计参考资料 效果一览 基本介绍 1.Matlab基于SSA-SVR麻雀算法优化支持向量机的数据多变量时间序列预测(完整源码和数据) 2.SS

力扣 | 递归 | 区间上的动态规划 | 486. 预测赢家

文章目录 一、递归二、区间动态规划 LeetCode:486. 预测赢家 一、递归 注意到本题数据范围为 1 < = n < = 20 1<=n<=20 1<=n<=20,因此可以使用递归枚举选择方式,时间复杂度为 2 20 = 1024 ∗ 1024 = 1048576 = 1.05 × 1 0 6 2^{20} = 1024*1024=1048576=1.05 × 10^

回归预测 | MATLAB实现PSO-LSTM(粒子群优化长短期记忆神经网络)多输入单输出

回归预测 | MATLAB实现PSO-LSTM(粒子群优化长短期记忆神经网络)多输入单输出 目录 回归预测 | MATLAB实现PSO-LSTM(粒子群优化长短期记忆神经网络)多输入单输出预测效果基本介绍模型介绍PSO模型LSTM模型PSO-LSTM模型 程序设计参考资料致谢 预测效果 Matlab实现PSO-LSTM多变量回归预测 1.input和outpu

时序预测|变分模态分解-双向时域卷积-双向门控单元-注意力机制多变量时间序列预测VMD-BiTCN-BiGRU-Attention

时序预测|变分模态分解-双向时域卷积-双向门控单元-注意力机制多变量时间序列预测VMD-BiTCN-BiGRU-Attention 文章目录 一、基本原理1. 变分模态分解(VMD)2. 双向时域卷积(BiTCN)3. 双向门控单元(BiGRU)4. 注意力机制(Attention)总结流程 二、实验结果三、核心代码四、代码获取五、总结 时序预测|变分模态分解-双向时域卷积

【销售预测 ARIMA模型】ARIMA模型预测每天的销售额

输入数据txt格式: 2017-05-01 100 2017-05-02 200 ……. python 实现arima: # encoding: utf-8"""function:时间序列预测ARIMA模型预测每天的销售额author:donglidate:2018-05-25"""# 导入库import numpy as np # numpy库from statsmode

回归预测 | Matlab基于贝叶斯算法优化XGBoost(BO-XGBoost/Bayes-XGBoost)的数据回归预测+交叉验证

回归预测 | Matlab基于贝叶斯算法优化XGBoost(BO-XGBoost/Bayes-XGBoost)的数据回归预测+交叉验证 目录 回归预测 | Matlab基于贝叶斯算法优化XGBoost(BO-XGBoost/Bayes-XGBoost)的数据回归预测+交叉验证效果一览基本介绍程序设计参考资料 效果一览 基本介绍 Matlab实现基于贝叶斯算法优化X