客户购买意愿预测

本文主要是介绍客户购买意愿预测，希望对大家解决编程问题提供一定的参考价值，需要的开发者们随着小编来一起学习吧！

1.提出问题

2.理解数据

3.查看数据

4.数据分析

4.1离散变量分析

4.2连续变量分析

5.特征工程

5.1特征处理

5.2统计检验与特征筛选

5.3.不平衡数据集处理

6.数据建模

6.1构建模型以及模型评估

6.2方案实施

1.提出问题

本项目是与葡萄牙银行机构的营销活动相关。这些营销活动一般以电话为基础，银行的客服人员至少联系客户一次，以确认客户是否有意愿购买该银行的产品（定期存款）。

因此本文的任务是基本类型为分类任务，即预测客户是否购买该银行的产品。

2.理解数据

数据来源：https://www.kesci.com/home/competition/5c234c6626ba91002bfdfdd3/content/1

项目提供了2个数据集，分别为训练数据（train_set）、测试数据（test_set）

了解数据字段：

	NO	字段名称	数据类型
1	ID	Int	客户唯一标识
2	age	Int	客户年龄
3	job	String	客户的职业
4	marital	String	婚姻状况
5	education	String	受教育水平
6	default	String	是否有违约记录
7	balance	Int	每年账户的平均余额
8	housing	String	是否有住房贷款
9	loan	String	是否有个人贷款
10	contact	String	与客户联系的沟通方式
11	day	Int	最后一次联系的时间（几号）
12	month	String	最后一次联系的时间（月份）
13	duration	Int	最后一次联系的交流时长
14	campaign	Int	在本次活动中，与该客户交流过的次数
15	pdays	Int	距离上次活动最后一次联系该客户，过去了多久（999表示没有联系过）
16	previous	Int	在本次活动之前，与该客户交流过的次数
17	poutcome	String	上一次活动的结果
18	y	Int	预测客户是否会订购定期存款业务

3.查看数据

In [1]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as snsplt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = Falseimport  warnings
warnings.filterwarnings("ignore")

In [2]:

train=pd.read_csv('D:/data/predict/train_set.csv')
test=pd.read_csv('D:/data/predict/test_set.csv')
train.head()

Out[2]:

	ID	age	job	marital	education	default	balance	housing	loan	contact	day	month	duration	campaign	pdays	previous	poutcome
0	1	43	management	married	tertiary	no	291	yes	no	unknown	9	may	150	2	-1	0	unknown
1	2	42	technician	divorced	primary	no	5076	yes	no	cellular	7	apr	99	1	251	2	other
2	3	47	admin.	married	secondary	no	104	yes	yes	cellular	14	jul	77	2	-1	0	unknown
3	4	28	management	single	secondary	no	-994	yes	yes	cellular	18	jul	174	2	-1	0	unknown
4	5	42	technician	divorced	secondary	no	2974	yes	no	unknown	21	may	187	5	-1	0	unknown

In [3]:

print('训练数据集大小', train.shape,'测试数据集大小',test.shape)

训练数据集大小 (25317, 18) 测试数据集大小 (10852, 17)

In [4]:

# 从训练集查看是否平衡数据集
plt.rc('font', family='SimHei', size=13)
fig = plt.figure()
plt.pie(train['y'].value_counts(),labels=train['y'].value_counts().index,autopct='%1.2f%%',counterclock = False)
plt.title('购买率')
plt.show()

数据集是不平衡数据集

In [5]:

# 查看是否有空值
data = pd.concat([train.drop(['y'],axis=1),test],axis=0).reset_index(drop=True)
data.isnull().sum()
无空值

In [7]:

# 对object型数据查看unique
str_features = []
num_features=[]
for col in train.columns:if train[col].dtype=='object':str_features.append(col)print(col,':  ',train[col].unique())if train[col].dtype=='int64' and col not in ['ID','y']:num_features.append(col)

job :   ['management' 'technician' 'admin.' 'services' 'retired' 'student''blue-collar' 'unknown' 'entrepreneur' 'housemaid' 'self-employed''unemployed']
marital :   ['married' 'divorced' 'single']
education :   ['tertiary' 'primary' 'secondary' 'unknown']
default :   ['no' 'yes']
housing :   ['yes' 'no']
loan :   ['no' 'yes']
contact :   ['unknown' 'cellular' 'telephone']
month :   ['may' 'apr' 'jul' 'jun' 'nov' 'aug' 'jan' 'feb' 'dec' 'oct' 'sep' 'mar']
poutcome :   ['unknown' 'other' 'failure' 'success']

观察发现分类字段有‘unknown’这个类别，此时将该类别也当作缺失值，进一步查看

In [8]:

train.isin(['unknown']).mean()*100

Out[8]:

ID            0.000000
age           0.000000
job           0.643836
marital       0.000000
education     4.206660
default       0.000000
balance       0.000000
housing       0.000000
loan          0.000000
contact      28.759332
day           0.000000
month         0.000000
duration      0.000000
campaign      0.000000
pdays         0.000000
previous      0.000000
poutcome     81.672394
y             0.000000
dtype: float64

通常对于缺失值的处理，最常用的方法无外乎删除法、替换法和插补法。

删除法是指将缺失值所在的观测行删除（前提是缺失行的比例非常低，如5%以内），或者删除缺失值所对应的变量（前提是该变量中包含的缺失值比例非常高，如70%左右）
替换法是指直接利用缺失变量的均值、中位数或众数替换该变量中的缺失值，其好处是缺失值的处理速度快，弊端是易产生有偏估计，导致缺失值替换的准确性下降
插补法则是利用有监督的机器学习方法（如回归模型、树模型、网络模型等）对缺失值作预测，其优势在于预测的准确性高，缺点是需要大量的计算，导致缺失值的处理速度大打折扣

这里观察到contact和poutcome的‘unknow’类别分别达到28.76%和81.67%，在展示数据后考虑进一步处理，job和education的unknown占比较小，考虑不对这两个特征的unknow进行处理

In [9]:

print(str_features)
print(num_features)

['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'month', 'poutcome']
['age', 'balance', 'day', 'duration', 'campaign', 'pdays', 'previous']

4.数据分析

4.1离散变量分析

In [10]:

plt.figure(figsize=(15,15))
i=1
for col in str_features:plt.subplot(3,3,i)# 这里用mean是因为标签是0，1二分类，0*0的行数(即没购买的人数)+1*1的行数(购买的人数)/所有行数=购买率train.groupby([col])['y'].mean().plot(kind='bar',stacked=True,rot=90,title='Purchase rate of {}'.format(col))plt.subplots_adjust(wspace=0.2,hspace=0.7)  # 调整子图间距i=i+1
plt.show()

结论：通过可视图可以初步观察到这些特征是否会影响购买率

4.2连续变量分析

In [11]:

num_features

Out[11]:

['age', 'balance', 'day', 'duration', 'campaign', 'pdays', 'previous']

In [12]:

train[num_features].describe()

Out[12]:

	age	balance	day	duration	campaign	pdays	previous
count	25317.000000	25317.000000	25317.000000	25317.000000	25317.000000	25317.000000	25317.000000
mean	40.935379	1357.555082	15.835289	257.732393	2.772050	40.248766	0.591737
std	10.634289	2999.822811	8.319480	256.975151	3.136097	100.213541	2.568313
min	18.000000	-8019.000000	1.000000	0.000000	1.000000	-1.000000	0.000000
25%	33.000000	73.000000	8.000000	103.000000	1.000000	-1.000000	0.000000
50%	39.000000	448.000000	16.000000	181.000000	2.000000	-1.000000	0.000000
75%	48.000000	1435.000000	21.000000	317.000000	3.000000	-1.000000	0.000000
max	95.000000	102127.000000	31.000000	3881.000000	55.000000	854.000000	275.000000

#用热力图考察数值型特征与目标之间的关系
num_train = train[['age', 'balance', 'day', 'duration', 'campaign', 'pdays', 'previous','y']]
plt.figure(figsize = (10,8))
corr = num_train.corr()
sns.heatmap(corr ,annot=True, cmap= 'Spectral_r')

<matplotlib.axes._subplots.AxesSubplot at 0x23a886ed9a0>

从中可见duration(最后一次联系的交流时长)和是否购买之间存在着较大的相关性， balance存在着较低的相关性，特征previous和pdays, campaign和day之间相关性较强。

age 年龄

In [13]:

plt.figure()
sns.boxenplot(x='y', y=u'age', data=train)
plt.show()

In [14]:

train[train['y']==0]['age'].plot(kind='kde',label='0')
train[train['y']==1]['age'].plot(kind='kde',label='1')
plt.legend()
plt.show()

两类客户的购买年龄分布差异不大；

balance 每年账户的平均余额

In [15]:

train['balance'].plot(kind='hist')
plt.show()

duration 最后一次联系的交流时长

In [16]:

plt.figure()
sns.boxplot(y=u'duration', data=train)
plt.show()

campaign 在本次活动中，与该客户交流过的次数

In [17]:

plt.figure(figsize=(10,5))
plt.subplot(1,2,1)
sns.boxenplot(x='y', y=u'campaign', data=train)
plt.subplot(1,2,2)
sns.boxplot(y=u'campaign', data=train)
plt.show()

pdays 距离上次活动最后一次联系该客户，过去了多久（999表示没有联系过）

In [18]:

plt.figure(figsize=(10,5))
plt.subplot(1,2,1)
sns.boxenplot(x='y', y=u'pdays', data=train)
plt.subplot(1,2,2)
sns.boxplot(y=u'pdays', data=train)
plt.show()

previous 在本次活动之前，与该客户交流过的次数

In [19]:

plt.figure(figsize=(10,5))
plt.subplot(1,2,1)
sns.boxenplot(x='y', y=u'previous', data=train)
plt.subplot(1,2,2)
sns.boxplot(y=u'previous', data=train)
plt.show()

5.特征工程

In [20]:

from scipy.stats import chi2_contingency       # 数值型特征检验，检验特征与标签的关系
from scipy.stats import f_oneway,ttest_ind     # 分类型特征检验，检验特征与标签的关系

In [21]:

#----------数据集处理--------------#
from sklearn.model_selection import train_test_split        # 划分训练集和验证集
from sklearn.model_selection import KFold,StratifiedKFold   # k折交叉
from imblearn.combine import SMOTETomek,SMOTEENN            # 综合采样
from imblearn.over_sampling import SMOTE                    # 过采样
from imblearn.under_sampling import RandomUnderSampler      # 欠采样#----------数据处理--------------#
from sklearn.preprocessing import StandardScaler # 标准化
from sklearn.preprocessing import OneHotEncoder  # 热独编码
from sklearn.preprocessing import OrdinalEncoder

5.1特征处理

数据清洗的步骤为：选择子集——列名重命名——删除重复值——缺失值处理——数据类型转换——数据排序——异常值处理，在这个案例中，在查看空值与缺失值以及对数据的可视化观测后，对连续变量检测异常值后进行处理，保留所有离散变量，并对其做虚拟变量处理。

连续变量即数值化数据做异常值处理

In [22]:

# 异常值处理
def outlier_processing(dfx):df = dfx.copy()q1 = df.quantile(q=0.25)q3 = df.quantile(q=0.75)iqr = q3 - q1Umin = q1 - 1.5*iqrUmax = q3 + 1.5*iqr df[df>Umax] = df[df<=Umax].max()df[df<Umin] = df[df>=Umin].min()return df

In [23]:

train['age']=outlier_processing(train['age'])
train['day']=outlier_processing(train['day'])
train['duration']=outlier_processing(train['duration'])
train['campaign']=outlier_processing(train['campaign'])test['age']=outlier_processing(test['age'])
test['day']=outlier_processing(test['day'])
test['duration']=outlier_processing(test['duration'])
test['campaign']=outlier_processing(test['campaign'])

In [24]:

train[num_features].describe()

Out[24]:

	age	balance	day	duration	campaign	pdays	previous
count	25317.000000	25317.000000	25317.000000	25317.000000	25317.000000	25317.000000	25317.000000
mean	40.859502	1357.555082	15.835289	234.235138	2.391437	40.248766	0.591737
std	10.387365	2999.822811	8.319480	175.395559	1.599851	100.213541	2.568313
min	18.000000	-8019.000000	1.000000	0.000000	1.000000	-1.000000	0.000000
25%	33.000000	73.000000	8.000000	103.000000	1.000000	-1.000000	0.000000
50%	39.000000	448.000000	16.000000	181.000000	2.000000	-1.000000	0.000000
75%	48.000000	1435.000000	21.000000	317.000000	3.000000	-1.000000	0.000000
max	70.000000	102127.000000	31.000000	638.000000	6.000000	854.000000	275.000000

分类变量做编码处理

In [25]:

dummy_train=train.join(pd.get_dummies(train[str_features])).drop(str_features,axis=1).drop(['ID','y'],axis=1)
dummy_test=test.join(pd.get_dummies(test[str_features])).drop(str_features,axis=1).drop(['ID'],axis=1)

5.2统计检验与特征筛选

连续变量-连续变量相关分析

连续变量-分类变量 T检验/方差分析

分类变量-分类变量卡方检验

对类别标签（离散变量）用卡方检验分析重要性

卡方检验认为显著水平大于95%是差异性显著的，这里即看p值是否是p>0.05，若p>0.05，则说明特征不会呈现差异性。

In [26]:

for col in str_features:obs=pd.crosstab(train['y'],train[col],rownames=['y'],colnames=[col])chi2, p, dof, expect = chi2_contingency(obs)print("{} 卡方检验p值: {:.4f}".format(col,p))

job 卡方检验p值: 0.0000
marital 卡方检验p值: 0.0000
education 卡方检验p值: 0.0000
default 卡方检验p值: 0.0001
housing 卡方检验p值: 0.0000
loan 卡方检验p值: 0.0000
contact 卡方检验p值: 0.0000
month 卡方检验p值: 0.0000
poutcome 卡方检验p值: 0.0000

结果显示所有类型变量均显著。

对连续变量做方差分析进行特征筛选

In [27]:

from sklearn.feature_selection import SelectKBest,f_classiff,p=f_classif(train[num_features],train['y'])
k = f.shape[0] - (p > 0.05).sum()
selector = SelectKBest(f_classif, k=k)
selector.fit(train[num_features],train['y'])print('scores_:',selector.scores_)
print('pvalues_:',selector.pvalues_)
print('selected index:',selector.get_support(True))

scores_: [  13.38856992   84.16396612   25.76507245 4405.56959938  193.97418155296.33099313  199.09942912]
pvalues_: [2.53676251e-04 4.89124305e-20 3.88332900e-07 0.00000000e+006.26768275e-44 4.93591331e-66 4.86613654e-45]
selected index: [0 1 2 3 4 5 6]

结果显示所有连续变量均显著。

In [28]:

# 标准化，返回值为标准化后的数据
standardScaler=StandardScaler()
ss=standardScaler.fit(dummy_train.loc[:,num_features])
dummy_train.loc[:,num_features]=ss.transform(dummy_train.loc[:,num_features])
dummy_test.loc[:,num_features]=ss.transform(dummy_test.loc[:,num_features])

5.3.不平衡数据集处理

In [29]:

X=dummy_train
y=train['y']

先拆分数据集为训练集和验证集，对训练集做采样处理，验证集不动

In [30]:

X_train,X_valid,y_train,y_valid=train_test_split(X,y,test_size=0.2,random_state=2021)

In [31]:

smote_tomek = SMOTETomek(random_state=2021)
X_resampled, y_resampled = smote_tomek.fit_resample(X_train, y_train)

6.数据建模

6.1构建模型以及模型评估

In [41]:

#----------建模工具--------------#
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipelineimport lightgbm as lgb
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier

In [42]:

#----------模型评估工具----------#
from sklearn.metrics import confusion_matrix # 混淆矩阵
from sklearn.metrics import classification_report
from sklearn.metrics import recall_score,f1_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve

本文主要考虑逻辑回归、支持向量机、随机森林和LGBM算法

In [43]:

#逻辑回归
param = {"penalty": ["l1", "l2", ], "C": [0.1, 1, 10], "solver": ["liblinear","saga"]}
gs = GridSearchCV(estimator=LogisticRegression(), param_grid=param, cv=2, scoring="roc_auc",verbose=10) 
gs.fit(X_resampled,y_resampled) 
print(gs.best_params_) 
y_pred = gs.best_estimator_.predict(X_valid) 
print(classification_report(y_valid, y_pred))

{'C': 1, 'penalty': 'l2', 'solver': 'liblinear'}precision    recall  f1-score   support0       0.93      0.94      0.93      44681       0.51      0.51      0.51       596accuracy                           0.88      5064macro avg       0.72      0.72      0.72      5064
weighted avg       0.88      0.88      0.88      5064

In [44]:

# 训练集
confusion_matrix(y_resampled,gs.best_estimator_.predict(X_resampled),labels=[1,0])

Out[44]:

array([[16129,  1742],[ 1080, 16791]], dtype=int64)

In [45]:

# 验证集
confusion_matrix(y_valid,y_pred,labels=[1,0])

Out[45]:

array([[ 302,  294],[ 290, 4178]], dtype=int64)

In [46]:

#画roc-auc曲线
def get_rocauc(X,y,clf):from sklearn.metrics import roc_curveFPR,recall,thresholds=roc_curve(y,clf.predict_proba(X)[:,1],pos_label=1)area=roc_auc_score(y,clf.predict_proba(X)[:,1])maxindex=(recall-FPR).tolist().index(max(recall-FPR))threshold=thresholds[maxindex]plt.figure()plt.plot(FPR,recall,color='red',label='ROC curve (area = %0.2f)'%area)plt.plot([0,1],[0,1],color='black',linestyle='--')plt.scatter(FPR[maxindex],recall[maxindex],c='black',s=30)plt.xlim([-0.05,1.05])plt.ylim([-0.05,1.05])plt.xlabel('False Positive Rate')plt.ylabel('Recall')plt.title('Receiver operating characteristic example')plt.legend(loc='lower right')plt.show()return threshold

In [47]:

threshold=get_rocauc(X_resampled, y_resampled,gs.best_estimator_)

左上角的点即是Recall和FPR平衡的点，threshold即为平衡点对应的阀值。可以根据阀值重新来构建预测结果：

In [48]:

# 阈值调整
def get_ypred(X,clf,threshold):y_pred=[]for i in clf.predict_proba(X)[:,1]:if i > threshold:y_pred.append(1)else:y_pred.append(0)return y_pred# ytrain_pred=get_ypred(Xtrain,clf,threshold)

In [49]:

# 支持向量机
param = {"C": list(np.linspace(0,1,10))} 
gs = GridSearchCV(estimator=SVC(probability=True), param_grid=param, cv=2, scoring="roc_auc", n_jobs=-1, verbose=10) 
gs.fit(X_resampled,y_resampled)
print(gs.best_params_) 
y_pred = gs.best_estimator_.predict(X_valid) 
print(classification_report(y_valid, y_pred))

Fitting 2 folds for each of 10 candidates, totalling 20 fits
{'C': 1.0}precision    recall  f1-score   support0       0.94      0.93      0.94      44681       0.52      0.58      0.55       596accuracy                           0.89      5064macro avg       0.73      0.75      0.74      5064
weighted avg       0.89      0.89      0.89      5064

In [53]:

# 随机森林
param = {'n_estimators':[500,520,530],'max_features':[0.74,0.75,0.77],'min_samples_leaf':[2,3,4]}
gs = GridSearchCV(estimator=RandomForestClassifier(), param_grid=param, cv=2, scoring="roc_auc", n_jobs=-1, verbose=10) 
gs.fit(X_resampled,y_resampled)
print(gs.best_params_) 
y_pred = gs.best_estimator_.predict(X_valid) 
print(classification_report(y_valid, y_pred))

Fitting 2 folds for each of 27 candidates, totalling 54 fits
{'max_features': 0.74, 'min_samples_leaf': 2, 'n_estimators': 530}precision    recall  f1-score   support0       0.95      0.92      0.93      44681       0.51      0.64      0.57       596accuracy                           0.89      5064macro avg       0.73      0.78      0.75      5064
weighted avg       0.90      0.89      0.89      5064

In [56]:

from lightgbm.sklearn import LGBMClassifier
param = {'max_depth': [5,7,9,11,13],'learning_rate': [0.01,0.03,0.05,0.1],'num_leaves': [30,90,120],'n_estimators': [1000,1500,2000,2500]}
gs = GridSearchCV(estimator=LGBMClassifier(), param_grid=param, cv=5, scoring="roc_auc", n_jobs=-1, verbose=10) 
gs.fit(X_resampled,y_resampled)
print(gs.best_params_) 
y_pred = gs.best_estimator_.predict(X_valid) 
print(classification_report(y_valid, y_pred))

Fitting 5 folds for each of 240 candidates, totalling 1200 fits
{'learning_rate': 0.01, 'max_depth': 13, 'n_estimators': 2000, 'num_leaves': 120}precision    recall  f1-score   support0       0.94      0.95      0.94      44681       0.59      0.53      0.56       596accuracy                           0.90      5064macro avg       0.77      0.74      0.75      5064
weighted avg       0.90      0.90      0.90      5064

from sklearn.metrics import roc_auc_score
LR_auc = roc_auc_score(y_valid, gs1.best_estimator_.predict_proba(X_valid)[:, 1])
svc_auc = roc_auc_score(y_valid, gs2.best_estimator_.predict_proba(X_valid)[:, 1])
rf_auc = roc_auc_score(y_valid, gs3.best_estimator_.predict_proba(X_valid)[:, 1])
LGBM_auc = roc_auc_score(y_valid, gs4.best_estimator_.predict_proba(X_valid)[:, 1])print("AUC for LR: {:.3f}".format(LR_auc))
print("AUC for SVC: {:.3f}".format(svc_auc))
print("AUC for Random Forest: {:.3f}".format(rf_auc))
print("AUC for LGBM: {:.3f}".format(LGBM_auc))

AUC for LR: 0.894
AUC for SVC: 0.907
AUC for Random Forest: 0.917
AUC for LGBM: 0.926

6.2方案实施

通过对比逻辑回归、支持向量机、随机森林和LGBM算法的AUC得出最佳参数所构建的模型，得出LGBM算法参数为{'learning_rate': 0.01, 'max_depth': 13, 'n_estimators': 2000, 'num_leaves': 120}时效果最好，precision、recall、f1-score的weighted avg都为0.9，AUC最高为0.926。

In [57]:

X_test = dummy_test
y_test=gs.best_estimator_.predict_proba(X_test)
test['pred']=y_test[:,1]

In [58]:

test[['ID','pred']].to_csv('lgb.csv')

提交结果，最终得分：0.9253091

这篇关于客户购买意愿预测的文章就介绍到这儿，希望我们推荐的文章对编程师们有所帮助！

客户购买意愿预测

1.提出问题

2.理解数据

3.查看数据

4.数据分析

4.1离散变量分析

4.2连续变量分析

5.特征工程

5.1特征处理

5.2统计检验与特征筛选

5.3.不平衡数据集处理

6.数据建模

6.1构建模型以及模型评估

6.2方案实施

相关文章

WiFi6时代来临! 华三H3C NX54路由器还值得购买吗?

购买磨轮平衡机时应该注意什么问题和技巧

客户案例：安全海外中继助力知名家电企业化解海外通邮困境

Tensorflow lstm实现的小说撰写预测

临床基础两手抓！这个12+神经网络模型太贪了，免疫治疗预测、通路重要性、基因重要性、通路交互作用性全部拿下！

【数据库实战】1_Oracle_命中关联人或黑名单或反洗钱客户

结合Python与GUI实现比赛预测与游戏数据分析

CNN-LSTM模型中应用贝叶斯推断进行时间序列预测

多维时序 | Matlab基于SSA-SVR麻雀算法优化支持向量机的数据多变量时间序列预测

力扣 | 递归 | 区间上的动态规划 | 486. 预测赢家