【075】心血管疾病预测KNN和逻辑斯蒂

本文主要是介绍【075】心血管疾病预测KNN和逻辑斯蒂，希望对大家解决编程问题提供一定的参考价值，需要的开发者们随着小编来一起学习吧！

内容目录

一、项目介绍二、数据探索

1、查看数据集的基本信息

2、性别和患病的关系

3、年龄和患病的关系

4、身高、体重与患病的关系

三、建模分析

1、计算相关性系数

2、编写预设函数

3、切分数据集

4、初步训练逻辑回归模型

5、数据标准化变换

6、利用KNN做优化

7、模型选择

8、KNN表现

9、逻辑回归表现

一、项目介绍

1、背景描述

数据集包括年龄、性别、收缩压、舒张压等12个特征的患者数据记录7万份。
当患者有心血管疾病时，目标类“cardio”等于1，如果患者健康，则为0。

原文见公众号：python宝

2、导包导数据

# 导入需要的工具包
import pandas as pd # data processing
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns  # plot
import pandas_profiling
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report,confusion_matrix
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScalerwarnings.filterwarnings("ignore")
data = pd.read_csv('D:\A\AI-master\py-data\cardio_train.csv',sep=';')
data.head()

   id    age  gender  height  weight  ap_hi  ap_lo  cholesterol  gluc  smoke  alco  active  cardio
0   0  18393       2     168    62.0    110     80            1     1      0     0       1       0
1   1  20228       1     156    85.0    140     90            3     1      0     0       1       1
2   2  18857       1     165    64.0    130     70            3     1      0     0       0       1
3   3  17623       2     169    82.0    150    100            1     1      0     0       1       1
4   4  17474       1     156    56.0    100     60            1     1      0     0       0       0(70000, 13)

二、数据探索

1、查看数据集的基本信息
#info()函数给出样本数据的相关信息概览 ：行数，列数，列索引，列非空值个数，列类型，内存占用
data.info()#describe()函数直接给出样本数据的一些基本的统计量，包括均值，标准差，最大值，最小值，分位数等。
data.describe()#pandas-profiling能够使用DataFrame自动生成数据的详细报告，相比describe生成的profile要详细的多。
pandas_profiling.ProfileReport(data)#导出报告,目前pandas-profiling目前只支持导出html格式的文件。如果想要生成图片，先生成的html文件，使用Chrome的内建截屏功能来生成图片，没错你前面看到结果就是使用这种方式生成的。
pfr = pandas_profiling.ProfileReport(data)
pfr.to_file('report.html')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70000 entries, 0 to 69999
Data columns (total 12 columns):
age            70000 non-null int64
gender         70000 non-null int64
height         70000 non-null int64
weight         70000 non-null float64
ap_hi          70000 non-null int64
ap_lo          70000 non-null int64
cholesterol    70000 non-null int64
gluc           70000 non-null int64
smoke          70000 non-null int64
alco           70000 non-null int64
active         70000 non-null int64
cardio         70000 non-null int64
dtypes: float64(1), int64(11)
memory usage: 6.4 MB

               age        gender        height        weight         ap_hi         ap_lo   cholesterol          gluc         smoke          alco        active        cardio
count  70000.000000  70000.000000  70000.000000  70000.000000  70000.000000  70000.000000  70000.000000  70000.000000  70000.000000  70000.000000  70000.000000  70000.000000
mean   19468.865814      1.349571    164.359229     74.205690    128.817286     96.630414      1.366871      1.226457      0.088129      0.053771      0.803729      0.499700
std     2467.251667      0.476838      8.210126     14.395757    154.011419    188.472530      0.680250      0.572270      0.283484      0.225568      0.397179      0.500003
min    10798.000000      1.000000     55.000000     10.000000   -150.000000    -70.000000      1.000000      1.000000      0.000000      0.000000      0.000000      0.000000
25%    17664.000000      1.000000    159.000000     65.000000    120.000000     80.000000      1.000000      1.000000      0.000000      0.000000      1.000000      0.000000
50%    19703.000000      1.000000    165.000000     72.000000    120.000000     80.000000      1.000000      1.000000      0.000000      0.000000      1.000000      0.000000
75%    21327.000000      2.000000    170.000000     82.000000    140.000000     90.000000      2.000000      1.000000      0.000000      0.000000      1.000000      1.000000
max    23713.000000      2.000000    250.000000    200.000000  16020.000000  11000.000000      3.000000      3.000000      1.000000      1.000000      1.000000      1.000000

2、性别和患病的关系

sns.countplot(x='cardio',data=data,hue='gender')
plt.show()

3、年龄和患病的关系

sns.boxplot(x='cardio',y='age',data=data)
plt.show()

4、身高、体重与患病的关系

plt.figure(figsize=(14,6))
plt.subplot(1,2,1)
sns.boxplot(x='cardio',y='height',data=data,palette='winter')
plt.subplot(1,2,2)
sns.boxplot(x='cardio',y='weight',data=data,palette='summer')
plt.show()

三、建模分析

1、计算相关性系数

correlations = data.corr()['cardio'].drop('cardio')
print(correlations)

age            0.238159
gender         0.008109
height        -0.010821
weight         0.181660
ap_hi          0.054475
ap_lo          0.065719
cholesterol    0.221147
gluc           0.089307
smoke         -0.015486
alco          -0.007330
active        -0.035653
Name: cardio, dtype: float64

2、编写预设函数

def feat_select(threshold):abs_cor = correlations.abs()features = abs_cor[abs_cor > threshold].index.tolist()return features
def model(mod,X_tr,X_te):mod.fit(X_tr,y_train)pred = mod.predict(X_te)print('Model score = ',mod.score(X_te,y_test)*100,'%')

3、切分数据集

msk = np.random.rand(len(data))<0.85
df_train_test = data[msk]
df_val = data[~msk]X = df_train_test.drop('cardio',axis=1)
y = df_train_test['cardio']
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=70)

4、初步训练逻辑回归模型

# 逻辑回归
lr = LogisticRegression()
threshold = [0.001,0.002,0.005,0.01,0.05,0.1]
for i in threshold:print("Threshold is {}".format(i))feature_i = feat_select(i)X_train_i = X_train[feature_i]X_test_i = X_test[feature_i]model(lr,X_train_i,X_test_i)

Threshold is 0.001
Model score =  70.79104102004865 %
Threshold is 0.002
Model score =  70.79104102004865 %
Threshold is 0.005
Model score =  70.79104102004865 %
Threshold is 0.01
Model score =  70.64843553393172 %
Threshold is 0.05
Model score =  72.30098146128681 %
Threshold is 0.1
Model score =  61.30358191426893 %
准确率不高，进一步对数据做处理

5、数据标准化变换

scale = StandardScaler()
scale.fit(X_train)
X_train_scaled = scale.transform(X_train)
X_train_ = pd.DataFrame(X_train_scaled,columns=data.columns[:-1])
scale.fit(X_test)
X_test_scaled = scale.transform(X_test)
X_test_ = pd.DataFrame(X_test_scaled,columns=data.columns[:-1])

6、利用KNN做优化

for i in threshold:feature = feat_select(i)X_train_k = X_train_[feature]X_test_k = X_test_[feature]err = []for j in range(1,30):knn = KNeighborsClassifier(n_neighbors=j)knn.fit(X_train_k,y_train)pred_j = knn.predict(X_test_k)err.append(np.mean(y_test != pred_j))plt.figure(figsize=(10,6))plt.plot(range(1,30),err)plt.title('Threshold of {}'.format(i))plt.xlabel('K value')plt.ylabel('Error')

7、模型选择

我们最终选择threshold在0.05时，对应的特征，作为input输入；

feat_final = feat_select(0.05)
print(feat_final)

['age', 'weight', 'ap_hi', 'ap_lo', 'cholesterol', 'gluc']

8、KNN表现

X_train = X_train_[feat_final]
X_val = np.asanyarray(df_val[feat_final])
y_val = np.asanyarray(df_val['cardio'])scale.fit(X_val)
X_val_scaled = scale.transform(X_val)
X_val_ = pd.DataFrame(X_val_scaled,columns=df_val[feat_final].columns)# knn with k=15
knn = KNeighborsClassifier(n_neighbors=15)
knn.fit(X_train,y_train)
pred = knn.predict(X_val_)print('Confusion Matrix =\n',confusion_matrix(y_val,pred))
print('\n',classification_report(y_val,pred))

Confusion Matrix =[[3775 1507][1464 3774]]precision    recall  f1-score   support0       0.72      0.71      0.72      52821       0.71      0.72      0.72      5238accuracy                           0.72     10520macro avg       0.72      0.72      0.72     10520
weighted avg       0.72      0.72      0.72     10520

9、逻辑回归表现

# Logistic regression
lr.fit(X_train,y_train)
pred = lr.predict(X_val_)# reports
print('Confusion Matrix =\n',confusion_matrix(y_val,pred))
print('\n',classification_report(y_val,pred))

Confusion Matrix =[[4261 1021][1951 3287]]precision    recall  f1-score   support0       0.69      0.81      0.74      52821       0.76      0.63      0.69      5238accuracy                           0.72     10520macro avg       0.72      0.72      0.72     10520
weighted avg       0.72      0.72      0.72     10520

10、小结

KNN 和逻辑回归表现差不多。