本文主要是介绍【075】心血管疾病预测KNN和逻辑斯蒂,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!
内容目录
一、项目介绍二、数据探索
1、查看数据集的基本信息
2、性别和患病的关系
3、年龄和患病的关系
4、身高、体重与患病的关系
三、建模分析
1、计算相关性系数
2、编写预设函数
3、切分数据集
4、初步训练逻辑回归模型
5、数据标准化变换
6、利用KNN做优化
7、模型选择
8、KNN表现
9、逻辑回归表现
一、项目介绍
1、背景描述
数据集包括年龄、性别、收缩压、舒张压等12个特征的患者数据记录7万份。
当患者有心血管疾病时,目标类“cardio”等于1,如果患者健康,则为0。
原文见公众号:python宝
2、导包导数据
# 导入需要的工具包
import pandas as pd # data processing
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns # plot
import pandas_profiling
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report,confusion_matrix
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScalerwarnings.filterwarnings("ignore")
data = pd.read_csv('D:\A\AI-master\py-data\cardio_train.csv',sep=';')
data.head()
id age gender height weight ap_hi ap_lo cholesterol gluc smoke alco active cardio
0 0 18393 2 168 62.0 110 80 1 1 0 0 1 0
1 1 20228 1 156 85.0 140 90 3 1 0 0 1 1
2 2 18857 1 165 64.0 130 70 3 1 0 0 0 1
3 3 17623 2 169 82.0 150 100 1 1 0 0 1 1
4 4 17474 1 156 56.0 100 60 1 1 0 0 0 0(70000, 13)
二、数据探索
1、查看数据集的基本信息
#info()函数给出样本数据的相关信息概览 :行数,列数,列索引,列非空值个数,列类型,内存占用
data.info()#describe()函数直接给出样本数据的一些基本的统计量,包括均值,标准差,最大值,最小值,分位数等。
data.describe()#pandas-profiling能够使用DataFrame自动生成数据的详细报告,相比describe生成的profile要详细的多。
pandas_profiling.ProfileReport(data)#导出报告,目前pandas-profiling目前只支持导出html格式的文件。如果想要生成图片,先生成的html文件,使用Chrome的内建截屏功能来生成图片,没错你前面看到结果就是使用这种方式生成的。
pfr = pandas_profiling.ProfileReport(data)
pfr.to_file('report.html')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70000 entries, 0 to 69999
Data columns (total 12 columns):
age 70000 non-null int64
gender 70000 non-null int64
height 70000 non-null int64
weight 70000 non-null float64
ap_hi 70000 non-null int64
ap_lo 70000 non-null int64
cholesterol 70000 non-null int64
gluc 70000 non-null int64
smoke 70000 non-null int64
alco 70000 non-null int64
active 70000 non-null int64
cardio 70000 non-null int64
dtypes: float64(1), int64(11)
memory usage: 6.4 MB
age gender height weight ap_hi ap_lo cholesterol gluc smoke alco active cardio
count 70000.000000 70000.000000 70000.000000 70000.000000 70000.000000 70000.000000 70000.000000 70000.000000 70000.000000 70000.000000 70000.000000 70000.000000
mean 19468.865814 1.349571 164.359229 74.205690 128.817286 96.630414 1.366871 1.226457 0.088129 0.053771 0.803729 0.499700
std 2467.251667 0.476838 8.210126 14.395757 154.011419 188.472530 0.680250 0.572270 0.283484 0.225568 0.397179 0.500003
min 10798.000000 1.000000 55.000000 10.000000 -150.000000 -70.000000 1.000000 1.000000 0.000000 0.000000 0.000000 0.000000
25% 17664.000000 1.000000 159.000000 65.000000 120.000000 80.000000 1.000000 1.000000 0.000000 0.000000 1.000000 0.000000
50% 19703.000000 1.000000 165.000000 72.000000 120.000000 80.000000 1.000000 1.000000 0.000000 0.000000 1.000000 0.000000
75% 21327.000000 2.000000 170.000000 82.000000 140.000000 90.000000 2.000000 1.000000 0.000000 0.000000 1.000000 1.000000
max 23713.000000 2.000000 250.000000 200.000000 16020.000000 11000.000000 3.000000 3.000000 1.000000 1.000000 1.000000 1.000000
2、性别和患病的关系
sns.countplot(x='cardio',data=data,hue='gender')
plt.show()
3、年龄和患病的关系
sns.boxplot(x='cardio',y='age',data=data)
plt.show()
4、身高、体重与患病的关系
plt.figure(figsize=(14,6))
plt.subplot(1,2,1)
sns.boxplot(x='cardio',y='height',data=data,palette='winter')
plt.subplot(1,2,2)
sns.boxplot(x='cardio',y='weight',data=data,palette='summer')
plt.show()
三、建模分析
1、计算相关性系数
correlations = data.corr()['cardio'].drop('cardio')
print(correlations)
age 0.238159
gender 0.008109
height -0.010821
weight 0.181660
ap_hi 0.054475
ap_lo 0.065719
cholesterol 0.221147
gluc 0.089307
smoke -0.015486
alco -0.007330
active -0.035653
Name: cardio, dtype: float64
2、编写预设函数
def feat_select(threshold):abs_cor = correlations.abs()features = abs_cor[abs_cor > threshold].index.tolist()return features
def model(mod,X_tr,X_te):mod.fit(X_tr,y_train)pred = mod.predict(X_te)print('Model score = ',mod.score(X_te,y_test)*100,'%')
3、切分数据集
msk = np.random.rand(len(data))<0.85
df_train_test = data[msk]
df_val = data[~msk]X = df_train_test.drop('cardio',axis=1)
y = df_train_test['cardio']
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=70)
4、初步训练逻辑回归模型
# 逻辑回归
lr = LogisticRegression()
threshold = [0.001,0.002,0.005,0.01,0.05,0.1]
for i in threshold:print("Threshold is {}".format(i))feature_i = feat_select(i)X_train_i = X_train[feature_i]X_test_i = X_test[feature_i]model(lr,X_train_i,X_test_i)
Threshold is 0.001
Model score = 70.79104102004865 %
Threshold is 0.002
Model score = 70.79104102004865 %
Threshold is 0.005
Model score = 70.79104102004865 %
Threshold is 0.01
Model score = 70.64843553393172 %
Threshold is 0.05
Model score = 72.30098146128681 %
Threshold is 0.1
Model score = 61.30358191426893 %
准确率不高,进一步对数据做处理
5、数据标准化变换
scale = StandardScaler()
scale.fit(X_train)
X_train_scaled = scale.transform(X_train)
X_train_ = pd.DataFrame(X_train_scaled,columns=data.columns[:-1])
scale.fit(X_test)
X_test_scaled = scale.transform(X_test)
X_test_ = pd.DataFrame(X_test_scaled,columns=data.columns[:-1])
6、利用KNN做优化
for i in threshold:feature = feat_select(i)X_train_k = X_train_[feature]X_test_k = X_test_[feature]err = []for j in range(1,30):knn = KNeighborsClassifier(n_neighbors=j)knn.fit(X_train_k,y_train)pred_j = knn.predict(X_test_k)err.append(np.mean(y_test != pred_j))plt.figure(figsize=(10,6))plt.plot(range(1,30),err)plt.title('Threshold of {}'.format(i))plt.xlabel('K value')plt.ylabel('Error')
7、模型选择
我们最终选择threshold在0.05时,对应的特征,作为input输入;
feat_final = feat_select(0.05)
print(feat_final)
['age', 'weight', 'ap_hi', 'ap_lo', 'cholesterol', 'gluc']
8、KNN表现
X_train = X_train_[feat_final]
X_val = np.asanyarray(df_val[feat_final])
y_val = np.asanyarray(df_val['cardio'])scale.fit(X_val)
X_val_scaled = scale.transform(X_val)
X_val_ = pd.DataFrame(X_val_scaled,columns=df_val[feat_final].columns)# knn with k=15
knn = KNeighborsClassifier(n_neighbors=15)
knn.fit(X_train,y_train)
pred = knn.predict(X_val_)print('Confusion Matrix =\n',confusion_matrix(y_val,pred))
print('\n',classification_report(y_val,pred))
Confusion Matrix =[[3775 1507][1464 3774]]precision recall f1-score support0 0.72 0.71 0.72 52821 0.71 0.72 0.72 5238accuracy 0.72 10520macro avg 0.72 0.72 0.72 10520
weighted avg 0.72 0.72 0.72 10520
9、逻辑回归表现
# Logistic regression
lr.fit(X_train,y_train)
pred = lr.predict(X_val_)# reports
print('Confusion Matrix =\n',confusion_matrix(y_val,pred))
print('\n',classification_report(y_val,pred))
Confusion Matrix =[[4261 1021][1951 3287]]precision recall f1-score support0 0.69 0.81 0.74 52821 0.76 0.63 0.69 5238accuracy 0.72 10520macro avg 0.72 0.72 0.72 10520
weighted avg 0.72 0.72 0.72 10520
10、小结
KNN 和逻辑回归表现差不多。
About Me:小婷儿
● 本文作者:小婷儿,专注于python、数据分析、数据挖掘、机器学习相关技术,也注重技术的运用
● 作者博客地址:https://blog.csdn.net/u010986753
● 本系列题目来源于作者的学习笔记,部分整理自网络,若有侵权或不当之处还请谅解
● 版权所有,欢迎分享本文,转载请保留出处
● 微信:tinghai87605025 联系我加微信群
● QQ:87605025
● QQ交流群py_data :483766429
● 公众号:python宝 或 DB宝
● 提供OCP、OCM和高可用最实用的技能培训
● 题目解答若有不当之处,还望各位朋友批评指正,共同进步
如果您觉得到文章对您有帮助,欢迎赞赏哦!有您的支持,小婷儿一定会越来越好!
这篇关于【075】心血管疾病预测KNN和逻辑斯蒂的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!