对kaggle:travel-insurance的分析

2023-12-12 12:10
文章标签 分析 travel kaggle insurance

本文主要是介绍对kaggle:travel-insurance的分析,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

https://www.kaggle.com/mhdzahier/travel-insurance

1.导入数据
from matplotlib import pyplot as plt
%matplotlib inline
plt.style.use('fivethirtyeight')
import seaborn as sns
import pandas as pd
sns.set()
import warnings
warnings.filterwarnings('ignore')
导入数据时出现错误,参考:https://blog.csdn.net/qq_35318838/article/details/80564938
df = pd.read_csv(r'C:/Users/admin/Desktop/论文/kaggle1/travel-insurance.csv',engine='python')
df.head()
AgencyAgency TypeDistribution ChannelProduct NameClaimDurationDestinationNet SalesCommision (in value)GenderAge
0CBHTravel AgencyOfflineComprehensive PlanNo186MALAYSIA-29.09.57F81
1CBHTravel AgencyOfflineComprehensive PlanNo186MALAYSIA-29.09.57F71
2CWTTravel AgencyOnlineRental Vehicle Excess InsuranceNo65AUSTRALIA-49.529.70NaN32
3CWTTravel AgencyOnlineRental Vehicle Excess InsuranceNo60AUSTRALIA-39.623.76NaN32
4CWTTravel AgencyOnlineRental Vehicle Excess InsuranceNo79ITALY-19.811.88NaN41
2.属性介绍及分析
order = ['Agency', 'Agency Type', 'Distribution Channel', 'Product Name', 'Duration','Destination','Net Sales', 'Commision (in value)', 'Age','Gender','Claim']
df = df[order]
df.columns = ['Agency', 'AgencyType', 'DistributionChannel', 'ProductName', 'Duration','Destination','NetSales', 'Commision', 'Age','Gender','Claim']
df.head()
AgencyAgencyTypeDistributionChannelProductNameDurationDestinationNetSalesCommisionAgeGenderClaim
0CBHTravel AgencyOfflineComprehensive Plan186MALAYSIA-29.09.5781FNo
1CBHTravel AgencyOfflineComprehensive Plan186MALAYSIA-29.09.5771FNo
2CWTTravel AgencyOnlineRental Vehicle Excess Insurance65AUSTRALIA-49.529.7032NaNNo
3CWTTravel AgencyOnlineRental Vehicle Excess Insurance60AUSTRALIA-39.623.7632NaNNo
4CWTTravel AgencyOnlineRental Vehicle Excess Insurance79ITALY-19.811.8841NaNNo
2.1数据的基本了解
df.shape
(63326, 11)
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 63326 entries, 0 to 63325
Data columns (total 11 columns):
Agency                 63326 non-null object
AgencyType             63326 non-null object
DistributionChannel    63326 non-null object
ProductName            63326 non-null object
Duration               63326 non-null int64
Destination            63326 non-null object
NetSales               63326 non-null float64
Commision              63326 non-null float64
Age                    63326 non-null int64
Gender                 18219 non-null object
Claim                  63326 non-null object
dtypes: float64(2), int64(2), object(7)
memory usage: 5.3+ MB
df.isnull().sum()
Agency                     0
AgencyType                 0
DistributionChannel        0
ProductName                0
Duration                   0
Destination                0
NetSales                   0
Commision                  0
Age                        0
Gender                 45107
Claim                      0
dtype: int64
数据总共63326条,但只有性别这个属性存在较多缺失值
df.describe()
DurationNetSalesCommisionAge
count63326.00000063326.00000063326.00000063326.000000
mean49.31707440.7020189.80999239.969981
std101.79156648.84563719.80438814.017010
min-2.000000-389.0000000.0000000.000000
25%9.00000018.0000000.00000035.000000
50%22.00000026.5300000.00000036.000000
75%53.00000048.00000011.55000043.000000
max4881.000000810.000000283.500000118.000000
从上面可以看出Duration的最小值是-2,但是旅行的时间应该是正值,所以应考虑把负值的样本删除。
2.2 对属性进行分析
属性Claim
df['Claim'].value_counts()
No     62399
Yes      927
Name: Claim, dtype: int64
属性Agency
df['Agency'].value_counts()
EPX    35119
CWT     8580
C2B     8267
JZI     6329
SSI     1056
JWT      749
RAB      725
LWC      689
TST      528
KML      392
ART      331
CCR      194
CBH      101
TTW       98
CSR       86
ADM       82
Name: Agency, dtype: int64
plt.figure(figsize=(15,7))
sns.countplot(x='Agency', data=df,orient='h')
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-0YaLBF1G-1587692825216)(output_18_0.png)]

df['Claim'].groupby(df['Agency']).value_counts()
Agency  Claim
ADM     No          82
ART     No         330Yes          1
C2B     No        7720Yes        547
CBH     No         100Yes          1
CCR     No         191Yes          3
CSR     No          85Yes          1
CWT     No        8494Yes         86
EPX     No       34924Yes        195
JWT     No         745Yes          4
JZI     No        6298Yes         31
KML     No         384Yes          8
LWC     No         653Yes         36
RAB     No         724Yes          1
SSI     No        1049Yes          7
TST     No         526Yes          2
TTW     No          94Yes          4
Name: Claim, dtype: int64
pd.crosstab(df.Agency,df.Claim,margins=True).style.background_gradient()
ClaimNoYesAll
Agency
ADM82082
ART3301331
C2B77205478267
CBH1001101
CCR1913194
CSR85186
CWT8494868580
EPX3492419535119
JWT7454749
JZI6298316329
KML3848392
LWC65336689
RAB7241725
SSI104971056
TST5262528
TTW94498
All6239992763326
AGency = pd.read_excel('C:/Users/admin/Desktop/论文/kaggle1/Agency.xlsx')
AGency['Percent']=AGency['Yes']/AGency['All']
AGency.sort_values("Percent",inplace=True)
AGency
AgencyNoYesAllPercent
0ADM540540.000000
12RAB70917100.001408
1ART31513160.003165
14TST51725190.003854
9JZI61283061580.004872
8JWT71647200.005556
7EPX34188193343810.005614
13SSI1024710310.006790
3CBH931940.010638
5CSR851860.011628
6CWT71328672180.011915
4CCR18731900.015789
10KML38083880.020619
15TTW944980.040816
11LWC623366590.054628
2C2B753154780780.067715
从以上数据可以看到索赔率高的代理商可能是索赔率低的代理商的20倍(0.0677150.00316520倍左右),
所以不同代理商之间还是有明显差异的。应保留这个特征,同时有些代理商的样本数量足够多,
可以考虑直接对某一个代理商的样本进行建模。
属性Agency Type
df['AgencyType'].value_counts()
Travel Agency    45869
Airlines         17457
Name: AgencyType, dtype: int64
pd.crosstab(df.AgencyType,df.Claim,margins=True).style.background_gradient()
ClaimNoYesAll
AgencyType
Airlines1686659117457
Travel Agency4553333645869
All6239992763326
AgencyType这个属性里面:Travel Agency的数量多,但是索赔人数少;Airlines的人数少,但是索赔人数多。
所以代理商的类别是航空、旅行社对索赔有明显影响,此属性保留。
属性Distribution Channel
df['DistributionChannel'].value_counts()
Online     62219
Offline     1107
Name: DistributionChannel, dtype: int64
pd.crosstab(df.DistributionChannel,df.Claim,margins=True).style.background_gradient()
ClaimNoYesAll
DistributionChannel
Offline1090171107
Online6130991062219
All6239992763326
忽略其他因素,分别索赔率:
Offline:17/1107=0.0153
Online:910/62219=0.0146
All:927/63326=0.0146
差异不是很大,结合现实也可以知道,购买保险人是否索赔应该与自己的经历有关,而与当初购买的方式关系不大。故此属性删除
属性ProductName
df['ProductName'].value_counts()
Cancellation Plan                       18630
2 way Comprehensive Plan                13158
Rental Vehicle Excess Insurance          8580
Basic Plan                               5469
Bronze Plan                              4049
1 way Comprehensive Plan                 3331
Value Plan                               2715
Silver Plan                              2249
Annual Silver Plan                       1423
Ticket Protector                         1056
Travel Cruise Protect                     527
Comprehensive Plan                        364
Gold Plan                                 352
24 Protect                                247
Single Trip Travel Protect Gold           204
Annual Gold Plan                          194
Premier Plan                              194
Single Trip Travel Protect Silver         173
Annual Travel Protect Gold                100
Annual Travel Protect Silver               86
Individual Comprehensive Plan              74
Single Trip Travel Protect Platinum        73
Annual Travel Protect Platinum             53
Spouse or Parents Comprehensive Plan       15
Child Comprehensive Plan                    9
Travel Cruise Protect Family                1
Name: ProductName, dtype: int64
pd.crosstab(df.ProductName,df.Claim,margins=True).style.background_gradient()
ClaimNoYesAll
ProductName
1 way Comprehensive Plan332293331
2 way Comprehensive Plan1301614213158
24 Protect2470247
Annual Gold Plan17321194
Annual Silver Plan12661571423
Annual Travel Protect Gold9010100
Annual Travel Protect Platinum50353
Annual Travel Protect Silver82486
Basic Plan5446235469
Bronze Plan38392104049
Cancellation Plan185864418630
Child Comprehensive Plan909
Comprehensive Plan3595364
Gold Plan33418352
Individual Comprehensive Plan71374
Premier Plan1913194
Rental Vehicle Excess Insurance8494868580
Silver Plan21081412249
Single Trip Travel Protect Gold19410204
Single Trip Travel Protect Platinum68573
Single Trip Travel Protect Silver1694173
Spouse or Parents Comprehensive Plan14115
Ticket Protector104971056
Travel Cruise Protect5252527
Travel Cruise Protect Family101
Value Plan2696192715
All6239992763326
Productnames = pd.read_excel('C:/Users/admin/Desktop/论文/kaggle1/Product Name.xlsx')
Productnames['Percent']=Productnames['Yes']/Productnames['All']
Productnames.sort_values("Percent",inplace=True)
Productnames
ClaimNoYesAllPercent
224 Protect24702470.000000
24Travel Cruise Protect Family1010.000000
11Child Comprehensive Plan9090.000000
10Cancellation Plan1858644186300.002362
01 way Comprehensive Plan3322933310.002702
23Travel Cruise Protect52525270.003795
8Basic Plan54462354690.004206
22Ticket Protector1049710560.006629
25Value Plan26961927150.006998
16Rental Vehicle Excess Insurance84948685800.010023
12 way Comprehensive Plan13016142131580.010792
12Comprehensive Plan35953640.013736
15Premier Plan19131940.015464
20Single Trip Travel Protect Silver16941730.023121
14Individual Comprehensive Plan713740.040541
7Annual Travel Protect Silver824860.046512
18Single Trip Travel Protect Gold194102040.049020
13Gold Plan334183520.051136
9Bronze Plan383921040490.051865
6Annual Travel Protect Platinum503530.056604
17Silver Plan210814122490.062695
21Spouse or Parents Comprehensive Plan141150.066667
19Single Trip Travel Protect Platinum685730.068493
5Annual Travel Protect Gold90101000.100000
3Annual Gold Plan173211940.108247
4Annual Silver Plan126615714230.110330
同理,容易看出不同保险产品之间的索赔率差异较大,此属性保留。
属性Destination
df['Destination'].value_counts().head(10)
SINGAPORE        13255
MALAYSIA          5930
THAILAND          5894
CHINA             4796
AUSTRALIA         3694
INDONESIA         3452
UNITED STATES     2530
PHILIPPINES       2490
HONG KONG         2411
INDIA             2251
Name: Destination, dtype: int64
由于国家有149个,这里就不列出了,发现前10名的目的地去的人数很多,后续可以考虑仅针对去某个国家的客户进行研究。
属性Age
plt.figure(figsize=(15,7))
df['Age'].hist(bins=12)
plt.xlim(0,120)
plt.xlabel('Age',fontsize=14)
plt.ylabel('count',fontsize=14)
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-L2zB2PQv-1587692825226)(output_41_0.png)]

年龄分布范围比较广泛,所以保留比较好
3.数据清洗与准备
df.drop(["Destination","Gender","DistributionChannel"],inplace=True,axis=1)
df = df.loc[(df["Duration"]>0) & (df["NetSales"]>0)]
df.shape
(60700, 8)
删除的样本数量:63326-60700=2626
df1 = df.loc[(df["Agency"]=="EPX")]
df1.drop(["Agency"],inplace=True,axis=1)
df1.head()
AgencyTypeProductNameDurationNetSalesCommisionAgeClaim
33Travel Agency2 way Comprehensive Plan925.00.036No
34Travel Agency2 way Comprehensive Plan1224.00.036No
35Travel Agency2 way Comprehensive Plan4520.00.036No
36Travel Agency2 way Comprehensive Plan3527.00.036No
37Travel Agency2 way Comprehensive Plan2820.00.044No
df1['Claim'].value_counts()
No     34188
Yes      193
Name: Claim, dtype: int64
df1['ProductName'].value_counts()
Cancellation Plan           18212
2 way Comprehensive Plan    12907
1 way Comprehensive Plan     3262
Name: ProductName, dtype: int64
df1['Commision'].value_counts()   #删除
0.0    34381
Name: Commision, dtype: int64
将分类变量转为示性变量
dummyX = pd.get_dummies(df1[['ProductName']])
dummyX.head()
ProductName_1 way Comprehensive PlanProductName_2 way Comprehensive PlanProductName_Cancellation Plan
33010
34010
35010
36010
37010
from sklearn import preprocessing
# 标签矩阵
labelList= df1["Claim"]
# 将标签矩阵二值化
lb = preprocessing.LabelBinarizer()
Y=lb.fit_transform(labelList)
Y
array([[0],[0],[0],..., [0],[0],[0]])
合并特征
X = pd.concat([dummyX,df1[['Duration','NetSales','Age']]], axis=1)
X.head()
ProductName_1 way Comprehensive PlanProductName_2 way Comprehensive PlanProductName_Cancellation PlanDurationNetSalesAge
33010925.036
340101224.036
350104520.036
360103527.036
370102820.044
平衡样本
from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state=42)
df_x,df_y = sm.fit_sample(X,Y)
df_x= pd.DataFrame(df_x)
df_x.columns = X.columns
df_y = pd.DataFrame(df_y)
df_y.columns = ["Claim"]
X = df_x
Y = df_y
决策树
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2, random_state = 1)
# Decision Tree's
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier()
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
from sklearn.metrics import accuracy_score
print('accuracy is',accuracy_score(y_pred,y_test))
accuracy is 0.983620941796
可视化直接崩掉!!,电脑卡住了,样本数太大了。
import graphviz
import pydotplus
from sklearn import tree
dot_data = tree.export_graphviz(classifier, out_file=None,feature_names=X.columns,class_names=['0', '1'],     #类别为0,1,也可以设置为yes,nofilled=True, rounded=True,  #rounded=True,框是圆边special_characters=True)
graph = pydotplus.graph_from_dot_data(dot_data)
with open('1.png', 'wb') as f:f.write(graph.create_png())

这篇关于对kaggle:travel-insurance的分析的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/484601

相关文章

Go标准库常见错误分析和解决办法

《Go标准库常见错误分析和解决办法》Go语言的标准库为开发者提供了丰富且高效的工具,涵盖了从网络编程到文件操作等各个方面,然而,标准库虽好,使用不当却可能适得其反,正所谓工欲善其事,必先利其器,本文将... 目录1. 使用了错误的time.Duration2. time.After导致的内存泄漏3. jsO

Spring事务中@Transactional注解不生效的原因分析与解决

《Spring事务中@Transactional注解不生效的原因分析与解决》在Spring框架中,@Transactional注解是管理数据库事务的核心方式,本文将深入分析事务自调用的底层原理,解释为... 目录1. 引言2. 事务自调用问题重现2.1 示例代码2.2 问题现象3. 为什么事务自调用会失效3

找不到Anaconda prompt终端的原因分析及解决方案

《找不到Anacondaprompt终端的原因分析及解决方案》因为anaconda还没有初始化,在安装anaconda的过程中,有一行是否要添加anaconda到菜单目录中,由于没有勾选,导致没有菜... 目录问题原因问http://www.chinasem.cn题解决安装了 Anaconda 却找不到 An

Spring定时任务只执行一次的原因分析与解决方案

《Spring定时任务只执行一次的原因分析与解决方案》在使用Spring的@Scheduled定时任务时,你是否遇到过任务只执行一次,后续不再触发的情况?这种情况可能由多种原因导致,如未启用调度、线程... 目录1. 问题背景2. Spring定时任务的基本用法3. 为什么定时任务只执行一次?3.1 未启用

C++ 各种map特点对比分析

《C++各种map特点对比分析》文章比较了C++中不同类型的map(如std::map,std::unordered_map,std::multimap,std::unordered_multima... 目录特点比较C++ 示例代码 ​​​​​​代码解释特点比较1. std::map底层实现:基于红黑

Spring、Spring Boot、Spring Cloud 的区别与联系分析

《Spring、SpringBoot、SpringCloud的区别与联系分析》Spring、SpringBoot和SpringCloud是Java开发中常用的框架,分别针对企业级应用开发、快速开... 目录1. Spring 框架2. Spring Boot3. Spring Cloud总结1. Sprin

Spring 中 BeanFactoryPostProcessor 的作用和示例源码分析

《Spring中BeanFactoryPostProcessor的作用和示例源码分析》Spring的BeanFactoryPostProcessor是容器初始化的扩展接口,允许在Bean实例化前... 目录一、概览1. 核心定位2. 核心功能详解3. 关键特性二、Spring 内置的 BeanFactory

MyBatis-Plus中Service接口的lambdaUpdate用法及实例分析

《MyBatis-Plus中Service接口的lambdaUpdate用法及实例分析》本文将详细讲解MyBatis-Plus中的lambdaUpdate用法,并提供丰富的案例来帮助读者更好地理解和应... 目录深入探索MyBATis-Plus中Service接口的lambdaUpdate用法及示例案例背景

MyBatis-Plus中静态工具Db的多种用法及实例分析

《MyBatis-Plus中静态工具Db的多种用法及实例分析》本文将详细讲解MyBatis-Plus中静态工具Db的各种用法,并结合具体案例进行演示和说明,具有很好的参考价值,希望对大家有所帮助,如有... 目录MyBATis-Plus中静态工具Db的多种用法及实例案例背景使用静态工具Db进行数据库操作插入

Go使用pprof进行CPU,内存和阻塞情况分析

《Go使用pprof进行CPU,内存和阻塞情况分析》Go语言提供了强大的pprof工具,用于分析CPU、内存、Goroutine阻塞等性能问题,帮助开发者优化程序,提高运行效率,下面我们就来深入了解下... 目录1. pprof 介绍2. 快速上手:启用 pprof3. CPU Profiling:分析 C