本文主要是介绍对kaggle:travel-insurance的分析,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!
https://www.kaggle.com/mhdzahier/travel-insurance
1.导入数据
from matplotlib import pyplot as plt
% matplotlib inline
plt. style. use( 'fivethirtyeight' )
import seaborn as sns
import pandas as pd
sns. set ( )
import warnings
warnings. filterwarnings( 'ignore' )
导入数据时出现错误,参考:https: // blog. csdn. net/ qq_35318838/ article/ details/ 80564938
df = pd. read_csv( r'C:/Users/admin/Desktop/论文/kaggle1/travel-insurance.csv' , engine= 'python' )
df. head( )
Agency Agency Type Distribution Channel Product Name Claim Duration Destination Net Sales Commision (in value) Gender Age 0 CBH Travel Agency Offline Comprehensive Plan No 186 MALAYSIA -29.0 9.57 F 81 1 CBH Travel Agency Offline Comprehensive Plan No 186 MALAYSIA -29.0 9.57 F 71 2 CWT Travel Agency Online Rental Vehicle Excess Insurance No 65 AUSTRALIA -49.5 29.70 NaN 32 3 CWT Travel Agency Online Rental Vehicle Excess Insurance No 60 AUSTRALIA -39.6 23.76 NaN 32 4 CWT Travel Agency Online Rental Vehicle Excess Insurance No 79 ITALY -19.8 11.88 NaN 41
2.属性介绍及分析
order = [ 'Agency' , 'Agency Type' , 'Distribution Channel' , 'Product Name' , 'Duration' , 'Destination' , 'Net Sales' , 'Commision (in value)' , 'Age' , 'Gender' , 'Claim' ]
df = df[ order]
df. columns = [ 'Agency' , 'AgencyType' , 'DistributionChannel' , 'ProductName' , 'Duration' , 'Destination' , 'NetSales' , 'Commision' , 'Age' , 'Gender' , 'Claim' ]
df. head( )
Agency AgencyType DistributionChannel ProductName Duration Destination NetSales Commision Age Gender Claim 0 CBH Travel Agency Offline Comprehensive Plan 186 MALAYSIA -29.0 9.57 81 F No 1 CBH Travel Agency Offline Comprehensive Plan 186 MALAYSIA -29.0 9.57 71 F No 2 CWT Travel Agency Online Rental Vehicle Excess Insurance 65 AUSTRALIA -49.5 29.70 32 NaN No 3 CWT Travel Agency Online Rental Vehicle Excess Insurance 60 AUSTRALIA -39.6 23.76 32 NaN No 4 CWT Travel Agency Online Rental Vehicle Excess Insurance 79 ITALY -19.8 11.88 41 NaN No
2.1数据的基本了解
df. shape
(63326, 11)
df. info( )
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 63326 entries, 0 to 63325
Data columns (total 11 columns):
Agency 63326 non-null object
AgencyType 63326 non-null object
DistributionChannel 63326 non-null object
ProductName 63326 non-null object
Duration 63326 non-null int64
Destination 63326 non-null object
NetSales 63326 non-null float64
Commision 63326 non-null float64
Age 63326 non-null int64
Gender 18219 non-null object
Claim 63326 non-null object
dtypes: float64(2), int64(2), object(7)
memory usage: 5.3+ MB
df. isnull( ) . sum ( )
Agency 0
AgencyType 0
DistributionChannel 0
ProductName 0
Duration 0
Destination 0
NetSales 0
Commision 0
Age 0
Gender 45107
Claim 0
dtype: int64
数据总共63326 条,但只有性别这个属性存在较多缺失值
df. describe( )
Duration NetSales Commision Age count 63326.000000 63326.000000 63326.000000 63326.000000 mean 49.317074 40.702018 9.809992 39.969981 std 101.791566 48.845637 19.804388 14.017010 min -2.000000 -389.000000 0.000000 0.000000 25% 9.000000 18.000000 0.000000 35.000000 50% 22.000000 26.530000 0.000000 36.000000 75% 53.000000 48.000000 11.550000 43.000000 max 4881.000000 810.000000 283.500000 118.000000
从上面可以看出Duration的最小值是- 2 ,但是旅行的时间应该是正值,所以应考虑把负值的样本删除。
2.2 对属性进行分析
属性Claim
df[ 'Claim' ] . value_counts( )
No 62399
Yes 927
Name: Claim, dtype: int64
属性Agency
df[ 'Agency' ] . value_counts( )
EPX 35119
CWT 8580
C2B 8267
JZI 6329
SSI 1056
JWT 749
RAB 725
LWC 689
TST 528
KML 392
ART 331
CCR 194
CBH 101
TTW 98
CSR 86
ADM 82
Name: Agency, dtype: int64
plt. figure( figsize= ( 15 , 7 ) )
sns. countplot( x= 'Agency' , data= df, orient= 'h' )
plt. show( )
df[ 'Claim' ] . groupby( df[ 'Agency' ] ) . value_counts( )
Agency Claim
ADM No 82
ART No 330Yes 1
C2B No 7720Yes 547
CBH No 100Yes 1
CCR No 191Yes 3
CSR No 85Yes 1
CWT No 8494Yes 86
EPX No 34924Yes 195
JWT No 745Yes 4
JZI No 6298Yes 31
KML No 384Yes 8
LWC No 653Yes 36
RAB No 724Yes 1
SSI No 1049Yes 7
TST No 526Yes 2
TTW No 94Yes 4
Name: Claim, dtype: int64
pd. crosstab( df. Agency, df. Claim, margins= True ) . style. background_gradient( )
Claim No Yes All Agency ADM 82 0 82 ART 330 1 331 C2B 7720 547 8267 CBH 100 1 101 CCR 191 3 194 CSR 85 1 86 CWT 8494 86 8580 EPX 34924 195 35119 JWT 745 4 749 JZI 6298 31 6329 KML 384 8 392 LWC 653 36 689 RAB 724 1 725 SSI 1049 7 1056 TST 526 2 528 TTW 94 4 98 All 62399 927 63326
AGency = pd. read_excel( 'C:/Users/admin/Desktop/论文/kaggle1/Agency.xlsx' )
AGency[ 'Percent' ] = AGency[ 'Yes' ] / AGency[ 'All' ]
AGency. sort_values( "Percent" , inplace= True )
AGency
Agency No Yes All Percent 0 ADM 54 0 54 0.000000 12 RAB 709 1 710 0.001408 1 ART 315 1 316 0.003165 14 TST 517 2 519 0.003854 9 JZI 6128 30 6158 0.004872 8 JWT 716 4 720 0.005556 7 EPX 34188 193 34381 0.005614 13 SSI 1024 7 1031 0.006790 3 CBH 93 1 94 0.010638 5 CSR 85 1 86 0.011628 6 CWT 7132 86 7218 0.011915 4 CCR 187 3 190 0.015789 10 KML 380 8 388 0.020619 15 TTW 94 4 98 0.040816 11 LWC 623 36 659 0.054628 2 C2B 7531 547 8078 0.067715
从以上数据可以看到索赔率高的代理商可能是索赔率低的代理商的20 倍(0.067715 是0.003165 的20 倍左右),
所以不同代理商之间还是有明显差异的。应保留这个特征,同时有些代理商的样本数量足够多,
可以考虑直接对某一个代理商的样本进行建模。
属性Agency Type
df[ 'AgencyType' ] . value_counts( )
Travel Agency 45869
Airlines 17457
Name: AgencyType, dtype: int64
pd. crosstab( df. AgencyType, df. Claim, margins= True ) . style. background_gradient( )
Claim No Yes All AgencyType Airlines 16866 591 17457 Travel Agency 45533 336 45869 All 62399 927 63326
AgencyType这个属性里面:Travel Agency的数量多,但是索赔人数少;Airlines的人数少,但是索赔人数多。
所以代理商的类别是航空、旅行社对索赔有明显影响,此属性保留。
属性Distribution Channel
df[ 'DistributionChannel' ] . value_counts( )
Online 62219
Offline 1107
Name: DistributionChannel, dtype: int64
pd. crosstab( df. DistributionChannel, df. Claim, margins= True ) . style. background_gradient( )
Claim No Yes All DistributionChannel Offline 1090 17 1107 Online 61309 910 62219 All 62399 927 63326
忽略其他因素,分别索赔率:
Offline: 17 / 1107 = 0.0153
Online: 910 / 62219 = 0.0146
All: 927 / 63326 = 0.0146
差异不是很大,结合现实也可以知道,购买保险人是否索赔应该与自己的经历有关,而与当初购买的方式关系不大。故此属性删除
属性ProductName
df[ 'ProductName' ] . value_counts( )
Cancellation Plan 18630
2 way Comprehensive Plan 13158
Rental Vehicle Excess Insurance 8580
Basic Plan 5469
Bronze Plan 4049
1 way Comprehensive Plan 3331
Value Plan 2715
Silver Plan 2249
Annual Silver Plan 1423
Ticket Protector 1056
Travel Cruise Protect 527
Comprehensive Plan 364
Gold Plan 352
24 Protect 247
Single Trip Travel Protect Gold 204
Annual Gold Plan 194
Premier Plan 194
Single Trip Travel Protect Silver 173
Annual Travel Protect Gold 100
Annual Travel Protect Silver 86
Individual Comprehensive Plan 74
Single Trip Travel Protect Platinum 73
Annual Travel Protect Platinum 53
Spouse or Parents Comprehensive Plan 15
Child Comprehensive Plan 9
Travel Cruise Protect Family 1
Name: ProductName, dtype: int64
pd. crosstab( df. ProductName, df. Claim, margins= True ) . style. background_gradient( )
Claim No Yes All ProductName 1 way Comprehensive Plan 3322 9 3331 2 way Comprehensive Plan 13016 142 13158 24 Protect 247 0 247 Annual Gold Plan 173 21 194 Annual Silver Plan 1266 157 1423 Annual Travel Protect Gold 90 10 100 Annual Travel Protect Platinum 50 3 53 Annual Travel Protect Silver 82 4 86 Basic Plan 5446 23 5469 Bronze Plan 3839 210 4049 Cancellation Plan 18586 44 18630 Child Comprehensive Plan 9 0 9 Comprehensive Plan 359 5 364 Gold Plan 334 18 352 Individual Comprehensive Plan 71 3 74 Premier Plan 191 3 194 Rental Vehicle Excess Insurance 8494 86 8580 Silver Plan 2108 141 2249 Single Trip Travel Protect Gold 194 10 204 Single Trip Travel Protect Platinum 68 5 73 Single Trip Travel Protect Silver 169 4 173 Spouse or Parents Comprehensive Plan 14 1 15 Ticket Protector 1049 7 1056 Travel Cruise Protect 525 2 527 Travel Cruise Protect Family 1 0 1 Value Plan 2696 19 2715 All 62399 927 63326
Productnames = pd. read_excel( 'C:/Users/admin/Desktop/论文/kaggle1/Product Name.xlsx' )
Productnames[ 'Percent' ] = Productnames[ 'Yes' ] / Productnames[ 'All' ]
Productnames. sort_values( "Percent" , inplace= True )
Productnames
Claim No Yes All Percent 2 24 Protect 247 0 247 0.000000 24 Travel Cruise Protect Family 1 0 1 0.000000 11 Child Comprehensive Plan 9 0 9 0.000000 10 Cancellation Plan 18586 44 18630 0.002362 0 1 way Comprehensive Plan 3322 9 3331 0.002702 23 Travel Cruise Protect 525 2 527 0.003795 8 Basic Plan 5446 23 5469 0.004206 22 Ticket Protector 1049 7 1056 0.006629 25 Value Plan 2696 19 2715 0.006998 16 Rental Vehicle Excess Insurance 8494 86 8580 0.010023 1 2 way Comprehensive Plan 13016 142 13158 0.010792 12 Comprehensive Plan 359 5 364 0.013736 15 Premier Plan 191 3 194 0.015464 20 Single Trip Travel Protect Silver 169 4 173 0.023121 14 Individual Comprehensive Plan 71 3 74 0.040541 7 Annual Travel Protect Silver 82 4 86 0.046512 18 Single Trip Travel Protect Gold 194 10 204 0.049020 13 Gold Plan 334 18 352 0.051136 9 Bronze Plan 3839 210 4049 0.051865 6 Annual Travel Protect Platinum 50 3 53 0.056604 17 Silver Plan 2108 141 2249 0.062695 21 Spouse or Parents Comprehensive Plan 14 1 15 0.066667 19 Single Trip Travel Protect Platinum 68 5 73 0.068493 5 Annual Travel Protect Gold 90 10 100 0.100000 3 Annual Gold Plan 173 21 194 0.108247 4 Annual Silver Plan 1266 157 1423 0.110330
同理,容易看出不同保险产品之间的索赔率差异较大,此属性保留。
属性Destination
df[ 'Destination' ] . value_counts( ) . head( 10 )
SINGAPORE 13255
MALAYSIA 5930
THAILAND 5894
CHINA 4796
AUSTRALIA 3694
INDONESIA 3452
UNITED STATES 2530
PHILIPPINES 2490
HONG KONG 2411
INDIA 2251
Name: Destination, dtype: int64
由于国家有149 个,这里就不列出了,发现前10 名的目的地去的人数很多,后续可以考虑仅针对去某个国家的客户进行研究。
属性Age
plt. figure( figsize= ( 15 , 7 ) )
df[ 'Age' ] . hist( bins= 12 )
plt. xlim( 0 , 120 )
plt. xlabel( 'Age' , fontsize= 14 )
plt. ylabel( 'count' , fontsize= 14 )
plt. show( )
年龄分布范围比较广泛,所以保留比较好
3.数据清洗与准备
df. drop( [ "Destination" , "Gender" , "DistributionChannel" ] , inplace= True , axis= 1 )
df = df. loc[ ( df[ "Duration" ] > 0 ) & ( df[ "NetSales" ] > 0 ) ]
df. shape
(60700, 8)
删除的样本数量:63326 - 60700 = 2626
df1 = df. loc[ ( df[ "Agency" ] == "EPX" ) ]
df1. drop( [ "Agency" ] , inplace= True , axis= 1 )
df1. head( )
AgencyType ProductName Duration NetSales Commision Age Claim 33 Travel Agency 2 way Comprehensive Plan 9 25.0 0.0 36 No 34 Travel Agency 2 way Comprehensive Plan 12 24.0 0.0 36 No 35 Travel Agency 2 way Comprehensive Plan 45 20.0 0.0 36 No 36 Travel Agency 2 way Comprehensive Plan 35 27.0 0.0 36 No 37 Travel Agency 2 way Comprehensive Plan 28 20.0 0.0 44 No
df1[ 'Claim' ] . value_counts( )
No 34188
Yes 193
Name: Claim, dtype: int64
df1[ 'ProductName' ] . value_counts( )
Cancellation Plan 18212
2 way Comprehensive Plan 12907
1 way Comprehensive Plan 3262
Name: ProductName, dtype: int64
df1[ 'Commision' ] . value_counts( )
0.0 34381
Name: Commision, dtype: int64
将分类变量转为示性变量
dummyX = pd. get_dummies( df1[ [ 'ProductName' ] ] )
dummyX. head( )
ProductName_1 way Comprehensive Plan ProductName_2 way Comprehensive Plan ProductName_Cancellation Plan 33 0 1 0 34 0 1 0 35 0 1 0 36 0 1 0 37 0 1 0
from sklearn import preprocessing
labelList= df1[ "Claim" ]
lb = preprocessing. LabelBinarizer( )
Y= lb. fit_transform( labelList)
Y
array([[0],[0],[0],..., [0],[0],[0]])
合并特征
X = pd. concat( [ dummyX, df1[ [ 'Duration' , 'NetSales' , 'Age' ] ] ] , axis= 1 )
X. head( )
ProductName_1 way Comprehensive Plan ProductName_2 way Comprehensive Plan ProductName_Cancellation Plan Duration NetSales Age 33 0 1 0 9 25.0 36 34 0 1 0 12 24.0 36 35 0 1 0 45 20.0 36 36 0 1 0 35 27.0 36 37 0 1 0 28 20.0 44
平衡样本
from imblearn. over_sampling import SMOTE
sm = SMOTE( random_state= 42 )
df_x, df_y = sm. fit_sample( X, Y)
df_x= pd. DataFrame( df_x)
df_x. columns = X. columns
df_y = pd. DataFrame( df_y)
df_y. columns = [ "Claim" ]
X = df_x
Y = df_y
决策树
from sklearn. model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X, Y, test_size = 0.2 , random_state = 1 )
from sklearn. tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier( )
classifier. fit( X_train, y_train)
y_pred = classifier. predict( X_test)
from sklearn. metrics import accuracy_score
print ( 'accuracy is' , accuracy_score( y_pred, y_test) )
accuracy is 0.983620941796
可视化直接崩掉!!,电脑卡住了,样本数太大了。
import graphviz
import pydotplus
from sklearn import tree
dot_data = tree. export_graphviz( classifier, out_file= None , feature_names= X. columns, class_names= [ '0' , '1' ] , filled= True , rounded= True , special_characters= True )
graph = pydotplus. graph_from_dot_data( dot_data)
with open ( '1.png' , 'wb' ) as f: f. write( graph. create_png( ) )
这篇关于对kaggle:travel-insurance的分析的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!