本文主要是介绍爱彼迎 python_利用数据科学使您的下一次波士顿爱彼迎之旅,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!
爱彼迎 python
介绍 (Introduction)
Boston is the capital and the most populous city in the State of Massachusetts in the United States. Its economy, culture, history, and education attract hundreds-thousands of tourists each year. I have been eager to travel to this beautiful city so long and eventually make my plan this March. However, an unexpected global pandemic locks me down in NYC and delays my plan. After staying home, I have been planning the next trip to Airbnb Boston with data science techniques. I think that infuse data science into a trip plan would be more scientific and interesting. If you are interested, you can also consider this blog as a funny and may-be insightful guidance for your next trip to Boston.
乙 oston是资本和人口最多的城市在马萨诸塞州的美国国家。 它的经济,文化,历史和教育每年吸引成千上万的游客。 我一直渴望去这个美丽的城市这么长时间,并最终在今年三月制定我的计划。 但是,一场出乎意料的全球大流行将我锁定在纽约市,并推迟了我的计划。 待在家里之后,我一直在计划使用数据科学技术前往波士顿Airbnb的行程。 我认为将数据科学融入旅行计划将更加科学和有趣。 如果您有兴趣,也可以将此博客视为下一次波士顿旅行的有趣且可能很有见地的指南。
The open dataset I will use coming from here and complied on 10 June 2020. The original dataset consists of 3440 listings, 16 features of Airbnb across 25 Boston neighborhoods. In this post, I will provide you with data visualization and machine learning solutions for three main questions that you would care about: Location: What regions do you have more choice or you are more likely to stay in Boston Airbnb? Room Type: What types of rooms are most popular for stay? Price: What are the important features to influence price? Could you predict the price of Boston Airbnb?
我将使用的开放数据集来自此处,并于2020年6月10日完成。原始数据集包含3440个列表,横跨25个波士顿社区的16个Airbnb功能。 在本文中,我将为您解决三个主要问题,为您提供数据可视化和机器学习解决方案:位置:您在哪些地区有更多选择,或者您更有可能留在Boston Airbnb? 房间类型:最受欢迎的住宿类型是什么类型的房间? 价格:影响价格的重要特征是什么? 您能预测波士顿爱彼迎的价格吗?
初步数据可视化 (Preliminary Data Visualization)
Firstly, I review paired relations using seaborn. Which provides general information and patterns among 9 usefully numerical features.
首先,我使用seaborn来回顾配对关系。 它提供了9个有用的数字功能中的常规信息和模式。
Some insightful points (Check Figure 1.):
一些有见地的要点(请参见图1):
- Latitude: From South 42.25 to North 42.40, the number of listings for Boston Airbnb increases. 纬度:从南部42.25到北部42.40,波士顿Airbnb的房源数量增加。
- Longitude: From West -71.15 to East -71.00, the number of listings for Boston Airbnb increases. 经度:从西部-71.15到东部-71.00,波士顿Airbnb的房源数量增加。
- The Number of Reviews and Reviews Per Month has a positive correlation. 每月的评论数和评论数具有正相关关系。
Then I applied Spearman Correlation Heatmap(Figure 2.) to review the correlation among 9 features.
然后,我应用了Spearman Correlation Heatmap(图2)来审查9个特征之间的相关性。
Some insightful points:
一些有见地的要点:
- Latitude is positively correlated with the price (r=0.31) and longitude (r=0.30). 纬度与价格(r = 0.31)和经度(r = 0.30)正相关。
- The number of reviews and reviews per month are positively correlated (r=0.44). 评论数与每月评论数呈正相关(r = 0.44)。
- Availability 365 and calculated host listings count are positively correlated (r=0.25). 可用性365和计算出的主机列表数量呈正相关(r = 0.25)。
Furthermore, for analytical purposes, I also deal with outliers and remove the rows that price is above $500, dummy the features in room type, and exclude the minor room types (Figure 3.): Hotel room, Shared room.
此外,出于分析目的,我还处理了异常值,并删除了价格高于500美元的行, 虚拟了房间类型的功能 ,并排除了次要房间类型(图3.):酒店房间,共享房间。
Therefore, The new Spearman Correlation Heatmap with 11 features(Figure 4.) should be more accurate.
因此,具有11个功能的新Spearman相关热图(图4)应该更准确。
Lastly, the more important points:
最后,更重要的一点是:
Price and Home Type: Entire home/apt is positively correlated to price (r=0.67). The private room is negatively correlated to price (r=-0.66). The average price for Entire home/apt is higher than the average price for the private room (will be shown later).
价格和房屋类型:整个房屋/公寓与价格成正相关(r = 0.67)。 私人房间与价格负相关(r = -0.66)。 整套房子/公寓的平ASP格高于私人房间的平ASP格(稍后显示)。
2. Latitude is positively correlated to price. As latitude increases from south to north, the prices of Airbnb may tend to increase.
2.纬度与价格成正比。 随着纬度从南到北的增加,Airbnb的价格可能会上涨。
3. The number of reviews and reviews per month are positively correlated.
3.评论数量与每月评论数量呈正相关。
4. The number of reviews and reviews per month is negatively correlated to minimum nights (Which are required by hosts).
4.每月的评论数量和评论数量与最低住宿天数负相关(房东要求的住宿天数)。
地点:您在哪个地区有更多选择,或者您更有可能留在Boston Airbnb? (Location: What regions do you have more choices or will you be more likely to stay in Boston Airbnb?)
Figures 5. shows the number of listing Airbnb across 25 different neighborhoods in Boston.
图5显示了在波士顿的25个不同社区中列出Airbnb的数量。
The TOP 5 neighborhoods that have most Airbnb are Dorchester, Downtown, Jamaica Plain, Roxbury, and Back Bay.
Airbnb最多的前5位社区是Dorchester,Downtown,Jamaica Plain,Roxbury和Back Bay。
Figure 6. shows the proportion of Airbnb across neighborhoods in Boston.
图6.显示了Airbnb在波士顿各个社区中所占的比例。
Check this out. Remarkably, Dorchester has an easily higher proportion of Airbnb compared to other neighborhoods, at 12%.
看一下这个。 值得注意的是,与其他社区相比,多切斯特的Airbnb比例要容易得多,为12%。
Figure 7., a density plot shows the distribution of Airbnb across Boston. The brightest area has the highest amount of Airbnb. You can also review the actual map of Boston Airbnb in Figure 8.
图7. 密度图显示了Airbnb在波士顿的分布。 最亮的区域中Airbnb的数量最多。 您还可以在图8中查看Boston Airbnb的实际地图。
I find that Boston Airbnb is highly populated in longitude from West -71.08 to East -71.06 and in latitude from South 42.34 to North 42.36.
我发现,从西部-71.08到东部-71.06的经度,以及从南42.34到北42.36的纬度的人口密度很高。
Look at them closely in Figure 9. An Airbnb Scatterplot in 25 different neighborhoods across Boston. Longitude and latitude are represented on the x-axis and y-axis.
在图9中仔细观察它们。遍布波士顿25个不同社区的Airbnb散布图。 经度和纬度分别在x轴和y轴上表示。
If you would like to expect what neighborhoods have a higher chance to find your Airbnb, the TOP 5 neighborhoods with their locations that are indicated in Figure 10. would provide useful information into your plan.
如果您希望期望哪些社区更有机会找到您的Airbnb,那么图10所示的TOP 5社区及其位置将为您的计划提供有用的信息。
房间类型:最受欢迎的住宿类型是什么类型的房间? (Room Type: What types of rooms are most popular for stay?)
Figure 11. shows that the Entire home/apt and Private room are the most available room type considering the number of listings.
图11显示,考虑到房源数量,整个家庭/公寓和私人房间是最可用的房间类型。
If you consider minimum nights to stay, Figure 12. shows that minimum nights (required by hosts) of 91, 1, and 2 probably give more choices for travelers.
如果考虑最小停留时间,图12显示91、1、2的最小停留时间(房东要求)可能会为旅行者提供更多选择。
If you look at Figure 13. and Figure 14., you will find there are about 576 Airbnb listings not available in all 365 days (either will be very popular or permanently closed).
如果您查看图13和图14.,您会发现在整个365天中没有大约576个Airbnb列表(无论是非常受欢迎还是永久关闭)。
Interestingly, there are also about 452 Airbnb listings available in all 365 days.
有趣的是,在整个365天内,大约有452个Airbnb房源可用。
Next, Let’s review some statistics about different room types
接下来,让我们回顾一些有关不同房型的统计数据
Figure 15. shows that travelers tend to stay longer in the Shared rooms than in the Private rooms and Entire homes/apt.
图15显示,旅行者在共享房间中的停留时间往往比在私人房间和整个房屋/公寓中的停留时间更长。
Figure 16. and Figure 17. both confirm that Private room and Entire home/ apt have a higher average number of reviews than Shared rooms and Hotel rooms.
图16和图17均确认“私人房间”和“整套房子/公寓”的平均评论数量高于“共享房间”和“酒店房间”。
The confirmation may indicate that the Entire home/ apt and private room are your choices of popular room types.
确认信息可能表明您会选择整个房间/套间和私人房间。
Lastly, Figure 18. shows the average days of availability in a year by room type.
最后,图18显示了按房间类型划分的一年中的平均可用天数。
The shared room has a much lower 131.94 day of availability. But it only has 16 listings. Therefore data would be biased and should not be considered as a decisive point for popularity.
共享室的可用时间低得多,为131.94天。 但它只有16个列表。 因此,数据将带有偏见,不应被视为普及的决定性点。
Compared to the Hotel room, Private room and Entire home/ apt have lower days of availability in a year. Especially Entire home/ apt has about fewer 30-days availabilities than Hotel room. Therefore, we can probably assume that the Entire home/apt is more popular.
与酒店房间相比,私人房间和整套房子/公寓一年的可用天数较少。 特别是,整个家庭/公寓的可用30天少于酒店房间。 因此,我们可以假设整个家庭/公寓更受欢迎。
价格:影响价格的重要特征是什么? 您能预测波士顿爱彼迎的价格吗? (Price: What are the important features to influence price? Could you predict the price of Boston Airbnb?)
After dealing with outliers, dummy variables, missing values, I use 3357 observations and 11 variables to build three models: Linear Regression, Lasso Regression, and Random Forests. The Response variable is the price.
处理离群值,伪变量, 缺失值之后 ,我使用3357个观测值和11个变量构建了三个模型: 线性回归 , 套索回归和随机森林 。 响应变量是价格。
Meanwhile, I find the average price by private room is about $81.22. The average price by the entire home or apartment is much higher at about $189.38.
同时,我发现私人房间的平ASP格约为81.22美元。 整个房屋或公寓的平ASP格要高得多,约为189.38美元。
The actual average price for Boston Airbnb from the test dataset is about $147.85. The predicted average price for Boston Airbnb from the test dataset is about $149.53. Meanwhile, the predictions are built upon Random Forests. Plus, you can also check the distributions of actual prices vs. predicted prices for Boston Airbnb in Figure 20.
根据测试数据显示 ,波士顿Airbnb的实际平ASP格约为147.85美元。 根据测试数据,Airbnb的预测平ASP格约为149.53美元。 同时,这些预测是建立在随机森林上的。 另外,您还可以在图20中查看Boston Airbnb的实际价格与预测价格的分布。
Model Performances
模型表演
The R-squared(R²) at 0.549 indicates that the Random Forest model best explains the variability of the response data. The Mean Absolute Error(MAE) at 40.423 indicates that the Random Forest model has a lower absolute difference between prediction and actual observation. Which means it has lower prediction errors. Clearly, Random Forests is the best among the three models.
R平方(R²)为0.549表示随机森林模型最好地解释了响应数据的可变性。 在40.423处的平均绝对误差(MAE)表示,随机森林模型在预测和实际观测值之间具有较低的绝对差。 这意味着它具有较低的预测误差。 显然,随机森林是这三种模型中最好的。
Feature Explanation: Coefficients of Regression Model, Tree-built Feature Importance Method, Shapley Value
特征说明:回归模型系数, 树状特征重要性方法 , Shapley值
Figure 22. shows the coefficients in Lasso Regression (alpha=0.1). Obviously, as the variables latitude, room type entire home/ apt, and longitude increase, the response variable price will increase. Reversely, the increase of room type private rooms will lead to a decrease in the price. The results of the coefficients also agree with my expectation that location and room type are important influencers for the prices.
图22.显示了套索回归系数( alpha = 0.1 )。 显然,随着变量纬度,整个家庭/公寓的房间类型和经度的增加,响应变量的价格将增加。 相反,增加房间类型的私人房间将导致价格下降。 系数的结果也符合我的期望,即位置和房间类型是影响价格的重要因素。
Figure 23. shows the feature importance ranking plot from the Random Forests model. Room type Entire home/apt, Latitude, and Longitude are still the most important features to predict the price. Interestingly, calculated host listings count and host id is bigger influencers in Random Forests.
图23.显示了来自“随机森林”模型的特征重要性排名图。 房间类型整个房屋/公寓,纬度和经度仍然是预测价格的最重要功能。 有趣的是,在“随机森林”中,计算得出的主机列表数量和主机ID影响更大。
I also use Shapley Value to analyze and explain predictions in Random Forests predicting prices of Airbnb.
我还使用Shapley值分析并解释了随机森林中预测Airbnb价格的预测。
As it indicates in Figure 24, the Shapley value plot can further show the positive and negative relationships of the predictors with the target variable price[1].
如图24所示,Shapley值图可以进一步显示预测变量与目标变量price的正负关系[1]。
Feature importance: Variables are ranked in descending order.
功能重要性:变量按降序排列。
Impact: The horizontal location shows whether the effect of that value is associated with a higher or lower prediction.
影响:水平位置显示该值的影响是与较高还是较低的预测相关联 。
Original value: Color shows whether that variable is high (in red) or low (in blue) for that observation.
原始值:颜色显示该变量在该观察值中是高(红色)还是低(蓝色)。
Correlation: A high level of the “Room type Entire home/ apt” content has a high and positive impact on the price. The “high” comes from the red color, and the “positive” impact is shown on the X-axis. Similarly, “minimum nights” is negatively correlated with the target variable price.
相关性: 较高的“房间类型整个房屋/公寓”内容对价格有高而积极的影响。 “高”来自红色,“正”影响显示在X轴上。 同样,“最少住宿天数”与目标可变价格负相关。
Figure 25. is a simpler version of Shapley Value indicating the average impact of each variable on the model’s output price in descending order and ignoring positive/negative prediction for the price. Of course, Shapley Value could be used to explain more complex models such as deep learning magically. Next time, you can use this algorithm to explain your BlackBox of deep learning models to your audiences.
图25.是Shapley值的简化版本,指示每个变量对模型输出价格的平均影响以降序排列,而忽略了价格的正/负预测。 当然,Shapley Value可以用来解释更复杂的模型,例如神奇地进行深度学习。 下次,您可以使用此算法向受众解释您的深度学习模型的BlackBox。
结论: (Conclusion:)
Using data science is not only can help to make business decisions but also can make life more interesting and scientific. Within data science for Boston Airbnb so far, I will apply these guidances into the next trip to Boston:
使用数据科学不仅可以帮助制定业务决策,还可以使生活变得更加有趣和科学。 到目前为止,在关于波士顿Airbnb的数据科学方面,我将把这些指导应用于下一次波士顿之旅:
- Location: What regions do you have more choice or you are more likely to stay in Boston Airbnb? 地点:您在哪个地区有更多选择,或者您更有可能留在Boston Airbnb?
TOP 5 neighborhoods for you to make decisions are Dorchester, Downtown, Jamaica Plain, Roxbury, and Back Bay. Geographically speaking, you would like to locate longitude between west -71.08 and east -71.06, latitude between south 42.25 and north 42.40 in Boston.
您可以决定的前5个街区是多切斯特,市区,牙买加平原,罗克斯伯里和后湾区。 从地理位置上讲,您想将经度定位在西部-71.08和东部-71.06之间,并将纬度定位在波士顿的南42.25和北42.40之间。
2. Room Type: What types of rooms are most popular for stay?
2.房间类型:哪种类型的房间最受欢迎?
Generally speaking, you would have a higher chance to find your Airbnb within the room type Entire home/apt and Private room. Making a comparison between the two, Entire home/ apt has higher numbers in terms of listings, average reviews per month. Private room has higher numbers in terms of average minimums nights, the average number of reviews, and average days of availability in 365 days.
一般来说,您会更有机会在整套房子/公寓和私人房间中找到您的Airbnb。 比较两者,整个房屋/公寓在列表和每月平均评论方面的数量更高。 私人房间的平均最低住宿天数,平均评价数和365天的平均可用天数更高。
3. Price: What are the important features to influence price? Could you predict the price of Boston Airbnb?
3.价格:影响价格的重要特征是什么? 您能预测波士顿爱彼迎的价格吗?
Lasso Regression and Random Forests both agree that Location(Longitude & Latitude) and Room Type (Entire home/apt &Private room) are important to predict prices of Boston Airbnb.
套索回归和随机森林都同意位置(经度和纬度)和房间类型(整个房间/公寓和私人房间)对于预测波士顿Airbnb的价格很重要。
Very interesting, the feature importance function in Random Forest and Shapley Value both suggest that calculated host listing count is important while it is contributed zero in the Lasso Regression in the term of coefficients.
非常有趣的是,Random Forest中的特征重要性函数和Shapley Value都表明,计算出的宿主列表数量很重要,而在Lasso回归系数中,其贡献为零。
If you care about the price, you might choose a Private room for your next trip. Otherwise, from Southwest to Northeast, the price of Boston Airbnb tends to increase.
如果您关心价格,则可以为下一次旅行选择私人房间。 否则,从西南到东北,波士顿爱彼迎的价格趋于上涨。
This is my first post on Medium. I hope it helps! I welcome feedback and constructive criticism. You can contact me on LinkedIn: https://www.linkedin.com/in/lanxiao12.
这是我在Medium上的第一篇文章。 希望对您有所帮助! 我欢迎您提供反馈和建设性的批评。 您可以在LinkedIn上与我联系: https : //www.linkedin.com/in/lanxiao12 。
Before you go, the codes can be found to my GitHub here. Happy coding, happy life!
在开始之前,可以在我的GitHub上找到代码。 快乐编码,快乐生活!
Special Thanks to Menoua Keshishian.
特别感谢Menoua Keshishian。
Reference:
参考:
[1] Dr.Dataman, Explain Your Model with the SHAP Values(2019), Towards Data Science
[1] Dataman博士, 使用SHAP值解释模型 (2019年),迈向数据科学
翻译自: https://towardsdatascience.com/using-data-science-to-make-your-next-trip-on-boston-airbnb-952030cad433
爱彼迎 python
相关文章:
这篇关于爱彼迎 python_利用数据科学使您的下一次波士顿爱彼迎之旅的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!