本文主要是介绍【数据分析】goodbooks-10k,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!
Ten thousand books, one million ratings. Also books marked to read, and tags.
数据来源:https://www.kaggle.com/zygmunt/goodbooks-10k
统计图书出版年份与数量及评分的关系
会用到book_id
original_publication_year
average_rating
import pandas as pd
from matplotlib import pyplot as pltfile_path = './books_data/books.csv'df = pd.read_csv(file_path, encoding='ansi')# 去除有NAN的行
data = df[pd.notnull(df['original_publication_year'])]# 按年份算书的均分
grouped = data['average_rating'].groupby(data['original_publication_year']).mean()
print(grouped)
# 按年份算书的数量
grouped1 = data.groupby(data['original_publication_year']).count()['book_id']
print(grouped1)year = grouped.index
rating = grouped.values
year1 = grouped1.index
books_num = grouped1.valuesplt.rcParams['font.sans-serif'] = ['SimHei']
fig = plt.figure(figsize=(15, 8))
# (xxx)这里前两个表示几*几的网格,最后一个表示第几子图
ax1 = fig.add_subplot(111)
ax1.plot(range(len(year)), rating, label='平均评分')
# 次坐标轴
ax2 = ax1.twinx()
plt.bar(range(len(year1)), books_num, label='数量')
plt.xticks(list(range(len(year)))[::10], year[::10].astype(int))
plt.show()
可看出1841年以前作品极少,评分波动很大,受单一作品影响大
筛选出1975年以后作品
data = data[data['original_publication_year'] > 1975]
显示数值
for i, (_x, _y) in enumerate(zip(range(len(year)), rating)):plt.text(_x, _y, round(rating[i], 3), color='black', fontsize=10)
-
enumerate():用于将一个可遍历的数据对象(如列表、元组或字符串)组合为一个索引序列
seasons = ['Spring', 'Summer', 'Fall', 'Winter'] list(enumerate(seasons))
[(0, ‘Spring’), (1, ‘Summer’), (2, ‘Fall’), (3, ‘Winter’)]
-
zip(): 用于将可迭代的对象作为参数,将对象中对应的元素打包成一个个元组,然后返回由这些元组组成的列表。
a = [1,2,3] b = [4,5,6] zipped = zip(a,b) # 打包为元组的列表
[(1, 4), (2, 5), (3, 6)]
import pandas as pd
from matplotlib import pyplot as pltfile_path = './books_data/books.csv'df = pd.read_csv(file_path, encoding='ansi')# 去除有NAN的行
data = df[pd.notnull(df['original_publication_year'])]
data = data[data['original_publication_year'] > 1975]# 按年份算书的均分
grouped = data['average_rating'].groupby(data['original_publication_year']).mean()
# 按年份算书的数量
grouped1 = data.groupby(data['original_publication_year']).count()['book_id']year = grouped.index
rating = grouped.values
year1 = grouped1.index
books_num = grouped1.valuesplt.rcParams['font.sans-serif'] = ['SimHei']
fig = plt.figure(figsize=(15, 8))
# (xxx)这里前两个表示几*几的网格,最后一个表示第几子图
ax1 = fig.add_subplot(111)
ax1.plot(range(len(year)), rating, color='black', alpha=0.8, marker='.', label='平均评分')
for i, (_x, _y) in enumerate(zip(range(len(year)), rating)):plt.text(_x, _y, round(rating[i], 3), color='black', fontsize=10)
# 图例
ax1.legend(loc='upper left')
# y轴取值范围
ax1.set_ylim([3.95, 4.11])
# 次坐标轴
ax2 = ax1.twinx()plt.bar(range(len(year1)), books_num, alpha=0.3, label='数量')
plt.xticks(list(range(len(year))), year.astype(int))
plt.legend(loc=1)
plt.show()
这篇关于【数据分析】goodbooks-10k的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!