使用Python语言，用最简单的线性回归预测高考录取人数

本文主要是介绍使用Python语言，用最简单的线性回归预测高考录取人数，希望对大家解决编程问题提供一定的参考价值，需要的开发者们随着小编来一起学习吧！

摘要

本文收集了2000年到2023年的高考参加考试人数以及2000年到2022年的高考录取人数，尝试通过这些数据使用最简单的线性回归来预测高考录取人数。

年份	参加考试人数/万	录取人数/万
2000	375.0	221.0
2001	454.0	268.0
2002	510.0	320.0
2003	613.0	382.0
2004	729.0	447.0
2005	877.0	504.0
2006	950.0	546.0
2007	1010.0	566.0
2008	1050.0	599.0
2009	1020.0	629.0
2010	946.0	657.0
2011	933.0	675.0
2012	915.0	685.0
2013	912.0	684.0
2014	939.0	697.0
2015	942.0	700.0
2016	940.0	705.0
2017	940.0	700.0
2018	975.0	790.99
2019	1031.0	820.0
2020	1071.0	856.0
2021	1078.0	1001.32
2022	1193.0	1014.53
2023	1291.0	？

线性回归模型预测

1. 使用上一年的录取数预测下一年

仅仅考虑每年的录取人数，忽略参考人数的影响，使用上一年的录取数预测下一年。

import numpy as np
import matplotlib.pyplot as plt
from pandas import DataFrame
from sklearn.linear_model import LinearRegression# 定义高考参考人数
x1 = np.array([375.0, 454.0, 510.0, 613.0, 729.0, 877.0, 950.0, 1010.0, 1050.0, 1020.0, 946.0, 933.0, 915.0, 912.0, 939.0, 942.0, 940.0, 940.0, 975.0, 1031.0, 1071.0, 1078.0, 1193.0, 1291.0]).reshape(-1, 1)# 定义高考录取人数
x2 = np.array([221.0, 268.0, 320.0, 382.0, 447.0, 504.0, 546.0, 566.0, 599.0, 629.0, 657.0, 675.0, 685.0, 684.0, 697.0, 700.0, 705.0, 700.0, 790.99, 820.0, 856.0, 1001.32, 1014.53]).reshape(-1, 1)# 创建线性回归模型对象
model = LinearRegression()# 训练模型
model.fit(x2[:-1], x2[1:])# 预测第二组数据的最后一个数据
last_x2 = np.array([[x2[-1][0]]])
last_y2 = model.predict(last_x2)
# 设置图表中中文字体正常显示
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False
plt.scatter(range(len(x1)), [i[0] for i in x1], label='参加高考人数')
plt.scatter(range(len(x2)), [i[0] for i in x2], label='高考录取人数')
plt.scatter(23, last_y2[0][0], label='预测录取人数')
plt.plot(range(1,len(x1)), [i[0] for i in model.predict(x2)], label='拟合结果')
plt.legend()
plt.text(23-0.5, last_y2[0][0]+20, int(last_y2[0][0]))
plt.savefig('预测1.jpg')
plt.show()
print("预测值为：", last_y2[0][0])

运行结果：
在这里插入图片描述
得出的预测结果是1039。
这种尝试显然不合适，因为忽略了参加考试人数的影响。

2. 使用当年的参加考试人数来预测录取人数

import numpy as np
import matplotlib.pyplot as plt
from pandas import DataFrame
from sklearn.linear_model import LinearRegression# 定义第一组数据
x1 = np.array([375.0, 454.0, 510.0, 613.0, 729.0, 877.0, 950.0, 1010.0, 1050.0, 1020.0, 946.0, 933.0, 915.0, 912.0, 939.0, 942.0, 940.0, 940.0, 975.0, 1031.0, 1071.0, 1078.0, 1193.0, 1291.0]).reshape(-1, 1)# 定义第二组数据
x2 = np.array([221.0, 268.0, 320.0, 382.0, 447.0, 504.0, 546.0, 566.0, 599.0, 629.0, 657.0, 675.0, 685.0, 684.0, 697.0, 700.0, 705.0, 700.0, 790.99, 820.0, 856.0, 1001.32, 1014.53]).reshape(-1, 1)# 创建线性回归模型对象
model = LinearRegression()# 训练模型
model.fit(x1[:-1], x2)# 预测第二组数据的最后一个数据
last_x2 = np.array([[x1[-1][0]]])
last_y2 = model.predict(last_x2)
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False
plt.scatter(range(len(x1)), [i[0] for i in x1], label='参加高考人数')
plt.scatter(range(len(x2)), [i[0] for i in x2], label='高考录取人数')
plt.scatter(23, last_y2[0][0], label='预测录取人数')
plt.plot(range(len(x1)), [i[0] for i in model.predict(x1)], label='拟合结果')
plt.legend()
plt.text(23-0.5, last_y2[0][0]+20, int(last_y2[0][0]))
plt.savefig('预测2.jpg')
plt.show()
print("预测值为：", last_y2[0][0])

运行结果：
在这里插入图片描述
得出的预测结果是986。
发现受单变量影响太大，拟合效果不好。

3. 参加考试人数和时间的因素相结合

尝试将参加考试人数和时间的因素相结合，同时进行标准化，消除量纲的影响。

import numpy as np
import matplotlib.pyplot as plt
from pandas import DataFrame
from sklearn.linear_model import LinearRegression# 定义第一组数据
x1 = np.array([375.0, 454.0, 510.0, 613.0, 729.0, 877.0, 950.0, 1010.0, 1050.0, 1020.0, 946.0, 933.0, 915.0, 912.0, 939.0, 942.0, 940.0, 940.0, 975.0, 1031.0, 1071.0, 1078.0, 1193.0, 1291.0]).reshape(-1, 1)
x1 = [[x1[i][0]/50, (i+1)] for i in range(len(x1))] # 数据除以50是为了进行标准化# 定义第二组数据
x2 = np.array([221.0, 268.0, 320.0, 382.0, 447.0, 504.0, 546.0, 566.0, 599.0, 629.0, 657.0, 675.0, 685.0, 684.0, 697.0, 700.0, 705.0, 700.0, 790.99, 820.0, 856.0, 1001.32, 1014.53]).reshape(-1, 1)
x2 = [[x2[i][0]/50] for i in range(len(x2))] # 数据除以50是为了进行标准化# 创建线性回归模型对象
model = LinearRegression()# 训练模型
model.fit(x1[:-1], x2)# 预测第二组数据的最后一个数据
last_x2 = np.array([[1291.0/50, 24]]) # 数据除以50是为了进行标准化
last_y2 = model.predict(last_x2)
plt.scatter(range(len(x1)), [i[0]*50 for i in x1], label='参加高考人数')
plt.scatter(range(len(x2)), [i[0]*50 for i in x2], label='高考录取人数')
plt.scatter(23, last_y2[0][0]*50, label='预测录取人数')
plt.plot(range(len(x1)), [i[0]*50 for i in model.predict(x1)], label='拟合结果')
plt.text(23-0.5, last_y2[0][0]*50+20, int(last_y2[0][0]*50))
plt.savefig('预测3.jpg')
plt.show()
print("预测值为：", last_y2[0][0]*50)

运行结果：
在这里插入图片描述
得出预测结果是1020。
拟合相对较好，且综合了参考人数以及时间的因素，预测结果相对可信。

展示

将所有数据绘制在一张图表中，便于展示。

import numpy as np
import matplotlib.pyplot as plt
from pandas import DataFrame
from sklearn.linear_model import LinearRegressiondata = DataFrame()
data['年份'] = range(2000, 2024)
data['参考人数'] = [375.0, 454.0, 510.0, 613.0, 729.0, 877.0, 950.0, 1010.0, 1050.0, 1020.0, 946.0, 933.0, 915.0, 912.0, 939.0, 942.0, 940.0, 940.0, 975.0, 1031.0, 1071.0, 1078.0, 1193.0, 1291.0]a = [221.0, 268.0, 320.0, 382.0, 447.0, 504.0, 546.0, 566.0, 599.0, 629.0, 657.0, 675.0, 685.0, 684.0, 697.0, 700.0, 705.0, 700.0, 790.99, 820.0, 856.0, 1001.32, 1014.53]
a.append(1020.4566124819598) # 加入预测结果
data['录取人数'] = a
data['录取人数'] = [float(i) for i in data['录取人数']]
data['参考人数'] = [float(i) for i in data['参考人数']]plt.xticks(rotation=45)
plt.title('高考数据折线图')
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False
plt.xlabel('年份')
plt.ylabel('人数/万')
plt.plot(data['年份'], data['参考人数'], marker='o', markersize=3, label='参加高考人数')
for i, j in zip(data['年份'][1::2], data['参考人数'][1::2]):plt.text(i-0.5, j+20, int(j))
plt.plot(data['年份'][:-1], data['录取人数'][:-1], 'x-', markersize=3, label='高考录取人数')
plt.plot(data['年份'][-2:], data['录取人数'][-2:], 'x--', markersize=3, label='预测录取人数')
for i, j in zip(data['年份'][1::2], data['录取人数'][1::2]):plt.text(i-0.5, j+20, int(j))plt.legend()
plt.savefig('高考数据折线图.jpg')
plt.show()