【赛事基线】“深水云脑”居民小区二次供水需求预测Baseline之DL

本文主要是介绍【赛事基线】“深水云脑”居民小区二次供水需求预测Baseline之DL,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

v1.1 说明
  1. 运行此notebook需要比赛数据,请到比赛官网注册报名后下载,并放置到对应的目录('./work/data/')下面!

  2. 更新一下当时的排名
    在这里插入图片描述

  3. 如果要达到上面的这个分数,可以结合另一个基线的方法。

之前写了个【赛事基线】“深水云脑”水质净化厂工艺控制-曝气量预测Baseline之DL,同时,“深水云脑”系列的比赛还有一个《居民小区二次供水需求预测》,同样也是时间序列问题,那就趁热打铁把这个比赛的Baseline也做了~

本文大体分为:

  • 赛题分析
  • 基线代码
  • 结果分析

黑喂狗 ~~~

赛题分析

1. 赛题与数据

摘一下赛题任务:

本次赛题主要通过居民小区智能水表总表读数和二次供水泵后流量计历史数据,结合气象、疫情数据等互联网相关数据进行回归、时序建模,以建立该区域居民小区需水预测模型。利用举办方提供的多个居民小区历史用水数据和感知数据,预测特定周期内不同小区每小时需水量,以指导实际供水运行工作。
在这里插入图片描述

直接看数据说话 ~

注意: 这里不提供数据集,运行的时候需要先去报名比赛,然后把相应数据放到对应的目录里面!!!

!pip install --user -q -r requirements.txt
import os
import numpy as np
import pandas as pd
import time
import functoolsfrom sklearn.metrics import mean_squared_error as mse
from sklearn.metrics import mean_absolute_error as mae
from sklearn.metrics import mean_squared_log_error as msle
from sklearn.model_selection import StratifiedKFold, KFold
import matplotlib.pyplot as plt
%matplotlib inline
DATA_PATH = './work/data/'
df_daily = pd.read_csv(DATA_PATH + 'daily_dataset.csv')
df_min = pd.read_csv(DATA_PATH + 'per5min_dataset.csv')
df_hour = pd.read_csv(DATA_PATH + 'hourly_dataset.csv')
df_test = pd.read_csv(DATA_PATH + 'test_public.csv')
df_sub = pd.read_csv(DATA_PATH + 'sample_submission.csv')
df_weather = pd.read_csv(DATA_PATH + 'weather.csv')
df_epidemic = pd.read_csv(DATA_PATH + 'epidemic.csv')
df_hour.head()
timeflow_1flow_2flow_3flow_4flow_5flow_6flow_7flow_8flow_9...flow_12flow_13flow_14flow_15flow_16flow_17flow_18flow_19flow_20train or test
02022-01-01 01:00:0029.714.654.740.13.049.710.91.15.0...2.9141.73.21.33.56.8NaN1.8061.4train
12022-01-01 02:00:0021.99.038.027.72.430.26.40.42.6...1.1081.32.20.82.34.5NaN3.8470.8train
22022-01-01 03:00:0016.94.528.922.91.319.73.80.51.4...0.7720.61.50.61.12.4NaNNaN0.5train
32022-01-01 04:00:0014.33.225.520.01.515.42.70.41.2...0.4140.21.20.70.81.8NaNNaN0.2train
42022-01-01 05:00:0014.93.526.420.61.217.52.20.51.2...0.2790.81.10.40.91.9NaNNaN0.3train

5 rows × 22 columns

df_hour.tail()
timeflow_1flow_2flow_3flow_4flow_5flow_6flow_7flow_8flow_9...flow_12flow_13flow_14flow_15flow_16flow_17flow_18flow_19flow_20train or test
57312022-08-27 20:00:00NaNNaNNaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNtest4
57322022-08-27 21:00:00NaNNaNNaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNtest4
57332022-08-27 22:00:00NaNNaNNaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNtest4
57342022-08-27 23:00:00NaNNaNNaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNtest4
57352022-08-28 00:00:00NaNNaNNaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNtest4

5 rows × 22 columns

df_hour.describe()
flow_1flow_2flow_3flow_4flow_5flow_6flow_7flow_8flow_9flow_10flow_11flow_12flow_13flow_14flow_15flow_16flow_17flow_18flow_19flow_20
count4980.0000005056.000005039.0000005039.000004979.0000005039.0000004982.0000004959.0000005061.0000004084.0000004924.0000004096.0000004973.0000004866.0000004957.0000004965.0000004954.0000004272.0000004088.0000005061.000000
mean44.41594420.2633774.64133847.058925.57849085.77846815.4344242.3173026.9976293.4761075.8815194.2010933.1490653.1501642.1632445.2894069.2852047.9152183.7365782.548390
std57.28930912.2184641.72287429.220407.38078755.07244712.8207622.3450894.4769015.0868246.9478218.2083453.0514863.3944002.2925575.7295349.8255809.9283155.3174411.552966
min1.0000001.800000.000000-0.100000.0000000.1000001.200000-32.3000000.5000000.056000-61.500000-0.013000-32.000000-30.300000-27.500000-68.800000-121.400000-0.013000-0.0860000.000000
25%28.40000011.6000055.40000025.800003.00000054.4000007.3000001.1000003.5000001.3460002.8000001.7760001.8000001.4000001.1000002.6000004.6000003.6480001.5975001.500000
50%43.50000019.6000074.10000047.600005.20000086.30000014.0500002.1000006.5000002.6840005.4000003.1880002.9000002.7000001.9000004.8000008.6000006.4400002.8050002.500000
75%55.60000026.9000092.40000063.200007.600000113.15000021.2000003.2000009.7000004.1692507.6000005.0080004.1000004.1000002.9000007.10000012.6000009.5120004.4800003.400000
max3797.400000160.600002048.6000001308.20000475.3000002458.500000414.30000062.60000026.000000172.339000183.000000376.93800091.60000088.90000061.900000152.700000265.700000364.062000172.74800012.200000
figure=plt.figure(figsize=(16,3))ax1=plt.subplot(141)
plt.plot(df_hour['flow_1'])
ax2=plt.subplot(142)
plt.plot(df_hour['flow_2'])
ax3=plt.subplot(143)
plt.plot(df_hour['flow_3'])
ax4=plt.subplot(144)
plt.plot(df_hour['flow_4'])plt.show()

在这里插入图片描述

df_test
timeflow_1flow_2flow_3flow_4flow_5flow_6flow_7flow_8flow_9...flow_12flow_13flow_14flow_15flow_16flow_17flow_18flow_19flow_20train or test
02022-05-01 01:00:00NaNNaNNaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNtest1
12022-05-01 02:00:00NaNNaNNaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNtest1
22022-05-01 03:00:00NaNNaNNaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNtest1
32022-05-01 04:00:00NaNNaNNaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNtest1
42022-05-01 05:00:00NaNNaNNaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNtest1
..................................................................
6672022-08-27 20:00:00NaNNaNNaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNtest4
6682022-08-27 21:00:00NaNNaNNaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNtest4
6692022-08-27 22:00:00NaNNaNNaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNtest4
6702022-08-27 23:00:00NaNNaNNaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNtest4
6712022-08-28 00:00:00NaNNaNNaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNtest4

672 rows × 22 columns

df_test.groupby('train or test')['time'].count()
train or test
test1    168
test2    168
test3    168
test4    168
Name: time, dtype: int64
SEQ_LEN = 168
# 参考开源项目https://github.com/lhrgo/Competition-code/blob/main/baseline.ipynb
test_list1 = df_test.groupby('train or test')['time'].first().reset_index()
test_list1 = test_list1['time'].values.tolist()
test_list2 = df_test.groupby('train or test')['time'].last().reset_index()
test_list2 = test_list2['time'].values.tolist()
test_list1.extend(test_list2)
test_list1.sort()
test_list1
['2022-05-01 01:00:00','2022-05-08 00:00:00','2022-06-01 01:00:00','2022-06-08 00:00:00','2022-07-21 01:00:00','2022-07-28 00:00:00','2022-08-21 01:00:00','2022-08-28 00:00:00']

从测试数据(df_test)能够看到,这次的预测任务是以 小时 为单位的,作为基线,为了简化分析,这次训练数据同样只采用 小时 数据(df_hour)。

如果对比【赛事基线】“深水云脑”水质净化厂工艺控制-曝气量预测Baseline之DL这个任务,会发现两者最大的区别,一个是预测时间序列的伴生控制量(“水质净化厂工艺控制-曝气量”的 column),而一个是预测时间序列本身(“居民小区二次供水需求预测”的 row)。

此外,赛题更特殊的一点是,整个时间序列被分为四个大段,而每个预测时间段只能使用此段时间之前的数据进行预测。

引用赛题官网的解释:

在这里插入图片描述

规则举例:

  • 可使用训练集1预测测试集1。
  • 可使用训练集1、2和3来预测测试集3。
  • 可以通过半监督学习,用训练集1、测试集1、训练集2来预测测试集2。
  • 禁止使用训练集4来预测测试集1、2、3。

具体分析数据:

  1. 共有 5736条记录(以小时计)。
  2. 其中 672条测试集,共分为四段,每一段为7天168条数据:
    • 2022-05-01 01:00:00 ~ 2022-05-08 00:00:00
    • 2022-06-01 01:00:00 ~ 2022-06-08 00:00:00
    • 2022-07-21 01:00:00 ~ 2022-07-28 00:00:00
    • 2022-08-21 01:00:00 ~ 2022-08-28 00:00:00
  3. 数据特征为 mermaid flowchat_1 ~ flow_20,共20个,同时也是需要预测的字段。
  4. 记录中 train or test用来区分训练集与测试集。
  5. 这是一个时间序列的数值回归问题。
  6. 数据具有 NaN值、小于零、异常大,等异常数值。
  7. 另外比赛还提供了以天计、以分钟计、天气、疫情等数据,后续可以用来构建特征。

2. 建模分析

目前已有的baseline基本都是用 lightgbm,这里尝试用 LSTM、Transformer来解决这个时间序列问题。

建模之前先要构造数据,这里上一个图,看看如何构造适合 LSTM、Transformer的数据结构:

在这里插入图片描述

这里以测试集的长度 T = 168 T=168 T=168 为每条构造数据的时间跨度,则,以每 T T T 长度的数据为 X X X,以紧邻的长度为 T T T 的数据为 Y Y Y,以步长 s = 1 小 时 s=1小时 s=1 滚动生成。

其中 X X X 的每个时间点 t t t 包括 m m m 个特征值,如 mermaid flowchat_1、flow_2...,以及构造的 day、hour...等特征。

其中 Y Y Y 的每个时间点 t t t 包括 20 20 20 个预测值,分别对应 mermaid flowchat_1 ~ flow_20

这样构造完数据之后,再根据4个测试集的时间点,从序列中摘出对应的训练数据即可。

而最终的测试数据,其实只有 4 条,也就是对应的4个测试时间点前 T T T 的那一条数据。

构造好数据了,对应的模型结构如下图所示:

在这里插入图片描述

从上面的数据构造与模型结构可以看到,对比 lightgbm如果要解决此问题,需要构造模型数量为:

4 ∗ 20 ∗ k = 80 ∗ k 4 * 20 * k = 80 * k 420k=80k

这里的 k k kk fold的数量,也就是说至少需要 80 个模型!(以目前公布的baseline来举例。如果将 mermaid flowchat_n也做为特征的话,则可以大大减少模型数量。)

需要预测多少次数值呢?

4 ∗ 168 ∗ 20 ∗ k = 13440 ∗ k 4 * 168 * 20 * k = 13440 * k 416820k=13440k

而我们这里使用的 LSTM、Transformer来做,共需要模型:

4 ∗ k 4 * k 4k

每个测试时间段只需要一个模型,需要做测试集预测 4次,只有 4个 X X X 数据需要预测!

OK,这里并不是说哪种方法更好,方法的好坏还是要用最终的成绩来说,这里只是提供一个更简洁有意思的方案而已 ~~~

数据构造与模型结构都介绍完了,具体的实现过程看下面的代码吧~

基线代码

COLUMNS_Y = ['flow_{}'.format(i) for i in range(1, 21)]
COLUMNS_X = COLUMNS_Y + ['day', 'hour', 'dayofweek']
COLUMNS_X, COLUMNS_Y
(['flow_1','flow_2','flow_3','flow_4','flow_5','flow_6','flow_7','flow_8','flow_9','flow_10','flow_11','flow_12','flow_13','flow_14','flow_15','flow_16','flow_17','flow_18','flow_19','flow_20','day','hour','dayofweek'],['flow_1','flow_2','flow_3','flow_4','flow_5','flow_6','flow_7','flow_8','flow_9','flow_10','flow_11','flow_12','flow_13','flow_14','flow_15','flow_16','flow_17','flow_18','flow_19','flow_20'])
def add_time_feat(data):data['time'] = pd.to_datetime(data['time'])data['day'] = data['time'].dt.daydata['hour'] = data['time'].dt.hourdata['minute'] = data['time'].dt.minutedata['dayofweek'] = data['time'].dt.dayofweekreturn data.sort_values('time').reset_index(drop=True)def add_other_feat(data, columns):data['flow_sum'] = data[columns].sum()data['flow_median'] = data[columns].median()data['flow_mean'] = data[columns].mean()return data
df_hour = add_time_feat(df_hour)
df_hour.head()
timeflow_1flow_2flow_3flow_4flow_5flow_6flow_7flow_8flow_9...flow_16flow_17flow_18flow_19flow_20train or testdayhourminutedayofweek
02022-01-01 01:00:0029.714.654.740.13.049.710.91.15.0...3.56.8NaN1.8061.4train1105
12022-01-01 02:00:0021.99.038.027.72.430.26.40.42.6...2.34.5NaN3.8470.8train1205
22022-01-01 03:00:0016.94.528.922.91.319.73.80.51.4...1.12.4NaNNaN0.5train1305
32022-01-01 04:00:0014.33.225.520.01.515.42.70.41.2...0.81.8NaNNaN0.2train1405
42022-01-01 05:00:0014.93.526.420.61.217.52.20.51.2...0.91.9NaNNaN0.3train1505

5 rows × 26 columns

class Trans:def __init__(self, data, name):self.min = max(0, np.percentile(data, 1))self.max = np.percentile(data, 99)self.base = self.max-self.mindef transform(self, data, scale=True):_data = np.clip(data, self.min, self.max)if not scale:return _datareturn (_data-self.min)/self.baseclass TransUtil:def __init__(self, data, exclude_cols=None):self.columns = data.columnsself.exclude_cols = exclude_colsself.trans = {}for c in self.columns:if data[c].dtype not in [int, float]:print('column "{}" not init trans...'.format(c))continueif exclude_cols is None or (exclude_cols is not None and c not in exclude_cols):print('init trans column...', c)self.trans[c] = Trans(data[c].fillna(method='backfill').fillna(method='ffill'), c)def transform(self, data, col_name, scale=True):if self.exclude_cols is not None and col_name in self.exclude_cols:return datafor t in self.trans:if t.startswith(col_name):return self.trans[t].transform(data, scale=scale)return data
trans_util = TransUtil(df_hour, exclude_cols=None) # 数据标准化
column "time" not init trans...
init trans column... flow_1
init trans column... flow_2
init trans column... flow_3
init trans column... flow_4
init trans column... flow_5
init trans column... flow_6
init trans column... flow_7
init trans column... flow_8
init trans column... flow_9
init trans column... flow_10
init trans column... flow_11
init trans column... flow_12
init trans column... flow_13
init trans column... flow_14
init trans column... flow_15
init trans column... flow_16
init trans column... flow_17
init trans column... flow_18
init trans column... flow_19
init trans column... flow_20
column "train or test" not init trans...
init trans column... day
init trans column... hour
init trans column... minute
init trans column... dayofweek
def generate_xy_pair(data, seq_len, trans_util, columns_x, columns_y):data_x = pd.DataFrame()for c in columns_x:data_x[c] = trans_util.transform(data[c].fillna(data[c].median()), c)data_y = pd.DataFrame()for c in columns_y:data_y[c] = trans_util.transform(data[c].fillna(data[c].median()), c, scale=False)data_x = data_x.valuesdata_y = data_y.valuesprint(data_x.shape, data_y.shape)d_x = []d_y = []for i in range(len(data_x)-seq_len*2+1):_x = data_x[i:i+seq_len]_y = data_y[i+seq_len:i+seq_len+seq_len]assert len(_x) == len(_y) == seq_len, (_x, _y, _x.shape, _y.shape, i, len(data_x))d_x.append(_x.T)d_y.append(_y.T)return np.asarray(d_x).transpose((0, 2, 1)), np.asarray(d_y).transpose((0, 2, 1))
data_x, data_y = generate_xy_pair(df_hour, seq_len=SEQ_LEN, trans_util=trans_util, columns_x=COLUMNS_X, columns_y=COLUMNS_Y)
(5736, 23) (5736, 20)
data_x.shape, data_y.shape
((5401, 168, 23), (5401, 168, 20))
data_x[0], data_y[0]
(array([[0.19510716, 0.2526096 , 0.26320132, ..., 0.        , 0.04347826,0.83333333],[0.11625556, 0.13569937, 0.12541254, ..., 0.        , 0.08695652,0.83333333],[0.06570966, 0.04175365, 0.05033003, ..., 0.        , 0.13043478,0.83333333],...,[0.63687829, 0.98538622, 0.92739274, ..., 0.2       , 0.95652174,0.66666667],[0.92094622, 0.6993737 , 0.67986799, ..., 0.2       , 1.        ,0.66666667],[0.26991508, 0.44050104, 0.38118812, ..., 0.23333333, 0.        ,0.83333333]]),array([[ 23.6  ,  12.2  ,  40.6  , ...,   3.932,   1.15 ,   1.4  ],[ 15.6  ,   5.   ,  32.6  , ...,   1.575,   0.509,   0.3  ],[ 12.4  ,   3.9  ,  25.1  , ...,   1.042,   0.394,   0.3  ],...,[ 71.3  ,  46.3  , 133.3  , ...,  14.968,   6.192,   4.8  ],[ 60.7  ,  37.   , 105.5  , ...,  12.944,   5.072,   4.   ],[ 35.   ,  19.8  ,  67.5  , ...,   8.908,   2.912,   2.4  ]]))
# 根据每段测试集将对应的训练数据/测试数据的idx提取出来
_train_idx_1 = df_hour[df_hour['time']<test_list1[0]].index.values.tolist()
_train_idx_2 = df_hour[(df_hour['time']>test_list1[1])&(df_hour['time']<test_list1[2])].index.values.tolist()
_train_idx_3 = df_hour[(df_hour['time']>test_list1[3])&(df_hour['time']<test_list1[4])].index.values.tolist()
_train_idx_4 = df_hour[(df_hour['time']>test_list1[5])&(df_hour['time']<test_list1[6])].index.values.tolist()# 每一段数据包括上一段时间
train_idx_1 = _train_idx_1[:-SEQ_LEN*2]
train_idx_2 = train_idx_1 + _train_idx_2[:-SEQ_LEN*2]
train_idx_3 = train_idx_2 + _train_idx_3[:-SEQ_LEN*2]
train_idx_4 = train_idx_3 + _train_idx_4[:-SEQ_LEN*2]test_idx_1 = _train_idx_1[-SEQ_LEN]
test_idx_2 = _train_idx_2[-SEQ_LEN]
test_idx_3 = _train_idx_3[-SEQ_LEN]
test_idx_4 = _train_idx_4[-SEQ_LEN]
len(_train_idx_1), len(_train_idx_2), len(_train_idx_3), len(_train_idx_4)
(2880, 576, 1032, 576)
len(train_idx_1), len(train_idx_2), len(train_idx_3), len(train_idx_4)
(2544, 2784, 3480, 3720)
test_idx_1, test_idx_2, test_idx_3, test_idx_4
(2712, 3456, 4656, 5400)
train_x_1 = data_x[train_idx_1]
train_y_1 = data_y[train_idx_1]
train_x_2 = data_x[train_idx_2]
train_y_2 = data_y[train_idx_2]
train_x_3 = data_x[train_idx_3]
train_y_3 = data_y[train_idx_3]
train_x_4 = data_x[train_idx_4]
train_y_4 = data_y[train_idx_4]test_x_1 = data_x[test_idx_1]
test_x_2 = data_x[test_idx_2]
test_x_3 = data_x[test_idx_3]
test_x_4 = data_x[test_idx_4]FEATURE_SIZE = train_x_1.shape[-1]
OUTPUT_SIZE = train_y_1.shape[-1]
train_x_1.shape, train_y_1.shape, test_x_1.shape
((2544, 168, 23), (2544, 168, 20), (168, 23))
import paddle
import paddle.nn as nn
import paddle.nn.functional as Fclass Tt(nn.Layer):def __init__(self,seq_len,feature_size,output_size,use_model='lstm',hidden_size=576,num_hidden_layers=6,num_attention_heads=6,intermediate_size=3072,hidden_act="gelu",hidden_dropout_prob=0.1,attention_probs_dropout_prob=0.1,max_position_embeddings=512,max_hour=25,max_min=61,max_dow=8,max_ts=1441):super(Tt, self).__init__()self.use_model = use_modelself.feature_size = feature_size# 如果有相应的时间embedding则可以使用self.th_embeddings = nn.Embedding(max_hour, hidden_size)self.tm_embeddings = nn.Embedding(max_min, hidden_size)self.td_embeddings = nn.Embedding(max_dow, hidden_size)self.tt_embeddings = nn.Embedding(max_ts, hidden_size)# 位置编码self.position_embeddings = nn.Embedding(max_position_embeddings, hidden_size)self.layer_norm = nn.LayerNorm(hidden_size)self.fc_inputs = nn.Linear(feature_size, hidden_size)encoder_layer = nn.TransformerEncoderLayer(hidden_size,num_attention_heads,intermediate_size,dropout=hidden_dropout_prob,activation=hidden_act,attn_dropout=attention_probs_dropout_prob,act_dropout=0)self.encoder = nn.TransformerEncoder(encoder_layer, num_hidden_layers)self.lstm = paddle.nn.LSTM(input_size=hidden_size, hidden_size=hidden_size, num_layers=2)self.fc_output_1 = nn.Linear(hidden_size, hidden_size)self.fc_output_2 = nn.Linear(hidden_size, hidden_size)self.fc_output_3 = nn.Linear(hidden_size, output_size)def forward(self,inputs,inputs_th=None,inputs_tm=None,inputs_td=None,inputs_tt=None,position_ids=None,attention_mask=None):if position_ids is None:ones = paddle.ones(inputs.shape[:2], dtype="int64")seq_length = paddle.cumsum(ones, axis=1)position_ids = seq_length - onesposition_ids.stop_gradient = Trueposition_embeddings = self.position_embeddings(position_ids)inputs = self.fc_inputs(inputs)inputs = nn.Tanh()(inputs)inputs = inputs + position_embeddings# 如果有相应的时间embedding则可以使用if inputs_th is not None:inputs += self.th_embeddings(inputs_th)if inputs_tm is not None:inputs += self.tm_embeddings(inputs_tm)if inputs_td is not None:inputs += self.td_embeddings(inputs_td)if inputs_tt is not None:inputs += self.tt_embeddings(inputs_tt)inputs = self.layer_norm(inputs)# 选择使用LSTM或者Transformerif self.use_model == 'lstm':encoder_outputs, (h, c) = self.lstm(inputs)elif self.use_model == 'transformer':if attention_mask is None:attention_mask = paddle.unsqueeze((paddle.zeros(inputs.shape[:2])).astype(self.fc_inputs.weight.dtype) * -1e4,axis=[1, 2])encoder_outputs = self.encoder(inputs,src_mask=attention_mask)output = self.fc_output_1(encoder_outputs)output = nn.ReLU()(output)output = self.fc_output_2(output)output = self.fc_output_3(output)return output
import paddle
import paddle.nn.functional as F
from paddle.metric import Accuracy
from paddle.io import DataLoader, BatchSampler
from paddlenlp.datasets import MapDataset
from paddlenlp.data import DataCollatorWithPadding
from paddlenlp.data import Dict, Stack, Pad
def calc_score(y_true, y_pred):return 1/(1+msle(np.clip(np.reshape(y_true, -1), 0, None), np.clip(np.reshape(y_pred, -1), 0, None)))def eval_model(model, data_loader):model.eval()y_pred = []y_true = []for step, batch in enumerate(data_loader, start=1):data = batch['data'].astype('float32')label = batch['label'].astype('float32')# 计算模型输出output = model(inputs=data)y_pred.extend(output.numpy())y_true.extend(label.numpy())score = calc_score(y_true, y_pred)model.train()return scoredef make_data_loader(data_x, idx, batch_size, data_y=None, shuffle=False):data = [{'data': data_x[i], 'label': 0 if data_y is None else data_y[i]} for i in idx]ds = MapDataset(data)batch_sampler = BatchSampler(ds, batch_size=batch_size, shuffle=shuffle)return DataLoader(dataset=ds, batch_sampler=batch_sampler)
EPOCHS = 30
BATCH_SIZE = 256
CKPT_DIR = 'work/output'
K_FOLD = 5
epoch_base = 0
step_eval = 5
step_log = 100def do_train(train_x, train_y, prefix):print('-'*20)print('training ...', prefix)print('train x:', np.shape(train_x), 'train y:', np.shape(train_y))paddle.seed(2022)for kfold, tv_idx in enumerate(KFold(n_splits=K_FOLD, shuffle=True, random_state=2022).split(train_x)):print('training fold...', kfold)train_idx, valid_idx = tv_idxmodel = Tt(seq_len=SEQ_LEN, feature_size=FEATURE_SIZE, output_size=OUTPUT_SIZE)train_data_loader = make_data_loader(train_x, train_idx, BATCH_SIZE, data_y=train_y, shuffle=True)valid_data_loader = make_data_loader(train_x, valid_idx, BATCH_SIZE, data_y=train_y, shuffle=False)optimizer = paddle.optimizer.AdamW(learning_rate=1e-4, parameters=model.parameters())criterion = paddle.nn.MSELoss()epochs = EPOCHS # 训练轮次save_dir = CKPT_DIR #训练过程中保存模型参数的文件夹if not os.path.exists(save_dir):os.makedirs(save_dir)global_step = 0 #迭代次数tic_train = time.time()model.train()best_score = 0for epoch in range(1+epoch_base, epochs+epoch_base+1):for step, batch in enumerate(train_data_loader, start=1):data = batch['data'].astype('float32')label = batch['label'].astype('float32')# 计算模型输出output = model(inputs=data)loss = criterion(output, label)# print(loss)# 打印损失函数值、准确率、计算速度global_step += 1if global_step % step_eval == 0:score = eval_model(model, valid_data_loader)            if score > best_score:# print('saving best model...', score)_save_dir = os.path.join(save_dir, '{}_kfold_{}_best_model.pdparams'.format(prefix, kfold))paddle.save(model.state_dict(),_save_dir)best_score = scoreif global_step % step_log == 0:print('global step %d, epoch: %d, batch: %d, loss: %.5f, valid score: %.5f, speed: %.2f step/s'% (global_step, epoch, step, loss, score,10 / (time.time() - tic_train)))tic_train = time.time()# 反向梯度回传,更新参数loss.backward()optimizer.step()optimizer.clear_grad()
def do_pred(test_x, prefix):print('-'*20)print('predict ...', prefix)print('predict x:', np.shape(test_x))# 预测test_data_loader = make_data_loader([test_x], [0], BATCH_SIZE, data_y=None, shuffle=False)sub_df = []save_dir = CKPT_DIRfor kfold in range(K_FOLD):print('predict kfold...', kfold)model = Tt(seq_len=SEQ_LEN, feature_size=FEATURE_SIZE, output_size=OUTPUT_SIZE)model.set_dict(paddle.load(os.path.join(save_dir, '{}_kfold_{}_best_model.pdparams'.format(prefix, kfold))))model.eval()y_pred = []for step, batch in enumerate(test_data_loader, start=1):data = batch['data'].astype('float32')label = batch['label'].astype('float32')# 计算模型输出output = model(inputs=data)y_pred.extend(output.numpy())sub_df.append(np.clip(y_pred, 0, None))return sub_df
# 依次训练每个测试集对应的模型
do_train(train_x_1, train_y_1, 'm1')
do_train(train_x_2, train_y_2, 'm2')
do_train(train_x_3, train_y_3, 'm3')
do_train(train_x_4, train_y_4, 'm4')
--------------------
training ... m1
train x: (2544, 168, 23) train y: (2544, 168, 20)
training fold... 0W0928 21:34:13.226250   365 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 8.0, Driver API Version: 11.2, Runtime API Version: 11.2
W0928 21:34:13.229223   365 gpu_resources.cc:91] device: 0, cuDNN Version: 8.2.global step 100, epoch: 13, batch: 4, loss: 189.34042, valid score: 0.74267, speed: 0.67 step/s
global step 200, epoch: 25, batch: 8, loss: 26.75570, valid score: 0.94225, speed: 0.75 step/s
training fold... 1
global step 100, epoch: 13, batch: 4, loss: 179.81596, valid score: 0.75175, speed: 0.88 step/s
global step 200, epoch: 25, batch: 8, loss: 27.06740, valid score: 0.94496, speed: 0.75 step/s
training fold... 2
global step 100, epoch: 13, batch: 4, loss: 192.32230, valid score: 0.74129, speed: 0.91 step/s
global step 200, epoch: 25, batch: 8, loss: 27.35677, valid score: 0.94298, speed: 0.75 step/s
training fold... 3
global step 100, epoch: 13, batch: 4, loss: 176.71466, valid score: 0.75317, speed: 0.87 step/s
global step 200, epoch: 25, batch: 8, loss: 24.32207, valid score: 0.94430, speed: 0.75 step/s
training fold... 4
global step 100, epoch: 13, batch: 4, loss: 196.51141, valid score: 0.73796, speed: 0.88 step/s
global step 200, epoch: 25, batch: 8, loss: 27.48337, valid score: 0.94143, speed: 0.74 step/s
--------------------
training ... m2
train x: (2784, 168, 23) train y: (2784, 168, 20)
training fold... 0
global step 100, epoch: 12, batch: 1, loss: 192.12552, valid score: 0.74218, speed: 0.83 step/s
global step 200, epoch: 23, batch: 2, loss: 26.67301, valid score: 0.94218, speed: 0.73 step/s
training fold... 1
global step 100, epoch: 12, batch: 1, loss: 181.16043, valid score: 0.75225, speed: 0.85 step/s
global step 200, epoch: 23, batch: 2, loss: 26.28015, valid score: 0.94389, speed: 0.73 step/s
training fold... 2
global step 100, epoch: 12, batch: 1, loss: 194.71078, valid score: 0.74261, speed: 0.87 step/s
global step 200, epoch: 23, batch: 2, loss: 28.19350, valid score: 0.93948, speed: 0.72 step/s
training fold... 3
global step 100, epoch: 12, batch: 1, loss: 181.40471, valid score: 0.75267, speed: 0.86 step/s
global step 200, epoch: 23, batch: 2, loss: 27.63694, valid score: 0.94298, speed: 0.72 step/s
training fold... 4
global step 100, epoch: 12, batch: 1, loss: 194.80693, valid score: 0.73768, speed: 0.85 step/s
global step 200, epoch: 23, batch: 2, loss: 27.04206, valid score: 0.93785, speed: 0.73 step/s
--------------------
training ... m3
train x: (3480, 168, 23) train y: (3480, 168, 20)
training fold... 0
global step 100, epoch: 10, batch: 1, loss: 195.62051, valid score: 0.74132, speed: 0.80 step/s
global step 200, epoch: 19, batch: 2, loss: 29.17942, valid score: 0.93782, speed: 0.70 step/s
global step 300, epoch: 28, batch: 3, loss: 22.93004, valid score: 0.94732, speed: 0.90 step/s
training fold... 1
global step 100, epoch: 10, batch: 1, loss: 191.73341, valid score: 0.74899, speed: 0.85 step/s
global step 200, epoch: 19, batch: 2, loss: 28.48909, valid score: 0.94111, speed: 0.70 step/s
global step 300, epoch: 28, batch: 3, loss: 24.10351, valid score: 0.94549, speed: 0.83 step/s
training fold... 2
global step 100, epoch: 10, batch: 1, loss: 200.53751, valid score: 0.74166, speed: 0.84 step/s
global step 200, epoch: 19, batch: 2, loss: 32.34964, valid score: 0.93378, speed: 0.70 step/s
global step 300, epoch: 28, batch: 3, loss: 22.18238, valid score: 0.94529, speed: 0.86 step/s
training fold... 3
global step 100, epoch: 10, batch: 1, loss: 190.54114, valid score: 0.74929, speed: 0.83 step/s
global step 200, epoch: 19, batch: 2, loss: 29.43060, valid score: 0.93647, speed: 0.70 step/s
global step 300, epoch: 28, batch: 3, loss: 22.63792, valid score: 0.94633, speed: 0.84 step/s
training fold... 4
global step 100, epoch: 10, batch: 1, loss: 199.86848, valid score: 0.73911, speed: 0.82 step/s
global step 200, epoch: 19, batch: 2, loss: 30.84038, valid score: 0.93401, speed: 0.71 step/s
global step 300, epoch: 28, batch: 3, loss: 25.37951, valid score: 0.94664, speed: 0.82 step/s
--------------------
training ... m4
train x: (3720, 168, 23) train y: (3720, 168, 20)
training fold... 0
global step 100, epoch: 9, batch: 4, loss: 196.55203, valid score: 0.74267, speed: 0.81 step/s
global step 200, epoch: 17, batch: 8, loss: 31.35485, valid score: 0.93497, speed: 0.70 step/s
global step 300, epoch: 25, batch: 12, loss: 24.27215, valid score: 0.94545, speed: 0.80 step/s
training fold... 1
global step 100, epoch: 9, batch: 4, loss: 191.64560, valid score: 0.74758, speed: 0.83 step/s
global step 200, epoch: 17, batch: 8, loss: 30.92274, valid score: 0.93813, speed: 0.69 step/s
global step 300, epoch: 25, batch: 12, loss: 24.90816, valid score: 0.94470, speed: 0.90 step/s
training fold... 2
global step 100, epoch: 9, batch: 4, loss: 197.55722, valid score: 0.74337, speed: 0.84 step/s
global step 200, epoch: 17, batch: 8, loss: 31.99613, valid score: 0.93345, speed: 0.70 step/s
global step 300, epoch: 25, batch: 12, loss: 24.23726, valid score: 0.94481, speed: 0.77 step/s
training fold... 3
global step 100, epoch: 9, batch: 4, loss: 186.58867, valid score: 0.74806, speed: 0.79 step/s
global step 200, epoch: 17, batch: 8, loss: 29.82816, valid score: 0.93393, speed: 0.71 step/s
global step 300, epoch: 25, batch: 12, loss: 25.93081, valid score: 0.94440, speed: 0.84 step/s
training fold... 4
global step 100, epoch: 9, batch: 4, loss: 198.73732, valid score: 0.74012, speed: 0.81 step/s
global step 200, epoch: 17, batch: 8, loss: 31.71860, valid score: 0.92987, speed: 0.70 step/s
global step 300, epoch: 25, batch: 12, loss: 24.98176, valid score: 0.94471, speed: 0.83 step/s
# 以此预测数据
pred_1 = do_pred(test_x_1, 'm1')
pred_2 = do_pred(test_x_2, 'm2')
pred_3 = do_pred(test_x_3, 'm3')
pred_4 = do_pred(test_x_4, 'm4')
--------------------
predict ... m1
predict x: (168, 23)
predict kfold... 0
predict kfold... 1
predict kfold... 2
predict kfold... 3
predict kfold... 4
--------------------
predict ... m2
predict x: (168, 23)
predict kfold... 0
predict kfold... 1
predict kfold... 2
predict kfold... 3
predict kfold... 4
--------------------
predict ... m3
predict x: (168, 23)
predict kfold... 0
predict kfold... 1
predict kfold... 2
predict kfold... 3
predict kfold... 4
--------------------
predict ... m4
predict x: (168, 23)
predict kfold... 0
predict kfold... 1
predict kfold... 2
predict kfold... 3
predict kfold... 4
np.shape(pred_1), np.shape(pred_2), np.shape(pred_3), np.shape(pred_4)
((5, 1, 168, 20), (5, 1, 168, 20), (5, 1, 168, 20), (5, 1, 168, 20))
result = np.vstack((np.mean(pred_1, axis=0).squeeze(),np.mean(pred_2, axis=0).squeeze(),np.mean(pred_3, axis=0).squeeze(),np.mean(pred_4, axis=0).squeeze()))result[result<0] = 0
result = pd.concat([df_sub['time'], pd.DataFrame(result)], axis=1)
result.columns = df_sub.columns
result.to_csv('work/result/result_0929_1.csv', index=False, encoding='utf-8')
result
timeflow_1flow_2flow_3flow_4flow_5flow_6flow_7flow_8flow_9...flow_11flow_12flow_13flow_14flow_15flow_16flow_17flow_18flow_19flow_20
02022-05-01 01:00:0016.3481399.17540326.26302718.7216742.37345934.8678367.2469551.0993443.313838...2.6702771.5937281.1537371.5145850.9689332.3035464.1337823.0258451.4022290.971339
12022-05-01 02:00:0014.0533805.96327322.57479314.6668611.69818126.5843874.5103870.7704312.162849...1.6513681.1847660.7589200.8778740.6572361.4386562.7126462.2770401.0389840.728588
22022-05-01 03:00:0013.6252624.78311621.75878313.3714541.48984023.9877093.5044700.6915821.788223...1.2440851.0795580.6461730.6398250.5593871.1231572.2160792.0841010.9399940.660505
32022-05-01 04:00:0014.6280294.95519023.24457914.1717381.58637125.4657713.6122960.7450961.890892...1.2511721.1659310.6891060.6490210.5906141.1483392.3080542.2251371.0030490.707679
42022-05-01 05:00:0017.2733906.25858527.38666717.0346641.97020330.9086864.6320160.9187582.382508...1.5988031.4206460.8582390.8467290.7351281.4652892.9028002.6697681.2213390.861373
..................................................................
6672022-08-27 20:00:0066.46830735.385902107.97544971.7524349.243578138.51367227.5109564.06935812.509884...10.1018126.1737424.8634655.8718433.6523279.14809615.80555911.8804415.4713683.962896
6682022-08-27 21:00:0076.71962042.754463123.60736182.20562711.031697161.87847933.8687974.92960815.284616...12.4548287.4299775.7230717.4726424.45206111.24803919.29324314.1265646.5710224.579916
6692022-08-27 22:00:0077.41856443.689445125.96052685.88475011.060221161.80069034.7039724.94804315.392654...12.9253827.6476265.8941227.7308724.58621011.64974219.84078414.6051586.6428104.627126
6702022-08-27 23:00:0060.88291233.174492103.90517476.5588767.978610120.09068325.8252833.45690210.831713...10.0265385.8873894.8016265.6553293.5967079.05543215.10651911.8375844.8732753.687579
6712022-08-28 00:00:0039.20066519.23662471.47702852.8190084.41845370.67337014.3439431.8405225.580392...5.9882783.5868763.1491783.0898762.1444675.4055998.6907537.5629302.8563582.337228

672 rows × 21 columns

结果分析

由于paddle的结果会有一点波动,这里仅做简单对比分析:

modelepochscore
LSTM300.441
LSTM500.442

在这里插入图片描述

关于模型的一些分析与说明,在另一篇文章【赛事基线】“深水云脑”水质净化厂工艺控制-曝气量预测Baseline之DL中已经聊了聊,这里不再赘述。

这里简单补充几点:

  1. 如果使用 Transformer结构,可以再加一个 Decoder的步骤,类似NLP中的生成模型,可以使模型更灵活,这里只写了 Encoder部分。
  2. 如果希望提升成绩,可以尝试构造更多的数据特征,比如时间差分、非线性变换等。
  3. 深度学习模型的最大优势在于结构灵活,从上面的模型就可以看出来,这里可以一次性输出 168 ∗ 20 = 3360 168 * 20 = 3360 16820=3360 个数值,并一次性反向传播完成。在NLP领域,多任务学习被证实具有很好的性能,在传统回归问题中也可以进行尝试。

最后,复杂的模型不一定就有更好的成绩,有同学上传的baseline中用简单的均值策略就可以获得远好于此次模型的成绩,值得深思 ~~~ 哈哈哈 [捂脸]

OK,希望这篇文章能对大家有所帮助,有问题互相探讨~

附录:

其他开源项目:

【赛事基线】“深水云脑”水质净化厂工艺控制-曝气量预测Baseline之DL

【实验分享】“字”还是“词”?这是个问题!

【比赛分享】讯飞-基于论文摘要的文本分类与查询性问答第4名(并列第3)的思考

我正在参加AI Studio 4周年活动,登录平台完成探索任务就有机会获得Mac、Iphone、网盘会员、GPU算力等奖品,点击链接为我助力,你也可以获奖哦
链接:https://aistudio.baidu.com/aistudio/4th?invitation=1&sharedUserId=942478&sharedUserName=er_zhong0


此文章为搬运
原项目链接

这篇关于【赛事基线】“深水云脑”居民小区二次供水需求预测Baseline之DL的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/772738

相关文章

公共筛选组件(二次封装antd)支持代码提示

如果项目是基于antd组件库为基础搭建,可使用此公共筛选组件 使用到的库 npm i antdnpm i lodash-esnpm i @types/lodash-es -D /components/CommonSearch index.tsx import React from 'react';import { Button, Card, Form } from 'antd'

自动驾驶规划中使用 OSQP 进行二次规划 代码原理详细解读

目录 1 问题描述 什么是稀疏矩阵 CSC 形式 QP Path Planning 问题 1. Cost function 1.1 The first term: 1.2 The second term: 1.3 The thrid term: 1.4 The forth term: 对 Qx''' 矩阵公式的验证 整体 Q 矩阵(就是 P 矩阵,二次项的权重矩阵)

DL理论笔记与理解

gradient的方向代表函数值增大的方向(这个方向由沿着各个轴方向偏导方向综合的方向),大小代表函数值变化的快慢。导数概念很大,偏导是沿着某方向上的导,梯度是沿着各个方向数偏导的向量。softmax函数叫这个的原因,把原来较大的数值压缩成相对的大数,把原来较小的数压缩在密集的空间,把数据间的margin压缩得越来越大,这就类似金字塔效应,你能力比别人强一些,得到的收益可能比别人强太多。CNN中卷

智能优化算法改进策略之局部搜索算子(三)—二次插值法

1、原理介绍 多项式是逼近函数的一种常用工具。在寻求函数极小点的区间(即寻查区间)上,我们可以利用在若干点处的函数值来构成低次插值多项式,用它作为求极小点的函数的近似表达式,并用这个多项式的极小点作为原函数极小点的近似。低次多项式的极小点比较容易计算。常用的插值多项式为二次或三次,一般说来三次插值公式的收敛性好一些,但在导数不变计算时,三点二次插值也是一种常用的方法[1]。 3

《图数据库:理论与实践》书籍销售火爆,二次印刷重磅来袭!

好书共享,就在此刻! 由创邻科技联合电子工业出版社匠心打磨三年,最终成稿的图数据库书籍《图数据库:理论与实践》发行上线后,获得了广泛好评,各平台销量迅速破千,并荣登京东 “数据库图书榜”热卖榜第二名。 基于广大读者的强烈需求和市场的持续认可,电子工业出版社迅速决定对书籍进行二次印刷,让更多的软件开发人员、数据分析师、图技术爱好者能够了解图技术,学习图技术、用好图技术! 为什么选择《图数据库

二次规划(Lagrange 方法,起作用集方法)

二次规划是非线性规划中一种特殊情形,它的目标函数是二次实函数,约束是线性的。由于二次规划比较简单,便于求解,且一些非线性规划可以转化为求解一系列二次规划问题,因此二次规划算法较早引起人们的重视,成为求解非线性规划的一个重要通径。二次规划的算法较多,本章介绍其中几个典型的方法,它们是 Lagrange 方法,起作用集方法,Lemke 方法和路径路踪法。 一、Lagrange 方法 考虑二次规划问

线性二次型调节器(LQR)举例

线性二次型调节器(LQR) 线性二次型调节器(LQR)是一种用于最优控制的问题,其中目标是通过最小化某个代价函数来找到最优控制策略。LQR特别适用于线性系统。为了在人形机器人上应用LQR进行建模,主要步骤包括建立系统模型、定义代价函数以及求解最优控制律。以下是详细步骤: 1. 系统建模 首先,建立人形机器人的线性状态空间模型。一般形式如下: 其中,x(t) 是状态向量,u(t) 是控制向量

DL基础补全计划(六)---卷积和池化

PS:要转载请注明出处,本人版权所有。 PS: 这个只是基于《我自己》的理解, 如果和你的原则及想法相冲突,请谅解,勿喷。 环境说明 Windows 10VSCodePython 3.8.10Pytorch 1.8.1Cuda 10.2 前言   本文是此基础补全计划的最终篇,因为从我的角度来说,如果前面这些基础知识都能够了解及理解,再加上本文的这篇基础知识,那么我们算是小半只脚踏入了大

DL基础补全计划(五)---数值稳定性及参数初始化(梯度消失、梯度爆炸)

PS:要转载请注明出处,本人版权所有。 PS: 这个只是基于《我自己》的理解, 如果和你的原则及想法相冲突,请谅解,勿喷。 环境说明 Windows 10VSCodePython 3.8.10Pytorch 1.8.1Cuda 10.2 前言   如果有计算机背景的相关童鞋,都应该知道数值计算中的上溢和下溢的问题。关于计算机中的数值表示,在我的《数与计算机 (编码、原码、反码、补码、移码

DL基础补全计划(四)---对抗过拟合:权重衰减、Dropout

PS:要转载请注明出处,本人版权所有。 PS: 这个只是基于《我自己》的理解, 如果和你的原则及想法相冲突,请谅解,勿喷。 环境说明 Windows 10VSCodePython 3.8.10Pytorch 1.8.1Cuda 10.2 前言   在《DL基础补全计划(三)—模型选择、欠拟合、过拟合》( https://blog.csdn.net/u011728480/article/d