资金流入流出预测—特征工程

2024-01-24 02:58

本文主要是介绍资金流入流出预测—特征工程,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

特征工程


特征工程的主要工作就是基于数据分析与探索提取潜在有价值的特征
 
在这里插入图片描述
特征工程对于模型的构建起着至关重要的作用。
 
在这里插入图片描述
特征工程的第一步是特征提取,第二步则是对提取出来的特征进行组合

特征提取

  • 基于数据分析与探索提取
  • 箱型图分析
    交易量与周几有关 ⇒ \Rightarrow 得到 6 个 0-1 型特征(是否周一,是否周二,…,是否周末)

 
在这里插入图片描述

  • 点线图分析
    user_start_level(用户星级)、user_occupation_id(用户职业)均与 is_trade(是否
    交易)有关 ⇒ \Rightarrow 虽然 user_start_level 连续,但是可以考虑将其离散化

在这里插入图片描述

  • 离散型特征的重要性:
    • 可用于设计规则
    • 易于模型拟合,xgboost、lightgbm、catboost 等都以决策树为基模型
    • 便于理解
    • 便于做特征组合
    • 在推荐系统等领域很常见

特征选择

  • 劣态:剔除几乎无关的特征,保留大量特征
  • 优胜:挑选出良好特征,组成最优特征子集
     在这里插入图片描述
  • 特征重要性的分析方法:
    • Mean Variance Test
    • SHAP (SHapley Additive exPlanations)
      • Python 的 shap 库
      • 可以用来解释任意机器学习模型的输出
         
        在这里插入图片描述 
        在这里插入图片描述
      • SHAP Value 为正,表明变量对预测值有促进作用
      • SHAP Value 为负,表明变量对预测值有抑制作用
      • 绝对值越大 ⇒ \Rightarrow 对预测影响越大
    • Permutation Importance
      • Python 的 eli5 库
      • 一个特征被处理为随机数后,若模型效果下降明显,则认为该特征重要

特征组合

  • 简单的特征组合方式:
    • 直接进行进行加、减、乘、除、 l o g log log e x p exp exp 等运算
    • 易于生成大量特征,但会易出现过拟合问题,且不易于解释
  • 更好的特征提取与组合方式:
    • 在充分理解问题背景的基础上进行数据分析与探索
    • 以时间序列问题为例,常见的特征包括统计量(最大值、最小值、中位数、偏度、峰度等)、排序(各统计量在历史同期的排名)、分位数(各统计量在历史同期排名的分位数)等等 。周期因子也可以视为一些特征的组合
    • 以推荐系统为例,常见的特征类别包括用户特征商品特征行为特征(按时段统计)等等。可对其中的离散型特征直接做组合
       
      在这里插入图片描述

特征工程的实现

下面的分析过程只对赎回总额 (total_redeem_amt) 进行,申购总额特征工程的过程与之类似。

  • 首先进行特征提取,并对赎回总额在不同特征上的分布情况进行可视化,根据可视化的结果可以直观地剔除一些无法有效分割数据集的特征以及方差较大的特征
  • 对赎回总额与其他特征的相关性进行分析,剔除与赎回总额相关性较差的那些特征
  • 利用 MVTest 从剔除的相关性较差的集合中挽回与赎回总额有依赖性的特征
  • 利用 SHAP 和 Permutation Importance 选择优胜特征
  • 汇总结果得到最终的特征集合

初始化

导入需要用到的包:

## 导入库函数
import pandas as  pd
import numpy as npimport datetime
import shap
import eli5
import seaborn as sns
import matplotlib.pyplot as pltfrom mvtest import * 
from wordcloud import WordCloud
from scipy import stats
from eli5.sklearn import PermutationImportance
from sklearn import tree
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LinearRegressionfrom typing import *
import warnings 
warnings.filterwarnings('ignore')
%matplotlib inline

为了方面后续操作,设置全局index变量:

## 为了方面后续操作,设置全局index变量
labels = ['total_purchase_amt','total_redeem_amt']
date_indexs = ['week','year','month','weekday','day']

数据预处理

定义处理数据要用到的函数:

## 读取 user_balance_table (用户申购赎回数据表)
def load_data(path: str = 'user_balance_table.csv')->pd.DataFrame:data_balance = pd.read_csv(path)return data_balance.reset_index(drop=True)## 添加时间戳
def add_timestamp(data: pd.DataFrame, time_index: str = 'report_date')->pd.DataFrame:data_balance = data.copy()data_balance['date'] = pd.to_datetime(data_balance[time_index], format= "%Y%m%d")data_balance['day'] = data_balance['date'].dt.daydata_balance['month'] = data_balance['date'].dt.monthdata_balance['year'] = data_balance['date'].dt.yeardata_balance['week'] = data_balance['date'].dt.weekdata_balance['weekday'] = data_balance['date'].dt.weekdayreturn data_balance.reset_index(drop=True)## 计算每天的申购、赎回总额
def get_total_balance(data: pd.DataFrame, date: str = '2014-03-31')->pd.DataFrame:df_tmp = data.copy()df_tmp = df_tmp.groupby(['date'])['total_purchase_amt','total_redeem_amt'].sum()df_tmp.reset_index(inplace=True)return df_tmp[(df_tmp['date']>= date)].reset_index(drop=True)## 生成测试数据
def generate_test_data(data: pd.DataFrame)->pd.DataFrame:total_balance = data.copy()start = datetime.datetime(2014,9,1)testdata = []while start != datetime.datetime(2014,10,15):temp = [start, np.nan, np.nan] # 用 nan 初始化待预测的数据testdata.append(temp)start += datetime.timedelta(days = 1)testdata = pd.DataFrame(testdata)testdata.columns = total_balance.columns# 把生成的测试数据和原始total_balance拼接在一起total_balance = pd.concat([total_balance, testdata], axis = 0)total_balance = total_balance.reset_index(drop=True)return total_balance.reset_index(drop=True)## 读取 user_profile_table (用户信息表)
def load_user_information(path: str = 'user_profile_table.csv')->pd.DataFrame:return pd.read_csv(path)

读取数据并进行处理:

## 读取数据并进行处理
balance_data = load_data('Dataset/user_balance_table.csv')
balance_data = add_timestamp(balance_data, time_index='report_date')
total_balance = get_total_balance(balance_data)
total_balance = generate_test_data(total_balance)
total_balance = add_timestamp(total_balance, 'date')
user_information = load_user_information('Dataset/user_profile_table.csv')

读取结果如下:

balance_data
 
在这里插入图片描述
total_balance
 
在这里插入图片描述
user_information
 
在这里插入图片描述

特征提取

基于日期的特征

对是否月初月中月末、是否周末、是否节假日等日期特征进行提取:

## 获取节假日的日期集合
def get_holiday_set()->Set[datetime.date]:holiday_set = set()# 清明节holiday_set = holiday_set | {datetime.date(2014,4,5), datetime.date(2014,4,6), datetime.date(2014,4,7)}# 劳动节holiday_set = holiday_set | {datetime.date(2014,5,1), datetime.date(2014,5,2), datetime.date(2014,5,3)}# 端午节holiday_set = holiday_set | {datetime.date(2014,5,31), datetime.date(2014,6,1), datetime.date(2014,6,2)}# 中秋节holiday_set = holiday_set | {datetime.date(2014,9,6), datetime.date(2014,9,7), datetime.date(2014,9,8)}# 国庆节holiday_set = holiday_set | {datetime.date(2014,10,1), datetime.date(2014,10,2), datetime.date(2014,10,3),\datetime.date(2014,10,4), datetime.date(2014,10,5), datetime.date(2014,10,6),\datetime.date(2014,10,7)}# 中秋节holiday_set = holiday_set | {datetime.date(2013,9,19), datetime.date(2013,9,20), datetime.date(2013,9,21)}# 国庆节holiday_set = holiday_set | {datetime.date(2013,10,1), datetime.date(2013,10,2), datetime.date(2013,10,3),\datetime.date(2013,10,4), datetime.date(2013,10,5), datetime.date(2013,10,6),\datetime.date(2013,10,7)}return holiday_set

提取所有是/否类型的特征:

## 提取所有是/否类型的特征
def extract_is_feature(data: pd.DataFrame)->pd.DataFrame:total_balance = data.copy().reset_index(drop=True)# 是否是Weekendtotal_balance['is_weekend'] = 0total_balance.loc[total_balance['weekday'].isin((5,6)), 'is_weekend'] = 1# 是否是假期total_balance['is_holiday'] = 0total_balance.loc[total_balance['date'].isin(get_holiday_set()), 'is_holiday'] = 1# 是否是节假日的第一天last_day_flag = 0 # 前一天是否为 holidaytotal_balance['is_firstday_of_holiday'] = 0# iterrows() 是在数据框中的行进行迭代的一个生成器,它返回每行的索引及一个包含行本身的对象# 可以使用 iterrows()方法实现行数据的遍历for index, row in total_balance.iterrows(): if last_day_flag == 0 and row['is_holiday'] == 1: # 如果当天为 holiday 但前一天不是 holidaytotal_balance.loc[index, 'is_firstday_of_holiday'] = 1last_day_flag = row['is_holiday']# 是否是节假日的最后一天total_balance['is_lastday_of_holiday'] = 0for index, row in total_balance.iterrows():if row['is_holiday'] == 1 and total_balance.loc[index+1, 'is_holiday'] == 0:total_balance.loc[index, 'is_lastday_of_holiday'] = 1# 是否是节假日后的上班第一天total_balance['is_firstday_of_work'] = 0last_day_flag = 0for index, row in total_balance.iterrows():if last_day_flag == 1 and row['is_holiday'] == 0:total_balance.loc[index, 'is_firstday_of_work'] = 1last_day_flag = row['is_lastday_of_holiday']# 是否要上班 (除了节假日和周末之外都要上班)total_balance['is_work'] = 1total_balance.loc[(total_balance['is_holiday'] == 1) | (total_balance['is_weekend'] == 1), 'is_work'] = 0special_work_day_set = {datetime.date(2014,5,4), datetime.date(2014,9,28)} # 调休上班total_balance.loc[total_balance['date'].isin(special_work_day_set), 'is_work'] = 1# 是否明天要上班total_balance['is_gonna_work_tomorrow'] = 0for index, row in total_balance.iterrows():if index == len(total_balance)-1:breakif row['is_work'] == 0 and total_balance.loc[index+1, 'is_work'] == 1: # 今天不上班,明天要上班total_balance.loc[index, 'is_gonna_work_tomorrow'] = 1# 昨天上班了吗total_balance['is_worked_yestday'] = 0for index, row in total_balance.iterrows():if index <= 1:continueif total_balance.loc[index-1, 'is_work'] == 1:total_balance.loc[index, 'is_worked_yestday'] = 1# 是否是放假前一天total_balance['is_lastday_of_workday'] = 0for index, row in total_balance.iterrows():if index == len(total_balance)-1:breakif row['is_holiday'] == 0 and total_balance.loc[index+1, 'is_holiday'] == 1:total_balance.loc[index, 'is_lastday_of_workday'] = 1# 是否周日要上班total_balance['is_work_on_sunday'] = 0for index, row in total_balance.iterrows():if index == len(total_balance)-1:breakif row['weekday'] == 6 and row['is_work'] == 1:total_balance.loc[index, 'is_work_on_sunday'] = 1# 是否是月初第一天total_balance['is_firstday_of_month'] = 0total_balance.loc[total_balance['day'] == 1, 'is_firstday_of_month'] = 1# 是否是月初第二天total_balance['is_secday_of_month'] = 0total_balance.loc[total_balance['day'] == 2, 'is_secday_of_month'] = 1# 是否是月初total_balance['is_premonth'] = 0total_balance.loc[total_balance['day'] <= 10, 'is_premonth'] = 1# 是否是月中total_balance['is_midmonth'] = 0total_balance.loc[(10 < total_balance['day']) & (total_balance['day'] <= 20), 'is_midmonth'] = 1# 是否是月末total_balance['is_tailmonth'] = 0total_balance.loc[20 < total_balance['day'], 'is_tailmonth'] = 1# 是否是每个月第一个周total_balance['is_first_week'] = 0total_balance.loc[total_balance['week'] % 4 == 1, 'is_first_week'] = 1# 是否是每个月第一个周total_balance['is_second_week'] = 0total_balance.loc[total_balance['week'] % 4 == 2, 'is_second_week'] = 1# 是否是每个月第一个周total_balance['is_third_week'] = 0total_balance.loc[total_balance['week'] % 4 == 3, 'is_third_week'] = 1# 是否是每个月第四个周total_balance['is_fourth_week'] = 0total_balance.loc[total_balance['week'] % 4 == 0, 'is_fourth_week'] = 1return total_balance.reset_index(drop=True)
## 提取所有是/否类型的特征到数据集
total_balance = extract_is_feature(total_balance)

对 weekday 特征进行 OneHot 编码:

## 对weekday特征进行OneHot编码
def encode_data(data: pd.DataFrame, feature_name:str = 'weekday', encoder=OneHotEncoder())->pd.DataFrame():total_balance = data.copy()week_feature = encoder.fit_transform(np.array(total_balance[feature_name]).reshape(-1, 1)).toarray()week_feature = pd.DataFrame(week_feature,columns= [feature_name + '_onehot_'+ str(x) for x in range(len(week_feature[0]))])featureWeekday = pd.concat([total_balance, week_feature], axis = 1)return featureWeekday
## 编码weekday特征到数据集
total_balance = encode_data(total_balance)
total_balance.head() 

生成是/否类型的特征集合:

## 生成是/否类型的特征集合
feature = total_balance[[x for x in total_balance.columns if x not in date_indexs]]  # date_indexs = ['week','year','month','weekday','day']
feature.head() 

进行标签分布分析:

绘制箱型图:

## 绘制箱型图
def draw_boxplot(data: pd.DataFrame)->None:f, axes = plt.subplots(7, 4, figsize=(18, 24))global date_indexs, labelscount = 0for i in [x for x in data.columns if x not in date_indexs + labels + ['date']]:sns.boxenplot(x=i, y='total_redeem_amt', data=data, ax=axes[count // 4][count % 4])count += 1
## 画出 total_redeem_amt 关于各个特征的箱线图
draw_boxplot(feature)

在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
根据箱型图剔除部分看起来较差的特征:

## 剔除看起来较差的特征
redeem_feature_seems_useless = [#样本量太少,建模时无效;但若确定这是一个有用规则,可以对结果做修正'is_work_on_sunday'#中位数差异不明显'is_premonth'
]

相关性分析:

画出相关性热力图:

## 画相关性热力图
def draw_correlation_heatmap(data: pd.DataFrame, way:str = 'pearson')->None:feature = data.copy()plt.figure(figsize=(20,10))plt.title('The ' + way +' coleration between total redeem and each feature')sns.heatmap(feature[[x for x in feature.columns if x not in ['total_purchase_amt', 'date'] ]].corr(way),linecolor='white',linewidths=0.1,cmap="RdBu")
draw_correlation_heatmap(feature, 'spearman')

在这里插入图片描述
剔除与 total_redeem_amt 相关性较低的特征:

## 剔除相关性较低的特征
temp = np.abs(feature[[x for x in feature.columns if x not in ['total_purchase_amt', 'date'] ]].corr('spearman')['total_redeem_amt'])
feature_low_correlation = list(set(temp[temp < 0.1].index))
print('其他特征与 total_redeem_amt 的相关系数')
print(temp)
print('与 total_redeem_amt 相关性较低的几个特征') 
print(feature_low_correlation)  

基于距离的特征

提取距离特征:

## 提取距离特征
def extract_distance_feature(data: pd.DataFrame)->pd.DataFrame:total_balance = data.copy()# 距离下一个假期还有多少天total_balance['dis_to_nowork'] = 0for index, row in total_balance.iterrows():if row['is_work'] == 0: # 遍历到假期时,给假期前面的数据的 dis_to_nowork 赋值step = 1flag = 1while flag:if index - step >= 0 and total_balance.loc[index - step, 'is_work'] == 1:# 如果 index - step > 0,并且那天工作则给 dis_to_nowork 赋值total_balance.loc[index - step, 'dis_to_nowork'] = stepstep += 1else:flag = 0# 距上一个假期已经过去了多少天,即已经工作了多少天total_balance['dis_from_nowork'] = 0step = 0for index, row in total_balance.iterrows():step += 1if row['is_work'] == 1:total_balance.loc[index, 'dis_from_nowork'] = stepelse:step = 0# 距离上班还有多少天total_balance['dis_to_work'] = 0for index, row in total_balance.iterrows():if row['is_work'] == 1:step = 1flag = 1while flag:if index - step >= 0 and total_balance.loc[index - step, 'is_work'] == 0:total_balance.loc[index - step, 'dis_to_work'] = stepstep += 1else:flag = 0# 已经休假了几天total_balance['dis_from_work'] = 0step = 0for index, row in total_balance.iterrows():step += 1if row['is_work'] == 0:total_balance.loc[index, 'dis_from_work'] = stepelse:step = 0# 距离下一个节假日还有多少天total_balance['dis_to_holiday'] = 0for index, row in total_balance.iterrows():if row['is_holiday'] == 1:step = 1flag = 1while flag:if index - step >= 0 and total_balance.loc[index - step, 'is_holiday'] == 0:total_balance.loc[index - step, 'dis_to_holiday'] = stepstep += 1else:flag = 0# 距离上一个节假日已经过去了多少天total_balance['dis_from_holiday'] = 0step = 0for index, row in total_balance.iterrows():step += 1if row['is_holiday'] == 0:total_balance.loc[index, 'dis_from_holiday'] = stepelse:step = 0# 距离下一个节假日的最后一天还有多少天total_balance['dis_to_holiendday'] = 0for index, row in total_balance.iterrows():if row['is_lastday_of_holiday'] == 1:step = 1flag = 1while flag:if index - step >= 0 and total_balance.loc[index - step, 'is_lastday_of_holiday'] == 0:total_balance.loc[index - step, 'dis_to_holiendday'] = stepstep += 1else:flag = 0# 距离上一个节假日的最后一天已经过去了多少天total_balance['dis_from_holiendday'] = 0step = 0for index, row in total_balance.iterrows():step += 1if row['is_lastday_of_holiday'] == 0:total_balance.loc[index, 'dis_from_holiendday'] = stepelse:step = 0# 距离月初第几天total_balance['dis_from_startofmonth'] = np.abs(total_balance['day'])# 距离月的中心点有几天total_balance['dis_from_middleofmonth'] = np.abs(total_balance['day'] - 15)# 距离星期的中心有几天total_balance['dis_from_middleofweek'] = np.abs(total_balance['weekday'] - 3)# 距离星期日有几天total_balance['dis_from_endofweek'] = np.abs(total_balance['weekday'] - 6)return total_balance
## 拼接距离特征到原数据集
total_balance = extract_distance_feature(total_balance)
total_balance.head() 

距离特征分析:

## 获取距离特征的列名
feature = total_balance[[x for x in total_balance.columns if x not in date_indexs]]
dis_feature_indexs = [x for x in feature.columns if (x not in date_indexs + labels + ['date']) & ('dis' in x)]
dis_feature_indexs 

画出距离特征关于total_redeem_amt的点线图:

## 画出距离特征关于total_redeem_amt的点线图
def draw_point_feature(data: pd.DataFrame)->None:feature = data.copy()f, axes = plt.subplots(data.shape[1] // 3, 3, figsize=(30, data.shape[1] // 3 * 4))count = 0for i in [x for x in feature.columns if (x not in date_indexs + labels + ['date'])]:# date_indexs + labels + ['date'] = ['week', 'year', 'month', 'weekday', 'day', 'total_purchase_amt', 'total_redeem_amt', 'date']sns.pointplot(x=i, y="total_redeem_amt",markers=["^", "o"], linestyles=["-", "--"],kind="point", data=feature, ax=axes[count // 3][count % 3] if data.shape[1] > 3 else axes[count])count += 1
draw_point_feature(feature[['total_redeem_amt'] + dis_feature_indexs])

在这里插入图片描述
处理距离过远的时间点:

## 处理距离过远的时间点
def dis_change(x):if x > 5:x = 10return x
## 处理特殊距离
dis_holiday_feature = [x for x in total_balance.columns if 'dis' in x and 'holi' in x]
# ['dis_to_holiday', 'dis_from_holiday', 'dis_to_holiendday', 'dis_from_holiendday']
dis_month_feature = [x for x in total_balance.columns if 'dis' in x and 'month' in x]
# ['dis_from_startofmonth', 'dis_from_middleofmonth']
total_balance[dis_holiday_feature] = total_balance[dis_holiday_feature].applymap(dis_change)
total_balance[dis_month_feature] = total_balance[dis_month_feature].applymap(dis_change) 

画出处理后的点线图:

## 处理后的点线图
feature = total_balance[[x for x in total_balance.columns if x not in date_indexs]]
draw_point_feature(feature[['total_redeem_amt'] + dis_feature_indexs])

在这里插入图片描述
根据点线图剔除方差较大的特征:

## 剔除看起来用处不大的特征
redeem_feature_seems_useless += [#即使做了处理,但方差太大,不可信,规律不明显'dis_to_holiday',#方差太大,不可信'dis_from_startofmonth',#方差太大,不可信'dis_from_middleofmonth'
]

相关性分析:

画出相关性热力图 :

## 画出相关性热力图 
draw_correlation_heatmap(feature[['total_redeem_amt'] + dis_feature_indexs])

在这里插入图片描述
剔除与 total_redeem_amt 相关性较差的特征

# 剔除相关性较差的特征
temp = np.abs(feature[[x for x in feature.columns if ('dis' in x) | (x in ['total_redeem_amt']) ]].corr()['total_redeem_amt'])
feature_low_correlation += list(set(temp[temp < 0.1].index) )
feature_low_correlation 

波峰波谷特征

提取波峰特征:

画出时序图观察波峰波谷的分布:

## 观察波峰特点
fig = plt.figure(figsize=(15,15))
for i in range(4,9):plt.subplot(5,1,i - 3)total_balance_2 = total_balance[(total_balance['date'] >= '2014-'+str(i)+'-01') & (total_balance['date'] < '2014-'+str(i+1)+'-01')]sns.pointplot(x=total_balance_2['day'],y=total_balance_2['total_redeem_amt'])plt.legend().set_title('Month:' + str(i)) 

在这里插入图片描述
在这里插入图片描述
通过观察图像可以看出来,波峰大都在周一,波谷大都在周六。

## 设定波峰波谷日期
def extract_peak_feature(data: pd.DataFrame)->pd.DataFrame:total_balance = data.copy()# 距离purchase波峰(即周一)有几天total_balance['dis_from_redeem_peak'] = np.abs(total_balance['weekday'] - 1)# 距离purchase波谷(即周六)有几天,与dis_from_endofweek相同total_balance['dis_from_redeem_valley'] = np.abs(total_balance['weekday'] - 6)return total_balance  
## 提取波峰特征
total_balance = extract_peak_feature(total_balance)
feature = total_balance[[x for x in total_balance.columns if x not in date_indexs]]
feature.head()  

画出波峰波谷特征的点线图:

draw_point_feature(feature[['total_redeem_amt'] + ['dis_from_redeem_peak','dis_from_redeem_valley']])

在这里插入图片描述
相关性分析:

temp = np.abs(feature[[x for x in feature.columns if ('peak' in x) or ('valley' in x) or (x in ['total_redeem_amt']) ]].corr()['total_redeem_amt'])
print('total_redeem_amt 与波峰波谷的相关性')
print(temp)
draw_correlation_heatmap(feature[['total_redeem_amt'] + ['dis_from_redeem_peak','dis_from_redeem_valley']]) 

在这里插入图片描述
可以看出来 total_redeem_amt 与波峰波谷特征都有一定的相关性。

周期因子特征

提取周期因子:

## 生成周期因子
def generate_rate(df, month_index):total_balance = df.copy()pure_balance = total_balance[['date','total_purchase_amt','total_redeem_amt']]pure_balance = pure_balance[(pure_balance['date'] >= '2014-03-01') & (pure_balance['date'] < '2014-'+str(month_index)+'-01')]pure_balance['weekday'] = pure_balance['date'].dt.weekdaypure_balance['day'] = pure_balance['date'].dt.daypure_balance['week'] = pure_balance['date'].dt.weekpure_balance['month'] = pure_balance['date'].dt.month# 计算星期因子weekday_rate = pure_balance[['weekday']+labels].groupby('weekday',as_index=False).mean() # 申购总额根据周几进行聚合,并取平均for name in labels:weekday_rate = weekday_rate.rename(columns={name: name+'_weekdaymean'})weekday_rate['total_purchase_amt_weekdaymean'] /= np.mean(pure_balance['total_purchase_amt']) # 根据周几分组后的均值/整体均值weekday_rate['total_redeem_amt_weekdaymean'] /= np.mean(pure_balance['total_redeem_amt'])pure_balance = pd.merge(pure_balance, weekday_rate, on='weekday', how='left')# 计算日期因子# 依据频次对周期因子total_purchase进行加权,获得日期因子# 日期因子 = 周期因子*(周一到周日在(1~31)号出现的次数/共有几个月)weekday_count = pure_balance[['day','weekday','date']].groupby(['day','weekday'],as_index=False).count() # 根据(1~31)号进行聚合,计算频次weekday_count = pd.merge(weekday_count, weekday_rate, on = 'weekday')weekday_count['total_purchase_amt_weekdaymean'] *= weekday_count['date'] / (len(set(pure_balance['month'])) - 1)weekday_count['total_redeem_amt_weekdaymean'] *= weekday_count['date'] / (len(set(pure_balance['month'])) - 1)day_rate = weekday_count.drop(['weekday','date'],axis=1).groupby('day',as_index=False).sum()weekday_rate.columns = ['weekday','purchase_weekdayrate','redeem_weekdayrate']day_rate.columns = ['day','purchase_dayrate','redeem_dayrate']day_rate['date'] = datetime.datetime(2014, month_index, 1)for index, row in day_rate.iterrows():if month_index in (2,4,6,9) and row['day'] == 31:continueday_rate.loc[index, 'date'] = datetime.datetime(2014, month_index, int(row['day']))day_rate['weekday'] = day_rate['date'].dt.weekdayday_rate = pd.merge(day_rate, weekday_rate, on='weekday')day_rate['purchase_dayrate'] = day_rate['purchase_weekdayrate'] / day_rate['purchase_dayrate']day_rate['redeem_dayrate'] = day_rate['redeem_weekdayrate'] / day_rate['redeem_dayrate']weekday_rate['month'] = month_indexday_rate['month'] = month_indexreturn weekday_rate, day_rate[['day','purchase_dayrate','redeem_dayrate','month']].sort_values('day')  
## 生成周期因子并合并到数据集
weekday_rate_list = []
day_rate_list = []
for i in range(3, 10):weekday_rate, day_rate = generate_rate(total_balance, i)weekday_rate_list.append(weekday_rate.reset_index(drop=True))day_rate_list.append(day_rate.reset_index(drop=True))weekday_rate_list = pd.concat(weekday_rate_list).reset_index(drop=True)
day_rate_list = pd.concat(day_rate_list).reset_index(drop=True)
total_balance = pd.merge(total_balance, weekday_rate_list, on=['weekday','month'], how='left')
total_balance = pd.merge(total_balance, day_rate_list, on=['day','month'], how='left')
total_balance.head()   
## 对周期因子进行处理
for i in [x for x in total_balance.columns if 'rate' in x and x not in labels + date_indexs]:total_balance[i] = total_balance[i].fillna(np.nanmedian(total_balance[i])) # 填充空值
total_balance.head() 

相关性分析:

画出相关性热力图:

## 画出相关性热力图
draw_correlation_heatmap(total_balance[['total_redeem_amt'] + [x for x in total_balance.columns if 'rate' in x and x not in labels + date_indexs]]) 

在这里插入图片描述
剔除与 total_redeem_amt 相关性低的特征:

## 剔除相关性低的特征
feature = total_balance.drop(date_indexs, axis=1)
temp = np.abs(feature[[x for x in feature.columns ]].corr()['total_redeem_amt'])
print('total_redeem_amt 和周期因子的相关性') 
temp 

动态时序特征

提取动态时序特征:

以星期为周期,统计申购总额的均值、中位数、最大值、最小值、偏度等:

## 提取动态特征
# 以星期为周期,统计申购总额的均值、中位数、最大值、最小值、偏度等
def get_amtfeature_with_time(data: pd.DataFrame)->pd.DataFrame:df_tmp_ = data[labels + date_indexs + ['date']].copy()total_balance = data.copy()# 添加时间戳df_tmp_ = df_tmp_[(df_tmp_['date']>= '2014-03-03')]df_tmp_['weekday'] = df_tmp_['date'].dt.weekday + 1 df_tmp_['week'] = df_tmp_['date'].dt.week - min(df_tmp_['date'].dt.week) + 1 # 距开始日期过去了几周df_tmp_['day'] = df_tmp_['date'].dt.daydf_tmp_['month'] = df_tmp_['date'].dt.monthdf_tmp_.reset_index(inplace=True)del df_tmp_['index']df_redeem = pd.DataFrame(columns = ['weekday1','weekday2','weekday3','weekday4','weekday5','weekday6','weekday7']) # 初始化一个空数据框count = 0for i in range(len(df_tmp_)): # 把周一到周日变成列,第几周变成行df_redeem.loc[count,'weekday'+str(df_tmp_.loc[i,'weekday'])] = df_tmp_.loc[i,'total_redeem_amt']if df_tmp_.loc[i,'weekday'] == 7:count = count + 1 df_tmp_['redeem_weekday_median'] = np.nandf_tmp_['redeem_weekday_mean'] = np.nandf_tmp_['redeem_weekday_min'] = np.nandf_tmp_['redeem_weekday_max'] = np.nandf_tmp_['redeem_weekday_std'] = np.nandf_tmp_['redeem_weekday_skew'] = np.nanfor i in range(len(df_tmp_)):# 从2014年3月31日开始统计# 因为df_tmp_是从2014-03-03开始的,3+4*7-1就是日期大于30号的数据if i > 4*7-1: # 取当前周数-2条数据进行统计分析df_tmp_.loc[i,'redeem_weekday_median'] = df_redeem.loc[:df_tmp_.loc[i,'week']-2,'weekday'+str(df_tmp_.loc[i,'weekday'])].median()df_tmp_.loc[i,'redeem_weekday_mean'] = df_redeem.loc[:df_tmp_.loc[i,'week']-2,'weekday'+str(df_tmp_.loc[i,'weekday'])].mean()df_tmp_.loc[i,'redeem_weekday_min'] = df_redeem.loc[:df_tmp_.loc[i,'week']-2,'weekday'+str(df_tmp_.loc[i,'weekday'])].min()    df_tmp_.loc[i,'redeem_weekday_max'] = df_redeem.loc[:df_tmp_.loc[i,'week']-2,'weekday'+str(df_tmp_.loc[i,'weekday'])].max()   df_tmp_.loc[i,'redeem_weekday_std'] = df_redeem.loc[:df_tmp_.loc[i,'week']-2,'weekday'+str(df_tmp_.loc[i,'weekday'])].std() df_tmp_.loc[i,'redeem_weekday_skew'] = df_redeem.loc[:df_tmp_.loc[i,'week']-2,'weekday'+str(df_tmp_.loc[i,'weekday'])].skew() # 偏度colList = ['redeem_weekday_median','redeem_weekday_mean','redeem_weekday_min','redeem_weekday_max','redeem_weekday_std','redeem_weekday_skew']total_balance = pd.merge(total_balance, df_tmp_[colList+['day','month']], on=['day','month'], how='left')return total_balance
## 合并特征到数据集
total_balance = get_amtfeature_with_time(total_balance)
total_balance.head()  

相关性分析:

画出相关性热力图:

## 绘制动态特征的相关性图
draw_correlation_heatmap(total_balance[['total_redeem_amt'] + ['redeem_weekday_median','redeem_weekday_mean','redeem_weekday_min','redeem_weekday_max','redeem_weekday_std','redeem_weekday_skew']]) 

在这里插入图片描述
把提取出来的特征储存为 csv:

feature[labels + ['dis_to_nowork', 'dis_to_work', 'dis_from_work', 'purchase_weekdayrate','redeem_dayrate', 'weekday_onehot_5', 'weekday_onehot_6','dis_from_nowork', 'is_holiday', 'weekday_onehot_1', 'weekday_onehot_2','weekday_onehot_0', 'dis_from_middleofweek', 'dis_from_holiendday','weekday_onehot_3', 'is_lastday_of_holiday', 'is_firstday_of_holiday','weekday_onehot_4', 'is_worked_yestday', 'is_second_week','is_third_week', 'dis_from_startofmonth', 'dis_from_holiday', 'total_purchase_amt','total_redeem_amt', 'date']].to_csv('Feature/0615_residual_redeem_origined.csv', index=False) 

特征剔除(劣汰)

剔除无法有效分割数据集的特征

画出各个特征分割数据集的分布估计图:

## 画出各个特征分割数据集的分布估计图
plt.figure(figsize=(4 * 6, 6 * len(feature.columns) / 6))
count = 0
for i in [x for x in feature.columns if (x not in labels + date_indexs + ['date']) & ('amt' not in x) & ('dis' not in x) & ('rate' not in x)]:count += 1if feature[feature[i] == 0].empty:continueplt.subplot(len(feature.columns) / 4, 4, count)ax = sns.kdeplot(feature[feature[i] == 0]['total_redeem_amt'], label= str(i) + ' == 0, redeem')ax = sns.kdeplot(feature[feature[i] == 1]['total_redeem_amt'], label= str(i) + ' == 1, redeem') 

在这里插入图片描述
在这里插入图片描述在这里插入图片描述
剔除对数据集划分不明显的特征 (分布图两条曲线几乎重合的特征):

## 剔除对数据集划分不明显的特征
redeem_feature_seems_useless += ['is_third_week','is_fourth_week']
redeem_feature_seems_useless   

使用MVTest挽回一些有依赖性但是不相关的特征

## MVtest Ref: https://github.com/ChuanyuXue/MVTest
l = mvtest()name_list = []
Tn_list = []
p_list = []
for i in [i for i in feature_low_correlation if 'is' in i or 'discret' in i]:pair = l.test(feature['total_redeem_amt'], feature[i])name_list.append(str(i))Tn_list.append(pair['Tn'])p_list.append(pair['p-value'][0])
temp = pd.DataFrame([name_list,Tn_list]).T.sort_values(1)
temp[1] = np.abs(temp[1])
feature_saved_from_mv_redeem = list(temp.sort_values(1, ascending=False)[temp[1] > 0.5984][0])  

剔除复共线特征

遍历相关性矩阵,如果特征 i 与特征 k 的相关性大于阈值,则考虑剔除特征 k,若特征 i 与特征 i * 特征 k 的相关性小于阈值,则保留 i*k 作为新特征(相当于交叉项)。

feature = feature[[x for x in feature.columns if (x not in feature_low_correlation + redeem_feature_seems_useless) or\(x in feature_saved_from_mv_redeem )]]redeem_cors = feature.corr()
redeem_cors['total_redeem_amt'] = np.abs(redeem_cors['total_redeem_amt'])
feature_lists = list(redeem_cors.sort_values(by='total_redeem_amt',ascending=False).index)[2:] # 降序排列
print(feature_lists) 
feature_temp = feature.dropna()  
## 这里要注意,保留的时候按照相关性降序排序,剔除按照相关性升序排序的顺序
thershold = 0.8
for i in range(len(feature_lists)): # 遍历相关性矩阵行for k in range(len(feature_lists)-1, -1, -1): # 倒序遍历相关性矩阵的列if i >= len(feature_lists) or k >= len(feature_lists) or i == k:breakif np.abs(np.corrcoef(feature_temp[feature_lists[i]], feature_temp[feature_lists[k]])[0][1]) > thershold:# 如果特征i与特征k的相关性大于阈值higher_feature_temp = feature_temp[feature_lists[i]] * feature_temp[feature_lists[k]]if np.abs(np.corrcoef(feature_temp[feature_lists[i]], higher_feature_temp)[0][1]) <= thershold:# 如果特征i与特征i*特征k的相关性小于阈值,则保留i*k作为新特征(相当于交叉项)name = str(feature_lists[i]) + '%%%%' + str(feature_lists[k])feature_temp[name] = higher_feature_tempfeature[name] = feature[feature_lists[i]] * feature[feature_lists[k]]feature_lists.append(name)feature_temp = feature_temp.drop(feature_lists[k], axis=1)# 删除特征kfeature_lists.remove(feature_lists[k]) 
feature = feature[[x for x in feature_lists if x not in labels] + labels + ['date']]
feature.to_csv('Feature/redeem_feature_droped_0614.csv',index=False) 
feature.head()

特征选择 (优胜)

分割数据集,取一部分数据作为训练数据,另一部分作为测试数据:

## 分割数据集
def split_data_underline(data): # 分成4月到8月以及8月到9月trainset = data[('2014-04-01' <= data['date']) & (data['date'] < '2014-08-01')]testset = data[('2014-08-01' <= data['date']) & (data['date'] < '2014-09-01')]return trainset, testset 

使用 SHAP 获取优胜特征

shap.initjs()  # 用来显示的模块
from sklearn import tree
model = tree.DecisionTreeRegressor()
train, test = split_data_underline(feature.dropna())
features = [x for x in train.columns if x not in date_indexs]
model.fit(train[features].drop(labels+['date'], axis=1), train['total_redeem_amt'])explainer = shap.TreeExplainer(model)
shap_testues = explainer.shap_values(test[features].drop(labels+['date'], axis=1))shap.summary_plot(shap_testues, test[features].drop(labels+['date'], axis=1), plot_type='bar')shap.summary_plot(shap_testues, test[features].drop(labels+['date'], axis=1))tree_important_redeem = pd.DataFrame(np.mean(np.abs(shap_testues), axis=0),[x for x in features if x not in labels + date_indexs + ['date']]).reset_index()

在这里插入图片描述
在这里插入图片描述
选择前20个重要性较大的特征:

tree_important_redeem = tree_important_redeem.sort_values(0, ascending=False).reset_index(drop=True)
tree_important_redeem = list(tree_important_redeem[:20]['index']) # 选择前20个重要性较大的特征 

使用词云图显示输出特征:

## 输出选择的特征
def draw_cloud(feature_index: List[str])->None:plt.figure(figsize=(20,10))plt.subplot(1,2,1)ciyun = WordCloud(background_color='white', max_font_size=40)ciyun.generate(text=''.join([x+' ' for x in feature_index if x != 'total_redeem_amt']))plt.imshow(ciyun, interpolation='bilinear')plt.axis("off") draw_cloud(tree_important_redeem)

在这里插入图片描述

使用 Permutation importance 获取优胜特征

model = LinearRegression()
train, test = split_data_underline(feature.dropna())
model.fit(train[features].drop(labels+['date'], axis=1), train['total_redeem_amt'])
perm = PermutationImportance(model, random_state=42).fit(test[features].drop(labels+['date'], axis=1), test['total_redeem_amt'])
liner_important_redeem = pd.DataFrame(np.abs(perm.feature_importances_), [x for x in features if x not in labels + date_indexs + ['date']]).reset_index()
eli5.show_weights(perm, feature_names=list(str(x) for x in features if x not in labels + ['date']))

在这里插入图片描述
选择前20个重要性较大的特征:

liner_important_redeem = liner_important_redeem.sort_values(0, ascending=False).reset_index(drop=True)
liner_important_redeem = list(liner_important_redeem[:20]['index']) 
liner_important_redeem 

使用词云图显示输出特征:

draw_cloud(liner_important_redeem) 

在这里插入图片描述

特征集合取交集选出最终优胜特征

winer_features_redeem = list(set(tree_important_redeem)\& set(liner_important_redeem))  
winer_features_redeem  

画出词云图:

draw_cloud(winer_features_redeem)

在这里插入图片描述

这篇关于资金流入流出预测—特征工程的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/638352

相关文章

Ilya-AI分享的他在OpenAI学习到的15个提示工程技巧

Ilya(不是本人,claude AI)在社交媒体上分享了他在OpenAI学习到的15个Prompt撰写技巧。 以下是详细的内容: 提示精确化:在编写提示时,力求表达清晰准确。清楚地阐述任务需求和概念定义至关重要。例:不用"分析文本",而用"判断这段话的情感倾向:积极、消极还是中性"。 快速迭代:善于快速连续调整提示。熟练的提示工程师能够灵活地进行多轮优化。例:从"总结文章"到"用

Jenkins构建Maven聚合工程,指定构建子模块

一、设置单独编译构建子模块 配置: 1、Root POM指向父pom.xml 2、Goals and options指定构建模块的参数: mvn -pl project1/project1-son -am clean package 单独构建project1-son项目以及它所依赖的其它项目。 说明: mvn clean package -pl 父级模块名/子模块名 -am参数

OmniGlue论文详解(特征匹配)

OmniGlue论文详解(特征匹配) 摘要1. 引言2. 相关工作2.1. 广义局部特征匹配2.2. 稀疏可学习匹配2.3. 半稠密可学习匹配2.4. 与其他图像表示匹配 3. OmniGlue3.1. 模型概述3.2. OmniGlue 细节3.2.1. 特征提取3.2.2. 利用DINOv2构建图形。3.2.3. 信息传播与新的指导3.2.4. 匹配层和损失函数3.2.5. 与Super

《计算机视觉工程师养成计划》 ·数字图像处理·数字图像处理特征·概述~

1 定义         从哲学角度看:特征是从事物当中抽象出来用于区别其他类别事物的属性集合,图像特征则是从图像中抽取出来用于区别其他类别图像的属性集合。         从获取方式看:图像特征是通过对图像进行测量或借助算法计算得到的一组表达特性集合的向量。 2 认识         有些特征是视觉直观感受到的自然特征,例如亮度、边缘轮廓、纹理、色彩等。         有些特征需要通

二、Maven工程的创建--JavaSEJavaEE

1、idea创建Maven JavaSE工程:  2、idea创建Maven JavaEE工程:   (1)手动创建 (2)插件方式创建 在idea里安装插件JBLJavaToWeb; 选择需要生成的项目文件后,右击: 项目的webapp文件夹出现小蓝点,代表成功。

HalconDotNet中的图像特征与提取详解

文章目录 简介一、边缘特征提取二、角点特征提取三、区域特征提取四、纹理特征提取五、形状特征提取 简介   图像特征提取是图像处理中的一个重要步骤,用于从图像中提取有意义的特征,以便进行进一步的分析和处理。HalconDotNet提供了多种图像特征提取方法,每种方法都有其特定的应用场景和优缺点。 一、边缘特征提取   边缘特征提取是图像处理中最基本的特征提取方法之一,通过检

三、Maven工程的构建

首先,创建和构建是两个概念。 构建是指将源代码、依赖库和资源文件等转换为可执行或可部署的应用程序的过程。 在这个过程中包括编译源代码、链接依赖库、打包和部署等多个步骤。 项目构建是软件开发过程中至关重要的一部分,它能够大大提高软件开发效率,使得开发人员更加专注于应用程序的开发和维护,而不必关心应用程序的构建细节。 同时,项目构建还能将多人写的代码聚合,并能够自动化项目的构建和部署,

我在高职教STM32——准备HAL库工程模板(1)

新学期开学在即,又要给学生上 STM32 嵌入式课程了。这课上了多年了,一直用的都是标准库来开发,已经驾轻就熟了。人就是这样,有了自己熟悉的舒适圈,就很难做出改变,老师上课也是如此,排斥新课和不熟悉的内容。显然,STM32 的开发,HAL 库已是主流,自己其实也在使用,只不过更换库就意味着教学内容有很大变化,自己也就迟迟没有迈出调整这一步。现在,是时候做出变化了,笔者计划保持教学项

java工程的导入jar包

由于现在学习java web,java工程导入jar包都忘记了。 在此想记录一下:工程项目名:右击 -- Build Path --add External Archives 点击会弹出一个框 ,选择你要导入的jar路径就可以了。

Tensorflow lstm实现的小说撰写预测

最近,在研究深度学习方面的知识,结合Tensorflow,完成了基于lstm的小说预测程序demo。 lstm是改进的RNN,具有长期记忆功能,相对于RNN,增加了多个门来控制输入与输出。原理方面的知识网上很多,在此,我只是将我短暂学习的tensorflow写一个预测小说的demo,如果有错误,还望大家指出。 1、将小说进行分词,去除空格,建立词汇表与id的字典,生成初始输入模型的x与y d