【论文笔记】Towards Corrective Deep Imitation Learning in Data Intensive Environments

本文主要是介绍【论文笔记】Towards Corrective Deep Imitation Learning in Data Intensive Environments,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

【论文笔记】Towards Corrective Deep Imitation Learning in Data Intensive Environments: Helping robots to learn faster by leveraging human knowledge

Abstract

Interactive imitation learning refers to learning methods where a human teacher interacts with an agent during the learning process providing feedback to improve its behavior(提供反馈信息提升智能体行为).

提出的问题 However, this(深度强化学习的经验回放) causes conflicts between the data in the buffer because samples collected by older versions of the policy may be contradictory and could deteriorate the performance of the current policy.

  1. 工作一 The present thesis focuses on interactive learning with corrective feedback(具有矫正反馈的交互性学习) and, in particular, in the framework Deep Corrective Advice Communicated by Humans(论文提出的模型名字) (D­COACH), which has successfully shown to be advantageous in terms of training time and data efficiency.

  2. 工作二 The current implementation of D­COACH uses a first-in-first-­out buffer with limited size, as the older the sample is, the more likely it is to deteriorate the performance of the learner(经验池上面采用队列式数据结构,先进先出,减少老策略采样的历史对新策略的影响).

    这种方法存在一个问题:就是一方面要降低数据的复杂性和冲突性,另一方面要避免遗忘

  3. 工作三 We propose an improved version of D­COACH, which we call Batch Deep COACH (BD­COACH, pronounced “be the coach”). BD­COACH incorporates a human model module that learns the feedback from the teacher and that can be employed to make corrections gathered by older versions of the policy still useful for batch updating the current version of the policy.

  4. 工作四 仿真实验、实物实验都有

1 Introduction

However, these examples(深度强化学习的一些实例) tend to happen in simulated environments with very specific learning tasks.

Furthermore, many real problems are easier to demonstrate(演示,做示范) than to design a reward function(设计奖励函数) for applying reinforcement learning.

Behavioral cloning has two main drawbacks: First, it requires demonstrations from an expert teacher which limits the possibilities of who can train the agent.需要专家的经验,这就意味着并非谁都能做演示;And secondly, it suffers from covariate shift(协方差转变), a distribution mismatch problem that initiates at the moment the agent deviates from the expert trajectory causing a cascade of errors that will probably make the agent fail the task(一步错步步错).


Interactive imitation learning (IIL,交互式模仿学习) is a branch of imitation learning that deals with the aforementioned issues by allowing a teacher to help the agent learn during its training.

In this work, we focus on corrections which gives name to the branch of IIL called corrective imitation learning (CIL,纠正模仿学习,是交互式模仿学习的一个分支). In CIL frameworks the human teacher sends corrections informing the agent whether the value of a taken action should be increased or decreased(采取的动作的价值需要提升或降低).


The goal of this master thesis is to create an extension of Deep Corrective Advice Communicated by Humans, D­COACH, a CIL algorithm designed for non-­expert humans that uses an artificial neural network as a function approximator for its policy(D­COACH,不需要专家教学辅助).

ER(Experience Replay,经验回放) endows algorithms with two main advantages, these being a higher data efficiency(更高的数据有效性) and the ability to train with uncorrelated data(非相关的数据).

Collected past experiences can be reused multiple times and the ANN gets more robust against locally over-fitting(局部过拟合) to the most recent trajectories, a phenomenon known as catastrophic forgetting(灾难性遗忘,指的是局部过拟合,陷进去走不出来了).Note that we refer to ER as corrections replay since, in this work, we replay old corrective feedback.


This forces to limit the size of the buffer which works under the assumption that the data stored in the replay buffer is still valid for the current version of the policy, even if it was collected by an older version of the policy.

As the size of the replay buffer starts to increase, this assumption does not hold anymore and the training of the policy will most likely fail, limiting therefore, the types of problems that D­COACH can address. 当经验池的大小不断增加的时候,老策略产生的经验在新策略下也能使用的这个假设不成立了,因此这个算法D­COACH失效了。

BD­COACH incorporates a human model module that learns the feedback from the teacher(增加一个从教师那边反馈得到的学习) and that can be employed to make corrections gathered by older versions of the policy(有老策略产生的经验生成矫正数据) still useful for batch updating the current version of the policy(在新策略更新过程中仍然有效).


我的理解是:D­COACH作为校正性模仿学习,且不需要专业认识的经验,作者这个在模仿学习中比较实用。因此这项工作是基于D­COACH的,但是这个算法里面的经验池大小是固定的,但是在序列动作产生时,累计的经验是不断变多的,对应标题的data intensive,不断增大的经验池会导致D­COACH失效/效果变坏,因为D­COACH是基于经验池大小不变且老策略的历史与新策略一致的假设,因此提出了自己的BD­COACH。

2 Background and Related Work

本章主要是对强化学习做一些回顾

2.1. Reinforcement Learning

2.1.1. On-policy and Off-policy Reinforcement Learning

According to Sutton and Barto, off-policy methods use two policies: They evaluate or improve a policy, the target policy, while using a different policy to generate the data, the behavior policy.

On the other hand, on-policy methods use a single policy, the same policy that is evaluated or improved, is the one used to generate behavior.

2.1.2. On-line and Off-line Reinforcement Learning

Traditional RL algorithms are on-line frameworks where the agent iteratively interacts in its environment, collecting experience to update its policy. The on-line approach works well in simulated environments however, for real world settings, on-line learning is impractical because the agent still needs to collect a diverse and large dataset.

Offline reinforcement learning, addresses the aforementioned problem. The key idea is that using only previously collected data, the agent has to learn the best possible policy without additional online data collection. With this offline framework, it is possible to apply RL to real world domains like robotics where the agent, the robot, could easily get damaged while collecting data iteratively in an online manner.

2.1.3. Experience Replay

Experience replay provides several benefits.

  1. First, it is an efficient way of taking advantage of previously collected experience by replaying it multiple times.
  2. Furthermore, experience replay provides uncorrelated data to train the neural network, which helps it to generalize and to minimize over fitting to the most recent trajectories.

It is important to remark that even if the experiences were collected with a single policy π π π, because the policy evolves over time, the same policy at time step t t t, π t π_{t} πt, is not equal to that same policy in a later time step, π N π_{N} πN . They are considered experiences gathered by different policies and therefore only off policy methods are applicable.

2.2. Function Approximation with Artificial Neural Networks

In particular, for this thesis, we will focus on fully connected feedforward neural networks (FNN) where information flows in only one direction without going through any loop.

2.3. Imitation Learning

Imitation learning is more useful with respect to reinforcement learning when it is easier for a teacher to demonstrate or provide feedback in order for the agent to learn rather than to specify a reward function that would lead to the desired behavior.

2.3.1. Interactive Imitation Learning

Interactive imitation learning (IIL) is a branch of imitation learning where human teachers can help intelligent agents to learn during their training.

  1. For the human to demonstrate the task when the robot request it.
  2. Evaluative feedback in the form of scalar values. Here the human is presented with several executions of a policy, and he/she has to decide which one is better according to the goal of the task. Then, a reward function that explains the decisions of the human is found, and by applying RL, the agent learns how to perform the task.
  3. Corrective imitation learning improves the informativeness of evaluative feedback, by allowing the teacher to inform the agent whether the value of a taken action should be increased or decreased and it requires less exploration compared to evaluative feedback.
2.3.2. On-Policy and Off-Policy Imitation Learning
  • In Off-­Policy imitation learning, an agent observes demonstrations from a supervisor and tries to recover the behavior via supervised learning, an example of off-­policy IL is behavioral cloning.
  • On­-policy imitation learning methods sample trajectories from the agent’s current distribution and update the model based on the data received. A common on­-policy algorithm is Dagger.

Dagger的介绍:

But, even if referring to the simple version of DAgger, there is another reason to consider it off policy. By its very nature, DAgger learns from a buffer, in other words, it learns from information gathered by older versions of the policy that are different from the current version.

1

2.3.3. Online and Offline Imitation Learning

In offline imitation learning, the agent learns by imitating a demonstrator without additional online
environment interactions unlike in the case of online IL.

2.4 Corrective Imitation Learning

The new framework that we propose in Chapter 3 is based on the D­COACH algorithm which, in turn, derives from the COACH algorithm; both methods are presented next.

2.4.1. COACH: Corrective Advice Communicated by Humans

The method Corrective Advice Communicated by Humans, COACH [13], is a CIL framework designed for non-expert humans teachers where the person supervising the learning agent, provides occasional corrections when the agent behaves wrongly.

特点:(1)不要求专业的人员;(2)不需要持续的示教,只有机器人出错了才纠正

This corrective feedback h h h is a binary signal(二进制信号) that indicates the direction in which the executed action a = π θ ( s ) a = π_{θ}(s) a=πθ(s), should be modified(指出在状态 s s s下的哪个动作是要被修正的).

  1. θ θ θ are updated using a stochastic gradient descent (SGD) strategy in a supervised learning manner.
  2. J ( θ ) J(θ) J(θ) is the mean squared error between the applied and the desired action.
  3. θ ← θ − α Δ θ J ( θ ) \theta\leftarrow\theta-\alpha\Delta_{\theta}J(\theta) θθαΔθJ(θ)

COACH works under the assumption that teachers are non-experts(非专家的假设) and that therefore, they are just able to provide a correction trend(提供一个错误的倾向/趋势) that tells the sign of the policy error but not the error itself. COACH的算法告诉智能体你这里错了,但不会告诉智能体咋错了。


To compute the exact magnitude of the error, COACH incorporates a hyperparameters e e e that needs to be defined beforehand, resulting in e r r o r t = h t ⋅ e error_{t} = h_{t} · e errort=hte.

The error needs to be defined as a function of the parameters in order to compute the gradient in the parameter space of the policy.

Thus, the error can also be described as the difference between the desired action generated with the teacher’s feedback, a t t a r g e t = a t + e r r o r t a^{target}_{t} = a_{t} + error_{t} attarget=at+errort , and the current output of the policy, a t = π θ ( o t ) a_{t} = π_{θ}(o_{t}) at=πθ(ot).
e r r o r t = a t t a r g e t − a t = a t t a r g e t − π θ ( o t ) θ ← θ + α ⋅ e r r o r t ∇ θ π θ error_{t}=a_{t}^{target}-a_{t}=a_{t}^{target}-\pi_{\theta}(o_{t}) \\ \theta\leftarrow\theta+\alpha\cdot error_{t}\nabla_{\theta}\pi_{\theta} errort=attargetat=attargetπθ(ot)θθ+αerrortθπθ
3

2.4.2. D-COACH: Deep COACH

Deep COACH, D­COACH, is the “deep” version of the COACH algorithm in the sense that it uses an artificial neural network to represent the policy of the agent.

The current version of D­COACH implements the corrections replay technique to be more data efficient. During learning, tuples of old corrections, ( s t , a t t a r g e t ) (s_{t}, a^{target}_{t} ) (st,attarget), are stored in a memory buffer B B B and then they are replayed to update the current policy of the agent.

However, the way that D­COACH implements the replaying of corrections has limitations. In this case, the replay buffer B B B works by assuming that recent feedback is still being valid to update the most recent version of the policy. Due to this assumption, the size of the buffer that D­COACH implements, needs to be drastically reduced, otherwise old corrections could update the policy in undesired directions of the policy’s parameter space.

On the other hand, a very small replay buffer will provoke an over fitting of the policy to data generated in the most recent trajectories, which limits the current version of D­COACH to work with low data intensive problems.

4

5

3 Batch Deep COACH (BD-COACH)

The current version of the CIL algorithm D­COACH is limited to problems that do not require large amounts of data as its replay buffer needs to be kept small.

3.1. Difference between D-COACH and BD-COACH

Therefore, a task with a combination of a complex and high ­dimensional observation space plus a long horizon and a complex observation action mapping, would be very challenging for D­COACH to learn.

In D­COACH, batch updates are independent of the policy as corrections from the buffer do not depend on what the policy is currently doing at those particular states. This fact makes that feedback gathered by older versions of the policy can deteriorate the performance of the current policy.

The human observes the state and the action of the agent at a particular moment and gives a correction accordingly. We introduce a human teacher learner module in our framework as an artificial neural network that takes pairs of state/actions and outputs the appropriate feedback correction(输入动作状态对,输出恰当的反馈校正). This module, called the human model, is learned in parallel together with the policy.

BD­COACH is able to work with higher demanding data tasks thanks to the human model module that learns to predict the feedback that the teacher provides. These predicted corrections depend on the output of the policy at a particular state making them convenient for updating the current version of the policy.

3.2. Learning Framework

6

Policy

输入环境的状态,输出相应动作

Replay Buffer

当人类教师提供矫正时,存储经验池。三元组 ( s t , a t , s t + 1 ) (s_{t},a_{t},s_{t+1}) (st,at,st+1)

Policy Update Module

The corrections that are fed to the policy update module depend on the actions taken by the policy for those states.

These corrections do not come directly from the replay buffer as in the case of D­COACH but instead are the output of the human model module.

Human Model

BD­COACH incorporates a human model, H ( s , a ) H(s, a) H(s,a), that learns to predict the corrective feedback given by the human teacher for inputs of state-­action pairs.

The framework Gaussian Process Coach (GPC) also employs a human model that, as our model, does take into account actions in addition to states. The difference is that GPC implements Gaussian processes as function approximator for both its policy and its human model to estimate the uncertainty of states and actions. In the case of BD­COACH, the human model is an artificial neural network that generate labels that are useful for the current version of the policy.

During the batch update of the policy, a mini-batch of states(状态的迷你批次) uniformly sampled from the replay buffer is passed to both the policy and the human model as inputs.

To clarify, these mini-batches of states are different from normal mini-batches as for this step, we do not use the actions or corrections(不需要动作信息和矫正信息) from the buffer.

Human Model Update

The human model update module is in charge of updating the weights of the ANN that represents the human model, H ( s , a ) H(s, a) H(s,a). The human model is updated with tuples of information ( s t , a t , h t ) (s_{t}, a_{t}, h_{t}) (st,at,ht) stored in the replay buffer, applying like this the corrections replay technique.

3.3. Discussion

4 Experimental Setting

4.1. Meta-World Benchmark

All the tasks in Meta ­World need an agent that executes an action in the environment equal to [ δ x , δ y , δ z , g ] [δx, δy, δz, g] [δx,δy,δz,g]. The first three dimensions of the action correspond to the change in position of the end effector in the three Cartesian axes(末端执行器的三维笛卡尔坐标). The last dimension represents the gripper effort that keeps the fingers of the end effector open or close. In our case, for this dimension, the expert policy always commands a constant value keeping the gripper open or close depending on the task(末端执行器保持一个常数). The observation space is a 9 dimensional space formed by the 3D Cartesian positions of the end effector, the object and the goal.

This metric, ∥ o b j e c t − g o a l ∥ 2 < ϵ ∥object − goal∥2 < ϵ objectgoal∥2<ϵ, is based on the euclidean distance between the object position and the goal position where ϵ ϵ ϵ is a small distance threshold that varies from task to task.

4.1.1. Simulated Experiments

The goal of the simulated experiments is to compare the performance between BD­COACH and D­COACH as a function of the amount of data required to solve a task.

The reason behind this design choice is that states formed by relative positions, s = [ [ x y z e n d e f f e c t o r ] , [ x y z o b j e c t ] , [ x y z g o a l ] ] s = [[xyz_{end_effector}], [xyz_{object}], [xyz_{goal}]] s=[[xyzendeffector],[xyzobject],[xyzgoal]], make it easier for the robot to generalize as the number of dimensions decreases(对场景信息做了一些降维处理).

4.1.2. Task plate-slide-v2

If the puck goes inside the goal in less than 500 time steps, the episode is considered successful and a fail otherwise. The task starts with the gripper and the object always initiated at the same position whereas the goal is initiated randomly within an area of 0.01 m 2 0.01m^{2} 0.01m2.

9

4.1.3. Task drawer-open-v2

10

4.1.4. Task button-press-top-down-v2

11

4.1.5. Synthesized Feedback

Using an oracle removes human factors such as the human teacher providing inconsistent feedback or he/she getting tired which would make comparisons between algorithms unfair. Furthermore, in order to compare performances of different frameworks, it is necessary to run many simulations to obtain a good average of the results which would be completely impractical if the teacher was a real human.

主要意思是,如果是真人教师的话,长时间的指导矫正会导致疲劳,会导致后面的矫正比较乏力。因此需要合成化反馈信号,把他上升到一个“神谕”(Oracle)中。

The oracle used in this work generates feedback by computing h = s i g n ( a t e a c h e r − a a g e n t ) h = sign(a_{teacher} − a_{agent}) h=sign(ateacheraagent), whereas the decision on whether to provide feedback at each time step is given by the probability P h = α ⋅ e x p ( − τ ⋅ t i m e s t e p ) P_{h} = α· exp(−τ ·timestep) Ph=αexp(τtimestep), where α ∈ R ∣ 0 ≤ α ≤ 1 {α ∈ R |0 ≤ α ≤ 1} αR∣0α1; τ ∈ R ∣ 0 ≤ τ {τ ∈ R |0 ≤ τ} τR∣0τ. 在一个指数分布的概率分布下决定是否发出纠正信号,那么纠正信号的发出是按照方波公式算出来的。

Furthermore, this binary feedback h h h is only provided if the difference between the action of the policy and the action of the teacher is larger than a threshold ϵ ϵ ϵ.

如果两个差值大于阈值,那么久设置正信号,反之设置负信号。

4.2. Experiments with KUKA Robot Arm

In order to validate the new proposed method BD­COACH on a real robotic setup with real human teachers, we devised two tasks involving a KUKA LBR iiwa 7 robot arm pushing a box placed on top of a table.

Several reflecting markers were attached to the box so its pose could be tracked by an OptiTrack motion capture system. The pose, captured by the eight cameras of the available OptiTrack system, consists of the position and orientation of the central point on the box created by the reflecting markers.(在物体上做一些标记,以便于用OptiTrack来做轨迹演示)

The human that supervises the learning process conveys the corrections with a joystick.

4.2.1. Task KUKA-push-box

Pushing an box in a straight line without a reactive robot is simply impossible to achieve as the box will naturally fall outside the desired straight trajectory. 机器人本身不足以精确到推物体走直线

Figure 4.4 shows this problem where a constant velocity is commanded to the end effector but, because the robot does not react to the misalignments, the box keeps deviating from the desired straight trajectory. KUKA机器人没有推直线,偏离了路线

12

4.2.2. Task KUKA-park-box

13

5 Results

5.1. Results of Simulated Tasks

This fact indicates that on certain time steps, the difference between the agent’s actions and the oracle’s actions is smaller than the oracle’s threshold ϵ and therefore it does not require feedback for those time steps.

14

15

5.2. Results of Validation in Real System

16

17

6 Conclusion

  1. More exhaustive experiments.

    BD­COACH has successfully demonstrated in simulation to be able to benefit from the corrections replay technique.

    However, it would be necessary to run exhaustive experiments with more human participants to take into account human factors that we have not considered and really prove the benefits of BD­COACH.

  2. Use images as observations.

    BD­COACH has been validated only with observations formed by Cartesian positions.

    It would be very interesting to see if it is able to keep its performance when the policy is fed with observations form by images.

  3. Validation with longer horizon tasks.

    Finally, more complex tasks could be taught to BD­COACH to see to what extent it can leverage the most from the replay buffer.

这篇关于【论文笔记】Towards Corrective Deep Imitation Learning in Data Intensive Environments的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/260186

相关文章

AI hospital 论文Idea

一、Benchmarking Large Language Models on Communicative Medical Coaching: A Dataset and a Novel System论文地址含代码 大多数现有模型和工具主要迎合以患者为中心的服务。这项工作深入探讨了LLMs在提高医疗专业人员的沟通能力。目标是构建一个模拟实践环境,人类医生(即医学学习者)可以在其中与患者代理进行医学

【学习笔记】 陈强-机器学习-Python-Ch15 人工神经网络(1)sklearn

系列文章目录 监督学习:参数方法 【学习笔记】 陈强-机器学习-Python-Ch4 线性回归 【学习笔记】 陈强-机器学习-Python-Ch5 逻辑回归 【课后题练习】 陈强-机器学习-Python-Ch5 逻辑回归(SAheart.csv) 【学习笔记】 陈强-机器学习-Python-Ch6 多项逻辑回归 【学习笔记 及 课后题练习】 陈强-机器学习-Python-Ch7 判别分析 【学

系统架构师考试学习笔记第三篇——架构设计高级知识(20)通信系统架构设计理论与实践

本章知识考点:         第20课时主要学习通信系统架构设计的理论和工作中的实践。根据新版考试大纲,本课时知识点会涉及案例分析题(25分),而在历年考试中,案例题对该部分内容的考查并不多,虽在综合知识选择题目中经常考查,但分值也不高。本课时内容侧重于对知识点的记忆和理解,按照以往的出题规律,通信系统架构设计基础知识点多来源于教材内的基础网络设备、网络架构和教材外最新时事热点技术。本课时知识

论文翻译:arxiv-2024 Benchmark Data Contamination of Large Language Models: A Survey

Benchmark Data Contamination of Large Language Models: A Survey https://arxiv.org/abs/2406.04244 大规模语言模型的基准数据污染:一项综述 文章目录 大规模语言模型的基准数据污染:一项综述摘要1 引言 摘要 大规模语言模型(LLMs),如GPT-4、Claude-3和Gemini的快

论文阅读笔记: Segment Anything

文章目录 Segment Anything摘要引言任务模型数据引擎数据集负责任的人工智能 Segment Anything Model图像编码器提示编码器mask解码器解决歧义损失和训练 Segment Anything 论文地址: https://arxiv.org/abs/2304.02643 代码地址:https://github.com/facebookresear

数学建模笔记—— 非线性规划

数学建模笔记—— 非线性规划 非线性规划1. 模型原理1.1 非线性规划的标准型1.2 非线性规划求解的Matlab函数 2. 典型例题3. matlab代码求解3.1 例1 一个简单示例3.2 例2 选址问题1. 第一问 线性规划2. 第二问 非线性规划 非线性规划 非线性规划是一种求解目标函数或约束条件中有一个或几个非线性函数的最优化问题的方法。运筹学的一个重要分支。2

【C++学习笔记 20】C++中的智能指针

智能指针的功能 在上一篇笔记提到了在栈和堆上创建变量的区别,使用new关键字创建变量时,需要搭配delete关键字销毁变量。而智能指针的作用就是调用new分配内存时,不必自己去调用delete,甚至不用调用new。 智能指针实际上就是对原始指针的包装。 unique_ptr 最简单的智能指针,是一种作用域指针,意思是当指针超出该作用域时,会自动调用delete。它名为unique的原因是这个

查看提交历史 —— Git 学习笔记 11

查看提交历史 查看提交历史 不带任何选项的git log-p选项--stat 选项--pretty=oneline选项--pretty=format选项git log常用选项列表参考资料 在提交了若干更新,又或者克隆了某个项目之后,你也许想回顾下提交历史。 完成这个任务最简单而又有效的 工具是 git log 命令。 接下来的例子会用一个用于演示的 simplegit

记录每次更新到仓库 —— Git 学习笔记 10

记录每次更新到仓库 文章目录 文件的状态三个区域检查当前文件状态跟踪新文件取消跟踪(un-tracking)文件重新跟踪(re-tracking)文件暂存已修改文件忽略某些文件查看已暂存和未暂存的修改提交更新跳过暂存区删除文件移动文件参考资料 咱们接着很多天以前的 取得Git仓库 这篇文章继续说。 文件的状态 不管是通过哪种方法,现在我们已经有了一个仓库,并从这个仓

忽略某些文件 —— Git 学习笔记 05

忽略某些文件 忽略某些文件 通过.gitignore文件其他规则源如何选择规则源参考资料 对于某些文件,我们不希望把它们纳入 Git 的管理,也不希望它们总出现在未跟踪文件列表。通常它们都是些自动生成的文件,比如日志文件、编译过程中创建的临时文件等。 通过.gitignore文件 假设我们要忽略 lib.a 文件,那我们可以在 lib.a 所在目录下创建一个名为 .gi