论文阅读以及复现:Shuffle and learn: Unsupervised learning using temporal order verification

本文主要是介绍论文阅读以及复现:Shuffle and learn: Unsupervised learning using temporal order verification,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

目录

Summary

Details

1、Frame Sampling Strategy

2、Training & Testing

2.1    Training Details

2.1    Testing Details (Details for Action Recognition)

复现过程中遇到的问题

 


 

论文名称:Shuffle and learn: Unsupervised learning using temporal order verification(2016 ECCV)

下载地址:https://arxiv.org/pdf/1603.08561.pdf

 

原作者 Caffe 代码:https://github.com/imisra/shuffle-tuple

我的 PyTorch 复现代码:https://github.com/BizhuWu/ShuffleAndLearn_PyTorch

 


 

Summary

这篇文章是一个视频动作识别领域中设计自监督学习上游任务(pretext task)的文章:

它通过从在视频中采样三帧图片(a-b-c),分别丢进三个共享权值的类 AlexNet 网络中,将从 fc7 层出来的特征 concat 起来,最后经过 fc8 做一个二分类:是正确顺序(a-b-c 或者 c-b-a【注意反序也算对】)/ 不是正确顺序(除了两种)

惊叹一下:原本我以为 temporal order verification 的单位是 video clips,所以 backbone 之类的应该是 C3D 之类的;万万没想到是 temporal order verification 的单位是 frames,所以 backbone 之类的应该是 AlexNet 的变体。

 


 

 

Details

1、Frame Sampling Strategy

  • we only sample tuples from temporal windows with high motion.
  • we use coarse frame level optical flow [56] as a proxy to measure the motion between frames. We treat the average flow magnitude per-frame as a weight for that frame, and use it to bias our sampling towards high motion windows.
  • To create positive and negative tuples, we sample five frames (fa, fb, fc, fd, fe) from a temporal window such that a < b < c < d < e. Positive instances are created using (fb, fc, fd), while negative instances are created using (fb, fa, fd) and (fb, fe, fd). Additional training examples are also created by inverting the order of all training instances, eg., (fd, fc, fb) is positive.
  • During training it is critical to use the same beginning frame fb and ending frame fd while only changing the middle frame for both positive and negative examples.
  • To avoid sampling ambiguous negative frames fa and fe, we enforce that the appearance of the positive fc frame is not too similar (measured by SSD on RGB pixel values) to fa or fe

 

这是作者给出的采样好的 tuple frames 和对应的 label(0 / 1)。可以在这里下载:here.

Each line of the file train01_image_keys.txt defines a tuple of three frames.

The corresponding file train01_image_labs.txt has a binary label indicating whether the tuple is in the correct or incorrect order.

 

 

 

 

2、Training & Testing

  • To learn a feature representation from the tuple ordering task, we use a simple triplet Siamese network. This network has three parallel stacks of layers with shared parameters. Every network stack follows the standard CaffeNet [57] (a slight modification of AlexNet [58]) architecture from the conv1 to the fc7 layer.
  • Each stack takes as input one of the frames from the tuple and produces a representation at the fc7 layer. The three fc7 outputs are concatenated as input to a linear classification layer.
  • We update the parameters of the network by minimizing the regularized cross-entropy loss of the predictions on each tuple.
  • During testing we can obtain the conv1 to fc7 representations of a single input frame by using just one stack, as the parameters across the three stacks are shared.

 

2.1    Training Details

  • For unsupervised pre-training, we do not use the semantic action labels.
  • We sample about 900k tuples from the UCF101 training videos.
  • We randomly initialize our network, and train for 100k iterations with a fixed learning rate of 10^−3 and mini-batch size of 128 tuples.
  • Each tuple consists of 3 frames

 

2.1    Testing Details (Details for Action Recognition)

  • backbone 用的是 Two-stream 那一篇中的 RGB apperance 那一个 branch
  • The parameters of the spatial network are initialized with our unsupervised pre-trained network.
  • At test time, 25 frames are uniformly sampled from each video.
  • Each frame is used to generate 10 inputs after fixed cropping and flipping (5 crops × 2 flips), and the prediction for the video is an average of the predictions across these 25×10 inputs.
  • We initialize the network parameters up to the fc7 layer using the parameters from the unsupervised pre-trained network, and initialize a new fc8 layer for the action recognition task.
  • We finetune the network following [60] for 20k iterations with a batch size of 256, and learning rate of 10^−2 decaying by 10 after 14k iterations, using SGD with momentum of 0.9, and dropout of 0.5

 

 

 

 

复现过程中遇到的问题

注意!!!UCF101 数据集中 HandstandPushups 那一个类的文件名为 HandstandPushups,而里面视频的名字是  HandStandPushups,这里数据处理容易出错

 

 

 

这篇关于论文阅读以及复现:Shuffle and learn: Unsupervised learning using temporal order verification的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/588598

相关文章

JAVA智听未来一站式有声阅读平台听书系统小程序源码

智听未来,一站式有声阅读平台听书系统 🌟&nbsp;开篇:遇见未来,从“智听”开始 在这个快节奏的时代,你是否渴望在忙碌的间隙,找到一片属于自己的宁静角落?是否梦想着能随时随地,沉浸在知识的海洋,或是故事的奇幻世界里?今天,就让我带你一起探索“智听未来”——这一站式有声阅读平台听书系统,它正悄悄改变着我们的阅读方式,让未来触手可及! 📚&nbsp;第一站:海量资源,应有尽有 走进“智听

AI hospital 论文Idea

一、Benchmarking Large Language Models on Communicative Medical Coaching: A Dataset and a Novel System论文地址含代码 大多数现有模型和工具主要迎合以患者为中心的服务。这项工作深入探讨了LLMs在提高医疗专业人员的沟通能力。目标是构建一个模拟实践环境,人类医生(即医学学习者)可以在其中与患者代理进行医学

论文翻译:arxiv-2024 Benchmark Data Contamination of Large Language Models: A Survey

Benchmark Data Contamination of Large Language Models: A Survey https://arxiv.org/abs/2406.04244 大规模语言模型的基准数据污染:一项综述 文章目录 大规模语言模型的基准数据污染:一项综述摘要1 引言 摘要 大规模语言模型(LLMs),如GPT-4、Claude-3和Gemini的快

论文阅读笔记: Segment Anything

文章目录 Segment Anything摘要引言任务模型数据引擎数据集负责任的人工智能 Segment Anything Model图像编码器提示编码器mask解码器解决歧义损失和训练 Segment Anything 论文地址: https://arxiv.org/abs/2304.02643 代码地址:https://github.com/facebookresear

论文翻译:ICLR-2024 PROVING TEST SET CONTAMINATION IN BLACK BOX LANGUAGE MODELS

PROVING TEST SET CONTAMINATION IN BLACK BOX LANGUAGE MODELS https://openreview.net/forum?id=KS8mIvetg2 验证测试集污染在黑盒语言模型中 文章目录 验证测试集污染在黑盒语言模型中摘要1 引言 摘要 大型语言模型是在大量互联网数据上训练的,这引发了人们的担忧和猜测,即它们可能已

OmniGlue论文详解(特征匹配)

OmniGlue论文详解(特征匹配) 摘要1. 引言2. 相关工作2.1. 广义局部特征匹配2.2. 稀疏可学习匹配2.3. 半稠密可学习匹配2.4. 与其他图像表示匹配 3. OmniGlue3.1. 模型概述3.2. OmniGlue 细节3.2.1. 特征提取3.2.2. 利用DINOv2构建图形。3.2.3. 信息传播与新的指导3.2.4. 匹配层和损失函数3.2.5. 与Super

软件架构模式:5 分钟阅读

原文: https://orkhanscience.medium.com/software-architecture-patterns-5-mins-read-e9e3c8eb47d2 软件架构模式:5 分钟阅读 当有人潜入软件工程世界时,有一天他需要学习软件架构模式的基础知识。当我刚接触编码时,我不知道从哪里获得简要介绍现有架构模式的资源,这样它就不会太详细和混乱,而是非常抽象和易

BERT 论文逐段精读【论文精读】

BERT: 近 3 年 NLP 最火 CV: 大数据集上的训练好的 NN 模型,提升 CV 任务的性能 —— ImageNet 的 CNN 模型 NLP: BERT 简化了 NLP 任务的训练,提升了 NLP 任务的性能 BERT 如何站在巨人的肩膀上的?使用了哪些 NLP 已有的技术和思想?哪些是 BERT 的创新? 1标题 + 作者 BERT: Pre-trainin

[论文笔记]LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

引言 今天带来第一篇量化论文LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale笔记。 为了简单,下文中以翻译的口吻记录,比如替换"作者"为"我们"。 大语言模型已被广泛采用,但推理时需要大量的GPU内存。我们开发了一种Int8矩阵乘法的过程,用于Transformer中的前馈和注意力投影层,这可以将推理所需

Detectorn2预训练模型复现:数据准备、训练命令、日志分析与输出目录

Detectorn2预训练模型复现:数据准备、训练命令、日志分析与输出目录 在深度学习项目中,目标检测是一项重要的任务。本文将详细介绍如何使用Detectron2进行目标检测模型的复现训练,涵盖训练数据准备、训练命令、训练日志分析、训练指标以及训练输出目录的各个文件及其作用。特别地,我们将演示在训练过程中出现中断后,如何使用 resume 功能继续训练,并将我们复现的模型与Model Zoo中的