论文精翻《A Tandem Learning Rule for Effective Training and Rapid Inference of Deep Spiking Neural ...》

本文主要是介绍论文精翻《A Tandem Learning Rule for Effective Training and Rapid Inference of Deep Spiking Neural ...》，希望对大家解决编程问题提供一定的参考价值，需要的开发者们随着小编来一起学习吧！

摘要/Abstract
I 简介/Introduction
II 通过串联网络学习/Learning Through A Tandem Network
- A 神经元模型/Neuron Model
- B 编码和解码方案/Encoding and Decoding Schemes
- C 作为离散神经表示的脉冲计数/Spike Count as a Discrete Neural Representation
- D 串联网络中的信用分配/Credit Assignment in the Tandem Network
III 实验评价与讨论/Experimental Evaluation And Discussion
- A 实验设置/Experimental Setups
- B 基于帧的物体识别结果/Frame-Based Object Recognition Results
- C 基于事件的对象识别结果/Event-Based Object Recognition Results
- D 卓越的回归能力/Superior Regression Capability
- E 交错层内的激活方向保持与权重-激活点积比例/Activation Direction Preservation and Weight-Activation Dot Product Proportionality Within the Interlaced Layers
- F 通过脉冲序列级代理梯度的高效学习/Efficient Learning Through Spike-Train Level Surrogate Gradient
- G 减少突触操作，实现快速推理/Rapid Inference With Reduced Synaptic Operations
IV 讨论与结论/Discussion And Conclusion

《A Tandem Learning Rule for Effective Training and Rapid Inference of Deep Spiking Neural Networks》论文精翻

👉 CSDN-论文原文下载

⚠️ 请注意：译文中，加粗文字为译者认为的重点部分，加粗斜体文字为译者觉得难以翻译/翻译不准的部分。

摘要/Abstract

SNN代表了神经形态计算(NC)架构中最突出的生物启发计算模型。然而，由于脉冲神经元函数的不可微性，标准误差反向传播算法不能直接适用于SNN。在这项工作中，我们提出了一个串联学习框架，由一个SNN和一个通过权值共享耦合的人工神经网络(ANN)组成。人工神经网络是一种辅助结构，便于在脉冲序列级别训练SNN的误差反向传播。为此，我们将脉冲计数作为SNN中的离散神经表示，并设计了一个可以有效近似耦合SNN的脉冲计数的ANN神经元激活函数。所提出的串联学习规则在传统的基于帧和基于事件的视觉数据集上展示了具有竞争力的模式识别和回归能力，与其他最先进的SNN实现相比，至少减少了一个数量级的推理时间和总的突触操作。因此，本文提出的串联学习规则为低计算资源训练高效、低延迟、高精度的深度SNN提供了一种新的解决方案。

Spiking neural networks (SNNs) represent the most prominent biologically inspired computing model for neuromorphic computing (NC) architectures. However, due to the nondifferentiable nature of spiking neuronal functions, the standard error backpropagation algorithm is not directly applicable to SNNs. In this work, we propose a tandem learning framework that consists of an SNN and an artificial neural network (ANN) coupled through weight sharing. The ANN is an auxiliary structure that facilitates the error backpropagation for the training of the SNN at the spike-train level. To this end, we consider the spike count as the discrete neural representation in the SNN and design an ANN neuronal activation function that can effectively approximate the spike count of the coupled SNN. The proposed tandem learning rule demonstrates competitive pattern recognition and regression capabilities on both the conventional frame- and event-based vision datasets, with at least an order of magnitude reduced inference time and total synaptic operations over other state-of-the-art SNN implementations. Therefore, the proposed tandem learning rule offers a novel solution to training efficient, low latency, and high-accuracy deep SNNs with low computing resources.

I 简介/Introduction

深度学习在计算机视觉[1]、语音处理[2]、语言理解[3]等方面极大地提高了模式识别性能。然而，深度人工神经网络(ANN)计算量大，内存效率低，因此限制了它们在计算预算有限的移动和可穿戴设备上的部署。这促使我们寻找节能的解决方案。

Deep learning has greatly improved pattern recognition performance by leaps and bounds in computer vision [1], speech processing [2], language understanding [3], and so on. However, deep artificial neural networks (ANNs) are computationally intensive and memory inefficient, thereby limiting their deployments in mobile and wearable devices that have limited computational budgets. This prompts us to look into energy-efficient solutions.

经过数百万年的进化，人类的大脑在执行复杂的感知与认知任务时具有令人难以置信的效率。虽然分层组织的深层ANN是由大脑启发的，但它们在许多方面与生物大脑有很大不同。从根本上说，信息是通过大脑中的异步动作电位或脉冲来表示和交流的。为了高效、快速地处理这些脉冲队列所携带的信息，生物神经系统进化出事件驱动计算策略，其能量消耗与感官刺激的活动水平相匹配。

The human brain, with millions of years of evolution, is incredibly efficient at performing complex perceptual and cognitive tasks. Although hierarchically organized deep ANNs are brain-inspired, they differ significantly from the biological brain in many ways. Fundamentally, the information is represented and communicated through asynchronous action potentials or spikes in the brain. To efficiently and rapidly process the information carried by these spike trains, biological neural systems evolve the event-driven computation strategy, whereby energy consumption matches with the activity level of sensory stimuli.

神经形态计算(Neuromorphic computing, NC)作为一种新兴的非冯·诺依曼计算范式，旨在通过在硅芯片[4]上的脉冲神经网络 (SNNs)模拟这种异步事件驱动的信息处理。新的NC架构实例包括TrueNorth[5]和Loihi[6]——利用低功耗、密集连接的并行计算单元来支持基于脉冲的计算。此外，内存和计算的共享存储可以有效缓解CPU和内存之间的低带宽问题（即冯·诺依曼瓶颈）[7]。当在这些神经形态架构上实现时，深度SNN表现出引人注目的能量效率和低延[8]。

Neuromorphic computing (NC), as an emerging non-von Neumann computing paradigm, aims to mimic such asynchronous event-driven information processing with spiking neural networks (SNNs) in silicon [4]. The novel NC architecturesinstances include TrueNorth [5] and Loihi [6]—leverage on the low-power, densely connected parallel computing units to support spike-based computation. Furthermore, the colocated memory and computation can effectively mitigate the problem of low bandwidth between the CPU and memory (i.e., von Neumann bottleneck) [7]. When implemented on these neuromorphic architectures, deep SNNs demonstrate compelling energy efficiency and low latency [8].

虽然NC体系结构提供了有吸引力的节能，但如何训练能够在这些NC体系结构上高效运行的大规模SNN仍然是一个具有挑战性的研究课题。脉冲神经元表现出丰富的动态行为[9]，如相位脉冲，爆发和脉冲频率自适应，相比简化的神经网络，这大大增加了建模的复杂性。此外，由于SNN中突触操作的异步性和不连续性，通常用于神经网络训练的误差反向传播算法并不直接适用于SNN。

While NC architectures offer attractive energy-saving, how to train large-scale SNNs that can operate efficiently and effectively on these NC architectures remains a challenging research topic. The spiking neurons exhibit a rich repertoire of dynamical behaviors [9], such as phasic spiking, bursting, and spike frequency adaptation, which significantly increases the modeling complexity over the simplified ANNs. Moreover, due to the asynchronous and discontinuous nature of synaptic operations within the SNN, the error backpropagation algorithm that is commonly used for the ANN training is not directly applicable to the SNN.

多年来，受到神经科学和机器学习研究的启发，越来越多的神经可塑性或学习方法被提出用于SNN[10]，[11]。生物学上合理的Hebbian学习规则[12]和时空依赖可塑性 (STDP)[13]是计算神经科学研究中有趣的局部学习规则，对于新兴的非易失性存储设备[14]的硬件实现也很有吸引力。尽管它们最近在小规模图像识别任务[15]，[16]上取得了成功，但由于无效的任务特定的信用分配和耗时的超参数调优，它们并不直接用于大规模机器学习任务。

Over the years, a growing number of neural plasticities or learning methods, inspired by neuroscience and machine learning studies, have been proposed for SNNs [10], [11]. The biological plausible Hebbian learning rules [12] and spiketiming-dependent plasticity (STDP) [13] are intriguing local learning rules for computational neuroscience studies and also attractive for hardware implementation with emerging nonvolatile memory device [14]. Despite their recent successes on the small-scale image recognition tasks [15], [16], they are not straightforward to be used for large-scale machine learning tasks due to the ineffective task-specific credit assignment and time-consuming hyperparameter tuning.

最近的研究[17]-[19]表明，将预训练的ANN转换为SNN是可行的，对分类精度几乎没有不利影响。这种间接训练方法假设模拟神经元的激活值相当于脉冲神经元的平均放电速率，只需要对训练后的神经网络的权重进行解析和归一化。Rueckauer等人[18]对这种方法的性能偏差进行了理论分析，并对用于物体识别任务的卷积神经网络(CNN)模型进行了系统研究。这种转换方法在许多传统的基于帧的视觉数据集上实现了SNN的最佳报告结果，包括具有挑战性的ImageNet-12数据集[18]，[19]。然而，这种通用转换方法需要权衡推理速度和分类精度，并且需要至少数百个推理时间步才能达到最佳分类精度。

Recent studies [17]–[19] show that it is viable to convert a pretrained ANN to an SNN with little adverse impacts on the classification accuracy. This indirect training approach assumes that the activation value of analog neurons is equivalent to the average firing rate of spiking neurons and simply requires parsing and normalizing of weights of the trained ANN. Rueckauer et al. [18] provide a theoretical analysis of the performance deviation of such an approach and a systematic study on the convolutional neural network (CNN) models for the object recognition task. This conversion approach achieves the best-reported results for SNNs on many conventional frame-based vision datasets, including the challenging ImageNet-12 dataset [18], [19]. However, this generic conversion approach comes with a tradeoff that has an impact on the inference speed and classification accuracy and requires at least several hundreds of inference time steps to reach optimal classification accuracy.

额外的研究工作还致力于训练约束ANN，这些ANN可以近似SNN的属性[20]，[21]，这允许训练的模型无缝地传输到目标硬件平台。基于基于速率的脉冲神经元模型，这种约束-然后训练的方法将脉冲神经元的稳态发射速率转换为连续的，从而可以用传统的误差反向传播算法进行优化的可微分形式。通过在训练过程中显式逼近SNN的属性，当在目标神经形态硬件上实现时，该方法比前面提到的通用转换方法性能更好。

Additional research efforts are also devoted to training constrained ANNs that can approximate the properties of SNNs [20], [21], which allows the trained model to be transferred to the target hardware platform seamlessly. Grounded on the rate-based spiking neuron model, this constrain-then-train approach transforms the steady-state firing rate of spiking neurons into a continuous and, hence, differentiable form that can be optimized with the conventional error backpropagation algorithm. By explicitly approximating the properties of SNNs during the training process, this approach performs better than the aforementioned generic conversion approach when implemented on the target neuromorphic hardware.

虽然通过通用ANN-to-SNN转换和约束-然后训练方法都显示了有竞争力的分类精度，但基于速率的脉冲神经元模型的基本假设需要较长的编码时间窗口（即图像或样本呈现多少时间步）或较高的发射速率才能达到稳定的神经元发射状态[18]，[20]，这样就可以消除预训练的ANN和SNN之间的近似误差。这种稳态要求限制了可以从NC架构中获得的计算优势，并且仍然是将这些方法应用于实时模式识别任务的主要障碍。

While competitive classification accuracies are shown with both the generic ANN-to-SNN conversion and the constrain-then-train approaches, the underlying assumption of a rate-based spiking neuron model requires a long encoding time window (i.e., how many time steps the image or sample are presented) or a high firing rate to reach the steady neuronal firing state [18], [20], such that the approximation errors between the pretrained ANN and the SNN can be eliminated. This steady-state requirement limits the computational benefits that can be acquired from the NC architectures and remain a major roadblock for applying these methods to real-time pattern recognition tasks.

为了提高整体能量效率和推理速度，理想的SNN学习规则应该支持较短的编码时间窗口和稀疏的突触活动。为了利用这一理想的特性，时间编码得到了研究，其中第一个脉冲的脉冲时间可被用作微分的代理，以实现误差反向传播算法[22]-[24]。尽管在MNIST数据集上得到了具有竞争力的结果，但仍然难以捉摸的是，时间学习规则如何保持神经元放电的稳定性，以便确定导数，以及如何将其扩大到最先进的深度ANN的大小。鉴于基于速率的SNN的稳态要求和基于时间的SNN的可扩展性问题，有必要开发新的学习方法，能够有效、高效地训练深度SNN在编码时间窗短、突触活动稀疏的情况下运行。

To improve the overall energy efficiency and inference speed, an ideal SNN learning rule should support a short encoding time window with sparse synaptic activities. To exploit this desirable property, the temporal coding has been investigated, whereby the spike timing of the first spike was employed as a differentiable proxy to enable the error backpropagation algorithm [22]–[24]. Despite competitive results on the MNIST dataset, it remains elusive how the temporal learning rule maintains the stability of neuronal firing such that the derivatives can be determined, and how it can be scaled up to the size of state-of-the-art deep ANNs. In view of the steady-state requirement of rate-based SNNs and scalability issues of temporal-based SNNs, it is necessary to develop new learning methods that can effectively and efficiently train deep SNNs to operate under a short encoding time window with sparse synaptic activities.

替代梯度学习[25]是最近出现的一种深度SNN的替代训练方法。通过离散时间公式，可以有效地将脉冲神经元建模为非脉冲递归神经网络(RNN)，其中脉冲神经元模型中的泄漏项表示为固定权重的自递归连接。通过建立与RNN的等价性，典型误差时间反向传播(BPTT)算法可以应用于训练深度SNN。在误差反向传播过程中，不可微的脉冲产生函数可以用连续函数代替，从而可以根据每个时间步的瞬时膜电位导出代理梯度。在实践中，代理梯度学习在静态和时间模式识别任务[26]-[29]中都表现得非常好。通过去除速率SNN的稳态发射速率约束和时间SNN的脉冲时间依赖，代理梯度学习支持SNN快速有效的模式识别。

Surrogate gradient learning [25] has emerged recently as an alternative training method for deep SNNs. With a discrete-time formulation, the spiking neuron can be effectively modeled as a nonspiking recurrent neural network (RNN), wherein the leak term in spiking neuron models is formulated as a fixed-weight self-recurrent connection. By establishing the equivalence with RNNs, the canonical error backpropagation through time (BPTT) algorithm can be applied to train deep SNNs. The nondifferentiable spike generation function can be replaced with a continuous function during the error backpropagation, whereby a surrogate gradient can be derived based on the instantaneous membrane potential at each time step. In practice, the surrogate gradient learning performs exceedingly well for both static and temporal pattern recognition tasks [26]–[29]. By removing the constraints of steady-state firing rate for rate-based SNN and spike-timing dependency of temporal-based SNN, the surrogate gradient learning supports rapid and efficient pattern recognition with SNNs.

虽然在MNIST和CIFAR-10[30]数据集上报告了替代梯度学习的具有竞争力的准确率，但使用BPTT训练深度SNN的内存和计算效率都很低，特别是对于更复杂的数据集和网络结构。此外，vanilla RNN的梯度消失问题[31]可能会对具有长时间持续时间的脉冲模式的学习性能产生不利影响。在本文中，为了提高替代梯度学习的学习效率，我们提出了一种新的串联神经网络学习规则。如图1所示，串联网络结构由一个SNN和一个ANN组成，它们分层耦合并共享权重。神经网络是一种辅助结构，用于在脉冲序列级别训练SNN的误差反向传播，而SNN用于推导精确的脉冲神经表示。通过广泛的实验研究，这种串联学习规则允许使用SNN进行快速、高效和可扩展的模式识别。

While competitive accuracies were reported on the MNIST and CIFAR-10 [30] datasets with the surrogate gradient learning, it is both memory and computationally inefficient to train deep SNNs using BPTT, especially for more complex datasets and network structures. Furthermore, the vanishing gradient problem [31] that is well-known for vanilla RNNs may adversely affect the learning performance for spiking patterns with long temporal duration. In this article, to improve the learning efficiency of surrogate gradient learning, we propose a novel learning rule with the tandem neural network. As illustrated in Fig. 1, the tandem network architecture consists of an SNN and an ANN that are coupled layerwise with weights sharing. The ANN is an auxiliary structure that facilitates the error backpropagation for the training of the SNN at the spike-train level, while the SNN is used to derive the exact spiking neural representation. This tandem learning rule allows rapid, efficient, and scalable pattern recognition with SNNs as demonstrated through extensive experimental studies.

🖼️ 图1：所提出的串联学习框架的说明，由一个SNN和一个具有共享权重的ANN组成。在该框架中，将脉冲计数作为主要的信息载体。设计神经网络激活函数来近似耦合SNN的脉冲计数，从而在脉冲序列级别上近似耦合SNN层的梯度。在训练过程中，在正向传递中，从SNN层获得的同步脉冲计数和脉冲序列分别作为后续SNN层和ANN层的输入；误差梯度在误差反向传播过程中通过人工神经网络层向后传递，以更新权重以最小化目标函数。

Fig. 1. Illustration of the proposed tandem learning framework that consists of an SNN and an ANN with shared weights. The spike counts are considered as the main information carrier in this framework. ANN activation function is designed to approximate the spike counts of the coupled SNN, so as to approximate the gradients of the coupled SNN layers at the spike-train level. During training, in the forward pass, the synchronized spike counts and spike trains derived from an SNN layer are taken as the inputs to the subsequent SNN and ANN layers, respectively; the error gradients are passed backward through the ANN layers during error backpropagation, to update the weights so as to minimize the objective function.

本文其余部分的组织如下。在第二节中，我们制定了拟议的串联学习框架。在第三节中，我们通过与其他SNN实现进行比较，在传统的基于框架的视觉数据集（即MNIST，CIFAR-10和ImageNet-12）和基于事件的视觉数据集（即N-MNIST和DVSCIFAR10）上评估了所提出的串联学习框架。最后，我们以第四节的讨论作为结束。

The rest of this article is organized as follows. In Section II, we formulate the proposed tandem learning framework. In Section III, we evaluate the proposed tandem learning framework on both the conventional frame-based vision datasets (i.e., MNIST, CIFAR-10, and ImageNet-12) and the event-based vision datasets (i.e., N-MNIST and DVSCIFAR10) by comparing with other SNN implementations. Finally, we conclude with discussions in Section IV.

II 通过串联网络学习/Learning Through A Tandem Network

在本节中，我们首先介绍在这项工作中使用的脉冲神经元模型。然后，我们提出了一种使用脉冲计数作为跨网络层的信息载体的离散神经表示方案，并设计了ANN激活函数，以有效地近似耦合SNN的脉冲计数，以便在脉冲序列级别上进行误差反向传播。最后，我们介绍了串联网络及其学习规则，称为串联学习规则，用于深度SNN训练。

In this section, we first introduce spiking neuron models that are used in this work. We then present a discrete neural representation scheme using spike count as the information carrier across network layers, and we design ANN activation functions to effectively approximate the spike count of the coupled SNN for error backpropagation at the spike-train level. Finally, we introduce the tandem network and its learning rule, which is called the tandem learning rule, for deep SNN training.

A 神经元模型/Neuron Model

脉冲神经元模型描述了大脑生物神经元丰富的动力学行为[32]。一般来说，神经元模型的计算复杂度随着其生物学合理性的增加而增加。因此，为了在高效的神经形态硬件上实现，可以提供足够的生物学细节的简单而有效的脉冲神经元模型是首选。

The spiking neuron models describe the rich dynamical behaviors of biological neurons in the brain [32]. In general, the computational complexity of spiking neuron models grows with the level of biological plausibility. Therefore, for implementation on efficient neuromorphic hardware, a simple yet effective spiking neuron model that can provide a sufficient level of biological details is preferred.

在这项工作中，我们使用了可以说是最简单的脉冲神经元模型，可以有效地描述带有脉冲计数的感知信息：基于电流的集成与发射(IF)神经元[18]和泄漏的集成与发射(LIF)神经元模型[32]。虽然IF和LIF神经元没有模拟生物神经元的丰富的脉冲活动频谱，但它们非常适合处理感觉输入，其中信息以脉冲率或同时发放的脉冲模式编码。

In this work, we use the arguably simplest spiking neuron models that can effectively describe the sensory information with spike counts: the current-based integrate-and-fire (IF) neuron [18] and leaky integrate-and-fire (LIF) neuron models [32]. While the IF and LIF neurons do not emulate the rich spectrum of spiking activities of biological neurons, they are, however, ideal for working with sensory input where information is encoded in spike rates or coincident spike patterns.

LIF神经元 $i$ 在 $l$ 层的阈下膜电位 $U^l_i$ 可用以下线性微分方程描述：

The subthreshold membrane potential $U^l_i$ of LIF neuron $i$ at layer $l$ can be described by the following linear differential equation:

$\begin{equation}\tau_m\dfrac{dU^l_i}{dt}=-[U^l_i-U_{rest}]+RI^l_i(t)\end{equation}$

其中 $τ_m$ 为膜时间常数。 $U_{rest}$ 和 $R$ 分别为脉冲神经元的静息电位和膜电阻。 $I^I_i(t)$ 为神经元 $i$ 的输入电流随时间变化。通过去除LIF神经元中涉及的膜电位泄漏效应，IF神经元的阈下动态可以描述为：

where $τ_m$ is the membrane time constant. $U_{rest}$ and $R$ are the resting potential and the membrane resistance of the spiking neuron, respectively. $I^l_i(t)$ refers to the time-dependent input current to the neuron $i$ . By removing the membrane potential leaky effect involved in the LIF neuron, the subthreshold dynamics of the IF neuron can be described as follows:

$\begin{equation}\frac{dU^l_i}{dt}=RI^l_i(t)\end{equation}$

在不失一般性的前提下，我们将静息电位 $U_{rest}$ 设为零，膜电阻 $R$ 设为酉数。每当 $U^l_i$ 越过发射阈值 $\vartheta$ 时，就会产生输出脉冲

Without loss of generality, we set the resting potential $U_{rest}$ to zero and the membrane resistance $R$ to unitary in this work. An output spike is generated whenever $U^l_i$ crosses the firing threshold $\vartheta$

$\begin{equation}s^l_i(t)=\Theta(U^l_i(t)-\vartheta)\ \text{with}\ \Theta(x)=\begin{cases}1,\quad\text{if}\ x\ge 0\\0,\quad\text{otherwise}\end{cases}\end{equation}$

其中 $s^l_i(t)$ 表示在时间步 $t$ 时神经元 $i$ 输出脉冲的发生。

where $s^l_i(t)$ indicates the occurrence of an output spike from the neuron $i$ at time step $t$ .

在实践中，给定一个较小的模拟时间步长 $d t$ ，LIF神经元的线性微分方程可以用以下离散时间公式很好地近似：

In practice, given a small simulation time step $d t$ , the linear differential equation of the LIF neuron can be well approximated by the following discrete-time formulation:

$\begin{equation}U^l_i[t]=\alpha U^l_i[t-1]+I^l_i[t]-\vartheta s^l_i[t-1]\end{equation}$

$\begin{equation}I^l_i[t]=\sum_jw^{l-1}_{ij}s^{l-1}_j[t-1]+b^l_i\end{equation}$

其中 $\alpha≡\exp(−dt/τ_m)$ 。以上公式中使用方括号表示离散时间建模。 $I^l_i[t]$ 总结了前一层突触前神经元对突触电流的贡献。 $W^{l−1}_{ij}$ 为 $l - 1$ 层传入神经元j的突触连接强度， $b^l_i$ 为神经元 $i$ 的恒定注入电流。如(4)的最后一项所示，在每次脉冲产生后，不再将膜电位重置为零，而是从膜电位中减去发射阈值 $\vartheta$ 。这有效地保留了超过发射阈值的剩余膜电位，并减少了跨层[18]的信息损失。同样，IF神经元的离散时间公式可以表示为：

where $\alpha≡\exp(−dt/τ_m)$ . The square brackets are used in the above formulations to reflect the discrete-time modeling. $I^l_i[t]$ summarizes the synaptic current contributions from presynaptic neurons of the preceding layer. $W^{l−1}_{ij}$ denotes the strength of the synaptic connection from the afferent neuron j of layer $l - 1$ , and $b^l_i$ is the constant injecting current to the neuron $i$ . As denoted by the last term of (4), instead of resetting the membrane potential to zero after each spike generation, the firing threshold $\vartheta$ is subtracted from the membrane potential. This effectively preserves the surplus membrane potential that increased over the firing threshold and reduces the information loss across layers [18]. Similarly, the discrete-time formulation of the IF neuron can be expressed as follows:

$\begin{equation}U^l_i[t]=U^l_i[t-1]+I^l_i[t]-\vartheta s^l_i[t-1]\end{equation}$

在我们的实验中，对于IF和LIF神经元，在处理每个新的输入示例之前， $U^l_i[0]$ 都被重置并初始化为零。我们将脉冲神经元产生的脉冲总数（即脉冲计数）作为主要的信息载体。对于层 $l$ 的神经元 $i$ ，脉冲计数 $c^l_i$ 可以通过对编码时间窗口 $T$ 的所有输出脉冲求和来确定

In our experiments, for both IF and LIF neurons, the $U^l_i[0]$ is reset and initialized to zero before processing each new input example. We consider the total number of spikes (i.e., spike count) generated by spiking neurons as the main information carrier. For neuron $i$ at layer $l$ , the spike count $c^l_i$ can be determined by summing all output spikes over the encoding time window $T$

$\begin{equation}C^l_i=\sum^T_{t=1}s^l_i[t]\end{equation}$

在这项工作中，我们使用非脉冲模拟神经元的激活值来近似脉冲神经元的脉冲数。模拟神经元 $i$ 所进行的转换可以描述为

In this work, we use the activation value of nonspiking analog neurons to approximate the spike count of spiking neurons. The transformation performed by the analog neuron $i$ can be described as

$\begin{equation}a^l_i=f\left(\sum_jw^{l-1}_{ij}x^{l-1}_j+b^l_i\right)\end{equation}$

其中 $w^{l−1}_{ij}$ 和 $b^l_i$ 分别为模拟神经元的权重项和偏置项。 $x^{l−1}_j$ 和 $a^l_i$ 对应模拟输入和输出激活值。 $f(\cdot)$ 为模拟神经元的激活函数。使用模拟神经元的脉冲计数近似的细节将在第二章C节中解释。

where $w^{l−1}_{ij}$ and $b^l_i$ are the weight and bias terms of the analog neuron, respectively. $x^{l−1}_j$ and $a^l_i$ correspond to the analog input and output activation values. $f(\cdot)$ denotes the activation function of analog neurons. Details of the spike count approximation using analog neurons will be explained in Section II-C.

B 编码和解码方案/Encoding and Decoding Schemes

SNN处理以脉冲序列表示的输入，理想情况下应该由基于事件的传感器生成，例如硅视网膜事件摄像机[33]和硅耳蜗音频传感器[34]。然而，与基于框架的传感器相比，从这些事件驱动的传感器收集的数据集并不丰富。为了将基于帧的传感器数据作为输入，SNN将需要额外的神经编码机制来将实值样本转换为脉冲序列。

The SNNs process inputs that are represented as spike trains, which ideally should be generated by event-based sensors, for instance, silicon retina event camera [33] and silicon cochlea audio sensor [34]. However, the datasets collected from these event-driven sensors are not abundantly available in comparison to their frame-based counterparts. To take frame-based sensor data as inputs, SNNs will require additional neural encoding mechanisms to transform the real-valued samples into spike trains.

一般来说，通常考虑两种神经编码方案：速率编码和时间编码。速率编码[17]，[18]在每个采样时间步将实值输入转换为脉冲序列，遵循泊松分布或伯努利分布。然而，它遭受采样误差，因此需要一个很长的编码时间窗口来补偿这种错误。因此，速率编码并不是将信息编码到我们想要的短时间窗口的最佳方法。另一方面，时间编码使用单个脉冲的时间来编码信息；实例包括第一次脉冲时间[22]、相位编码[32]等。因此，具有较高的编码效率和计算优势。但是，该算法解码复杂，对噪声敏感[32]。此外，在神经形态芯片上实现时间编码所必需的高时间分辨率也是一项挑战。

In general, two neural encoding schemes are commonly considered: rate code and temporal code. Rate code [17], [18] converts real-valued inputs into spike trains at each sampling time step following a Poisson or Bernoulli distribution. However, it suffers from sampling errors, thereby requiring a long encoding time window to compensate for such errors. Hence, the rate code is not the best to encode information into the short time window that we desire. On the other hand, temporal coding uses the timing of a single spike to encode information; instances include the time-to-first-spike [22], phase code [32], and so on. Therefore, it enjoys superior coding efficiency and computational advantages. However, it is complex to decode and sensitive to noise [32]. Moreover, it is also challenging to achieve a high temporal resolution, which is essential for the temporal coding, on neuromorphic chips.

或者，我们将实值输入作为与时间相关的输入电流，并在每个时间步长直接应用于(4)和(6)。该神经编码方案克服了速率码的采样误差；因此，它可以支持准确和快速的推理，如较早的工作所示[28]，[35]。如图1所示，从该神经编码层开始，将脉冲序列和脉冲计数分别作为SNN层和ANN层的输入。

Alternatively, we take the real-valued inputs as the time-dependent input currents and directly apply them in (4) and (6) at every time step. This neural encoding scheme overcomes the sampling error of the rate code; therefore, it can support accurate and rapid inference, as demonstrated in earlier works [28], [35]. As shown in Fig. 1, beginning from this neural encoding layer, spike trains and spike counts are taken as input to the SNN and ANN layers, respectively.

为了便于模式分类，需要一个SNN后端将输出的脉冲序列解码为模式类。对于解码，从SNN输出层解码是可行的，使用在编码时间窗口 $T$ 上累积的离散脉冲计数或连续自由聚合膜电位（无脉冲） $U^{l,f}_i$ 。

To facilitate pattern classification, an SNN back end is required to decode the output spike trains into pattern classes. For decoding, it is feasible to decode from the SNN output layer using either the discrete spike counts or the continuous free aggregate membrane potentials (no spiking) $U^{l,f}_i$ that accumulated over the encoding time window $T$

$\begin{equation}U^{l,f}_i=R\left(\sum_jw^{l-1}_{ij}c^{l-1}_j+b^l_iT\right)\end{equation}$

在我们的初步研究中，如图2所示，我们观察到自由聚合膜电位提供了更平滑的学习曲线，因为它允许在输出层导出连续的误差梯度。此外，自由聚合膜电位可以直接作为回归任务的输出。因此，除非另有说明，我们在这项工作中使用自由聚合膜电位进行神经解码。

In our preliminary study, as shown in Fig. 2, we observe that the free aggregate membrane potential provides a much smoother learning curve, as it allows continuous error gradients to be derived at the output layer. Furthermore, the free aggregate membrane potential can be directly used as the output for regression tasks. Therefore, we use the free aggregate membrane potential for neural decoding in this work unless otherwise stated.

🖼️ 图2 (a)不同训练方案下CIFAR-10测试集的分类准确率。(b) CIFAR-10测试集的分类精度作为不同编码窗口大小 $T$ 的函数。实验中使用的是中频神经元。为了更好的可视化，从Epoch 100到150的学习曲线被放大并提供在插图中。

Fig.2. (a) Classification accuracy on the CIFAR-10 test set with different training schemes. (b) Classification accuracy on the CIFAR-10 test set as a function of different encoding window sizes $T$ . The IF neurons are used in this experiment. For better visualization, the learning curves from Epoch 100 to 150 are enlarged and provided in the inset.

C 作为离散神经表示的脉冲计数/Spike Count as a Discrete Neural Representation

深度ANN学会用紧凑的潜在表示来描述输入数据。典型的潜在表示形式是连续或离散值向量。虽然大多数研究都集中在连续潜在表示法上，但离散表示法在解决现实问题[36]-[40]上有其独特的优势。例如，它们可能更适合表示自然语言，自然语言本质上是离散的，也适合逻辑推理和预测学习。此外，离散神经表示的思想也被利用在网络量化[41]，[42]中，其中网络权值，激活值和梯度被量化用于有效的神经网络训练和推理。

Deep ANNs learn to describe the input data with compact latent representations. A typical latent representation is in the form of a continuous or discrete-valued vector. While most studies have focused on continuous latent representations, discrete representations have their unique advantages in solving real-world problems [36]–[40]. For example, they are potentially a more natural fit for representing natural language, which is inherently discrete, and also native for logical reasoning and predictive learning. Moreover, the idea of discrete neural representation has also been exploited in the network quantization [41], [42], where network weights, activation values, and gradients are quantized for efficient neural network training and inference.

在这项工作中，我们将脉冲计数视为深层SNN中的离散潜在表示，并设计人工神经网络激活函数来近似耦合SNN的脉冲计数，这样就可以有效地从ANN层中获得脉冲序列级别的代理梯度。有了这样的离散潜在表示，有效的SNN层非线性变换可以表示为

In this work, we consider the spike count as a discrete latent representation in deep SNNs and design ANN activation functions to approximate the spike count of the coupled SNN such that spike-train level surrogate gradients can be effectively derived from the ANN layer. With such a discrete latent representation, the effective nonlinear transformation at the SNN layer can be expressed as

$\begin{equation}c^l_i=g(s^{l-1}; w^{l-1}_i, b^l_i)\end{equation}$

式中， $g(\cdot)$ 为脉冲神经元所进行的有效神经转换。鉴于脉冲产生的状态依赖的性质，直接确定从 $s^{l−1}$ 到 $c^l_i$ 的解析表达式是不可行的。为了避免这个问题，我们通过假设从 $s^{l−1}$ 产生的突触电流随时间均匀分布来简化刺突生成过程。这个假设是通过在整个时间窗口内重复输入相同的输入来实现的。它还间接地确保了输入电流到后续层的稳定性。此外，尽管对后续层的输入脉冲序列具有随机性，但CNN中的脉冲神经元通常具有高扇入连接，可以补偿这种输入电流的可变性。第三，通过适当地将膜时间常数 $τ_m$ 设置为适当大的数字，使其与模拟的持续时间相比具有相对较大的积分时间窗口（因此，在模拟的时间过程中近似于一个IF神经元），脉冲神经元的放电速率可以迅速适应并稳定下来。这就在每一个时间步上产生了一个恒定的突触电流 $I^{ l,c}_i$ 。

where $g(\cdot)$ denotes the effective neural transformation performed by spiking neurons. Given the state-dependent nature of spike generation, it is not feasible to directly determine an analytical expression from $s^{l−1}$ to $c^l_i$ . To circumvent this problem, we simplify the spike generation process by assuming that the resulting synaptic currents from $s^{l−1}$ are evenly distributed over time. This assumption is realized by inputting the same inputs repeatedly over the whole time window. It also indirectly ensures the stability of input currents to the subsequent layers. Moreover, despite the randomness of input spike trains to the subsequent layers, the spiking neurons within a CNN typically have high fan-in connections that can compensate for such an input current variability. Third, by properly set the membrane time constant $τ_m$ to an appropriately large number so that it has a comparatively large integration time window compared to the duration of the simulation (hence, approximating an IF neuron over the time course of the simulation), the firing rate of spiking neurons can quickly adapt and stabilize. This yields a constant synaptic current $I^{ l,c}_i$ at every time step

$\begin{equation}I^{l,c}_i=\left(\sum_jw^{l-1}_{ij}c^{l-1}_j+b^l_iT\right)/T\end{equation}$

将常数突触电流 $I^{l,c}_i$ 代入(2)，可得IF神经元的脉冲间间隔时间表达式如下：

Taking the constant synaptic current $I^{l,c}_ i$ into (2), we, thus, obtain the following expression for the interspike interval of IF neurons:

$\begin{equation}ISI^l_i=\rho\left(\frac{\vartheta}{RI^{l,c}_i}\right)\end{equation}$

其中 $ρ(\cdot)$ 表示校正线性单位(ReLU)的非线性变换。如前一节所述，在本工作中膜电阻 $R$ 被假设为酉的，因此下降。输出脉冲数可以进一步近似如下：

where $ρ(\cdot)$ denotes the nonlinear transformation of the rectified linear unit (ReLU). As mentioned in the earlier section, the membrane resistance $R$ is assumed to be unitary in this work and, hence, dropped. The output spike count can be further approximated as follows:

$\begin{equation}c^l_i=\frac{T}{ISI^l_i}=\frac{1}{\vartheta}\rho\left(\sum_jw^{l-1}_{ij}c^{l-1}_j+b^l_iT\right)\end{equation}$

将 $\vartheta$ 设为1，(13)的形式与(8)中描述的模拟神经元的激活函数相同。具体地说，将 $ρ(\cdot)$ 设为模拟神经元的激活函数[见 $f(\cdot)$ ]，脉冲计数 $c^{l−1}_j$ 作为输入（见 $x^{l−1}_j$ ），并将聚合常数注入电流 $b^l_iT$ 作为相应模拟神经元的偏置项（见 $b^l_i$ ），因此，这种配置允许脉冲计数与脉冲序列级误差梯度从耦合的权值共享ANN层近似。如图3所示，所提出的神经网络激活函数可以有效地近似图像分类任务中耦合SNN层的准确脉冲计数。近似误差可以被认为是随机噪声，它被证明可以提高训练神经网络[43]的泛化性。

By setting $\vartheta$ to 1, (13) takes the same form as the activation function of analog neurons, as described in (8). Specifically, by setting $ρ(\cdot)$ as the activation function [see$ f (\cdot)$] for analog neurons, spike count $c^{l−1}_j$ as the input (see $x^{l−1}_j$ ), and aggregated constant injecting current $b^l_i T$ as the bias term (see $b^l_i$ ) for the corresponding analog neurons, this configuration allows the spike count and hence the spike-train level error gradients to be approximated from the coupled weight sharing ANN layer. As shown in Fig. 3, it is apparent that the proposed ANN activation function can effectively approximate the exact spike count of the coupled SNN layers in an image classification task. The approximation errors can be considered as stochastic noise that was shown to improve the generalizability of the trained neural networks [43].

🖼️ 图3 说明在串联网络（IF神经元）中，脉冲计数是离散的神经表示。我们提供了从CIFAR-10数据集中采样的随机选择的中间激活。每个卷积层的顶部和底部行分别来自SNN和耦合ANN层的精确和近似的脉冲计数激活。注意，这里只给出了前8个特征映射，并将其绘制在分开的块中。

Fig. 3. Illustration of spike counts as the discrete neural representation in the tandem network (IF neurons). The intermediate activations of a randomly selected sample from the CIFAR-10 dataset are provided. The top and bottom rows of each convolution layer refer to the exact and approximated spike count activations, derived from the SNN and the coupled ANN layer, respectively. Note that only the first eight feature maps are given and plotted in separated blocks.

按照IF神经元的相同近似机制，我们还考虑将(11)中的恒定电流注入到LIF神经元中，并且可以从(1)中通过计算神经元从静息电位上升到放电阈值的充电周期来确定峰间间隔（更多细节请参阅补充材料）。因此，我们得到

Following the same approximation mechanism for the IF neuron, one also consider injecting the constant current determine in (11) into the LIF neuron, and the interspike interval can be determined from (1) by calculating the charging period for neurons to rise from the resting potential to the firing threshold (more details in the Supplementary Material). Thus, we obtain

$\begin{equation}ISI^l_i=\tau_m\log{\left[1+\dfrac{\vartheta}{\rho(I^{l,c}_i-\vartheta)}\right]}\end{equation}$

因此，近似的脉冲计数可以计算为

Hence, the approximated spike count can be evaluated as

$\begin{equation}c^l_i=\frac{T}{\tau_m\log{\left[1+\dfrac{\vartheta}{\rho(I^{l,c}_i-\vartheta)}\right]}}\end{equation}$

然而，当 $I^{l,c}_i \le \vartheta$ 时，上面的方程没有定义，当 $I^{l,c}_i − \vartheta$ 略大于零时，它在数值上也是不稳定的。为了解决这个问题，我们将ReLU激活函数 $ρ(\cdot)$ 替换为平滑代理 $ρ_s(\cdot)$ ，定义如下：

However, the above equation is undefined when $I^{l,c}_i \le \vartheta$ , and it is also numerically unstable when $I^{l,c}_i − \vartheta$ is marginally greater than zero. To address this, we replace the ReLU activation function $ρ(\cdot)$ with a smoothed surrogate $ρ_s(\cdot)$ that is defined as follows:

$\begin{equation}\rho_s(x)=\log{(1+e^x)}\end{equation}$

与IF神经元一样，将脉冲计数 $c^{l−1}_j$ 和聚合常数注入电流 $b^l_iT$ 作为模拟神经元的输入，以(15)作为激活函数，LIF神经元的脉冲计数可以很好地近似于耦合模拟神经元。使用脉冲计数的离散神经表示及其权重共享模拟神经元的近似允许代理梯度在脉冲序列级别上近似，并应用于误差反向传播。因此，它的学习效率优于其他在每个时间步进行权重更新的替代梯度方法[25]，[27]，[28]。

The same as the IF neurons, by taking the spike count $c^{l−1}_j$ and the aggregated constant injecting current $b^l_iT$ as inputs for analog neurons, with (15) as the activation function, the spike count of the LIF neurons can be well approximated by the coupled analog neurons. The discrete neural representation using spike count and its approximation with weight sharing analog neurons allow the surrogate gradients to be approximated at the spike-train level and applied during error backpropagation. Hence, it has a learning efficiency superior to other surrogate gradient methods [25], [27], [28] that perform weight update at each time step.

D 串联网络中的信用分配/Credit Assignment in the Tandem Network

由于自定义的神经网络激活函数可以有效地近似脉冲神经元的离散神经表示，这促使我们思考是否可以直接训练一个受约束的神经网络，然后将其权重转移到等效的SNN，即约束后训练的方法[20]，[21]。这样，深度神经网络的训练可以通过深度神经网络的训练来实现，并且可以利用为神经网络开发的大量工具和方法。

As the customized ANN activation functions can effectively approximate the discrete neural representation of spiking neurons, it prompts us to think whether it is feasible to directly train a constrained ANN and then transfer its weights to an equivalent SNN, i.e., the constrain-then-train approach [20], [21]. In this way, the training of a deep SNN can be achieved by that of a deep ANN, and a large number of tools and methods developed for ANNs can be leveraged.

我们将(13)作为约束神经网络的激活函数，随后将训练的权重传递给带有IF神经元的SNN。结果网络在MNIST数据集[21]上报告了具有竞争力的分类精度。然而，当将这种方法应用于时间窗口为10的更复杂的CIFAR-10数据集时，SNN与预训练的ANN相比出现了较大的分类精度下降（在我们的实验中约为21%）。通过仔细比较ANN近似的“脉冲计数”和SNN的实际脉冲计数，我们观察到ANN和SNN的同一层之间的脉冲计数差异越来越大，如图4(a)所示。这是因为来自ANN层的近似只提供了输出脉冲计数的平均估计，这忽略了输入脉冲序列的时间结构。因此，它可能会导致实际输出脉冲计数的差异。随着脉冲计数在各层间的移动，差异越来越大。虽然对于用于分类MNIST数据集[21]或编码时间窗口非常长的[20]的浅网络，这种脉冲计数差异可以忽略不计，但它在面对稀疏的突触活动和短时间窗口时具有巨大的影响。

We take (13) as the activation function for a constrained ANN and subsequently transfer the trained weights to the SNN with IF neurons. The resulting network reports a competitive classification accuracy on the MNIST dataset [21]. However, when applying this approach to the more complex CIFAR-10 dataset with a time window of 10, a large classification accuracy drop (around 21% in our experiment) occurred to the SNN from that of the pretrained ANN. By carefully comparing the ANN approximated “spike count” with the actual SNN spike count, we observe an increasing spike count discrepancy between the same layers of ANN and SNN, as shown in Fig. 4(a). This is due to the fact that the approximation from the ANN layer only provides a mean estimation of the output spike count, which ignores the temporal structure of input spike trains. Hence, it could cause discrepancies to the actual output spike counts. The discrepancy grows as the spike counts travel through the layers. While such spike count discrepancies could be negligible for a shallow network used for classifying the MNIST dataset [21] or with a very long encoding time window [20], it has huge impacts in the face of sparse synaptic activities and a short time window.

🖼️ 图4 (a)用约束-然后训练方法产生的神经表示错误的总结，即实际SNN层输出与相同权重的约束ANN的近似输出之间的平均脉冲计数差。实验采用CifarNet（IF神经元）网络结构，编码时间窗口为10。(b)手工示例，用于说明SNN和近似ANN之间的脉冲计数近似误差，这通常发生在编码时间窗口短且神经元活动稀疏时。在这个例子中，虽然突触后IF神经元的聚合膜电位最终保持在放电阈值以下（一个有用的中间量，用于近似输出脉冲计数），但由于兴奋性突触的脉冲提前到达，输出脉冲就产生了。

Fig. 3. (a) Summary of the neural representation error that happened with the constrain-then-train approach, i.e., the mean spike count difference between the actual SNN layer outputs and those approximated from a constrained ANN of the same weights. The experiment is performed with a network structure of CifarNet (IF neurons) and an encoding time window of 10. (b) Handcrafted example for illustration of the spike count approximation error between the SNN and the approximated ANN, which usually happens when the encoding time window is short and neuronal activities are sparse. In this example, although the aggregate membrane potential of the postsynaptic IF neuron stays below the firing threshold in the end (a useful intermediate quantity that is applied to approximate the output spike count), an output spike is generated due to the early arrival of spikes from excitatory synapses.

为了演示在信息正向传播过程中如何出现这种近似误差，即神经表示误差，我们手工制作了一个示例，如图4(b)所示。尽管在编码时间窗口结束时，IF神经元的自由聚合膜电位保持在放电阈值以下，但由于兴奋性突触的脉冲提前到达，可能已经产生了输出脉冲。值得一提的是，在单层上可以很好地控制脉冲计数的差异，如图3和图4(a)所示（第一层的差异不显著）。然而，这种神经表示误差会在层间累积，并显著影响从训练过的ANN转移权重的SNN的分类精度。因此，为了有效地训练具有短编码时间窗口和稀疏突触活动的深度SNN，有必要在训练循环中导出带有SNN的精确神经表示。

To demonstrate how such an approximation error may occur during information forward-propagation, namely, neural representation error, we handcraft an example, as shown in Fig. 4(b). Although the free aggregate membrane potential, at the end of the encoding time window, of an IF neuron stays below the firing threshold, an output spike could have been generated due to the early arrival of spikes from the excitatory synapses. It is worth mentioning that the spike count discrepancy can be well controlled at the single layer, as shown in Figs. 3 and 4(a) (the discrepancy in the first layer is insignificant). However, such a neural representation error will accumulate across layers and significantly affect the classification accuracy of the SNN with weights transferred from a trained ANN. Therefore, to effectively train a deep SNN with a short encoding time window and sparse synaptic activities, it is necessary to derive an exact neural representation with SNN in the training loop.

为了解决这个问题，我们提出了一个串联学习框架。如图1所示，在(13)和(15)中定义的具有激活函数的人工神经网络用于在神经网络层中实现误差反向传播，而与耦合的人工神经网络共享权重的SNN用于确定准确的神经表示（即脉冲计数和脉冲序列）。从SNN层确定的同步脉冲计数和脉冲序列分别传输到后续的ANN层和SNN层。值得一提的是，在正向传递中，ANN层将前一个SNN层的输出作为输入。这旨在通过交错层使SNN的输入与ANN同步，而不是试图优化ANN的分类性能。

To solve this problem, we propose a tandem learning framework. As shown in Fig. 1, an ANN with activation function defined in (13) and (15) is employed to enable error backpropagation through the ANN layers, while the SNN, sharing weights with the coupled ANN, is employed to determine the exact neural representation (i.e., spike counts and spike trains). The synchronized spike counts and spike trains, determined from the SNN layer, are transmitted to the subsequent ANN and SNN layers, respectively. It is worth mentioning that, in the forward pass, the ANN layer takes the output of the previous SNN layer as the input. This aims at synchronizing the inputs of the SNN with ANN via the interlaced layers, rather than trying to optimize the classification performance of the ANN.

通过在串联网络训练期间合并脉冲神经元的动态，将准确的输出脉冲计数，而不是ANN预测的脉冲计数，从前向后传播到后续的ANN层。所提出的串联学习框架可以有效地防止神经表示误差在层间向前累积。当耦合的神经网络被用于误差反向传播时，前向推理完全在训练后的SNN上执行。算法1给出了所提出的串联学习规则的伪代码。

By incorporating the dynamics of spiking neurons during the training of the tandem network, the exact output spike counts, instead of ANN predicted spike counts, are propagated forward to the subsequent ANN layer. The proposed tandem learning framework can effectively prevent the neural representation error from accumulating forward across layers. While a coupled ANN is harnessed for error backpropagation, the forward inference is executed entirely on the SNN after training. The pseudocode of the proposed tandem learning rule is given in Algorithm 1.

算法1 串联学习规则的伪代码

III 实验评价与讨论/Experimental Evaluation And Discussion

在本节中，我们首先评估了所提出的串联学习框架在基于帧的物体识别和图像重建任务上的学习能力。我们进一步讨论了为什么在串联网络中可以进行有效的学习。然后，我们评估串联学习规则对由事件驱动的相机传感器产生的异步输入的适用性。最后，我们讨论了所提出的串联学习规则可以实现的高学习效率和可扩展性、快速推理和突触操作减少的特性。

In this section, we first evaluate the learning capability of the proposed tandem learning framework on frame-based object recognition and image reconstruction tasks. We further discuss why effective learning can be performed within the tandem network. Then, we evaluate the applicability of the tandem learning rule to the asynchronous inputs generated from the event-driven camera sensors. Finally, we discuss the properties of high learning efficiency and scalability, rapid inference, and synaptic operation reductions that can be achieved with the proposed tandem learning rule.

A 实验设置/Experimental Setups

数据集：为了对所提出的串联学习规则进行综合评估，我们使用了三个传统的基于帧的图像数据集：MNIST、CIFAR-10[30]和ImageNet-12[44]。此外，我们还研究了串联学习对事件驱动视觉数据集的适用性：N-MNIST[45]和DVS-CIFAR10[46]。遵循[28]中采用的相同数据预处理程序，我们通过积累N-MNIST和DVS-CIFAR10数据集每10毫秒间隔内发生的脉冲来降低时间分辨率。关于实验数据集的更多细节在补充材料中提供。

Datasets: To conduct a comprehensive evaluation on the proposed tandem learning rule, we use three conventional frame-based image datasets: MNIST, CIFAR-10 [30], and ImageNet-12 [44]. In addition, we also investigate the applicability of tandem learning to event-driven vision datasets: N-MNIST [45] and DVS-CIFAR10 [46]. Following the same data preprocessing procedures adopted in [28], we reduce the temporal resolution by accumulating the spikes that occurred within every 10-ms interval for the N-MNIST and DVS-CIFAR10 datasets. More details about the experimental datasets are provided in the Supplementary Material.

网络和训练配置：如表I所示，我们在基于帧的CIFAR-10和基于事件的DVS-CIFAR10数据集（即CifarNet）上使用一个具有七个可学习层的CNN进行对象识别。为了处理DVSCIFAR10更高的输入维数，我们将每个卷积层的步幅增加到2，并将第一层的核大小增加到7。由于SNN离散建模的高计算成本和大内存需求，我们使用AlexNet[1]在大规模ImageNet-12数据集上进行对象识别。为了在N-MNIST数据集上进行对象识别，我们设计了一个七层CNN，称为DigitNet。对于MNIST数据集上的图像重建任务，我们评估了一个具有784-256-128-64-128256-784架构的脉冲自编码器，其中数字指的是每层神经元的数量。

Network and Training Configurations: As shown in Table I, we use a CNN with seven learnable layers for object recognition on both the frame-based CIFAR-10 and event-based DVS-CIFAR10 datasets, namely, CifarNet. To handle the higher input dimensionality of the DVSCIFAR10, we increase the stride of each convolution layer to 2 and the kernel size to 7 for the first layer. Due to the high computational cost and large memory requirements for discrete-time modeling of SNNs, we use the AlexNet [1] for object recognition on the large-scale ImageNet-12 dataset. For object recognition on the N-MNIST dataset, we design a seven-layer CNN called the DigitNet. For the image reconstruction task on the MNIST dataset, we evaluate a spiking autoencoder that has an architecture of 784-256-128-64-128256-784, wherein the numbers refer to the number of neurons at each layer.

📊 表1 用于(a) CIFAR-10和DVS-CIFAR10， (b) IMAGENET-2012，和© N-MNIST实验的网络架构。对于卷积层，括号中的值分别是指过滤器的高度、宽度、步幅和数量。对于全连接层，括号中的值指的是神经元的数量。

TABLE I. NETWORK ARCHITECTURES USED FOR (a) CIFAR-10 AND DVSCIFAR10, (b) IMAGENET-2012, AND © N-MNIST EXPERIMENTS. FOR CONV LAYER S , THE VALUES IN THE BRACKET REFER TO THE HEIGHT,WIDTH,STRIDE, AND NUMBER OF FILTERS,RESPECTIVELY.FOR THE FC LAYER , THE VALUE IN THE BRACKET REFERS TO THE NUMBER OF NEURONS

实现细节：为了减少对权重初始化的依赖，加快训练过程，我们在每个卷积和全连接层之后增加了一个批量归一化层。考虑到批归一层只进行仿射变换，我们按照[18]中介绍的方法，将它们的参数集成到前一层的权重中，然后应用到耦合SNN层中。我们将ANN-to-SNN转换工作中常用的平均池化操作替换为步幅为2的卷积操作，不仅以可学习的方式进行降维，而且还降低了计算成本和延迟[47]。对于具有IF神经元的SNN，我们将发射阈值设置为1。对于带有LIF神经元的SNN，我们将激发阈值设置为0.1，膜时间常数 $τ_m$ 设置为20个时间步长。

Implementation Details: To reduce the dependency on weight initialization and accelerate the training process, we add a batch normalization layer after each convolution and fully connected layer. Given that the batch normalization layer only performs an affine transformation, we follow the approach introduced in [18] and integrate their parameters into the preceding layer’s weights before applying them in the coupled SNN layer. We replace the average pooling operations that are commonly used in ANN-to-SNN conversion works, with a stride of two convolution operations, which not only perform dimensionality reduction in a learnable fashion but also reduce the computation cost and latency [47]. For SNNs with IF neurons, we set the firing threshold to 1. For SNNs with LIF neurons, we set the firing threshold to 0.1 and the membrane time constant $τ_m$ to 20 time steps.

我们使用Pytorch[48]执行所有实验，除了在ImageNet-12数据集上的实验，其中我们使用Tensorpack工具箱[49]。我们遵循与Tensorpack中用于ImageNet-12对象识别任务相同的数据预处理程序（裁剪、翻转、均值归一化等）、优化器和学习速率衰减计划。对于CIFAR-10数据集，我们遵循与[28]相同的数据预处理和训练配置。Pytorch源代码可以在这里找到，更多的实现细节可以在补充材料中找到。

We perform all experiments with Pytorch [48], except for the experiment on the ImageNet-12 dataset, where we use the Tensorpack toolbox [49]. We follow the same data preprocessing procedures (crop, flip, mean normalization, and so on), optimizer, and learning rate decay schedule that are adopted in the Tensorpack for the ImageNet-12 object recognition task. For the CIFAR-10 dataset, we follow the same data preprocessing and training configurations, as in [28]. The Pytorch source codes can be found here,1 and more implementation details can be found in the Supplementary Material.

评估指标：对于基于帧和基于事件的视觉数据集上的物体识别任务，我们报告了测试集上的分类精度。对于MNIST数据集上的图像重建任务，我们报告了重建手写数字的均方误差(MSE)。我们为所有任务执行了三次独立的运行，并报告了所有运行的最佳结果，除了ImageNet-12数据集上的对象识别任务，其中只报告了一次运行的性能。

Evaluation Metrics: For object recognition tasks on both the frame- and event-based vision datasets, we report the classification accuracy on the test sets. For the image reconstruction task on the MNIST dataset, we report the mean square error (MSE) of the reconstructed handwritten digits. We perform three independent runs for all tasks and report the best result across all runs, except for the object recognition task on the ImageNet-12 dataset where the performance of only a single run is reported.

为了研究SNN模型相对于ANN模型的计算效率，我们报告了等效ANN模型和SNN模型之间的能量消耗比：

To study the computational efficiency of SNN models over their ANN counterparts, we report the energy consumption ratio between the equivalent ANN and SNN models as follows:

$\begin{equation}\frac{E_{\text{ANN}}}{E_{\text{SNN}}}=\frac{N_{\text{A,MAC}}\cdot E_{\text{MAC}}}{N_{\text{S,MAC}}\cdot E_{\text{MAC}}+N_{\text{AC}}\cdot E_{\text{AC}}}\end{equation}$

其中， $E_{ANN}$ 和 $E_{SNN}$ 分别表示ANN和SNN的总能量消耗。 $N_{A,MAC}$ 和 $N_{S,MAC}$ 分别为ANN和SNN使用MAC操作的总数。 $N_{AC}$ 指SNN的交流操作次数。 $E_{MAC}$ 和 $E_{AC}$ 分别是指每台MAC和每台AC运行的能耗。根据Han et al.[50]的报道，[51]中SNN能耗的研究，32位浮点MAC消耗4.6 pJ，而45 nm工艺中32位浮点AC操作仅消耗0.9 pJ。

where $E_{ANN}$ and $E_{SNN}$ denote the total energy consumption by the ANN and the SNN, respectively. $N_{A,MAC}$ and $N_{S,MAC}$ are the total number of MAC operations used by the ANNs and SNNs, respectively. $N_{AC}$ is the total number of AC operations used by the SNN. $E_{MAC}$ and $E_{AC}$ refer to the energy cost per MAC and AC operation, respectively. As reported by Han et al. [50] and adopted for the study of the energy consumption of SNNs in [51], a 32-bit floating point MAC consumes 4.6 pJ, while a 32-bit floating point AC operation consumes only 0.9 pJ in 45-nm process technology.

B 基于帧的物体识别结果/Frame-Based Object Recognition Results

对于CIFAR-10，如表II所示，使用IF神经元的SNN（以下记为SNN-IF）在脉冲计数和聚合膜电位解码方面的测试准确率分别为87.41%和90.98%，而使用LIF神经元的SNN (SNN-LIF)实现的结果略差，分类准确率为89.04%，这可能是由于平滑代理激活函数的近似误差造成的。然而，使用类似的激活函数来近似LIF神经元[20]的放电速率，约束-训练方法在CIFAR-10数据集上的分类准确率仅为83.54%。这一结果证实了在串联学习网络中保持SNN在训练循环中的必要性。此外，我们的脉冲CifarNet所取得的结果也与最先进的ANN-to-SNN转换[18]，[19]，[51]和基于脉冲的学习[28]，[51]方法一样具有竞争力。

For CIFAR-10, as provided in Table II, the SNN using IF neurons, denoted as SNN-IF, hereafter, achieves a test accuracy of 87.41% and 90.98% for spike count and aggregate membrane potential decoding, respectively, while the result is slightly worse for the SNN implementation with LIF neurons (SNN-LIF) that achieve a classification accuracy of 89.04%, which may be due to the approximation error of the smoothed surrogate activation function. Nevertheless, with a similar activation function designed to approximate the firing rate of LIF neurons [20], the constrain-then-train approach only achieves a classification accuracy of 83.54% on the CIFAR-10 dataset. This result confirms the necessity of keeping the SNN in the training loop as proposed in the tandem learning network. Moreover, the results achieved by our spiking CifarNet is also as competitive as the state-of-the-art ANN-to-SNN conversion [18], [19], [51] and spike-based learning [28], [51] methods.

📊 表2 在CIFAR-10和ImageNet-12测试集上比较不同SNN实现的分类精度和推理速度

TABLE II. COMPARISON OF CLASSIFICATION ACCURACY AND INFERENCE SPEED OF DIFFERENT SNN IMPLEMENTATIONS ON THE CIFAR-10 AND IMAGENET-12 TEST SETS

如图2(a)所示，我们注意到带有脉冲计数解码的学习动态是不稳定的，这是由于在输出层导出的离散误差梯度。因此，我们在其余的实验中使用了聚合膜电位解码。尽管使用ReLU激活函数和量化CNN（激活值按照量化感知训练方案[42]量化到3位）的学习收敛速度比普通CNN慢，但SNN-IF的分类精度最终超过了量化CNN。为了研究编码时间窗口大小 $T$ 对分类精度的影响，我们使用 $T$ 从1到8的IF神经元重复CIFAR-10实验。如图2(b)所示，分类精度随着时间窗口的增大而不断提高。它表明串联学习在利用编码时间窗口来表示信息方面的有效性，编码时间窗口决定了脉冲计数的上限。值得注意的是，在时间窗口大小仅为1的情况下，准确率可以达到89.83%，这表明可以同时实现准确和快速的推断。

As shown in Fig. 2(a), we note that the learning dynamics with spike count decoding is unstable, which is attributed to the discrete error gradients derived at the output layer. Therefore, we use the aggregate membrane potential decoding for the rest of the experiments. Although the learning converges slower than the plain CNN with the ReLU activation function and quantized CNN (with activation value quantized to 3 bits following the quantization-aware training scheme [42]), the classification accuracy of the SNN-IF eventually surpasses that of the quantized CNN. To study the effect of the encoding time window size $T$ on the classification accuracy, we repeat the CIFAR-10 experiments using IF neurons with $T$ ranging from 1 to 8. As shown in Fig. 2(b), the classification accuracy improves consistently with a larger time window size. It suggests the effectiveness of tandem learning in making use of the encoding time window, which determines the upper bound of the spike count, to represent information. Notably, 89.83% accuracy can be achieved with a time window size of only 1, suggesting that accurate and rapid inference can be achieved simultaneously.

为了在大规模ImageNet-12数据集上使用基于脉冲的学习规则训练模型，需要大量的计算机内存来存储脉冲神经元的中间状态，并且需要大量的计算成本来进行离散时间模拟。因此，只有少数SNN实现在没有考虑训练过程中神经元脉冲的动态情况下，对这一具有挑战性的任务进行了一些成功的尝试，包括ANN到SNN转换[18]，[19]和约束-然后训练[20]方法。

To train a model on the large-scale ImageNet-12 dataset with a spike-based learning rule, it requires a huge amount of computer memory to store the intermediate states of the spiking neurons and huge computational costs for a discrete-time simulation. Hence, only a few SNN implementations, without taking into consideration the dynamics of spiking neurons during training, have made some successful attempts on this challenging task, including ANN-to-SNN conversion [18], [19] and constrain-then-train [20] approaches.

如表II所示，在编码时间窗口为10个时间步的情况下，采用串联学习规则训练的AlexNet的top-1和top-5准确率分别为50.22%和73.60%。该结果与具有相同AlexNet架构[20]的约束训练方法（总时间步数为200）相当。值得注意的是，所提出的学习规则只需要10个推断时间步，至少比其他报道的方法快一个数量级。虽然ANN-to-SNN转换方法[18]，[19]在ImageNet-12上实现了更好的分类精度，但它们的成功在很大程度上归功于所使用的更先进的网络模型。

As shown in Table II, with an encoding time window of ten time steps, the AlexNet trained with the tandem learning rule achieves top-1 and top-5 accuracies of 50.22% and 73.60%, respectively. This result is comparable to that of the constrainthen-train approach with the same AlexNet architecture [20] with a total number of time steps of 200. Notably, the proposed learning rule only takes ten inference time steps that are at least an order of magnitude faster than the other reported methods. While the ANN-to-SNN conversion approaches [18], [19] achieve better classification accuracies on the ImageNet-12, their successes are largely credited to the more advanced network models used.

此外，我们注意到，与全精度激活的基线ANN实现相比，串联学习的精度下降了约7%（从原始AlexNet模型[1]修订，用两个卷积操作的跨步替换池化层，以匹配本工作中使用的AlexNet，并添加批量归一化层）。为了研究对离散神经表示精度的影响（精度的下降有多少是由于激活量化，以及有多少是由于IF神经元的动态），我们通过将激活值量化到仅10级来修改全精度ANN。在单次试验中，得到的量化神经网络的top-1和top-5错误率分别为50.27%和73.92%。这一结果与我们的SNN实现非常接近，这表明激活函数的量化是精度下降的主要原因。

Furthermore, we note that tandem learning suffers an accuracy drop of around 7% from the baseline ANN implementation with full-precision activation (revised from the original AlexNet model [1] by replacing pooling layers with a stride of two convolution operations to match the AlexNet used in this work and adding batch normalization layers). To investigate the effect on the accuracy of the discrete neural representation (how much of the drop in the accuracy is due to activation quantization, and how much of it is due to the dynamics of the IF neuron), we modify the full-precision ANN by quantizing the activation values to only ten levels. In a single trial, the resulting quantized ANN achieves the top-1 and top-5 error rates of 50.27% and 73.92%, respectively. This result is very close to that of our SNN implementation, which suggests that the quantization of the activation function alone accounts for most of the accuracy drop.

C 基于事件的对象识别结果/Event-Based Object Recognition Results

仿生事件相机异步捕捉逐像素强度变化，显示出引人注目的高动态范围、高时间分辨率和无运动模糊的特性。因此，事件驱动视觉[58]作为传统基于框架的视觉的补充，在计算机视觉界引起了越来越多的关注。事件驱动视觉的早期研究集中在从捕获的事件流构建基于框架的特征表示，以便机器学习模型[56]、[57]或ANN[59]能够有效地处理它。尽管这些工作取得了很好的结果，但基于帧的特征的后处理增加了延迟，即使在低事件率下也会产生很高的计算成本。相比之下，异步SNN自然地处理基于事件的感官输入，因此，在构建完全事件驱动的系统方面具有很大的潜力。

The bioinspired event cameras capture per-pixel intensity change asynchronously, which exhibits compelling properties of high dynamic range, high temporal resolution, and no motion blur. Event-driven vision [58], therefore, attracts growing attention in the computer vision community as a complement to the conventional frame-based vision. The early research on event-driven vision focuses on constructing frame-based feature representation from the captured event streams such that it can be effectively processed by machine learning models [56], [57] or ANNs [59]. Despite the promising results achieved by these works, the postprocessing of frame-based features increases latency and incurs high computational costs even during low event rates. In contrast, the asynchronous SNNs naturally process event-based sensory inputs and, hence, hold great potential to build fully event-driven systems.

在这项工作中，我们研究了串联学习在训练SNN处理事件摄像机输入中的适用性。为此，我们在N-MNIST和DVS-CIFAR10数据集上执行目标识别任务。如表III所示，对于N-MNIST数据集，我们的脉冲CNN对SNN-IF和SNN-LIF的准确率分别达到99.31%和99.22%。这些结果优于许多现有的SNN实现[26]，[27]，[46]，[54]和机器学习模型[53]，[56]，[57]，同时与最近引入的基于脉冲的学习方法[28]中获得的最佳报告结果相当。

In this work, we investigate the applicability of tandem learning in training SNNs to handle inputs of event cameras. To this end, we perform object recognition tasks on the N-MNIST and DVS-CIFAR10 datasets. As shown in Table III, for the N-MNIST dataset, our spiking CNNs achieve an accuracy of 99.31% and 99.22% for SNN-IF and SNN-LIF, respectively. These results outperform many existing SNN implementations [26], [27], [46], [54] and machine learning models [53], [56], [57] while on par with the best reported result achieved in a recently introduced spike-based learning method [28].

📊 表3 N-MNIST和DVS-CIFAR10数据集的目标识别结果比较

TABLE III. COMPARISON OF THE OBJECT RECOGNITION RESULTS ON THE N-MNIST AND DVS-CIFAR10 DATASETS

同样，我们的SNN模型也报告了DVS-CIFAR10数据集上的最先进性能。这证明了所提出的串联学习规则在处理事件驱动的相机数据方面的有效性。为了解决DVS-CIFAR10数据集的数据稀缺问题，我们进一步在DVS-CIFAR10数据集上通过微调SNN模型（在基于框架的CIFAR10数据集上进行预训练）来探索迁移学习。值得注意的是，以这种方式训练的SNN模型精度提高了约7%。值得注意的是，尽管忽略了脉冲序列的时间结构，只将脉冲计数作为信息载体，串联学习规则在这些数据集上的表现非常好，这可以解释为在这些数据集的收集过程中添加了可以忽略不计的时间信息[60]。

Similarly, our SNNs models also report state-of-the-art performance on the DVS-CIFAR10 dataset. This demonstrates the effectiveness of the proposed tandem learning rule in handling event-driven camera data. To address the data scarcity of the DVS-CIFAR10 dataset, we further explored transfer learning by fine-tuning SNN models (pretrained on frame-based CIFAR10 dataset), on the DVS-CIFAR10 dataset. Notably, the SNN models trained in this way achieve approximately 7% accuracy improvements. It is worth noting that, despite neglecting the temporal structure of spike trains and only consider spike counts as the information carrier, the tandem learning rule performs exceedingly well on these datasets, which can be explained by the fact that negligible temporal information is added during the collection of these datasets [60].

D 卓越的回归能力/Superior Regression Capability

为了探索串联学习用于回归任务的能力，我们使用全连接的脉冲自编码器在MNIST数据集上执行图像重建任务。如图5所示，采用本文提出的串联学习规则训练的自编码器可以有效地重建高质量的图像。当时间窗口大小为32时，SNN-IF和SNN-LIF的MSE分别为0.0038和0.0072，与等效ANN的0.0025相比略有下降。然而，值得一提的是，通过利用脉冲序列的稀疏性，SNN可以提供比ANN的高精度浮点数表示更高的数据压缩率。如图6所示，在较大的时间窗口大小下，网络性能接近基线神经网络，在目标识别任务中与观测保持一致。

To explore the capability of tandem learning for regression tasks, we perform an image reconstruction task on the MNIST dataset with a fully connected spiking autoencoder. As shown in Fig. 5, the autoencoder trained with the proposed tandem learning rule can effectively reconstruct images with high quality. With a time window size of 32, the SNN-IF and SNN-LIF achieve MSEs of 0.0038 and 0.0072 and a slight drop from 0.0025 of an equivalent ANN. However, it is worth mentioning that, by leveraging the sparsity of spike trains, the SNNs can provide a much higher data compression rate over a high-precision floating number representation of the ANN. As shown in Fig. 6, with a larger time window size, the network performance approaches that of the baseline ANN, which aligns with the observation in object recognition tasks.

🖼️ 图5 MNIST数据集上的脉冲自编码器（ $T$ = 32的IF神经元）重建图像的插图。对于每个数字，左列是原始图像，右列是重建图像。

Fig. 5. Illustration of the reconstructed images from a spiking autoencoder (IF neurons with $T$ = 32) on the MNIST dataset. For each digit, the left column is the original image, and the right column is the reconstructed image.

🖼️ 图6 作为编码时间窗口大小的函数，图像重建性能。

Fig. 6. Illustration of the image reconstruction performance as a function of the encoding time window size.

E 交错层内的激活方向保持与权重-激活点积比例/Activation Direction Preservation and Weight-Activation Dot Product Proportionality Within the Interlaced Layers

在展示了所提出的串联学习规则在物体识别和图像重建任务上的有效性之后，我们希望解释为什么学习可以通过交错的网络层有效地执行。为了回答这个问题，我们借鉴了最近一项二值神经网络理论工作[61]的想法，其中学习也在交错的网络层中执行（二值化激活正向传播到后续层）。在所提出的串联网络中，在层 $l - 1$ 的ANN层激活值 $a^{l−1}$ 被替换为来自耦合SNN层的脉冲计数 $c^{l−1}$ 。我们进一步分析了这两个量之间的不匹配程度及其对激活正向传播和误差反向传播的影响。

After showing how effective the proposed tandem learning rule performs on object recognition and image reconstruction tasks, we hope to explain why the learning can be performed effectively via the interlaced network layers. To answer this question, we borrow ideas from a recent theoretical work of binary neural network [61], wherein learning is also performed across the interlaced network layers (binarized activations are forward propagated to subsequent layers). In the proposed tandem network, the ANN layer activation value $a^{l−1}$ at layer $l - 1$ is replaced with the spike count $c^{l−1}$ derived from the coupled SNN layer. We further analyze the degree of mismatch between these two quantities and its effect on the activation forward propagation and error backpropagation.

在CIFAR-10上的数值实验中，我们随机抽取了256个测试样本，计算了所有卷积层向量化 $c^{l−1}$ 和 $a^{l−1}$ 之间的余弦角。如图7所示，他们的余弦角平均都在24◦以下，并且这种关系在整个学习过程中始终保持不变。虽然这些角度在低维中看起来很大，但在高维空间中却非常小。根据高维计算理论[62]和二叉神经网络的理论研究[61]，任意两个高维随机向量之间的余弦角近似正交。同样值得注意的是，用 $c^{l−1}$ 替换 $a^{l−1}$ 的失真比将一个随机的高维向量二值化要轻，理论上它将余弦角改变了37°。假设从后续ANN层反向传播的激活函数和误差梯度保持相等，误差反向传播的失真受 $a^{l−1}$ 和 $c^{l−1}$ 之间差异的局部限制。

In our numerical experiments on CIFAR-10 with a randomly drawn minibatch of 256 test samples, we calculate the cosine angle between vectorized $c^{l−1}$ and $a^{l−1}$ for all the convolution layers. As shown in Fig. 7, their cosine angles are below 24◦ on average and such a relationship maintains consistently throughout learning. While these angles seem large in low dimensions, they are exceedingly small in a high-dimensional space. According to the hyperdimensional computing theory [62] and the theoretical study of binary neural network [61], the cosine angle between any two high-dimensional random vectors is approximately orthogonal. It is also worth noting that the distortion of replacing $a^{l−1}$ with $c^{l−1}$ is less severe than binarizing a random high-dimensional vector, which changes the cosine angle by 37° in theory. Given that the activation function and error gradients that backpropagated from the subsequent ANN layer remain equal, the distortions to the error backpropagation are bounded locally by the discrepancy between $a^{l−1}$ and $c^{l−1}$ .

🖼️ 图7 神经网络与SNN层输出峰值数不匹配误差分析。在(a) 30和(b) 200的所有卷积层的向量化 $c$ 和 $a$ 之间的余弦角分布。虽然这些角度在低维中看起来很大，但在高维空间中却非常小。在(c) 30和(d) 200时期，重量激活点积 $c \cdot W$ 和 $a \cdot W$ 之间的PCCs分布。在整个学习过程中，pcc始终保持在0.9以上，这表明权重-激活点积的线性关系大致保持。

Fig. 7. Analysis of mismatch errors between output spike counts of ANN and SNN layers. Distribution of cosine angles between vectorized $c$ and $a$ for all convolution layers at epochs (a) 30 and (b) 200. While these angles seem large in low dimensions, they are exceedingly small in a high-dimensional space. Distribution of PCCs between weight-activation dot products $c \cdot W$ and $a \cdot W$ at epochs (c) 30 and (d) 200. The PCCs maintain consistently above 0.9 throughout learning, suggesting that the linear relationship of weight-activation dot products is approximately preserved.

此外，我们计算了权重激活点积 $c^l·W$ 和 $a^l·W$ 之间的泊松相关系数(PCC)，这是我们当前网络配置中一个重要的中间量（批量归一化层的输入）。PCC范围为−1到1，测量两个变量之间的线性相关性。值为1意味着一个完美的正线性关系。如图7所示，大部分样本的PCC在整个学习过程中始终保持在0.9以上，表明权重-激活点积的线性关系大致保持。

Furthermore, we calculate the Pearson correlation coefficient (PCC) between the weight-activation dot products $c^l · W$ and $a^l · W$ , which is an important intermediate quantity (input to the batch normalization layer) in our current network configurations. The PCC, ranging from −1 to 1, measures the linear correlation between two variables. A value of 1 implies a perfect positive linear relationship. As shown in Fig. 7, the PCC maintains consistently above 0.9 throughout learning for most of the samples, suggesting that the linear relationship of weight-activation dot products is approximately preserved.

F 通过脉冲序列级代理梯度的高效学习/Efficient Learning Through Spike-Train Level Surrogate Gradient

在本节中，我们将所提出的串联学习规则的学习效率与流行的替代梯度学习方法[25]进行比较。代理梯度学习方法描述了带有RNN的脉冲神经元的时变动态，并使用基于BPTT的训练方法来优化网络参数。在误差反向传播阶段，不可微的脉冲生成函数被连续的代理函数所取代，这样就可以为每个时间步确定代理梯度。相比之下，串联学习决定了脉冲序列级别的误差梯度；因此，它可以显著提高学习效率。

In this section, we compare the learning efficiency of the proposed tandem learning rule to the popular family of surrogate gradient learning methods [25]. The surrogate gradient learning methods describe the time-dependent dynamic of spiking neurons with an RNN, whereby the BPTT-based training method is used to optimize the network parameters. The nondifferentiable spike generation function is replaced by a continuous surrogate function during the error backpropagation phase, such that a surrogate gradient can be determined for each time step. In contrast, tandem learning determines the error gradient at the spike-train level; therefore, it can significantly improve the learning efficiency.

在这里，我们比较了串联学习与[28]中提出的代理学习方法的学习效率。如图8所示，在使用LIF神经元的CIFAR-10数据集上的实验中，由于BPTT方法需要使用中间神经元状态进行存储和计算，因此计算时间和GPU内存使用率随着时间窗口大小 $T$ 呈线性增长。值得注意的是，采用这种基于BPTT的方法，当 $T = 8$ 时，SNN无法安装到单个Nvidia Geforce GTX 1080Ti GPU卡上，其内存空间为11GB。因此，它阻止了对更具挑战性的任务大规模部署这种方法，这在其他著作[25]中也提到过。相比之下，串联学习不需要存储每个时间步的中间神经元状态；因此，在 $T = 8$ 时，它比基于BPTT的方法加速了2.45倍，GPU内存占用减少了2.37倍。计算效率的改进预计将进一步提高更大的时间窗口大小。此外，对于所有不同的 $T$ ，使用串联学习方法训练的SNN比基于BPTT的方法获得了更高的测试准确性。这可以解释为，代理梯度方法的无偏近似通常随着时间窗口 $T$ 的增加而改善，而所提出的串联学习规则则不是这样。此外，串联学习规则有效地避免了基于BPTT的方法中存在的梯度消失和爆炸问题，并且可以方便地集成批归一化方法来提高训练收敛性。因此，所提出的串联学习方法具有更好的学习效果、效率和可扩展性。

Here, we compare the learning efficiency of tandem learning with the surrogate learning method presented in [28]. As shown in Fig. 8, for the experiment on the CIFAR-10 dataset with LIF neurons, the computation time and GPU memory usage grow linearly with the time window size $T$ for the BPTT-based method since it requires to store and calculate using the intermediate neuronal states. It worth noting that, taking this BPTT-based method, the SNN is unable to fit onto a single Nvidia Geforce GTX 1080Ti GPU card with 11-GB memory space when $T = 8$ . Therefore, it prevents a large-scale deployment of this approach for more challenging tasks as also mentioned in other works [25]. In contrast, the storage of the intermediate neuronal state of each time step is not required for the tandem learning; therefore, it shows a speedup of 2.45 times over the BPTT-based approach with 2.37 times fewer GPU memory usage at $T = 8$ . The improvements in computational efficiency are expected to be further boosted for larger time window sizes. Furthermore, the SNNs trained with the tandem learning approach achieved higher test accuracies over the BPTT-based approach consistently for all different $T$ . This can be explained by the fact that the unbiased approximation of surrogate gradient methods generally improves with an increasing time window $T$ ,which is not the case for the proposed tandem learning rule. Moreover, the tandem learning rule effectively circumvents the vanishing and exploding gradient problem that existed in the BPTT-based methods and allows easy integration of batch normalization method to improve training convergence. Therefore, the proposed tandem learning approach demonstrates much better learning effectiveness, efficiency, and scalability.

🖼️ 图8 与基于BPTT的代理梯度方法相比，所提出的串联学习方法的计算时间、GPU内存使用和测试精度的比较。结果作为编码时间窗口大小 $T$ 的函数提供。

Fig. 8. Comparison of the computation time, GPU memory usage, and test accuracy of the proposed tandem learning approach over the BPTT-based surrogate gradient method. The results are provided as a function of the encoding time window size $T$ .

G 减少突触操作，实现快速推理/Rapid Inference With Reduced Synaptic Operations

如表II所示，在不影响分类精度的情况下，使用该学习规则训练的SNN可以比其他学习规则执行推理至少快一个数量级。此外，如图2和图6所示，所提出的串联学习规则可以处理和利用不同的编码时间窗口大小 $T$ 。在最具挑战性的情况下，只允许传输一个脉冲（即 $T = 1$ ），我们能够在物体识别和图像重建任务中获得令人满意的结果。具体而言，对于CIFAR-10数据集上的物体识别任务，我们对IF和LIF神经元分别实现了89.52%和88.82%的准确率。这远远高于最近引入的用于数控硬件的二进制神经网络实现，其精度为84.67%[63]。这可能部分归功于我们所采用的编码方案，其中完整的输入信息可以在第一次步骤中进行编码。此外，每次卷积后增加的批归一层和全连接层保证了有效的信息传输到顶层。通过增加时间窗口的大小可以进一步改善结果：因此，可以根据不同的应用需求，在推理速度和准确性之间进行权衡。

As shown in Table II, the SNN trained with the proposed learning rule can perform inference at least one order of magnitude quicker than other learning rules without compromising on the classification accuracy. Moreover, as demonstrated in Figs. 2 and 6, the proposed tandem learning rule can deal with and utilize different encoding time window size $T$ . In the most challenging case when only one spike is allowed to transmit (i.e., $T = 1$ ), we are able to achieve satisfying results for both object recognition and image reconstruction tasks. Specifically, for the object recognition task on the Cifar10 dataset, we achieve an accuracy of 89.52% and 88.82% for the IF and LIF neuron, respectively. It is much higher than a recently introduced binary neural network implementation for the NC hardware with an accuracy of 84.67% [63]. This may be partially credited to the encoding scheme that we have employed, whereby full input information can be encoded in the first time step. Besides, the batch normalization layer, which is added after each convolution, and the fully connected layer ensure effective information transmission to the top layers. The results can be improved further by increasing the time window size; therefore, a tradeoff between inference speed and accuracy can be achieved according to different application requirements.

为了研究训练后SNN的能量效率，我们遵循惯例，在CIFAR-10数据集上计算ANN与SNN的总能量比。值得注意的是，ANN所需的总能量消耗是一个固定的数字，与时间窗口大小无关，而SNN所需的总能量消耗几乎随时间窗口大小线性增长，如图9(a)所示。在分类精度相当的情况下，ANN模型需要的总能量分别是SNN-IF和SNN-LIF的15.96和20.67倍（在 $T = 8$ 时）。这可以用SNN所需的推理时间短和突触活动稀疏来解释，如图9(b)所示。相比之下，在类似的VGGNet-9网络[51]上，采用ANN-to-SNN转换和基于脉冲的学习方法的最先进的SNN实现分别需要0.18和1.4倍的总能量比。这表明我们的SNN实现的效率至少提高了一个数量级。在ImageNet数据集上，由于第一个卷积层占用了很大比例的能量，需要昂贵的MAC操作，当 $T$ 分别等于5和10时，能量比略微降低到4.79和3.39。

To study the energy efficiency of the trained SNNs, we follow the convention by calculating the total energy ratio of ANN to SNN on the CIFAR-10 dataset. Notably, the total energy consumption required for ANNs is a fixed number that is independent of the time window size, whereas it grows almost linearly with time window size for SNNs, as shown in Fig. 9(a). With a comparable classification accuracy, the ANN model requires 15.96 and 20.67 times total energy to the SNN-IF and SNN-LIF (at $T = 8$ ), respectively. This can be explained by the short inference time required for the SNNs and the sparse synaptic activities, as summarized in Fig. 9(b). In contrast, the state-of-the-art SNN implementations with the ANN-to-SNN conversion and spike-based learning methods require a total energy ratio of 0.18 and 1.4 times, respectively, on a similar VGGNet-9 network [51]. It suggests our SNN implementation is at least an order of magnitude more efficient. On the ImageNet dataset, due to a large percentage of energy taken by the first convolution layer that requires costly MAC operations, the energy ratio reduces slightly to 4.79 and 3.39 with $T$ equal to 5 and 10, respectively.

🖼️ 图9 (a) CIFAR-10数据集上ANN与SNN的总能量比随编码时间窗口大小的变化。(b)经过训练的SNN模型每时间步每个神经元的平均脉冲计数( $T$ = 8)。在所有网络层都可以观察到稀疏的神经元活动，从而在神经形态硬件上实现低功耗。

Fig. 9. (a) Total energy ratio of ANN to SNN as a function of encoding time window size on the CIFAR-10 dataset. (b) Average spike count per neuron per time step of the trained SNN model ( $T$ = 8). Sparse neuronal activities can be observed across all network layers, leading to low power consumption when implemented on neuromorphic hardware.

IV 讨论与结论/Discussion And Conclusion

在这项工作中，我们介绍了一种新的串联神经网络及其学习规则，以有效地训练SNN进行分类和回归任务。在串联神经网络中，使用SNN来确定激活正向传播的确切脉冲计数和脉冲序列，而与耦合SNN共享权重的ANN用于近似脉冲计数，因此，耦合SNN在脉冲序列级别上的梯度。考虑到错误反向传播是在简化的ANN上执行的，所提出的学习规则在内存和计算上都比流行的代理梯度学习方法更有效，后者在每个时间步[25]，[27]-[29]执行梯度逼近。

In this work, we introduce a novel tandem neural network and its learning rule to effectively train SNNs for classification and regression tasks. Within the tandem neural network, an SNN is employed to determine exact spike counts and spike trains for the activation forward propagation, while an ANN, sharing the weight with the coupled SNN, is used to approximate the spike counts and, hence, gradients of the coupled SNN at the spike-train level. Given that error backpropagation is performed on the simplified ANN, the proposed learning rule is both memory and computationally more efficient than the popular surrogate gradient learning methods that perform gradient approximation at each time step [25], [27]–[29].

为了理解为什么在串联学习框架中可以有效地进行学习，我们研究了串联网络的学习动态，并将其与完整的人工神经网络进行了比较。对CIFAR-10的实证研究表明，在高维空间中，向量化的神经网络输出 $a^l$ 与耦合的SNN输出脉冲计数 $c^l$ 之间的余弦距离非常小，并且这种关系在整个训练过程中都保持着。此外，在激活正向传播的重要中间量 $a^l·W$ 与权重-激活点积 $c^l·W$ 之间表现出强烈的正PCC，表明权重-激活点积之间保持了良好的线性关系。

To understand why the learning can be effectively performed within the tandem learning framework, we study the learning dynamics of the tandem network and compare it with an intact ANN. The empirical study on the CIFAR-10 reveals that the cosine distances between the vectorized ANN output al and the coupled SNN output spike count cl are exceedingly small in a high-dimensional space and such a relationship maintains throughout the training. Furthermore, strongly positive PCCs are exhibited between weight-activation dot product cl · W and al · W , an important intermediate quantity in the activation forward propagation, suggesting that a linear relationship of weight-activation dot products is well preserved.

使用所提出的串联学习规则训练的SNN在基于帧和基于事件的物体识别任务中都表现出了有竞争力的分类准确性。通过有效利用时间窗口大小，确定脉冲计数的上限，来表示信息，并添加批处理归一化层，以确保有效的信息流；快速推理，与其他SNN实现相比，至少节省了一个数量级的时间，在我们的实验中得到了证明。此外，通过利用稀疏的神经元活动和短的编码时间窗口，总的突触操作也比基线神经网络和其他最先进的SNN实现减少了至少一个数量级。

The SNNs trained with the proposed tandem learning rule have demonstrated competitive classification accuracies on both the frame- and event-based object recognition tasks. By making efficient use of the time window size, that determines the upper bound for the spike count, to represent information, and adding batch normalization layers to ensure effective information flow; rapid inferences, with at least an order of magnitude time-saving compared to other SNN implementations, are demonstrated in our experiments. Furthermore, by leveraging on the sparse neuronal activities and short encoding time window, the total synaptic operations are also reduced by at least an order of magnitude over the baseline ANNs and other state-of-the-art SNN implementations.

在未来的工作中，我们将通过为LIF神经元设计更有效的逼近函数和评估更先进的网络架构，探索使用LIF神经元来缩小基线ANN和SNN之间精度差距的策略。此外，我们要承认串联学习规则忽略了脉冲序列的时间结构，不适用于时间序列学习，因为需要为每个时间步或脉冲确定误差函数，而不是在脉冲计数级别。为了解决时间结构很重要的任务，如手势识别和光流估计[64]，我们有兴趣研究一种混合网络结构，包括用于特征提取的前馈网络和用于序列建模的循环网络是否有用。具体而言，经过串联学习训练的前馈SNN可以在短时间尺度上作为强大的基于速率的特征提取器[65]，而后续的脉冲RNN[35]可以用于在较长时间尺度上显式地处理底层模式的时间结构。

For future work, we will explore strategies to close the accuracy gap between the baseline ANN and SNN using the LIF neurons by designing a more effective approximating function for the LIF neuron and evaluating more advanced network architectures. In addition, we would like to acknowledge that the tandem learning rule, which neglects the temporal structure of spike trains, is not applicable for temporal sequence learning because the error function is required to be determined for each time step or spike rather than at the spike count level. To solve the tasks where the temporal structure is important, such as gesture recognition and optical flow estimation [64], we are interested to study whether a hybrid network structure that includes a feedforward network for feature extraction, and a recurrent network for sequence modeling could be useful. Specifically, the feedforward SNN trained with tandem learning can work as a powerful rate-based feature extractor on the short time scale [65], while a subsequent spiking RNN [35] can be used to explicitly handle the temporal structure of underlying patterns on the longer time scale.

这篇关于论文精翻《A Tandem Learning Rule for Effective Training and Rapid Inference of Deep Spiking Neural ...》的文章就介绍到这儿，希望我们推荐的文章对编程师们有所帮助！