[论文精读]Few-shot domain-adaptive anomaly detection for cross-site brain images

本文主要是介绍[论文精读]Few-shot domain-adaptive anomaly detection for cross-site brain images，希望对大家解决编程问题提供一定的参考价值，需要的开发者们随着小编来一起学习吧！

论文网址：Few-shot domain-adaptive anomaly detection for cross-site brain images | IEEE Journals & Magazine | IEEE Xplore

英文是纯手打的！论文原文的summarizing and paraphrasing。可能会出现难以避免的拼写错误和语法错误，若有发现欢迎评论指正！文章偏向于笔记，谨慎食用！

1. 省流版

1.1. 心得

1.2. 论文总结图

2. 论文逐段精读

2.1. Abstract

2.2. Introduction

2.3. Related work

2.3.1. Classification of mental disorders

2.3.2. Few-shot learning for anomaly detection

2.3.3. Cross-domain few-shot learning

2.4. Materials

2.4.1. Demographic, clinical and imaging information of data

2.4.2. Preprocessing

2.4.3. Functional connectivity measures

2.5. Proposed algorithm

2.5.1. Problem definition

2.5.2. Deep semi-supervised anomaly detection (DSAD)

2.5.3. Residual correction block (RCB)

2.5.4. Conditional adversarial domain adaptation revisited

2.5.5. Overall formulation of the FAAD algorithm

2.6. Experiment

2.6.1. Baseline method

2.6.2. Implementation details

2.6.3. Results and analysis

2.7. Discussion

2.8. Conclusion

3. 知识补充

3.1. Hypersphere

3.2. Meta-learning

3.3. Manifold

3.4. Canonical Correlation Analysis (CCA)

4. Reference List

1. 省流版

1.1. 心得

（1）这Intro在我黯淡无光的读着重复的论文的每一天中突然闪耀起来了。这是TPAMI的魅力吗

（2）其实我现在觉得脑图分类总不好可能是大家也有别的病...（天哪我又...他他他居然在文章的3.1（不是我的3.1，我的是2.4.1）里面说了“患者无神经系统疾病、严重内科疾病、药物滥用或电休克治疗史。所有健康对照与SCZ或MDD患者无相关性。他们也根据DSM-IV标准进行评估。他们都没有急性身体疾病，药物滥用或依赖，头部受伤导致意识丧失的历史，或严重的精神或神经疾病。”我不知道其他的有没有，反正大概率有的话都不在正文）

（3）Related works写名字是真的...难评。为什么不能写写模型名字

（4）文章也解释了为什么用fMRI而不是sMRI：“精神障碍引起的病理改变通常是功能性的，而不是结构性的，尤其是在早期阶段。”

（5）文章解释了为什么不用voxel FC而是用ROI based FC：“在体素方面，由于FC具有超高的维度(十亿级)和较低的信噪比(SNR)，因此没有采用。”

（6）我终于知道什么是标签空间了，就像去不同医院测的指标其实不一样

（7）我的discussion：我突然觉得似乎对于注意力来说ROI得小然后对于普通的ROI得大

1.2. 论文总结图

2. 论文逐段精读

2.1. Abstract

①For solving the problem that fMRI data comes from different sites, the authors proposed few-shot domain-adaptive anomaly detection (FAAD)

②They firstly adopt domain adaptation, which reduce the differences of different sites. And secondly combining the features of different sites

③The database is the Human Connectome Project (HCP)

2.2. Introduction

①It is hard to obtain enough number of correctly labeled samples

②⭐It comes overfitting risk when applying unsupervised methods in that the dimension of functional connectivity is too high, the number of sample is limited and differences between samples are significant

③⭐In reality, the number of healthy people is definitely much greater than the number of Alzheimer's patients. If follows the situation (the ratio of AD and HC), it may decreases the accuracy of binary classification

④⭐Accordingly...They take large amount healthy samples as their pre-traning set, then apply anomaly detection in comprehensive sites.

⑤作者在这里提到一个标签空间的问题，他们认为纯健康的源域和有健康有不健康的目标域的标签空间可能是不一样的。因此不能采用传统的自适应方法。作者认为“需要应用一般和有条件的领域自适应。这样可以在保持训练模型的判别能力的同时，使两个域的特征分布保持一致”

⑥The schematic of their FAAD:

⑦Their contributions: a) they are the first one to adopt anomaly detection in psychiatric disoders classification, b) for one class in source dataset and two classes (only one new class) in target dataset, they alleviate the difference of distribution between two classes, c) they align the general feature distribution and conditional distribution between the source and the target datasets at the same time

interrater adj. 评分者间的：指不同评分者之间的一致性或可靠性

delineate v. （详细地）描述，解释；标明，标示（边界）

schematic adj. 略图的;严谨的;简表的;有章法的 n. 简图

authenticity n. 真实性，可靠性

2.3. Related work

2.3.1. Classification of mental disorders

①Shen et al. classified schizophrenia (SCZ) and HC by locally linear embedding and C-means clustering

②Zeng et al. classified depression and HC by whole brain FC and SVM

③What is more, Zeng et al. then classified SCZ and HC by discriminant autoencoder network with sparsity constraint (DANS) with combining different sites of data

④Sui et al. predicted the cognitive domain score of SCZ by extracting features from multimodal MRI images

⑤Li et al. classified posttraumatic stress disorder (PTSD) and HC by dynamic FC

⑥Gopinath et al. predicted the stage of AD by new learnable graph pooling method

⑦Lian et al. extracted the multi-scale features of AD by hierarchical fully convolutional network (H-FCN)

⑧Mourao-Miranda et al. classified patients by anomaly detections with SVM but only contains 38 samples

morphometry n. 形态测量学；形态计量术

2.3.2. Few-shot learning for anomaly detection

①Anomaly detection, also called outlier detection or novelty detection, tries to limit all the training samples (normal samples) in a hypersphere as much as possible. All the samples that fall outside the hypersphere are abnormal samples

②Few number of anomalies will better help to depict the hypersphere

③Lu et al. proposed a few-shot scene-adaptive outlier detection method

④Ding et al. put forward graph deviation networks (GDN) and new cross-network meta-learning algorithm

⑤Koizumi et al. proposed a few-shot method to train cascaded specific anomaly detector

⑥It is hard to use meta-learning cuz the domain is single (diversity needed) and unseen labels can only be used in fine-tune in meta-learning

a.k.a. abbr.又名，亦称(尤用于引出某人的昵称或艺名（also known as）);

2.3.3. Cross-domain few-shot learning

①Most of the cross-domain methods focus on the condition that the label space is the same of the source domain and the target domain

②Guan et al. proposed triplet autoencoder (TriAE) model

③Zhao et al. put forward domain-adversarial prototypical network (DAPN) model with meta-learning and N-way k-shot classification. N-way k-shot means N clusters in support set and k samples in each clusters. The there is a query set which contains N clusters also to query (measure the performance). Due to the requirement of N clusters, disease classification can not apply this method

2.4. Materials

①The overall pipeline:

（A）Get time series $\overset{Pearson\, \, correlation}{\rightarrow}$ FC $\overset{vectorize}{\rightarrow}$ input vector

（B）Pretraining: input vector (dimension $N=\frac{n(n-1)}{2}$ , where $n$ is the number of ROI) $\overset{three-layer\, \, autoencoder}{\rightarrow}$ output vector through reconstruction loss $L_{reconstruction}$ （我不知道怎么用的）

（C）Apply three-repeat three-trial validation in samples with random seed in each repeat for randomize the sequence of samples. Select few normal and abnormal samples from each trival randomly as labelled data. The remain of them is regard as test set

（D）Retaining the encoder from B and compensating the differences between domains through residual correction block and conditional adversarial domain adaptation. Also

$L_{total}=L_{ad}+L_{da}\left ( \beta \right )$

where $L_{ad}$ denotes the loss of anomaly detection and $L_{da}$ denotes the loss of domain adaptation.

②Finally, the measure the performance by the AUC of unlabelled target domain

2.4.1. Demographic, clinical and imaging information of data

①Sites: 7

（1）Source domain

①dataset: The Human Connectome Project (HCP) dataset (HCP S1200)

②Samples: 1053 HC with 483 males and 570 females

③Parameters of scanning: spatial resolution = 2×2×2mm³ , repetition time (TR) = 720 ms, echo time (TE) = 33.1 ms, field of view (FOV) = 208×80mm² , slices = 72, flip angle (FA) = 52◦, TRs = 1200

（2）Target domain

①Dadaset: AMU, FMMU#1, FMMU#2, PUTH, UCLA and COBRE datasets (they are a) rs-fMRI, b) keep the same scanner in one site, c) the sample size ＞100 when contains HC and SCZ, ＞ 150 when contains SCZ and MDD for one site)

2.4.2. Preprocessing

①Software: SMP8

②Magnetic saturation: the first five frames of the scanned data are discarded

③Slice timing

④Motion correction: excluding scans with excessive head motion during acquisition (>2.5 mm translation and/or 2.5◦ rotation)

⑤Normalization with an EPI template in the Montreal Neurological Institute (MNI) atlas space (3-mm isotropic voxels)

⑥Spatial smoothing with a 6-mm fullwidth half-maximum Gaussian kernel

⑦Linear detrending and bandpass temporal filtering (0·01–0·08 Hz)

⑧Regression of nuisance variables, including the six parameters obtained by rigid body head motion correction, ventricular and white matter signals, and their first temporal derivatives, quadratic terms, and squares of derivatives

2.4.3. Functional connectivity measures

①AAL atlas lacks information of functional organization

②17-network parcellation possess high SNR but do not contain some subcortical regions, such as the thalamus and amygdala, which are regarded as essential regions in memory, emotional control and various cognitive functions

③Thus, they use BA512 atlas with eigen clustering (EIC) and unsupervised method

④Applying Pearson correlation coefficient in time series under each atlas, then transforming them to approach to normal distribution by Fisher r-to-z transformation

⑤Three atlases:

striatum n. 纹状体，终脑的皮层 thalamus n. [解剖] 丘脑；花托 amygdala n. [解剖] 杏仁核；扁桃腺；苦巴旦杏

2.5. Proposed algorithm

2.5.1. Problem definition

① $\mathcal{D}_{s}=\{(x_{si},y_{si})\}_{i=1}^{n_{s}}=\{\mathbf{X}_{s},y_{s}\}$ is the source domain, the HCP dataset, where $y_{si}=+1$

② $\mathcal{D}_{t}$ is the target domain, the AMU, FMMU#1, FMMU#2, PUTH, UCLA and COBRE datasets

③ $\mathcal{D}_{l}=\{(x_{li},y_{li})\}_{i=1}^{n_{l}}=\{\mathbf{X}_{l},y_{l}\}$ is the labeled target, where $y_{li}=+1$ for HC, $y_{li}=-1$ for patients

④ $\mathcal{D}_{u}=\{(x_{ui})\}_{i=1}^{n_{u}}=\{\mathbf{X}_{u}\}$ is the unlabeled target

⑤

$\mathcal{X}_{s}$	the feature space of the source domain $\mathcal{D}_{s}$
$\mathcal{X}_{t}$	the feature space of the target domain $\mathcal{D}_{t}$
$\mathcal{Y}_{s}$	the label space of the source domain $\mathcal{D}_{s}$ , $\mathcal{Y}_{s}\subset \mathcal{Y}_{t}$ . Its class number $C_s=1$
$\mathcal{Y}_{t}$	the label space of the target domain $\mathcal{D}_{t}$ . Its class number $C_t=2$

⑥ $D\left ( \mathcal{X}_{s} \right )=D\left ( \mathcal{X}_{t} \right )$ means they have the same dimension

⑦⭐The feature distribution between source and target domain is difference, namely $P_{s}(X_{s})\neq P_{t}(X_{t})$ （其实我不知道这个特征分布指的是 a) 同样的指标但是大小分区不均还是 b) 指标个数一样但是指标不一样）

⑧They aim to alleviate the distribution discrepancy between $\mathcal{D}_{s}$ and $\mathcal{D}_{l}$ and apply anomaly detection in $\mathcal{D}_{u}$

2.5.2. Deep semi-supervised anomaly detection (DSAD)

①In $L$ layers deep support vector data description (deep SVDD):

$\begin{aligned}\min_{\mathcal{W}}\frac{i}{n}\sum_{i=1}^{n}||\phi(x_{i};\mathcal{W})-c||^{2}+\frac{\lambda}{2}\sum_{l=1}^{L}||\mathbf{W}^{l}||_{F}^{2}\end{aligned}$

where $\mathcal{X}\subset\mathbb{R}^{D}$ denotes the input space and $\mathcal{Z}\subset\mathbb{R}^d$ denotes the output space;

$\mathcal{W}=\{\mathbf{W}^{1},...,\mathbf{W}^{L}\}$ , $x_{1},...,x_{n}\in\mathcal{X}$ , $c$ denotes the center of the hypersphere;

And this function is for minimizing the volume of hypersphere of all the HC;

The left term is to enclose the HC and the right term is a standard weight decay regularizer with hyperparameter $\lambda > 0$

②For there is only HC samples for training and maxmizing the mutual information $\mathcal{I}(\mathcal{X},\mathcal{Z})$ , autoencoder initialization with reconstruction loss as the optimizer

③The mean value of all the features of encoded samples in center $c$ :

$c=\frac{1}{n}\sum_{i=1}^{n}\phi(x_{si};\mathcal{W}_{0})$

④The anomaly score after training can be:

$s(x)=\|\phi(x;\mathcal{W})-c\|^2$

⑤There might be "hypersphere collapse" when only use HC. It means the radius of the hypersphere reduce to 0 and eliminating the representation capability of the network. It can be mitigated by few labeled abnormal samples

⑥For two classes labeled samples, there are:

$\begin{aligned}&(x_{t1},y_{t1}),...,(x_{tm},y_{tm}),\\&(x_{t(m+1)},y_{t(m+1)}),...,(x_{t(2m)},y_{t(2m)})\in\mathcal{X}_t\times\mathcal{Y}_t\end{aligned}$

⑦After adding the labeled samples, the network could be changed to:

$\begin{aligned} \operatorname*{min}_{\mathcal{V}}& \begin{aligned}\frac{1}{n}\sum_{i=1}^n(||\phi(x_{si};\mathcal{W})-c||^2)^{y_si}\end{aligned} \\ &+\frac{1}{2m}\sum_{j=1}^{2m}(||\phi(x_{tj};\mathcal{W})-c||^{2})^{y_{t}j}+\frac{\lambda}{2}\sum_{l=1}^{L}||\mathbf{W}^{l}||_{F}^{2} \end{aligned}$

the labeled abnormal samples are mapped away from center by penalization

⑧The centers of source domain and target domain are shared

2.5.3. Residual correction block (RCB)

①Distribution alignment by increasing discrepancy loss may not completely eliminate the domain discrepancies

②Li et al. put forward two-layer fully connected neural network RCB, which $\mathcal{Y}_{t}\subset\mathcal{Y}_{s}$

③ $\phi_{s}(x_{s})$ and $\phi_{t}(x_{t})$ are the task-specific features of source data $x_s$ and target data $x_t$

④“The source data $x_s$ only needs to go through the original network, while the target data $x_t$ needs to pass the RCB afterward.” Hence $\phi_{s}(x_{s})=\phi(x_{s})$ （我不知道啥意思）

⑤Feature that learned by RCB is denoted as $\Delta\phi_{\boldsymbol{s}}(x_{t})$

⑥The integrate target feature: $\phi_{t}(x_{t})=\phi_{s}(x_{t})+\Delta\phi_{s}(x_{t})$

⑦They further update the object equation, i.e. the loss of DSAD:

$\begin{aligned} L_{ad}=& \begin{aligned}\frac{1}{n}\sum_{i=1}^{n}(||\phi_{s}(x_{si};\mathcal{W})-c||^{2})^{y_{si}}\end{aligned} \\ &+\frac1{2m}\sum_{j=1}^{2m}(||\phi_{t}(x_{tj};\mathcal{W})-c||^{2})^{y_{tj}}+\frac\lambda2\sum_{l=1}^{L}||\mathbf{W}^{l}||_{F}^{2} \end{aligned}$

2.5.4. Conditional adversarial domain adaptation revisited

①CDAN designed for traditional domain adaptation, which domain possess the same label space of source and target domain

②The domain confufsion error:

$\begin{aligned}L_{dc}&=-\frac{1}{n}\sum_{i=1}^{n}\log[D(\phi_s(x_{si}),g(x_{si}))]\\&-\frac{1}{2m}\sum_{j=1}^{2m}\log[1-D(\phi_t(x_{tj}),g(x_{tj}))]\end{aligned}$

③They apply:

$\begin{aligned}&\{g(x_1),g(x_2),...,g(x_B)\}\\&=\text{softmax}(\{-s(x_1),-s(x_2),...,-s(x_B)\})\end{aligned}$

where $s\left ( x_i \right )$ denotes the distance between $x_i$ and $c$

④There are adversarial network:

$\begin{aligned}&\min_\phi L_{ad}(\phi)-\beta L_{dc}(D,g)\\&\min_DL_{dc}(D,g)\end{aligned}$

⑤The domain discriminator $D(\phi,g)=D(\phi\otimes g)$

⑥Then, the CDAN can be:

$\begin{aligned} &\begin{aligned}\min_{\phi}L_{ad}(\phi)+\beta(\frac{1}{n}\sum_{i=1}^{n}w(g(x_{si}))\log[D(\phi_{s}(x_{si})\otimes g(x_{si}))]\end{aligned} \\ &+\frac{1}{2m}\sum_{j=1}^{2m}w(g(x_{tj}))\log[1-D(\phi_{t}(x_{tj})\otimes g(x_{tj}))]) \\ &\operatorname*{mar}_{D} \kappa\frac{1}{n}\sum_{i=1}^{n}w(g(x_{si}))\log[D(\phi_{s}(x_{si})\otimes g(x_{si}))] \\ &+\frac{1}{2m}\sum_{j=1}^{2m}w(g(x_{tj}))\log[1-D(\phi_{t}(x_{tj})\otimes g(x_{tj}))]. \end{aligned}$

where the entropy criterion $w(g)=1+e^{-g}$

2.5.5. Overall formulation of the FAAD algorithm

①The Few-shot domain-Adaptive Anomaly Detection (FAAD) combines DSAD and RCB:

$\begin{aligned} \min_{\phi}& \frac{1}{n}\sum_{i=1}^{n}(||\phi_{s}(x_{si};\mathcal{W})-c||^{2})^{y_{si}} \\ &+\frac{1}{2m}\sum_{j=1}^{2m}(||\phi_{t}(x_{tj};\mathcal{W})-c||^{2})^{y_{tj}}+\frac{\lambda}{2}\sum_{l=1}^{L}||\mathbf{W}^{l}||_{F}^{2} \end{aligned}$

②FAAD+CDANE:

$\begin{aligned} &\min_{\phi} \begin{aligned}\frac{1}{n}\sum_{i=1}^n(||\phi_s(x_{si};\mathcal{W})-c||^2)^{y_{si}}\end{aligned} \\ &+\frac1{2m}\sum_{j=1}^{2m}(||\phi_{t}(x_{tj};\mathcal{W})-c||^{2})^{y_{tj}}+\frac\lambda2\sum_{l=1}^{L}||\mathbf{W}^{l}||_{F}^{2} \\ &+\beta(\frac1n\sum_{i=1}^nw(g(x_{si}))\log[D(\phi_s(x_{si})\otimes g(x_{si}))] \\ &+\frac1{2m}\sum_{j=1}^{2m}w(g(x_{tj}))\log[1-D(\phi_{t}(x_{tj})\otimes g(x_{tj}))]) \\ &\max_{D} \begin{aligned}\frac{1}{n}\sum_{i=1}^nw(g(x_{si}))\log[D(\phi_s(x_{si})\otimes g(x_{si}))]\end{aligned} \\ &+\frac1{2m}\sum_{j=1}^{2m}w(g(x_{tj}))\log[1-D(\phi_{t}(x_{tj})\otimes g(x_{tj}))], \end{aligned}$

③The pseudo code of FAAD+CDANE:

2.6. Experiment

①They compared their model with a) machine learning as SVM and deep learning as FNN, b) originial anomaly detection DSAD, c) domain adaptation models

②They evaluate the soecific disease detection ability and various disease domain differentiating ability of their model

2.6.1. Baseline method

①They apply 95% PCA-SVM cuz the number of dimension is far more than the samples（特征维数是哪个什么n(n-1)/2吗，）

②They construct a BC-DNN with FNN combined with a fully connected layer and a Softmax layer. Then apply pre-training in BC-DNN to get BC-DNN-p

③They continue to introduce other models...（我这省略了）

2.6.2. Implementation details

（1）Network and training setup

①Shot: 10-shot and 20-shot applied

②Measurement: AUC

③FNN: input dimensions of layer 1,2,3 are the original dimension of vector, 128, 32 respectively; learning rate=0.001; optimizer: Adam

④FAAD and FAAD+CDANE: learning rate of RCB = 1/10 original learning rate; epoch=12 in pretraining and epoch=16 in FAAD; learning rate / 10 in the fourth and eighth epoch; batch size=4; $\lambda =0.0001$ and $\beta =0.1$ (from 0 to 0.1, influenced by coefficient $\begin{aligned}(1-\exp(-\delta p))/(1+\exp(-\delta p))\end{aligned}$ , where $\delta =10$ and $p$ iterate from 0 to 1)（我不能太理解）; dropout ratio=0.2（多看一眼就会爆炸的段落）

⑤DSAD-DANN: $\beta =1$

（2）Data augmentation

①为什么在这里又说特征维度比样本量小！？

②⭐They think the label of partial fMRI scanning is the same as the full scan

③⭐“在训练过程中，每个时间过程都是随机裁剪的(应该从扫描的第一帧开始，并且大于原始长度的一半)，然后用于计算全脑FC。在测试期间，放弃增强”（这种叫增强啊...可能没学过数据增强）

2.6.3. Results and analysis

They compare the mean AUC of 9 trials

（1）FAAD for one mental disorder (SCZ only)

①AMU

②FMMU#1

③FMMU#2

④PUTH

⑤UCLA

⑥COBRE

⑦他们在这之后花了大篇幅撰写discussion，不过讨论都是基于实验结果的，对于没有实验结果的我暂时没有特别大的意义。因此只是看了一遍而没有记录

⑧Mean values and standard deviation of AUCs(%):

（2）FAAD for two mental disorders (SCZ & MDD)

①AMU

②FMMU#1

（3）Discriminative FC and brain regions

①They combine all the FC vector in each test set and apply canonical correlation analysis (CCA) on it. Get the mean weight of FC in each test set and select the top 10%

②SCZ visualization:

③SCZ or MDD:

（4）Empirical analysis of parameters

①Grid search $\beta =\left \{ 0,\, 0.05,\, 0.1,\, 0.15,\, 0.2,\, 0.25 \right \}$ and find FAAD+CDANE is not sensitive to $\beta$

②Table of the tuning:

（5）Distribution of anomaly scores

①Anomaly scores in FMMU#1 with AAL:

（6）Brain parcellation and model performance

①Comparison of datasets and atlases:

2.7. Discussion

①This model can also be generalized to other networks

②⭐图的定义和图的拉普拉斯表示并不总是令人满意哈哈哈哈哈笑死，但你这个平均精度其实也不算太高，虽然最高可以到80但是平均下来我感觉就六七十了。2021其实也很够了

③Most of the samples in HCP are young person, it might influence the results

④⭐They did not consider the different pre-processing pipeline of different sites

2.8. Conclusion

我就懒得conclude了，该是啥是啥

3. 知识补充

3.1. Hypersphere

参考学习：超球面_百度百科 (baidu.com)

3.2. Meta-learning

参考学习：一文入门元学习（Meta-Learning）（附代码） - 知乎 (zhihu.com)

3.3. Manifold

参考学习1：几何学中最伟大的发明之一——流形，其背后的几何直觉与数学方法 (baidu.com)

参考学习2：流形_百度百科 (baidu.com)

3.4. Canonical Correlation Analysis (CCA)

参考学习：Canonical Correlation Analysis - 知乎 (zhihu.com)

4. Reference List

Su J. et al. (2021) 'Few-shot domain-adaptive anomaly detection for cross-site brain images', IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1. doi: 10.1109/TPAMI.2021.3125686

这篇关于[论文精读]Few-shot domain-adaptive anomaly detection for cross-site brain images的文章就介绍到这儿，希望我们推荐的文章对编程师们有所帮助！