论文笔记《Part-of-Speech Tagging for Twitter with Adversarial Neural Networks》

本文主要是介绍论文笔记《Part-of-Speech Tagging for Twitter with Adversarial Neural Networks》，希望对大家解决编程问题提供一定的参考价值，需要的开发者们随着小编来一起学习吧！

这里记录近两个月阅读论文中，最贴近项目需求的几篇论文，此博为其中一篇，也是思想引用最多的一篇。

0、Paper basic information
Authors：Tao Gui, Qi Zhang∗, Haoran Huang, Minlong Peng, Xuanjing Huang
School: Fudan University
Published 2017 in EMNLP

1.Introduction
Twitter社交媒体内容具有非正式、非标准词汇；不同领域之间地差异；lack of training data and out-of-vocabulary words.

提出TPANN网络，使用大量其他领域注释数据、领域内未标记数据、少量标记领域内数据，为客服敌对网络只能学习通用特性地缺陷，使用autoencoder只对目标数据集进行操作，以保留其特性。为解决out-of-vocablary问题，方法包含character level CNN 利用subword信息。

论文主要贡献：
合并large scale unlabeled in-domain data, out-of-domain label data, in-domain labeled data RNN, 通过 in-domain 和 out-of-domain 学习 domain-invariant 表示，通过学习表示构造跨域 POS tagger 试图保留目标域的特性。
实验结果表明，在三个不同数据集上取得更好的性能。

2.Approach
TPANN：学习资源丰富领域和目标领域之间学习共同特征，同时保持目标领域特性。
将双向LSTM扩展为 adversarial network 和 autoencoder，模型分为四部分：Feature Extractor， POS Tagging Classifer，Domain Discriminator，Target Domain Autoencoder.

2.1 Feature Extractor
F采用CNN提取character embedding feature，可解决词汇表外问题，为合并word embedding features，连结 word embedding、character embedding 作为下一层 bi-LSTM 输入，用 bi-LSTM 来 model sentences，F可提取顺序关系和上下文信息。

输入句子x，第i word： $x\epsilon S(x)$ ， $x\epsilon T(x)$ ，输入样本来自源领域和目标领域。

$\theta _{f}$ ：F的参数，$\nu $ : the vocabulary of words, $C$ : the vocabulary of characters, d : character embedding 的维度 $Q\epsilon R^{d\times \left |c \right |}$ : representation matrix of vocabulary.
假设$x_{i}\epsilon \nu $ 组成字符序列 $c^{i} = [c_{1},c_{2},...,c_{l}]$ ， $l$ 是单词最大长度，每个单词会被填充到这个长度。 $c^{i} \epsilon R ^{d\times l}$ ：是CNN的输入。

采用一个narrow convolution between $c^{i}$ and filter $\epsilon R ^{d\times k}$ ，k是filter width。添加 a bias 应用非线性获得 a feature map $m^{i} \epsilon R ^{l-k+1}$ ， $m^{i}$ 的第j个元素： $i^{k}[j]=tanh(<c^{i}[*, j:j+k-1], H>+b)$

$lt;A,B>=Tr(AB^{T})$ ： the Frobenius inner product.
用 a max-over-time pooling operation over the feature map.
CNN 用不同宽度的 filters 获得 feature vector $\vec{c_{i}}$ for word \vec{x_{i}}

character-level feature vector $\vec{c_{i}}$ 是连接到 word embedding $\vec{w_{i}}$ 为下一层 bi-LSTM 输入
word embedding $\vec{w_{i}}$ 是预先训练了30 million tweets.
then, the hidden states $h$ of bi-LSTM 变成 features 将 transfered to $P, Q, R$ ，即 $F (x) = h$ .

2.2 POS Tagging classifier and Domain Discriminator
$P$ and $Q$ 以 $F (x)$ 为输入。有standard feed-forward networks with a softmax layer for classification. $P$ 预测POS tagging label to 获得分类能力， $Q$ 区别域标签使 $F (x)$ 域不变。 $P$ maps the Feature vector $F(x_{i})$ to its label. mapping 的参数表示为 $\theta _{y}$ . 在 $N_{s}$ 样本，源领域内 with the 交叉熵损失。 $L_{task}=-\sum_{i=1}^{N_{s}}y_{i}*log\hat{y_{i}}$ .

$y_{i}$ 是 POS tagging label 的 one-hot 向量，对应 $x_{i}\epsilon S(x)$
$\hat{y_{i}}=P(F(x_{i}))$ ，是top softmax层的输出。
参数 $\theta _{f},\theta _{y}$ 通过降低分类损失 $L_{task}$ 进行优化，确保 $P(F(x_{i}))$ 可对源领域做出预测。

相反，domain discriminator maps the same hidden states $h$ to the domain labels. 参数 $\theta _{d}$ 区分域标签用损失函数 $L_{type}=\sum_{i=1}^{N_{s}+N_{t}}\left \{ d_{i}log\hat{d_{i}} +(1-d_{i})log(1-\hat{d_{i}})\right \}$
$d_{i}$ 是样本i的ground truch domain label. $\hat{d_{i}}$ ：top layer 输出： $\hat{d_{i}}=Q(F(x_{i}))$