本文主要是介绍2017. cheap translation for Cross-lingual NER 阅读笔记,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!
cheap translation for Cross-lingual NER, Illinois Champaign
- 提出了一个生成翻译字典的 cheap translation 算法
- 该算法可以和 wikifier features、Brown Cluster features 等结合取得更好的效果
- 通过实验说明当 source Lan. 与 traget Lan. 相似度比较高时,可以进一步提高模型性能。
- 探究了 cheap translation 所需字典的size,及其他一些 lan. specific 的 techniques 对模型性能的影响
Abstract
- 目标
- 面向 low-resource Lan. 的 NER
- 仅依赖 very minimal resources.
- 方法概述
- 利用 lexicon translate: annotated high-resource Lan. => target Lan. (training data)
- 再利用 translated target data 学习一个 monolingual NER model
- 结果
- 当 target lan. 的 wilipedia 数据可获得时,本文的方法可以提升 wikipedia-based methods 得到 SOA 的结果。
Cheap Translation
- core idea: translate annotated high-re. Lan. => target Lan.
- using: a lexicon, rather than parallel text
- challenge: a source word <=> a phrase translation table in target Lan.
- prominence score: normalized coocurrence counts in each candidate set
- greeding decoding method
- choose a candidate, copy labels to it
- options from the lexicon are resolved by a LM multiplied by the prominence score of each option (Stolcke, 2002)
- Caution: no reordering of the translated result
- Result
- annotated data in target Lan.
- in spite of mistakes, the context around the entities (matters for NER) is reasonably well-preserved
NER Model
- Illinoise NER system
- standard features: forms, capitalization, affixes, word prior, word after, etc.
- Brown cluster features
- multilingual gazetteers
- wiki features
Experiment
baseline
- tranin on English => apply the model directly to the target Lan.
- no gazetteers
- result
- dosen’t work on non-Latin lan.
cheap translation
- translate English to target lan. (lexicon coverage)
- Yoruba
- normalize the text (remove all pronunciation sys.) => make the data less sparse
- Bengali & Tamil
- omit using the word suface as a feature
- result
- 14.6 points of F1 improv. over the baseline
- proposed approach is othogonal to other approaches, and can be combined with great effects.
+ Wikifier Features
- obtained by grounding words & phrases to English Wikipedia pages
- using the categories of the linked page as NER features for the surface text.
- result
- improve scores for all 7 lan.
- improvement at European Lan > other Lan. => it’s advantageous to select a source lan. similar to the target lan.
Google Translate
- translate English into other Lan. by Google
- aligne the source-target data using fast align
- the alignments can be noisy given the relatively small amount of text
- project labels across alignments
- result
- “As with the other approaches, Brown cluster features are an important signal.”
Translation from Similar Lan.
- required 2 new resources
- annotated data in a similar Lan. S
- “it is likely that there exists an annotated dataset in a closer language.”
- a lexicon: S -> T (target Lan.)
- collect all English words that appear in both dictionaries => two sets of candidate translations
- too many entries, some incorrect, but correct entry is there.
- collect all English words that appear in both dictionaries => two sets of candidate translations
- annotated data in a similar Lan. S
- choice of the source Lan: WALS
- result
- 5.4 points higher than Enlish-source
Translation from Similar Lan. + Wikifeier features
- result
- best cross-lan. setting scores.
Other Expt.
Dictionary Ablation: what effect the size of the lexicon has on the end result?
- only a small numbe rof dictionary entries are likely to be useful
- a small but carefully constructed manual dictionary could have a large impact.
Uyghur
- no wikifier features: Wikipedia size is too small
- Dictionary: LORELEI + Rolston => 116k
- Name Substitution: untranslatable => replace with a randomly selected NE from the gazetteer list corresponding to the tag. => sentence is not fluent, but NEs are fluent in the target text.
- Stemming (词干分析): remove all possible suffixes
- omit form feature for Bengali & Tamil
- result
- stemming helps => makes the features more dense
+ generate dicionaries using observations over Yyghur and Uzbek
- manual mapping: 100 words
- edit-distance mapping: a modified edit-distance algo.
- cross-lingual CCA with word vectors
- using CCA project Uyghur and Uzbed monolingual vctor into a share semantic space
- closest Uyghur word to each Uzbek word.
- result
- lang. spec. no combine: not much
- combine ALL lang. specific techs: 10 points higher
- different training data covers many angles
这篇关于2017. cheap translation for Cross-lingual NER 阅读笔记的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!