一直在关注Action Classification,VOC2010结果发布之后,大体看了一下,基本上就那些图像特征的使用(dense SIFT+Spatial Pyramid),然后就是乱七八糟的融合了,归结都低就是Multiple Kernel Learning以及一些近似的算法。
下面看看VOC2010关于ActionClassification部分的结果:
Average Precision (AP %)
phoning | playing instrument | reading | riding bike | riding horse | running | taking photo | using computer | walking | |
---|---|---|---|---|---|---|---|---|---|
BONN_ACTION | 47.5 | 51.1 | 31.9 | 64.5 | 69.1 | 78.5 | 32.4 | 53.9 | 61.1 |
CVC_BASE | 56.2 | 56.5 | 34.7 | 75.1 | 83.6 | 86.5 | 25.4 | 60.0 | 69.2 |
CVC_SEL | 49.8 | 52.8 | 34.3 | 74.2 | 85.5 | 85.1 | 24.9 | 64.1 | 72.5 |
INRIA_SPM_HT | 53.2 | 53.6 | 30.2 | 78.2 | 88.4 | 84.6 | 30.4 | 60.9 | 61.8 |
NUDT_SVM_WHGO_SIFT_CENTRIST_LLM | 47.2 | 47.9 | 24.5 | 74.2 | 81.0 | 79.5 | 24.9 | 58.6 | 71.5 |
SURREY_MK_KDA | 52.6 | 53.5 | 35.9 | 81.0 | 89.3 | 86.5 | 32.8 | 59.2 | 68.6 |
UCLEAR_SVM_DOSP_MULTFEATS | 47.0 | 57.8 | 26.9 | 78.8 | 89.7 | 87.3 | 32.5 | 60.0 | 70.1 |
UMCO_DHOG_KSVM | 53.5 | 43.0 | 32.0 | 67.9 | 68.8 | 83.0 | 34.1 | 45.9 | 60.4 |
WILLOW_A_SVMSIFT_1-A_LSVM | 49.2 | 37.7 | 22.2 | 73.2 | 77.1 | 81.7 | 24.3 | 53.7 | 56.9 |
WILLOW_LSVM | 40.4 | 29.9 | 32.2 | 53.5 | 62.2 | 73.6 | 17.6 | 45.8 | 41.5 |
WILLOW_SVMSIFT | 47.9 | 29.1 | 21.7 | 53.5 | 76.7 | 78.3 | 26.0 | 42.9 | 56.4 |
各种方法的描述后面也有。
首先看看UCLEAR_SVM_DOSP_MULTFEATS的方法:
Multiple chi squared kernels are computed: spatial pyramid (SP) w/ dense SIFT, dense overlapping SP w/ HOG, texture filter, LAB values (bag-of-words w/ the above features) and edge dir hists. They are computed on full images, person bounding boxes (BB) and BB of the lower part (simple stretch-scale of person BB) expected to contain horse, bike etc. They are combined with class specific binary weights based on their perf on val set. Finally, class specific SVMs trained on train+val.
是不是感觉方法很简单?
再看看SURREY_MK_KDA的方法:
Kernel-level fusion with Spatial Pyramid Grids, Soft Assignment and Kernel Discriminant Analysis using spectral regression. 18 kernels have been generated from 18 variants of SIFT. 融合吧。
CVC_SEL的方法:
Enhanced CVC submission built upon CVC-BASE for action recognition. Standard BoW model over multiple features from CVC-BASE plus contextual object descriptors. Cross-validation procedure for action-specific feature and kernel selection. Foreground/background/neighborhood modeled separately, spatial pyramid over several features for foreground representation. Object detection based on deformable part-based detector incorporated. Late fusion of feature-specific SVM outputs for final action score.
综上所述:Spatial Pyramid w/(dense SIFT | overlap HOG)这是最好用的描述模板的方法,一起用就用Multiple Kernel融合起来,学个融合的参数,其实效果真的很好很好,不骗你。
所以说,对于一些类似这样的问题,除非你是非得自己发明一些描述子,不然用这些就能够达到一些实验的目标,当然实用也是未尝不可的。