《Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields》论文笔记

本文主要是介绍《Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields》论文笔记，希望对大家解决编程问题提供一定的参考价值，需要的开发者们随着小编来一起学习吧！

《Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields》论文笔记

原论文: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields
源码：ZheC/Realtime_Multi-Person_Pose_Estimation
CMU-Perceptual-Computing-Lab/openpose

文章亮点

多人实时人体检测
PCM(Part Confidence Maps)+PAF(Part Affinity Fields)
Bipartite Matching

人体姿态估计

人体姿态估计的挑战：1.每张图片包含不同数量的人，并且每个人会有不同的姿态；2.图片中肢体遮挡、接触等使得姿态估计比较困难；3.估计的时间随着图片中人数的增长而增加使得实时监测很有挑战。

人体姿态估计两种主流的方式是：top-down approach和bottom-up approach。top-down是在检测到图片中的人之后，对人的姿态进行估计。bottom-up是指先检测关节点再判断关节点属于哪个人。

该论文提出的模型是一种bottom-up的估计方法。通过Convolutional pose machines的方式得到关节点的heatmap来判断关节点（body part），通过PAF的CNN结构（也是采用Convolutional pose machines的网络结构）得到2D的向量集合来编码肢体（limb）的位置和方向。此时判断关节点属于哪个人可以转化为经典的二分图解决方法。通过这种方式解决了之前bottom-up在判断关节点属于哪个人时的NP困难问题，实现了实时估计，具体流程见图1。

图1 （a)的图片输入模型，生成（b）的PCM热量图和（c）的PAF向量图，通过二分图匹配来确定多人下关节点肢体匹配问题,最终输出得到（e）。

网络结构

输入一张 $w*h$ 的图片（已通过CNN提取特征），通过前向传播预测关节点位置的2D置信图S（PCM）和这些关节点亲和度向量集L（PAF）。集合 $S=(S_1,S_2,...,S_J)$ 指有 $J$ 个关节点对应的位置置信图，S的维度为 $w*h$ ,集合 $L=(L_1,L_2,...L_C)$ 指有 $C$ 个关节点亲和度的向量图， $L$ 的维度为 $w*h*2$ ，网络结构见图2。

图2 双层多阶段的CNN架构。每个阶段的第一层预测置信图 $S^t$ ，第二层预测PAFs $L^t$ 。每个阶段输出之后，输入F和该阶段的两层输出堆叠起来作为下一阶段的输入。

S t = ρ t (F) ， t = 1

$S^t=\rho^t(F)，t=1$

L t = ϕ t (F) ， t = 1

$L^t=\phi^t(F)，t=1$

S t = ρ t (F, S t - 1, L t - 1) ， t \geq 2

$S^t=\rho^t(F,S^t-1,L^t-1)，t\geq2$

L t = ϕ t (F, S t - 1, L t - 1) ， t \geq 2

$L^t=\phi^t(F,S^t-1,L^t-1)，t\geq2$

ρt $\rho^t$ ,

ϕt $\phi^t$ 为阶段t下CNN计算。
损失函数的计算如下：

f t S = \sum j = 1 J \sum p W (p) \cdot | | S t j (p) - S * j | | 22

$f_S^t=\sum_{j=1}^J\sum_pW(p)\cdot||S_j^t(p)-S_j^*||_2^2$

f t L = \sum j = 1 J \sum p W (p) \cdot | | L t j (p) - L * j | | 22

$f_L^t=\sum_{j=1}^J\sum_pW(p)\cdot||L_j^t(p)-L_j^*||_2^2$

S∗j $S_j^*$ 是

j $j$ 关节点的置信图的groundtruth，

L∗j $L_j^*$ 是PAF向量的groundtruth，

W $W$ 是一个二元掩码，当

p $p$ 像素没有被标注时，

W(p)=0 $W(p)=0$ (由于某些数据集并没有标注所有的人)。最终的损失函数为：

f = \sum t = 1 T (f t S + f t L)

$f=\sum_{t=1}^T(f_S^t+f_L^t)$ 每个阶段对损失函数都有贡献可以防止梯度消失的产生。

论文外阅读源码时一些模型细节的补充：

F为输入图像经过10层VGG-19后通过两层 $3\times3$ 的卷积层（Convolutional pose machines结构），这两层卷积层不改变图像的大小只降低通道数，从512通道降低为128通道。
PCM的输出通道为19，PAF的输出通道为38，即 $J=19,C=19$ ，stage中均不改变图像的大小。
stage为6

数据集PCM label处理

计算公式如下：

S * j, k (p) = e x p (- | | p - x j , k | | 2 2 σ 2)

$S_{j,k}^*(p)=exp(-\frac{||p-x_{j,k}||_2^2}{\sigma^2})$

S∗j,k(p) $S_{j,k}^*(p)$ 为第

k $k$ 个人,第

j $j$ 个body part的置信度，

xj,k $x_{j,k}$ 为第

k $k$ 个人，第

j $j$ 个body part的位置groundtruth。即，PCM满足高斯分布，峰值为body part的位置groundtruth。当某个像素点存在多个人的body part置信度时，取最大值而非平均值。这是因为平均值会磨平峰值，而选择最大值会使得像素点靠近峰值时依旧很准确，如图3所示。

图3

数据集PAF label处理

PAF是每个肢体（limb）的2D向量场：对于位于limb上的像素点，该点的2D向量表示该limb连接两个body part的亲和力大小和方向。
对于单个limb，假设 $x_{j_1,k}$ , $x_{j_2,k}$ 为第 $k$ 个人两个body part $j_1$ , $j_2$ 的位置groundtruth。如果 $p$ 在limb上，则 $L_{c,k}^*(p)=v,v=(x_{j_2,k}-x_{j_1,k})/||(x_{j_2,k}-x_{j_1,k})||_2$ ；如果 $p$ 不在limb上， $L_{c,k}^*(p)$ 为0向量。
按照如下范围判断像素 $p$ 是否位于limb上：

0 \leq v \cdot (p - x j 1, k) \leq | | (x j 2, k - x j 1, k) | | 2 (横 向 判 断 条 件)

$0\leq v\cdot(p-x_{j_1,k})\leq||(x_{j_2,k}-x_{j_1,k})||_2 (横向判断条件)$

0 \leq v ⊥ \cdot (p - x j 1, k) \leq σ l) （ 纵 向 判 断 条 件 ）

$0\leq v_\bot\cdot(p-x_{j_1,k})\leq\sigma_l)（纵向判断条件）$

σl $\sigma_l$ 为设置的阈值。
当一个图像上有多人时，取平均值：

L * c (p) = 1 n c ( p ) \sum k L * c, k (p)

$L_{c}^*(p)=\frac{1}{n_c(p)}\sum_kL_{c,k}^*(p)$

nc(p) $n_c(p)$ 为

p $p$ 像素处非零向量的数目。

预测结果的算法

假设只有两个预测的候选位置点 $d_{j_1}$ , $d_{j_2}$ ,定义这两个点关联的置信度 $E$ 计算公式如下：

E = \int u = 1 u = 0 L c (p (u)) \cdot d j 1 - d j 2 | | d j 1 - d j 2 | | 2

$E=\int_{u=0}^{u=1}L_c(p(u))\cdot\frac{d_{j_1}-d_{j_2}}{||d_{j_1}-d_{j_2}||_2}$ 其中

p (u) = (1 - u) d j 1 + u d j 2

$p(u)=(1-u)d_{j_1}+ud_{j_2}$ 实际测试中，往往是选择两点间等间隔分布的像素点进行求和计算来取代积分计算。
多人估计时，首先通过非极大值抑制算法选择出body part的候选位置，源码如下：

import scipy
print heatmap_avg.shape#plt.imshow(heatmap_avg[:,:,2])
from scipy.ndimage.filters import gaussian_filter
all_peaks = []
peak_counter = 0for part in range(19-1):x_list = []y_list = []map_ori = heatmap_avg[:,:,part]map = gaussian_filter(map_ori, sigma=3)map_left = np.zeros(map.shape)map_left[1:,:] = map[:-1,:]map_right = np.zeros(map.shape)map_right[:-1,:] = map[1:,:]map_up = np.zeros(map.shape)map_up[:,1:] = map[:,:-1]map_down = np.zeros(map.shape)map_down[:,:-1] = map[:,1:]peaks_binary = np.logical_and.reduce((map>=map_left, map>=map_right, map>=map_up, map>=map_down,map > param['thre1']))peaks = zip(np.nonzero(peaks_binary)[1], np.nonzero(peaks_binary)[0]) # note reversepeaks_with_score = [x + (map_ori[x[1],x[0]],) for x in peaks]id = range(peak_counter, peak_counter + len(peaks))peaks_with_score_and_id = [peaks_with_score[i] + (id[i],) for i in range(len(id))]all_peaks.append(peaks_with_score_and_id)peak_counter += len(peaks)

此时一个body part会存在多个候选位置 $D_J=\{d_j^m: for j\in(1...J),m\in(1...N_j)\}$ , $N_j$ 为body party $j$ 候选位置。关于候选位置的关联问题可以转化为二分图问题：
对于第c个limb

m a x E c = m a x \sum m \in D j 1 \sum n \in D j 2 E m n \cdot z m n j 1 j 2

$max{E_c}=max\sum_{m\in{D_{j_1}}}\sum_{n\in{D_{j_2}}}E_{mn}\cdot z_{j_1j_2}^{mn}$
s.t.

\forall m \in D j 1, \sum n \in D j 2 z m n j 1 j 2 \leq 1

$\forall m\in D_{j_1},\sum_{n\in D_{j_2}}z_{j_1j_2}^{mn}\leq1$

\forall m \in D j 2, \sum n \in D j 1 z m n j 1 j 2 \leq 1

$\forall m\in D_{j_2},\sum_{n\in D_{j_1}}z_{j_1j_2}^{mn}\leq1$

zmnj1j2∈{0,1} $z_{j_1j_2}^{mn}\in \{0,1\}$ 表示候选位置

dmj1 $d_{j_1}^m$ 和

dmj2 $d_{j_2}^m$ 是否连接，

Emn $E_{mn}$ 为候选位置连接的置信度。
文章中最终选择匈牙利算法来解决此二分图问题，源码如下：

# find connection in the specified sequence, center 29 is in the position 15
limbSeq = [[2,3], [2,6], [3,4], [4,5], [6,7], [7,8], [2,9], [9,10], \[10,11], [2,12], [12,13], [13,14], [2,1], [1,15], [15,17], \[1,16], [16,18], [3,17], [6,18]]
# the middle joints heatmap correpondence
mapIdx = [[31,32], [39,40], [33,34], [35,36], [41,42], [43,44], [19,20], [21,22], \[23,24], [25,26], [27,28], [29,30], [47,48], [49,50], [53,54], [51,52], \[55,56], [37,38], [45,46]]connection_all = []
special_k = []
mid_num = 10for k in range(len(mapIdx)):score_mid = paf_avg[:,:,[x-19 for x in mapIdx[k]]]candA = all_peaks[limbSeq[k][0]-1]candB = all_peaks[limbSeq[k][1]-1]nA = len(candA)nB = len(candB)indexA, indexB = limbSeq[k]if(nA != 0 and nB != 0):connection_candidate = []for i in range(nA):for j in range(nB):vec = np.subtract(candB[j][:2], candA[i][:2])norm = math.sqrt(vec[0]*vec[0] + vec[1]*vec[1])vec = np.divide(vec, norm)startend = zip(np.linspace(candA[i][0], candB[j][0], num=mid_num), \np.linspace(candA[i][1], candB[j][1], num=mid_num))vec_x = np.array([score_mid[int(round(startend[I][1])), int(round(startend[I][0])),\0] for I in range(len(startend))])vec_y = np.array([score_mid[int(round(startend[I][1])), int(round(startend[I][0])),\1] for I in range(len(startend))])score_midpts = np.multiply(vec_x, vec[0]) + np.multiply(vec_y, vec[1])score_with_dist_prior = sum(score_midpts)/len(score_midpts) + min(0.5*oriImg.shape[0]/norm-1, 0)criterion1 = len(np.nonzero(score_midpts > param['thre2'])[0]) > 0.8 *len(score_midpts)criterion2 = score_with_dist_prior > 0if criterion1 and criterion2:connection_candidate.append([i, j, score_with_dist_prior,score_with_dist_prior+candA[i][2]+candB[j][2]])connection_candidate = sorted(connection_candidate, key=lambda x: x[2], reverse=True)connection = np.zeros((0,5))for c in range(len(connection_candidate)):i,j,s = connection_candidate[c][0:3]if(i not in connection[:,3] and j not in connection[:,4]):connection = np.vstack([connection, [candA[i][3], candB[j][3], s, i, j]])if(len(connection) >= min(nA, nB)):breakconnection_all.append(connection)else:special_k.append(k)connection_all.append([])
# last number in each row is the total parts number of that person
# the second last number in each row is the score of the overall configuration
subset = -1 * np.ones((0, 20))
candidate = np.array([item for sublist in all_peaks for item in sublist])for k in range(len(mapIdx)):if k not in special_k:partAs = connection_all[k][:,0]partBs = connection_all[k][:,1]indexA, indexB = np.array(limbSeq[k]) - 1for i in range(len(connection_all[k])): #= 1:size(temp,1)found = 0subset_idx = [-1, -1]for j in range(len(subset)): #1:size(subset,1):if subset[j][indexA] == partAs[i] or subset[j][indexB] == partBs[i]:subset_idx[found] = jfound += 1if found == 1:j = subset_idx[0]if(subset[j][indexB] != partBs[i]):subset[j][indexB] = partBs[i]subset[j][-1] += 1subset[j][-2] += candidate[partBs[i].astype(int), 2] + connection_all[k][i][2]elif found == 2: # if found 2 and disjoint, merge themj1, j2 = subset_idxprint "found = 2"membership = ((subset[j1]>=0).astype(int) + (subset[j2]>=0).astype(int))[:-2]if len(np.nonzero(membership == 2)[0]) == 0: #mergesubset[j1][:-2] += (subset[j2][:-2] + 1)subset[j1][-2:] += subset[j2][-2:]subset[j1][-2] += connection_all[k][i][2]subset = np.delete(subset, j2, 0)else: # as like found == 1subset[j1][indexB] = partBs[i]subset[j1][-1] += 1subset[j1][-2] += candidate[partBs[i].astype(int), 2] + connection_all[k][i][2]# if find no partA in the subset, create a new subsetelif not found and k < 17:row = -1 * np.ones(20)row[indexA] = partAs[i]row[indexB] = partBs[i]row[-1] = 2row[-2] = sum(candidate[connection_all[k][i,:2].astype(int), 2]) +connection_all[k][i][2]subset = np.vstack([subset, row])

模型在MPII上的表现

table1
表1 分别为在测试子集和完整测试集上不同模型结果的对比

图4 在不同的PCKh阈值下mAP的变化曲线。
PCKh-0.5阈值下，使用PAFs，其mAP比one-midpoint的高2.9%，比two-midpoints的方法高2.3%。这是因为PAFs同时利用了位置和方向这两个信息，在人体有交叉的图像中表现更好。通过对图像未标记的的部分进行掩码，提高了2.3%mAP，因为它避免了训练时对正确的预测进行损失惩罚。PAFs算法可以得到与使用GT连接相似的mAP结果（分别为79.4%和81.6%）

模型在COCO上的表现

图5 在COCO数据集上AP成绩和运行时间
图5（d)数据来源为：原始的图片为 $1080\times1920$ ,resize成 $368\times654$ ,GPU型号为NVIDIA GeForce GTX-1080 GPU。最终结果可以发现top-down方法运行时间会随着图片人数的增多显著提高，而使用bottom-up其运行时间相对很缓慢。