MMPose-RTMO推理详解及部署实现（下）

本文主要是介绍MMPose-RTMO推理详解及部署实现（下），希望对大家解决编程问题提供一定的参考价值，需要的开发者们随着小编来一起学习吧！

- 前言
- 一、RTMO推理(Python)
- - 1. RTMO预测
  - 2. RTMO预处理
  - 3. RTMO后处理
  - 4. RTMO推理
- 二、RTMO推理(C++)
- - 1. ONNX导出
  - 2. RTMO预处理
  - 3. RTMO后处理
  - 4. RTMO推理
- 三、RTMO部署
- - 1. 源码下载
  - 2. 环境配置
  - - 2.1 配置CMakeLists.txt
    - 2.2 配置Makefile
  - 3. ONNX导出
  - 4. engine生成
  - 5. 源码修改
  - 6. 运行
- 结语
- 下载链接
- 参考

前言

在 MMPose-RTMO推理详解及部署实现（上）文章中我们有提到如何导出 RTMO 的 ONNX 模型，这篇文章就来看看如何在 tensorRT 上推理得到结果

Note：开始之前大家务必参考 MMPose-RTMO推理详解及部署实现（上）将对应的环境配置好，并将 RTMO 的 ONNX 导出来，这里博主就不再介绍了

参考：https://github.com/shouxieai/tensorRT_Pro

实现：https://github.com/Melody-Zhou/tensorRT_Pro-YOLOv8

在这里插入图片描述

RTMO效果

在这里插入图片描述

RTMO网络结构

一、RTMO推理(Python)

1. RTMO预测

我们先尝试利用官方预训练权重来推理一张图片并保存，看能否成功

将下载好的预训练权重放在 mmpose 项目下，准备开始推理，执行如下指令即可进行推理：

python demo/inferencer_demo.py ./tests/data/coco/000000197388.jpg --pose2d ./configs/body_2d_keypoint/rtmo/body7/rtmo-s_8xb32-600e_body7-640x640.py --pose2d-weights ./rtmo-s_8xb32-600e_body7-640x640-dac2bf74_20231211.pth --vis-out-dir vis_results

Note：--pose2d 指定的配置文件一定要与 --pose2d-weights 相对应

输出如下：

在这里插入图片描述

可以看到推理成功了，结果保存在 vis_results/000000197388.jpg，如下图所示：

在这里插入图片描述

2. RTMO预处理

模型推理成功后我们就要来梳理下 RTMO 的预处理和后处理，方便后续在 C++ 上实现，我们先来看预处理的实现。

我们来调试 demo/inferencer_demo.py 文件：

在这里插入图片描述

博主这里采用的是 vscode 进行代码的调试，其中的 launch.json 文件内容如下：

{// 使用 IntelliSense 了解相关属性。 // 悬停以查看现有属性的描述。// 欲了解更多信息，请访问: https://go.microsoft.com/fwlink/?linkid=830387"version": "0.2.0","configurations": [{"name": "Python: 当前文件","type": "debugpy","request": "launch","program": "${file}","cwd": "${workspaceFolder}","console": "integratedTerminal","args": ["./tests/data/coco/000000000785.jpg","--pose2d", "./configs/body_2d_keypoint/rtmo/body7/rtmo-s_8xb32-600e_body7-640x640.py","--pose2d-weights", "./rtmo-s_8xb32-600e_body7-640x640-dac2bf74_20231211.pth","--vis-out-dir", "vis_results",// "--show"],"justMyCode": true,}]
}

可以看到我们通过 MMPoseInferencer 创建了一个推理类，并调用了它的 __call__ 方法，但是我们打断点会发现根本跳不进去

博主这里参考了 mmpose 的官方文档，重新写了一个简单的推理代码可以调试，具体可参考：https://mmpose.readthedocs.io/en/latest/user_guides/inference.html

from mmpose.apis import MMPoseInferencerimg_path = 'tests/data/coco/000000000785.jpg'   # replace this with your own image path# instantiate the inferencer using the model alias
pose2d = "./configs/body_2d_keypoint/rtmo/body7/rtmo-s_8xb32-600e_body7-640x640.py"
pose2d_weights = "./rtmo-s_8xb32-600e_body7-640x640-dac2bf74_20231211.pth"
inferencer = MMPoseInferencer(pose2d, pose2d_weights)# The MMPoseInferencer API employs a lazy inference approach,
# creating a prediction generator when given input
result_generator = inferencer(img_path, show=False)
result = next(result_generator)

调试效果图如下：

在这里插入图片描述

可以看到它确实是调用了 MMPoseInferencer 的 __call__ 方法，我们往下走

在这里插入图片描述

可以看到接着调用了 self.preprocess 进行预处理，继续往下走：

在这里插入图片描述

可以看到它其实调用的是父类 BaseMMPoseInferencer 的 preprocess 函数，接着往下走：

在这里插入图片描述

可以看到它其实调用的是 Pose2DInferencer 类中的 preprocess_single 函数，本质上就是一个套娃，接着往下走

在这里插入图片描述

可以看到在 LoadImage 函数中调用的其实是父类的 transform 函数，接着往下走

在这里插入图片描述

我们分析可以得知这个部分其实就真正做了一部分的预处理，代码如下：

def transform(self, results: Dict) -> Optional[dict]:"""The transform function of :class:`BottomupResize` to performphotometric distortion on images.See ``transform()`` method of :class:`BaseTransform` for details.Args:results (dict): Result dict from the data pipeline.Returns:dict: Result dict with images distorted."""img = results['img']img_h, img_w = results['ori_shape']w, h = self.input_sizeinput_sizes = [(w, h)]if self.aug_scales:input_sizes += [(int(w * s), int(h * s)) for s in self.aug_scales]imgs = []for i, (_w, _h) in enumerate(input_sizes):actual_input_size, padded_input_size = self._get_input_size(img_size=(img_w, img_h), input_size=(_w, _h))if self.use_udp:center = np.array([(img_w - 1.0) / 2, (img_h - 1.0) / 2],dtype=np.float32)scale = np.array([img_w, img_h], dtype=np.float32)warp_mat = get_udp_warp_matrix(center=center,scale=scale,rot=0,output_size=actual_input_size)else:center = np.array([img_w / 2, img_h / 2], dtype=np.float32)scale = np.array([img_w * padded_input_size[0] / actual_input_size[0],img_h * padded_input_size[1] / actual_input_size[1]],dtype=np.float32)warp_mat = get_warp_matrix(center=center,scale=scale,rot=0,output_size=padded_input_size)_img = cv2.warpAffine(img,warp_mat,padded_input_size,flags=cv2.INTER_LINEAR,borderValue=self.pad_val)imgs.append(_img)# Store the transform information w.r.t. the main input sizeif i == 0:results['img_shape'] = padded_input_size[::-1]results['input_center'] = centerresults['input_scale'] = scaleresults['input_size'] = padded_input_sizeif self.aug_scales:results['img'] = imgsresults['aug_scales'] = self.aug_scaleselse:results['img'] = imgs[0]results['aug_scale'] = Nonereturn results

我们调试分析（省略…😄）可知这个部分其实就是做了一个 warpAffine，将图像缩放到 640x640x3

OK，我们接着往下走

在这里插入图片描述

可以看到在做完 warpAffine 之后又调用了 PackPoseInputs 类中的 transform 函数，它其实就是调用了 image_to_tensor 函数，如下所示：

def image_to_tensor(img: Union[np.ndarray,Sequence[np.ndarray]]) -> torch.torch.Tensor:"""Translate image or sequence of images to tensor. Multiple image tensorswill be stacked.Args:value (np.ndarray | Sequence[np.ndarray]): The original image orimage sequenceReturns:torch.Tensor: The output tensor."""if isinstance(img, np.ndarray):if len(img.shape) < 3:img = np.expand_dims(img, -1)img = np.ascontiguousarray(img)tensor = torch.from_numpy(img).permute(2, 0, 1).contiguous()else:assert is_seq_of(img, np.ndarray)tensor = torch.stack([image_to_tensor(_img) for _img in img])return tensor

经过我们分析可以得知，这部分主要实现的是 h,w,c→c,h,w 以及 to_tensor 的操作，这里我们得到的是一个 3x640x640 的 tensor，我们接着调：

在这里插入图片描述

经过调试分析（省略…😄）可以知道 preprocess 这个函数做的内容就这么多了，其余的预处理操作是在 forward 函数中调用的

在 Pose2DInferencer 类的 forward 中调用了 test_step 函数来做另外的预处理：

在这里插入图片描述

它实际上是执行了父类的 ImgDataPreprocessor 的 forward 过程，只不过这个过程被封装在底层了，我们无法进行调试，代码如下：

def forward(self, data: dict, training: bool = False) -> Union[dict, list]:"""Performs normalization, padding and bgr2rgb conversion based on``BaseDataPreprocessor``.Args:data (dict): Data sampled from dataset. If the collatefunction of DataLoader is :obj:`pseudo_collate`, data will be alist of dict. If collate function is :obj:`default_collate`,data will be a tuple with batch input tensor and list of datasamples.training (bool): Whether to enable training time augmentation. Ifsubclasses override this method, they can perform differentpreprocessing strategies for training and testing based on thevalue of ``training``.Returns:dict or list: Data in the same format as the model input."""data = self.cast_data(data)  # type: ignore_batch_inputs = data['inputs']# Process data with `pseudo_collate`.if is_seq_of(_batch_inputs, torch.Tensor):batch_inputs = []for _batch_input in _batch_inputs:# channel transformif self._channel_conversion:_batch_input = _batch_input[[2, 1, 0], ...]# Convert to float after channel conversion to ensure# efficiency_batch_input = _batch_input.float()# Normalization.if self._enable_normalize:if self.mean.shape[0] == 3:assert _batch_input.dim() == 3 and _batch_input.shape[0] == 3, ('If the mean has 3 values, the input tensor ''should in shape of (3, H, W), but got the tensor 'f'with shape {_batch_input.shape}')_batch_input = (_batch_input - self.mean) / self.stdbatch_inputs.append(_batch_input)# Pad and stack Tensor.batch_inputs = stack_batch(batch_inputs, self.pad_size_divisor,self.pad_value)# Process data with `default_collate`.elif isinstance(_batch_inputs, torch.Tensor):assert _batch_inputs.dim() == 4, ('The input of `ImgDataPreprocessor` should be a NCHW tensor ''or a list of tensor, but got a tensor with shape: 'f'{_batch_inputs.shape}')if self._channel_conversion:_batch_inputs = _batch_inputs[:, [2, 1, 0], ...]# Convert to float after channel conversion to ensure# efficiency_batch_inputs = _batch_inputs.float()if self._enable_normalize:_batch_inputs = (_batch_inputs - self.mean) / self.stdh, w = _batch_inputs.shape[2:]target_h = math.ceil(h / self.pad_size_divisor) * self.pad_size_divisortarget_w = math.ceil(w / self.pad_size_divisor) * self.pad_size_divisorpad_h = target_h - hpad_w = target_w - wbatch_inputs = F.pad(_batch_inputs, (0, pad_w, 0, pad_h),'constant', self.pad_value)else:raise TypeError('Output of `cast_data` should be a dict of ''list/tuple with inputs and data_samples, 'f'but got {type(data)}: {data}')data['inputs'] = batch_inputsdata.setdefault('data_samples', None)return data

这里有两个分支，很明显我们调用的是第一个分支的内容（猜的…😄），那它做了哪些内容呢？主要有：

_batch_input = _batch_input[[2, 1, 0], ...]：bgr→rgb
_batch_input = _batch_input.float()：uint8→float
batch_inputs = stack_batch(batch_inputs, self.pad_size_divisor, self.pad_value)：添加 batch 维度

在执行完成后最终返回的输入如下图所示：

在这里插入图片描述

可以看到它的 shape 是 1x3x640x640，类型是 float32

我们接着往下调试会发现它开始进入网络中进行 forward 推理了，因此以上就是 RTMO 的整个预处理过程

我们总结下 RTMO 的预处理主要做了哪些事情：

1. warpAffine
2. bgr→rgb
3. c,h,w→h,w,c
4. h,w,c→b,c,h,w

那和我们常见的预处理方式略有不同，它没有 /255.0 这个操作，其他倒是相同，大家如果对 YOLOv5 的预处理熟悉的话，会发现 RTMO 的预处理和 YOLOv5 的预处理基本上一模一样，因此我们不难写出对应的预处理代码，如下所示：

def preprocess_warpAffine(image, dst_width=640, dst_height=640):scale = min((dst_width / image.shape[1], dst_height / image.shape[0]))ox = (dst_width  - scale * image.shape[1]) / 2oy = (dst_height - scale * image.shape[0]) / 2M = np.array([[scale, 0, ox],[0, scale, oy]], dtype=np.float32)img_pre = cv2.warpAffine(image, M, (dst_width, dst_height), flags=cv2.INTER_LINEAR,borderMode=cv2.BORDER_CONSTANT, borderValue=(114, 114, 114))IM = cv2.invertAffineTransform(M)# cv2.imwrite("img_pre.jpg", img_pre)img_pre = (img_pre[...,::-1]).astype(np.float32)img_pre = img_pre.transpose(2, 0, 1)[None]img_pre = torch.from_numpy(img_pre)return img_pre, IM

warpAffine 非常适合在 CUDA 上加速，关于 warpAffine 仿射变换的细节大家可以参考 YOLOv5推理详解及预处理高性能实现，这边不再赘述。

3. RTMO后处理

我们再来看看后处理的实现

后处理部分比较繁琐我们就不再 mmpose 中调试分析了，我们在上篇 [MMPose-RTMO推理详解及部署实现（上）] 文章导出 ONNX 时有提到 mmdeploy 在部署时会重写 head 部分，我们一起来看下：

# ========== rtmo_head.py ==========# mmdeploy/codebase/mmpose/models/heads/rtmo_head.py@FUNCTION_REWRITER.register_rewriter(func_name='mmpose.models.heads.hybrid_heads.''rtmo_head.RTMOHead.forward')
def predict(self,x: Tuple[Tensor],batch_data_samples: List = [],test_cfg: Optional[dict] = None):"""Get predictions and transform to bbox and keypoints results.Args:x (Tuple[Tensor]): The input tensor from upstream network.batch_data_samples: Batch image meta info. Defaults to None.test_cfg: The runtime config for testing process.Returns:Tuple[Tensor]: Predict bbox and keypoint results.- dets (Tensor): Predict bboxes and scores, which is a 3D Tensor,has shape (batch_size, num_instances, 5), the last dimension 5arrange as (x1, y1, x2, y2, score).- pred_kpts (Tensor): Predict keypoints and scores, which is a 4DTensor, has shape (batch_size, num_instances, num_keypoints, 5),the last dimension 3 arrange as (x, y, score)."""# deploy contextctx = FUNCTION_REWRITER.get_context()backend = get_backend(ctx.cfg)deploy_cfg = ctx.cfgcfg = self.test_cfg if test_cfg is None else test_cfg# get predictionscls_scores, bbox_preds, _, kpt_vis, pose_vecs = self.head_module(x)[:5]assert len(cls_scores) == len(bbox_preds)num_imgs = cls_scores[0].shape[0]# flatten and concat predictionsscores = self._flatten_predictions(cls_scores).sigmoid()flatten_bbox_preds = self._flatten_predictions(bbox_preds)flatten_pose_vecs = self._flatten_predictions(pose_vecs)flatten_kpt_vis = self._flatten_predictions(kpt_vis).sigmoid()bboxes = self.decode_bbox(flatten_bbox_preds, self.flatten_priors,self.flatten_stride)if backend == Backend.TENSORRT:# pad for batched_nms because its output index is filled with -1bboxes = torch.cat([bboxes,bboxes.new_zeros((bboxes.shape[0], 1, bboxes.shape[2]))],dim=1)scores = torch.cat([scores, scores.new_zeros((scores.shape[0], 1, 1))], dim=1)# nms parameterspost_params = get_post_processing_params(deploy_cfg)max_output_boxes_per_class = post_params.max_output_boxes_per_classiou_threshold = cfg.get('nms_thr', post_params.iou_threshold)score_threshold = cfg.get('score_thr', post_params.score_threshold)pre_top_k = post_params.get('pre_top_k', -1)keep_top_k = cfg.get('max_per_img', post_params.keep_top_k)# do nms_, _, nms_indices = multiclass_nms(bboxes,scores,max_output_boxes_per_class,iou_threshold,score_threshold,pre_top_k=pre_top_k,keep_top_k=keep_top_k,output_index=True)batch_inds = torch.arange(num_imgs, device=scores.device).view(-1, 1)# filter predictionsdets = torch.cat([bboxes, scores], dim=2)dets = dets[batch_inds, nms_indices, ...]pose_vecs = flatten_pose_vecs[batch_inds, nms_indices, ...]kpt_vis = flatten_kpt_vis[batch_inds, nms_indices, ...]grids = self.flatten_priors[nms_indices, ...]# decode keypointsbbox_cs = torch.cat(bbox_xyxy2cs(dets[..., :4], self.bbox_padding), dim=-1)keypoints = self.dcc.forward_test(pose_vecs, bbox_cs, grids)pred_kpts = torch.cat([keypoints, kpt_vis.unsqueeze(-1)], dim=-1)return dets, pred_kpts

根据分析（省略…😄）我们知道 mmpose 后处理其实是包含了两部分：

nms：非极大值抑制
decode：框和关键点的解码

大家如果对 YOLOv8-Pose 的后处理熟悉的话，会发现 RTMO 的后处理和 YOLOv8-Pose 几乎一模一样，因此我们不难写出对应的后处理代码，如下所示：

def iou(box1, box2):def area_box(box):return (box[2] - box[0]) * (box[3] - box[1])left   = max(box1[0], box2[0])top    = max(box1[1], box2[1])right  = min(box1[2], box2[2])bottom = min(box1[3], box2[3])cross  = max((right-left), 0) * max((bottom-top), 0)union  = area_box(box1) + area_box(box2) - crossif cross == 0 or union == 0:return 0return cross / uniondef NMS(boxes, iou_thres):remove_flags = [False] * len(boxes)keep_boxes = []for i, ibox in enumerate(boxes):if remove_flags[i]:continuekeep_boxes.append(ibox)for j in range(i + 1, len(boxes)):if remove_flags[j]:continuejbox = boxes[j]if iou(ibox, jbox) > iou_thres:remove_flags[j] = Truereturn keep_boxesdef postprocess(pred, IM=[], conf_thres=0.25, iou_thres=0.5):# 输入是模型推理的结果，即2000个预测框# 1,2000,56 [cx,cy,w,h,conf,17*3]boxes = []for img_id, box_id in zip(*np.where(pred[...,4] > conf_thres)):item = pred[img_id, box_id]left, top, right, bottom, conf = item[:5]keypoints = item[5:].reshape(-1, 3)keypoints[:, 0] = keypoints[:, 0] * IM[0][0] + IM[0][2]keypoints[:, 1] = keypoints[:, 1] * IM[1][1] + IM[1][2]boxes.append([left, top, right, bottom, conf, *keypoints.reshape(-1).tolist()])boxes = np.array(boxes)lr = boxes[:,[0, 2]]tb = boxes[:,[1, 3]]boxes[:,[0,2]] = IM[0][0] * lr + IM[0][2]boxes[:,[1,3]] = IM[1][1] * tb + IM[1][2]boxes = sorted(boxes.tolist(), key=lambda x:x[4], reverse=True)return NMS(boxes, iou_thres)

Note：不同点在于框的五个维度直接就是 left，top，right，bottom，conf，而不再是中心点宽高，这个我们在导出 ONNX 时有特别提到过

后处理中预测框的解码我们是通过仿射变换逆矩阵 IM 实现的，关于 IM 的细节大家可以参考 YOLOv5推理详解及预处理高性能实现，这边不再赘述。关于 NMS 的代码参考自 tensorRT_Pro 中的实现：yolo.cpp#L119

关键点的解码我们同样可以通过 IM 将其映射回原图上，因此 RTMO 的后处理和 YOLOv8-Pose 的基本上没什么区别，只是需要大家清楚模型预测的结果中每个维度所代表的含义即可

对于一张 640x640 的图片来说，RTMO 预测框的总数量是 2000，每个预测框的维度是 56（针对 COCO 数据集的人体 17 个关键点而言）
$\begin{aligned} 2000\times56&=40\times40\times56+20\times20\times56\\ &=40\times40\times(5+51)+20\times20\times(5+51)\\ &=40\times40\times(5+17\times3)+20\times20\times(5+17\times3)\\ \end{aligned}$
其中的 5 对应的是 left，top，right，bottom，conf，分别代表的含义是边界框左上角右下角坐标以及置信度；17 对应的是 COCO 数据集中的人体 17 个关键点，3 代表每个关键点的信息，包括 x，y，visibility，分别代表的含义是关键点的 x 和 y 坐标以及可见性或者说置信度，在对关键点进行可视化时我们只会可视化那些 visibility 大于 0.5 的关键点，因为低于 0.5 的关键点我们认为它被遮挡或者不在图像上。

目前主流的姿态点估计算法分为两种，一种是 top-down 自顶向下，先检测出图像中所有的人体检测框，再根据每个检测框识别姿态；另一种是 bottom-up 自低向上，先检测出图像中所有的骨骼点，再通过拼接得到多个人的骨架。两种方法各有优缺点，其中自顶向上的方法，姿态检测的准确度非常依赖目标检测框的质量；而自低向上的方法，如果两人离得非常近，容易出现模棱两可的情况，而且由于是依赖两个骨骼点之间的关系，所以失去了对全局的信息获取。

像 AlphaPose 和 YOLOv8-Pose 模型都是采用的自顶向下的方法，即先检测出所有的人体框再对每个人体做姿态估计；而 RTMO 模型采用的则是自低而上的方法，即先检测出所有的骨骼点然后再拼接。

4. RTMO推理

通过上面对 RTMO 的预处理和后处理分析之后，整个推理过程就显而易见了。RTMO 的推理包括图像预处理、模型推理、预测结果后处理三部分，其中预处理主要包括 warpAffine 仿射变换，后处理主要包括 boxes、keypoints 的 decode 解码和 NMS 两部分。

在 mmpose 项目下新建 infer.py 用于推理，完整的推理代码如下：

import cv2
import numpy as np
import onnxruntime as ortdef preprocess_warpAffine(image, dst_width=640, dst_height=640):scale = min((dst_width / image.shape[1], dst_height / image.shape[0]))ox = (dst_width  - scale * image.shape[1]) / 2oy = (dst_height - scale * image.shape[0]) / 2M = np.array([[scale, 0, ox],[0, scale, oy]], dtype=np.float32)img_pre = cv2.warpAffine(image, M, (dst_width, dst_height), flags=cv2.INTER_LINEAR,borderMode=cv2.BORDER_CONSTANT, borderValue=(114, 114, 114))IM = cv2.invertAffineTransform(M)# cv2.imwrite("img_pre.jpg", img_pre)img_pre = (img_pre[...,::-1]).astype(np.float32)img_pre = img_pre.transpose(2, 0, 1)[None]# img_pre = torch.from_numpy(img_pre)return img_pre, IMdef iou(box1, box2):def area_box(box):return (box[2] - box[0]) * (box[3] - box[1])left   = max(box1[0], box2[0])top    = max(box1[1], box2[1])right  = min(box1[2], box2[2])bottom = min(box1[3], box2[3])cross  = max((right-left), 0) * max((bottom-top), 0)union  = area_box(box1) + area_box(box2) - crossif cross == 0 or union == 0:return 0return cross / uniondef NMS(boxes, iou_thres):remove_flags = [False] * len(boxes)keep_boxes = []for i, ibox in enumerate(boxes):if remove_flags[i]:continuekeep_boxes.append(ibox)for j in range(i + 1, len(boxes)):if remove_flags[j]:continuejbox = boxes[j]if iou(ibox, jbox) > iou_thres:remove_flags[j] = Truereturn keep_boxesdef postprocess(pred, IM=[], conf_thres=0.25, iou_thres=0.5):# 输入是模型推理的结果，即2000个预测框# 1,2000,56 [cx,cy,w,h,conf,17*3]boxes = []for img_id, box_id in zip(*np.where(pred[...,4] > conf_thres)):item = pred[img_id, box_id]left, top, right, bottom, conf = item[:5]keypoints = item[5:].reshape(-1, 3)keypoints[:, 0] = keypoints[:, 0] * IM[0][0] + IM[0][2]keypoints[:, 1] = keypoints[:, 1] * IM[1][1] + IM[1][2]boxes.append([left, top, right, bottom, conf, *keypoints.reshape(-1).tolist()])boxes = np.array(boxes)lr = boxes[:,[0, 2]]tb = boxes[:,[1, 3]]boxes[:,[0,2]] = IM[0][0] * lr + IM[0][2]boxes[:,[1,3]] = IM[1][1] * tb + IM[1][2]boxes = sorted(boxes.tolist(), key=lambda x:x[4], reverse=True)return NMS(boxes, iou_thres)def hsv2bgr(h, s, v):h_i = int(h * 6)f = h * 6 - h_ip = v * (1 - s)q = v * (1 - f * s)t = v * (1 - (1 - f) * s)r, g, b = 0, 0, 0if h_i == 0:r, g, b = v, t, pelif h_i == 1:r, g, b = q, v, pelif h_i == 2:r, g, b = p, v, telif h_i == 3:r, g, b = p, q, velif h_i == 4:r, g, b = t, p, velif h_i == 5:r, g, b = v, p, qreturn int(b * 255), int(g * 255), int(r * 255)def random_color(id):h_plane = (((id << 2) ^ 0x937151) % 100) / 100.0s_plane = (((id << 3) ^ 0x315793) % 100) / 100.0return hsv2bgr(h_plane, s_plane, 1)skeleton = [[16, 14], [14, 12], [17, 15], [15, 13], [12, 13], [6, 12], [7, 13], [6, 7], [6, 8], [7, 9], [8, 10], [9, 11], [2, 3], [1, 2], [1, 3], [2, 4], [3, 5], [4, 6], [5, 7]]
pose_palette = np.array([[255, 128, 0], [255, 153, 51], [255, 178, 102], [230, 230, 0], [255, 153, 255],[153, 204, 255], [255, 102, 255], [255, 51, 255], [102, 178, 255], [51, 153, 255],[255, 153, 153], [255, 102, 102], [255, 51, 51], [153, 255, 153], [102, 255, 102],[51, 255, 51], [0, 255, 0], [0, 0, 255], [255, 0, 0], [255, 255, 255]],dtype=np.uint8)
kpt_color  = pose_palette[[16, 16, 16, 16, 16, 0, 0, 0, 0, 0, 0, 9, 9, 9, 9, 9, 9]]
limb_color = pose_palette[[9, 9, 9, 9, 7, 7, 7, 0, 0, 0, 0, 0, 16, 16, 16, 16, 16, 16, 16]]if __name__ == "__main__":# 1. preprocessimg = cv2.imread("./tests/data/coco/000000197388.jpg")img_pre, IM = preprocess_warpAffine(img)model_path = "./rtmo-s_8xb32-600e_body7-640x640.onnx"session = ort.InferenceSession(model_path)input_name = 'images'inputs = {input_name: img_pre}# 2. inferoutputs = session.run(None, inputs)[0]# 3. postprocessboxes = postprocess(outputs, IM)# 4. visualizefor box in boxes:left, top, right, bottom = int(box[0]), int(box[1]), int(box[2]), int(box[3])confidence = box[4]label = 0color = random_color(label)cv2.rectangle(img, (left, top), (right, bottom), color, 2, cv2.LINE_AA)caption = f"person {confidence:.2f}"w, h = cv2.getTextSize(caption, 0, 1, 2)[0]cv2.rectangle(img, (left - 3, top - 33), (left + w + 10, top), color, -1)cv2.putText(img, caption, (left, top - 5), 0, 1, (0, 0, 0), 2, 16)keypoints = box[5:]keypoints = np.array(keypoints).reshape(-1, 3)for i, keypoint in enumerate(keypoints):x, y, conf = keypointcolor_k = [int(x) for x in kpt_color[i]]if conf < 0.5:continueif x != 0 and y != 0:cv2.circle(img, (int(x), int(y)), 3, color_k, -1, lineType=cv2.LINE_AA)for i, sk in enumerate(skeleton):pos1 = (int(keypoints[(sk[0] - 1), 0]), int(keypoints[(sk[0] - 1), 1]))pos2 = (int(keypoints[(sk[1] - 1), 0]), int(keypoints[(sk[1] - 1), 1]))conf1 = keypoints[(sk[0] - 1), 2]conf2 = keypoints[(sk[1] - 1), 2]if conf1 < 0.5 or conf2 < 0.5:continueif pos1[0] == 0 or pos1[1] == 0 or pos2[0] == 0 or pos2[1] == 0:continuecv2.line(img, pos1, pos2, [int(x) for x in limb_color[i]], thickness=2, lineType=cv2.LINE_AA)cv2.imwrite("infer-pose.jpg", img)print("save done")

Note：这里我们采用 onnxruntime 推理，使用的 onnx 是我们 MMPose-RTMO推理详解及部署实现（上）文章中导出的 onnx，可视化的代码参考自 YOLOv8-Pose，大家感兴趣的可以看下：YOLOv8-Pose推理详解及部署实现

执行成功后会在当前目录下生成 infer-pose.jpg 推理结果图像，如下图所示：

在这里插入图片描述

至此，我们在 Python 上面完成了 RTMO 的整个推理过程，下面我们去 C++ 上实现。

二、RTMO推理(C++)

C++ 上的实现我们使用的 repo 依旧是 tensorRT_Pro，现在我们就基于 tensorRT_Pro 完成 RTMO 在 C++ 上的推理。

1. ONNX导出

ONNX 导出的细节请参考 MMPose-RTMO推理详解及部署实现（上），这边不再赘述。

2. RTMO预处理

之前有提到过 RTMO 的预处理部分和 YOLOv5 实现基本一样，因此我们在 tensorRT_Pro 中 RTMO 模型的预处理可以直接使用 YOLOv5 的预处理，只是需要注意在 CUDAKernel::Norm 的指定时不再除以 255.0 即可。

tensorRT_Pro 的预处理代码如下：

__global__ void warp_affine_bilinear_and_normalize_plane_kernel(uint8_t* src, int src_line_size, int src_width, int src_height, float* dst, int dst_width, int dst_height, uint8_t const_value_st, float* warp_affine_matrix_2_3, Norm norm, int edge){int position = blockDim.x * blockIdx.x + threadIdx.x;if (position >= edge) return;float m_x1 = warp_affine_matrix_2_3[0];float m_y1 = warp_affine_matrix_2_3[1];float m_z1 = warp_affine_matrix_2_3[2];float m_x2 = warp_affine_matrix_2_3[3];float m_y2 = warp_affine_matrix_2_3[4];float m_z2 = warp_affine_matrix_2_3[5];int dx      = position % dst_width;int dy      = position / dst_width;float src_x = m_x1 * dx + m_y1 * dy + m_z1;float src_y = m_x2 * dx + m_y2 * dy + m_z2;float c0, c1, c2;if(src_x <= -1 || src_x >= src_width || src_y <= -1 || src_y >= src_height){// out of rangec0 = const_value_st;c1 = const_value_st;c2 = const_value_st;}else{int y_low = floorf(src_y);int x_low = floorf(src_x);int y_high = y_low + 1;int x_high = x_low + 1;uint8_t const_value[] = {const_value_st, const_value_st, const_value_st};float ly    = src_y - y_low;float lx    = src_x - x_low;float hy    = 1 - ly;float hx    = 1 - lx;float w1    = hy * hx, w2 = hy * lx, w3 = ly * hx, w4 = ly * lx;uint8_t* v1 = const_value;uint8_t* v2 = const_value;uint8_t* v3 = const_value;uint8_t* v4 = const_value;if(y_low >= 0){if (x_low >= 0)v1 = src + y_low * src_line_size + x_low * 3;if (x_high < src_width)v2 = src + y_low * src_line_size + x_high * 3;}if(y_high < src_height){if (x_low >= 0)v3 = src + y_high * src_line_size + x_low * 3;if (x_high < src_width)v4 = src + y_high * src_line_size + x_high * 3;}// same to opencvc0 = floorf(w1 * v1[0] + w2 * v2[0] + w3 * v3[0] + w4 * v4[0] + 0.5f);c1 = floorf(w1 * v1[1] + w2 * v2[1] + w3 * v3[1] + w4 * v4[1] + 0.5f);c2 = floorf(w1 * v1[2] + w2 * v2[2] + w3 * v3[2] + w4 * v4[2] + 0.5f);}if(norm.channel_type == ChannelType::Invert){float t = c2;c2 = c0;  c0 = t;}if(norm.type == NormType::MeanStd){c0 = (c0 * norm.alpha - norm.mean[0]) / norm.std[0];c1 = (c1 * norm.alpha - norm.mean[1]) / norm.std[1];c2 = (c2 * norm.alpha - norm.mean[2]) / norm.std[2];}else if(norm.type == NormType::AlphaBeta){c0 = c0 * norm.alpha + norm.beta;c1 = c1 * norm.alpha + norm.beta;c2 = c2 * norm.alpha + norm.beta;}int area = dst_width * dst_height;float* pdst_c0 = dst + dy * dst_width + dx;float* pdst_c1 = pdst_c0 + area;float* pdst_c2 = pdst_c1 + area;*pdst_c0 = c0;*pdst_c1 = c1;*pdst_c2 = c2;
}

关于预处理部分其实就是调用了上述 CUDA 核函数来实现 warpAffine，由于在 CUDA 中我们是对每个像素进行操作，因此非常容易实现 BGR → RGB，/255.0 等操作。关于代码的具体分析可以参考 YOLOv5推理详解及预处理高性能实现，这边不再赘述。

3. RTMO后处理

之前有提到过 RTMO 的检测框后处理部分和 YOLOv8-Pose 基本相同，只是框的维度信息变为了左上角和右下角坐标，代码可参考：yolo_pose_decode.cu#L13

因此我们不难写出 RTMO 的 decode 解码部分的实现部分，如下所示：

static __global__ void decode_kernel_rtmo(float *predict, int num_bboxes, float confidence_threshold, float* invert_affine_matrix, float* parray, int MAX_IMAGE_BOXES){int position = blockDim.x * blockIdx.x + threadIdx.x;if(position >= num_bboxes) return;float* pitem     = predict + (5 + 3 * NUM_KEYPOINTS) * position;float left       = *pitem++;float top        = *pitem++;float right      = *pitem++;float bottom     = *pitem++;float confidence = *pitem++;if(confidence < confidence_threshold)return;int index = atomicAdd(parray, 1);if(index >= MAX_IMAGE_BOXES)return;affine_project(invert_affine_matrix, left,  top,    &left,  &top);affine_project(invert_affine_matrix, right, bottom, &right, &bottom);float* pout_item = parray + 1 + index * NUM_BOX_ELEMENT; *pout_item++ = left;*pout_item++ = top;*pout_item++ = right;*pout_item++ = bottom;*pout_item++ = confidence;*pout_item++ = 1; // 1 = keep, 0 = ignorefor(int i = 0; i < NUM_KEYPOINTS; ++i){float keypoint_x          = *pitem++;float keypoint_y          = *pitem++;float keypoint_confidence = *pitem++;affine_project(invert_affine_matrix, keypoint_x, keypoint_y, &keypoint_x, &keypoint_y);*pout_item++ = keypoint_x;*pout_item++ = keypoint_y;*pout_item++ = keypoint_confidence;  }
}

关于 decode 的具体实现其实就是启动多个线程，每个线程处理一个框的解码，包括框坐标和关键点坐标的解码，我们会通过仿射变换逆矩阵 IM 将坐标映射回原图上的，关于 decode 代码的详细分析可参考 infer源码阅读之yolo.cu，这边不再赘述。

另外关于 NMS 部分，由于在 RTMO 模型中没有 label 类别标签维度，因此也需要适当调整，调整后的 NMS 代码如下：

static __global__ void nms_kernel_rtmo(float* bboxes, int max_objects, float threshold){int position = (blockDim.x * blockIdx.x + threadIdx.x);int count = min((int)*bboxes, max_objects);if (position >= count) return;// left, top, right, bottom, confidence, keepflag, (keypoint_x, keypoint_y, keypoint_confidence) * 17float* pcurrent = bboxes + 1 + position * NUM_BOX_ELEMENT;for(int i = 0; i < count; ++i){float* pitem = bboxes + 1 + i * NUM_BOX_ELEMENT;if(i == position) continue;if(pitem[4] >= pcurrent[4]){if(pitem[4] == pcurrent[4] && i < position)continue;float iou = box_iou(pcurrent[0], pcurrent[1], pcurrent[2], pcurrent[3],pitem[0],    pitem[1],    pitem[2],    pitem[3]);if(iou > threshold){pcurrent[5] = 0;  // 1=keep, 0=ignorereturn;}}}
}

关于 NMS 的具体实现也是启动多个线程，每个线程处理一个框，如果剩余框中的置信度大于当前线程中处理的框，则计算两个框的 IoU，通过 IoU 值判断是否保留该框。相比于 CPU 版的 NMS 应该是少套了一层循环，另外一层循环是通过 CUDA 上线程的并行操作处理的，代码参考自：yolo_decode.cu#L81

4. RTMO推理

通过上面对 RTMO 的预处理和后处理分析之后，整个推理过程就显而易见了。C++ 上 RTMO 的预处理部分可直接沿用 YOLOv5 的预处理，后处理中的 decode 解码和 NMS 部分需要简单修改。

我们在终端执行如下指令即可完成推理（注意！完整流程博主会在后续内容介绍，这边只是简单演示）：

make rtmo -j64

编译图解如下所示：

在这里插入图片描述

可以看到使用 tensorRT_Pro 自带的编译接口出现了节点解析错误，这是因为我们导出的 ONNX 中包含 LayerNormalization 的节点，tensorRT 只有在 8.6 版本之后才开始支持 LayerNormalization 算子，而 tensorRT_Pro 自己构建的 onnx_parser 是 8.0 版本的，因此会出现解析错误

我们目前无法通过 TRT::compile 编译接口生成 engine，摆在我们面前的依旧是两种方案，一种是手动替换 onnx-parser 解析器，这点我们在 RT-DETR推理详解及部署实现有详细讲过；另一种就是利用高版本的 tensorRT 的 trtexec 工具生成 engine

我们先来看第二种方案，利用高版本的 trtexec 工具生成 engine

博主新建了一个 build.sh 脚本文件，其内容如下：

#! /usr/bin/bashTRTEXEC=/home/jarvis/lean/TensorRT-8.6.1.6/bin/trtexec# export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/jarvis/lean/TensorRT-8.6.1.6/lib# rtmo
${TRTEXEC} --onnx=rtmo-s_8xb32-600e_body7-640x640.onnx --minShapes=images:1x3x640x640 --optShapes=images:1x3x640x640 --maxShapes=images:16x3x640x640 --saveEngine=rtmo-s_8xb32-600e_body7-640x640.FP32.trtmodel

在终端执行如下指令即可：

bash build.sh

在这里插入图片描述

可以看到 engien 生成成功了，接下来就是拿着 engine 去进行推理了

我们继续执行刚才的指令即可完成推理：

make rtmo -j64

输出如下：

在这里插入图片描述

推理结果如下图所示：

在这里插入图片描述

PS：方案一手动替换 onnx-parser 我们在 RT-DETR推理详解及部署实现有详细讲过，这边就不再赘述了

采用方案一替换 onnx-parser 之后我们就可以通过 TRT::compile 来构建 engine 了，输出如下图所示：

在这里插入图片描述

至此，我们在 C++ 上面完成了 RTMO 的整个推理过程，下面我们将完整的走一遍流程。

三、RTMO部署

博主新建了一个仓库 tensorRT_Pro-YOLOv8，该仓库基于 shouxieai/tensorRT_Pro，并进行了调整以支持 YOLOv8 的各项任务，目前已支持分类、检测、分割、姿态点估计任务。

下面我们就来具体看看如何利用 tensorRT_Pro-YOLOv8 这个 repo 完成 RTMO 的推理。

1. 源码下载

tensorRT_Pro-YOLOv8 的代码可以直接从 GitHub 官网上下载，源码下载地址是 https://github.com/Melody-Zhou/tensorRT_Pro-YOLOv8，Linux 下代码克隆指令如下：

git clone https://github.com/Melody-Zhou/tensorRT_Pro-YOLOv8.git

也可手动点击下载，点击右上角的 Code 按键，将代码下载下来。至此整个项目就已经准备好了。也可以点击 here 下载博主准备好的源代码（注意代码下载于 2024/6/1 日，若有改动请参考最新）

2. 环境配置

需要使用的软件环境有 TensorRT、CUDA、cuDNN、OpenCV、Protobuf，所有软件环境的安装可以参考 Ubuntu20.04软件安装大全，这里不再赘述，需要各位看官自行配置好相关环境😄，外网访问较慢，这里提供下博主安装过程中的软件安装包下载链接 Baidu Drive【pwd:yolo】🚀🚀🚀

tensorRT_Pro-YOLOv8 提供 CMakeLists.txt 和 Makefile 两种方式编译，二者选一即可

2.1 配置CMakeLists.txt

主要修改五处

1. 修改第 13 行，修改 OpenCV 路径

set(OpenCV_DIR   "/usr/local/include/opencv4/")

2. 修改第 15 行，修改 CUDA 路径

set(CUDA_TOOLKIT_ROOT_DIR     "/usr/local/cuda-11.6")

3. 修改第 16 行，修改 cuDNN 路径

set(CUDNN_DIR    "/usr/local/cudnn8.4.0.27-cuda11.6")

4. 修改第 17 行，修改 tensorRT 路径（版本必须大于 8.6）

set(TENSORRT_DIR "/home/jarvis/lean/TensorRT-8.6.1.6")

5. 修改第 20 行，修改 protobuf 路径

set(PROTOBUF_DIR "/home/jarvis/protobuf")

2.2 配置Makefile

主要修改五处

1. 修改第 4 行，修改 protobuf 路径

lean_protobuf  := /home/jarvis/protobuf

2. 修改第 5 行，修改 tensorRT 路径（版本必须大于 8.6）

lean_tensor_rt := /home/jarvis/lean/TensorRT-8.6.1.6

3. 修改第 6 行，修改 cuDNN 路径

lean_cudnn     := /usr/local/cudnn8.4.0.27-cuda11.6

4. 修改第 7 行，修改 OpenCV 路径

lean_opencv    := /usr/local

5. 修改第 8 行，修改 CUDA 路径

lean_cuda      := /usr/local/cuda-11.6

3. ONNX导出

导出细节可以查看 MMPose-RTMO推理详解及部署实现（上），这边不再赘述。记得将导出的 ONNX 模型放在 tensorRT_Pro-YOLOv8/workspace 文件夹下。

4. engine生成

修改 workspace 下 build.sh 文件内容，如下所示：

#! /usr/bin/bashTRTEXEC=/home/jarvis/lean/TensorRT-8.6.1.6/bin/trtexec# export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/jarvis/lean/TensorRT-8.6.1.6/lib# rtmo
${TRTEXEC} --onnx=rtmo-s_8xb32-600e_body7-640x640.onnx --minShapes=images:1x3x640x640 --optShapes=images:1x3x640x640 --maxShapes=images:16x3x640x640 --saveEngine=rtmo-s_8xb32-600e_body7-640x640.FP32.trtmodel

其中需要修改 TRTEXEC 的路径为你自己的路径，终端执行如下指令：

cd tensorRT_Pro-YOLOv8/workspace
bash build.sh

5. 源码修改

如果你想推理自己训练的模型还需要修改下源代码，RTMO 模型的推理代码主要在 app_rtmo.cpp 文件中，我们就只需要修改这一个文件的内容即可，源码修改较简单主要有以下几点：

app_rtmo.cpp 292 行，“rtmo-s_8xb32-600e_body7-640x640”修改为你导出的 ONNX 模型名

具体修改示例如下：

test(TRT::Mode::FP32, "best")	// 修改1 292行"rtmo-s_8xb32-600e_body7-640x640"改成"best"

6. 运行

OK！源码修改好了，Makefile 编译文件也搞定了，engine 模型也准备好了，现在可以编译运行了，直接在终端执行如下指令即可：

make rtmo -j64

推理过程如下图所示：

在这里插入图片描述

推理成功后会生成 rtmo-s_8xb32-600e_body7-640x640_RTMO_FP32_result 文件夹，该文件夹下保存了推理的图片。

模型推理效果如下图所示：

在这里插入图片描述

OK，以上就是使用 tensorRT_Pro-YOLOv8 推理 RTMO 的大致流程，若有问题，欢迎各位看官批评指正。

结语

博主在这里针对 RTMO 的预处理和后处理做了简单分析，同时与大家分享了 C++ 上的实现流程，目的是帮大家理清思路，更好的完成后续的部署工作😄。感谢各位看到最后，创作不易，读后有收获的看官请帮忙点个👍⭐️

通过分析 MMPose-RTMO 整个模型的导出和部署博主还是学到了很多东西的，同时把之前学习到的知识都回顾了一遍，虽然整个过程比较痛苦，不过总归还是有收获的🤗

最后大家如果觉得 tensorRT_Pro-YOLOv8 这个 repo 对你有帮助的话，不妨点个 ⭐️ 支持一波，这对博主来说非常重要，感谢各位🙏。

下载链接

软件安装包下载链接【提取码:yolo】🚀🚀🚀
源代码、权重下载链接【提取码:rtmo】

参考

MMPose-RTMO推理详解及部署实现（上）
https://github.com/shouxieai/tensorRT_Pro
https://github.com/Melody-Zhou/tensorRT_Pro-YOLOv8
YOLOv5推理详解及预处理高性能实现
YOLOv8-Pose推理详解及部署实现
RT-DETR推理详解及部署实现

这篇关于MMPose-RTMO推理详解及部署实现（下）的文章就介绍到这儿，希望我们推荐的文章对编程师们有所帮助！

MMPose-RTMO推理详解及部署实现（下）

目录

前言

一、RTMO推理(Python)

1. RTMO预测

2. RTMO预处理

3. RTMO后处理

4. RTMO推理

二、RTMO推理(C++)

1. ONNX导出

2. RTMO预处理

3. RTMO后处理

4. RTMO推理

三、RTMO部署

1. 源码下载

2. 环境配置

2.1 配置CMakeLists.txt

2.2 配置Makefile

3. ONNX导出

4. engine生成

5. 源码修改

6. 运行

结语

下载链接

参考

相关文章

使用Python删除Excel中的行列和单元格示例详解

golang程序打包成脚本部署到Linux系统方式

Linux下删除乱码文件和目录的实现方式

MySQL中的LENGTH()函数用法详解与实例分析

Spring Boot spring-boot-maven-plugin 参数配置详解(最新推荐)

SpringBoot+EasyExcel实现自定义复杂样式导入导出

mybatis执行insert返回id实现详解

Spring Boot集成Druid实现数据源管理与监控的详细步骤

Python通用唯一标识符模块uuid使用案例详解

Linux在线解压jar包的实现方式