TVM踩坑记录

本文主要是介绍TVM踩坑记录，希望对大家解决编程问题提供一定的参考价值，需要的开发者们随着小编来一起学习吧！

TVM踩坑记录

安装
AirFace.onnx试验
- - 问题定位：
自动优化

安装

参照官网。
安装完成后安装官网quick-start-py跑一跑简单例子

import numpy as np
from tvm import relay
import tvm
from tvm.contrib import graph_runtimebatch_size = 1
num_class = 1000
image_shape = (3, 224, 224)
data_shape = (batch_size,) + image_shape
out_shape = (batch_size, num_class)mod, params = relay.testing.resnet.get_workload(num_layers=18, batch_size=batch_size, image_shape=image_shape)print(mod.astext(show_meta_data=False))

结果失败

Traceback (most recent call last):File "./compile.py", line 84, in <module>from tvm import relayFile "/usr/local/lib/python3.6/dist-packages/tvm-0.6.dev0-py3.6-linux-x86_64.egg/tvm/relay/__init__.py", line 27, in <module>from . import expr_functorFile "/usr/local/lib/python3.6/dist-packages/tvm-0.6.dev0-py3.6-linux-x86_64.egg/tvm/relay/expr_functor.py", line 24, in <module>from .op import OpFile "/usr/local/lib/python3.6/dist-packages/tvm-0.6.dev0-py3.6-linux-x86_64.egg/tvm/relay/op/__init__.py", line 20, in <module>from .op import get, register, register_schedule, register_compute, register_gradient, \File "/usr/local/lib/python3.6/dist-packages/tvm-0.6.dev0-py3.6-linux-x86_64.egg/tvm/relay/op/op.py", line 19, in <module>import topiFile "/usr/local/lib/python3.6/dist-packages/topi-0.6.dev0-py3.6.egg/topi/__init__.py", line 43, in <module>from . import nnFile "/usr/local/lib/python3.6/dist-packages/topi-0.6.dev0-py3.6.egg/topi/nn/__init__.py", line 23, in <module>from .deformable_conv2d import *File "/usr/local/lib/python3.6/dist-packages/topi-0.6.dev0-py3.6.egg/topi/nn/deformable_conv2d.py", line 23, in <module>from ..cpp.image import bilinear_sample_nchw
ImportError: cannot import name 'bilinear_sample_nchw'

这个错误需要在环境变量中添加：

export LD_LIBRARY_PATH=/home/bokyliu/.local/lib/python3.6/site-packages/topi-0.6.dev0-py3.6.egg/topi

问题一个接一个

Connected to pydev debugger (build 182.4505.26)
Traceback (most recent call last):File "/home/bokyliu/Work/pycharm-community-2018.2.4/helpers/pydev/pydevd.py", line 1664, in <module>main()File "/home/bokyliu/Work/pycharm-community-2018.2.4/helpers/pydev/pydevd.py", line 1658, in mainglobals = debugger.run(setup['file'], None, None, is_module)File "/home/bokyliu/Work/pycharm-community-2018.2.4/helpers/pydev/pydevd.py", line 1068, in runpydev_imports.execfile(file, globals, locals)  # execute the scriptFile "/home/bokyliu/Work/pycharm-community-2018.2.4/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfileexec(compile(contents+"\n", file, 'exec'), glob, loc)File "/home/bokyliu/Project/TVM/quick-start.py", line 13, in <module>mod, params = relay.testing.resnet.get_workload(
AttributeError: module 'tvm.relay' has no attribute 'testing'

问题很无语啊，python把testing当成文件了，其实他是一个文件夹，所以修改代码为如下：

import numpy as np
# from tvm import relay
from tvm.relay.testing import resnet
import tvm
from tvm.contrib import graph_runtimebatch_size = 1
num_class = 1000
image_shape = (3, 224, 224)
data_shape = (batch_size,) + image_shape
out_shape = (batch_size, num_class)mod, params = resnet.get_workload(num_layers=18, batch_size=batch_size, image_shape=image_shape)print(mod.astext(show_meta_data=False))

成功打印出计算图：

v0.0.4
def @main(%data: Tensor[(1, 3, 224, 224), float32], %bn_data_gamma: Tensor[(3), float32], %bn_data_beta: Tensor[(3), float32], %bn_data_moving_mean: Tensor[(3), float32], %bn_data_moving_var: Tensor[(3), float32], %conv0_weight: Tensor[(64, 3, 7, 7), float32], %bn0_gamma: Tensor[(64), float32], %bn0_beta: Tensor[(64), float32], %bn0_moving_mean: Tensor[(64), float32], %bn0_moving_var: Tensor[(64), float32], %stage1_unit1_bn1_gamma: Tensor[(64), float32], %stage1_unit1_bn1_beta: Tensor[(64), float32], %stage1_unit1_bn1_moving_mean: Tensor[(64), float32], %stage1_unit1_bn1_moving_var: Tensor[(64), float32], %stage1_unit1_conv1_weight: Tensor[(64, 64, 3, 3), float32], %stage1_unit1_bn2_gamma: Tensor[(64), float32], %stage1_unit1_bn2_beta: Tensor[(64), float32], %stage1_unit1_bn2_moving_mean: Tensor[(64), float32], %stage1_unit1_bn2_moving_var: Tensor[(64), float32], %stage1_unit1_conv2_weight: Tensor[(64, 64, 3, 3), float32], %stage1_unit1_sc_weight: Tensor[(64, 64, 1, 1), float32], %stage1_unit2_bn1_gamma: Tensor[(64), float32], %stage1_unit2_bn1_beta: Tensor[(64), float32], %stage1_unit2_bn1_moving_mean: Tensor[(64), float32], %stage1_unit2_bn1_moving_var: Tensor[(64), float32], %stage1_unit2_conv1_weight: Tensor[(64, 64, 3, 3), float32], %stage1_unit2_bn2_gamma: Tensor[(64), float32], %stage1_unit2_bn2_beta: Tensor[(64), float32], %stage1_unit2_bn2_moving_mean: Tensor[(64), float32], %stage1_unit2_bn2_moving_var: Tensor[(64), float32], %stage1_unit2_conv2_weight: Tensor[(64, 64, 3, 3), float32], %stage2_unit1_bn1_gamma: Tensor[(64), float32], %stage2_unit1_bn1_beta: Tensor[(64), float32], %stage2_unit1_bn1_moving_mean: Tensor[(64), float32], %stage2_unit1_bn1_moving_var: Tensor[(64), float32], %stage2_unit1_conv1_weight: Tensor[(128, 64, 3, 3), float32], %stage2_unit1_bn2_gamma: Tensor[(128), float32], %stage2_unit1_bn2_beta: Tensor[(128), float32], %stage2_unit1_bn2_moving_mean: Tensor[(128), float32], %stage2_unit1_bn2_moving_var: Tensor[(128), float32], %stage2_unit1_conv2_weight: Tensor[(128, 128, 3, 3), float32], %stage2_unit1_sc_weight: Tensor[(128, 64, 1, 1), float32], %stage2_unit2_bn1_gamma: Tensor[(128), float32], %stage2_unit2_bn1_beta: Tensor[(128), float32], %stage2_unit2_bn1_moving_mean: Tensor[(128), float32], %stage2_unit2_bn1_moving_var: Tensor[(128), float32], %stage2_unit2_conv1_weight: Tensor[(128, 128, 3, 3), float32], %stage2_unit2_bn2_gamma: Tensor[(128), float32], %stage2_unit2_bn2_beta: Tensor[(128), float32], %stage2_unit2_bn2_moving_mean: Tensor[(128), float32], %stage2_unit2_bn2_moving_var: Tensor[(128), float32], %stage2_unit2_conv2_weight: Tensor[(128, 128, 3, 3), float32], %stage3_unit1_bn1_gamma: Tensor[(128), float32], %stage3_unit1_bn1_beta: Tensor[(128), float32], %stage3_unit1_bn1_moving_mean: Tensor[(128), float32], %stage3_unit1_bn1_moving_var: Tensor[(128), float32], %stage3_unit1_conv1_weight: Tensor[(256, 128, 3, 3), float32], %stage3_unit1_bn2_gamma: Tensor[(256), float32], %stage3_unit1_bn2_beta: Tensor[(256), float32], %stage3_unit1_bn2_moving_mean: Tensor[(256), float32], %stage3_unit1_bn2_moving_var: Tensor[(256), float32], %stage3_unit1_conv2_weight: Tensor[(256, 256, 3, 3), float32], %stage3_unit1_sc_weight: Tensor[(256, 128, 1, 1), float32], %stage3_unit2_bn1_gamma: Tensor[(256), float32], %stage3_unit2_bn1_beta: Tensor[(256), float32], %stage3_unit2_bn1_moving_mean: Tensor[(256), float32], %stage3_unit2_bn1_moving_var: Tensor[(256), float32], %stage3_unit2_conv1_weight: Tensor[(256, 256, 3, 3), float32], %stage3_unit2_bn2_gamma: Tensor[(256), float32], %stage3_unit2_bn2_beta: Tensor[(256), float32], %stage3_unit2_bn2_moving_mean: Tensor[(256), float32], %stage3_unit2_bn2_moving_var: Tensor[(256), float32], %stage3_unit2_conv2_weight: Tensor[(256, 256, 3, 3), float32], %stage4_unit1_bn1_gamma: Tensor[(256), float32], %stage4_unit1_bn1_beta: Tensor[(256), float32], %stage4_unit1_bn1_moving_mean: Tensor[(256), float32], %stage4_unit1_bn1_moving_var: Tensor[(256), float32], %stage4_unit1_conv1_weight: Tensor[(512, 256, 3, 3), float32], %stage4_unit1_bn2_gamma: Tensor[(512), float32], %stage4_unit1_bn2_beta: Tensor[(512), float32], %stage4_unit1_bn2_moving_mean: Tensor[(512), float32], %stage4_unit1_bn2_moving_var: Tensor[(512), float32], %stage4_unit1_conv2_weight: Tensor[(512, 512, 3, 3), float32], %stage4_unit1_sc_weight: Tensor[(512, 256, 1, 1), float32], %stage4_unit2_bn1_gamma: Tensor[(512), float32], %stage4_unit2_bn1_beta: Tensor[(512), float32], %stage4_unit2_bn1_moving_mean: Tensor[(512), float32], %stage4_unit2_bn1_moving_var: Tensor[(512), float32], %stage4_unit2_conv1_weight: Tensor[(512, 512, 3, 3), float32], %stage4_unit2_bn2_gamma: Tensor[(512), float32], %stage4_unit2_bn2_beta: Tensor[(512), float32], %stage4_unit2_bn2_moving_mean: Tensor[(512), float32], %stage4_unit2_bn2_moving_var: Tensor[(512), float32], %stage4_unit2_conv2_weight: Tensor[(512, 512, 3, 3), float32], %bn1_gamma: Tensor[(512), float32], %bn1_beta: Tensor[(512), float32], %bn1_moving_mean: Tensor[(512), float32], %bn1_moving_var: Tensor[(512), float32], %fc1_weight: Tensor[(1000, 512), float32], %fc1_bias: Tensor[(1000), float32]) -> Tensor[(1, 1000), float32] {%0 = nn.batch_norm(%data, %bn_data_gamma, %bn_data_beta, %bn_data_moving_mean, %bn_data_moving_var, epsilon=2e-05f, scale=False) /* ty=(Tensor[(1, 3, 224, 224), float32], Tensor[(3), float32], Tensor[(3), float32]) */;%1 = %0.0;%2 = nn.conv2d(%1, %conv0_weight, strides=[2, 2], padding=[3, 3], channels=64, kernel_size=[7, 7]) /* ty=Tensor[(1, 64, 112, 112), float32] */;...%89 = nn.bias_add(%88, %fc1_bias, axis=-1) /* ty=Tensor[(1, 1000), float32] */;nn.softmax(%89) /* ty=Tensor[(1, 1000), float32] */
}

后续继续运行例子，没有出现新的问题。但是发现在执行

loaded_lib = tvm.module.load(path_lib)

之后，运行目录下会出现deploy_lib.tar.so文件

AirFace.onnx试验

做完了上面的，有点飘了，看例子中超分辨率的onnx模型用tvm部署貌似没有很多操作，就想着把airface模型拿来试一下，毕竟最近没少为这个模型伤脑筋。于是，借用教程的代码：

onnx_model = onnx.load('/home/bokyliu/Project/TVM/airFace06.onnx')
######################################################################
# Load a test image
# ---------------------------------------------
# A single cat dominates the examples!
from PIL import Image
# img_url = 'https://github.com/dmlc/mxnet.js/blob/master/data/cat.png?raw=true'
# img_path = download_testdata(img_url, 'cat.png', module='data')
img = Image.open('/home/bokyliu/feature1.jpg').resize((112, 112))
img_arr = np.array(img).transpose(2, 0, 1)
######################################################################
# Compile the model with relay
# ---------------------------------------------
target = 'llvm'input_name = '1'
shape_dict = {input_name: img_arr.shape}
mod, params = relay.frontend.from_onnx(onnx_model, shape_dict)with relay.build_config(opt_level=1):intrp = relay.build_module.create_executor('graph', mod, tvm.cpu(0), target)######################################################################
# Execute on TVM
# ---------------------------------------------
dtype = 'float32'
tvm_output = intrp.evaluate()(tvm.nd.array(x.astype(dtype)), **params).asnumpy()

运行，提示：

Traceback (most recent call last):File "/home/bokyliu/Work/pycharm-community-2018.2.4/helpers/pydev/pydevd.py", line 1664, in <module>main()File "/home/bokyliu/Work/pycharm-community-2018.2.4/helpers/pydev/pydevd.py", line 1658, in mainglobals = debugger.run(setup['file'], None, None, is_module)File "/home/bokyliu/Work/pycharm-community-2018.2.4/helpers/pydev/pydevd.py", line 1068, in runpydev_imports.execfile(file, globals, locals)  # execute the scriptFile "/home/bokyliu/Work/pycharm-community-2018.2.4/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfileexec(compile(contents+"\n", file, 'exec'), glob, loc)File "/home/bokyliu/Project/TVM/airface_from_onnx.py", line 72, in <module>mod, params = relay.frontend.from_onnx(onnx_model, shape_dict)File "/home/bokyliu/Project/incubator-tvm/python/tvm/relay/frontend/onnx.py", line 1497, in from_onnxmod, params = g.from_onnx(graph, opset)File "/home/bokyliu/Project/incubator-tvm/python/tvm/relay/frontend/onnx.py", line 1284, in from_onnxraise ValueError("Must provide an input shape for `{0}`.".format(i_name))ValueError: Must provide an input shape for `0`.

看到这个问题，只需要将

input_name = '1' # 改为 input_name = 'input.1'

至于为什么此处要修改，经过我发现，在onnx的计算图里

graph(%input.1 : Float(1, 3, 112, 112),%conv1.conv.weight : Float(64, 3, 3, 3),%conv1.bn.weight : Float(64),...)

此处要一致，因为tvm也是按照计算图走的。
再次运行：

tvm.error.OpNotImplemented: The following operators are not supported for frontend ONNX: ATen

这个说明原始模型里面有onnx不支持的内容，应该是转onnx出错了。用pytorch1.3转onnx问题解决。
接下来换了onnx没有出现上述问题，但是想想也知道肯定会有新的问题出现：

WARNING:root:Attribute momentum is ignored in relay.sym.batch_norm
.
.
.
WARNING:root:Attribute momentum is ignored in relay.sym.batch_norm
Traceback (most recent call last):File "/home/bokyliu/Project/TVM/airface_from_onnx.py", line 73, in <module>mod, params = relay.frontend.from_onnx(onnx_model, shape_dict)File "/home/bokyliu/Project/incubator-tvm/python/tvm/relay/frontend/onnx.py", line 1497, in from_onnxmod, params = g.from_onnx(graph, opset)File "/home/bokyliu/Project/incubator-tvm/python/tvm/relay/frontend/onnx.py", line 1325, in from_onnxop = self._convert_operator(op_name, inputs, attr, opset)File "/home/bokyliu/Project/incubator-tvm/python/tvm/relay/frontend/onnx.py", line 1425, in _convert_operatorsym = convert_map[op_name](inputs, attrs, self._params)File "/home/bokyliu/Project/incubator-tvm/python/tvm/relay/frontend/onnx.py", line 470, in _impl_v5static_shape = infer_value_simulated(shape, params)File "/home/bokyliu/Project/incubator-tvm/python/tvm/relay/frontend/common.py", line 520, in infer_value_simulatedoutput_value = infer_value(input_val, params)File "/home/bokyliu/Project/incubator-tvm/python/tvm/relay/frontend/common.py", line 494, in infer_valuegraph, lib, params = tvm.relay.build(func, target="llvm", params=params)File "/home/bokyliu/Project/incubator-tvm/python/tvm/relay/build_module.py", line 244, in buildgraph_json, mod, params = bld_mod.build(func, target, target_host, params)File "/home/bokyliu/Project/incubator-tvm/python/tvm/relay/build_module.py", line 109, in buildself._build(func, target, target_host)File "/home/bokyliu/Project/incubator-tvm/python/tvm/_ffi/_ctypes/function.py", line 207, in __call__raise get_last_ffi_error()tvm._ffi.base.TVMError: Traceback (most recent call last):[bt] (8) /home/bokyliu/.local/lib/python3.6/site-packages/tvm-0.6.dev0-py3.6-linux-x86_64.egg/tvm/libtvm.so(tvm::relay::backend::RelayBuildModule::GetFunction(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, tvm::runtime::ObjectPtr<tvm::runtime::Object> const&)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)#3}::operator()(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) const+0x1b5) [0x7f6b3cd6f6e5][bt] (7) /home/bokyliu/.local/lib/python3.6/site-packages/tvm-0.6.dev0-py3.6-linux-x86_64.egg/tvm/libtvm.so(tvm::relay::backend::RelayBuildModule::BuildRelay(tvm::relay::Function, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tvm::runtime::NDArray, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, tvm::runtime::NDArray> > > const&)+0x5e) [0x7f6b3cd6e66e][bt] (6) /home/bokyliu/.local/lib/python3.6/site-packages/tvm-0.6.dev0-py3.6-linux-x86_64.egg/tvm/libtvm.so(tvm::relay::backend::RelayBuildModule::Optimize(tvm::relay::Function, tvm::Map<tvm::Integer, tvm::Target, void, void> const&, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tvm::runtime::NDArray, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, tvm::runtime::NDArray> > > const&)+0xee) [0x7f6b3cd6d87e][bt] (5) /home/bokyliu/.local/lib/python3.6/site-packages/tvm-0.6.dev0-py3.6-linux-x86_64.egg/tvm/libtvm.so(tvm::relay::ModuleNode::FromExpr(tvm::relay::Expr const&, tvm::Map<tvm::relay::GlobalVar, tvm::relay::Function, void, void> const&, tvm::Map<tvm::relay::GlobalTypeVar, tvm::relay::TypeData, void, void> const&)+0x1d5) [0x7f6b3ce18825][bt] (4) /home/bokyliu/.local/lib/python3.6/site-packages/tvm-0.6.dev0-py3.6-linux-x86_64.egg/tvm/libtvm.so(tvm::relay::ModuleNode::Add(tvm::relay::GlobalVar const&, tvm::relay::Function const&, bool)+0x28c) [0x7f6b3ce163cc][bt] (3) /home/bokyliu/.local/lib/python3.6/site-packages/tvm-0.6.dev0-py3.6-linux-x86_64.egg/tvm/libtvm.so(tvm::relay::InferType(tvm::relay::Function const&, tvm::relay::Module const&, tvm::relay::GlobalVar const&)+0x1d7) [0x7f6b3cd39aa7][bt] (2) /home/bokyliu/.local/lib/python3.6/site-packages/tvm-0.6.dev0-py3.6-linux-x86_64.egg/tvm/libtvm.so(tvm::relay::TypeInferencer::Infer(tvm::relay::Expr)+0x86) [0x7f6b3cd39326][bt] (1) /home/bokyliu/.local/lib/python3.6/site-packages/tvm-0.6.dev0-py3.6-linux-x86_64.egg/tvm/libtvm.so(tvm::relay::ErrorReporter::RenderErrors(tvm::relay::Module const&, bool)+0x230c) [0x7f6b3cdf675c][bt] (0) /home/bokyliu/.local/lib/python3.6/site-packages/tvm-0.6.dev0-py3.6-linux-x86_64.egg/tvm/libtvm.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x32) [0x7f6b3c724af2]File "/home/bokyliu/Project/incubator-tvm/src/relay/ir/error.cc", line 133
TVMError: 
Error(s) have occurred. The program has been annotated with them:In `main`: 
v0.0.4
fn () {%0 = nn.conv2d(meta[relay.Constant][0], meta[relay.Constant][1], strides=[2, 2], padding=[1, 1], kernel_size=[3, 3]);%1 = nn.batch_norm(%0, meta[relay.Constant][2], meta[relay.Constant][3], meta[relay.Constant][4], meta[relay.Constant][5], epsilon=1e-05f);%2 = %1.0;%3 = expand_dims(meta[relay.Constant][6], axis=1);%4 = expand_dims(%3, axis=2);%5 = nn.prelu(%2, %4) tensor type `Tensor[(64), float32]` has 1 dimensions, while `Tensor[(64, 1, 1), float32]` has 3 dimensions; unable to unify: `Tensor[(64), float32]` and `Tensor[(64, 1, 1), float32]`; ;%6 = nn.conv2d(%5, meta[relay.Constant][7], padding=[1, 1], groups=64, kernel_size=[3, 3]);

问题定位：

WARNING:root:Attribute momentum is ignored in relay.sym.batch_norm

在执行build_module.py里此句时提示WARNING

op = self._convert_operator(op_name, inputs, attr, opset)

其中：
op_name = BatchNormalization
用断点跟进去，后续继续定位到build_module.py中

sym = convert_map[op_name](inputs, attrs, self._params)

继续命中断点
卷积层：
卷积层命中断点
bn层：
bn层命中断点
PReLU:
PReLU
觉得有意思的是PReLU的inputs，单独截了个图：
PReLU-inputs
free_var代表释放了的变量？
按照这个顺序找下去，最后一直到python/tvm/relay/op/nn/nn.py中：

def prelu(data, alpha, axis=1):"""This operator takes data as input and does Leaky versionof a Rectified Linear Unit... math::`y = x > 0 ? x : alpha * x`Parameters----------data : tvm.relay.ExprThe input data to the operator.alpha : tvm.relay.ExprSlope coefficient for the negative half axis.axis : int, optionalSpecify which shape axis the channel is specified.Returns-------result : tvm.relay.ExprThe computed result."""return _make.prelu(data, alpha, axis)

后来也是阴差阳错，在TVM的计算图中看到有两个ExpandDim操作，axis分别是：1、2！但这是哪来的呢？
想到这里，赶紧回头看了看ONNX转出时打印的graph，果然也有unsqeeze操作。
所以问题出在ONNX上，后来了解到ONNX有好几个版本，于是我换回了pytorch1.0.1，这是目前坑最少的版本。
但用pytorch1.0.1转出的ONNX里面又有ATen，没办法，只能自己向ONNX中注册ATen对应的op了。
然后就一切顺利了，模型得以跑通。但TVM编译ONNX再进行推理和TVM加载tar、json、params再进行推理的速度也有区别(cpu i5 7500)，前者耗时约8s后者约330ms，都比pytorch直接运行pth(220ms)要慢。