[笔记]TVM部署AirFace

2024-03-26 20:38
文章标签 部署 笔记 tvm airface

本文主要是介绍[笔记]TVM部署AirFace,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

使用TVM在Tx2 Arm上部署AirFace c++

  • 目录
    • 前言
    • 自动优化
    • 终端测试

目录

前言

不要问为什么Tx2要用Arm核,它只是开发方便,习惯把它作工业母机罢了。

自动优化

TVM一个设计亮点在于他可以在PC端通过RPC优化网络,这个大大加快了优化速度。
虽说PC端加速优化过程,但是在实际使用中发现优化速度还是很慢的,也是一个炼丹过程。而且极端依赖CPU性能,在TVM给出的例子都是用32线程服务器进行的优化。顺便说一句,TVM在自动优化的时候最大使用的线程数等于CPU的线程数。

根据FrozenGene说的,arm目前还不能用graph tune。
话不多说,上代码:

# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements.  See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership.  The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License.  You may obtain a copy of the License at
#
#   http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied.  See the License for the
# specific language governing permissions and limitations
# under the License.
"""
Auto-tuning a convolutional network for ARM CPU
===============================================
**Author**: `Lianmin Zheng <https://github.com/merrymercy>`_, `Zhao Wu <https://github.com/FrozenGene>`_, `Eddie Yan <https://github.com/eqy>`_Auto-tuning for a specific ARM device is critical for getting the best
performance. This is a tutorial about how to tune a whole convolutional
network.The operator implementation for ARM CPU in TVM is written in template form.
The template has many tunable knobs (tile factor, vectorization, unrolling, etc).
We will tune all convolution and depthwise convolution operators
in the neural network. After tuning, we produce a log file which stores
the best knob values for all required operators. When the TVM compiler compiles
these operators, it will query this log file to get the best knob values.We also released pre-tuned parameters for some arm devices. You can go to
`ARM CPU Benchmark <https://github.com/apache/incubator-tvm/wiki/Benchmark#arm-cpu>`_
to see the results.
"""######################################################################import os
import onnx
import numpy as np
import tvm
from tvm import autotvm
from tvm import relay
import tvm.relay.testing
from tvm.autotvm.tuner import XGBTuner, GATuner, RandomTuner, GridSearchTuner
from tvm.contrib.util import tempdir
import tvm.contrib.graph_runtime as runtime
from tvm.contrib import utilmodel_name = "face_load_weight"
model_dir = '/home/bokyliu/dukto/fxp/AirFace/2d_facerecognition/20191119-1/test/%s.onnx' % model_name
input_name = "0"
#################################################################
# Define network
# --------------
# First we need to define the network in relay frontend API.
# We can load some pre-defined network from :code:`relay.testing`.
# We can also load models from MXNet, ONNX and TensorFlow.def get_network(name, batch_size):"""Get the symbol definition and random weight of a network"""input_shape = (batch_size, 3, 224, 224)output_shape = (batch_size, 1000)if "resnet" in name:n_layer = int(name.split('-')[1])mod, params = relay.testing.resnet.get_workload(num_layers=n_layer, batch_size=batch_size, dtype=dtype)elif "vgg" in name:n_layer = int(name.split('-')[1])mod, params = relay.testing.vgg.get_workload(num_layers=n_layer, batch_size=batch_size, dtype=dtype)elif name == 'mobilenet':mod, params = relay.testing.mobilenet.get_workload(batch_size=batch_size)elif name == 'squeezenet_v1.1':mod, params = relay.testing.squeezenet.get_workload(batch_size=batch_size, version='1.1', dtype=dtype)elif name == 'inception_v3':input_shape = (1, 3, 299, 299)mod, params = relay.testing.inception_v3.get_workload(batch_size=batch_size, dtype=dtype)elif name == 'mxnet':# an example for mxnet modelfrom mxnet.gluon.model_zoo.vision import get_modelblock = get_model('resnet18_v1', pretrained=True)mod, params = relay.frontend.from_mxnet(block, shape={'data': input_shape}, dtype=dtype)net = mod["main"]net = relay.Function(net.params, relay.nn.softmax(net.body), None, net.type_params, net.attrs)mod = relay.Module.from_expr(net)elif name == 'onnx':input_shape = (batch_size, 3, 112, 112)onnx_model = onnx.load(model_dir)shape_dict = {input_name: (1, 3, 112, 112)}output_shape = (1, 512)mod, params = relay.frontend.from_onnx(onnx_model, shape_dict, dtype="float32")else:raise ValueError("Unsupported network: " + name)return mod, params, input_shape, output_shape############################################################################################################
# Set Tuning Options
# ------------------
# Before tuning, we should apply some configurations. Here I use an RK3399 board
# as example. In your setting, you should modify the target and device_key accordingly.
# set :code:`use_android` to True if you use android phone.#### DEVICE CONFIG ##### Replace "aarch64-linux-gnu" with the correct target of your board.
# This target is used for cross compilation. You can query it by :code:`gcc -v` on your device.
target = tvm.target.create('llvm -device=arm_cpu -target=aarch64-linux-gnu')# Also replace this with the device key in your tracker
device_key = 'tx2'# Set this to True if you use android phone
use_android = False#### TUNING OPTION ####
network = 'onnx'
log_file = "%s.%s.log" % (device_key, network)
dtype = 'float32'tuning_option = {'log_filename': log_file,'tuner': 'xgb','n_trial': 1500,'early_stopping': 800,'try_spatial_pack_depthwise': True,'measure_option': autotvm.measure_option(builder=autotvm.LocalBuilder(build_func='ndk' if use_android else 'default'),runner=autotvm.RPCRunner(device_key, host='0.0.0.0', port=9190,number=5,timeout=10,),),
}num_threads = 4
os.environ["TVM_NUM_THREADS"] = str(num_threads)####################################################################
#
# .. note:: How to set tuning options
#
#   In general, the default values provided here work well.
#   If you have enough time budget, you can set :code:`n_trial`, :code:`early_stopping` larger,
#   which makes the tuning run longer.
#   If your device runs very slow or your conv2d operators have many GFLOPs, considering to
#   set timeout larger.
#
#   If your model has depthwise convolution, you could consider setting
#   :code:`try_spatial_pack_depthwise` be :code:`True`, which perform better than default
#   optimization in general. For example, on ARM CPU A53 2.0GHz, we find it could boost 1.6x
#   performance of depthwise convolution on Mobilenet V1 model.###################################################################
# Begin Tuning
# ------------
# Now we can extract tuning tasks from the network and begin tuning.
# Here, we provide a simple utility function to tune a list of tasks.
# This function is just an initial implementation which tunes them in sequential order.
# We will introduce a more sophisticated tuning scheduler in the future.# You can skip the implementation of this function for this tutorial.
def tune_tasks(tasks,measure_option,tuner='xgb',n_trial=1000,early_stopping=None,log_filename='tuning.log',use_transfer_learning=True,try_winograd=True,try_spatial_pack_depthwise=True):if try_winograd:for i in range(len(tasks)):try:  # try winograd templatetsk = autotvm.task.create(tasks[i].name, tasks[i].args,tasks[i].target, tasks[i].target_host, 'winograd')input_channel = tsk.workload[1][1]if input_channel >= 64:tasks[i] = tskexcept Exception:pass# if we want to use spatial pack for depthwise convolutionif try_spatial_pack_depthwise:tuner = 'xgb_knob'for i in range(len(tasks)):if tasks[i].name == 'topi_nn_depthwise_conv2d_nchw':tsk = autotvm.task.create(tasks[i].name, tasks[i].args,tasks[i].target, tasks[i].target_host,'contrib_spatial_pack')tasks[i] = tsk# create tmp log filetmp_log_file = log_filename + ".tmp"if os.path.exists(tmp_log_file):os.remove(tmp_log_file)for i, tsk in enumerate(reversed(tasks)):prefix = "[Task %2d/%2d] " % (i+1, len(tasks))# create tunerif tuner == 'xgb' or tuner == 'xgb-rank':tuner_obj = XGBTuner(tsk, loss_type='rank')elif tuner == 'xgb_knob':tuner_obj = XGBTuner(tsk, loss_type='rank', feature_type='knob')elif tuner == 'ga':tuner_obj = GATuner(tsk, pop_size=50)elif tuner == 'random':tuner_obj = RandomTuner(tsk)elif tuner == 'gridsearch':tuner_obj = GridSearchTuner(tsk)else:raise ValueError("Invalid tuner: " + tuner)if use_transfer_learning:if os.path.isfile(tmp_log_file):tuner_obj.load_history(autotvm.record.load_from_file(tmp_log_file))# do tuningn_trial = min(n_trial, len(tsk.config_space))# n_trial = len(tsk.config_space)tuner_obj.tune(n_trial=n_trial,early_stopping=early_stopping,measure_option=measure_option,callbacks=[autotvm.callback.progress_bar(n_trial, prefix=prefix),autotvm.callback.log_to_file(tmp_log_file)])# pick best records to a cache fileautotvm.record.pick_best(tmp_log_file, log_filename)os.remove(tmp_log_file)########################################################################
# Finally, we launch tuning jobs and evaluate the end-to-end performance.def tune_and_evaluate(tuning_opt):# extract workloads from relay programprint("Extract tasks...")mod, params, input_shape, outshape = get_network(network, batch_size=1)tasks = autotvm.task.extract_from_program(mod["main"], target=target,params=params,ops=(relay.op.nn.conv2d,))# run tuning tasksprint("Tuning...")tune_tasks(tasks, **tuning_opt)# compile kernels with history best recordswith autotvm.apply_history_best(log_file):print("Compile...")with relay.build_config(opt_level=1):graph, lib, params = relay.build_module.build(mod, target=target, params=params)# export librarylib_dir = '/home/bokyliu/Project/TVM/%s_tune_lib-fp32.tar' % model_namegraph_dir = '/home/bokyliu/Project/TVM/%s_tune_graph-fp32.json' % model_nameparams_dir = '/home/bokyliu/Project/TVM/%s_tune_params-fp32' % model_nametmp = tempdir()if use_android:from tvm.contrib import ndkfilename = "net.so"lib.export_library(tmp.relpath(filename), ndk.create_shared)else:filename = "net.tar"lib.export_library(lib_dir)temp = util.tempdir()with open(temp.relpath(graph_dir), "w") as fo:fo.write(graph)with open(temp.relpath(params_dir), "wb") as fo:fo.write(relay.save_param_dict(params))# upload module to deviceprint("Upload...")remote = autotvm.measure.request_remote(device_key, '0.0.0.0', 9190,timeout=10000)# remote.upload(tmp.relpath(filename))# rlib = remote.load_module(filename)remote.upload(lib_dir)remote_tar = '%s_tune_lib-fp16.tar' % model_namerlib = remote.load_module(remote_tar)# upload parameters to devicectx = remote.context(str(target), 0)module = runtime.create(graph, rlib, ctx)data_tvm = tvm.nd.array((np.random.uniform(size=input_shape)).astype(dtype))module.set_input('0', data_tvm)module.set_input(**params)module.run()out0 = module.get_output(0, tvm.nd.empty(outshape)).asnumpy()# test onnx outputctx = tvm.gpu()# create modulemodule = runtime.create(graph, lib, ctx)# set input and parametersmodule.set_input("0", data_tvm)module.set_input(**params)# runmodule.run()# get outputout1 = module.get_output(0, tvm.nd.empty(outshape)).asnumpy()tvm.testing.assert_allclose(out0, out1, atol=1e-3)# evaluateprint("Evaluate inference time cost...")ftimer = module.module.time_evaluator("run", ctx, number=12, repeat=10)prof_res = np.array(ftimer().results) * 1000  # convert to millisecondprint("Mean inference time (std dev): %.2f ms (%.2f ms)" %(np.mean(prof_res), np.std(prof_res)))# We do not run the tuning in our webpage server since it takes too long.
# Uncomment the following line to run it by yourself.tune_and_evaluate(tuning_option)######################################################################
# Sample Output
# -------------
# The tuning needs to compile many programs and extract feature from them.
# So a high performance CPU is recommended.
# One sample output is listed below.
# It takes about 2 hours on a 32T AMD Ryzen Threadripper.
#
# .. code-block:: bash
#
#    Extract tasks...
#    Tuning...
#    [Task  1/12]  Current/Best:   22.37/  52.19 GFLOPS | Progress: (544/1000) | 406.59 s Done.
#    [Task  2/12]  Current/Best:    6.51/  18.77 GFLOPS | Progress: (608/1000) | 325.05 s Done.
#    [Task  3/12]  Current/Best:    4.67/  24.87 GFLOPS | Progress: (480/1000) | 372.31 s Done.
#    [Task  4/12]  Current/Best:   11.35/  46.83 GFLOPS | Progress: (736/1000) | 602.39 s Done.
#    [Task  5/12]  Current/Best:    1.01/  19.80 GFLOPS | Progress: (448/1000) | 262.16 s Done.
#    [Task  6/12]  Current/Best:    2.47/  23.76 GFLOPS | Progress: (672/1000) | 563.85 s Done.
#    [Task  7/12]  Current/Best:   14.57/  33.97 GFLOPS | Progress: (544/1000) | 465.15 s Done.
#    [Task  8/12]  Current/Best:    1.13/  17.65 GFLOPS | Progress: (576/1000) | 365.08 s Done.
#    [Task  9/12]  Current/Best:   14.45/  22.66 GFLOPS | Progress: (928/1000) | 724.25 s Done.
#    [Task 10/12]  Current/Best:    3.22/  15.36 GFLOPS | Progress: (864/1000) | 564.27 s Done.
#    [Task 11/12]  Current/Best:   11.03/  32.23 GFLOPS | Progress: (736/1000) | 635.15 s Done.
#    [Task 12/12]  Current/Best:    8.00/  21.65 GFLOPS | Progress: (1000/1000) | 1111.81 s Done.
#    Compile...
#    Upload...
#    Evaluate inference time cost...
#    Mean inference time (std dev): 162.59 ms (0.06 ms)######################################################################
#
# .. note:: **Experiencing Difficulties?**
#
#   The auto tuning module is error-prone. If you always see " 0.00/ 0.00 GFLOPS",
#   then there must be something wrong.
#
#   First, make sure you set the correct configuration of your device.
#   Then, you can print debug information by adding these lines in the beginning
#   of the script. It will print every measurement result, where you can find useful
#   error messages.
#
#   .. code-block:: python
#
#      import logging
#      logging.getLogger('autotvm').setLevel(logging.DEBUG)
#
#   Finally, always feel free to ask our community for help on https://discuss.tvm.ai

由于本来就没有把TVM完全吃透,这个代码就是直接在教程上修改而来。
修改的主要内容:

  • 将try_spatial_pack_depthwise置true
  • 修改n_trail和early_stopping
  • 保存优化的结果(教程里面这点很坑,人家优化了几十个小时的结果就让他保存在/tmp/***/下面,程序一退出就自动删除了。)

终端测试

自动优化结束后,再将优化后的graph、tar、json复制到Tx2上,用

import numpy as np
import tvm
from tvm.contrib import graph_runtimepath_lib = './100-net-fp16.tar'loaded_json = open("./face_partial_tune_graph-fp16.json").read()
loaded_lib = tvm.module.load(path_lib)
loaded_params = bytearray(open('./face_partial_tune_params-fp16', 'rb').read())
input_data = tvm.nd.array(np.random.uniform(size=(1, 3, 112, 112)).astype('float32'))input_name = "0" # ??graph?????
ctx = tvm.cpu()
module = graph_runtime.create(loaded_json, loaded_lib, ctx)
module.set_input(input_name, input_data)
# module.set_input(**loaded_params)
module.load_params(loaded_params)# evaluate
print("Evaluate inference time cost...")
ftimer = module.module.time_evaluator("run", ctx, number=100, repeat=3)
prof_res = np.array(ftimer().results) * 1000  # convert to millisecond
print("Mean inference time (std dev): %.2f ms (%.2f ms)" %(np.mean(prof_res), np.std(prof_res)))

生成.so文件,并且计算一下推理耗时。
接下来建议看看优化后的结果是否跟原始模型有较大区别,这里我也提供一份代码:

import numpy as np
import tvm
import tvm.relay as relay
from tvm.contrib import graph_runtime
import torch
# import cv2 as cvtest_json = '/home/face/tvm_cpp/modelFolder/face_partial_tune_graph-fp16-load.json'
test_lib = '/home/face/tvm_cpp/modelFolder/100-net-fp16-load.tar.so'
test_param = '/home/face/tvm_cpp/modelFolder/face_partial_tune_params-fp16-load'loaded_json = open(test_json).read()
loaded_lib = tvm.module.load(test_lib)
loaded_params = bytearray(open(test_param, "rb").read())def preprocess(img_src):img_src= cv.cvtColor(img_src, cv.COLOR_BGR2RGB)img_src= cv.resize(img_src, (112, 112))input_data = np.array(img_src).astype(np.float32)input_data = input_data / 255.0input_data = np.transpose(input_data, (2, 0, 1))input_data[0] = (input_data[0] - 0.5)/ 0.5input_data[1] = (input_data[1] - 0.5)/ 0.5input_data[2] = (input_data[2] - 0.5)/ 0.5input_data = input_data[np.newaxis, :].copy()return input_data# img = cv.imread("/home/face/anna/164_2.jpg")
# img_input = preprocess(img)ctx = tvm.cpu(0)
module = graph_runtime.create(loaded_json, loaded_lib, ctx)
module.load_params(loaded_params)tempimg0 = torch.ones(1, 3, 112, 112)
# run the module
module.set_input("0", tempimg0)
module.run()
out_deploy = module.get_output(0).asnumpy()print(out_deploy)

在这里没出意外,计算结果跟torch的误差很小,接下来可以着手c++部署了。在尝试c++部署的时候还是走了跟多弯路的,主要是可参考的资料太少,话不多说上代码:
CMakeLists.txt

cmake_minimum_required(VERSION 2.8.12)project(tvm_cpp)set(CMAKE_INCLUDE_CURRENT_DIR ON)
set(CMAKE_AUTOMOC ON)find_package(Qt5Core)
set(OpenCV_DIR /home/face/addition/opencv-3.4.2/build)
find_package (OpenCV REQUIRED)
if(OpenCV_FOUND)include_directories(${OpenCV_INCLUDE_DIRS})message(STATUS "OpenCV library status:")message(STATUS "    version: ${OpenCV_VERSION}")message(STATUS "    libraries: ${OpenCV_LIBS}")message(STATUS "    include path: ${OpenCV_INCLUDE_DIRS}")
endif()add_executable(${PROJECT_NAME} "main.cpp")target_link_libraries(${PROJECT_NAME} Qt5::Core)INCLUDE_DIRECTORIES("~/tvm/include")
INCLUDE_DIRECTORIES("~/tvm/3rdparty/dlpack/include")
INCLUDE_DIRECTORIES("~/tvm/3rdparty/dmlc-core/include")target_link_libraries(tvm_cpp "~/tvm/build/libtvm.so""~/tvm/build/libtvm_runtime.so"${OpenCV_LIBS})

main.cpp

#include <QCoreApplication>
#include <dlpack/dlpack.h>
#include <tvm/runtime/module.h>
#include <tvm/runtime/registry.h>
#include <tvm/runtime/packed_func.h>
#include <opencv2/opencv.hpp>
#include <algorithm>
#include <fstream>
#include <iterator>
#include <stdexcept>
#include <string>
#include <opencv2/dnn/dnn.hpp>
#include <dirent.h>int find_dir_file(std::string dir_name, std::vector<std::string> &v) //文件夹地址,文件列表
{DIR *dirp;struct dirent *dp;std::vector<std::string> first;dirp = opendir(dir_name.c_str());while ((dp = readdir(dirp)) != NULL){//跳过'.'和'..'两个目录if (dp->d_name[0] == '.')continue;first.push_back(dp->d_name);}(void)closedir(dirp);std::cout << "first.size = " << first.size() << std::endl;//子目录搜索std::vector<std::string> sec;for (int i = 0; i < first.size(); i++){std::string second = dir_name + "/" + first[i];// cout<<"second = "<<second<<endl;dirp = opendir(second.c_str());while ((dp = readdir(dirp)) != NULL){//跳过'.'和'..'两个目录if (dp->d_name[0] == '.')continue;std::string save = second + "/" + dp->d_name;sec.push_back(save);}(void)closedir(dirp);}std::cout << "sec.size = " << sec.size() << std::endl;//子子目录搜索std::cout<<sec[0]<<std::endl;std::cout<<sec[1]<<std::endl;std::vector<std::string> trd;for (int i = 0; i < sec.size(); i++){std::string third = sec[i];// cout << third << endl;dirp = opendir(third.c_str());while ((dp = readdir(dirp)) != NULL){//跳过'.'和'..'两个目录if (dp->d_name[0] == '.')continue;std::string save = third + "/" + dp->d_name;v.push_back(save);}(void)closedir(dirp);}return 0;
}void Mat_to_CHW(float *data, cv::Mat &frame)
{assert(data && !frame.empty());unsigned int volChl = 112 * 112;for(int c = 0; c < 3; ++c){for (unsigned j = 0; j < volChl; ++j)data[c*volChl + j] = static_cast<float>(float(frame.data[j * 3 + c]) / 255.0);}}int main(int argc, char *argv[])
{QCoreApplication a(argc, argv);std::vector<std::string> v;find_dir_file("/home/face/kaoqin_112/", v);int num = v.size();std::cout << "total img num = " << num << std::endl;// tvm module for compiled functionstvm::runtime::Module mod_syslib = tvm::runtime::Module::LoadFromFile("../modelFolder/100-net-fp16-load.tar.so");//load graphstd::ifstream json_in("../modelFolder/face_partial_tune_graph-fp16-load.json");std::string json_data((std::istreambuf_iterator<char>(json_in)), std::istreambuf_iterator<char>());json_in.close();// parameters in binarystd::ifstream params_in("../modelFolder/face_partial_tune_params-fp16-load", std::ios::binary);std::string params_data((std::istreambuf_iterator<char>(params_in)), std::istreambuf_iterator<char>());params_in.close();// parameters need to be TVMByteArray type to indicate the binary dataTVMByteArray params_arr;params_arr.data = params_data.c_str();params_arr.size = params_data.length();int dtype_code = kDLFloat;int dtype_bits = 32;int dtype_lanes = 1;int device_type = kDLCPU;int device_id = 0;// get global function module for graph runtimetvm::runtime::Module mod = (*tvm::runtime::Registry::Get("tvm.graph_runtime.create"))(json_data, mod_syslib, device_type, device_id);DLTensor* x;int in_ndim = 4;int64_t in_shape[4] = {1, 3, 112, 112};TVMArrayAlloc(in_shape, in_ndim, dtype_code, dtype_bits, dtype_lanes, device_type, device_id, &x);// create csvstd::ofstream rgbData;rgbData.open("FeatureData.csv",std::ios::out | std::ios::trunc);// load image from cv matfloat avg_time = 0;float totaltime = 0;for(int i=0; i<v.size(); i++){cv::Mat tensor = cv::imread(v[i]);if(tensor.empty())continue;cv::cvtColor(tensor,tensor, cv::COLOR_BGR2RGB);float testinput[112*112*3];Mat_to_CHW(testinput, tensor);int size = sizeof(float32_t);memcpy(x->data, &testinput, 3 * 112 * 112 * size);// get the function from the module(set input data)tvm::runtime::PackedFunc set_input = mod.GetFunction("set_input");set_input("0", x);// get the function from the module(load patameters)tvm::runtime::PackedFunc load_params = mod.GetFunction("load_params");load_params(params_arr);// get the function from the module(run it)tvm::runtime::PackedFunc run = mod.GetFunction("run");for(int j=0; j<1; j++){double t = (double)cv::getTickCount();run();float timeuse = ((double)cv::getTickCount() - t)/ cv::getTickFrequency();if(i!=0){totaltime+=timeuse;avg_time = totaltime/(float)i;}std::cout<<v[i]<<" time: "<< timeuse <<"averge time: "<<avg_time<<std::endl;}tvm::runtime::PackedFunc get_output = mod.GetFunction("get_output");tvm::runtime::NDArray res = get_output(0);float *p_res = (float *)res->data;std::vector<float> f1;float ssum=0;for(int j=0; j<512; j++){ssum += p_res[j]*p_res[j];}ssum = sqrt(ssum);for(int j=0; j<512; j++){f1.push_back(p_res[j]/ssum);}rgbData<<v[i]<<",";for(int j=0; j<512; j++){rgbData<<f1[j]<<",";}rgbData<<std::endl;}rgbData.close();TVMArrayFree(x);return 0;
}

可以看到整个部署代码还是很简单,但是实际推理速度还是没那么令人满意,这个模型mnn推理不到80ms,tvm耗时148ms。也可能是我优化方法没用对,期待后续发现,但是TVM在ARM CPU上部署过程应该就是如此了。

这篇关于[笔记]TVM部署AirFace的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/849774

相关文章

闲置电脑也能活出第二春?鲁大师AiNAS让你动动手指就能轻松部署

对于大多数人而言,在这个“数据爆炸”的时代或多或少都遇到过存储告急的情况,这使得“存储焦虑”不再是个别现象,而将会是随着软件的不断臃肿而越来越普遍的情况。从不少手机厂商都开始将存储上限提升至1TB可以见得,我们似乎正处在互联网信息飞速增长的阶段,对于存储的需求也将会不断扩大。对于苹果用户而言,这一问题愈发严峻,毕竟512GB和1TB版本的iPhone可不是人人都消费得起的,因此成熟的外置存储方案开

阿里开源语音识别SenseVoiceWindows环境部署

SenseVoice介绍 SenseVoice 专注于高精度多语言语音识别、情感辨识和音频事件检测多语言识别: 采用超过 40 万小时数据训练,支持超过 50 种语言,识别效果上优于 Whisper 模型。富文本识别:具备优秀的情感识别,能够在测试数据上达到和超过目前最佳情感识别模型的效果。支持声音事件检测能力,支持音乐、掌声、笑声、哭声、咳嗽、喷嚏等多种常见人机交互事件进行检测。高效推

【学习笔记】 陈强-机器学习-Python-Ch15 人工神经网络(1)sklearn

系列文章目录 监督学习:参数方法 【学习笔记】 陈强-机器学习-Python-Ch4 线性回归 【学习笔记】 陈强-机器学习-Python-Ch5 逻辑回归 【课后题练习】 陈强-机器学习-Python-Ch5 逻辑回归(SAheart.csv) 【学习笔记】 陈强-机器学习-Python-Ch6 多项逻辑回归 【学习笔记 及 课后题练习】 陈强-机器学习-Python-Ch7 判别分析 【学

系统架构师考试学习笔记第三篇——架构设计高级知识(20)通信系统架构设计理论与实践

本章知识考点:         第20课时主要学习通信系统架构设计的理论和工作中的实践。根据新版考试大纲,本课时知识点会涉及案例分析题(25分),而在历年考试中,案例题对该部分内容的考查并不多,虽在综合知识选择题目中经常考查,但分值也不高。本课时内容侧重于对知识点的记忆和理解,按照以往的出题规律,通信系统架构设计基础知识点多来源于教材内的基础网络设备、网络架构和教材外最新时事热点技术。本课时知识

论文阅读笔记: Segment Anything

文章目录 Segment Anything摘要引言任务模型数据引擎数据集负责任的人工智能 Segment Anything Model图像编码器提示编码器mask解码器解决歧义损失和训练 Segment Anything 论文地址: https://arxiv.org/abs/2304.02643 代码地址:https://github.com/facebookresear

数学建模笔记—— 非线性规划

数学建模笔记—— 非线性规划 非线性规划1. 模型原理1.1 非线性规划的标准型1.2 非线性规划求解的Matlab函数 2. 典型例题3. matlab代码求解3.1 例1 一个简单示例3.2 例2 选址问题1. 第一问 线性规划2. 第二问 非线性规划 非线性规划 非线性规划是一种求解目标函数或约束条件中有一个或几个非线性函数的最优化问题的方法。运筹学的一个重要分支。2

【C++学习笔记 20】C++中的智能指针

智能指针的功能 在上一篇笔记提到了在栈和堆上创建变量的区别,使用new关键字创建变量时,需要搭配delete关键字销毁变量。而智能指针的作用就是调用new分配内存时,不必自己去调用delete,甚至不用调用new。 智能指针实际上就是对原始指针的包装。 unique_ptr 最简单的智能指针,是一种作用域指针,意思是当指针超出该作用域时,会自动调用delete。它名为unique的原因是这个

查看提交历史 —— Git 学习笔记 11

查看提交历史 查看提交历史 不带任何选项的git log-p选项--stat 选项--pretty=oneline选项--pretty=format选项git log常用选项列表参考资料 在提交了若干更新,又或者克隆了某个项目之后,你也许想回顾下提交历史。 完成这个任务最简单而又有效的 工具是 git log 命令。 接下来的例子会用一个用于演示的 simplegit

记录每次更新到仓库 —— Git 学习笔记 10

记录每次更新到仓库 文章目录 文件的状态三个区域检查当前文件状态跟踪新文件取消跟踪(un-tracking)文件重新跟踪(re-tracking)文件暂存已修改文件忽略某些文件查看已暂存和未暂存的修改提交更新跳过暂存区删除文件移动文件参考资料 咱们接着很多天以前的 取得Git仓库 这篇文章继续说。 文件的状态 不管是通过哪种方法,现在我们已经有了一个仓库,并从这个仓

忽略某些文件 —— Git 学习笔记 05

忽略某些文件 忽略某些文件 通过.gitignore文件其他规则源如何选择规则源参考资料 对于某些文件,我们不希望把它们纳入 Git 的管理,也不希望它们总出现在未跟踪文件列表。通常它们都是些自动生成的文件,比如日志文件、编译过程中创建的临时文件等。 通过.gitignore文件 假设我们要忽略 lib.a 文件,那我们可以在 lib.a 所在目录下创建一个名为 .gi