目标识别和文字检测算法 Faster R-CNN、CTPN

本文主要是介绍目标识别和文字检测算法 Faster R-CNN、CTPN，希望对大家解决编程问题提供一定的参考价值，需要的开发者们随着小编来一起学习吧！

Faster R-CNN 目标检测算法

Towards Real-Time Object Detection with Region Proposal Networks

R-CNN：Regions with CNN features

Input image
Extract region proposals(~2k)
Compute CNN features
Classify regions

IoU Intersection over Union

测量在特定数据集中检测相应物体准确度的一个标准

预测范围： bounding boxex

ground-truth bounding boxes（人为在训练集图像中标出要检测物体的大概范围）

$\frac{Area\quad of\quad Overlap}{Area \quad of \quad Union}$

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-IldTlrdy-1639284143003)(Detecting%20Text%20in%20Natural%20Image%20with%20Connectionist%20Text%20Proposal%20Network.assets/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2dhb3l1MTI1MzQwMTU2Mw==,size_16,color_FFFFFF,t_70.png)]

NMS (Non-Maximum Suppression)

Fast R-CNN

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-YaZeS7QC-1639284143006)(Detecting%20Text%20in%20Natural%20Image%20with%20Connectionist%20Text%20Proposal%20Network.assets/image-20211211202345500.png)]

Selection search

Anchor sliding window Feature extraction

RPN Loss

Cls label 二分类，是否有物体，使用IoU gt bounding box anchor box

Loc label
$t_x^* = (x^*-x_a)/w_a, t_y^* = (y^*-y_a)/h_a,\\ t_w^* = log(w^*/w_a), t_h^* = log(t_w^*)$

$t_x = (x-x_a)/w_a, t_y = (y-ya)/h_a,\\ t_w = log(w/w_a), t_h = log(h/h_a)$

Cls loss

Cross Entropy交叉熵

Loc Loss
$z_i = 0.5(x_i-y_i)^2/beta, \quad if |x_i-y_i|<beta\\ z_i = |x_i-y_i|-0.5*beta, \quad otherwise$

RoI Head Region of Interest

Mask R-CNN

$L = L_{cls}+L_{box}+L_{mask}$

To this we apply a per-pixel sigmoid,and define $L_{mask}$ as the average binary cross-entropy loss. For an RoI associated with gorund-truth k, $L_{mask}$ is only defined o the k-th mask(other mask outputs do not contribute to the loss).

RoI Align不对齐，保留浮点，在小区域之内继续划分

CTPN 文字检测算法

Detecting Text in Natural Image with Connectionist Text Proposal Network

Detecting text in fine-scale proposals
Recurrent connectionist text proposals
Side-refinement

$v_c= (c_y-c_y^a)/h^a\\ v_c^* = (c_y^*-c_y^a)/h^a\\ v_h = log(h/h_a)\\ v_h^* = log(h^*/h^a)$

Text line construction

$o^* = (x^*_{side} -c^a_x)/w^a$

Code

bounding box

CRNN 文字识别算法

An End-yo-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition

CRNN
Code
CTC
lexicon-based
lexicon-free

feature sequence —— receptive field感受野

CRNN——CTC
$\pi = --hh-e-l-ll-oo--\\ B(\pi) = hello\\ p(l|y) = \sum_{\pi:B(\pi)=1} p(\pi|y), \quad p('hello'|y) = \sum_{\pi:B(\pi)='hello'} p(\pi|y)$

CTC Theory

$\sum_{\pi \in B^{-1}(1)} p(\pi|x).\\ h(x) = arg\quad max_{1\in L\leq T} \quad p(l|x).\\ O^{ML}(S,N_w) = -\sum_{(x,z)\in S} ln(p(z|x))=-\sum_{(x,z) \in S} ln(\sum_{\pi \in B^{-1}(z)} p(\pi |x))$

为了让所有的path都能在图中唯一、合法的表示，结点转换有如下约束：

转换只能往右下方向，其他方向不允许
相同的字符之间起码要有一个空字符
非空字符不能被跳过
起点必须从前两个字符开始
终点必须落在结尾两个字符

forward-backward

定义在时刻t经过节点s的全部前缀子路径的概率总和为前向概率 $\alpha_t(s)$
$\alpha_3(4) = p(_ap)+p(aap)+p(a_p)+p(app)$

情况1：第s个符号为空符号blank
$\alpha_t(s) = (\alpha_{t-1}(s)+\alpha_{t-1}(s-1))·y^t_{seq(s)}$
情况2：第s个符号等于第s-2个符号
$\alpha_t(s) = (\alpha_{t-1}(s)+\alpha_{t-1}(s-1))·y^t_{seq(s)}$
情况3：既不属于情况1，也不属于情况2
$\alpha_t(s) = (\alpha_{t-1}(s)+\alpha_{t-1}(s-1)+\alpha_{t-1}(s-2))·y^t_{seq(s)}$