《厦门大学学报（自然科学版）》

针对目标跟踪过程存在的动态不确定性的问题,传统跟踪方法容易产生目标漂移甚至跟踪失败,而基于深度学习的跟踪算法随着网络结构的加深容易导致深层特征过于稀疏抽象,不利于克服上述问题.为此,本文提出SiamMask三分支网络融合注意力机制的孪生网络目标跟踪新方法,旨在加强网络对特征选取的学习能力,加强目标有效特征的抽取,并减少冗余信息对网络负担的影响.特征提取主干网络选用改进的Resnet-50,通过融合深层和浅层特征,实现跟踪目标特征的有效表达.利用4个数据集(COCO、ImageNet-DET 2015、ImageNet-VID 2015、YouTube-VOS)对提出的特征融合孪生网络框架进行训练,并使用VOT数据集进行在线测试.实验表明:与文中其他跟踪方法相比,该算法在面对动态目标尺度变化、环境光照、运动模糊等场景表现更优异.

Objective : Siamese network as an important thought branch of deep learning object tracking algorithm. The location of the target was determined by comparing the similarity of the search area features. The SiamMask algorithm based on the idea of siamese network predicts the target region through three branches: classification, regression and mask. We build on the SiamMask algorithm to further explore the impact of different feature channels and spaces on the tracking performance in the process of extracting features using convolutional neural networks. The effect of the extracted shallow and deep features on the tracking accuracy and robustness is also studied.
Methods : The modified ResNet-50 was used as the feature extraction backbone network, and the last residual block is removed. The output features of the second and third residual blocks were performed using convolution for subsampling of the feature channels, and the channel and spatial attention mechanism are used for the features after the second residual block. The two subsampled features were then summed and fused. Dual attention mechanisms act again for the postfused features. And the convolution is used to learn the parameters. The network parameters in the algorithm were trained using four large-scale datasets such as ImageNet-VID 2015.
Results : After a detailed analysis of the SiamMask algorithm, improvements were made for its deficiencies. The dual attention mechanism is used to strengthen and suppress the network channel and spatial feature parameters of different targets to improve the utilization of the network to the useful features of the target. The shallow and deep features were fused to the backbone network to make more comprehensive expression of the target features.The improved algorithm was tested on VOT2016 and VOT2018 datasets compared with deep learning and related filtering algorithms like DaSiamRPN and KCF. Among them, the test results on the VOT2016 dataset show that the algorithm tracking accuracy is 62.3%, which is the highest among the listed algorithms, while the robustness performance and expected average overlap are 29.4% and 37.8%, respectively, which are comparable to the listed algorithms. Specific analysis: 1) Compared with the benchmark SiamMask algorithm, the accuracy of the algorithm was slightly improved, the robustness was optimized by 1.8 percentage points, and the expected average overlap rate was increased by 1.4 percentage points. 2) Compared with SiamFC, the present algorithm uses the anchor box mechanism, which can better fit the target position, and is better improved in all three indicators. 3) Compared with traditional correlation filtering algorithms such as KCF and ECO, this method uses deep learning combined with attention mechanism to effectively improve the tracking ability of dynamic target scale changes, environmental lighting and other scenarios.The test results on VOT2018 dataset show that the accuracy of this algorithm is 60.1%, 1.2 percentage points higher than the benchmark algorithm, and the best results in all the comparison algorithms. The robustness was 37.5%, which was 3.7 percentage points optimized than the benchmark algorithm. The average expected overlap was 31%, a 1.4 percentage points improvement over the benchmark algorithm, and was comparable across all the contrast algorithms. It can be seen that the feature fusion and attention mechanism proposed in this paper are conducive to dealing with the dynamic uncertainty factors existing in target tracking.
Conclusions : The experimental results show that for a large number of spatial and channel feature redundancy problems in the SiamMask object tracking algorithm based on the neural network. the attention mechanism can inhibit and enhance the channel and spatial features of different targets, which can effectively utilize useful features and reduce the impact of redundant features on the algorithm performance. The fusion of shallow and deep features in the process of convolutional feature extraction can avoid the one-sidedness of using only abstract features, increase the expression of shallow texture and contour features to the target, and make the expression of target features more comprehensive. Through the attention mechanism and the feature fusion strategy, the algorithm can effectively respond to the challenging problems such as the target scale change, environmental illumination and motion blur during the target tracking process.

引言
1 SiamMask 三分支目标跟踪网络结构
2 方法提出
3 双重注意力模型设计
4 特征融合网
5 损失函数
6 实验结果与分析
7 结论

图1 用于目标跟踪的SiamMask三分支网络结构<br/>Fig.1 The SiamMask three-branch network for target tracking

图1 用于目标跟踪的SiamMask三分支网络结构
Fig.1 The SiamMask three-branch network for target tracking

图2 基于SiamMask网络的改进目标跟踪算法结构<br/>Fig.2 The improved target tracking algorithm based on SiamMask network

图2 基于SiamMask网络的改进目标跟踪算法结构
Fig.2 The improved target tracking algorithm based on SiamMask network

图3 CSAM注意力模型<br/>Fig.3 CSAM attentional model

图3 CSAM注意力模型
Fig.3 CSAM attentional model

表1 VOT2016数据集上不同方法跟踪结果<br/>Tab.1 The tracking results of different methods on VOT2016

表1 VOT2016数据集上不同方法跟踪结果
Tab.1 The tracking results of different methods on VOT2016

图4 不同算法在VOT2016数据集不同场景下的A-R图(S=30)<br/>Fig.4 The A-R graphs of different algorithms in different scenarios of VOT2016 dataset(S=30)

图4 不同算法在VOT2016数据集不同场景下的A-R图(S=30)
Fig.4 The A-R graphs of different algorithms in different scenarios of VOT2016 dataset(S=30)

表2 VOT2018数据集上不同方法跟踪结果<br/>Tab.2 The tracking results of different methods on VOT2018

表2 VOT2018数据集上不同方法跟踪结果
Tab.2 The tracking results of different methods on VOT2018

图5 不同算法在VOT2018数据集不同场景下的A-R图(S=30)<br/>Fig.5 The A-R graphs of different algorithms in different scenarios of VOT2018 dataset(S=30)

图5 不同算法在VOT2018数据集不同场景下的A-R图(S=30)
Fig.5 The A-R graphs of different algorithms in different scenarios of VOT2018 dataset(S=30)

图6 VOT2016数据集不同场景下不同跟踪算法结果<br/>Fig.6 The results of different tracking algorithms under different scenarios in VOT2016 dataset

图6 VOT2016数据集不同场景下不同跟踪算法结果
Fig.6 The results of different tracking algorithms under different scenarios in VOT2016 dataset

图7 注意力机制作用的目标聚焦结果<br/>Fig.7 The target focus results of attentional mechanisms

图7 注意力机制作用的目标聚焦结果
Fig.7 The target focus results of attentional mechanisms

[1] BOLME D S,BEVERIDGE J R,DRAPER B A,et al.Visual object tracking using adaptive correlation filters[C]∥Computer Society Conference on Computer Vision and Pattern Recognition.San Francisco:IEEE,2010:2544-2550.
[2] HENRIQUES J F,CASEIRO R,MARTINS P,et al.High-speed tracking with kernelized correlation filters[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2014,37(3):583-596.
[3] HENRIQUES J F,CASEIRO R,MARTINS P,et al.Exploiting the circulant structure of tracking-by-detection with kernels[C]∥European Conference on Computer Vision.Berlin:Springer,2012:702-715.
[4] 成悦,李建增,李爱华,等.基于置信度的加权特征融合相关滤波跟踪[J].计算机工程与应用,2019,55(20):152-158.
[5] BERTINETTO L,VALMADRE J,HENRIQUES J F,et al.Fully-convolutional siamese networks for object tracking[C]∥European Conference on Computer Vision.Cham:Springer,2016:850-865.
[6] KRIZHEVSKY A,SUTSKEVER I,HINTON G E.ImageNet classification with deep convolutional neural networks[J].Communications of the ACM,2017,60(6):84-90.
[7] LI B,YAN J,WU W,et al.High performance visual tracking with siamese region proposal network[C]∥Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Salt Lake City:IEEE,2018:8971-8980.
[8] LI B,WU W,WANG Q,et al.SiamRPN++:evolution of siamese visual tracking with very deep networks[C]∥IEEE/CVF Conference on Computer Vision and Pattern Recognition.Long Beach:IEEE,2019:4282-4291.
[9] WANG Q,ZHANG L,BERTINETTO L,et al.Fast online object tracking and segmentation:a unifying approach[C]∥IEEE/CVF Conference on Computer Vision and Pattern Recognition.Long Beach:IEEE,2019:1328-1338.
[10] VOIGTLAENDER P,LUITEN J,TORR P H S,et al.Siam R-CNN:Visual tracking by re-detection[C]∥IEEE/CVF Conference on Computer Vision and Pattern Recognition.Seattle:IEEE,2020:6578-6588.
[11] CHEN Z,ZHONG B,LI G,et al.Siamese box adaptive network for visual tracking[C]∥IEEE/CVF Conference on Computer Vision and Pattern Recognition.Seattle:IEEE,2020:6668-6677.
[12] WANG Q,WU B,ZHU P,et al.Eca-net:efficient channel attention for deep convolutional neural networks[C]∥IEEE/CVF Conference on Computer Vision and Pattern Recognition.Seattle:IEEE,2020:11534-11542.
[13] WOO S,PARK J,LEE J Y,et al.Cbam:convolutional block attention module[C]∥European Conference on Computer Vision(ECCV).Cham:Springer,2018:3-19.
[14] LIN T Y,MAIRE M,BELONGIE S,et al.Microsoft COCO:common objects in context[C]∥European Conference on Computer Vision.Cham:Springer,2014:740-755.
[15] RUSSAKOVSKY O,DENG J,SU H,et al.ImageNet large scale visual recognition challenge[J].International Journal of Computer Vision,2015,115(3):211-252.
[16] XU N,YANG L,FAN Y,et al.Youtube-VOS:Sequence-to-sequence video object segmentation[C]∥European Conference on Computer Vision(ECCV).Cham:Springer,2018:585-601.
[17] KRISTAN M,LEONARDIS A,MATAS J,et al.The visual object tracking VOT2016 challenge results[C]∥European Conference on Computer Vision.Heidelberg:Springer,2016:777-823.
[18] KRISTAN M,LEONARDIS A,MATAS J,et al.The sixth visual object tracking VOT2018 challenge results[C]∥European Conference on Computer Vision.Cham:Springer,2018:3-53.
[19] ZHU Z,WANG Q,LI B,et al.Distractor-aware siamese networks for visual object tracking[C]∥European Conference on Computer Vision.Cham:Springer,2018:101-117.
[20] DANELLJAN M,BHAT G,KHAN F S,et al.ECO:efficient convolution operators for tracking[C]∥IEEE Conference on Computer Vision and Pattern Recognition(CVPR).Honolulu:IEEE,2017:6638-6646.
[21] DANELLJAN M,HAGER G,KHAN F S,et al.Learning spatially regularized correlation filters for visual tracking[C]∥IEEE International Conference on Computer Vision(ICCV).Santiago:IEEE,2015:4310-4318.
[22] NAM H,HAN B.Learning multi-domain convolutional neural networks for visual tracking[C]∥IEEE Conference on Computer Vision and Pattern Recognition(CVPR).Las Vegas:IEEE,2016:4293-4302.
[23] BERTINETTO L,VALMADRE J,GOLODETZ S,et al.Staple:complementary learners for real-time tracking[C]∥IEEE Conference on Computer Vision and Pattern Recognition.Las Vegas:IEEE,2016:1401-1409.
[24] DANELLJAN M,HÄGER G,KHAN F,et al.Accurate scale estimation for robust visual tracking[C]∥British Machine Vision Conference.Nottingham:Bmva Press,2014:471-482.
[25] VOJIR T,NOSKOVA J,MATAS J.Robust scale-adaptive mean-shift for tracking[J].Pattern Recognition Letters,2014,49(3):250-258.
[26] DANELLJAN M,ROBINSON A,KHAN F S,et al.Beyond correlation filters:learning continuous convolution operators for visual tracking[C]∥European Conference on Computer Vision.Cham:Springer,2016:472-488.
[27] PU S,SONG Y,MA C.Deep attentive tracking via reciprocative learning[C]∥Advances in Neural Information Processing Systems.Cambrige,MA:MIT,2018:1931-1941.
[28] BHAT G,JOHNANDER J,DANELLJAN M,et al.Unveiling the power of deep tracking[C]∥European Conference on Computer Vision(ECCV).Cham:Springer,2018:483-498.
[29] HE A,LUO C,TIAN X,et al.A twofold Siamese network for real-time object tracking[C]∥IEEE Conference on Computer Vision and Pattern Recognition.Salt Lake City:IEEE,2018:4834-4843.
[30] ABDELPAKEY M H,SHEHATA M S,MOHAMED M M.DensSiam:end-to-end densely-siamese network with self-attention model for object tracking[C]∥International Symposium on Visual Computing.Cham:Springer,2018:463-473.

备注

引言

1 SiamMask 三分支目标跟踪网络结构

2 方法提出

3 双重注意力模型设计

4 特征融合网

5 损失函数

6 实验结果与分析

7 结论

学报简介

备注

引言