用于目标跟踪的特征融合孪生网络算法研究

(1.厦门理工学院电气工程与自动化学院,福建 厦门 361024; 2.中国科学院福建物质结构研究所泉州装备制造研究所,福建 晋江 362216)

注意力机制; 目标跟踪; 深度学习; 孪生网络

Study of the feature fusion siamese network algorithm for target tracking
FAN Dongjia1,LIN Mingqiang2,DAI Houde2,ZHONG Xungao1*,ZHAO Jing1

(1.College of Electrical Engineering and Automation,Xiamen University of Technology,Xiamen 361024,China; 2.Quanzhou Institute of equipment manufacturing,Haixi Research Institute,Chinese Academy of Sciences,Jinjiang 362216,China)

attention mechanism; object tracking; deep learning; siamese network

DOI: 10.6043/j.issn.0438-0479.202109006

备注

针对目标跟踪过程存在的动态不确定性的问题,传统跟踪方法容易产生目标漂移甚至跟踪失败,而基于深度学习的跟踪算法随着网络结构的加深容易导致深层特征过于稀疏抽象,不利于克服上述问题.为此,本文提出SiamMask三分支网络融合注意力机制的孪生网络目标跟踪新方法,旨在加强网络对特征选取的学习能力,加强目标有效特征的抽取,并减少冗余信息对网络负担的影响.特征提取主干网络选用改进的Resnet-50,通过融合深层和浅层特征,实现跟踪目标特征的有效表达.利用4个数据集(COCO、ImageNet-DET 2015、ImageNet-VID 2015、YouTube-VOS)对提出的特征融合孪生网络框架进行训练,并使用VOT数据集进行在线测试.实验表明:与文中其他跟踪方法相比,该算法在面对动态目标尺度变化、环境光照、运动模糊等场景表现更优异.

Objective : Siamese network as an important thought branch of deep learning object tracking algorithm. The location of the target was determined by comparing the similarity of the search area features. The SiamMask algorithm based on the idea of siamese network predicts the target region through three branches: classification, regression and mask. We build on the SiamMask algorithm to further explore the impact of different feature channels and spaces on the tracking performance in the process of extracting features using convolutional neural networks. The effect of the extracted shallow and deep features on the tracking accuracy and robustness is also studied.
Methods : The modified ResNet-50 was used as the feature extraction backbone network, and the last residual block is removed. The output features of the second and third residual blocks were performed using convolution for subsampling of the feature channels, and the channel and spatial attention mechanism are used for the features after the second residual block. The two subsampled features were then summed and fused. Dual attention mechanisms act again for the postfused features. And the convolution is used to learn the parameters. The network parameters in the algorithm were trained using four large-scale datasets such as ImageNet-VID 2015.
Results : After a detailed analysis of the SiamMask algorithm, improvements were made for its deficiencies. The dual attention mechanism is used to strengthen and suppress the network channel and spatial feature parameters of different targets to improve the utilization of the network to the useful features of the target. The shallow and deep features were fused to the backbone network to make more comprehensive expression of the target features.The improved algorithm was tested on VOT2016 and VOT2018 datasets compared with deep learning and related filtering algorithms like DaSiamRPN and KCF. Among them, the test results on the VOT2016 dataset show that the algorithm tracking accuracy is 62.3%, which is the highest among the listed algorithms, while the robustness performance and expected average overlap are 29.4% and 37.8%, respectively, which are comparable to the listed algorithms. Specific analysis: 1) Compared with the benchmark SiamMask algorithm, the accuracy of the algorithm was slightly improved, the robustness was optimized by 1.8 percentage points, and the expected average overlap rate was increased by 1.4 percentage points. 2) Compared with SiamFC, the present algorithm uses the anchor box mechanism, which can better fit the target position, and is better improved in all three indicators. 3) Compared with traditional correlation filtering algorithms such as KCF and ECO, this method uses deep learning combined with attention mechanism to effectively improve the tracking ability of dynamic target scale changes, environmental lighting and other scenarios.The test results on VOT2018 dataset show that the accuracy of this algorithm is 60.1%, 1.2 percentage points higher than the benchmark algorithm, and the best results in all the comparison algorithms. The robustness was 37.5%, which was 3.7 percentage points optimized than the benchmark algorithm. The average expected overlap was 31%, a 1.4 percentage points improvement over the benchmark algorithm, and was comparable across all the contrast algorithms. It can be seen that the feature fusion and attention mechanism proposed in this paper are conducive to dealing with the dynamic uncertainty factors existing in target tracking.
Conclusions : The experimental results show that for a large number of spatial and channel feature redundancy problems in the SiamMask object tracking algorithm based on the neural network. the attention mechanism can inhibit and enhance the channel and spatial features of different targets, which can effectively utilize useful features and reduce the impact of redundant features on the algorithm performance. The fusion of shallow and deep features in the process of convolutional feature extraction can avoid the one-sidedness of using only abstract features, increase the expression of shallow texture and contour features to the target, and make the expression of target features more comprehensive. Through the attention mechanism and the feature fusion strategy, the algorithm can effectively respond to the challenging problems such as the target scale change, environmental illumination and motion blur during the target tracking process.