OPT OpenIR  > 光谱成像技术研究室
Auxiliary Loss Multimodal GRU Model in Audio-Visual Speech Recognition
Yuan, Yuan; Tian, Chunlin; Lu, Xiaoqiang1
Source PublicationIEEE ACCESS
Contribution Rank1

Audio-visual speech recognition (AVSR) utilizes both audio and video modalities for the robust automatic speech recognition. Most deep neural network (DNN) has achieved promising performances in AVSR owing to its generalized and nonlinear mapping ability. However, these DNN models have two main disadvantages: 1) the first disadvantage is that most models alleviate the AVSR problems neglecting the fact that the frames are correlated; and 2) the second disadvantage is the feature learned by the mentioned models is not credible. This is because the joint representation learned by the fusion fails to consider the specific information of categories, and the discriminative information is sparse, while the noise, reverberation, irrelevant image objection, and background are redundancy. Aiming at relieving these disadvantages, we propose the auxiliary loss multimodal GRU (alm-GRU) model including three parts: feature extraction, data augmentation, and fusion & recognition. The feature extraction and data augmentation are a complete effective solution for the processing raw complete video and training, and precondition for later core part: fusion & recognition using alm-GRU equipped with a novel loss which is an end-to-end network combining both fusion and recognition, furthermore considering the modal and temporal information. The experiments show the superiority of our model and necessity of the data augmentation and generative component in the benchmark data sets.


KeywordAduio-visual Systems Recurrent Neural Networks Generative Adversarial Networks
Indexed BySCI ; EI
WOS IDWOS:000426304300001
EI Accession Number20180704784763
Citation statistics
Cited Times:11[WOS]   [WOS Record]     [Related Records in WOS]
Document Type期刊论文
Affiliation1.Chinese Acad Sci, Xian Inst Opt & Precis Mech, Ctr Opt Imagery Anal & Learning, Xian 710119, Shaanxi, Peoples R China;
2.Univ Chinese Acad Sci, Beijing 100049, Peoples R China
Recommended Citation
GB/T 7714
Yuan, Yuan,Tian, Chunlin,Lu, Xiaoqiang. Auxiliary Loss Multimodal GRU Model in Audio-Visual Speech Recognition[J]. IEEE ACCESS,2018,6:5573-5583.
APA Yuan, Yuan,Tian, Chunlin,&Lu, Xiaoqiang.(2018).Auxiliary Loss Multimodal GRU Model in Audio-Visual Speech Recognition.IEEE ACCESS,6,5573-5583.
MLA Yuan, Yuan,et al."Auxiliary Loss Multimodal GRU Model in Audio-Visual Speech Recognition".IEEE ACCESS 6(2018):5573-5583.
Files in This Item:
File Name/Size DocType Version Access License
Auxiliary Loss Multi(1638KB)期刊论文出版稿限制开放CC BY-NC-SAApplication Full Text
Related Services
Recommend this item
Usage statistics
Export to Endnote
Google Scholar
Similar articles in Google Scholar
[Yuan, Yuan]'s Articles
[Tian, Chunlin]'s Articles
[Lu, Xiaoqiang]'s Articles
Baidu academic
Similar articles in Baidu academic
[Yuan, Yuan]'s Articles
[Tian, Chunlin]'s Articles
[Lu, Xiaoqiang]'s Articles
Bing Scholar
Similar articles in Bing Scholar
[Yuan, Yuan]'s Articles
[Tian, Chunlin]'s Articles
[Lu, Xiaoqiang]'s Articles
Terms of Use
No data!
Social Bookmark/Share
All comments (0)
No comment.

Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.