|
Abstract : |
When trying to overcome the significant performance drops of ASR systems in the presence of noise, one road to follow is the integration of the information present in the lips movement of the speaker. Comparisons showed that integration of audio and video data on the decision level yields best recognition results. This raises the question how to weight the two modalities in different noise conditions. Throughout this article we develop a weighting process adaptive to various background noise situations. Firstly we evaluate the performance of different weighting schemes in a manually controlled recognition task with different types of noise. Next we compare different stream reliability criteria. Based on this a mapping between the criteria and the free parameter of the fusion process is derived and its applicability to control the fusion is demonstrated. Finally the possibilities and limitations of a fixed combination compared to an adaptive weighting are exposed., |