论文标题
交叉缝制的多模式编码器
Cross-stitched Multi-modal Encoders
论文作者
论文摘要
在本文中,我们提出了一种用于多模式语音和文本输入的新颖体系结构。我们使用多头跨模式的注意力结合了经过预定的语音和文本编码器,并共同对目标问题进行微调。所得的架构可用于连续代币级别的分类或发言级预测,以同时作用文本和语音。最终的编码器有效地捕获了声学和词汇信息。我们将基于多模式的话语级分类的多头注意融合的好处与预先模态特定的表示的简单串联进行了比较。我们的模型体系结构紧凑,资源效率,可以在单个消费者GPU卡上进行培训。
In this paper, we propose a novel architecture for multi-modal speech and text input. We combine pretrained speech and text encoders using multi-headed cross-modal attention and jointly fine-tune on the target problem. The resultant architecture can be used for continuous token-level classification or utterance-level prediction acting on simultaneous text and speech. The resultant encoder efficiently captures both acoustic-prosodic and lexical information. We compare the benefits of multi-headed attention-based fusion for multi-modal utterance-level classification against a simple concatenation of pre-pooled, modality-specific representations. Our model architecture is compact, resource efficient, and can be trained on a single consumer GPU card.