使用自动语音识别的关键词提取和情感分析

论文标题

使用自动语音识别的关键词提取和情感分析

Keywords Extraction and Sentiment Analysis using Automatic Speech Recognition

论文作者

Shukla, Rachit

论文摘要

自动语音识别（ASR）是计算语言学的跨学科子场，它开发了方法和技术，可以通过计算机将口语识别和翻译成文本。它结合了语言学，计算机科学和电气工程领域的知识和研究。情感分析是文本的上下文挖掘，可以在原始资料中识别和提取主观信息，并帮助企业在监视在线对话时了解其品牌，产品或服务的社交情绪。根据语音结构，在语音识别中使用了三个模型来进行匹配：声学模型，语音词典和语言模型。任何语音识别计划均可使用两个因素进行评估：准确性（将口语转换为数字数据的百分比误差）和速度（该程序可以与人说话的人保持同步的程度）。为了将语音转换为文本（STT），我们将研究以下开源工具包：CMU Sphinx和Kaldi。该工具包使用MEL频率Cepstral系数（MFCC）和I-vector进行特征提取。 CMU Sphinx已与预训练的隐藏马尔可夫模型（HMM）和高斯混合物模型（GMM）一起使用，而Kaldi则与预训练的神经网络（NNET）一起用作声学模型。 N-Gram语言模型包含以晶格形式生成最可能的假设（转录）的音素或PDF-ID。语音数据集以.raw或.wav文件的形式存储，并在.txt文件中转录。然后，该系统试图识别文本中的意见，并提取以下属性：极性（如果说话者表示正面或负面意见）和关键词（正在谈论的事情）。

Automatic Speech Recognition (ASR) is the interdisciplinary subfield of computational linguistics that develops methodologies and technologies that enables the recognition and translation of spoken language into text by computers. It incorporates knowledge and research in linguistics, computer science, and electrical engineering fields. Sentiment analysis is contextual mining of text which identifies and extracts subjective information in the source material and helping a business to understand the social sentiment of their brand, product or service while monitoring online conversations. According to the speech structure, three models are used in speech recognition to do the match: Acoustic Model, Phonetic Dictionary and Language Model. Any speech recognition program is evaluated using two factors: Accuracy (percentage error in converting spoken words to digital data) and Speed (the extent to which the program can keep up with a human speaker). For the purpose of converting speech to text (STT), we will be studying the following open source toolkits: CMU Sphinx and Kaldi. The toolkits use Mel-Frequency Cepstral Coefficients (MFCC) and I-vector for feature extraction. CMU Sphinx has been used with pre-trained Hidden Markov Models (HMM) and Gaussian Mixture Models (GMM), while Kaldi is used with pre-trained Neural Networks (NNET) as acoustic models. The n-gram language models contain the phonemes or pdf-ids for generating the most probable hypothesis (transcription) in the form of a lattice. The speech dataset is stored in the form of .raw or .wav file and is transcribed in .txt file. The system then tries to identify opinions within the text, and extract the following attributes: Polarity (if the speaker expresses a positive or negative opinion) and Keywords (the thing that is being talked about).

下载PDF全文

下载文献需遵守相关版权规定

论文标题