论文标题
从问答论坛中回答Corpora的生物医学问题
Generating Biomedical Question Answering Corpora from Q&A forums
论文作者
论文摘要
问答(QA)是一项自然语言处理任务,旨在获取对用户问题的相关答案。尽管该领域已经取得了一些进展,但由于域的复杂性和训练集的有限可用性,生物医学问题仍然是大多数质量检查方法的挑战。我们提出了一种方法,可以从Q \&a Web论坛中自动提取问题 - 网络对,该论坛可用于文档检索,这是大多数QA系统的关键步骤。提出的框架从选定的论坛中提取问题和包含引用的各个答案。这样,可以使用这些论坛用户注释的问题 - 库来开发和评估基于文档检索的质量检查系统。我们通过将我们的框架应用于三个论坛,获得了7,453个问题和14,239个问题库对,从而产生了BIQA语料库。我们评估了与每个问题相关的文章数量以及每个答案的投票数量会影响基线文件检索方法的性能。另外,我们证明了作为答案提供的文章与问题非常相似,并培训了最先进的深度学习模型,该模型的性能与使用专家手动注释的数据集相似。所提出的框架可用于从与新帖子相同的论坛以及其他论坛中更新BIQA语料库,并通过文档支持其答案。 Biqa语料库及用于生成它的框架可在\ url {https://github.com/lasigebiotm/biqa}上获得。
Question Answering (QA) is a natural language processing task that aims at obtaining relevant answers to user questions. While some progress has been made in this area, biomedical questions are still a challenge to most QA approaches, due to the complexity of the domain and limited availability of training sets. We present a method to automatically extract question-article pairs from Q\&A web forums, which can be used for document retrieval, a crucial step of most QA systems. The proposed framework extracts from selected forums the questions and the respective answers that contain citations. This way, QA systems based on document retrieval can be developed and evaluated using the question-article pairs annotated by users of these forums. We generated the BiQA corpus by applying our framework to three forums, obtaining 7,453 questions and 14,239 question-article pairs. We evaluated how the number of articles associated with each question and the number of votes on each answer affects the performance of baseline document retrieval approaches. Also, we demonstrated that the articles given as answers are significantly similar to the questions and trained a state-of-the-art deep learning model that obtained similar performance to using a dataset manually annotated by experts. The proposed framework can be used to update the BiQA corpus from the same forums as new posts are made, and from other forums that support their answers with documents. The BiQA corpus and the framework used to generate it are available at \url{https://github.com/lasigeBioTM/BiQA}.