论文标题
基于命名实体的旅游业语料库的信息提取
Information Extraction based on Named Entity for Tourism Corpus
论文作者
论文摘要
如今,旅游信息分散了。要搜索信息,通常要浏览搜索引擎的结果,选择并查看每个住宿的详细信息是耗时的。在本文中,我们提出了一种方法,可以从搜索引擎返回的全文中提取特定信息,以促进用户。然后,用户可以专门查看所需的相关信息。该方法可用于其他域中的相同任务。主要步骤是1)建筑培训数据和2)建筑识别模型。首先,收集了旅游数据,并建立了词汇。原始语料库用于训练创建词汇嵌入。此外,它用于创建注释数据。提出了创建指定实体注释的过程。然后,可以构建给定实体类型的识别模型。从实验中,给定酒店描述,该模型可以提取所需的实体,即姓名,位置,设施。提取的数据可以进一步存储为结构化信息,例如以本体学格式,以进行未来的查询和推理。基于机器学习的自动命名实体标识的模型产生的错误范围为8%-25%。
Tourism information is scattered around nowadays. To search for the information, it is usually time consuming to browse through the results from search engine, select and view the details of each accommodation. In this paper, we present a methodology to extract particular information from full text returned from the search engine to facilitate the users. Then, the users can specifically look to the desired relevant information. The approach can be used for the same task in other domains. The main steps are 1) building training data and 2) building recognition model. First, the tourism data is gathered and the vocabularies are built. The raw corpus is used to train for creating vocabulary embedding. Also, it is used for creating annotated data. The process of creating named entity annotation is presented. Then, the recognition model of a given entity type can be built. From the experiments, given hotel description, the model can extract the desired entity,i.e, name, location, facility. The extracted data can further be stored as a structured information, e.g., in the ontology format, for future querying and inference. The model for automatic named entity identification, based on machine learning, yields the error ranging 8%-25%.