论文标题

在Web数据共享中使用特定于产品的schema.org:一系列最佳实践

On using Product-Specific Schema.org from Web Data Commons: An Empirical Set of Best Practices

论文作者

Selvam, Ravi Kiran, Kejriwal, Mayank

论文摘要

近年来,schema.org的增长率很高。现在嵌入HTML页面中嵌入的产品的结构化描述现在并不少见,尤其是在电子商务网站上。 Web Data Commons(WDC)项目已从Common Crawl中的网页大规模提取了Schema.org数据,并将其作为RDF“知识图”的规模提供。专门描述产品的数据的一部分为研究人员和小型公司提供了一个千载难逢的机会,可以将其用于分析和下游应用程序。但是,由于该数据的广泛范围,因此数据是否可原始形式可用。在本文中,我们对WDC提供的有关产品特定的schema.org数据进行了详细的实证研究。我们的研究的目的不是简单分析,而是为使用和消费WDC产品特定的schema.org数据设计了一套经验扎根的最佳实践。我们的研究揭示了六个最佳实践,每种实践都通过实验数据和分析来证明。

Schema.org has experienced high growth in recent years. Structured descriptions of products embedded in HTML pages are now not uncommon, especially on e-commerce websites. The Web Data Commons (WDC) project has extracted schema.org data at scale from webpages in the Common Crawl and made it available as an RDF `knowledge graph' at scale. The portion of this data that specifically describes products offers a golden opportunity for researchers and small companies to leverage it for analytics and downstream applications. Yet, because of the broad and expansive scope of this data, it is not evident whether the data is usable in its raw form. In this paper, we do a detailed empirical study on the product-specific schema.org data made available by WDC. Rather than simple analysis, the goal of our study is to devise an empirically grounded set of best practices for using and consuming WDC product-specific schema.org data. Our studies reveal six best practices, each of which is justified by experimental data and analysis.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源