Clueweb22：带有视觉和语义信息的100亿个网络文档

论文标题

Clueweb22：带有视觉和语义信息的100亿个网络文档

ClueWeb22: 10 Billion Web Documents with Visual and Semantic Information

论文作者

Overwijk, Arnold, Xiong, Chenyan, Liu, Xiao, VandenBerg, Cameron, Callan, Jamie

论文摘要

Clueweb22是Clueweb系列数据集的最新迭代，提供了100亿个网页，隶属于丰富的信息。它的设计受到高质量，大规模网络语料库的需求，以支持一系列学术和行业研究，例如在信息系统中，检索授权的AI系统和模型预处理。与较早的Chueweb Corpora相比，Clueweb22语料库更大，更多样化，更高质量，并且与商业网络搜索中的文档分布保持一致。除了RAW HTML外，ClueWeb22还提供了有关行业标准文档理解系统提供的网页的丰富信息，包括由Web浏览器呈现的页面的可视化表示，从神经网络parser中解析的HTML结构信息，以及预处理的清洁文档文档文档，以降低进入障碍。这些信号中的许多已在行业中广泛使用，但在此规模上首次可供研究社区使用。

ClueWeb22, the newest iteration of the ClueWeb line of datasets, provides 10 billion web pages affiliated with rich information. Its design was influenced by the need for a high quality, large scale web corpus to support a range of academic and industry research, for example, in information systems, retrieval-augmented AI systems, and model pretraining. Compared with earlier ClueWeb corpora, the ClueWeb22 corpus is larger, more varied, of higher-quality, and aligned with the document distributions in commercial web search. Besides raw HTML, ClueWeb22 includes rich information about the web pages provided by industry-standard document understanding systems, including the visual representation of pages rendered by a web browser, parsed HTML structure information from a neural network parser, and pre-processed cleaned document text to lower the barrier to entry. Many of these signals have been widely used in industry but are available to the research community for the first time at this scale.

下载PDF全文

下载文献需遵守相关版权规定

论文标题