论文标题

Wikihist.html:英语Wikipedia的完整修订历史记录以HTML格式

WikiHist.html: English Wikipedia's Full Revision History in HTML Format

论文作者

Mitrevski, Blagoj, Piccardi, Tiziano, West, Robert

论文摘要

Wikipedia用Wikitext标记语言编写。在服务内容时,将Wikipedia解析Wikitext的MediaWiki软件将Wikitext驱动到HTML,从而通过扩展宏(模板和Mod-ules)插入其他内容。因此,打算分析读者看到的Wikipediaas的研究人员应与HTML合作,而不是Wikitext。由于Wikipedia的修订历史仅以Wikitext格式公开可用,因此研究人员不得不使用Wikipedia的REST API进行自发wikitextto-wikitext-wikitext to-html解析。但是,这种方法(1)并未扩展到非常大的Data,(2)在历史文章修订中无法正确扩展宏。我们通过开发平行的体系结构来解决这些问题,以使用MediaWiki的本地实例来解析大量Wikitext,并随着正确的历史宏观扩展的能力增强。通过部署我们的系统,我们以HTML格式生产和发布Wikihist.html,英语Wikipedia的完整修订历史记录。在对Wikipedia的超链接的经验分析中,我们强调了Wikihist.html比Raw Wikitext的优势,这表明HTML中存在的Wiki链接中有一半以上是Raw Wikitext中缺少的Wiki链接,并且缺少的链接对于用户导航很重要。

Wikipedia is written in the wikitext markup language. When serving content, the MediaWiki software that powers Wikipedia parses wikitext to HTML, thereby inserting additional content by expanding macros (templates and mod-ules). Hence, researchers who intend to analyze Wikipediaas seen by its readers should work with HTML, rather than wikitext. Since Wikipedia's revision history is publicly available exclusively in wikitext format, researchers have had to produce HTML themselves, typically by using Wikipedia's REST API for ad-hoc wikitext-to-HTML parsing. This approach, however, (1) does not scale to very large amounts ofdata and (2) does not correctly expand macros in historical article revisions. We solve these problems by developing a parallelized architecture for parsing massive amounts of wikitext using local instances of MediaWiki, enhanced with the capacity of correct historical macro expansion. By deploying our system, we produce and release WikiHist.html, English Wikipedia's full revision history in HTML format. We highlight the advantages of WikiHist.html over raw wikitext in an empirical analysis of Wikipedia's hyperlinks, showing that over half of the wiki links present in HTML are missing from raw wikitext and that the missing links are important for user navigation.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源