论文标题

在Web浏览器中提取相关图像以删除样板

Extraction of Relevant Images for Boilerplate Removal in Web Browsers

论文作者

Bose, Joy

论文摘要

样板是指网页的不需要和重复的部分(例如广告或目录),这些部分分散了用户不阅读网页的核心内容(例如新闻文章)的注意力。从网页中准确检测和删除样板内容可以使用户可以免费查看网页或新闻文章。这在Web浏览器中的阅读器模式之类的功能中很有用。读取器模式在Web浏览器中的当前实现(例如Firefox,Chrome和Edge)在网页中的文本内容方面表现出色。但是,当网页内容具有动态性时,它们主要是基于启发式的,而不是灵活的。另外,它们通常以图像形式和网页中的多媒体形式删除样板内容的表现不佳。为了检测样板图像,需要了解网页中图像的实际布局,只有在渲染网页时才有可能。在本文中,我们讨论了相关图像提取中的一些问题。我们还介绍了测试框架的设计,以测量准确性和分类器,以通过利用无头浏览器解决方案来提取相关图像,从而为图像提供渲染信息。

Boilerplate refers to unwanted and repeated parts of a webpage (such as ads or table of contents) that distracts the user from reading the core content of the webpage, such as a news article. Accurate detection and removal of boilerplate content from a webpage can enable the users to have a clutter free view of the webpage or news article. This can be useful in features like reader mode in web browsers. Current implementations of reader mode in web browsers such as Firefox, Chrome and Edge perform reasonably well for textual content in webpages. However, they are mostly heuristic based and not flexible when the webpage content is dynamic. Also they often do not perform well for removing boilerplate content in the form of images and multimedia in webpages. For detection of boilerplate images, one needs to have knowledge of the actual layout of the images in the webpage, which is only possible when the webpage is rendered. In this paper we discuss some of the issues in relevant image extraction. We also present the design of a testing framework to measure accuracy and a classifier to extract relevant images by leveraging a headless browser solution that gives the rendering information for images.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源