论文标题
TPP:启用CXL层的透明页面位置
TPP: Transparent Page Placement for CXL-Enabled Tiered-Memory
论文作者
论文摘要
高度规模应用中对内存的需求不断增长,导致内存成为整体数据中心支出的很大一部分。像CXL这样的相干接口的出现可实现主内存的扩展,并为此问题提供了有效的解决方案。在这样的系统中,主内存可以构成具有不同特征的不同内存技术。在本文中,我们表征了Meta服务器机队中广泛的数据中心应用程序的内存使用模式。因此,我们证明了有机会卸载更冷的页面以减慢这些应用程序的存储层。但是,如果没有有效的内存管理,此类系统会大大降低性能。 我们为启用CXL的内存提出了一种新颖的OS级应用程序透明页面放置机制(TPP)。 TPP采用轻巧的机制来识别和放置热/冷页面以适当的记忆层。它可以主动从本地内存到CXL-MEMORY。该技术可确保新页面分配的内存净空,这些分配通常与请求处理有关,并且往往是短暂而热的。同时,TPP可以迅速促进将慢速CXL记忆中的绩效关键热页促进到快速本地内存中,同时最大程度地减少了对开销的采样和不必要的迁移。 TPP在没有任何特定应用程序知识的情况下透明地工作,可以在全球部署作为内核版本。 我们使用CXL 1.1支持的新型X86 CPU的早期样品评估了TPP。 TPP使分层的存储系统作为理想的基线(<1%的间隙),该基线具有本地层中所有内存。它比当今的Linux好18%,比现有解决方案(包括Numa平衡和自动化)好5-17%。大多数TPP补丁已在Linux v5.18版本中合并。
The increasing demand for memory in hyperscale applications has led to memory becoming a large portion of the overall datacenter spend. The emergence of coherent interfaces like CXL enables main memory expansion and offers an efficient solution to this problem. In such systems, the main memory can constitute different memory technologies with varied characteristics. In this paper, we characterize memory usage patterns of a wide range of datacenter applications across the server fleet of Meta. We, therefore, demonstrate the opportunities to offload colder pages to slower memory tiers for these applications. Without efficient memory management, however, such systems can significantly degrade performance. We propose a novel OS-level application-transparent page placement mechanism (TPP) for CXL-enabled memory. TPP employs a lightweight mechanism to identify and place hot/cold pages to appropriate memory tiers. It enables a proactive page demotion from local memory to CXL-Memory. This technique ensures a memory headroom for new page allocations that are often related to request processing and tend to be short-lived and hot. At the same time, TPP can promptly promote performance-critical hot pages trapped in the slow CXL-Memory to the fast local memory, while minimizing both sampling overhead and unnecessary migrations. TPP works transparently without any application-specific knowledge and can be deployed globally as a kernel release. We evaluate TPP in the production server fleet with early samples of new x86 CPUs with CXL 1.1 support. TPP makes a tiered memory system performant as an ideal baseline (<1% gap) that has all the memory in the local tier. It is 18% better than today's Linux, and 5-17% better than existing solutions including NUMA Balancing and AutoTiering. Most of the TPP patches have been merged in the Linux v5.18 release.