• No se han encontrado resultados

HeMem: Scalable Tiered Memory Management for Big Data Applications and Real NVM

N/A
N/A
Protected

Academic year: 2023

Share "HeMem: Scalable Tiered Memory Management for Big Data Applications and Real NVM"

Copied!
21
0
0

Texto completo

(1)

HeMem:

Scalable Tiered Memory Management for Big Data Applications and Real NVM

Amanda Raybuck, Tim Stamler, Wei Zhang, Mattan Erez, and Simon Peter

(2)

DRAM + NVM tiered memory

● 8x capacity

● 2x latency

● Asymmetric read/write bandwidth

● High overhead for small accesses

DRAM NVM

Memory Bus

DRAM

NVM

Storage

(3)

Hardware tiered memory

Application

Hardware Tiered Memory

DRAM NVM

No OS support needed Low overhead

No visibility into apps Limited to simple

management techniques

Example: Intel Optane Memory Mode

(4)

Existing software tiered memory

Application

OS/Library Tiered Memory

DRAM NVM

Evaluated only on emulated NVM:

Does not scale to NVM capacity No support for asymmetric

read/write bandwidth Limited flexibility

Examples: HeteroOS [ISCA ‘17], Nimble Page Management

[ASPLOS ‘19]

Insights into applications Supports complex policies

(5)

Why not access/dirty bits?

• Not scalable

• Takes seconds to scan large memories with base pages

• Overhead of TLB shootdowns to clear bits

(6)

HeMem:

Scalable software tiered memory management system designed for real NVM

Design principles:

• Asynchronous memory access sampling with CPU performance counters

• Asynchronous memory migration with DMA offload

• Focus on asymmetric NVM bandwidth

• Data scalability awareness

• Flexibility

(7)

PEBS memory access sampling

PEBS: processor event-based sampling

Supported in modern Intel processors

Processor records samples of load/store virtual memory address

Records are stored in a memory buffer

We measure DRAM loads, NVM loads, and all stores

Instead of using page table access/dirty bits

Sampling 0.02% of all memory accesses provides sufficient fidelity

(8)

Asynchronous hot/cold classification

PEBS buffer:

.

Sample1 Sample2

...

HeMem PEBS thread

DRAM

NVM

Hot

Hot Cold

Cold Batch

Sample 0.02% of memory accesses

: Hot memory page

: Cold memory page

CPU Counters

(9)

Asynchronous memory migration

WP WP

DRAM

NVM

Hot

Hot Cold

Cold

WP

DMA

DMA Req:

Src: VA1, 2MB Dst: VA2, 2MB

HeMem

policy thread

DMA Req:

Src: VA3, 2MB Dst: VA4, 2MB

DMA Req:

Src: VA5, 2MB Dst: VA6, 2MB

WP WP

WP

(10)

Optimize for real NVM

Keep small objects in DRAM

Avoid the small random reads from NVM that suffer overheads

Small, ephemeral objects remain in DRAM

Limit writes to NVM to avoid write bandwidth bottleneck

Migrate and keep frequently written pages to DRAM

Read Block Size (Bytes) Write Block Size (Bytes)

(11)

Data Scalability Awareness

Tracking hot/cold memory is expensive with lots of memory

Only manage objects that are long-lived and likely to grow

Allow Linux to handle everything else (program text, kernel objects...)

Smaller objects are more likely to be short-lived and can be in DRAM

(12)

Flexible user space mechanisms

HeMem is implemented as a user-level library

Can be modified to better suit applications

Can more closely integrate with managed runtimes to further optimize

Garbage collection

Userfaultfd for handling of page and write-protection faults

Intercepts memory allocation calls to learn size of objects

Works with unmodified applications

(13)

Implementation

HeMem library implemented with ~4100 lines of C code

Relies on a custom Linux kernel with support for /dev/dax

Added ~1300 lines to linux kernel

Both DRAM and NVM exposed as /dev/dax files

DRAM /dev/dax reserved at startup with memmap command line argument

Uses PEBS via the Linux perf interface

MEM_LOAD_L3_MISS_RETIRED.LOCAL_DRAM for DAM loads

MEM_LOAD_RETIRED.LOCAL_PMM for NVM loads

MEM_INST_RETIRED.ALL_STORES for stores

(14)

Evaluation

(15)

Evaluation setup

Cascade Lake-SP w/ 24 cores, 192 GB DRAM, 768 GB NVM

All DIMMs populated, leveraging all 6 memory channels

Comparisons:

Intel Memory Mode

Linux nimble tiered memory management [ASPLOS ‘19]

(16)

Hot set identification

GUPS microbenchmark with hot set (512 GB working set) 8 byte accesses, non-contiguous hot set

2X

(17)

Dynamic hot set identification

GUPS with a 512 GB working set and a 16 GB hot set

At time t=150, shift hot set over by 4 GB

(18)

FlexKVS key-value store throughput

4KB value size, 90% GET, 10% SET, 20% hot keys accessed 90% of time

15%

(19)

FlexKVS key-value store priority latency

1 prioritized server with 16GB working set

1 non-prioritized server with 500GB working set

(20)

Betweenness Centrality algorithm on graph with 229 vertices

GAPbs execution time

1.3X 1.6X HeMem with PEBS makes 10x fewer writes to NVM

(21)

Summary

• Tiered memory systems need to support real NVM

Need to scale to large capacities

Need to support unique NVM performance features

HeMem: redesign of tiered memory management with real NVM

Sampling-based memory access monitoring without page tables

Asynchronous memory migration in batches with DMA offload

Accurately distinguishes hot from cold memory

Up to 1.6x GAPbs speedup, 2x GUPS, 10x fewer NVM writes

Source code: https://bitbucket.org/ajaustin/hemem/src/sosp-submission/

Referencias

Documento similar

Such application was ported to Java following the steps of the Java StarSs programming model: first, in a task selection interface, a total of five service tasks and seven method

Knowledge Applications Development for SMEs Business Management System Improvement ENTERprise Information Systems (pp.. Real Options in Management and Organizational Strategy: A

How to Construct a Correct and Scalable iBGP Configuration.. IEEE

In this guide for teachers and education staff we unpack the concept of loneliness; what it is, the different types of loneliness, and explore some ways to support ourselves

The objective was to analyse the processing and data management requirements of simulation applications which support research in five scientific areas: astrophysics,

–  Data copies to/from device memory –  Manual work scheduling.. –  Code and data management from

On a system with Intel Optane DC NVM, HeMem outperforms hardware, OS, and PL-based tiered memory management, providing up to 50% runtime reduction for the GAP graph

The package makes it easy to detect different types of outliers (magnitude, shape, and amplitude) in functional data, and some of the implemented methods can be applied to