Towards Bug-free Persistent Memory Applications Ian Neal, Andrew

38 Slides9.43 MB

Towards Bug-free Persistent Memory Applications Ian Neal, Andrew Quinn, Baris Kasikci In collaboration with: Ben Reeves, Ben Stoler, Youngjin Kwon, Simon Peter

Good News! PM has arrived! Great, fast new storage technology called persistent main memory (PM) AKA non-volatile memory (NVM) Higher capacity, only 2 3x higher latency than DRAM Can be used to increase main memory capacity (volatile) Can be used as durable storage and can be used directly in user space! Image Source: https://www.intel.com/content/www/us/en/architecture-and-technology/optane-dc-persistent-memory.html 2

PM Programming Background Modern CPU architectures require special persistence mechanisms to ensure updates are made durable (Intel Optane PM DIMM) X 1 CPU Cache X 1 Volatile! CPU X 1 1. store 1 into X 2. flush 3. memory fence 3

Problem: PM Programming Requires Care Modern CPU architectures require special persistence mechanisms to ensure updates made durable Writing correct and efficient PM code is hard 1. store 1 into X 2. flush X 3. memory fence 1. store 1 into X 2. flush X 3. memory fence . 4. store 1 into Y 5. flush Y 6. memory fence . X 1 Y 1 1. store 1 into X 2. store 1 into Y . 3. flush X 4. flush Y 5. memory fence . X 1 ? Y 1 4

Problem: PM Programming Requires Care (cont.) Can have major consequences (data inconsistency, loss) Resulting bugs are hard to find (non-determinism, opaque) X 1 This requires PM-specific developer tools! X 1 CPU Cache XY 1Z gdb Eviction! CPU 1. store 1 into X 2. flush X 3. memory fence 5

Our Work PM bug finding PM bug fixing AGAMOTTO [OSDI ’20] HIPPOCRATES [ASPLOS ‘21] Finding PM bugs using PM-optimized symbolic execution Fixing durability bugs in PM programs with provably correct fixes 6

Prior Work We want a tool that is automatic (low effort to use) and accurate (doesn’t miss bugs). 7

Our Contributions Bug Study AGAMOTTO Results 84 found Application independent! No annotations! (Automatic detection) Symbolic Execution! No test suite required! (Accurate detection) Symbolic Execution Engine (KLEE) Reported to developers Showed impact of performance bugs ( 47% throughput) 8

Overview Bug Study AGAMOTTO Results 84 found Reported to developers Symbolic Execution Engine (KLEE) Showed impact of performance bugs ( 47% throughput) 9

PM Bug Study Surveyed 63 previously recorded bugs Most from PMDK’s issue tracker Also analyzed bugs from PMTest, XFDetector Identified two predominant patterns Later use these to build bug detectors (i.e., bug oracles) 10

(1) Missing flush/fence Example: oid should be made durable, but never is Not making updates explicitly durable in PM Not tied to any application-specific logic Accounts for 79% of the bugs in our survey (50/63) Not guaranteed to be false-positive free We don’t discover any in our testing 11

(2) Redundant flush/fence Example: overuse of flush/fence instructions Unnecessary use of durability mechanisms Is a performance issue regardless of program logic Accounts for 11% of the bugs in our survey (7/63) 12

Bug Study Conclusions Two patterns, 90% of the bugs (57/63), are application independent! Can automated their detection Pattern-based detection allows us to avoid source code Application-specific bugs, 10% of the bugs (6/63), are still important modifications! Allow developers to define their own patterns! 13

Overview Bug Study AGAMOTTO Results 84 found Reported to developers Symbolic Execution Engine (KLEE) Showed impact of performance bugs ( 47% throughput) 14

Overview AGAMOTTO Symbolic Execution Engine (KLEE) 15

How Symbolic Execution Works Select state do read ? do read 𝛌 Execute Update state 16

How Symbolic Execution Works Select state do read 𝛌 do read 𝛌 0 Symbolic execution gives us high code coverage! Execute (no need for extensive test suites) do read 0 Update state 17

How does AGAMOTTO work? Select state do read 𝛌 0 Execute pbuf[x]: clean PM Memory Model do read 0 pbuf[x]: dirty Update state 18

(1) PM Memory Model Augment symbolic memory model with PM state Allocated from calls to mmap from pmem files store clean store dirty flush flushed fence 19

How does AGAMOTTO work? Select state do read 𝛌 0 Execute pbuf[x]: clean PM Memory Model Update state PM Bug Oracles do read 0 pbuf[x]: flushed 20

(2) PM Bug Detection Augment PM state transitions with bug signals exit missing flush/fence exit clean dirty redundant flush flushed flush fence flush/fence 21

(2) PM Bug Detection Custom oracles custom state & signals Example: ordering bug (A must be durable before B) store B ordering violation We use 2 custom oracles to reproduce all user-space No bug application-specific bugs found from prior work! store B store (still B no need to modify the source code!) A: clean B: clean A: dirty B: clean A: flushed B: clean 22

What else? AGAMOTTO Why? To handle “State Space Explosion”! Symbolic Execution Engine (KLEE) 23

Problem: State Space Explosion if if How do we get here faster? Drive symbolic execution towards states with the most PM if operations! (because PM updates) bugs bad if if jmp 24

(3) PM State Selection Select state Priority: 2 PM State Selection Priority: 2 do read 𝛌 0 Priority: 0 Priority: 0 InExecute practice, 10% of instructions have non-zero priority! pbuf[x]: clean Priority: 2 PM Memory Model do read 0 Update state PM Bug Oracles pbuf[x]: dirty Priority: 2 Priority: 1 1. Identify PM instructions (static analysis) 2. Assign unit weight 3. Back-propagate priorities 25

Design Summary AGAMOTTO 3. Select states based on PM priority to maximize chances of finding bugs 1. Create a model for PM and add to symbolic states 2. Add rules to detect applicationindependent bug signals Symbolic Execution Engine (KLEE) 26

Overview Bug Study AGAMOTTO Results 84 found Reported to developers Symbolic Execution Engine (KLEE) Showed impact of performance bugs ( 47% throughput) 27

Effectiveness 84 new bugs found so far (across 5 systems) 14 correctness bugs, 70 performance bugs PM Search Strategy: KLEE: after 1 hour, found 30% (25/84) bugs AGAMOTTO: finds all bugs in 40 minutes Performance study on RECIPE [SOSP ’19]: 23 47% overall throughput increase in key-value store workloads! 28

Our Work PM bug finding PM bug fixing AGAMOTTO [OSDI ’20] HIPPOCRATES [ASPLOS ‘21] Finding PM bugs using PM-optimized symbolic execution Fixing durability bugs in PM programs with provably correct fixes 29

Is it enough to just find the bugs? Even after automatically finding PM bugs, fixing them is difficult 23 days from report to fix, up to 66 days! (Study of PMDK bugs) Bugs also found using a PM debugging tool (pmemcheck) Challenging to balance trade-off between correctness and efficiency Need to help PM developers fix these PM-specific bugs! Prevalence of work in PM debugging shows the prevalence of durability bugs in PM General program repair techniques can introduce side-effects; don’t want accidentally introduce more crash consistency bugs! 30

Correctly Fixing Bugs Key insight: durability fixes primarily introduce memory orderings, which cannot introduce the possibility of new incorrect program behavior Durability problems can be fixed, by definition, by adding cache flushes and/or memory ordering primitives Adding these primitives has no impact other than restricting possible memory orderings Restricting memory orderings doesn’t introduce new incorrect behavior No new incorrect behavior, ergo no new bugs! Based on this insight, we can automate the reasoning behind inserting efficient fixes while ensuring they are correct 31

Fixing a Durability Bug Can safely add all the flushes and fences we want! Seems easy enough Results in 2-12x slowdowns! 32

Performance Issues Naïve fixes harm functions which operate on both PM and DRAM! Solution: the Persistent Subprogram Transformation! 33

Persistent Subprogram Transformation For cases where fixes may operate on PM, duplicate subprogram to minimize performance impact This transformation is also provably correct! Duplication preserves program semantics After duplication, only insert flushes and fences! 34

HIPPOCRATES We built HIPPOCRATES, an automated tool which takes input from bug finders like pmemcheck and AGAMOTTO to automatically fix PM bugs (Step 1) Parse bug finder trace (Step 2) Identify bug locations (Step 3) Compute fixes (Phase 1) Compute intraprocedural solution (Phase 2) Perform fix reduction (Phase 3) Attempt to hoist fixes (heuristic) (Step 4) Apply fixes and compile 1. Computes simplest possible solution (flush/fence after offending store) 2. Ensures no duplicate fixes 3. Heuristically determines if fix should be elsewhere in call stack of error Figure out where a fix would be least likely to operate on volatile memory Heuristic also proven safe! 35

Results Fixes bugs correctly (“do no harm”) with efficient fixes Compared against fixed PMDK bugs, HIPPOCRATES fixes are effectively identical to manual fixes Up to 7% performance increase over manually-crafted durability code compared to PM-port of Redis developed by Intel 36

Conclusion PM bugs are prevalent and difficult to manually correct We find 84 new PM bugs and show the potential impact of performance bugs, increasing throughput by 23 47% (AGAMOTTO) We automatically fix PM durability bugs and show that our fixes can outperform manual fixes by up to 7% (HIPPOCRATES) 37

Thank you! Contact: Ian Neal, [email protected]

Back to top button