Summary for “Generic Database Cost Models for Hierarchical Memory Systems”
Published:
This is my summary for the paper Generic Database Cost Models for Hierarchical Memory Systems.
Database cost models consist of two components: the logical component and the physical component. While the physical operators’ specific details are irrelevant to the logical component, the physical component has to know the physical operators’ algorithms and implementations to make a suitable choice for a given logical operator. Although disk-access and CPU time costs have been well-studied in the literature, there is a research gap for modeling memory access costs, especially in hierarchical memory systems. Due to caches in a hierarchical memory system, research on this topic must model memory access patterns. This is the main objective of this paper. Additionally, a model for algorithms’ memory access costs can also extend to I/O costs.
Caches are usually organized into multiple levels. In cache systems, there are three important parameters: cache capacity (C), cache block size (B), and cache associativity (A). Furthermore, access latency consists of sequential access latency and random access latency. The authors generalize every cache/memory/disk level to the same unified cost model and treat the system as a multilevel cache system. Each level is different from one another in the exact hardware characteristics (i.e., the number of cache misses, access latency), and a cache/memory/disk level can be treated as the cache for an immediately lower level cache/memory/disk. The authors hypothesize that with this assumption, we can calculate the total cost of data access across the whole system as a sum of costs across all cache levels:
\[T_{mem} = \Sigma_{i = 1}^N(M_i^s l_{i+1}^s +M_i^r l_{i+1}^r)\]With $x_{{s, r}}$ (i.e., sequential, random), $M_i^x$ is the number of cache misses at level $i$, $l_{i+1}^x$ is the access latency of level $i$. While the access latency depends on the hardware and measurable for an arbitrary computer using a Calibrator (a special program provided by the authors), the number of cache misses also depends on the algorithms. We must derive this number from each algorithm’s access pattern. The authors approach this challenge by hypothesizing that each algorithm/operator’s complex data access pattern can be decomposed into simpler data access patterns. The first building block to this assumption is basic access patterns: single sequential traversal, repetitive sequential traversal, single random traversal, repetitive random traversal, random access, interleaved multi-cursor access. To build up complex access patterns from these basic patterns, the authors introduce compound access patterns, comprising sequential execution and concurrent execution. It is demonstrated that we can describe complicated access patterns by many algorithms using the two building blocks: basic access patterns and compound access patterns.
The authors also provide the formula for estimating the number of cache misses for each basic pattern and how to derive these formulas. Although the formulas vary across the access patterns, the formulas’ parameters are the hardware characteristic at a particular cache level. The final problem is then how to model cache interference between subpatterns in a compound access pattern. Firstly, the authors introduce states to model the cache states between subsequent patterns. This additional concept is enough to model sequential execution. However, to model concurrent execution, the authors propose dividing the hypothetically usable cache to each competing pattern according to its footprint size. With this final piece, the formulation for computing the number of cache misses for concurrent execution is introduced.
The authors report experiments performed on an SGI Origin 2000 with five different algorithms: Quick-Sort, Merge-Join, Hash-Join, Partitioning, and Partitioned Hash-Join on four metrics L1 misses, L2 misses, TLB misses, and total execution time. The experiments show that the data access cost model proposed can predict the trends of the metrics with low errors. Most importantly, the method can predict with high accuracy the important points in which there are noticeable changes in these metrics.
By assuming that a database system with cache, memory, and disk for data access can be treated as a multilevel cache system, the paper proposes a unified model for data access cost in database management systems. The authors suggest a method for decomposing complex data access patterns into manageable components with basic access patterns and compound access patterns. The data access cost of an algorithm can then be estimated by estimating the cost of its data access subpatterns and calculating hardware constants. This method allows calculating the data access cost of an arbitrary algorithm by describing its data access pattern using the concepts provided by this paper.