2024 Literature Review on Write-optimized NoSQL Systems
December 27, 2024 18:47
Do you know LSM-Tree? It is pretty much the core building block of many today’s KV store systems. Regardless of what they say, NoSQL databases are gaining increasing popularity due to their ability to support high write throughput. Still, extensive research is on going to optimize write operations further while maintaining a balance between the read, write, and space amplifications. In this literature review, let’s first discuss B-Tree, before digging deeper into LSM-Tree and other emerging design, which have become increasingly prominent in data systems due to their potential for handling big data.
Introduction
Application trends. For the past decades, one of the main challenges in data management has been designing data systems that can sustain fast data ingestion rate and process query at low latency [1]. This prompted a large number of those systems to adopt a highly ingestion-optimized design, such as Log-Structured Merge (LSM)-Trees at storage level. This log-structured paradigm has been preferred in many state-of-the-art NoSQL, relational, spatial, and time-series data systems in contrast to B-Trees.
B-Trees. Traditionally, B-Trees are used by many RDBMS/SQL databases (ie. PostgreSQL [2]) and file systems (ie. ext4 [3]). The B simply stands for Bayer, the author of the original data structure, or Boeing, where he worked at the time. In contrast to a binary tree that holds only two children per node, B-Trees allow for storing more than two pointers per node. The number of pointers in the node may be adjusted to any number that can be accommodated by a page size (ie. 4KB), although some implementations today use larger node sizes spanning across multiple pages [4]. Since B-Trees have the properties of being sorted and self-balancing, it is popularly used on a read-heavy systems, such as row-store systems (RDBMS).
Log-Structured Merge (LSM) Tree. In contrast, LSM-Trees are write-optimized and serve as a core data structure for data-intensive KV-stores, which map unique key identifiers to their corresponding data values. KV-stores have been increasingly optimized to handle the industrial demands of ingesting application writes while still enabling fast reads within tight memory budgets, especially with the advent of flash-based solid-state drives (SSDs), where write operations are more costly than reads. The LSM-Tree was introduced by O’Neil et al. as a low-cost disk-based data structure, drawing its design philosophy from the log-structured file system developed in 1992. An LSM-based KV-store incorporates several key components—such as append-only log files, in-memory tables, compaction, flush mechanisms, and Bloom filters—engineered to ensure high write performance, fast data retrieval, and efficient storage. However, its commercial adoption didn’t occur until nearly a decade later with the introduction of Google’s BigTable.
Ingested write. To support high write throughput, LSM-Trees employ out-of-place ingestion, which eliminates random insertions, updates, and deletions. This approach is fundamental to the design of LSM-Trees: writes are batched in memory, and only large chunks of sorted data are written to disk. The files on disk are never edited in place; instead, updates and deletes are applied lazily to immutable files. This strategy helps achieve high write throughput by minimizing costly random access operations.
Sorted string tables. LSM-Trees maintain the data in immutable sorted files, which allows for high disk space utilization [8] and ease of merge [4]. Each level on the disk has a capacity that typically grows exponentially. A key property in each level is such that the contents of a Level L are more recent than the contents of deeper, larger Level L + 1.
Structurally, an SSTable is split into two parts: data and index blocks. The data blocks retain the key-value pairs that are the records, ordered by key. Meanwhile, the index blocks store keys mapped to data-block pointers.
Compaction. When a level reaches its maximum capacity, all or part of its data is sort-merged with data from the next level with an overlapping key-range. This process is called compaction. Compaction helps with reducing space amplification through periodic garbage collection. It also bounds the number of sorted components, or runs, thereby facilitating fast lookups. However, since compaction involves reading multiple files into memory, followed by sorting, merging, and writing them back to disk, the performance of foreground workloads may be impacted during this process, causing write stalls [9].
Clasically, LSM-Trees support two compaction strategies: leveling and tiering. In leveling there exists each level having at most one run, and every time a run in Level i – 1 (the previous level) is moved to Level i, this run will be greedily merged with the existing one at said level, if any, immediately. With tiering, on the other hand, every level must accumulate certain runs before they can be merged. Tiering is more write-optimized as each compaction spans fewer data, thus minimizing write stalls. However, it is less efficient in terms of read and space [10].
Flushing. Typically, for an LSM-Tree with L levels, the first level (Level 0) is retained in memory while the remaining levels (Level 1 to L – 1) resides on the disk, such that the lower levels can hold exponentially more data but with progessively older timestamps [7]. Incoming inserts, updates, or deletes are first buffered within this in-memory buffer. This in-memory buffer is often implemented using a data structure that allows logarithmic time lookups, such as a binary search tree or skip list [4].
Every time the memory buffer reaches capacity, a new buffer is created to receive the next batch of inserts to avoid write stalls. This process is called flushing. Once the buffer is full, the entries could be compacted. This implies that updates and deletes are handled as new inserts, and are lazily applied to the base data. This way, LSM-Trees further reduces write stalls.
Data retrieval. Retrieving a value associated to a key, known as a point lookup, begins at the memory buffer and traverses the tree from the smallest disk-level to the largest one. Bloom filter is often used to determine whether the key might be found in a given block, before proceeding with actually finding it in the said block. If the key exists, it’s categorized as non-zero-result point lookups, otherwise it is said to be zero-result point lookups. It is ensured that the latest version of an entry is always retained in the most recent run. In addition to Bloom filter, fence pointers are also used to quickly identify which page(s) to read [11].
Bloom filter. Bloom filter is a data structure that allows us to tests for membership efficiently [12], such that to facilitate higher point lookup performance, runs that do not contain a target entry can be skipped. With Bloom filter, each item in the set is hashed with a set of hash function h and the resulting bits mapped to positions. The bloom filter of a set is the OR-ing of each item’s bit string. To check for the membership of an item in the set, n bit positions are computed using the same h hash functions before checking if all corresponding bits in the filter are set to 1. This, however, doesn’t necessarily guarantee that the item exists in the list, a problem known as false positives. The false positive rate depends on the number of hash functions h, the number of items in the set L, and the length of the bit arrays m, which can be computed as in the following equation [13].
\(\left( 1 - \left[ 1 - \frac{1}{m} \right]^{hS} \right)^h \approx (1-e^{\frac{-hS}{m}})^h\) Formula 1. Formula to approximate the false positive rates.
Scanning. Scanning the “database,” known as range lookup, returns the most recent versions of all keys within a range. This entails scanning and merging keys found on different levels, to ensure the system returns only the latest version for each key. Typically, scanning is done in parallel.
Workload. It can be summed up that, following [11], [14], a KV-store system commonly involves four types of operations: zero-result point lookups, non-zero-result point lookups, range lookups, and writes.
Amplifications. All data structure of interest in data systems can be judged with regards to their read, write, and space/memory amplification characteristics. Read amplification is defined as “how much more data do we have to read for every key we are looking for on top of accessing this particular key” [15]. That means, if we need to read 5 pages to answer a query, the read amplification is 5. Extrapolating this for writing means that the write amplification measures the ratio or the amount of data written to the storage versus the amount of data written to the database. Each write operation may require flushing an entire page, which means that, even if only one byte of data is modified, the entire page may need to be written to disk. To measure it, if we are to write 10 MB of data into the database, and yet the data system needs to write in total of 30MB, then we say the write amplification is 3. Meanwhile, space or memory amplification is defined as the ratio of the amount of data on the storage device versus the amount of data in the database. Together, read amplification and write amplification are jointly used to measure the performance of an LSM-Tree [16]. With that in mind, we can say that an LSM-Tree will have less write amplification than a B-Tree while conversely, while B-Tree will have less read amplification than an LSM-Tree. This can be proven as follow.
Assuming that the block size of a B-Tree is B measured in bytes, we can say that the read amplification of B-Tree is:
\(O\left( \log_B \frac{N}{B} \right)\) Formula 2. Read amplification of a B-Tree.
Where N is the size of the database, and all things being of constant size, including the record, its key, and the pointer for all of data. Meanwhile, we can be confident that we must perform binary search for each level of an LSM-Tree, assuming that it’s leveled. Binary tree itself has read cost of:
\(O\left( \log \frac{N}{B} \right)\) Formula 3. Read cost of a binary tree.
Since leveled LSM-Tree has a size ratio of k such that level at i is k-times larger than that at i-1, therefore, we obtain:
\( level_i = O\left( \log \frac{N}{B} \right) \newline level_{i-1} = O\left( \log \frac{N}{kB} \right) \newline level_{i-2} = O\left( \log \frac{N}{k^2B} \right) \newline level_{i-m} = O\left( \log \frac{N}{k^mB} \right) \) Formula 4. Read cost for a leveled LSM-Tree.
Where m indicates the step in the level moving upward, therefore, we can see that the search space is getting reduced in proportion of the size ratio k, hence, if the data size at the lowest level i is O(N) the data at an immediately preceding level (i-1) would be O(N/k). The data size on a leveled LSM-Tree, therefore, can be generalized as:
\( O\left(\frac{N}{k^m} \right) \) Formula 5. The generalized size of data at level i-m.
It follows that the total number of disk read for a leveled LSM-Tree would be:
\( O\left( \log\frac{N}{B} \right) + O\left( \log\frac{N}{kB} \right) + \cdots + O\left( \log\frac{N}{k^mB} \right) \) Formula 6. The read amplification of a leveled LSM-Tree.
We come to conclude that a leveled LSM-Tree’s read performance is not as good as B-Tree.
For write amplification characteristic of LSM-Tree, by design, we can trust that it should not trigger a rewrite of the entire database files. To prove it, we can first calculate the number of level in the tree, given a size ratio of k to be:
\( \log_k\frac{N}{B} \) Formula 7. The number of levels in an LSM-Tree.
Assuming that we need to flush data from a given level to the next level, or from the buffer to the first level, the write amplification in the worst case if we need to remerge data item into the same level would be:
\( O\left(k * \log_k\frac{N}{B} \right) \) Formula 8. The number of levels in an LSM-Tree.
In comparison, the design of B-Tree requires writing the entire leaf if its structure is changed, therefore, the write amplification for B-Tree is simply:
\( O\left(B \right) \) Formula 9. The write amplification of B-Tree.
For this we conclude that the write performance of B-Tree is not as good as that of an LSM-Tree.
Challenges and Solutions
Ideally there exists a data structure that guarantees the best read and write performance for all hardware for all time, with no overhead. But, there is no such a perfect data structure that simultaneously reduce space/memory, read, and write amplifications [17]. That is, if for example we reduce the number of levels, say by increasing the level multiplier, it would increases write amplification although decreases space and read amplification. In this section, we shall review such pain points of LSM-Trees and how they have been addressed in the literature.
Space amplification. Indeed, an LSM-Tree is superior than B-Tree in preserving space, because due to its algorithm requirement, a B-Tree page can only be ½ or ⅓ full. However, following that each level in an LSM-Tree has a fixed size multiplier, if each level is 10x larger than the previous level, it follows that all of the levels up to the last level combined are only 11.111…% the size of the last level. In addition to that, LSM-Trees occupies more storage space than the size of the user data due to “inefficient removal,” that is, deleted entries take up space until compaction discards them. Also, due to multi-version concurrency, the system is prevented from reclaiming space occupied by a file until the compaction of said file is completed [10].
RocksDB implements 2 strategies to reduce this space amplification by (1) adapting the level size dynamically to the size of the data, and (2) applying a number of compression strategies. Spooky, on the other hand, introduces a new method of granulating LSM-Tree merge operations such that contention between write amplification and space amplification could be reduced. It does so by concocting 6 design decisions that significantly alter the database files organization.
Read overhead. LSM-Trees has worse read performance since it is designed primarily to support high ingestion [8], [17], [18]. To offer competitive read performance, the design is augmented with auxiliary, in-memory, lightweight data structures such as filters, indexes, and block caches. Further improvement in this space is adopting state-of-the-art Bloom filter bits-allocation strategy proposed in Monkey [11], which allocates different false positive rates (FPRs) for Bloom filters in different levels to achieve optimal memory allocation.
Inefficient policy transition. Another significant challenges, as Cosine mentioned, is the needs for a technique to transition a data system’s engine design with minimum data movement [19]. Under the classical LSM-Tree, continuously changing the compaction policy necessitates reallocation of memory and/or reorganization of the data in the tree, which is an expensive operation. In fact, compaction process by their own right is already a major performance cost for LSM-Trees [20]. Nutanix reported that for their production workloads, per cluster, an average of 2.5 hours per day is spent on compaction operations [21].
Attempts to solve inefficient policy transition results in generally two different school of thoughts: greedy and lazy. With greedy transition, once a level receives a tuning strategy, it needs to merge and flush the current level into the next level, which is a costly operation that incur write stalls, significantly degrading the performance. The problem is exacerbated when a level is unfilled, since merging it to the next level could incur a significant write amplification. Alternatively, the lazy transition aims only to change a level’s compaction policy if it has been merged to the next level. Indeed, this lazy transition should not bring an immediate compaction cost, but in certain cases, such as in the case of machine-learning powered model, this is undesirable as it delays the timeliness of the information that can be fed into the model.
RusKey comes up with Flexible LSM-Tree (FLSM-Tree) to accommodate a flexible transition between compaction policies, by allowing different-sized runs to exist at the same level [16]. In that way, it afforded more room for an LSM-Tree to morph the structure gradually. At each level, there is an active run that can admit entries. When it reaches its capacity, the run is sealed, and a new one is created and activated. When a new policy is applied, it does not affect sealed runs, and since it only affected the metadata of the active run, it does not incur any disk reads, resulting in zero transition cost and zero delay, in that the LSM-Tree can immediately react with the new policy. In the case where the workload only contain reads, this flexible transition essentially degenerates into lazy transition with no impact.
Inefficient removal. Let’s recall that in an LSM-Tree, an entry at level i is always newer than an entry of the same key at level j, if j > i. In fact, this is how LSM-Trees exploit this property in order to support deletion through tombstone, a special metadata that invalidates older entries of the same key, with the expectation that eventually they will be persistently deleted. This means that for a long time, deletion is only performed logically, not physically. What is the implication for this? Apart from privacy consideration, which is a serious issue considering data privacy protection acts like GDPR [22] and CCPA [23]; it turns out, tombstones (1) increase space amplification, (2) affect read performance as queries potentially have to discard large collections of invalid entries, and (3) increase write amplification as logically deleted entries keep being compacted. This problem is exacerbated in applications where deletion is quite the norm, such as in stream processing where processed data of specific timestamps could be removed. Forceful compaction in those cases is suboptimal as it causes high latency spikes. In fact, deletion by timestamp when it is not the primary key, that is when it is a secondary attribute, may require periodic full-tree compaction [24], which is extremely expensive as it involves superfluous disk I/Os that increases write amplification and results in latency spikes; this is because every file of an LSM-Tree is sorted using the primary key.
The most extensive solution to the problem can be found in Lethe, which comes up with FADE and KiWi as a solution [1]. FADE ensures that every tombstone adhere to the user’s provided delete persistence threshold by assigning time-to-live (TTL) for every file. Compaction is then triggered not only when a level is saturated, but also when any file on that level has TLL expired. It has an extensive algorithm to determine which files to compact depending on the trigger ie. whether it’s a saturation-driven trigger or deletion-driven trigger. In case of tie, FADE choose the file with the most tombstones. The benefits of FADE is twofold: (1) it helps future compactions to involve fewer deleted entries, leading to smaller write amplification; (2) it reduces the number of entries hashed in the Bloom filters, leading to smaller overall FPR, which means increased read performance, albeit if marginally. On the other hand, we can think of KiWi as an additional layer that keep tracks of the “secondary index” for attributes that can be used for deletion lookups. That is, in addition to the levels of the tree, the files of a level, and the page of a file, KiWi introduces delete tiles that belong to a file and consist of pages. The content of a tile is ordered on the delete key D which allows for execution of ranged deletes to the point that it is possible to drop continuous page entirely.
Inefficient updates. Apart from deletions, many KV store workloads exhibit skewed data popularity, where a few hot keys are much more likely to be updated than cold keys. Unfortunately, updates in LSM-Trees are inherently inefficient due to their out-of-place update mechanism. TRIAD addresses this by taking into account whether a key is popular (hot), and optimize the update process accordingly [21]. In contrast, FASTER ditched the idea of out-of-place updates entirely, opting for an in-place process [25].
TRIAD reduces the frequent compactions triggered by different versions of the same hot keys. It does so by employing a holistic combination of three techniques. First, it minimizes frequent I/O operations on hot keys. Second, it defers file compaction until the overlap between files becomes large enough to merge a high number of duplicate keys. Finally, it avoids flushing the memory component by enhancing the role played by the commit log, by treating just like SSTables data in level 0. This way, TRIAD achieves up to 193% higher throughput than vanilla RocksDB, with 77% less time spent on compaction and flushing, and 4x lower write amplification. Furthermore, TRIAD can be easily combined with other techniques, as it is orthogonal to various approaches designed to reduce write amplification.
Interestingly, FASTER opted for an in-place update. It leverages HybridLog, a novel concurrent log with a log-structured organization that differs from LSM-Trees. HybridLog consists of tiers with stable, read-only, and mutable regions, with the mutable region supporting in-place updates. HybridLog spans memory and secondary stage, where the in-memory portion acts as a cache for hot records. Experiments show that FASTER achieves orders-of-magnitude better throughput, reaching up to 160M operations per second on a single machine.
Instance optimization. However, trade-off must be made between the costs of reads, writes, and space. For example, to enable faster point reads, we either need to (1) enlarge the Bloom filters and thereby increasing the likelihood of false positives in accordance to formula 1 we have discussed earlier, or (2) increase write cost by compacting runs more greedily so that the number of runs across which false positives incurs can be restricted. That is to say, under existing designs, improving one of these three metrics makes either one or both of the other metrics worse. Wacky continuum solves this conundrum by calculating and then picking the best design suitable for the workload [5]. It includes cost models that predict how overall system behavior could be impacted by changing knobs. With this, we can search analytically within Wacky’s design space to come up with the most suitable tradeoff for the workload.
We regard work on this space to be instance-optimization, where data systems expose configurable knobs or parameters, including but not limited to the compaction policy, size ratio between different levels, and Bloom filters optimization, to cite a few. Apart from the Wacky continuum, research in this domain includes Monkey [11], Dostoevsky [26], Cosine [19], and K-LSM [27]. Monkey formulates the expected I/O cost, which it then used to co-tune the compaction policy, the memory space for the buffer, and the Bloom filter in order to come up with an optimal LSM-Tree design with the minimal I/O cost for a given workload. Dostoevsky then improves on the existing compaction policy to be hybrid, that is, it fuses leveling and tiering in a single tree to strike a balance between read and write throughput based on the workload.
Not tuning per se, but another interesting area of research in this space focuses on making it easier to visualize the effects of optimization experiments. For example, while compactions are critical to LSM performance, selecting the right strategy typically requires expert knowledge. Sarkar et al. introduced the concept of Compactionary [28], which decodes compaction as an ensemble of primitives. Configuring each primitive allows users to easily observe its impact on performance.
Auto-tuning. Mo et al. regards such instance-level optimization as whitebox modeling, in which it tries to capture the relationship between various policies (such as compaction) and LSM-Tree performance via a sophisticated formula [16]. Yet, not everything can be captured by such formula. For example, memory cache can significantly affect performance, yet such bottom-level details are often times unaccounted for in whitebox modeling. In contrast to that is blackbox modeling which involves machine learning. In this model, several actions modify LSM-Tree policies and reward traces the system performance, adjusting itself accordingly and automatically for the highest benefits.
We would argue that blackbox modeling also helps with reducing complexity. This is because, in practice, configuring a data system can be challenging, as they are designed to serve diverse range of applications. As such, modern data systems often come with hundreds of configurable knobs, such as in the case of RocksDB and Hbase, with their performance highly dependent on those configurations. Using blackbox modeling, those configurable knobs can be accounted for. Works in this field includes CAMAL [7], RTune [29], K2Vtune [30], and Endure [14] as well as more recently ELMo-Tune [31], which uses GPT-4 Large Language Model (LLM), and RusKey [16].
In the case of CAMAL, an ML model is iteratively fitted to facilitate an active learning cycle, based on namely complexity-based model. In each round, a (set of) knob to be adjusted is chosen to be analyzed using said complexity-based model, against other pre-existing/pre-set knobs from an earlier iteration, such that we have (W, X, Y) samples to train on, where X is the selected knobs, Y is the recorded performance, and W is a workload. Their experimentation shows that it achieves 16% to 18% average improvement across various workloads, which, when compared to Monkey, translates to 8x reduced latency for some write-heavy workloads.
RusKey [16], on the other hand, attempts to design LSM-Tree online under the context of dynamic workload, guided by reinforcement learning. It aims to tune the compaction policy in each level to best process overall workload. To help it achieves that, RusKey introduces a new LSM-Tree design, named FLSM-Tree, that facilitates seamless transitions between different compaction policies, which itself is a major bottleneck for a dynamic KV-store system. How it works is that, RusKey allows for arbitrary arrivals of operations without apriori knowledge of query-to-update ratio. RusKey then collects those operations as a mission that divides the workload. It also collects relevant statistics such that, after procesing each mission, RusKey may adjust its internal LSM-Tree structure for the coming mission through reinforcement learning that rewards performance. For a concrete example, assume RusKey is initially in a balanced state. The dynamic workload then shifts to write-heavy with 90% updates, before shifting again to read-write-balanced, followed by a read-heavy workload of 90% lookups. At the beginning, RusKey tunes the compaction policy P of the FLSM-Tree from 1 to 10 to achieve the best write performance, and note that during this tuning, the system is not fed with workload statistics. The FLSM-Tree promises flexible transition without transition cost and delay. Once the workload shift to the balanced state once again, RusKey re-tune the compaction policy P from 10 to 3, which is more effective in a balanced workload. Then, when the workload shifts to read-heavy, RusKey re-tune P from 3 to 1 to achieve the most optimal read efficiency.
Now, ELMo-Tune takes an entirely different approach by involving LLMs trained on large datasets comprising of websites, blogs, articles, research papers and open-source implementations. The ELMo-Tune prototype claims to achieve up to 3x improvement in throughput and 9x improvement in p99 latency after several tuning iterations. Parallel to CAMAL, those iterations are needed to allow the model to experiment and learn from past results, enabling flexible tuning approach that can work with a variety of system configurations, workloads, and hardwares. Although showing great promise, we may argue that this blackbox modeling is probably the “least scientific,” although the paper discusses how it avoids hallucination and problems that come with using LLMs. The main reason being, it’s still hard to reason and verify how modern machine learning models arrive to a solution, let alone a large language model, making it sounds much more like gambling [32]. Nonetheless, this holds promise for if a computer could understand what to do entirely by itself without any programming apart from telling it how to look up for resources, exciting opportunities abound. One of it is supporting the most radical approach of all, which is the idea that we can come up with the most adequate data structure on-demand. This is no longer auto-tuning, but shape-shifting. The idea lies in the concept of design continuums that unifies major distinct data structure designs [15]. Currently, a subset of the idea, without machine learning, has been used to demonstrate how we can adjust LSM-Tree design space to fit with the workload characteristics of the machine in the case of Wacky continuum.
Modern hardware. The classical LSM-Tree was designed with older hardware in mind, long before the advent of General-Purpose Graphics Processing Units (GPGPUs) or Solid-State Drives (SSDs), which are now ubiquitous. SSDs, for example, are fundamentally different from the Hard Disk Drives (HDDs) on which traditional LSM-Tree designs were based. Unlike HDDs, SSDs are more prone to wear from repeated writes, such that the high write amplification in classical LSM-Trees can significantly reduce device lifetime. Indeed, modern hardware introduces both unique challenges and untapped potential.
One of the readily deployable solutions in this subarea of research is WiscKey [20]. The central idea behind WiscKey is the separation of keys from values. This decoupling can significantly reduce write amplification by avoiding data movement while sorting, since values are not come bundled with their keys. This also results in the size of the LSM-Tree being significantly trimmed, leading to fewer I/O operations and better caching during lookups.
Other more recent advancements include the use of General-Purpose Graphics Processing Units (GPGPUs) [33] or Field-programmable Gate Array (FPGA) [34] to offload compaction processes from the CPUs, thereby relieving those CPUs from such I/O-intensive operations. For such kind of innovations, however, several common challenges persist: not all machines are equipped with these specialized devices. In NoFTL-KV, this prompted them to drop backward compatibility with unsupported hardware altogether [35]. In another, such as LOCS [36], a very specific type of SSD having exposed channel device files is required. This raises the question of whether those solutions are ready for widespread adoption or they remain largely experimental, although the innovative solutions demonstrated thus far hold promising results, such as in the case of NoFTL-KV, in which the authors claim that their solution improves transactional throughput by 33%, enhances response time by 2.3x, and reduces write amplification and extends storage longevity by 19x without altering the fundamental design of LSM-Tree, aside from integrating physical storage management directly within their system to better exploit the characteristics of modern storage technologies. Other interesting and cutting-edge research in this area includes, but not limited to, processing the data near its sources by using physical addresses directly [37], applying LSM-Trees on devices with the main memory physically separated from the CPUs in data centers [38], and LSM-Tree on a newer nonvolatile memory technologies [9], [39].
Applications
In this section, we shall review how write-optimized NoSQL database have been used in production. Although it seems there’s a dichotomy between SQL and NoSQL school of thoughts, in reality, the boundary between the two could have been blurred. This supports the argument that the SQL databases could absorb the ideas of recent advancements, instead of becoming extinct or obsolete [40]. Survival anlysis suggests that if X has stayed for n number of years, it will likely stay for another n years to come [41]. This creates a possibility that in the future, OLTP and OLAP data systems could be one and the same, although indeed there must be a little bit of sacrifice made.
MyRocks at Facebook. In [42], Dong et al. claims how Facebook’s MySQL instances is increasingly being switched over to MyRocks, a storage engine based on RocksDB, which in turn is based on LSM-Trees. RocksDB itself has been used in many applications both within and outside of Facebook [43], such as for stream processing [44], storing user activity [45], caching application data at Netflix, and as a large distributed database at Yahoo [46], to list a few examples. Facebook’s adaption of RocksDB as their MySQL engine found that the read performance is only affected marginally, well within their requirement, yet their storage requirement has been halfed by 50%, resulting in increased transaction throughput and decrased write amplification.
X-Engine Data System at Alibaba. Single’s Day is one of the largest online shopping festival in China, held byAlibaba, which runs the world’s largest e-commerce platform serving more than 600 million customers. Online e-commerce transactions have 3 notable challenges: (1) drastic increase of transactions per second with the kickoff of major sales events, (2) a large number of hot records that can overwhelm system buffers, and (3) quick shift of the temperature from between hot, warm, and cold of different records due to the availability of promotions on different categories over short period of time. Traditional approach of using shared-nothing OLTP databases with sharding to distribute transactions over many database instances is extremely costly, although that does work.
To address those challenges well within budget, Alibaba Group designed their own data system: X-Engine [47], which is based on tiered LSM-Trees but with slight modifications to suit their needs. An LSM-Tree backs each of X-Engine’s table, which is row-oriented and partitioned into sub-tables. Each LSM-Tree consists of a hot data tier residing in main memory and a warm/cold data tier residing in NVM/SSD/HDD that are futher partitioned into different levels. A machine learning model is used to categorize the temperature of a record. They need to flush often to avoid out-of-memory failures, yet it does so differently. They introduce an intra-Level0 compactions, where warm runs on Level0 are actively merged without pushing it to the next level. This is because the strong temporal locality in e-commerce workloads may require access to those records. If they are not sorted, flushed and compacted, queries will have to access all runs to find potential matches. This pragmatic approach keeps warm records in the first level of the LSM-Tree. The size of Level0 are kept small so that compactions only need to access only as few runs as possible. Each run is also kept small at 2 MB, and further divided into multiple 16 KB data blocks. Each block maintains zone map statistics that enable skipping over certain blocks if the key range of said block does not overlap with those of other runs. In addition, they offload compactions to FPGA devices, releasing the CPUs to do other things than doing the heavy burden of merging runs. Evaluation results show that during the Singles’ Day sales, which involved a workload consisting of 42% point lookups, 10% range lookups, 32% updates, and 16% inserts, X-Engine demonstrates very stable performance in terms of queries per second (QPS). It outperforms InnoDB and RocksDB by an average of 44% and 31%, respectively, and performs only slightly slower than InnoDB (<10%) on a more intensive read workload, with 80% point lookups and 11% range lookups.
Future Research
There are several interesting questions raised in some of the papers that remain unanswered. We shall review some of those in this section.
Different level multiplier. The original LSM-Trees paper proved that all level needs to have the same size multiplier in order to optimize for write amplification [48]. Dong et al. raised a question whether this also holds true when optimizing for space, especially when considering that different levels may use different algorithms for compression, resulting in different compression ratios at each level. They argued that in practice it is unlikely that the size of the data stored at the last level will be x times the target size of the previous level.
Self-learning auto-tuning. Another possible future direction for research is finding out how autotuning ML-models can recognize deprecated and newer knobs, that may not be documented or even burried deep within the system’s source code. In this view, perhaps, models like ELMo-Tune is better than CAMAL, since it should be easier for ELMo-Tune to understand outdated knobs than for camal to do so, since ELMo-Tune is based on language model that can “understand” human language, ready to be retrained without much supervision. However, again, this is just an assumption.
Secondary indexes. Read overhead can arise when querying specific values other than primary keys. Without an index, querying these attributes requires the database system to perform a full-table scan. Thus, secondary indexing is an indispensable feature in any serious database systems. Yet, LSM-Trees lack secondary indexing. Moreover, keeping secondary indexes consistent has been difficult. This is because LSM-Trees employ blind-write pattern, where new data is appended without any checking, in contrast to the read-modify-write step used in B-Trees. As a result, it’s hard to deal with updating or deleting secondary keys, as synchronization with the primary key is unavoidable, yet doing so would terminate the blind-write contract and thus degrades the overall write performance of the system. Indeed, this has been what typical modern LSM-based storage systems does: storing secondary indexes as another LSM-Tree [13]. Other papers, such as [49], are also complex and/or require specific hardware to function. Ideally, we should have a structure as straightforward to understand as how secondary indexes in relational, B-Trees based system are supported.
New emerging data structure. A new design known as Log-Structured Hash-table (LSH-Table) has emerged for applications that require even faster ingestion and update rates [25]. However, this design is still heavily under research.
Hardware-Software Co-design. Last but not least is to realize the full potential of modern hardware. We feel the need of some level of co-design between different stakeholders. For example, LOCS cannot fully leverage the capabilities of modern SSDs, such as multi-channel features, if the device drivers do not expose the necessary APIs or device files for data systems to access.
Conclusion
Merely relying on B-Trees for all use cases is no longer sufficient in the context of modern data systems. Recent advancements have driven the need for systems capable of handling large volumes of data efficiently, both for reading and writing. The Log-Structured Merge Tree (LSM-Tree) is one such data structure designed specifically for high-speed data ingestion, powering applications like BigQuery. However, like any engineered solution, the LSM-Tree is not without its challenges. Main challenges revolve around balancing read, write, and space amplification. Despite that, ongoing research continues to refine and optimize LSM-Trees, enhancing their adaptability to various use cases and solidifying their role as a versatile data structure in the era of big data. Facebook, for example, has successfully implemented LSM-Trees within MySQL, improving both data ingestion rates and disk space utilization with minimal impact on read performance—an exciting development! We conclude that research in this area of optimizing data systems for the era of big data shows no signs of slowing down. It remains an interesting subfield of computer science full of challenges waiting to be solved.
References
[1] S. Sarkar, T. I. Papon, D. Staratzis, and M. Athanassoulis, “Lethe: A Tunable Delete-Aware LSM Engine,” in Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, in SIGMOD ’20. New York, NY, USA: Association for Computing Machinery, 2020, pp. 893–908. doi: 10.1145/3318464.3389757.
[2] H. Suzuki, “The Internals of PostgreSQL,” The Internals of PostgreSQL. [Online]. Available: https://www.interdb.jp/pg/#the-internals-of-postgresql
[3] A. Mathur, M. Cao, S. Bhattacharya, A. Dilger, and L. Vivier, “The new ext 4 filesystem : current status and future plans,” 2010. [Online]. Available: https://api.semanticscholar.org/CorpusID:267893896
[4] A. Petrov, “Algorithms behind modern storage systems,” Commun ACM, vol. 61, no. 8, pp. 38–44, Jul. 2018, doi: 10.1145/3209210.
[5] N. Dayan and S. Idreos, “The Log-Structured Merge-Bush & the Wacky Continuum,” in Proceedings of the 2019 International Conference on Management of Data, in SIGMOD ’19. New York, NY, USA: Association for Computing Machinery, 2019, pp. 449–466. doi: 10.1145/3299869.3319903.
[6] M. Rosenblum and J. K. Ousterhout, “The design and implementation of a log-structured file system,” ACM Trans Comput Syst, vol. 10, no. 1, pp. 26–52, Feb. 1992, doi: 10.1145/146941.146943.
[7] W. Yu, S. Luo, Z. Yu, and G. Cong, “CAMAL: Optimizing LSM-trees via Active Learning,” Proc ACM Manag Data, vol. 2, no. 4, Sep. 2024, doi: 10.1145/3677138.
[8] S. Sarkar and M. Athanassoulis, “Dissecting, Designing, and Optimizing LSM-based Data Stores,” in Proceedings of the 2022 International Conference on Management of Data, in SIGMOD ’22. New York, NY, USA: Association for Computing Machinery, 2022, pp. 2489–2497. doi: 10.1145/3514221.3522563.
[9] T. Yao et al., “MatrixKV: reducing write stalls and write amplification in LSM-tree based KV stores with a matrix container in NVM,” in Proceedings of the 2020 USENIX Conference on Usenix Annual Technical Conference, in USENIX ATC’20. USA: USENIX Association, 2020.
[10] N. Dayan, T. Weiss, S. Dashevsky, M. Pan, E. Bortnikov, and M. Twitto, “Spooky: granulating LSM-tree compactions correctly,” Proc VLDB Endow, vol. 15, no. 11, pp. 3071–3084, Jul. 2022, doi: 10.14778/3551793.3551853.
[11] N. Dayan, M. Athanassoulis, and S. Idreos, “Monkey: Optimal Navigable Key-Value Store,” in Proceedings of the 2017 ACM International Conference on Management of Data, in SIGMOD ’17. New York, NY, USA: Association for Computing Machinery, 2017, pp. 79–94. doi: 10.1145/3035918.3064054.
[12] B. H. Bloom, “Space/time trade-offs in hash coding with allowable errors,” Commun ACM, vol. 13, no. 7, pp. 422–426, Jul. 1970, doi: 10.1145/362686.362692.
[13] M. A. Qader, S. Cheng, and V. Hristidis, “A Comparative Study of Secondary Indexing Techniques in LSM-based NoSQL Databases,” in Proceedings of the 2018 International Conference on Management of Data, in SIGMOD ’18. New York, NY, USA: Association for Computing Machinery, 2018, pp. 551–566. doi: 10.1145/3183713.3196900.
[14] A. Huynh, H. A. Chaudhari, E. Terzi, and M. Athanassoulis, “Endure: a robust tuning paradigm for LSM trees under workload uncertainty,” Proc VLDB Endow, vol. 15, no. 8, pp. 1605–1618, Apr. 2022, doi: 10.14778/3529337.3529345.
[15] S. Idreos et al., “Design Continuums and the Path Toward Self-Designing Key-Value Stores that Know and Learn,” in Conference on Innovative Data Systems Research, 2019. [Online]. Available: https://api.semanticscholar.org/CorpusID:58013807
[16] D. Mo, F. Chen, S. Luo, and C. Shan, “Learning to Optimize LSM-trees: Towards A Reinforcement Learning based Key-Value Store for Dynamic Workloads,” Proc ACM Manag Data, vol. 1, no. 3, Nov. 2023, doi: 10.1145/3617333.
[17] M. Athanassoulis et al., “Designing Access Methods: The RUM Conjecture,” in Proceedings of the 19th International Conference on Extending Database Technology, EDBT 2016, Bordeaux, France, March 15-16, 2016, Bordeaux, France, March 15-16, 2016, E. Pitoura, S. Maabout, G. Koutrika, A. Marian, L. Tanca, I. Manolescu, and K. Stefanidis, Eds., OpenProceedings.org, 2016, pp. 461–466. doi: 10.5441/002/EDBT.2016.42.
[18] M. Athanassoulis and S. Idreos, “Design Tradeoffs of Data Access Methods,” in Proceedings of the 2016 International Conference on Management of Data, in SIGMOD ’16. New York, NY, USA: Association for Computing Machinery, 2016, pp. 2195–2200. doi: 10.1145/2882903.2912569.
[19] S. Chatterjee, M. Jagadeesan, W. Qin, and S. Idreos, “Cosine: a cloud-cost optimized self-designing key-value storage engine,” Proc VLDB Endow, vol. 15, no. 1, pp. 112–126, Sep. 2021, doi: 10.14778/3485450.3485461.
[20] L. Lu, T. S. Pillai, H. Gopalakrishnan, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau, “WiscKey: Separating Keys from Values in SSD-Conscious Storage,” ACM Trans Storage, vol. 13, no. 1, Mar. 2017, doi: 10.1145/3033273.
[21] O. Balmau et al., “TRIAD: Creating Synergies Between Memory, Disk and Log in Log Structured Key-Value Stores,” in 2017 USENIX Annual Technical Conference (USENIX ATC 17), Santa Clara, CA: USENIX Association, Jul. 2017, pp. 363–375. [Online]. Available: https://www.usenix.org/conference/atc17/technical-sessions/presentation/balmau
[22] M. Goddard, “The EU General Data Protection Regulation (GDPR): European Regulation that has a Global Impact,” Int. J. Mark. Res., vol. 59, no. 6, pp. 703–705, 2017, doi: 10.2501/IJMR-2017-050.
[23] P. Bukaty, “Rights of Consumers and Obligations of the Business,” in The California Consumer Privacy Act (CCPA), in An implementation guide. , IT Governance Publishing, 2019, pp. 55–89. doi: 10.2307/j.ctvjghvnn.9.
[24] G. Huang et al., “X-Engine: An Optimized Storage Engine for Large-scale E-commerce Transaction Processing,” in Proceedings of the 2019 International Conference on Management of Data, in SIGMOD ’19. New York, NY, USA: Association for Computing Machinery, 2019, pp. 651–665. doi: 10.1145/3299869.3314041.
[25] B. Chandramouli, G. Prasaad, D. Kossmann, J. Levandoski, J. Hunter, and M. Barnett, “FASTER: A Concurrent Key-Value Store with In-Place Updates,” in Proceedings of the 2018 International Conference on Management of Data, in SIGMOD ’18. New York, NY, USA: Association for Computing Machinery, 2018, pp. 275–290. doi: 10.1145/3183713.3196898.
[26] N. Dayan and S. Idreos, “Dostoevsky: Better Space-Time Trade-Offs for LSM-Tree Based Key-Value Stores via Adaptive Removal of Superfluous Merging,” in Proceedings of the 2018 International Conference on Management of Data, in SIGMOD ’18. New York, NY, USA: Association for Computing Machinery, 2018, pp. 505–520. doi: 10.1145/3183713.3196927.
[27] A. Huynh, H. A. Chaudhari, E. Terzi, and M. Athanassoulis, “Towards flexibility and robustness of LSM trees,” VLDB J., vol. 33, no. 4, pp. 1105–1128, Jul. 2024, doi: 10.1007/s00778-023-00826-9.
[28] S. Sarkar, K. Chen, Z. Zhu, and M. Athanassoulis, “Compactionary: A Dictionary for LSM Compactions,” in Proceedings of the 2022 International Conference on Management of Data, in SIGMOD ’22. New York, NY, USA: Association for Computing Machinery, 2022, pp. 2429–2432. doi: 10.1145/3514221.3520169.
[29] H. Jin, J. Lee, and S. Park, “RTune: a RocksDB tuning system with deep genetic algorithm,” in Proceedings of the Genetic and Evolutionary Computation Conference, in GECCO ’22. New York, NY, USA: Association for Computing Machinery, 2022, pp. 1209–1217. doi: 10.1145/3512290.3528726.
[30] J. Lee, S. Seo, J. Choi, and S. Park, “K2vTune: A workload-aware configuration tuning for RocksDB,” Inf Process Manage, vol. 61, no. 1, Feb. 2024, doi: 10.1016/j.ipm.2023.103567.
[31] V. Thakkar, M. Sukumar, J. Dai, K. Singh, and Z. Cao, “Can Modern LLMs Tune and Configure LSM-based Key-Value Stores?,” in Proceedings of the 16th ACM Workshop on Hot Topics in Storage and File Systems, in HotStorage ’24. New York, NY, USA: Association for Computing Machinery, 2024, pp. 116–123. doi: 10.1145/3655038.3665954.
[32] J. Huang and K. C.-C. Chang, “Towards Reasoning in Large Language Models: A Survey,” ArXiv, vol. abs/2212.10403, 2022, [Online]. Available: https://api.semanticscholar.org/CorpusID:254877753
[33] H. Sun, J. Xu, X. Jiang, G. Chen, Y. Yue, and X. Qin, “gLSM: Using GPGPU to Accelerate Compactions in LSM-tree-based Key-value Stores,” ACM Trans Storage, vol. 20, no. 1, Jan. 2024, doi: 10.1145/3633782.
[34] T. Zhang et al., “FPGA-Accelerated Compactions for LSM-based Key-Value Store,” in 18th USENIX Conference on File and Storage Technologies (FAST 20), Santa Clara, CA: USENIX Association, Feb. 2020, pp. 225–237. [Online]. Available: https://www.usenix.org/conference/fast20/presentation/zhang-teng
[35] T. Vinçon, S. Hardock, C. Riegger, J. Oppermann, A. Koch, and I. Petrov, “NoFTL-KV: tackling write-amplification on KV-stores with native storage management,” presented at the Advances in database technology - EDBT 2018 : 21st International Conference on Extending Database Technology, Vienna, Austria, March 26-29, 2018. proceedings, M. Böhlen, Ed., Konstanz: Universität Konstanz, 2018, pp. 457–460. [Online]. Available: https://openproceedings.org/html/pages/2018_edbt.html
[36] P. Wang et al., “An efficient design and implementation of LSM-tree based key-value store on open-channel SSD,” in Proceedings of the Ninth European Conference on Computer Systems, in EuroSys ’14. New York, NY, USA: Association for Computing Machinery, 2014. doi: 10.1145/2592798.2592804.
[37] T. Vinçon, A. Bernhardt, I. Petrov, L. Weber, and A. Koch, “nKV: near-data processing with KV-stores on native computational storage,” in Proceedings of the 16th International Workshop on Data Management on New Hardware, in DaMoN ’20. New York, NY, USA: Association for Computing Machinery, 2020. doi: 10.1145/3399666.3399934.
[38] R. Wang, C. Gao, J. Wang, P. Kadam, M. TamerÖzsu, and W. G. Aref, “Optimizing LSM-based indexes for disaggregated memory,” VLDB J., vol. 33, no. 6, pp. 1813–1836, Jun. 2024, doi: 10.1007/s00778-024-00863-y.
[39] D. Kim, J. Lee, K. S. Lim, J. Heo, T. J. Ham, and J. W. Lee, “An LSM Tree Augmented with B+ Tree on Nonvolatile Memory,” ACM Trans Storage, vol. 20, no. 1, Jan. 2024, doi: 10.1145/3633475.
[40] M. Stonebraker and A. Pavlo, “What Goes Around Comes Around… And Around…,” SIGMOD Rec, vol. 53, no. 2, pp. 21–37, Jul. 2024, doi: 10.1145/3685980.3685984.
[41] H. Yanagisawa, “Proper scoring rules for survival analysis,” in Proceedings of the 40th International Conference on Machine Learning, in ICML’23. Honolulu, Hawaii, USA: JMLR.org, 2023.
[42] S. Dong, M. D. Callaghan, L. Galanis, D. Borthakur, T. Savor, and M. Strum, “Optimizing Space Amplification in RocksDB,” in Conference on Innovative Data Systems Research, 2017. [Online]. Available: https://api.semanticscholar.org/CorpusID:16175593
[43] S. Dong, A. Kryczka, Y. Jin, and M. Stumm, “RocksDB: Evolution of Development Priorities in a Key-value Store Serving Large-scale Applications,” ACM Trans Storage, vol. 17, no. 4, Oct. 2021, doi: 10.1145/3483840.
[44] G. J. Chen et al., “Realtime Data Processing at Facebook,” in Proceedings of the 2016 International Conference on Management of Data, in SIGMOD ’16. New York, NY, USA: Association for Computing Machinery, 2016, pp. 1087–1098. doi: 10.1145/2882903.2904441.
[45] S. A. Noghabi et al., “Samza: stateful scalable stream processing at LinkedIn,” Proc VLDB Endow, vol. 10, no. 12, pp. 1634–1645, Aug. 2017, doi: 10.14778/3137765.3137770.
[46] B. F. Cooper et al., “PNUTS to Sherpa: lessons from Yahoo!’s cloud database,” Proc VLDB Endow, vol. 12, no. 12, pp. 2300–2307, Aug. 2019, doi: 10.14778/3352063.3352146.
[47] G. Huang et al., “X-Engine: An Optimized Storage Engine for Large-scale E-commerce Transaction Processing,” in Proceedings of the 2019 International Conference on Management of Data, in SIGMOD ’19. New York, NY, USA: Association for Computing Machinery, 2019, pp. 651–665. doi: 10.1145/3299869.3314041.
[48] P. O’Neil, E. Cheng, D. Gawlick, and E. O’Neil, “The log-structured merge-tree (LSM-tree),” Acta Inform., vol. 33, no. 4, pp. 351–385, Jun. 1996, doi: 10.1007/s002360050048.
[49] J. Wang, Y. Lu, Q. Wang, Y. Zhang, and J. Shu, “Perseid: A Secondary Indexing Mechanism for LSM-Based Storage Systems,” ACM Trans Storage, vol. 20, no. 2, Feb. 2024, doi: 10.1145/3633285.