On the design of efficient, fast and predictable memory allocator
We are building our own Operating System in C for our Principles of Operating System class. At one time, we are required to create a memory allocation system. Given a whole heap of memory, we should be able to reserve some space within the memory and free that space. In this post, I would like to write down the approach that I took.
Our system allocates in a way that double-word boundary is respected; all address and size is thus a factor of 8. If the system is initialized with a main memory of which size is not a factor of 8, less space will be taken to make it align properly. When a block of memory is requested, we use first-fit to find the earliest block that can satisfy the request. Such a block must be at least of the same size, or bigger than requested. Since it is possible that the returned block is much larger than what is requested, we employ buddy allocator technique to divide the block. We call this division technique “chunking up.” As we later demonstrates, our chunking up technique is similar to budy allocator but is different in a way.
When the memory is setup for the first time, which is when the kernel is being initialized, the memory would be in the following configuration:
Position | Address | PID | Free? | Size |
---|---|---|---|---|
0 | 0x7f2c171e0018 | 0 | yes | 134217720 |
The POS indicates an absolute position counted from 0, whereby the ADDRESS is the memory address usable by the user. While the POS does account the overhead struct, the ADDRESS does not. In other words, the ADDRESS is the address of the addressable allocated memory which can be read, write, checked and freed. Now, FREE indicates whether the block is, at the time of reporting, being reserved or not. The SIZE reports how large the allocated memory is sans any overhead struct. If we were to aggregate and sum the SIZE, the number would be smaller than the total free memory space when the memory is at its kernel-fresh state since we must account for overhead structs created for each memory block.
Now, when our system is fully initialized, the memory map would be akin to:
Position | Address | PID | Free? | Size |
---|---|---|---|---|
0 | 0x7fd464e5d018 | 0 | no | 8 |
16 | 0x7fd464e5d028 | 0 | yes | 8 |
32 | 0x7fd464e5d038 | 0 | no | 8 |
48 | 0x7fd464e5d048 | 0 | yes | 72 |
128 | 0x7fd464e5d098 | 0 | yes | 120 |
256 | 0x7fd464e5d118 | 0 | yes | 248 |
512 | 0x7fd464e5d218 | 0 | yes | 504 |
1024 | 0x7fd464e5d418 | 0 | yes | 1016 |
2048 | 0x7fd464e5d818 | 0 | yes | 2040 |
4096 | 0x7fd464e5e018 | 0 | yes | 4088 |
8192 | 0x7fd464e5f018 | 0 | yes | 8184 |
16384 | 0x7fd464e61018 | 0 | yes | 16376 |
32768 | 0x7fd464e65018 | 0 | yes | 32760 |
65536 | 0x7fd464e6d018 | 0 | yes | 65528 |
131072 | 0x7fd464e7d018 | 0 | yes | 131064 |
262144 | 0x7fd464e9d018 | 0 | yes | 262136 |
524288 | 0x7fd464edd018 | 0 | yes | 524280 |
1048576 | 0x7fd464f5d018 | 0 | yes | 1048568 |
2097152 | 0x7fd46505d018 | 0 | yes | 2097144 |
4194304 | 0x7fd46525d018 | 0 | yes | 4194296 |
8388608 | 0x7fd46565d018 | 0 | yes | 8388600 |
16777216 | 0x7fd465e5d018 | 0 | yes | 16777208 |
33554432 | 0x7fd466e5d018 | 0 | yes | 33554424 |
67108864 | 0x7fd468e5d018 | 0 | yes | 67108856 |
As must be noticeable, the memory is divided in such a way that the right hand side, that is ones at addresses that are growing larger, tended to be of a greater size. This indeed is done deliberately with careful design. Put in another way, we believe that blocks at the left-hand side is extremely short-lived or volatile in nature–they are used and freed quickly. For that, we want a system that does not scan further to be able to give those blocks at a time of need. This can be observed by how block at 0x7f5465a75028 of 8 bytes is free. This block must be used by the shell program in some way and then freed before the memory mapper could take a look at it when it was still being used. For a fuller picture, the first block stored the PCB (Process Control Block) metadata which is allocated by the system’s kernel as soon as the OS is running–and will remain there as long as the OS is running.
This technique helped make our first-fit algorithm to have virtually identical benefits to best-fit without compromising speed. But, unlike buddy allocator in general, however, we always make sure that the division would cause the right block to be of a slightly bigger size than its left-hand cousin. That is, the division is not done so evenly. For example, let’s study these blocks:
Position | Address | PID | Free? | Size |
---|---|---|---|---|
2048 | 0x7fd464e5d818 | 0 | yes | 2040 |
4096 | 0x7fd464e5e018 | 0 | yes | 4088 |
If the division is fully fair and equal, block at POS 4096 would be of size 4080 bytes but instead, it is of 8 bytes larger. Indeed, this is how our chunking up is a little bit different from the standard buddy allocator as we ensure that the right-hand blocks are always at slightly larger size than its left-hand side cousin. This is done with the hope that the left-hand side will be much smaller and more readily suitable for the requested size.
The generally large-spaced blocks on the right also allow for the next request to ask for a bigger size, which, should that very block can accommodate it, they can be returned right away without splitting a new bigger one. Seen from a different perspective, we may say that the block on the left will tend to be much short-living as the algorithm is indeed a little carefully designed so that blocks of smaller size are concentrated on the left-hand side which is very near to the iteration when we need to find a suitable block.
But, we do not stop at first-fit and best-buddy allocation when it comes to allocating memory. Otherwise, we can run into the possibility of wasting so much space. That is so because there can be a point where we would no longer be able to chunk up a block as doing so would either created buddies of effectively 0-byte large (if not even less) such as in the case of a block with 16 bytes. When a 16 bytes block is chunked, it should end up as two blocks of 8 bytes each. But, since we will need to create a bookkeeping struct for the new block on the right, the right buddy can end up with 0 byte addressable space. The left buddy does not have to create a bookkeeping struct since it can use the same bookkeeping struct the block is originally attached with.
Instead of chunking, in this case we ended up with a process so-called space-fitting. For example, if we have a memory in this layout:
Position | Address | PID | Free? | Size |
---|---|---|---|---|
0 | 0x7f5465a75018 | 0 | no | 8 |
16 | 0x7f5465a75028 | 0 | yes | 80 |
104 | 0x7f5465a75080 | 0 | no | 16 |
128 | 0x7f5465a75098 | 0 | no | 16 |
152 | 0x7f5465a750b0 | 0 | no | 16 |
If we issue: malloc 8 to reserve an 8-bytes addressable memory space, we will end up with a new block at ADDRESS 0x7f5465a75038 of exactly 8 bytes:
Position | Address | PID | Free? | Size |
---|---|---|---|---|
0 | 0x7f5465a75018 | 0 | no | 8 |
16 | 0x7f5465a75028 | 0 | yes | 8 |
32 | 0x7f5465a75038 | 0 | no | 8 |
48 | 0x7f5465a75048 | 0 | yes | 64 |
120 | 0x7f5465a75090 | 0 | no | 16 |
Without space-fitting, this new region would likely be bigger than 8 bytes, although will be less than 80-bytes wide.
The space-fitting works in this manner: first, the first-fit and best-budy will perform their works to give us the best possible block as soon as possible. Now, if the block is larger and the right hand side is free, we will transfer extra bytes there. This is almost always possible because of how our memory allocator is designed to work, that is, seen from another angle, to make the right hand side almost always free. In the rare case the right is not free but the left is, the extra bytes will be transferred there. In the even rarer case that neither side is free, the block will “divide” itself when possible even if the division is hardly equal and fair–-as long as none of the block will end up with 0 byte. If none of the above can be done, the block will be left as-is.
This overall design has benefited us tremendously. Our iteration does not have to iterate further. If we ever need to chunk up we will still have a block big enough to be chunked up again. And that, nearly every block that is reserved shouldn’t have that much extra, unnecessary bytes.
Now, we should see what will happen if we allocate 128 bytes of memory. First, take a look at the snapshot of our memory map before making such an allocation:
Position | Address | PID | Free? | Size |
---|---|---|---|---|
0 | 0x7f5465a75018 | 0 | no | 8 |
16 | 0x7f5465a75028 | 0 | yes | 8 |
32 | 0x7f5465a75038 | 0 | no | 8 |
48 | 0x7f5465a75048 | 0 | yes | 16 |
72 | 0x7f5465a75060 | 0 | no | 32 |
112 | 0x7f5465a75088 | 0 | yes | 40 |
160 | 0x7f5465a750b8 | 0 | no | 16 |
184 | 0x7f5465a750d0 | 0 | no | 16 |
208 | 0x7f5465a750e8 | 0 | yes | 96 |
312 | 0x7f5465a75150 | 0 | yes | 96 |
416 | 0x7f5465a751b8 | 0 | yes | 88 |
512 | 0x7f5465a75218 | 0 | yes | 504 |
1024 | 0x7f5465a75418 | 0 | yes | 1016 |
2048 | 0x7f5465a75818 | 0 | yes | 2040 |
Now, after performing memory allocation, our memory map is:
Position | Address | PID | Free? | Size |
---|---|---|---|---|
0 | 0x7f5465a75018 | 0 | no | 8 |
16 | 0x7f5465a75028 | 0 | yes | 8 |
32 | 0x7f5465a75038 | 0 | no | 8 |
48 | 0x7f5465a75048 | 0 | yes | 16 |
72 | 0x7f5465a75060 | 0 | no | 32 |
112 | 0x7f5465a75088 | 0 | yes | 40 |
160 | 0x7f5465a750b8 | 0 | no | 16 |
184 | 0x7f5465a750d0 | 0 | no | 16 |
208 | 0x7f5465a750e8 | 0 | yes | 96 |
312 | 0x7f5465a75150 | 0 | yes | 96 |
416 | 0x7f5465a751b8 | 0 | yes | 88 |
512 | 0x7f5465a75218 | 0 | no | 128 |
648 | 0x7f5465a752a0 | 0 | yes | 368 |
1024 | 0x7f5465a75418 | 0 | yes | 1016 |
Notice the memory at ADDRESS 0x7f5465a75218
is located just in between of two world: one on its left all of smaller size, and one on its right of bigger size, proving our memory allocator worked just as intended by design. That’s why, instead of having (at least) 280 bytes of a single block on its left (0x7f5465a750e8
, 0x7f5465a75150
, and 0x7f5465a751b8
combined), that “potential” block is chunked to 3 blocks, completely upholding the rules of our allocator.
Now, what would happen if we were to allocate 64 bytes worth of memory? The answer is, our memory map would be as follow:
Position | Address | PID | Free? | Size |
---|---|---|---|---|
0 | 0x7f5465a75018 | 0 | no | 8 |
16 | 0x7f5465a75028 | 0 | yes | 8 |
32 | 0x7f5465a75038 | 0 | no | 8 |
48 | 0x7f5465a75048 | 0 | yes | 16 |
72 | 0x7f5465a75060 | 0 | no | 32 |
112 | 0x7f5465a75088 | 0 | no | 88 |
208 | 0x7f5465a750e8 | 0 | yes | 40 |
256 | 0x7f5465a75118 | 0 | no | 16 |
280 | 0x7f5465a75130 | 0 | no | 16 |
304 | 0x7f5465a75148 | 0 | yes | 48 |
360 | 0x7f5465a75180 | 0 | yes | 48 |
416 | 0x7f5465a751b8 | 0 | yes | 88 |
512 | 0x7f5465a75218 | 0 | no | 128 |
648 | 0x7f5465a752a0 | 0 | yes | 368 |
1024 | 0x7f5465a75418 | 0 | yes | 1016 |
2048 | 0x7f5465a75818 | 0 | yes | 2040 |
4096 | 0x7f5465a76018 | 0 | yes | 4088 |
8192 | 0x7f5465a77018 | 0 | yes | 8184 |
It used block at ADDRESS 0x7f5465a75088 that is 88-bytes worth. Again, it’s located on the left-hand side of 0x7f5465a75218 (128 bytes) which we’ve allocated earlier.
When a given memory block is freed, it will be merged with the block to its right and/or to its left if they were free at the time of merging. When those blocks are merged, their overhead struct will be claimed and thus reusable, resulting in the SIZE of said merged block to be much bigger than when they were standing all by themselves.
To demonstrate, assume we had allocated 16 bytes (0x7f5465a75048) and 8 bytes (0x7f5465a75060) worth of space to end up with the following memory map:
Position | Address | PID | Free? | Size |
---|---|---|---|---|
0 | 0x7f5465a75018 | 0 | no | 8 |
16 | 0x7f5465a75028 | 0 | yes | 8 |
32 | 0x7f5465a75038 | 0 | no | 8 |
48 | 0x7f5465a75048 | 0 | no | 16 |
72 | 0x7f5465a75060 | 0 | no | 32 |
112 | 0x7f5465a75088 | 0 | no | 88 |
208 | 0x7f5465a750e8 | 0 | yes | 16 |
232 | 0x7f5465a75100 | 0 | no | 24 |
264 | 0x7f5465a75120 | 0 | yes | 40 |
312 | 0x7f5465a75150 | 0 | no | 16 |
336 | 0x7f5465a75168 | 0 | no | 16 |
360 | 0x7f5465a75180 | 0 | yes | 64 |
432 | 0x7f5465a751c8 | 0 | yes | 72 |
512 | 0x7f5465a75218 | 0 | no | 128 |
648 | 0x7f5465a752a0 | 0 | yes | 368 |
1024 | 0x7f5465a75418 | 0 | yes | 1016 |
2048 | 0x7f5465a75818 | 0 | yes | 2040 |
4096 | 0x7f5465a76018 | 0 | yes | 4088 |
Now, let’s free 0x7f5465a75048:
Position | Address | PID | Free? | Size |
---|---|---|---|---|
0 | 0x7f5465a75018 | 0 | no | 8 |
16 | 0x7f5465a75028 | 0 | yes | 8 |
32 | 0x7f5465a75038 | 0 | no | 8 |
48 | 0x7f5465a75048 | 0 | yes | 16 |
72 | 0x7f5465a75060 | 0 | no | 32 |
112 | 0x7f5465a75088 | 0 | no | 88 |
208 | 0x7f5465a750e8 | 0 | yes | 40 |
If we were to free the region on its left, 0x7f5465a75038, our memory map will be merged all the way up to 0x7f5465a75028, making a super-sized block out of what used to be 8-bytes large, and far larger than when they are accounted individually: 32 bytes (0x7f5465a75028 + 0x7f5465a75038 + 0x7f5465a75048) since now the overhead structs will be a part of the addressable space.
Position | Address | PID | Free? | Size |
---|---|---|---|---|
0 | 0x7f5465a75018 | 0 | no | 8 |
16 | 0x7f5465a75028 | 0 | yes | 48 |
72 | 0x7f5465a75060 | 0 | no | 32 |
112 | 0x7f5465a75088 | 0 | no | 88 |
208 | 0x7f5465a750e8 | 0 | yes | 40 |
256 | 0x7f5465a75118 | 0 | no | 16 |
280 | 0x7f5465a75130 | 0 | no | 16 |
Now of course this block is naturally bigger, but there is exactly no needs nor reasons to move it to the right-hand side just to follow the rules. In fact, doing so is extremely undesirable as we would have to travel further to reserve new blocks that is usually temporary in nature especially in these hotspot regions.
So far our algorithm has worked beautifully, but at this rate it’s far from desirable. In a production environment where the memory is in a chunked up state, when a user wanted to allocate simply one byte bigger than the unallocated tail block: the allocator will refuse to allocate even if the free memory size is suffice to fulfill such a request.
For example, given our tails in this configuration:
Position | Address | PID | Free? | Size |
---|---|---|---|---|
4194304 | 0x7fec44cd2018 | 0 | yes | 4194296 |
8388608 | 0x7fec450d2018 | 0 | yes | 8388600 |
16777216 | 0x7fec458d2018 | 0 | yes | 16777208 |
33554432 | 0x7fec468d2018 | 0 | yes | 33554424 |
67108864 | 0x7fec488d2018 | 0 | yes | 67108856 |
Reserving 67108857 bytes of memory, which will then be rounded up to meet our double-word boundary expectation, would not be allowed under current technique. That is because there is simply no block that the first-fit can found that can satisfy such a request. Fortunately, the fix for that is very easy: we just need to stitch memory back starting from the tail (the end tip of the memory) to a possible position nearing to the head such that the combined sum of those blocks will be (large) enough for the request. We call this technique stitching.
Without stitching we would not be able to reserve 77108856 bytes of memory as, again, the last block is way too small for that. But, with stitching, this is what would happen when we reserve 77108856 bytes of memory:
Position | Address | PID | Free? | Size |
---|---|---|---|---|
4194304 | 0x7fec44cd2018 | 0 | yes | 4194296 |
8388608 | 0x7fec450d2018 | 0 | yes | 8388600 |
16777216 | 0x7fec458d2018 | 0 | yes | 40331640 |
57108864 | 0x7fec47f48998 | 0 | no | 77108856 |
Notice that we managed to reserve exactly that size, while transferring any extra byte to its left buddy. The process is done in this way: first, 0x7fec488d2018 would be merged with 0x7fec468d2018 which result in 0x7fec468d2018 having extra bytes. The 0x7fec468d2018 then transfer its extra bytes to its left buddy, which is 0x7fec458d2018, causing its size to widen from 16777208 bytes to 40331640 bytes. Now, since 0x7fec458d2018 has received extra bytes, 0x7fec468d2018 can no longer stay at its position thus it “indented” itself resulting it to be in the ADDRESS of 0x7fec47f48998. The expected behavior of these stitched blocks would remain similar. If they were to be freed, the whole block is freed, and perhaps merged to its left buddy or right buddy as appropriate. They are also subjected themselves from being chunked up, also if applicable.
Thus, as demonstrated, our algorithm is rather efficient in its use of the memory because when our user requested for 8 bytes space and the allocator found a 16 bytes block, the block can be chunked down and space-fitted so as to avoid wasting space. It is also clearly fast when it needs to find a suitable block as it used the plain-and-simple first-fit algorithm. On top of that, it’s relatively predictable in its behavior due to buddy allocator with some adjustments albeit.
There are still rooms for improvement, however. One that needed attention is that, current technique may fail to reserve memory even if the total free size is adequate when there are hurdle blocks which make the tail end to start at a different position past these hurdle(s), in other words: the new tail end will be of smaller size which significantly reduce the chance to reserve memory of bigger size. To visualize this, imagine we have this configuration:
Position | Address | PID | Free? | Size |
---|---|---|---|---|
8388608 | 0x7f99ea135018 | 0 | yes | 8388600 |
16777216 | 0x7f99ea935018 | 0 | no | 16777208 |
33554432 | 0x7f99eb935018 | 0 | yes | 33554424 |
67108864 | 0x7f99ed935018 | 0 | no | 67108856 |
Notice that block at POS 16777216 is a hurdle block because it sits in the middle. If we want to reserve 33554428 bytes (4 bytes larger than what POS 33554432 can offer), although our memory from the free size standpoint is completely capable to offer the space, our allocator will not return any block. That is because we will need to defragment our memory by moving any hurdle block so that the tail end is above them. To do so, we may need to move parts of our memory into the disk. For now, this is not implemented due to its relative complexity, yet it is not a rocket science.
Although we have just talked about the mechanism without showing a single code of our allocator, I can be quite sure that it’s a rather simple code due to the fact that we don’t play with chances. All memory blocks are initialized and known at any given time–they are in a way uniform in nature. They just work by following exactly the same rules, eliminating the needs of using many ifs or elses.