Vibrant Bitmap
What is a Bitmap?
Here, a bitmap is a type of index. If you came in thinking of graphics, you can press the back button.
Basic Concept
A bitmap is a form of indexing technology in databases. It represents the values of a specific column as an element or form of a bit array or vector, where each bit indicates whether a specific value for that column exists. Thus, the most basic form assigns a bit index to each row and represents existence as 0 or 1.
For example, let's assume there is a table with a gender column. If this table has 5 rows, and the values of the gender column are as follows:
- Male
- Female
- Male
- Female
- Other
In this case, the bitmap index for the gender column can be represented as follows:
- Male: 10100
- Female: 01010
- Other: 00001
So, where is it actually used?
Therefore, everyone would probably think that it's not simply for representing that. To properly utilize bitmap indexes, let's consider the following scenario.
1CREATE TABLE IF NOT EXISTS Users (
2 id INT PRIMARY KEY,
3 name VARCHAR(100),
4 active BOOLEAN,
5 sex BOOLEAN,
6 age INT
7);
Let's assume this table has 1,000,000 rows. And let's say we execute the following query.
1SELECT * FROM Users WHERE active = TRUE AND SEX = TRUE;
This query finds rows where both the active and sex columns are TRUE. The fastest way to find rows that satisfy the condition where both columns are TRUE would be to use a bitwise AND operation.
- Let's assume a bitmap index for the active column,
1100110011...(a bit array of 1,000,000 bits), has been pre-indexed, and - A bitmap index for the sex column,
1010101010...(a bit array of 1,000,000 bits), has been pre-indexed.
Then, by performing a bitwise AND operation on the two bitmap indexes, rows where both columns are TRUE can be found quickly. Since it is a simple bitwise operation, results can be obtained much faster than other index scans.
Trade-offs
While bitmap indexes can be highly efficient, they come with several trade-offs. Although there are many other disadvantages, we will address only one for now.
Specifically, bitmap indexes are more suitable for low cardinality (i.e., when the number of possible values in a column is small). Using a bitmap index on a high-cardinality column can result in a very large bitmap.
If a bitmap index is used on a high-cardinality column, a separate bitmap must be created for each unique value, which can require a significant amount of storage space. For instance, if a column has 1,000,000 unique values, 1,000,000 bitmaps would need to be generated, which can be highly inefficient.
So We Use
To replace bitmaps in high cardinality scenarios, we can use something called Roaring Bitmap. Roaring Bitmap has various characteristics, but its greatest underlying philosophy and characteristic is to dynamically select the most suitable storage method based on data density to maximize both compression ratio and operation speed.
Basic Concepts of Roaring Bitmap
Two-Phase Indexing
To store a single integer, Roaring Bitmap first undergoes the following two phases. While it's called two-phase indexing, you can simply think of it as a two-dimensional array.
- Extracting the higher 16 bits: The higher 16 bits of the integer are extracted and used as an index. This becomes the index of the outer array within a two-dimensional array of 65536 in length. This array stores the addresses of the containers, which will be discussed later.
- Extracting the lower 16 bits: The lower 16 bits of the integer are extracted and used as the position within the container.
Containers
These containers, as analogized earlier, correspond to the inner arrays within the two-dimensional array. However, unlike the higher 16 bits, the internal structure of these containers varies depending on the amount of internal data. Roaring Bitmap supports the following three types of containers:
- Array Container: This container is used when a small amount of data is stored internally. Specifically, it is used to store up to 4096 elements. This container is simply implemented as a sorted array of integers. Elements can be found using binary search.
- Bitmap Container: This container is used when a large amount of data is stored internally. Specifically, it is used to store more than 4096 elements. This container is implemented as a 65536-bit (8192-byte) bitmap. Each bit indicates whether an integer at that position exists.
- Run Container: This container is used to store consecutive ranges of integers. For example, it is efficient for storing consecutive integers such as 1, 2, 3, 4, 5. This container is implemented as a list of (start value, length) pairs. If all values are present, it simply means all values from the beginning to the end are present, so only (0, 65536) needs to be stored.
This approach enables Roaring Bitmap to utilize memory very efficiently and provide optimal performance for various data densities.
Features of Roaring Bitmap
Basic Features
Roaring Bitmap provides the following basic features:
- Insertion: Adds a new integer to the bitmap. During this process, an appropriate container can be selected or converted as needed.
- Deletion: Removes an existing integer from the bitmap. During this process, appropriate handling is performed based on the container's state.
- Search: Checks whether a specific integer exists in the bitmap. This process is performed very quickly.
Set Operations
Roaring Bitmap also supports the following set operations:
- Union: Creates a new bitmap containing all elements from two bitmaps.
- Intersection: Creates a new bitmap containing only elements present in both bitmaps.
- Symmetric Difference: Creates a new bitmap containing elements present in only one of the two bitmaps.
These operations are optimized based on the bitmap's container type and are executed very quickly.
In Go Language
In the Go language, the github.com/RoaringBitmap/roaring package can be used to easily leverage Roaring Bitmap. This package provides all the functionalities of Roaring Bitmap and is optimized for the characteristics of the Go language.
Installation
1go get github.com/RoaringBitmap/roaring/v2
Usage Example
1package main
2
3import (
4 "log"
5
6 "github.com/RoaringBitmap/roaring/v2"
7)
8
9func main() {
10 b1 := roaring.BitmapOf(1, 2, 3, 4) // Initialize with 1, 2, 3, 4
11 b2 := roaring.BitmapOf(2, 4, 6, 8, 12) // Initialize with 2, 4, 6, 8, 12
12
13 b1.Add(5) // Add 5
14 b2.Add(10) // Add 10
15
16 b2.Remove(12) // Remove 12
17
18 and := roaring.FastAnd(b1, b2)
19 or := roaring.FastOr(b1, b2)
20
21 // Bitmap operations modify the original, so use a clone
22 xor := b1.Clone()
23 xor.Xor(b2)
24
25 log.Printf("and result: %v", and.ToArray())
26 log.Printf("or result: %v", or.ToArray())
27 log.Printf("xor result: %v", xor.ToArray())
28}
12025/09/08 22:16:27 and result: [2 4]
22025/09/08 22:16:27 or result: [1 2 3 4 5 6 8 10]
32025/09/08 22:16:27 xor result: [1 3 5 6 8 10]
Major functions not shown in the example code above include:
Contains(x uint32) bool: Checks if a specific value exists in the bitmap.GetCardinality() uint64: Returns the number of elements contained in the bitmap.Rank(x uint32) uint64: Returns the number of elements less than or equal to a specific value.ToBytes() ([]byte, error): Serializes the bitmap into a byte slice.FromBytes(data []byte) error: Restores the bitmap from a byte slice.Clone() *Bitmap: Creates a clone of the bitmap.Clear(): Initializes the bitmap.NextValue(x uint32) int64: Returns the next element greater than or equal to x.PreviousValue(x uint32) int64: Returns the next element less than x.Maximum() uint32: Returns the largest element in the bitmap.Minimum() uint32: Returns the smallest element in the bitmap.
For more details, please refer to the official documentation.
Note that the roaring package internally includes a package that supports uint64 in addition to uint32. Since they offer almost identical interfaces, they can be directly interchanged.
1package main
2
3import (
4 "log"
5
6 "github.com/RoaringBitmap/roaring/v2/roaring64"
7)
8
9func main() {
10 b1 := roaring64.BitmapOf(1, 2, 3, 4) // Initialize with 1, 2, 3, 4
11 b2 := roaring64.BitmapOf(2, 4, 6, 8, 12) // Initialize with 2, 4, 6, 8, 12
12
13 b1.Add(5) // Add 5
14 b2.Add(10) // Add 10
15
16 b2.Remove(12) // Remove 12
17
18 and := roaring64.FastAnd(b1, b2)
19 or := roaring64.FastOr(b1, b2)
20
21 // Bitmap operations modify the original, so use a clone
22 xor := b1.Clone()
23 xor.Xor(b2)
24
25 log.Printf("and result: %v", and.ToArray())
26 log.Printf("or result: %v", or.ToArray())
27 log.Printf("xor result: %v", xor.ToArray())
28}
Conclusion
Roaring Bitmap is a powerful tool capable of efficiently storing and manipulating high-cardinality integer sets. In the Go language, using the github.com/RoaringBitmap/roaring package allows for easy utilization of all Roaring Bitmap functionalities. This package maximizes memory usage and performance through various container types and optimized operations. Roaring Bitmap can be applied in diverse fields such as database indexing, log analysis, and statistical computations to implement efficient data processing.
Roaring Bitmap demonstrates high performance, especially with large datasets, and can be usefully employed in various applications. Combined with Go language's concise and efficient syntax, Roaring Bitmap offers a powerful tool for developers. Further advancements and optimizations of Roaring Bitmap are anticipated in the future.
Why did I do this, you ask? For indexing each tag and fast union operations in time-series databases.