Vibrant Bitmap
What is a Bitmap?
The term "bitmap" here refers to a type of index. If you came here thinking about graphics, you may navigate back.
Fundamental Concept
A bitmap constitutes a form of indexing technology within databases. It represents the values of a specific column as an element or form of a bit array or vector, where each bit indicates the presence or absence of a particular value for that column. Consequently, the most fundamental form involves assigning a bit index to each row and expressing presence or absence with 0 and 1.
For instance, consider a table possessing a gender column. Assume this table contains five rows, and the values of the gender column are as follows:
- Male
- Female
- Male
- Female
- Other
In this scenario, the bitmap index for the gender column can be represented as follows:
- Male: 10100
- Female: 01010
- Other: 00001
Practical Applications
It is widely understood that this is not merely for simple representation. To appropriately employ bitmap indexes, consider the following case:
1CREATE TABLE IF NOT EXISTS Users (
2 id INT PRIMARY KEY,
3 name VARCHAR(100),
4 active BOOLEAN,
5 sex BOOLEAN,
6 age INT
7);
Assume this table contains 1,000,000 rows. Furthermore, suppose the following query is executed:
1SELECT * FROM Users WHERE active = TRUE AND SEX = TRUE;
This query seeks rows where both the 'active' and 'sex' columns are TRUE. The most expeditious method to identify rows satisfying the condition where both columns are TRUE would be to utilize a bitwise AND operation.
- Assume that a bitmap index for the 'active' column,
1100110011...
(a 1,000,000-bit array), has been pre-indexed, and - Assume that a bitmap index for the 'sex' column,
1010101010...
(a 1,000,000-bit array), has been pre-indexed.
Then, by performing a bitwise AND operation on these two bitmap indexes, rows where both columns are TRUE can be rapidly identified. Due to its nature as a simple bitwise operation, it is likely to yield results significantly faster than other index scans.
Trade-offs
While bitmap indexes can be highly efficient, they entail several trade-offs. Although numerous other disadvantages exist, only one will be discussed at this juncture.
Specifically, bitmap indexes are more suitable for low cardinality (i.e., cases where the number of possible values in a column is small). Employing bitmap indexes for high-cardinality columns can lead to very large bitmaps.
When a bitmap index is used on a high-cardinality column, a separate bitmap must be generated for each unique value, which can necessitate substantial storage space. For instance, if a column contains 1,000,000 unique values, 1,000,000 bitmaps would need to be generated, which can be highly inefficient.
Our Approach
To substitute bitmaps in high-cardinality scenarios, Roaring Bitmap can be employed. While Roaring Bitmap possesses various characteristics, its most fundamental philosophical underpinning and characteristic is dynamically selecting the most suitable storage method based on data density to maximize both compression ratio and operation speed
.
Fundamental Concepts of Roaring Bitmap
Two-Stage Indexing
To store a single integer, Roaring Bitmap first undergoes the following two stages. While referred to as two-stage indexing, it can simply be conceptualized as a two-dimensional array.
- Extraction of the upper 16 bits: The upper 16 bits of the integer are extracted and used as an index. This becomes the index of the outer array within a two-dimensional array of 65536 in total length. This array stores the addresses of containers, which will be discussed subsequently.
- Extraction of the lower 16 bits: The lower 16 bits of the integer are extracted and used as a position within the container.
Containers
This container corresponds to the inner array in the two-dimensional array analogy. However, unlike the upper 16 bits, the internal structure of this container varies depending on the amount of data stored within it. Roaring Bitmap supports the following three types of containers:
- Array Container: This container is utilized when a small amount of data is stored internally. Specifically, it is employed to store 4096 or fewer elements. This container is implemented as a simple sorted array of integers. Elements can be located through binary search.
- Bitmap Container: This container is utilized when a large amount of data is stored internally. Specifically, it is employed to store more than 4096 elements. This container is implemented as a 65536-bit (8192-byte) bitmap. Each bit indicates whether the integer at that position is present.
- Run Container: This container is utilized for storing contiguous ranges of integers. For example, it is efficient for storing consecutive integers such as 1, 2, 3, 4, 5. This container is implemented as a list of (start value, length) pairs. If all values are present, it simply means that all values from the beginning to the end are present, so only (0, 65536) needs to be stored.
This approach enables Roaring Bitmap to utilize memory very efficiently and to provide optimal performance across diverse data densities.
Functionality of Roaring Bitmap
Basic Functionality
Roaring Bitmap provides the following basic functionalities:
- Insertion: Adds a new integer to the bitmap. During this process, an appropriate container is selected, or the container can be converted if necessary.
- Deletion: Removes an existing integer from the bitmap. Appropriate processing also occurs during this process, depending on the state of the container.
- Search: Verifies the existence of a specific integer within the bitmap. This process is executed with high speed.
Set Operations
Roaring Bitmap also supports the following set operations:
- Union: Generates a new bitmap containing all elements from two bitmaps.
- Intersection: Generates a new bitmap containing only elements present in both bitmaps.
- Symmetric Difference: Generates a new bitmap containing elements present in only one of the two bitmaps.
These operations are optimized based on the bitmap's container type and are performed with high speed.
In the Go Language
In the Go language, Roaring Bitmap can be readily utilized by employing the github.com/RoaringBitmap/roaring
package. This package provides all functionalities of Roaring Bitmap and is optimized to align with the characteristics of the Go language.
Installation
1go get github.com/RoaringBitmap/roaring/v2
Usage Example
1package main
2
3import (
4 "log"
5
6 "github.com/RoaringBitmap/roaring/v2"
7)
8
9func main() {
10 b1 := roaring.BitmapOf(1, 2, 3, 4) // Initialize with 1, 2, 3, 4
11 b2 := roaring.BitmapOf(2, 4, 6, 8, 12) // Initialize with 2, 4, 6, 8, 12
12
13 b1.Add(5) // Add 5
14 b2.Add(10) // Add 10
15
16 b2.Remove(12) // Remove 12
17
18 and := roaring.FastAnd(b1, b2)
19 or := roaring.FastOr(b1, b2)
20
21 // Since bitmap operations modify the original, use a clone
22 xor := b1.Clone()
23 xor.Xor(b2)
24
25 log.Printf("and result: %v", and.ToArray())
26 log.Printf("or result: %v", or.ToArray())
27 log.Printf("xor result: %v", xor.ToArray())
28}
12025/09/08 22:16:27 and result: [2 4]
22025/09/08 22:16:27 or result: [1 2 3 4 5 6 8 10]
32025/09/08 22:16:27 xor result: [1 3 5 6 8 10]
Key functionalities not demonstrated in the above example code include:
Contains(x uint32) bool
: Checks if a specific value exists in the bitmap.GetCardinality() uint64
: Returns the count of elements included in the bitmap.Rank(x uint32) uint64
: Returns the count of elements less than or equal to a specific value.ToBytes() ([]byte, error)
: Serializes the bitmap into a byte slice.FromBytes(data []byte) error
: Restores the bitmap from a byte slice.Clone() *Bitmap
: Creates a clone of the bitmap.Clear()
: Initializes the bitmap.NextValue(x uint32) int64
: Returns the next element greater than or equal to x.PreviousValue(x uint32) int64
: Returns the previous element less than x.Maximum() uint32
: Returns the largest element in the bitmap.Minimum() uint32
: Returns the smallest element in the bitmap.
These functionalities are available. For further details, please refer to the official documentation.
Notably, the 'roaring' package includes support for uint64
in addition to uint32
. Since it provides an almost identical interface, it can be directly interchanged.
1package main
2
3import (
4 "log"
5
6 "github.com/RoaringBitmap/roaring/v2/roaring64"
7)
8
9func main() {
10 b1 := roaring64.BitmapOf(1, 2, 3, 4) // Initialize with 1, 2, 3, 4
11 b2 := roaring64.BitmapOf(2, 4, 6, 8, 12) // Initialize with 2, 4, 6, 8, 12
12
13 b1.Add(5) // Add 5
14 b2.Add(10) // Add 10
15
16 b2.Remove(12) // Remove 12
17
18 and := roaring64.FastAnd(b1, b2)
19 or := roaring64.FastOr(b1, b2)
20
21 // Since bitmap operations modify the original, use a clone
22 xor := b1.Clone()
23 xor.Xor(b2)
24
25 log.Printf("and result: %v", and.ToArray())
26 log.Printf("or result: %v", or.ToArray())
27 log.Printf("xor result: %v", xor.ToArray())
28}
Conclusion
Roaring Bitmap represents a robust instrument for efficiently storing and manipulating sets of integers with high cardinality. In the Go language, the github.com/RoaringBitmap/roaring
package facilitates the straightforward utilization of all Roaring Bitmap functionalities. This package maximizes memory usage and performance through diverse container types and optimized operations. Roaring Bitmap can be employed in various domains, including database indexing, log analysis, and statistical computations, to achieve efficient data processing.
Roaring Bitmap exhibits particularly high performance with large datasets and can be beneficially employed in diverse applications. In conjunction with Go language's concise and efficient syntax, Roaring Bitmap furnishes developers with a powerful instrument. Continued development and further optimization of Roaring Bitmap, alongside additional functionalities, are anticipated.
Why did I undertake this endeavor? For indexing by tag and for rapid union operations in time-series databases.