Cool, so how do you replicate huuuuuge databases? Well, you don’t. You partition them and then replicate. So we are just adding additional complexity to an already complex system? Oh, Yes. It will be fun.
Partitioning is important for efficiently scaling systems. To distribute the load across multiple node, and make the system fault tolerant by replicating the partitioned nodes.
Partitioning of Key-Value databases
So you partition your database into 25 nodes, then it should be able to handle 25 times the load, right? Nah, the world isn’t fair and so is partitioning.
When data is partitioned unfairly then it is called a skewed partition. When a single partition receives a high load as compared to other partitions it is called a hot spot.
The goal of partitioning must be to distribute data and query load equally onto multiple nodes.
Lets say you are partitioning a simple kv datastore. You can do it in two ways:
Key range partitioning: We use key ranges to define boundaries for partitions, similar to how encyclopaedias are organised. The keys are sorted and assigned to specific partitions, making range queries efficient since not all partitions need to be queried. However, this approach can cause hot spots, where certain access points are overused, leading to an imbalanced database. To handle this, partition ranges must be adjusted based on the data, and key selection needs to be strategic. Predicting which keys will create hot spots, though, can be challenging.
Key Hash partitioning: To prevent hot spots in key-range partitioning, hash functions can be used to distribute keys. Even similar keys will have different hashes, provided a good hash function is used. Hash-based partitioning ensures even distribution of keys, but sacrifices the efficiency of range queries, as all partitions must be queried.
A combined approach uses a compound primary key, where the first part is hashed to determine the partition, and the second part can be used for range queries (e.g., [user_id, timestamp]). This allows for range queries on the second part.
However, hash keys don't fully eliminate skew. In cases like social media accounts with millions of followers, hot spots may still occur, requiring custom solutions from developers.
Partitioning and Secondary Indexes
The secondary indexes don’t map neatly into partitions, you have to take some extra steps to handle secondary indexes as they don’t uniquely identify the rows. There are two ways to handle secondary indexes in partitions
By Document (Local Index):
Secondary indexes are stored within the same partition as the primary index, so each partition only covers its own documents. For example, if searching for red Fender guitars, you must query all partitions.While this method (scatter/gather) slows down searches by secondary index, it makes write operations fast since both primary and secondary indexes are updated within the same partition.
By Term (Global Index)
In a global index strategy, secondary indexes are stored in separate partitions based on the indexed terms, not alongside the primary data. For example, if you're searching for red Fender guitars, the partition containing the index for colorwill include references (like primary keys) to all matching items, regardless of the partition the actual data resides in. This allows the search to be directed to specific partitions rather than querying all of them.
Global Indexing makes writes a bit complex.
Rebalancing the Partitions
Now you have multiple partitions, but soon these partitions are going to get huge so you will have to partition them again, so how do you approach that?
Hash Mod N:
Data is assigned to partitions using a mod function. However, this method is inefficient when adding new nodes, as most of the data gets reshuffled, leading to unnecessary data movement. This strategy is generally discouraged.Fixed Number of Partitions:
Multiple partitions are assigned to each node, with a fixed total number of partitions across the cluster. When a new node is added, it takes over some partitions from existing nodes. This limits data movement to just partition reassignments. However, it's challenging to predict the right number of partitions initially, as future data growth can vary.Dynamic Partitioning:
As data grows, partitions can be split dynamically. Nodes handle multiple partitions, and new partitions are created or deleted as needed. This approach adapts to data changes but requires starting with a reasonable number of partitions. Pre-splitting can help if key distribution is predictable.Partitioning Proportional to Nodes:
A fixed number of partitions is assigned per node. When a new node is added, it randomly splits and takes over half of existing partitions. Hash-based partitioning is used for randomness, though this can lead to imbalanced splits. Advanced algorithms can mitigate unfair partition distribution.
Teams prefer manual rebalancing over automatic rebalancing to avoid operational and unwanted failures, automatic rebalancing can some times lead to inefficient data segments and overload the nodes. Having context of the systems and business logic can prevent mishaps hence manual rebalancing is preferred.
Request Routing
When rebalancing data, nodes, and partitions, it's important to ensure users can still access the correct nodes. Users need to know the IP and port of the node to send requests. There are three ways to manage this:
Node Forwarding: If a node receives a request for data it doesn't have, it forwards the request to the correct node, retrieves the data, and sends it back to the client.
Routing Tier: A separate routing layer determines which node should handle the request.
Direct Client Knowledge: Clients are aware of which node and partition to contact.
The challenge in all cases is ensuring that clients or routing tiers know about changes in partition-to-node assignments. This can be managed using an external service (e.g., Zookeeper) that tracks node changes and updates either the routing tier or clients accordingly.
Okay, that’s it for this chapter. This one was easy and short, just how I like it :P