One of the things that often confuses users who are familiar with other databases when they try Redis, is the lack of visibility into the database: There are no set of tables or collections to see, just a plain, flat key space which could (potentially) have millions of keys. The ability to iterate cheaply over this key space thus becomes very important to familiarize oneself with the database contents.
Iterating over the Redis key space has other important use cases too a few that come to mind are:
- garbage collection or cleaning up keys that match a certain pattern
- data movement and schema changes or moving a certain set keys to another data structure
- debugging, data sampling, data fixes or finding and fixing all keys which were messed up by a recent change
In this post, we will dig deep into various key space iteration options available in Redis.
O(N) Iterators: KEYS
The Redis KEYS command returns all the keys in the database that match a pattern (or all the keys in the key space). Similar commands for fetching all the fields stored in a hash is HGETALL and for all fetching the members of a SMEMBERS. The keys in Redis themselves are stored in a dictionary (aka a hash table). The KEYS command works by iterating over this dictionary and sending everything that matches the pattern out as a single Array reply. The other commands work similarly.
The performance of such an operation depends on the size of the collection i.e. O(N). Thus, use of KEYS is severely discouraged in production environments with a large number of keys. Redis being single threaded, gets blocked during this iteration thus blocking other operations. Thus, KEYS should be used only for debugging and other special occasions where performance is not a concern (like when the database has been brought offline to apply a data fix). The other important thing to remember about this algorithm is that it sends out all the matching keys together as a single response. This might be extremely convenient when the key space is small but will create multiple issues on a large key space. KEYS however is a favorite command among developers in their own dev environments.
KEYS in Action:
127.0.0.1:6379> MSET one 1 two 2 three 3 four 4 OK # All the keys 127.0.0.1:6379> keys * 1) "four" 2) "three" 3) "two" 4) "one" # keys that begin with the letter 't' 127.0.0.1:6379> keys t* 1) "three" 2) "two" # keys that have a 'ee' in them 127.0.0.1:6379> keys *ee* 1) "three"
Cursor Based Iterators: SCAN
The SCAN and it’s sister commands, SSCAN (for sets), HSCAN (for hashes) and ZSCAN (for sorted sets) provide the cursor based approach to iterating over Redis data structures. They have been available in Redis since 2.8.0.
Keys are returned in incremental iterations with constant time guarantee for each iteration. A cursor (an integer is this case) is returned when the iterations is initialized and an updated cursor is returned and every iteration. An iteration cycle begins when the cursor is set to 0 in the SCAN request, and terminates when the cursor returned by the server is 0. Due to nuances of Redis architecture and the cursor algorithm implementation here are some peculiarities of this approach:
- A full iteration always retrieves all the elements that were present in the collection from the start to the end of a full iteration.
- A full iteration never returns any element that was NOT present in the collection from the start to the end of a full iteration.
- A given element may be returned multiple times. It is up to the application to handle the case of duplicated elements
- Elements that were not constantly present in the collection during a full iteration, may be returned or not: it is undefined.
- A number of elements returned during each count varies and can be 0 too. However, the iteration is not complete until the server returns the cursor value of 0.
- The COUNT option can be used to limit the number of elements returned in each iteration. The default value is 10. However, it is considered only a suggestion and not enforced in all cases. COUNT value can be changed during each iteration call.
- The MATCH option allows specifying of patterns like the KEYS command allows.
- The cursor implementation is completely stateless on the server side. That allows (potentially) infinite iterations to start in parallel. Also, there are no requirements of ensuring that an iteration continues up to the end and can be stopped anytime.
Despite its peculiarities, SCAN is a very useful command and the right command to pick for key space iterations for most use cases.Despite its peculiarities, SCAN is a very useful command and the right command to pick for key space iterations for most use cases.Click To Tweet
SCAN in Action
127.0.0.1:6379> flushdb OK 127.0.0.1:6379> keys * (empty list or set) 127.0.0.1:6379> debug populate 33 OK 127.0.0.1:6379> scan 0 COUNT 5 1) "4" 2) 1) "key:1" 2) "key:9" 3) "key:13" 4) "key:29" 5) "key:23" 127.0.0.1:6379> scan 4 1) "42" 2) 1) "key:24" 2) "key:28" 3) "key:18" 4) "key:16" 5) "key:12" 6) "key:2" 7) "key:6" 8) "key:31" 9) "key:27" 10) "key:19" 127.0.0.1:6379> scan 42 1) "9" 2) 1) "key:3" 2) "key:4" 3) "key:20" 4) "key:8" 5) "key:32" 6) "key:5" 7) "key:26" 8) "key:10" 9) "key:21" 10) "key:14" 127.0.0.1:6379> scan 9 COUNT 100 1) "0" 2) 1) "key:25" 2) "key:30" 3) "key:22" 4) "key:17" 5) "key:15" 6) "key:0" 7) "key:11" 8) "key:7"
Under the Hood
The algorithm that SCAN (and it’s sister commands) use to scan through is an intriguing one and leads to some of the characteristics of the command that we described above. Antirez described it at a high level in his blog post and it is explained (slightly better) in the comments above the implementation (function dictScan). Describing it in details will make this post too long so I will give enough description to make it’s implications evident.
- Most Redis data structures are internally represented as dictionaries (at least partially in the case of sorted sets). They are implemented as power-of-two sized hash tables with chaining for collisions. The challenge in writing a cursor based iterative algorithm here is to be able to deal with growing and shrinking of the hash without sacrificing the Redis principles of simplicity (of the API) and speed.
- SCAN basically scans a bunch of hash buckets every iteration and returns the elements matching the pattern in those. Since it looks at only a fixed list of buckets some time iterations might return no values at all.
- The cursor that is returned is basically an offset into the hash table being iterated. It deals with the growing and shrinking of hash tables (i.e. rehashing) by clever manipulation of higher level bits of the offset while increasing the offset along with the properties of the hash table. Implications from this approach are that new elements added during the iteration may or may not be returned. However, the cursor itself, wouldn’t need to restart on a change in size of the hash table.
- A given bucket must be visited just once and all its keys must be returned in a single go. This is again to ensure that hash resizing (i.e. rehashing) doesn’t complicate iteration progress. However, this leads to the COUNT argument being not strictly enforceable.
- Since the above approach is completely stateless on the server side, it basically implies that iterations can be stopped or a huge number of iterations can be started in parallel without no increased memory usage.
At a high level, two choices are available to iterate over the Redis key space are:
- Use KEYS when performance is not a concern or when the key space is reasonably sized.
- At all other times, use SCAN.
Did you know we now support Redis? Fully managed Redis® hosting in the safety of your own cloud account. Leverage AWS/Azure credits for Redis hosting. Find out more here.