Background

One day, a colleague introduced a session about Redlock in a study group I participate in.

It was such great content that I wanted to dig deeper with my own understanding.

It would be helpful to review Redlock Algorithm beforehand.

Lock for what?

Locks are mainly used to ensure efficiency and correctness.

Efficiency

Prevent redundant work from being executed unnecessarily.

  • ex. N nodes performing the same heavy task (taking 10 minutes) simultaneously, leading to cost/time waste.

Correctness

Enable consistent and accurate data processing on shared resources across concurrent processes.

  • ex. N nodes processing a user’s withdrawal logic simultaneously, causing the user’s account to be charged N times.

According to Martin Kleppmann, if you’re considering using Redis Lock for efficiency, it is recommended not to use Redlock.

ItemSingle Redis LockRedis Redlock Algorithm
Lock TargetSingle Redis instance5 independent Redis instances
Lock Creation MethodSET key value NX PX <TTL>Attempt SET key value NX PX <TTL> on all 5 nodes
Success ConditionSuccessful lock on one RedisSuccessful lock on majority (3 out of 5) nodes
Failure HandlingLock information lost if Redis failsMajority lock remains safe even if some nodes fail
Split Brain HandlingImpossiblePartially possible (not perfect)
ConsistencyWeak consistency (single instance)Strengthened consistency during lock acquisition (multi-instance)
ComplexitySimple (easy to implement)Complex (requires handling lock acquisition time, clock drift)
Fault ToleranceLowRelatively higher
PerformanceFast (single node access)May be slower (communication with 5 nodes)
Main Use CaseSmall systems, single-server environmentsGlobal distributed systems, high-availability lock systems

If the Redis node crashes unexpectedly

Timeout occurs while trying to acquire the lock → application response delay or business logic execution failure

Incomplete Lock

Using a single Redis node cannot guarantee high availability and stability during failure scenarios.

  1. Fail case 1: Lock release due to GC Stop-the-World pause
    • Duration of STW is unpredictable.
    • Even Concurrent GC cannot avoid STW.
  2. Fail case 2: After acquiring lock, external port operations (API, DB, HDFS…) experience packet loss
    • After acquiring lock, delays during IO operations → TTL (lease) expiration → another thread may acquire the lock and perform the same operation.
    • Delays caused by packet loss in external network operations → TTL (lease) expiration …

SPoF Solution: Master - Slave Structure

During failover, TTL expiration may lead to unlock, causing data corruption.

sequenceDiagram participant A as Client A participant Lock as Lock Service participant B as Client B participant Storage as Shared Storage A->>Lock: Acquire Lock on filename Lock-->>A: Lock Granted (with TTL) A->>Storage: Read File Note over A: Redis failover (longer than TTL) Lock-->>B: Lock expired, available again B->>Lock: Acquire Lock on filename Lock-->>B: Lock Granted B->>Storage: Read File B->>Storage: Update and Write File A->>Storage: Update and Write File (After GC pause) Note over Storage: Data corruption!

Stability Solution: Safe Lock with Fencing

Similar to first commit wins in MVCC, transaction handling at the storage level is based on version (token).

  1. client 1 successfully acquires the lock (with token33) but encounters delay during storage write (GC, network delay, etc.)
  2. client 1’s lock lease expires.
  3. client 2 acquires the lock (with token34) and completes the write operation before client 1 finishes.
  4. client 1 attempts storage write → storage rejects token33 because it’s older than token34 (transaction fail).

The biggest problem is: who generates the fencing token? In a distributed environment, implementing a counter requires another leader election… (an infinite loop)

Redlock

Operation Flow

  1. Record the current time in milliseconds.
  2. Try to acquire the lock on all N Redis instances sequentially with the same key and a random value. Set a short timeout for each attempt so that if a node is down, move to the next instance immediately.
  3. Calculate the time taken to acquire locks, and if locks are successfully acquired on the majority of instances and the time taken is less than the lock’s validity time, the lock is considered acquired.
  4. If the lock is acquired, set the new validity time as (initial validity − elapsed time).
  5. If the lock is not acquired, or if the remaining validity time is negative (exceeded during acquisition), release the locks from all instances.

Bad Timing Issue

CategoryDescription
General Distributed SystemAssumes “cannot trust time” → ensures safety unconditionally, only liveness depends on timing
RedlockRelies on time (clock accuracy, network delay) to guarantee lock safety
ProblemIf clocks jump forward/backward (GC, NTP, network delay), lock expiration calculations may fail and lock can be broken
ResultNot just liveness degradation — safety violations (e.g., data corruption, duplicate execution) can occur
sequenceDiagram participant C1 as Client 1 participant C2 as Client 2 participant A as Redis Node A participant B as Redis Node B participant C as Redis Node C participant D as Redis Node D participant E as Redis Node E %% First Scenario: Clock Jump C1->>A: Acquire lock C1->>B: Acquire lock C1->>C: Acquire lock Note over C: Clock jumps forward -> lock expires prematurely C1->>D: (Network issue, cannot reach) C1->>E: (Network issue, cannot reach) C2->>C: Acquire lock (C believes no lock exists) C2->>D: Acquire lock C2->>E: Acquire lock Note over C1,C2: Both clients believe they hold the lock %% Second Scenario: Process Pause (e.g., GC) or Long Network Delay C1->>A: Lock request sent (in-flight) C1->>B: Lock request sent (in-flight) C1->>C: Lock request sent (in-flight) C1->>D: Lock request sent (in-flight) C1->>E: Lock request sent (in-flight) Note over C1: Client 1 stops (GC pause or process pause) Note over A: Locks expire during Client 1 pause C2->>A: Acquire new lock C2->>B: Acquire new lock C2->>C: Acquire new lock C2->>D: Acquire new lock C2->>E: Acquire new lock C1->>C1: Client 1 resumes after GC pause C1->>C1: Receives "lock acquired" responses from Redis (stale responses) Note over C1,C2: Both clients now believe they hold the lock
ScenarioDescription
First (Clock Jump)Redis C node’s clock jumps forward, causing early TTL expiration. Client 1 thinks it still holds the lock, but Client 2 acquires it again, leading both to believe they own the lock.
Second (GC Pause)Client 1 sends lock requests but pauses (GC), during which locks expire. Client 2 acquires new locks, while Client 1 later processes stale success responses.

Synchrony assumptions of Redlock

ConditionDescription
Bounded Network DelayPackets must arrive within a guaranteed maximum delay
Bounded Process PauseGC or system pauses must stay within a limited time
Bounded Clock DriftClock drift must be small; NTP synchronization must be reliable

➔ That is, all delays, pauses, and clock drifts must be much smaller than the lock’s TTL (time-to-live) for Redlock to function correctly.

Is it realistic to expect such conditions? Remember GitHub’s 90-second packet delay.

Ultimately… Redlock is an algorithm that relies on time, and due to clock jumps, GC STW, and network packet loss, it cannot guarantee correctness.

Since Redis was never designed for “consensus” but rather as a key-value store, for truly reliable locks, it is better to use solutions like Zookeeper or Raft instead of Redlock.