Skip to content

Detect and recover from master worker death during setup#91

Draft
ssteele110 wants to merge 8 commits intomasterfrom
sammyst/master-setup-heartbeat
Draft

Detect and recover from master worker death during setup#91
ssteele110 wants to merge 8 commits intomasterfrom
sammyst/master-setup-heartbeat

Conversation

@ssteele110
Copy link
Contributor

Summary

  • Adds master setup heartbeat mechanism to detect when master worker dies during queue setup
  • Other workers can atomically take over as master and complete the setup
  • Prevents builds from failing when master dies before pushing tests to queue

Problem

When the master worker is killed before completing queue setup:

  • master-status stays at 'setup' (until 8-hour TTL expires)
  • All non-master workers poll for ~120s then fail with LostMaster
  • No recovery mechanism exists - the entire build fails

Solution

  • Master sends periodic heartbeats during setup (every 5s by default)
  • Workers waiting for master check if heartbeat is stale (>30s by default)
  • If stale, workers attempt atomic takeover via Lua script (only one wins)
  • Winning worker runs master setup with its existing test index
  • Guard in push() prevents split-brain (old master can't push after replacement)

Key Changes

  1. Configuration (lib/ci/queue/configuration.rb)

    • master_setup_heartbeat_interval (default: 5s)
    • master_setup_heartbeat_timeout (default: 30s)
  2. Worker (lib/ci/queue/redis/worker.rb)

    • with_master_setup_heartbeat - heartbeat thread during setup
    • attempt_master_takeover - calls Lua script
    • run_master_setup - runs setup after takeover
    • Split-brain guard in push()
  3. Base (lib/ci/queue/redis/base.rb)

    • wait_for_master() now checks for stale heartbeat
    • master_setup_heartbeat_stale? helper
  4. Lua Script (redis/takeover_master.lua)

    • Atomic takeover ensuring only one worker wins

Test plan

  • Unit test: Verify heartbeat is sent during setup
  • Unit test: Verify takeover when master heartbeat is stale
  • Unit test: Verify NO takeover when master heartbeat is fresh
  • Unit test: Verify only one worker wins takeover race
  • Integration: Full flow with simulated master death

🤖 Generated with Claude Code

@ssteele110 ssteele110 force-pushed the sammyst/master-setup-heartbeat branch from 7ac3885 to 8bb5884 Compare February 5, 2026 23:48
ssteele110 and others added 7 commits February 5, 2026 15:49
When the master worker is killed before completing queue setup, the
master-status stays at 'setup' and all non-master workers poll for
~120s then fail with LostMaster. This change adds a master setup
heartbeat mechanism that allows workers to detect a dead master and
atomically re-elect a new one.

Changes:
- Add master_setup_heartbeat_interval (5s) and master_setup_heartbeat_timeout
  (30s) config options
- Master sends heartbeat during setup via background thread
- Workers detect stale heartbeat and attempt atomic takeover via Lua script
- Guard in push() prevents split-brain (old master pushing after replacement)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The previous guard check in push() had a race window between checking
master-worker-id and executing the transaction. A takeover could occur
in that window, causing both old and new master to push tests.

Use Redis WATCH to monitor master-worker-id. If it changes between
WATCH and MULTI execution, the transaction automatically aborts,
preventing duplicate test pushes.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Previously, there was a window between acquiring master role and
entering with_master_setup_heartbeat where no heartbeat existed.
Other workers could see status='setup' with no heartbeat and
incorrectly attempt takeover.

Now the initial heartbeat is set immediately after acquiring master
role, closing this window.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…keover

After taking over as master, check if master-status is still "setup"
before running setup. If status is already "ready", the original master
completed its push before we could act, so we skip to avoid duplicate
test pushes.

This fixes the race where:
1. Original master's push() passes WATCH check
2. Heartbeat becomes stale during redis.multi execution
3. Worker takes over (status still "setup")
4. Original master's transaction commits, sets status="ready"
5. Without this fix, takeover worker would also push tests

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Extract shared master setup logic (reorder tests, store chunk metadata,
push to queue) into execute_master_setup method. Used by both initial
master setup in populate() and takeover in run_master_setup().

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@ssteele110 ssteele110 force-pushed the sammyst/master-setup-heartbeat branch from 8bb5884 to ddfef01 Compare February 5, 2026 23:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant