Detect and recover from master worker death during setup#91
Draft
ssteele110 wants to merge 8 commits intomasterfrom
Draft
Detect and recover from master worker death during setup#91ssteele110 wants to merge 8 commits intomasterfrom
ssteele110 wants to merge 8 commits intomasterfrom
Conversation
7ac3885 to
8bb5884
Compare
When the master worker is killed before completing queue setup, the master-status stays at 'setup' and all non-master workers poll for ~120s then fail with LostMaster. This change adds a master setup heartbeat mechanism that allows workers to detect a dead master and atomically re-elect a new one. Changes: - Add master_setup_heartbeat_interval (5s) and master_setup_heartbeat_timeout (30s) config options - Master sends heartbeat during setup via background thread - Workers detect stale heartbeat and attempt atomic takeover via Lua script - Guard in push() prevents split-brain (old master pushing after replacement) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The previous guard check in push() had a race window between checking master-worker-id and executing the transaction. A takeover could occur in that window, causing both old and new master to push tests. Use Redis WATCH to monitor master-worker-id. If it changes between WATCH and MULTI execution, the transaction automatically aborts, preventing duplicate test pushes. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Previously, there was a window between acquiring master role and entering with_master_setup_heartbeat where no heartbeat existed. Other workers could see status='setup' with no heartbeat and incorrectly attempt takeover. Now the initial heartbeat is set immediately after acquiring master role, closing this window. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…keover After taking over as master, check if master-status is still "setup" before running setup. If status is already "ready", the original master completed its push before we could act, so we skip to avoid duplicate test pushes. This fixes the race where: 1. Original master's push() passes WATCH check 2. Heartbeat becomes stale during redis.multi execution 3. Worker takes over (status still "setup") 4. Original master's transaction commits, sets status="ready" 5. Without this fix, takeover worker would also push tests Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…after takeover" This reverts commit 636796e.
Extract shared master setup logic (reorder tests, store chunk metadata, push to queue) into execute_master_setup method. Used by both initial master setup in populate() and takeover in run_master_setup(). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
8bb5884 to
ddfef01
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Problem
When the master worker is killed before completing queue setup:
master-statusstays at'setup'(until 8-hour TTL expires)LostMasterSolution
push()prevents split-brain (old master can't push after replacement)Key Changes
Configuration (
lib/ci/queue/configuration.rb)master_setup_heartbeat_interval(default: 5s)master_setup_heartbeat_timeout(default: 30s)Worker (
lib/ci/queue/redis/worker.rb)with_master_setup_heartbeat- heartbeat thread during setupattempt_master_takeover- calls Lua scriptrun_master_setup- runs setup after takeoverpush()Base (
lib/ci/queue/redis/base.rb)wait_for_master()now checks for stale heartbeatmaster_setup_heartbeat_stale?helperLua Script (
redis/takeover_master.lua)Test plan
🤖 Generated with Claude Code