Skip to content

Add Buildkite infrastructure for GPU job isolation#432

Open
msaroufim wants to merge 20 commits intomainfrom
buildkite-infrastructure
Open

Add Buildkite infrastructure for GPU job isolation#432
msaroufim wants to merge 20 commits intomainfrom
buildkite-infrastructure

Conversation

@msaroufim
Copy link
Member

@msaroufim msaroufim commented Feb 4, 2026

  • Add deployment/buildkite/ with setup-node.sh, pipeline.yml, Dockerfile
  • Add BuildkiteLauncher class for submitting jobs to Buildkite
  • Add buildkite-runner.py for job execution in containers
  • Add BuildkiteGPU enum (B200_BK, H100_BK, MI300_BK)
  • Add e2e test script for verifying Buildkite integration

This enables vendors to onboard GPU resources with proper per-GPU isolation using a single setup script that creates Buildkite agents.

It's alive

kernelbot on  buildkite-infrastructure [$] is 📦 v0.1.0 via 🐍 v3.14.2 took 54s
❯   BUILDKITE_API_TOKEN=XXX uv run python scripts/submit_buildkite_job.py --eval vectoradd_py
=== Buildkite Job Submission ===
Org: mark-saroufim
Pipeline: kernelbot
Queue: test
Run ID: manual-test
Eval: vectoradd_py

Task: vectoradd_py
Submission: submission_triton.py
[STATUS] Submitting to Buildkite queue: test
2026-02-04 19:14:58 - libkernelbot.launchers.buildkite - INFO - Submitting job sub-unknown-L40S_BK to Buildkite queue test
2026-02-04 19:14:59 - libkernelbot.launchers.buildkite - INFO - Build created: https://buildkite.com/mark-saroufim/kernelbot/builds/55
[UPDATE] Build created: [55](<https://buildkite.com/mark-saroufim/kernelbot/builds/55>)
[UPDATE] ⏳ Build [55](<https://buildkite.com/mark-saroufim/kernelbot/builds/55>): scheduled (0.3s)
[UPDATE] ⏳ Build [55](<https://buildkite.com/mark-saroufim/kernelbot/builds/55>): running (10.6s)
[UPDATE] ⏳ Build [55](<https://buildkite.com/mark-saroufim/kernelbot/builds/55>): running (20.9s)
[UPDATE] ⏳ Build [55](<https://buildkite.com/mark-saroufim/kernelbot/builds/55>): running (31.3s)
[UPDATE] ⏳ Build [55](<https://buildkite.com/mark-saroufim/kernelbot/builds/55>): running (41.6s)
[UPDATE] Build completed: [55](<https://buildkite.com/mark-saroufim/kernelbot/builds/55>)

=== Result ===
Success: True
System: SystemInfo(gpu='NVIDIA L40S', device_count=1, cpu='AMD EPYC 9254 24-Core Processor', runtime='CUDA', platform='Linux-5.15.0-164-generic-x86_64-with-glibc2.35', torch='2.6.0+cu124', hostname='70c0619235cc')

test:
  Passed: True
  Duration: 3.230035022001175s
  Result: {'test-count': '5', 'test.0.spec': 'size: 127; seed: 4242', 'test.0.status': 'pass', 'test.1.spec': 'size: 128; seed: 5236', 'test.1.status': 'pass', 'test.2.spec': 'size: 129; seed: 1001', 'test.2.status': 'pass', 'test.3.spec': 'size: 256; seed: 5531', 'test.3.status': 'pass', 'test.4.spec': 'size: 512; seed: 9173', 'test.4.status': 'pass', 'check': 'pass'}

kernelbot on  buildkite-infrastructure [$] is 📦 v0.1.0 via 🐍 v3.14.2 took 57s
❯

- Add deployment/buildkite/ with setup-node.sh, pipeline.yml, Dockerfile
- Add BuildkiteLauncher class for submitting jobs to Buildkite
- Add buildkite-runner.py for job execution in containers
- Add BuildkiteGPU enum (B200_BK, H100_BK, MI300_BK)
- Add e2e test script for verifying Buildkite integration

This enables vendors to onboard GPU resources with proper per-GPU
isolation using a single setup script that creates Buildkite agents.
Copilot AI review requested due to automatic review settings February 4, 2026 23:20
@github-actions
Copy link

github-actions bot commented Feb 4, 2026

Coverage report

Click to see where and how coverage changed

FileStatementsMissingCoverageCoverage
(new stmts)
Lines missing
  src/libkernelbot
  consts.py
  utils.py
Project Total  

This report was generated by python-coverage-comment-action

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds comprehensive Buildkite infrastructure to enable GPU job isolation for kernel evaluation workloads. The implementation provides a vendor-agnostic approach for onboarding GPU resources through automated setup scripts that create per-GPU Buildkite agents.

Changes:

  • Adds BuildkiteLauncher class with API integration for job submission and result polling
  • Implements buildkite-runner.py for containerized job execution with compressed payload handling
  • Adds BuildkiteGPU enum (B200_BK, H100_BK, MI300_BK) and integrates with GPU lookup system
  • Creates deployment infrastructure including setup-node.sh, pipeline.yml, and Dockerfile
  • Provides e2e test script for integration verification

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 13 comments.

Show a summary per file
File Description
src/libkernelbot/launchers/buildkite.py Implements Buildkite API integration with build submission, polling, and artifact retrieval
src/runners/buildkite-runner.py Container entrypoint that decodes job payloads and executes evaluations
src/libkernelbot/consts.py Adds BuildkiteGPU enum and integrates it into GPU lookup system
src/libkernelbot/launchers/init.py Exports BuildkiteLauncher for public API
deployment/buildkite/setup-node.sh Automated node setup creating per-GPU agents with isolation
deployment/buildkite/pipeline.yml Buildkite pipeline configuration for dockerized job execution
deployment/buildkite/Dockerfile Container image with CUDA, PyTorch, and kernelbot dependencies
tests/e2e_buildkite_test.py End-to-end test script for Buildkite integration verification

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +116 to +120
local cpu_start=$((gpu_idx * CPUS_PER_GPU))
local cpu_end=$((cpu_start + CPUS_PER_GPU - 1))
local agent_name="${NODE_NAME}-gpu${gpu_idx}"
local config_dir="/etc/buildkite-agent/agent-${gpu_idx}"
local build_dir="/var/lib/buildkite-agent/gpu-${gpu_idx}/builds"
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The local keyword is used in a non-function context. In bash, local is only valid within functions and will cause a syntax error when used in a for loop at the script level. These variables should either be declared without local, or the agent setup logic should be moved inside a function (which it already is - setup_agents), but the local keyword is being used in the wrong scope.

Suggested change
local cpu_start=$((gpu_idx * CPUS_PER_GPU))
local cpu_end=$((cpu_start + CPUS_PER_GPU - 1))
local agent_name="${NODE_NAME}-gpu${gpu_idx}"
local config_dir="/etc/buildkite-agent/agent-${gpu_idx}"
local build_dir="/var/lib/buildkite-agent/gpu-${gpu_idx}/builds"
cpu_start=$((gpu_idx * CPUS_PER_GPU))
cpu_end=$((cpu_start + CPUS_PER_GPU - 1))
agent_name="${NODE_NAME}-gpu${gpu_idx}"
config_dir="/etc/buildkite-agent/agent-${gpu_idx}"
build_dir="/var/lib/buildkite-agent/gpu-${gpu_idx}/builds"

Copilot uses AI. Check for mistakes.
Comment on lines 29 to 31
RUN pip install --no-cache-dir \
torch==2.4.0 \
triton \
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The torch version is pinned to 2.4.0, but the CUDA base image is 12.4.0. Consider verifying compatibility or documenting why this specific combination is used. Additionally, the triton package is installed without version pinning, which may lead to version compatibility issues.

Suggested change
RUN pip install --no-cache-dir \
torch==2.4.0 \
triton \
# torch==2.4.0 has been validated with the CUDA 12.4.0 base image used above.
# Pin triton to a version compatible with torch 2.4.0 for reproducible builds.
RUN pip install --no-cache-dir \
torch==2.4.0 \
triton==3.0.0 \

Copilot uses AI. Check for mistakes.
Comment on lines 85 to 90
result = await launcher._launch(
run_id="e2e-test",
config=test_config,
queue=args.queue,
status=SimpleReporter(),
)
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test accesses a private method _launch which is prefixed with an underscore, indicating it's intended for internal use. This violates encapsulation principles and makes the test brittle to internal implementation changes. Consider either making this a public method if it needs to be tested directly, or testing through the public run_submission method instead.

Copilot uses AI. Check for mistakes.
Comment on lines +139 to +147
cat > "${config_dir}/environment" << ENVEOF
NVIDIA_VISIBLE_DEVICES=${gpu_idx}
CUDA_VISIBLE_DEVICES=${gpu_idx}
KERNELBOT_GPU_INDEX=${gpu_idx}
KERNELBOT_CPU_START=${cpu_start}
KERNELBOT_CPU_END=${cpu_end}
KERNELBOT_CPUSET=${cpu_start}-${cpu_end}
KERNELBOT_MEMORY=${RAM_PER_GPU}
ENVEOF
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The environment file sets NVIDIA_VISIBLE_DEVICES and CUDA_VISIBLE_DEVICES unconditionally, but these are NVIDIA-specific variables. For AMD GPUs (like MI300_BK), the appropriate variables would be ROCR_VISIBLE_DEVICES or HIP_VISIBLE_DEVICES. This will cause GPU isolation to fail on AMD systems. Consider adding conditional logic based on GPU_TYPE to set the appropriate visibility variables.

Copilot uses AI. Check for mistakes.
- docker#v5.11.0:
image: "${KERNELBOT_IMAGE:-ghcr.io/gpu-mode/kernelbot:latest}"
always-pull: true
runtime: nvidia
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docker runtime is hardcoded to 'nvidia', which will not work for AMD GPUs. For AMD systems, the runtime should typically be set to a ROCm-compatible runtime or left as default with appropriate device mappings. This configuration prevents the pipeline from working with AMD MI300_BK GPUs that are listed in the BuildkiteGPU enum.

Suggested change
runtime: nvidia
runtime: "${KERNELBOT_DOCKER_RUNTIME:-nvidia}"

Copilot uses AI. Check for mistakes.

print("=== Kernelbot Evaluation ===")
print(f"Run ID: {run_id}")
print(f"GPU: {os.environ.get('NVIDIA_VISIBLE_DEVICES', 'not set')}")
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message refers to "NVIDIA_VISIBLE_DEVICES" but this environment variable may not be set on AMD systems. Consider using a more generic message like "Visible GPU devices" or checking the actual GPU type before referring to NVIDIA-specific variables.

Suggested change
print(f"GPU: {os.environ.get('NVIDIA_VISIBLE_DEVICES', 'not set')}")
visible_gpus = (
os.environ.get("NVIDIA_VISIBLE_DEVICES")
or os.environ.get("ROCR_VISIBLE_DEVICES")
or os.environ.get("GPU_DEVICE_ORDINAL")
)
print(f"Visible GPU devices: {visible_gpus if visible_gpus is not None else 'not set'}")

Copilot uses AI. Check for mistakes.
- KERNELBOT_PAYLOAD
- KERNELBOT_RUN_ID
- KERNELBOT_GPU_INDEX
- KERNELBOT_CPUSET
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The KERNELBOT_CPUSET environment variable is listed in the propagate-environment section but is not used in the resource constraints. If the intention is to use CPU pinning via cpuset, this should be configured in the docker plugin's cpuset_cpus parameter. Otherwise, this environment variable serves no purpose in the container.

Suggested change
- KERNELBOT_CPUSET

Copilot uses AI. Check for mistakes.
timeout=30.0,
)
return self._client

Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The httpx.AsyncClient is created but never explicitly closed. This can lead to resource leaks and warnings about unclosed client sessions. Consider implementing a cleanup method (e.g., async def close()) or using the launcher as an async context manager to ensure the client is properly closed when no longer needed.

Suggested change
async def close(self) -> None:
"""Close the underlying HTTP client, if it has been created."""
if self._client is not None:
await self._client.aclose()
self._client = None

Copilot uses AI. Check for mistakes.
Comment on lines +42 to +63
echo "Installing Docker and NVIDIA Container Toolkit..."

# Docker
if ! command -v docker &> /dev/null; then
curl -fsSL https://get.docker.com | sh
usermod -aG docker ubuntu 2>/dev/null || true
fi

# NVIDIA Container Toolkit
if ! dpkg -l | grep -q nvidia-container-toolkit; then
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
apt-get update
apt-get install -y nvidia-container-toolkit
nvidia-ctk runtime configure --runtime=docker
systemctl restart docker
fi

echo "Docker + NVIDIA toolkit installed."
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function install_docker_nvidia is NVIDIA-specific but is called unconditionally even when GPU_TYPE is set to AMD GPUs like "mi300". For AMD systems, this will install unnecessary NVIDIA tools. Consider either renaming this function to reflect that it's GPU-agnostic, adding conditional logic to install ROCm Docker support for AMD GPUs, or creating separate installation paths for different GPU vendors.

Suggested change
echo "Installing Docker and NVIDIA Container Toolkit..."
# Docker
if ! command -v docker &> /dev/null; then
curl -fsSL https://get.docker.com | sh
usermod -aG docker ubuntu 2>/dev/null || true
fi
# NVIDIA Container Toolkit
if ! dpkg -l | grep -q nvidia-container-toolkit; then
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
apt-get update
apt-get install -y nvidia-container-toolkit
nvidia-ctk runtime configure --runtime=docker
systemctl restart docker
fi
echo "Docker + NVIDIA toolkit installed."
echo "Installing Docker..."
# Docker (vendor-agnostic)
if ! command -v docker &> /dev/null; then
curl -fsSL https://get.docker.com | sh
usermod -aG docker ubuntu 2>/dev/null || true
fi
# NVIDIA Container Toolkit (only for NVIDIA GPU types)
case "${GPU_TYPE}" in
b200|h100|nvidia*|*nvidia*)
if ! dpkg -l | grep -q nvidia-container-toolkit; then
echo "Detected NVIDIA GPU_TYPE='${GPU_TYPE}', installing NVIDIA Container Toolkit..."
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
apt-get update
apt-get install -y nvidia-container-toolkit
nvidia-ctk runtime configure --runtime=docker
systemctl restart docker
fi
;;
*)
echo "GPU_TYPE='${GPU_TYPE}' is not NVIDIA; skipping NVIDIA Container Toolkit installation."
;;
esac
echo "Docker installation complete."

Copilot uses AI. Check for mistakes.
@@ -0,0 +1,42 @@
# Kernelbot evaluation image
FROM nvidia/cuda:12.4.0-devel-ubuntu22.04
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Dockerfile is based on nvidia/cuda:12.4.0-devel-ubuntu22.04, which is NVIDIA-specific and will not work for AMD MI300_BK GPUs listed in the BuildkiteGPU enum. For AMD support, you would need a separate Dockerfile based on a ROCm image, or a multi-stage build that can support both GPU vendors.

Suggested change
FROM nvidia/cuda:12.4.0-devel-ubuntu22.04
ARG GPU_BASE_IMAGE=nvidia/cuda:12.4.0-devel-ubuntu22.04
FROM ${GPU_BASE_IMAGE}

Copilot uses AI. Check for mistakes.
- Add Quick Start section for fast onboarding
- Add Current Status showing working 2x L40S test
- Add Working Docker Pipeline example with key points
- Add troubleshooting for Docker runtime crashes and hook shebang
- Document auto-resource detection (CPU/RAM divided by GPU count)
- Add Summary of Key Decisions section
- Add inline_steps parameter to _launch() for testing without pipeline config
- Add create_artifact_test_steps() helper method
- Update e2e test to use artifact mode by default (no Buildkite config needed)
- Fix pipeline.yml to use gpus: "all" instead of runtime: nvidia
- Add pipeline-artifact-test.yml for manual testing
Buildkite returns a 302 redirect to S3 for artifact downloads.
The auth header shouldn't be forwarded to S3, so we now:
1. Request with follow_redirects=False
2. Extract the S3 URL from the Location header
3. Fetch from S3 with a clean client

Also update test pipeline to write result.json artifact.
- Add read_artifacts to required API token scopes
- Add S3 redirect handling to key decisions
- Add complete E2E workflow diagram and explanation
- Include verified test output showing successful artifact download
- Document the 9-step flow from submission to result retrieval
- Add L40S_BK GPU type for test infrastructure
- Create pipeline-eval.yml for running real kernel evaluations
- Create tests/test_buildkite.py with integration tests matching modal/github pattern
- Update submit_buildkite_job.py to support --eval flag for real evaluations
- Add queue mapping for L40S_BK -> test queue
- Changed workdir to /workdir (the mounted checkout directory)
- Copy result.json to /workdir before container exits
- Use uv for Python package management
- Clone from buildkite-infrastructure branch
- Add PyTorch installation step
- Use workdir: /workdir for artifact accessibility
- Copy result.json to workdir before container exits
- Activate venv before running evaluation
- Add real evaluation job submission instructions
- Add integration test documentation
- Document operational model (no pre-built Docker image)
- Clarify when admin action is/isn't needed
- Note shared evaluation logic across all runners
Backend integration:
- Register BuildkiteLauncher in create_backend() when BUILDKITE_API_TOKEN is set
- Add BUILDKITE_API_TOKEN, BUILDKITE_ORG, BUILDKITE_PIPELINE env vars
- Results now flow to database same as GitHub/Modal

Pre-built Docker image for fast cold starts:
- Add Dockerfile with all dependencies pre-installed
- Add build-image.sh script for local image building
- Add pipeline-fast.yml for using pre-built image (~5s vs ~40s cold start)
- Update setup-node-simple.sh with BUILD_IMAGE=true option

Update skills doc with operational model for both approaches
- Add L40S_BK SM arch (89 - Ada Lovelace) to GPU_TO_SM mapping
- Document env vars by location:
  - Heroku/Backend: BUILDKITE_API_TOKEN, BUILDKITE_ORG, BUILDKITE_PIPELINE
  - GPU Nodes: BUILDKITE_AGENT_TOKEN (set by admin), auto-set vars
  - Jobs: KERNELBOT_* vars passed via API
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant