Add Buildkite infrastructure for GPU job isolation#432
Add Buildkite infrastructure for GPU job isolation#432
Conversation
- Add deployment/buildkite/ with setup-node.sh, pipeline.yml, Dockerfile - Add BuildkiteLauncher class for submitting jobs to Buildkite - Add buildkite-runner.py for job execution in containers - Add BuildkiteGPU enum (B200_BK, H100_BK, MI300_BK) - Add e2e test script for verifying Buildkite integration This enables vendors to onboard GPU resources with proper per-GPU isolation using a single setup script that creates Buildkite agents.
Coverage reportClick to see where and how coverage changed
This report was generated by python-coverage-comment-action |
||||||||||||||||||||||||||||||
There was a problem hiding this comment.
Pull request overview
This PR adds comprehensive Buildkite infrastructure to enable GPU job isolation for kernel evaluation workloads. The implementation provides a vendor-agnostic approach for onboarding GPU resources through automated setup scripts that create per-GPU Buildkite agents.
Changes:
- Adds BuildkiteLauncher class with API integration for job submission and result polling
- Implements buildkite-runner.py for containerized job execution with compressed payload handling
- Adds BuildkiteGPU enum (B200_BK, H100_BK, MI300_BK) and integrates with GPU lookup system
- Creates deployment infrastructure including setup-node.sh, pipeline.yml, and Dockerfile
- Provides e2e test script for integration verification
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 13 comments.
Show a summary per file
| File | Description |
|---|---|
| src/libkernelbot/launchers/buildkite.py | Implements Buildkite API integration with build submission, polling, and artifact retrieval |
| src/runners/buildkite-runner.py | Container entrypoint that decodes job payloads and executes evaluations |
| src/libkernelbot/consts.py | Adds BuildkiteGPU enum and integrates it into GPU lookup system |
| src/libkernelbot/launchers/init.py | Exports BuildkiteLauncher for public API |
| deployment/buildkite/setup-node.sh | Automated node setup creating per-GPU agents with isolation |
| deployment/buildkite/pipeline.yml | Buildkite pipeline configuration for dockerized job execution |
| deployment/buildkite/Dockerfile | Container image with CUDA, PyTorch, and kernelbot dependencies |
| tests/e2e_buildkite_test.py | End-to-end test script for Buildkite integration verification |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| local cpu_start=$((gpu_idx * CPUS_PER_GPU)) | ||
| local cpu_end=$((cpu_start + CPUS_PER_GPU - 1)) | ||
| local agent_name="${NODE_NAME}-gpu${gpu_idx}" | ||
| local config_dir="/etc/buildkite-agent/agent-${gpu_idx}" | ||
| local build_dir="/var/lib/buildkite-agent/gpu-${gpu_idx}/builds" |
There was a problem hiding this comment.
The local keyword is used in a non-function context. In bash, local is only valid within functions and will cause a syntax error when used in a for loop at the script level. These variables should either be declared without local, or the agent setup logic should be moved inside a function (which it already is - setup_agents), but the local keyword is being used in the wrong scope.
| local cpu_start=$((gpu_idx * CPUS_PER_GPU)) | |
| local cpu_end=$((cpu_start + CPUS_PER_GPU - 1)) | |
| local agent_name="${NODE_NAME}-gpu${gpu_idx}" | |
| local config_dir="/etc/buildkite-agent/agent-${gpu_idx}" | |
| local build_dir="/var/lib/buildkite-agent/gpu-${gpu_idx}/builds" | |
| cpu_start=$((gpu_idx * CPUS_PER_GPU)) | |
| cpu_end=$((cpu_start + CPUS_PER_GPU - 1)) | |
| agent_name="${NODE_NAME}-gpu${gpu_idx}" | |
| config_dir="/etc/buildkite-agent/agent-${gpu_idx}" | |
| build_dir="/var/lib/buildkite-agent/gpu-${gpu_idx}/builds" |
deployment/buildkite/Dockerfile
Outdated
| RUN pip install --no-cache-dir \ | ||
| torch==2.4.0 \ | ||
| triton \ |
There was a problem hiding this comment.
The torch version is pinned to 2.4.0, but the CUDA base image is 12.4.0. Consider verifying compatibility or documenting why this specific combination is used. Additionally, the triton package is installed without version pinning, which may lead to version compatibility issues.
| RUN pip install --no-cache-dir \ | |
| torch==2.4.0 \ | |
| triton \ | |
| # torch==2.4.0 has been validated with the CUDA 12.4.0 base image used above. | |
| # Pin triton to a version compatible with torch 2.4.0 for reproducible builds. | |
| RUN pip install --no-cache-dir \ | |
| torch==2.4.0 \ | |
| triton==3.0.0 \ |
| result = await launcher._launch( | ||
| run_id="e2e-test", | ||
| config=test_config, | ||
| queue=args.queue, | ||
| status=SimpleReporter(), | ||
| ) |
There was a problem hiding this comment.
The test accesses a private method _launch which is prefixed with an underscore, indicating it's intended for internal use. This violates encapsulation principles and makes the test brittle to internal implementation changes. Consider either making this a public method if it needs to be tested directly, or testing through the public run_submission method instead.
| cat > "${config_dir}/environment" << ENVEOF | ||
| NVIDIA_VISIBLE_DEVICES=${gpu_idx} | ||
| CUDA_VISIBLE_DEVICES=${gpu_idx} | ||
| KERNELBOT_GPU_INDEX=${gpu_idx} | ||
| KERNELBOT_CPU_START=${cpu_start} | ||
| KERNELBOT_CPU_END=${cpu_end} | ||
| KERNELBOT_CPUSET=${cpu_start}-${cpu_end} | ||
| KERNELBOT_MEMORY=${RAM_PER_GPU} | ||
| ENVEOF |
There was a problem hiding this comment.
The environment file sets NVIDIA_VISIBLE_DEVICES and CUDA_VISIBLE_DEVICES unconditionally, but these are NVIDIA-specific variables. For AMD GPUs (like MI300_BK), the appropriate variables would be ROCR_VISIBLE_DEVICES or HIP_VISIBLE_DEVICES. This will cause GPU isolation to fail on AMD systems. Consider adding conditional logic based on GPU_TYPE to set the appropriate visibility variables.
deployment/buildkite/pipeline.yml
Outdated
| - docker#v5.11.0: | ||
| image: "${KERNELBOT_IMAGE:-ghcr.io/gpu-mode/kernelbot:latest}" | ||
| always-pull: true | ||
| runtime: nvidia |
There was a problem hiding this comment.
The docker runtime is hardcoded to 'nvidia', which will not work for AMD GPUs. For AMD systems, the runtime should typically be set to a ROCm-compatible runtime or left as default with appropriate device mappings. This configuration prevents the pipeline from working with AMD MI300_BK GPUs that are listed in the BuildkiteGPU enum.
| runtime: nvidia | |
| runtime: "${KERNELBOT_DOCKER_RUNTIME:-nvidia}" |
|
|
||
| print("=== Kernelbot Evaluation ===") | ||
| print(f"Run ID: {run_id}") | ||
| print(f"GPU: {os.environ.get('NVIDIA_VISIBLE_DEVICES', 'not set')}") |
There was a problem hiding this comment.
The error message refers to "NVIDIA_VISIBLE_DEVICES" but this environment variable may not be set on AMD systems. Consider using a more generic message like "Visible GPU devices" or checking the actual GPU type before referring to NVIDIA-specific variables.
| print(f"GPU: {os.environ.get('NVIDIA_VISIBLE_DEVICES', 'not set')}") | |
| visible_gpus = ( | |
| os.environ.get("NVIDIA_VISIBLE_DEVICES") | |
| or os.environ.get("ROCR_VISIBLE_DEVICES") | |
| or os.environ.get("GPU_DEVICE_ORDINAL") | |
| ) | |
| print(f"Visible GPU devices: {visible_gpus if visible_gpus is not None else 'not set'}") |
| - KERNELBOT_PAYLOAD | ||
| - KERNELBOT_RUN_ID | ||
| - KERNELBOT_GPU_INDEX | ||
| - KERNELBOT_CPUSET |
There was a problem hiding this comment.
The KERNELBOT_CPUSET environment variable is listed in the propagate-environment section but is not used in the resource constraints. If the intention is to use CPU pinning via cpuset, this should be configured in the docker plugin's cpuset_cpus parameter. Otherwise, this environment variable serves no purpose in the container.
| - KERNELBOT_CPUSET |
| timeout=30.0, | ||
| ) | ||
| return self._client | ||
|
|
There was a problem hiding this comment.
The httpx.AsyncClient is created but never explicitly closed. This can lead to resource leaks and warnings about unclosed client sessions. Consider implementing a cleanup method (e.g., async def close()) or using the launcher as an async context manager to ensure the client is properly closed when no longer needed.
| async def close(self) -> None: | |
| """Close the underlying HTTP client, if it has been created.""" | |
| if self._client is not None: | |
| await self._client.aclose() | |
| self._client = None |
| echo "Installing Docker and NVIDIA Container Toolkit..." | ||
|
|
||
| # Docker | ||
| if ! command -v docker &> /dev/null; then | ||
| curl -fsSL https://get.docker.com | sh | ||
| usermod -aG docker ubuntu 2>/dev/null || true | ||
| fi | ||
|
|
||
| # NVIDIA Container Toolkit | ||
| if ! dpkg -l | grep -q nvidia-container-toolkit; then | ||
| curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \ | ||
| gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg | ||
| curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \ | ||
| sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \ | ||
| tee /etc/apt/sources.list.d/nvidia-container-toolkit.list | ||
| apt-get update | ||
| apt-get install -y nvidia-container-toolkit | ||
| nvidia-ctk runtime configure --runtime=docker | ||
| systemctl restart docker | ||
| fi | ||
|
|
||
| echo "Docker + NVIDIA toolkit installed." |
There was a problem hiding this comment.
The function install_docker_nvidia is NVIDIA-specific but is called unconditionally even when GPU_TYPE is set to AMD GPUs like "mi300". For AMD systems, this will install unnecessary NVIDIA tools. Consider either renaming this function to reflect that it's GPU-agnostic, adding conditional logic to install ROCm Docker support for AMD GPUs, or creating separate installation paths for different GPU vendors.
| echo "Installing Docker and NVIDIA Container Toolkit..." | |
| # Docker | |
| if ! command -v docker &> /dev/null; then | |
| curl -fsSL https://get.docker.com | sh | |
| usermod -aG docker ubuntu 2>/dev/null || true | |
| fi | |
| # NVIDIA Container Toolkit | |
| if ! dpkg -l | grep -q nvidia-container-toolkit; then | |
| curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \ | |
| gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg | |
| curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \ | |
| sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \ | |
| tee /etc/apt/sources.list.d/nvidia-container-toolkit.list | |
| apt-get update | |
| apt-get install -y nvidia-container-toolkit | |
| nvidia-ctk runtime configure --runtime=docker | |
| systemctl restart docker | |
| fi | |
| echo "Docker + NVIDIA toolkit installed." | |
| echo "Installing Docker..." | |
| # Docker (vendor-agnostic) | |
| if ! command -v docker &> /dev/null; then | |
| curl -fsSL https://get.docker.com | sh | |
| usermod -aG docker ubuntu 2>/dev/null || true | |
| fi | |
| # NVIDIA Container Toolkit (only for NVIDIA GPU types) | |
| case "${GPU_TYPE}" in | |
| b200|h100|nvidia*|*nvidia*) | |
| if ! dpkg -l | grep -q nvidia-container-toolkit; then | |
| echo "Detected NVIDIA GPU_TYPE='${GPU_TYPE}', installing NVIDIA Container Toolkit..." | |
| curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \ | |
| gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg | |
| curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \ | |
| sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \ | |
| tee /etc/apt/sources.list.d/nvidia-container-toolkit.list | |
| apt-get update | |
| apt-get install -y nvidia-container-toolkit | |
| nvidia-ctk runtime configure --runtime=docker | |
| systemctl restart docker | |
| fi | |
| ;; | |
| *) | |
| echo "GPU_TYPE='${GPU_TYPE}' is not NVIDIA; skipping NVIDIA Container Toolkit installation." | |
| ;; | |
| esac | |
| echo "Docker installation complete." |
| @@ -0,0 +1,42 @@ | |||
| # Kernelbot evaluation image | |||
| FROM nvidia/cuda:12.4.0-devel-ubuntu22.04 | |||
There was a problem hiding this comment.
The Dockerfile is based on nvidia/cuda:12.4.0-devel-ubuntu22.04, which is NVIDIA-specific and will not work for AMD MI300_BK GPUs listed in the BuildkiteGPU enum. For AMD support, you would need a separate Dockerfile based on a ROCm image, or a multi-stage build that can support both GPU vendors.
| FROM nvidia/cuda:12.4.0-devel-ubuntu22.04 | |
| ARG GPU_BASE_IMAGE=nvidia/cuda:12.4.0-devel-ubuntu22.04 | |
| FROM ${GPU_BASE_IMAGE} |
…TTPS, proper isolation
- Add Quick Start section for fast onboarding - Add Current Status showing working 2x L40S test - Add Working Docker Pipeline example with key points - Add troubleshooting for Docker runtime crashes and hook shebang - Document auto-resource detection (CPU/RAM divided by GPU count) - Add Summary of Key Decisions section
- Add inline_steps parameter to _launch() for testing without pipeline config - Add create_artifact_test_steps() helper method - Update e2e test to use artifact mode by default (no Buildkite config needed) - Fix pipeline.yml to use gpus: "all" instead of runtime: nvidia - Add pipeline-artifact-test.yml for manual testing
Buildkite returns a 302 redirect to S3 for artifact downloads. The auth header shouldn't be forwarded to S3, so we now: 1. Request with follow_redirects=False 2. Extract the S3 URL from the Location header 3. Fetch from S3 with a clean client Also update test pipeline to write result.json artifact.
- Add read_artifacts to required API token scopes - Add S3 redirect handling to key decisions - Add complete E2E workflow diagram and explanation - Include verified test output showing successful artifact download - Document the 9-step flow from submission to result retrieval
- Add L40S_BK GPU type for test infrastructure - Create pipeline-eval.yml for running real kernel evaluations - Create tests/test_buildkite.py with integration tests matching modal/github pattern - Update submit_buildkite_job.py to support --eval flag for real evaluations - Add queue mapping for L40S_BK -> test queue
- Changed workdir to /workdir (the mounted checkout directory) - Copy result.json to /workdir before container exits - Use uv for Python package management - Clone from buildkite-infrastructure branch
- Add PyTorch installation step - Use workdir: /workdir for artifact accessibility - Copy result.json to workdir before container exits - Activate venv before running evaluation
- Add real evaluation job submission instructions - Add integration test documentation - Document operational model (no pre-built Docker image) - Clarify when admin action is/isn't needed - Note shared evaluation logic across all runners
Backend integration: - Register BuildkiteLauncher in create_backend() when BUILDKITE_API_TOKEN is set - Add BUILDKITE_API_TOKEN, BUILDKITE_ORG, BUILDKITE_PIPELINE env vars - Results now flow to database same as GitHub/Modal Pre-built Docker image for fast cold starts: - Add Dockerfile with all dependencies pre-installed - Add build-image.sh script for local image building - Add pipeline-fast.yml for using pre-built image (~5s vs ~40s cold start) - Update setup-node-simple.sh with BUILD_IMAGE=true option Update skills doc with operational model for both approaches
- Add L40S_BK SM arch (89 - Ada Lovelace) to GPU_TO_SM mapping - Document env vars by location: - Heroku/Backend: BUILDKITE_API_TOKEN, BUILDKITE_ORG, BUILDKITE_PIPELINE - GPU Nodes: BUILDKITE_AGENT_TOKEN (set by admin), auto-set vars - Jobs: KERNELBOT_* vars passed via API
This enables vendors to onboard GPU resources with proper per-GPU isolation using a single setup script that creates Buildkite agents.
It's alive