Skip to content

feat: add system defaults to buildSlurmConf (TCL-5588)#23

Open
jhu-svg wants to merge 2 commits intoslurm-1.0-together-changesfrom
TCL-5588/system-defaults-in-buildSlurmConf
Open

feat: add system defaults to buildSlurmConf (TCL-5588)#23
jhu-svg wants to merge 2 commits intoslurm-1.0-together-changesfrom
TCL-5588/system-defaults-in-buildSlurmConf

Conversation

@jhu-svg
Copy link
Copy Markdown

@jhu-svg jhu-svg commented Apr 20, 2026

Summary

Add system-critical Slurm defaults to the Slinky operator so every cluster (IC and BM) gets them automatically. These are infrastructure config in the same category as AuthType=auth/slurm — always there, not in extraConf, users can not accidentally delete them.

What is added

slurm.conf (buildSlurmConf)

### SYSTEM DEFAULTS ###
UnkillableStepTimeout=600
HealthCheckInterval=60
HealthCheckNodeState=ANY
HealthCheckProgram=/usr/bin/gpu_healthcheck.sh

Appears before EXTRA CONFIG, so user extraConf can override any value if needed (Slurm uses last value in config).

cgroup.conf (buildCgroupConf)

ConstrainRAMSpace=yes

Without this, --mem in sbatch is just a scheduling hint. With ConstrainRAMSpace=yes, Slurm enforces memory limits via cgroups — jobs that exceed their allocation get killed cleanly by Slurm instead of triggering the kernel OOM killer.

Why

  • UnkillableStepTimeout=600: Prevents "kill task failed" node drains (default 60s too short for GPU job teardown). FA had 20 nodes drain from this.
  • HealthCheckProgram: Auto-resumes nodes drained for "kill task failed" if GPUs healthy. Tested on QA + staging.
  • ConstrainRAMSpace=yes: Tangible confirmed --mem=500GB did not prevent OOM without this. Universal need — every cluster should enforce memory limits.
  • Covers IC and BM: Both use the Slinky operator. The cluster operator only covers IC.
  • No annotation gate needed: Slinky operator rebuilds slurm.conf natively on any Controller CR change.

What is NOT hardcoded (intentionally)

  • MemSpecLimit: Reserves memory for the OS per node (in MB). The right value depends on node memory size (200Gi for 1Ti nodes, 300Gi for 2Ti nodes). Set per-cluster via extraConf/UI.

Requires

  • v1.0.7+ worker images for HealthCheckProgram (/usr/bin/gpu_healthcheck.sh must exist). Pre-v1.0.7 workers will log harmless warnings every 60s.

Testing

  • QA (Apr 16): HealthCheckProgram verified — drain "kill task failed" auto-resumed in <60s, maintenance drain NOT resumed.
  • Staging (Apr 15): Same test via /data workaround before image bake.
  • ConstrainRAMSpace verified on Suno cluster via configFileRefs patch (Notion runbook).

Follow-up

After this ships in a new chart version, remove slurmHealthCheckConf() from the cluster operator buildSlurmExtraConf (currently duplicated there via PR #472).

One-pager: https://www.notion.so/345b878aad1a81148a86c29cdb2c7d3b
Ticket: TCL-5588

jhu-svg added 2 commits April 20, 2026 15:11
Add UnkillableStepTimeout=600, HealthCheckInterval=60,
HealthCheckNodeState=ANY, HealthCheckProgram to the generated
slurm.conf as system defaults. These appear before ### EXTRA CONFIG ###
so user extraConf can override if needed (Slurm uses last value).

Covers both IC and BM clusters since both use the Slinky operator.
No annotation gate needed — Slinky operator rebuilds slurm.conf
natively on any Controller CR change.

Requires v1.0.7+ worker images (gpu_healthcheck.sh must exist at
/usr/bin/gpu_healthcheck.sh). Pre-v1.0.7 workers will log harmless
"HealthCheckProgram not found" warnings.

Made-with: Cursor
Without this, --mem in sbatch is just a scheduling hint — Slurm
doesn't enforce memory limits. With ConstrainRAMSpace=yes, jobs
that exceed their memory allocation get killed by Slurm instead
of triggering the kernel OOM killer.

MemSpecLimit is NOT added as a default because the right value
depends on node memory size (per-cluster tuning via extraConf).

Made-with: Cursor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant