feat: add system defaults to buildSlurmConf (TCL-5588)#23
Open
jhu-svg wants to merge 2 commits intoslurm-1.0-together-changesfrom
Open
feat: add system defaults to buildSlurmConf (TCL-5588)#23jhu-svg wants to merge 2 commits intoslurm-1.0-together-changesfrom
jhu-svg wants to merge 2 commits intoslurm-1.0-together-changesfrom
Conversation
Add UnkillableStepTimeout=600, HealthCheckInterval=60, HealthCheckNodeState=ANY, HealthCheckProgram to the generated slurm.conf as system defaults. These appear before ### EXTRA CONFIG ### so user extraConf can override if needed (Slurm uses last value). Covers both IC and BM clusters since both use the Slinky operator. No annotation gate needed — Slinky operator rebuilds slurm.conf natively on any Controller CR change. Requires v1.0.7+ worker images (gpu_healthcheck.sh must exist at /usr/bin/gpu_healthcheck.sh). Pre-v1.0.7 workers will log harmless "HealthCheckProgram not found" warnings. Made-with: Cursor
Without this, --mem in sbatch is just a scheduling hint — Slurm doesn't enforce memory limits. With ConstrainRAMSpace=yes, jobs that exceed their memory allocation get killed by Slurm instead of triggering the kernel OOM killer. MemSpecLimit is NOT added as a default because the right value depends on node memory size (per-cluster tuning via extraConf). Made-with: Cursor
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add system-critical Slurm defaults to the Slinky operator so every cluster (IC and BM) gets them automatically. These are infrastructure config in the same category as AuthType=auth/slurm — always there, not in extraConf, users can not accidentally delete them.
What is added
slurm.conf (buildSlurmConf)
Appears before EXTRA CONFIG, so user extraConf can override any value if needed (Slurm uses last value in config).
cgroup.conf (buildCgroupConf)
Without this, --mem in sbatch is just a scheduling hint. With ConstrainRAMSpace=yes, Slurm enforces memory limits via cgroups — jobs that exceed their allocation get killed cleanly by Slurm instead of triggering the kernel OOM killer.
Why
What is NOT hardcoded (intentionally)
Requires
Testing
Follow-up
After this ships in a new chart version, remove slurmHealthCheckConf() from the cluster operator buildSlurmExtraConf (currently duplicated there via PR #472).
One-pager: https://www.notion.so/345b878aad1a81148a86c29cdb2c7d3b
Ticket: TCL-5588