Skip to content

Augment the scheduler with resources to allow more fine-grained parallelism limitation#184

Merged
AlexJones0 merged 8 commits intolowRISC:masterfrom
AlexJones0:scheduler_resource_limits
Apr 17, 2026
Merged

Augment the scheduler with resources to allow more fine-grained parallelism limitation#184
AlexJones0 merged 8 commits intolowRISC:masterfrom
AlexJones0:scheduler_resource_limits

Conversation

@AlexJones0
Copy link
Copy Markdown
Contributor

@AlexJones0 AlexJones0 commented Apr 16, 2026

This PR contains the implementation of "resources" to the scheduler, which are essentially a mechanism for more fine-grained parallelism limits than what is currently offered by the standard --max-parallel flag. Note: This PR is quite large, let me know if it needs to be split up for review.

The JobSpec model is now changed so that each job can declare the resources that it uses, as a mapping of a resource name to some number. These resources are managed by a ResourceManager which is operated by the scheduler, which will strictly ensure that the running jobs do not try to allocate more resources than there are available. For now, resources are only defined for the various different sim tools, but this could be extended to the other flows in the future (and will probably be easier if the flows/deploys are better refactored).

Resources are determined via a ResourceProvider interface/protocol. Currently, this PR only implements static resources, where users can pass e.g. --resource A=20 --resource B=50 flags on the command line to define static resource limits {"A": 20, "B": 50}. The main goal of this interface is that this can be extended in the future to support more dynamic resources if needed - for example license or compute availability that is actively polled via some command. While no dynamic resources are implemented in this PR, the integration of resources into the scheduler is designed such that no changes should be needed if/when they are introduced.

See the individual commit messages for more details. Also relevant: see my local branch with some experimentation for dynamic resource availability.

Comment thread src/dvsim/scheduler/resources.py
@AlexJones0 AlexJones0 force-pushed the scheduler_resource_limits branch 2 times, most recently from 3bec966 to 4ce9dd9 Compare April 16, 2026 19:21
@AlexJones0
Copy link
Copy Markdown
Contributor Author

AlexJones0 commented Apr 16, 2026

Note: after some further discussion, I also dropped the commit that adds more granular per-tool license knowledge as part of the SimTool plugin. This might be nice eventually, but is a bit too complex at the moment and doesn't give us any useful advantages at the moment (all useful sub-licenses, e.g. those for formal, would all only be used for one job at once, because those use the GUI). When this PR is merged I'll create an issue to track this in case we might find it useful to add in the future.

Instead we just treat the tool itself as the resource. This has the nice advantage for now of meaning this generalizes and extends to support any other flows that also define a tool (e.g. linting? I think).

This will eventually be used by the Scheduler to manage more in-depth
parallelism, where jobs will define resources and the scheduler will
have to respect parallelism limits on those resources.

The abstract `ResourceProvider` is designed in such a way that more
complicated resource provider implementations could be added in the
future (when compared to the StaticResourceProviders), with the ability
to eventually support dynamic resource allocation.

Signed-off-by: Alex Jones <alex.jones@lowrisc.org>
This is just an idempotent constructor; it is confirmed via the arg type
(and by manually inspecting possible call sites) that this will always
be a `Path` already.

Signed-off-by: Alex Jones <alex.jones@lowrisc.org>
`OrderedDict` is redundant in modern Python, so let's type it properly and
use modern conveniences. Likewise, we shouldn't be returning `dict_keys`
if we intend to return a `Sequence` in the output tuples.

Signed-off-by: Alex Jones <alex.jones@lowrisc.org>
pyright is correct, we should be handling the case where
`cov_total is None` which is when the coverage summary doesn't contain
the expected "Score" metric. This should result in an appropriate error
from parsing.

Signed-off-by: Alex Jones <alex.jones@lowrisc.org>
Rather than use an `int` and make `0` an implementation-defined
"unbounded", it is nicer to explicitly support `max_parallelism=None`
to more clearly refer to this case.

Signed-off-by: Alex Jones <alex.jones@lowrisc.org>
@AlexJones0 AlexJones0 force-pushed the scheduler_resource_limits branch from 4ce9dd9 to e378ebf Compare April 16, 2026 21:34
Integrate the previously introduced `ResourceManager` into the
scheduler. The scheduler now attempts to allocate resources for jobs
when it decides to run them, which are then released when the job
finishes execution (or fails to launch). All resources go through the
manager, which will fail to allocate them if there are not enough
resources to provide within the defined limits. If no resource limits
are defined, behaviour depends on the `ResourceManager` configuration,
but the default is to assume an unbounded limit.

At the start of the scheduler run, all jobs are validated against the
limits defined in the `ResourceManager`. For example, if static
resources are used and there exists some job whose resource requirements
cannot be satisfied by the defined limits, then this will be caught in
advance of execution and reported early as an error. If the resources
are dynamic, then this case only results in a warning.

Signed-off-by: Alex Jones <alex.jones@lowrisc.org>
This commit provides command-line options for creating and configuring a
`ResourceManager` with a static resource provider to use passed resource
limits. This can then be given to the scheduler to provide more
fine-grained parallelism limiting than is allowed by `--max-parallel`,
where jobs are now only scheduled such that the scheduler will always
respect the defined resource limits.

Note the TODO about Python 3.11 - when the minimum Python version is
bumped we can make the enum a StrEnum which has much better native str()
behaviour than the existing Enum type and removes some of the extra glue
code.

Signed-off-by: Alex Jones <alex.jones@lowrisc.org>
Add some extra scheduler tests to cover the functionality of the new
resource-level parallelism feature that was introduced.

Signed-off-by: Alex Jones <alex.jones@lowrisc.org>
@AlexJones0 AlexJones0 force-pushed the scheduler_resource_limits branch from e378ebf to 0aad34d Compare April 16, 2026 22:51
@AlexJones0 AlexJones0 requested a review from machshev April 17, 2026 09:37
Copy link
Copy Markdown
Collaborator

@machshev machshev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @AlexJones0!

@AlexJones0 AlexJones0 added this pull request to the merge queue Apr 17, 2026
Merged via the queue into lowRISC:master with commit 030a584 Apr 17, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants