Augment the scheduler with resources to allow more fine-grained parallelism limitation#184
Merged
AlexJones0 merged 8 commits intolowRISC:masterfrom Apr 17, 2026
Merged
Conversation
machshev
reviewed
Apr 16, 2026
3bec966 to
4ce9dd9
Compare
Contributor
Author
|
Note: after some further discussion, I also dropped the commit that adds more granular per-tool license knowledge as part of the Instead we just treat the tool itself as the resource. This has the nice advantage for now of meaning this generalizes and extends to support any other flows that also define a tool (e.g. linting? I think). |
This will eventually be used by the Scheduler to manage more in-depth parallelism, where jobs will define resources and the scheduler will have to respect parallelism limits on those resources. The abstract `ResourceProvider` is designed in such a way that more complicated resource provider implementations could be added in the future (when compared to the StaticResourceProviders), with the ability to eventually support dynamic resource allocation. Signed-off-by: Alex Jones <alex.jones@lowrisc.org>
This is just an idempotent constructor; it is confirmed via the arg type (and by manually inspecting possible call sites) that this will always be a `Path` already. Signed-off-by: Alex Jones <alex.jones@lowrisc.org>
`OrderedDict` is redundant in modern Python, so let's type it properly and use modern conveniences. Likewise, we shouldn't be returning `dict_keys` if we intend to return a `Sequence` in the output tuples. Signed-off-by: Alex Jones <alex.jones@lowrisc.org>
pyright is correct, we should be handling the case where `cov_total is None` which is when the coverage summary doesn't contain the expected "Score" metric. This should result in an appropriate error from parsing. Signed-off-by: Alex Jones <alex.jones@lowrisc.org>
Rather than use an `int` and make `0` an implementation-defined "unbounded", it is nicer to explicitly support `max_parallelism=None` to more clearly refer to this case. Signed-off-by: Alex Jones <alex.jones@lowrisc.org>
4ce9dd9 to
e378ebf
Compare
Integrate the previously introduced `ResourceManager` into the scheduler. The scheduler now attempts to allocate resources for jobs when it decides to run them, which are then released when the job finishes execution (or fails to launch). All resources go through the manager, which will fail to allocate them if there are not enough resources to provide within the defined limits. If no resource limits are defined, behaviour depends on the `ResourceManager` configuration, but the default is to assume an unbounded limit. At the start of the scheduler run, all jobs are validated against the limits defined in the `ResourceManager`. For example, if static resources are used and there exists some job whose resource requirements cannot be satisfied by the defined limits, then this will be caught in advance of execution and reported early as an error. If the resources are dynamic, then this case only results in a warning. Signed-off-by: Alex Jones <alex.jones@lowrisc.org>
This commit provides command-line options for creating and configuring a `ResourceManager` with a static resource provider to use passed resource limits. This can then be given to the scheduler to provide more fine-grained parallelism limiting than is allowed by `--max-parallel`, where jobs are now only scheduled such that the scheduler will always respect the defined resource limits. Note the TODO about Python 3.11 - when the minimum Python version is bumped we can make the enum a StrEnum which has much better native str() behaviour than the existing Enum type and removes some of the extra glue code. Signed-off-by: Alex Jones <alex.jones@lowrisc.org>
Add some extra scheduler tests to cover the functionality of the new resource-level parallelism feature that was introduced. Signed-off-by: Alex Jones <alex.jones@lowrisc.org>
e378ebf to
0aad34d
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR contains the implementation of "resources" to the scheduler, which are essentially a mechanism for more fine-grained parallelism limits than what is currently offered by the standard
--max-parallelflag. Note: This PR is quite large, let me know if it needs to be split up for review.The
JobSpecmodel is now changed so that each job can declare the resources that it uses, as a mapping of a resource name to some number. These resources are managed by aResourceManagerwhich is operated by the scheduler, which will strictly ensure that the running jobs do not try to allocate more resources than there are available. For now, resources are only defined for the various different sim tools, but this could be extended to the other flows in the future (and will probably be easier if the flows/deploys are better refactored).Resources are determined via a
ResourceProviderinterface/protocol. Currently, this PR only implements static resources, where users can pass e.g.--resource A=20 --resource B=50flags on the command line to define static resource limits{"A": 20, "B": 50}. The main goal of this interface is that this can be extended in the future to support more dynamic resources if needed - for example license or compute availability that is actively polled via some command. While no dynamic resources are implemented in this PR, the integration of resources into the scheduler is designed such that no changes should be needed if/when they are introduced.See the individual commit messages for more details. Also relevant: see my local branch with some experimentation for dynamic resource availability.