Thank you for your interest in BC-Bench.
BC-Bench is open source, and you're welcome to fork and adapt it for your own use. We are not accepting external contributions in this repository at this time.
The instructions below are for teams that fork BC-Bench and replace the dataset with their own tasks.
A very high-level overview of the repository structure:
BC-Bench/
├── src/bcbench/ # Evaluation harness — agent orchestration, build/test pipeline, results
├── dataset/ # Benchmark dataset tasks
├── scripts/ # Scripts for container setup & test execution; not needed for local development
├── notebooks/ # Analysis and visualization of results
├── evaluator/ # Braintrust scorer integration, used only when uploading result to Braintrust
└── docs/ # GitHub Page for the leaderboard site
Prerequisites:
# Folder layout example
# C:\depot\BCApps -> cloned evaluation target repository
# C:\depot\BC-Bench -> your fork of this repo
gh repo fork microsoft/BC-Bench --clone
cd BC-Bench
# Install python
uv python install
# Install dependencies
uv sync --all-groups
# Install pre-commit hooks
uv run pre-commit install
# Show CLI help
uv run bcbench --help
# Run Copilot CLI on a single task (generate patch only, no build/test)
# This is very fast, give it a go and see it live!
uv run bcbench run copilot microsoft__BCApps-5633 --category bug-fix --repo-path /path/to/BCApps# Run tests
uv run pytest --cov=src/bcbench --cov-report=term-missing
# Lint and format
uv run pre-commit run --all-filesBC-Bench uses semantic versioning to track changes that may affect evaluation results. The version is stored in pyproject.toml and automatically embedded in all evaluation results.
| Change Type | Version Bump | Examples |
|---|---|---|
Major (X.0.0) |
Dataset changes, evaluation methodology changes | Adding/removing benchmark entries, changing pass criteria |
Minor (0.X.0) |
Tooling updates that may affect results | Bumping GitHub Copilot CLI, changing agent prompts |
Patch (0.0.X) |
Bug fixes, documentation | Fixing a parsing bug, updating docs |
Results from different benchmark versions cannot be aggregated together. When you run bcbench result update, the system will raise an error if you try to combine runs with different benchmark_version values.
This ensures the leaderboard always compares apples-to-apples. When bumping versions:
- Update the version in
pyproject.toml - Create a GitHub release with release notes describing the changes
- Clear old results from
docs/_data/*.jsonif needed - Re-run evaluations with the new version
Below are the steps you can follow to update coding agents' version, usually needed in scenarios like new model release.
Similar process to bump AL MCP's version, search for "Microsoft.Dynamics.BusinessCentral.Development.Tools" to identify files to modify.
- Find the corresponding workflow file
.github/workflows/<agent-name>-evaluation.yml - In the file, find the step that installs the coding agent (e.g. Install GitHub Copilot)
- Manually change the hardcoded version (it's by design that version is hardcoded)
- When you are done, bump BC-Bench's version in pyproject.toml following the Versioning Policy
- Commit your changes, and merge into
mainbranch - Create a new release
You usually need to bump the coding agents' version first to be able to use the newly released model.
- Find the corresponding workflow file
.github/workflows/<agent-name>-evaluation.yml - Add the model as a new input option in the
workflow_dispatchtrigger - Add the model into the corresponding list in cli_options.py
- Commit your changes, and merge into
mainbranch - Do a test run before a full one
After you bump the version in pyproject.toml following the Versioning Policy, you should then Create a new release after pushing your changes.
The process is straightforward, when you are not sure, check the previous releases for reference.
- Create a new tag following the version in
pyproject.toml(e.g. v1.1.2) - Title can simply be the same as the newly created tag
- Describe what is changed since the last release, only mention things that might affect evaluation result.