Skip to content

feat(server,driver-vm,e2e): gateway-owned readiness + VM compute driver e2e#901

Open
drew wants to merge 4 commits intofeat/supervisor-session-grpc-datafrom
drew/vm-driver-install-hangs-on-startup
Open

feat(server,driver-vm,e2e): gateway-owned readiness + VM compute driver e2e#901
drew wants to merge 4 commits intofeat/supervisor-session-grpc-datafrom
drew/vm-driver-install-hangs-on-startup

Conversation

@drew
Copy link
Copy Markdown
Collaborator

@drew drew commented Apr 21, 2026

Summary

Makes the VM compute driver end-to-end path work on top of the supervisor-initiated relay in #867, and moves the authoritative "sandbox is Ready" transition from each compute driver onto the gateway. The smoke test against openshell-gateway --drivers vm (mise run e2e:vm) goes from hanging at 180s to passing in ~10s.

Related Issue

Stacked on top of #867. No issue link.

Changes

feat(server): promote sandbox phase on supervisor session connect

  • New SupervisorSessionObserver trait. SupervisorSessionRegistry invokes the observer on register / remove_if_current outside the internal mutex.
  • ComputeRuntime::install_supervisor_observer wires a ComputeSessionObserver bridge; the runtime holds a Weak<SupervisorSessionRegistry> to break the Arc cycle between registry and observer.
  • New mark_sandbox_session_connected / mark_sandbox_session_disconnected flip phase and rewrite the Ready condition with reason=SupervisorConnected / SupervisorDisconnected. Terminal states (Deleting, Error) are preserved.
  • Backfill path in apply_sandbox_update_locked handles the register-before-store race: if a driver snapshot arrives and the registry already holds a live session for that sandbox, phase is promoted on the spot.

refactor(driver-vm): drop log-grep readiness; always run gvproxy

  • Delete guest_ssh_ready() and ready_condition(). The driver no longer owns Ready; monitor_sandbox only surfaces Error for launcher-process failures.
  • Critical fix: runtime.rs now starts gvproxy unconditionally. With the SSH port forward removed in feat(server,sandbox): supervisor-initiated SSH connect and exec over gRPC-multiplexed relay #867, port_map was empty by default, which skipped gvproxy startup entirely — leaving the guest with no eth0 and no route to the host gateway. The guest supervisor's ConnectSupervisor stream needs gvproxy to reach host.containers.internal (rewritten to 192.168.127.1 inside the guest).
  • Remove dead VmContext::set_port_map; mark the libkrun FFI binding #[allow(dead_code)].

e2e(vm): run smoke against openshell-gateway with the VM compute driver

  • Rewrite e2e/rust/e2e-vm.sh for the split-binary flow (former openshell-vm K8s-in-a-VM binary is gone).
  • Pin --driver-dir target/debug so the gateway picks up the freshly cargo-built driver rather than a stale ~/.local/libexec/openshell/openshell-driver-vm from a prior install-vm.sh run.
  • Anchor per-run state under /tmp (macOS AF_UNIX SUN_LEN is 104 bytes; worktree paths routinely blow it).
  • On failure, preserve the state dir and dump the gateway log + every sandbox's rootfs-console.log inline for post-mortem.
  • Drop build:docker:gateway and vm:build dependencies from tasks/test.toml's e2e:vm.

Testing

  • mise run pre-commit passes (lint + format + license headers clean; clippy warnings unchanged from baseline)
  • Unit tests added/updated
    • openshell-server lib: 255 pass (+8 compute promotion tests, +4 registry observer tests)
    • openshell-driver-vm lib: 17 pass
    • openshell-server integration (supervisor_relay_integration): 6 pass
  • E2E tests added/updated: mise run e2e:vm passes in ~10s, stable across back-to-back runs

Checklist

Notes for the reviewer

  • Base is feat/supervisor-session-grpc-data, not main. Merge order: land feat(server,sandbox): supervisor-initiated SSH connect and exec over gRPC-multiplexed relay #867 first, then rebase this onto main.
  • The gvproxy-always change in driver-vm/runtime.rs could arguably belong in feat(server,sandbox): supervisor-initiated SSH connect and exec over gRPC-multiplexed relay #867 itself (it's a latent bug in that PR — VMs have no network after the SSH port forward was dropped). Happy to split that hunk off into a separate commit against feat/supervisor-session-grpc-data directly if that's the preferred landing path.
  • Readiness semantics change: before = "sshd is bound", after = "supervisor→gateway session is live". The latter is what clients actually need before opening a relay, so this is a strict improvement. The Kubernetes driver is unaffected because its own Ready=True from kubelet still wins (the gateway override only applies when phase is Provisioning / Unknown).

drew added 3 commits April 20, 2026 18:45
The gateway now owns the Ready transition for sandboxes. When a
ConnectSupervisor RPC registers a session in SupervisorSessionRegistry,
ComputeRuntime promotes the sandbox record from Provisioning to Ready
with reason=SupervisorConnected. When the session ends, the sandbox
is demoted back to Provisioning with reason=SupervisorDisconnected.

This replaces compute-driver-specific liveness probes (log grep, TCP
poll) with the authoritative signal that the relay plane for SSH and
exec is live end-to-end. Drivers that still want to report Error
conditions (e.g. the VM driver on a dead launcher process) continue
to do so; only the Ready transition moves.

Wiring is observer-based: SupervisorSessionRegistry holds an optional
trait object installed by ComputeRuntime::install_supervisor_observer
during server startup. Session register/remove_if_current invoke
on_session_connected/on_session_disconnected off the internal lock;
the observer spawns async tasks to update the persisted sandbox record
and notify the watch bus.

Backfill path in apply_sandbox_update_locked handles the
register-before-store race: when a driver snapshot arrives and the
registry already has a live session for that sandbox, the phase is
promoted on the spot.

Adds 8 unit tests covering promote/demote transitions, terminal-state
no-ops, idempotent re-register, and the backfill race. Adds 4
registry-level tests covering observer fire-once semantics and
supersede-race guards.
The VM driver no longer owns the Ready transition — the gateway-side
SupervisorSessionObserver now promotes sandboxes to Ready when their
supervisor session connects. Remove guest_ssh_ready() (a brittle
grep over the serial console) and the ready_condition() helper.
monitor_sandbox still watches the launcher child process and emits
Error conditions on ProcessExited / ProcessPollFailed.

Also always start gvproxy, not just when port_map is non-empty. With
the supervisor-initiated relay migration in #867, the SSH port forward
was dropped; that left port_map empty in the default path, which in
turn skipped gvproxy startup, which left the guest with no eth0 and
no route to the host gateway. The guest supervisor's outbound
ConnectSupervisor stream needs gvproxy to reach
host.containers.internal (rewritten to 192.168.127.1 inside the guest),
so gvproxy is structurally required for any sandbox that talks to
the gateway.

Inline the gvproxy setup into an unconditional block that returns
(guard, api_sock, forwarded_port_map), dropping the mutable plumbing
the prior conditional form needed. Remove the now-dead
VmContext::set_port_map wrapper; mark its libkrun FFI binding
#[allow(dead_code)] so a future reintroduction doesn't need to touch
the symbol table.
Rewrite e2e/rust/e2e-vm.sh for the split-binary flow (openshell-gateway
+ openshell-driver-vm) now that the former openshell-vm K8s-in-a-VM
binary is gone. The new flow:

  1. Stage the embedded VM runtime (libkrun + gvproxy + base rootfs)
     via mise run vm:setup and mise run vm:rootfs -- --base, both
     idempotent and run only when artifacts are missing.
  2. Build openshell-gateway, openshell-driver-vm, and the openshell
     CLI from the current workspace with cargo.
  3. On macOS, codesign the driver with the Hypervisor.framework
     entitlement so libkrun can start the microVM.
  4. Start the gateway with --drivers vm --disable-tls
     --disable-gateway-auth --db-url sqlite::memory:, pinning
     --driver-dir target/debug so the gateway picks up the freshly
     built driver rather than ~/.local/libexec/openshell from a
     prior install-vm.sh run.
  5. Wait for 'Server listening', run the cluster-agnostic Rust smoke
     test against OPENSHELL_GATEWAY_ENDPOINT=http://127.0.0.1:<port>,
     then SIGTERM the gateway.

State paths root under /tmp rather than target/ because the VM
driver's compute-driver.sock lives under --vm-driver-state-dir; with
AF_UNIX SUN_LEN = 104 bytes on macOS (108 on Linux), worktree paths
under target/ routinely blow the limit.

On failure, the trap preserves the per-run state dir plus dumps the
gateway log and every sandbox's rootfs-console.log inline so CI
artifacts capture post-mortem data.

Drop the former --vm-port / --vm-name reuse path entirely — the new
gateway is cheap to start (a few seconds, no k3s bootstrap) and that
reuse flow mapped to openshell-vm's StatefulSet rollout, which no
longer exists. Drop the build:docker:gateway and vm:build task
dependencies from tasks/test.toml's e2e:vm for the same reason.
@drew drew self-assigned this Apr 21, 2026
@drew drew requested a review from a team as a code owner April 21, 2026 05:24
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Apr 21, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

With the SSH port forward removed in #867 and no other host→guest port
mappings in play, everything that configured gvproxy's port-forwarder
is dead weight. gvproxy stays because the VM still needs its virtual
NIC, DHCP server, and default router for guest egress, and because
the sandbox supervisor's per-sandbox netns (veth + iptables, see
openshell-sandbox/src/sandbox/linux/netns.rs) needs a real kernel
network stack inside the guest to branch off of — libkrun's built-in
TSI socket impersonation would not satisfy those primitives.

What we stop doing:

* Dropping the `-listen` API socket. No one calls
  `/services/forwarder/expose` on it any more.
* Passing `-ssh-port -1`. gvproxy's default 2222 SSH forward binds
  a host-side TCP listener that would race concurrent sandboxes
  and surface a misleading 'sshd is reachable' endpoint.
  `-1` is gvproxy's documented switch for 'no SSH forward'; see
  getForwardsMap in containers/gvisor-tap-vsock cmd/gvproxy/main.go.
* Removing VmLaunchConfig::port_map and the CLI --vm-port flag.
* Removing krun_set_port_map from the libkrun FFI bindings.
* Removing helpers that only made sense when we had a port map to
  manage: plan_gvproxy_ports, parse_port_mapping, expose_port_map,
  gvproxy_expose, pick_gvproxy_ssh_port, kill_stale_gvproxy_by_port,
  kill_stale_gvproxy_by_port_map, kill_gvproxy_pid, is_process_named,
  and the GUEST_SSH_PORT constant.
* Removing the four port-mapping unit tests.

Verified: after `sandbox create -- echo hi`, `lsof` shows gvproxy
opens zero TCP listeners; only its qemu/vfkit unixgram data socket
remains. E2E smoke still passes in ~10s.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant