feat(server,driver-vm,e2e): gateway-owned readiness + VM compute driver e2e#901
Open
drew wants to merge 4 commits intofeat/supervisor-session-grpc-datafrom
Open
feat(server,driver-vm,e2e): gateway-owned readiness + VM compute driver e2e#901drew wants to merge 4 commits intofeat/supervisor-session-grpc-datafrom
drew wants to merge 4 commits intofeat/supervisor-session-grpc-datafrom
Conversation
The gateway now owns the Ready transition for sandboxes. When a ConnectSupervisor RPC registers a session in SupervisorSessionRegistry, ComputeRuntime promotes the sandbox record from Provisioning to Ready with reason=SupervisorConnected. When the session ends, the sandbox is demoted back to Provisioning with reason=SupervisorDisconnected. This replaces compute-driver-specific liveness probes (log grep, TCP poll) with the authoritative signal that the relay plane for SSH and exec is live end-to-end. Drivers that still want to report Error conditions (e.g. the VM driver on a dead launcher process) continue to do so; only the Ready transition moves. Wiring is observer-based: SupervisorSessionRegistry holds an optional trait object installed by ComputeRuntime::install_supervisor_observer during server startup. Session register/remove_if_current invoke on_session_connected/on_session_disconnected off the internal lock; the observer spawns async tasks to update the persisted sandbox record and notify the watch bus. Backfill path in apply_sandbox_update_locked handles the register-before-store race: when a driver snapshot arrives and the registry already has a live session for that sandbox, the phase is promoted on the spot. Adds 8 unit tests covering promote/demote transitions, terminal-state no-ops, idempotent re-register, and the backfill race. Adds 4 registry-level tests covering observer fire-once semantics and supersede-race guards.
The VM driver no longer owns the Ready transition — the gateway-side SupervisorSessionObserver now promotes sandboxes to Ready when their supervisor session connects. Remove guest_ssh_ready() (a brittle grep over the serial console) and the ready_condition() helper. monitor_sandbox still watches the launcher child process and emits Error conditions on ProcessExited / ProcessPollFailed. Also always start gvproxy, not just when port_map is non-empty. With the supervisor-initiated relay migration in #867, the SSH port forward was dropped; that left port_map empty in the default path, which in turn skipped gvproxy startup, which left the guest with no eth0 and no route to the host gateway. The guest supervisor's outbound ConnectSupervisor stream needs gvproxy to reach host.containers.internal (rewritten to 192.168.127.1 inside the guest), so gvproxy is structurally required for any sandbox that talks to the gateway. Inline the gvproxy setup into an unconditional block that returns (guard, api_sock, forwarded_port_map), dropping the mutable plumbing the prior conditional form needed. Remove the now-dead VmContext::set_port_map wrapper; mark its libkrun FFI binding #[allow(dead_code)] so a future reintroduction doesn't need to touch the symbol table.
Rewrite e2e/rust/e2e-vm.sh for the split-binary flow (openshell-gateway
+ openshell-driver-vm) now that the former openshell-vm K8s-in-a-VM
binary is gone. The new flow:
1. Stage the embedded VM runtime (libkrun + gvproxy + base rootfs)
via mise run vm:setup and mise run vm:rootfs -- --base, both
idempotent and run only when artifacts are missing.
2. Build openshell-gateway, openshell-driver-vm, and the openshell
CLI from the current workspace with cargo.
3. On macOS, codesign the driver with the Hypervisor.framework
entitlement so libkrun can start the microVM.
4. Start the gateway with --drivers vm --disable-tls
--disable-gateway-auth --db-url sqlite::memory:, pinning
--driver-dir target/debug so the gateway picks up the freshly
built driver rather than ~/.local/libexec/openshell from a
prior install-vm.sh run.
5. Wait for 'Server listening', run the cluster-agnostic Rust smoke
test against OPENSHELL_GATEWAY_ENDPOINT=http://127.0.0.1:<port>,
then SIGTERM the gateway.
State paths root under /tmp rather than target/ because the VM
driver's compute-driver.sock lives under --vm-driver-state-dir; with
AF_UNIX SUN_LEN = 104 bytes on macOS (108 on Linux), worktree paths
under target/ routinely blow the limit.
On failure, the trap preserves the per-run state dir plus dumps the
gateway log and every sandbox's rootfs-console.log inline so CI
artifacts capture post-mortem data.
Drop the former --vm-port / --vm-name reuse path entirely — the new
gateway is cheap to start (a few seconds, no k3s bootstrap) and that
reuse flow mapped to openshell-vm's StatefulSet rollout, which no
longer exists. Drop the build:docker:gateway and vm:build task
dependencies from tasks/test.toml's e2e:vm for the same reason.
With the SSH port forward removed in #867 and no other host→guest port mappings in play, everything that configured gvproxy's port-forwarder is dead weight. gvproxy stays because the VM still needs its virtual NIC, DHCP server, and default router for guest egress, and because the sandbox supervisor's per-sandbox netns (veth + iptables, see openshell-sandbox/src/sandbox/linux/netns.rs) needs a real kernel network stack inside the guest to branch off of — libkrun's built-in TSI socket impersonation would not satisfy those primitives. What we stop doing: * Dropping the `-listen` API socket. No one calls `/services/forwarder/expose` on it any more. * Passing `-ssh-port -1`. gvproxy's default 2222 SSH forward binds a host-side TCP listener that would race concurrent sandboxes and surface a misleading 'sshd is reachable' endpoint. `-1` is gvproxy's documented switch for 'no SSH forward'; see getForwardsMap in containers/gvisor-tap-vsock cmd/gvproxy/main.go. * Removing VmLaunchConfig::port_map and the CLI --vm-port flag. * Removing krun_set_port_map from the libkrun FFI bindings. * Removing helpers that only made sense when we had a port map to manage: plan_gvproxy_ports, parse_port_mapping, expose_port_map, gvproxy_expose, pick_gvproxy_ssh_port, kill_stale_gvproxy_by_port, kill_stale_gvproxy_by_port_map, kill_gvproxy_pid, is_process_named, and the GUEST_SSH_PORT constant. * Removing the four port-mapping unit tests. Verified: after `sandbox create -- echo hi`, `lsof` shows gvproxy opens zero TCP listeners; only its qemu/vfkit unixgram data socket remains. E2E smoke still passes in ~10s.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Makes the VM compute driver end-to-end path work on top of the supervisor-initiated relay in #867, and moves the authoritative "sandbox is Ready" transition from each compute driver onto the gateway. The smoke test against
openshell-gateway --drivers vm(mise run e2e:vm) goes from hanging at 180s to passing in ~10s.Related Issue
Stacked on top of #867. No issue link.
Changes
feat(server): promote sandbox phase on supervisor session connectSupervisorSessionObservertrait.SupervisorSessionRegistryinvokes the observer onregister/remove_if_currentoutside the internal mutex.ComputeRuntime::install_supervisor_observerwires aComputeSessionObserverbridge; the runtime holds aWeak<SupervisorSessionRegistry>to break theArccycle between registry and observer.mark_sandbox_session_connected/mark_sandbox_session_disconnectedflip phase and rewrite theReadycondition withreason=SupervisorConnected/SupervisorDisconnected. Terminal states (Deleting,Error) are preserved.apply_sandbox_update_lockedhandles the register-before-store race: if a driver snapshot arrives and the registry already holds a live session for that sandbox, phase is promoted on the spot.refactor(driver-vm): drop log-grep readiness; always run gvproxyguest_ssh_ready()andready_condition(). The driver no longer ownsReady;monitor_sandboxonly surfacesErrorfor launcher-process failures.runtime.rsnow starts gvproxy unconditionally. With the SSH port forward removed in feat(server,sandbox): supervisor-initiated SSH connect and exec over gRPC-multiplexed relay #867,port_mapwas empty by default, which skipped gvproxy startup entirely — leaving the guest with noeth0and no route to the host gateway. The guest supervisor'sConnectSupervisorstream needs gvproxy to reachhost.containers.internal(rewritten to192.168.127.1inside the guest).VmContext::set_port_map; mark the libkrun FFI binding#[allow(dead_code)].e2e(vm): run smoke against openshell-gateway with the VM compute drivere2e/rust/e2e-vm.shfor the split-binary flow (formeropenshell-vmK8s-in-a-VM binary is gone).--driver-dir target/debugso the gateway picks up the freshly cargo-built driver rather than a stale~/.local/libexec/openshell/openshell-driver-vmfrom a priorinstall-vm.shrun./tmp(macOSAF_UNIXSUN_LENis 104 bytes; worktree paths routinely blow it).rootfs-console.loginline for post-mortem.build:docker:gatewayandvm:builddependencies fromtasks/test.toml'se2e:vm.Testing
mise run pre-commitpasses (lint + format + license headers clean; clippy warnings unchanged from baseline)openshell-serverlib: 255 pass (+8 compute promotion tests, +4 registry observer tests)openshell-driver-vmlib: 17 passopenshell-serverintegration (supervisor_relay_integration): 6 passmise run e2e:vmpasses in ~10s, stable across back-to-back runsChecklist
Notes for the reviewer
feat/supervisor-session-grpc-data, notmain. Merge order: land feat(server,sandbox): supervisor-initiated SSH connect and exec over gRPC-multiplexed relay #867 first, then rebase this onto main.driver-vm/runtime.rscould arguably belong in feat(server,sandbox): supervisor-initiated SSH connect and exec over gRPC-multiplexed relay #867 itself (it's a latent bug in that PR — VMs have no network after the SSH port forward was dropped). Happy to split that hunk off into a separate commit againstfeat/supervisor-session-grpc-datadirectly if that's the preferred landing path.Ready=Truefrom kubelet still wins (the gateway override only applies when phase isProvisioning/Unknown).