Problem Statement
openshell-server currently has no metrics instrumentation — only structured tracing logs exist. As the gateway for all sandbox lifecycle operations, SSH tunneling, supervisor session management, and policy enforcement, it needs comprehensive Prometheus metrics to support production SLOs, alerting, capacity planning, and incident debugging.
Without metrics, operators cannot:
- Define or track SLIs/SLOs
- Set up meaningful alerting (beyond log-based)
- Build operational dashboards
- Debug production incidents with quantitative data
- Plan capacity based on saturation trends
NOTE The list of metrics below are just some ideas to encourage thinking along the lines of an SRE trying to support this workload. We do not have to implement them all. Just looking to start a discussion about which ones would be most valuable to have in a first phase.
Proposed Design
Crates & Exposition
Use the metrics facade crate with metrics-exporter-prometheus for Prometheus exposition — this mirrors the existing tracing facade pattern. Add a /metrics GET route to the existing health_router() in http.rs (outside auth, accessible to scrapers). Initialize PrometheusBuilder in run_server() and store PrometheusHandle in ServerState.
For gRPC and HTTP request metrics, implement a Tower middleware layer in multiplex.rs that wraps both the GrpcRouter and HTTP service — this is the single highest-value instrumentation point.
Metrics Catalog (16 families, 3 priority tiers)
P0 — Critical (Day-1 Paging Metrics)
| # |
Metric |
Type |
Labels |
Instrumentation Point |
| 1 |
openshell_grpc_requests_total |
counter |
method, code |
multiplex.rs Tower layer |
| 1 |
openshell_grpc_request_duration_seconds |
histogram |
method, code |
multiplex.rs Tower layer |
| 2 |
openshell_http_requests_total |
counter |
path, status |
multiplex.rs Tower layer |
| 2 |
openshell_http_request_duration_seconds |
histogram |
path, status |
multiplex.rs Tower layer |
| 3 |
openshell_supervisor_sessions_active |
gauge |
— |
supervisor_session.rs |
| 3 |
openshell_supervisor_session_connects_total |
counter |
superseded |
supervisor_session.rs |
| 3 |
openshell_supervisor_session_disconnects_total |
counter |
reason |
supervisor_session.rs |
| 4 |
openshell_sandboxes_by_phase |
gauge |
phase |
compute/mod.rs |
| 5 |
openshell_relay_opens_total |
counter |
result |
supervisor_session.rs |
| 5 |
openshell_relay_claims_total |
counter |
result |
supervisor_session.rs |
| 5 |
openshell_relay_pending_count |
gauge |
— |
supervisor_session.rs |
P1 — Important (SLO Tracking, Capacity Planning)
| # |
Metric |
Type |
Labels |
Instrumentation Point |
| 6 |
openshell_sandbox_phase_transition_duration_seconds |
histogram |
from_phase, to_phase |
compute/mod.rs |
| 6 |
openshell_sandbox_create_total |
counter |
result |
compute/mod.rs |
| 6 |
openshell_sandbox_delete_total |
counter |
result |
compute/mod.rs |
| 7 |
openshell_ssh_connections_active |
gauge |
dimension |
ssh_tunnel.rs |
| 7 |
openshell_ssh_connection_limit_rejections_total |
counter |
limit_type |
ssh_tunnel.rs |
| 7 |
openshell_ssh_tunnel_duration_seconds |
histogram |
— |
ssh_tunnel.rs |
| 7 |
openshell_ssh_sessions_active |
gauge |
— |
ssh_tunnel.rs |
| 7 |
openshell_ssh_sessions_reaped_total |
counter |
reason |
ssh_tunnel.rs |
| 8 |
openshell_relay_wait_for_session_duration_seconds |
histogram |
result |
supervisor_session.rs |
| 8 |
openshell_relay_claim_latency_seconds |
histogram |
— |
supervisor_session.rs |
| 8 |
openshell_relay_reaped_total |
counter |
— |
supervisor_session.rs |
| 9 |
openshell_compute_watch_restarts_total |
counter |
reason |
compute/mod.rs |
| 9 |
openshell_compute_reconcile_duration_seconds |
histogram |
— |
compute/mod.rs |
| 9 |
openshell_compute_orphans_pruned_total |
counter |
— |
compute/mod.rs |
| 9 |
openshell_compute_driver_rpc_duration_seconds |
histogram |
method |
compute/mod.rs |
| 10 |
openshell_db_operation_duration_seconds |
histogram |
operation, backend |
persistence/mod.rs |
| 10 |
openshell_db_errors_total |
counter |
operation, backend |
persistence/mod.rs |
| 11 |
openshell_policy_merge_attempts_total |
counter |
result |
grpc/policy.rs |
| 11 |
openshell_policy_merge_retries |
histogram |
— |
grpc/policy.rs |
P2 — Nice to Have (Deep Operational Insight)
| # |
Metric |
Type |
Instrumentation Point |
| 12 |
openshell_ws_tunnel_connections_active |
gauge |
ws_tunnel.rs |
| 12 |
openshell_ws_tunnel_bytes_total |
counter |
ws_tunnel.rs |
| 13 |
openshell_exec_duration_seconds |
histogram |
grpc/sandbox.rs |
| 13 |
openshell_exec_total |
counter |
grpc/sandbox.rs |
| 14 |
openshell_tcp_connections_active |
gauge |
lib.rs |
| 14 |
openshell_tls_handshake_failures_total |
counter |
lib.rs |
| 15 |
openshell_tracing_bus_subscribers |
gauge |
tracing_bus.rs |
| 15 |
openshell_tracing_bus_messages_published_total |
counter |
tracing_bus.rs |
| 16 |
Process metrics (RSS, FDs, CPU, uptime) |
various |
auto via PrometheusBuilder |
SLI/SLO Definitions
| SLI |
PromQL |
Target |
| gRPC Availability |
1 - rate(grpc_requests_total{code!="OK"}[5m]) / rate(grpc_requests_total[5m]) |
99.9% / 30d |
| gRPC Latency (p99) |
histogram_quantile(0.99, rate(grpc_request_duration_seconds_bucket[5m])) |
<500ms (unary) |
| Sandbox Create Success |
rate(sandbox_create_total{result="success"}[1h]) / rate(sandbox_create_total[1h]) |
99.5% |
| Time to Ready (p50) |
histogram_quantile(0.50, rate(phase_transition_duration_seconds_bucket{to_phase="ready"}[1h])) |
<30s |
| SSH Tunnel Availability |
1 - rate(http_requests_total{path="/connect/ssh",status=~"5.."}[5m]) / rate(http_requests_total{path="/connect/ssh"}[5m]) |
99.9% |
| Relay Claim Rate |
rate(relay_claims_total{result="success"}[5m]) / rate(relay_opens_total{result="success"}[5m]) |
>99% |
Key Alerting Rules
| Severity |
Condition |
Duration |
| Page |
supervisor_sessions_active == 0 (while sandboxes exist) |
2m |
| Page |
gRPC error ratio > 1% |
5m |
| Page |
All sandboxes in error, none ready |
5m |
| Warn |
relay_pending_count > 50 |
2m |
| Warn |
SSH connection limit rejections firing |
instant |
| Warn |
Compute watch restarts > 3 in 10m |
— |
| Warn |
DB p99 > 100ms |
5m |
Implementation Phases
- Phase 1 — Infrastructure (single PR): Add
metrics + metrics-exporter-prometheus crates, /metrics endpoint, Tower middleware for gRPC/HTTP RED metrics.
- Phase 2 — P0 Metrics (single PR): Instrument supervisor sessions, relay channels, sandbox phase gauge.
- Phase 3 — P1 Metrics (2-3 PRs): SSH tunnel saturation, DB latency, compute driver health, policy merge retries.
- Phase 4 — P2 Metrics (as needed): WebSocket, exec, TCP/TLS, log bus, process metrics.
Critical Files
crates/openshell-server/src/multiplex.rs — Tower middleware (highest-value single point)
crates/openshell-server/src/supervisor_session.rs — Session + relay gauges/counters
crates/openshell-server/src/lib.rs — PrometheusHandle init, ServerState
crates/openshell-server/src/http.rs — /metrics endpoint
crates/openshell-server/src/compute/mod.rs — Phase gauge, driver health
crates/openshell-server/src/ssh_tunnel.rs — Connection saturation metrics
crates/openshell-server/src/persistence/mod.rs — DB operation metrics
crates/openshell-server/src/grpc/policy.rs — Policy merge retry metrics
Alternatives Considered
- OpenTelemetry (
opentelemetry + opentelemetry-prometheus): Heavier dependency, but provides OTLP export for Datadog/Grafana Cloud. Only justified if the deployment stack already uses OTLP collectors. The metrics facade is lighter and more idiomatic for Rust.
tracing-derived metrics (e.g. tracing-opentelemetry): Could derive counters/histograms from existing tracing spans, but provides less control over label cardinality and histogram buckets. Better suited as a complement to explicit metrics, not a replacement.
Agent Investigation
Explored the full openshell-server codebase including:
- All gRPC service definitions in
proto/ (openshell.proto, compute_driver.proto, inference.proto, sandbox.proto)
- Server architecture:
lib.rs (ServerState), multiplex.rs (protocol multiplexing), grpc/ handlers
- Supervisor session system:
supervisor_session.rs (session registry, relay channels, heartbeats, reaper)
- SSH tunnel:
ssh_tunnel.rs (connection limits, session reaping, relay integration)
- Compute driver abstraction:
compute/mod.rs (reconciliation loop, watch stream, orphan pruning)
- Persistence layer:
persistence/mod.rs, persistence/postgres.rs, persistence/sqlite.rs
- All Cargo.toml files: confirmed zero metrics-related dependencies exist today
- Searched for any existing counter/histogram/gauge patterns: none found
Problem Statement
openshell-server currently has no metrics instrumentation — only structured tracing logs exist. As the gateway for all sandbox lifecycle operations, SSH tunneling, supervisor session management, and policy enforcement, it needs comprehensive Prometheus metrics to support production SLOs, alerting, capacity planning, and incident debugging.
Without metrics, operators cannot:
NOTE The list of metrics below are just some ideas to encourage thinking along the lines of an SRE trying to support this workload. We do not have to implement them all. Just looking to start a discussion about which ones would be most valuable to have in a first phase.
Proposed Design
Crates & Exposition
Use the
metricsfacade crate withmetrics-exporter-prometheusfor Prometheus exposition — this mirrors the existingtracingfacade pattern. Add a/metricsGET route to the existinghealth_router()inhttp.rs(outside auth, accessible to scrapers). InitializePrometheusBuilderinrun_server()and storePrometheusHandleinServerState.For gRPC and HTTP request metrics, implement a Tower middleware layer in
multiplex.rsthat wraps both theGrpcRouterand HTTP service — this is the single highest-value instrumentation point.Metrics Catalog (16 families, 3 priority tiers)
P0 — Critical (Day-1 Paging Metrics)
openshell_grpc_requests_totalmethod,codemultiplex.rsTower layeropenshell_grpc_request_duration_secondsmethod,codemultiplex.rsTower layeropenshell_http_requests_totalpath,statusmultiplex.rsTower layeropenshell_http_request_duration_secondspath,statusmultiplex.rsTower layeropenshell_supervisor_sessions_activesupervisor_session.rsopenshell_supervisor_session_connects_totalsupersededsupervisor_session.rsopenshell_supervisor_session_disconnects_totalreasonsupervisor_session.rsopenshell_sandboxes_by_phasephasecompute/mod.rsopenshell_relay_opens_totalresultsupervisor_session.rsopenshell_relay_claims_totalresultsupervisor_session.rsopenshell_relay_pending_countsupervisor_session.rsP1 — Important (SLO Tracking, Capacity Planning)
openshell_sandbox_phase_transition_duration_secondsfrom_phase,to_phasecompute/mod.rsopenshell_sandbox_create_totalresultcompute/mod.rsopenshell_sandbox_delete_totalresultcompute/mod.rsopenshell_ssh_connections_activedimensionssh_tunnel.rsopenshell_ssh_connection_limit_rejections_totallimit_typessh_tunnel.rsopenshell_ssh_tunnel_duration_secondsssh_tunnel.rsopenshell_ssh_sessions_activessh_tunnel.rsopenshell_ssh_sessions_reaped_totalreasonssh_tunnel.rsopenshell_relay_wait_for_session_duration_secondsresultsupervisor_session.rsopenshell_relay_claim_latency_secondssupervisor_session.rsopenshell_relay_reaped_totalsupervisor_session.rsopenshell_compute_watch_restarts_totalreasoncompute/mod.rsopenshell_compute_reconcile_duration_secondscompute/mod.rsopenshell_compute_orphans_pruned_totalcompute/mod.rsopenshell_compute_driver_rpc_duration_secondsmethodcompute/mod.rsopenshell_db_operation_duration_secondsoperation,backendpersistence/mod.rsopenshell_db_errors_totaloperation,backendpersistence/mod.rsopenshell_policy_merge_attempts_totalresultgrpc/policy.rsopenshell_policy_merge_retriesgrpc/policy.rsP2 — Nice to Have (Deep Operational Insight)
openshell_ws_tunnel_connections_activews_tunnel.rsopenshell_ws_tunnel_bytes_totalws_tunnel.rsopenshell_exec_duration_secondsgrpc/sandbox.rsopenshell_exec_totalgrpc/sandbox.rsopenshell_tcp_connections_activelib.rsopenshell_tls_handshake_failures_totallib.rsopenshell_tracing_bus_subscriberstracing_bus.rsopenshell_tracing_bus_messages_published_totaltracing_bus.rsPrometheusBuilderSLI/SLO Definitions
1 - rate(grpc_requests_total{code!="OK"}[5m]) / rate(grpc_requests_total[5m])histogram_quantile(0.99, rate(grpc_request_duration_seconds_bucket[5m]))rate(sandbox_create_total{result="success"}[1h]) / rate(sandbox_create_total[1h])histogram_quantile(0.50, rate(phase_transition_duration_seconds_bucket{to_phase="ready"}[1h]))1 - rate(http_requests_total{path="/connect/ssh",status=~"5.."}[5m]) / rate(http_requests_total{path="/connect/ssh"}[5m])rate(relay_claims_total{result="success"}[5m]) / rate(relay_opens_total{result="success"}[5m])Key Alerting Rules
supervisor_sessions_active == 0(while sandboxes exist)relay_pending_count > 50Implementation Phases
metrics+metrics-exporter-prometheuscrates,/metricsendpoint, Tower middleware for gRPC/HTTP RED metrics.Critical Files
crates/openshell-server/src/multiplex.rs— Tower middleware (highest-value single point)crates/openshell-server/src/supervisor_session.rs— Session + relay gauges/counterscrates/openshell-server/src/lib.rs— PrometheusHandle init, ServerStatecrates/openshell-server/src/http.rs—/metricsendpointcrates/openshell-server/src/compute/mod.rs— Phase gauge, driver healthcrates/openshell-server/src/ssh_tunnel.rs— Connection saturation metricscrates/openshell-server/src/persistence/mod.rs— DB operation metricscrates/openshell-server/src/grpc/policy.rs— Policy merge retry metricsAlternatives Considered
opentelemetry+opentelemetry-prometheus): Heavier dependency, but provides OTLP export for Datadog/Grafana Cloud. Only justified if the deployment stack already uses OTLP collectors. Themetricsfacade is lighter and more idiomatic for Rust.tracing-derived metrics (e.g.tracing-opentelemetry): Could derive counters/histograms from existing tracing spans, but provides less control over label cardinality and histogram buckets. Better suited as a complement to explicit metrics, not a replacement.Agent Investigation
Explored the full openshell-server codebase including:
proto/(openshell.proto, compute_driver.proto, inference.proto, sandbox.proto)lib.rs(ServerState),multiplex.rs(protocol multiplexing),grpc/handlerssupervisor_session.rs(session registry, relay channels, heartbeats, reaper)ssh_tunnel.rs(connection limits, session reaping, relay integration)compute/mod.rs(reconciliation loop, watch stream, orphan pruning)persistence/mod.rs,persistence/postgres.rs,persistence/sqlite.rs