From 6dcfe21f53861ef10e7d7b12fa319705227bd96f Mon Sep 17 00:00:00 2001 From: "raghav.mehndiratta" Date: Fri, 17 Apr 2026 13:32:27 -0700 Subject: [PATCH 1/2] filter metrics based on pipeline type --- docs/metric_context.md | 58 ++++++++++++++----- docs/metrics/faithfulness.md | 4 +- docs/metrics/speakability.md | 2 +- docs/metrics/stt_wer.md | 2 +- .../transcription_accuracy_key_entities.md | 2 +- docs/metrics/user_behavioral_fidelity.md | 2 +- src/eva/metrics/base.py | 2 +- src/eva/metrics/diagnostic/speakability.py | 3 +- src/eva/metrics/diagnostic/stt_wer.py | 3 +- .../transcription_accuracy_key_entities.py | 3 +- src/eva/metrics/runner.py | 10 ++-- tests/unit/metrics/test_runner.py | 5 +- 12 files changed, 65 insertions(+), 31 deletions(-) diff --git a/docs/metric_context.md b/docs/metric_context.md index 36037f8e..10ba391e 100644 --- a/docs/metric_context.md +++ b/docs/metric_context.md @@ -193,40 +193,72 @@ These are absolute paths to output files saved during benchmark execution. - **`audio_user_path: Optional[str]`** - Path to the user simulator's audio channel (mono WAV file). - **`audio_mixed_path: Optional[str]`** - Path to the mixed stereo audio (left=assistant, right=user). -## Audio-Native (S2S, S2T+TTS) vs Cascade Architecture +## Audio-Native (S2S, AUDIO_LLM) vs Cascade Architecture This section explains the architectural differences that affect which variables are reliable. For a quick summary, see [Why Multiple Representations?](#why-multiple-representations-of-the-same-conversation). -"Audio-native" is an umbrella term for architectures where the model processes raw audio input directly, as opposed to cascade where the model receives STT text. This includes Speech-to-Speech (S2S) and Speech-to-Text + TTS (S2T+TTS) architectures. +"Audio-native" is an umbrella term for architectures where the model processes raw audio input directly, as opposed to cascade where the model receives STT text. There are three pipeline types (`PipelineType` enum): + +- **`CASCADE`** — separate STT → LLM → TTS steps +- **`AUDIO_LLM`** — audio input to LLM + separate TTS (S2T+TTS) +- **`S2S`** — end-to-end speech-to-speech model (no separate TTS) ### Pipelines -**Cascade:** +**CASCADE:** `User audio → Agent STT → Text → LLM → Text → TTS → Assistant audio` The LLM processes **transcribed text**, so `transcribed_user_turns` reflects what the assistant actually saw. -**Audio-native:** -`User audio → Raw audio directly to model → Assistant audio (S2S)` or -`User audio → Raw audio directly to model → Text → TTS → Assistant audio (S2T+TTS)` +**AUDIO_LLM:** +`User audio → Raw audio directly to model → Text → TTS → Assistant audio` + +**S2S:** +`User audio → Raw audio directly to model → Assistant audio` + +For both AUDIO_LLM and S2S, the model processes **raw audio**. The audit log may contain a transcript from the service's own secondary STT, but this is **not what the model used** — it's just for reference. This is why `transcribed_user_turns` is unreliable for audio-native models and `intended_user_turns` should be used instead. + +Check `context.pipeline_type` to determine which mode was used, or `context.is_audio_native` for a boolean grouping of `AUDIO_LLM` and `S2S`. + +### Writing Pipeline-Aware Metrics + +**Controlling which pipelines a metric runs on:** -The model processes **raw audio**. The audit log may contain a transcript from the service's own secondary STT, but this is **not what the model used** — it's just for reference. This is why `transcribed_user_turns` is unreliable for audio-native models and `intended_user_turns` should be used instead. +Set `supported_pipeline_types` on your metric class. The runner will skip the metric automatically for unsupported pipelines. The default is all three types. -Check `context.pipeline_type` to determine which mode was used, or `context.is_audio_native` for a boolean grouping of `S2S` and `AUDIO_LLM`. +```python +from eva.models.config import PipelineType + +class MyMetric(BaseMetric): + # Only run on cascade (e.g. STT/transcription metrics) + supported_pipeline_types = frozenset({PipelineType.CASCADE}) + + # Run on cascade and audio LLM but not S2S (e.g. metrics that require a TTS step) + supported_pipeline_types = frozenset({PipelineType.CASCADE, PipelineType.AUDIO_LLM}) -### Writing Audio-Native-Aware Metrics + # Run on all pipelines (default — no declaration needed) + supported_pipeline_types = frozenset(PipelineType) +``` + +**Branching on pipeline type within `compute()`:** -If your metric needs user text directly (rather than via `conversation_trace`, which handles this automatically), branch on `context.is_audio_native`: +If your metric runs on multiple pipelines but needs different behavior per type, branch on `context.pipeline_type` or use `context.is_audio_native`: ```python async def compute(self, context: MetricContext) -> MetricScore: - # Option 1: manual branching + # Branch by exact pipeline type + if context.pipeline_type == PipelineType.CASCADE: + user_turns = context.transcribed_user_turns + else: + user_turns = context.intended_user_turns + + # Or use is_audio_native (True for both AUDIO_LLM and S2S) user_turns = context.intended_user_turns if context.is_audio_native else context.transcribed_user_turns - # Option 2: use conversation_trace (handles S2S vs Cascade automatically) + # Or use conversation_trace (handles S2S vs Cascade automatically) for entry in context.conversation_trace: if entry["role"] == "user": - # "intended" for S2S, "transcribed" for Cascade + # "intended" for audio-native, "transcribed" for cascade user_text = entry["content"] ``` diff --git a/docs/metrics/faithfulness.md b/docs/metrics/faithfulness.md index 47812e48..427c6ce5 100644 --- a/docs/metrics/faithfulness.md +++ b/docs/metrics/faithfulness.md @@ -25,7 +25,7 @@ Uses the following MetricContext fields: - `conversation_trace`: Full conversation with tool calls (via `format_transcript`) - `agent_instructions`, `agent_role`, `agent_tools`: Agent configuration for policy evaluation - `current_date_time`: Simulated date/time for temporal reasoning -- `is_audio_native` (audio-native): Architecture flag (controls which prompt variant is used) +- `pipeline_type` / `is_audio_native`: Architecture flag (controls which prompt variant is used — cascade vs. audio-native) ### Audio-Native vs Cascade @@ -35,7 +35,7 @@ This metric has **significantly different behavior** depending on the architectu - User turns in the trace are **STT transcripts** — the text the assistant's text LLM actually received. - The judge evaluates faithfulness against what the assistant saw (the transcript), not what the user actually said. - If STT transcribed "Kim" but the user said "Kin", using "Kim" is faithful (the assistant can only work with what it received). This issue would be captured by the transcription accuracy key entities metric. -**Audio-native (S2S, S2T+TTS):** +**Audio-native (AUDIO_LLM, S2S):** - User turns in the trace are **intended text** (what the user simulator was instructed to say), since audio-native models do not use transcriptions. - The judge evaluates whether the assistant **correctly understood the audio**. If the assistant misheard the user and used incorrect information, that IS a faithfulness violation — accurate audio understanding is part of the audio-native model's responsibility. diff --git a/docs/metrics/speakability.md b/docs/metrics/speakability.md index 4ecb9f76..af2f7dba 100644 --- a/docs/metrics/speakability.md +++ b/docs/metrics/speakability.md @@ -25,7 +25,7 @@ Uses `intended_assistant_turns` from MetricContext — the text sent to the TTS ### Audio-Native vs Cascade - **Cascade**: Fully applicable — evaluates whether the LLM's text output is appropriate for TTS. -- **Audio-native (S2S, S2T+TTS):** **Skipped entirely** (`skip_s2s = True` (audio-native)). Audio-native models generate audio directly, so there is no separate text-to-TTS step and "speakability" of intermediate text is not meaningful. +- **S2S:** **Skipped** (`supported_pipeline_types = {CASCADE, AUDIO_LLM}`). S2S models generate audio directly without a separate TTS step, so there is no intermediate text whose speakability can be evaluated. AUDIO_LLM models do have a TTS step and are evaluated. ### Evaluation Methodology diff --git a/docs/metrics/stt_wer.md b/docs/metrics/stt_wer.md index f254a7ae..7128ca6f 100644 --- a/docs/metrics/stt_wer.md +++ b/docs/metrics/stt_wer.md @@ -26,7 +26,7 @@ Uses the following MetricContext fields: ### Audio-Native vs Cascade - **Cascade**: Fully applicable — measures the quality of the assistant's STT pipeline, which directly affects the LLM's input. -- **Audio-native (S2S, S2T+TTS):** **Skipped entirely** (`skip_s2s = True` (audio-native)). Audio-native models receive raw audio, not STT transcripts, so measuring STT accuracy is not meaningful. The `transcribed_user_turns` field in audio-native systems comes from a secondary transcription service, not the model's actual input. +- **AUDIO_LLM / S2S:** **Skipped entirely** (`supported_pipeline_types = {CASCADE}`). Audio-native models receive raw audio, not STT transcripts, so measuring STT accuracy is not meaningful. The `transcribed_user_turns` field in audio-native systems comes from a secondary transcription service, not the model's actual input. ### Evaluation Methodology diff --git a/docs/metrics/transcription_accuracy_key_entities.md b/docs/metrics/transcription_accuracy_key_entities.md index 676ca27e..11c9c02b 100644 --- a/docs/metrics/transcription_accuracy_key_entities.md +++ b/docs/metrics/transcription_accuracy_key_entities.md @@ -29,7 +29,7 @@ The judge receives both texts side by side and identifies key entities to compar ### Audio-Native vs Cascade - **Cascade**: Fully applicable — measures whether the assistant's STT correctly captured key entities, which directly affects downstream tool calls and responses. -- **Audio-native (S2S, S2T+TTS):** **Skipped entirely** (`skip_s2s = True` (audio-native)). Audio-native models receive raw audio, not STT output, so entity-level STT accuracy is not meaningful. Entity perception issues in audio-native systems are captured instead by `faithfulness` (which treats mishearing as a faithfulness violation). +- **AUDIO_LLM / S2S:** **Skipped entirely** (`supported_pipeline_types = {CASCADE}`). Audio-native models receive raw audio, not STT output, so entity-level STT accuracy is not meaningful. Entity perception issues in audio-native systems are captured instead by `faithfulness` (which treats mishearing as a faithfulness violation). ### Evaluation Methodology diff --git a/docs/metrics/user_behavioral_fidelity.md b/docs/metrics/user_behavioral_fidelity.md index a03344ef..d3b86d5b 100644 --- a/docs/metrics/user_behavioral_fidelity.md +++ b/docs/metrics/user_behavioral_fidelity.md @@ -31,7 +31,7 @@ Uses the following MetricContext fields: This metric has **pipeline-specific prompt text** and provides the judge with **two views** of the conversation: - **Cascade**: The judge sees the agent-side transcript (`conversation_trace`, where user turns are STT transcriptions) alongside the `intended_user_turns` (ground truth). The prompt explains that discrepancies between the two are transcription errors — the user should not be penalized for the agent mishearing. -- **Audio-native (S2S, S2T+TTS):** The judge sees the conversation trace (where user turns are already intended text) alongside the `intended_user_turns`. The prompt explains this is an audio-native system and that discrepancies in agent behavior may be due to audio perception errors, not user corruption. +- **Audio-native (AUDIO_LLM, S2S):** The judge sees the conversation trace (where user turns are already intended text) alongside the `intended_user_turns`. The prompt explains this is an audio-native system and that discrepancies in agent behavior may be due to audio perception errors, not user corruption. In both cases, `intended_user_turns` serves as ground truth for what the user actually said. diff --git a/src/eva/metrics/base.py b/src/eva/metrics/base.py index 8572fed0..b31ad5cd 100644 --- a/src/eva/metrics/base.py +++ b/src/eva/metrics/base.py @@ -159,7 +159,7 @@ class BaseMetric(ABC): metric_type: MetricType = MetricType.CODE # Override in subclasses pass_at_k_threshold: float = 0.5 # Normalized score threshold for pass@k pass/fail exclude_from_pass_at_k: bool = False # Set True for metrics not suitable for pass@k - skip_audio_native: bool = False # Set True for metrics that should not run on audio-native records (S2S, AudioLLM) + supported_pipeline_types: frozenset[PipelineType] = frozenset(PipelineType) # Pipeline types this metric supports def __init__(self, config: dict[str, Any] | None = None): """Initialize the metric. diff --git a/src/eva/metrics/diagnostic/speakability.py b/src/eva/metrics/diagnostic/speakability.py index ad6c0a4a..bc9d8a45 100644 --- a/src/eva/metrics/diagnostic/speakability.py +++ b/src/eva/metrics/diagnostic/speakability.py @@ -8,6 +8,7 @@ from eva.metrics.base import MetricContext, PerTurnConversationJudgeMetric from eva.metrics.registry import register_metric +from eva.models.config import PipelineType @register_metric @@ -32,7 +33,7 @@ class SpeakabilityJudgeMetric(PerTurnConversationJudgeMetric): description = "Debug metric: LLM judge evaluation of text voice-friendliness per turn" category = "diagnostic" exclude_from_pass_at_k = True - skip_audio_native = True + supported_pipeline_types = frozenset({PipelineType.CASCADE, PipelineType.AUDIO_LLM}) rating_scale = (0, 1) def get_expected_turn_ids(self, context: MetricContext) -> list[int]: diff --git a/src/eva/metrics/diagnostic/stt_wer.py b/src/eva/metrics/diagnostic/stt_wer.py index 15747978..f94b6ba2 100644 --- a/src/eva/metrics/diagnostic/stt_wer.py +++ b/src/eva/metrics/diagnostic/stt_wer.py @@ -11,6 +11,7 @@ from eva.metrics.base import CodeMetric, MetricContext from eva.metrics.registry import register_metric from eva.metrics.utils import aggregate_wer_errors, extract_wer_errors, reverse_word_error_rate +from eva.models.config import PipelineType from eva.models.results import MetricScore from eva.utils.wer_normalization import normalize_text @@ -42,7 +43,7 @@ class STTWERMetric(CodeMetric): description = "Debug metric: Speech-to-Text transcription accuracy using Word Error Rate" category = "diagnostic" exclude_from_pass_at_k = True - skip_audio_native = True + supported_pipeline_types = frozenset({PipelineType.CASCADE}) def __init__(self, config: dict | None = None): """Initialize the metric with language configuration.""" diff --git a/src/eva/metrics/diagnostic/transcription_accuracy_key_entities.py b/src/eva/metrics/diagnostic/transcription_accuracy_key_entities.py index c8232867..0044d57a 100644 --- a/src/eva/metrics/diagnostic/transcription_accuracy_key_entities.py +++ b/src/eva/metrics/diagnostic/transcription_accuracy_key_entities.py @@ -9,6 +9,7 @@ from eva.metrics.base import MetricContext, TextJudgeMetric from eva.metrics.registry import register_metric from eva.metrics.utils import aggregate_per_turn_scores, parse_judge_response_list, resolve_turn_id +from eva.models.config import PipelineType from eva.models.results import MetricScore @@ -45,7 +46,7 @@ class TranscriptionAccuracyKeyEntitiesMetric(TextJudgeMetric): description = "Debug metric: LLM judge evaluation of STT key entity transcription accuracy for entire conversation" category = "diagnostic" exclude_from_pass_at_k = True - skip_audio_native = True + supported_pipeline_types = frozenset({PipelineType.CASCADE}) rating_scale = None # Custom scoring (not 1-3 scale) default_aggregation = "mean" diff --git a/src/eva/metrics/runner.py b/src/eva/metrics/runner.py index d31e9de6..f765ddea 100644 --- a/src/eva/metrics/runner.py +++ b/src/eva/metrics/runner.py @@ -388,12 +388,10 @@ async def compute_metric(metric: BaseMetric) -> tuple[str, MetricScore]: ) # Filter out metrics incompatible with the pipeline type - applicable_metrics = metrics_to_run - if context.is_audio_native: - skipped = [m.name for m in metrics_to_run if m.skip_audio_native] - if skipped: - logger.info(f"[{record_id}] Skipping metrics incompatible with audio-native pipeline: {skipped}") - applicable_metrics = [m for m in metrics_to_run if not m.skip_audio_native] + skipped = [m.name for m in metrics_to_run if context.pipeline_type not in m.supported_pipeline_types] + if skipped: + logger.info(f"[{record_id}] Skipping metrics incompatible with {context.pipeline_type} pipeline: {skipped}") + applicable_metrics = [m for m in metrics_to_run if context.pipeline_type in m.supported_pipeline_types] # Run all metrics in parallel tasks = [compute_metric(metric) for metric in applicable_metrics] diff --git a/tests/unit/metrics/test_runner.py b/tests/unit/metrics/test_runner.py index 23a4cd3d..e6bcd354 100644 --- a/tests/unit/metrics/test_runner.py +++ b/tests/unit/metrics/test_runner.py @@ -7,6 +7,7 @@ import yaml from eva.metrics.runner import MetricsRunner +from eva.models.config import PipelineType from eva.models.results import MetricScore, RecordMetrics from tests.unit.conftest import make_evaluation_record @@ -14,11 +15,11 @@ class _FakeMetric: - """Minimal stand-in for BaseMetric — only ``name``, ``skip_audio_native``, and pass@k attrs are read.""" + """Minimal stand-in for BaseMetric — only ``name``, ``supported_pipeline_types``, and pass@k attrs are read.""" def __init__(self, name: str): self.name = name - self.skip_audio_native = False + self.supported_pipeline_types = frozenset(PipelineType) self.exclude_from_pass_at_k = False self.pass_at_k_threshold = 0.5 From 37ad38a7a05d65bfc73f0bf48c830baedb51b154 Mon Sep 17 00:00:00 2001 From: raghavm243512 Date: Tue, 21 Apr 2026 10:25:58 -0700 Subject: [PATCH 2/2] Apply suggestions from code review Co-authored-by: Gabrielle Gauthier-Melancon --- docs/metrics/stt_wer.md | 2 +- docs/metrics/transcription_accuracy_key_entities.md | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/metrics/stt_wer.md b/docs/metrics/stt_wer.md index 7128ca6f..283ffc53 100644 --- a/docs/metrics/stt_wer.md +++ b/docs/metrics/stt_wer.md @@ -26,7 +26,7 @@ Uses the following MetricContext fields: ### Audio-Native vs Cascade - **Cascade**: Fully applicable — measures the quality of the assistant's STT pipeline, which directly affects the LLM's input. -- **AUDIO_LLM / S2S:** **Skipped entirely** (`supported_pipeline_types = {CASCADE}`). Audio-native models receive raw audio, not STT transcripts, so measuring STT accuracy is not meaningful. The `transcribed_user_turns` field in audio-native systems comes from a secondary transcription service, not the model's actual input. +- **Audio-native (AUDIO_LLM / S2S):** **Skipped entirely** (`supported_pipeline_types = {CASCADE}`). Audio-native models receive raw audio, not STT transcripts, so measuring STT accuracy is not meaningful. The `transcribed_user_turns` field in audio-native systems comes from a secondary transcription service, not the model's actual input. ### Evaluation Methodology diff --git a/docs/metrics/transcription_accuracy_key_entities.md b/docs/metrics/transcription_accuracy_key_entities.md index 11c9c02b..8264422d 100644 --- a/docs/metrics/transcription_accuracy_key_entities.md +++ b/docs/metrics/transcription_accuracy_key_entities.md @@ -29,7 +29,7 @@ The judge receives both texts side by side and identifies key entities to compar ### Audio-Native vs Cascade - **Cascade**: Fully applicable — measures whether the assistant's STT correctly captured key entities, which directly affects downstream tool calls and responses. -- **AUDIO_LLM / S2S:** **Skipped entirely** (`supported_pipeline_types = {CASCADE}`). Audio-native models receive raw audio, not STT output, so entity-level STT accuracy is not meaningful. Entity perception issues in audio-native systems are captured instead by `faithfulness` (which treats mishearing as a faithfulness violation). +- **Audio-native (AUDIO_LLM / S2S):** **Skipped entirely** (`supported_pipeline_types = {CASCADE}`). Audio-native models receive raw audio, not STT output, so entity-level STT accuracy is not meaningful. Entity perception issues in audio-native systems are captured instead by `faithfulness` (which treats mishearing as a faithfulness violation). ### Evaluation Methodology