ServiceNow · raghavm243512 · Apr 21, 2026 · Apr 17, 2026 · Apr 20, 2026 · Apr 20, 2026
diff --git a/docs/metric_context.md b/docs/metric_context.md
@@ -193,40 +193,72 @@ These are absolute paths to output files saved during benchmark execution.
 - **`audio_user_path: Optional[str]`** - Path to the user simulator's audio channel (mono WAV file).
 - **`audio_mixed_path: Optional[str]`** - Path to the mixed stereo audio (left=assistant, right=user).
 
-## Audio-Native (S2S, S2T+TTS) vs Cascade Architecture
+## Audio-Native (S2S, AUDIO_LLM) vs Cascade Architecture
 
 This section explains the architectural differences that affect which variables are reliable. For a quick summary, see [Why Multiple Representations?](#why-multiple-representations-of-the-same-conversation).
 
-"Audio-native" is an umbrella term for architectures where the model processes raw audio input directly, as opposed to cascade where the model receives STT text. This includes Speech-to-Speech (S2S) and Speech-to-Text + TTS (S2T+TTS) architectures.
+"Audio-native" is an umbrella term for architectures where the model processes raw audio input directly, as opposed to cascade where the model receives STT text. There are three pipeline types (`PipelineType` enum):
+
+- **`CASCADE`** — separate STT → LLM → TTS steps
+- **`AUDIO_LLM`** — audio input to LLM + separate TTS (S2T+TTS)
+- **`S2S`** — end-to-end speech-to-speech model (no separate TTS)
 
 ### Pipelines
 
-**Cascade:**
+**CASCADE:**
 `User audio → Agent STT → Text → LLM → Text → TTS → Assistant audio`
 
 The LLM processes **transcribed text**, so `transcribed_user_turns` reflects what the assistant actually saw.
 
-**Audio-native:**
-`User audio → Raw audio directly to model → Assistant audio (S2S)` or
-`User audio → Raw audio directly to model → Text → TTS → Assistant audio (S2T+TTS)`
+**AUDIO_LLM:**
+`User audio → Raw audio directly to model → Text → TTS → Assistant audio`
+
+**S2S:**
+`User audio → Raw audio directly to model → Assistant audio`
+
+For both AUDIO_LLM and S2S, the model processes **raw audio**. The audit log may contain a transcript from the service's own secondary STT, but this is **not what the model used** — it's just for reference. This is why `transcribed_user_turns` is unreliable for audio-native models and `intended_user_turns` should be used instead.
+
+Check `context.pipeline_type` to determine which mode was used, or `context.is_audio_native` for a boolean grouping of `AUDIO_LLM` and `S2S`.
+
+### Writing Pipeline-Aware Metrics
+
+**Controlling which pipelines a metric runs on:**
 
-The model processes **raw audio**. The audit log may contain a transcript from the service's own secondary STT, but this is **not what the model used** — it's just for reference. This is why `transcribed_user_turns` is unreliable for audio-native models and `intended_user_turns` should be used instead.
+Set `supported_pipeline_types` on your metric class. The runner will skip the metric automatically for unsupported pipelines. The default is all three types.
 
-Check `context.pipeline_type` to determine which mode was used, or `context.is_audio_native` for a boolean grouping of `S2S` and `AUDIO_LLM`.
+```python
+from eva.models.config import PipelineType
+
+class MyMetric(BaseMetric):
+    # Only run on cascade (e.g. STT/transcription metrics)
+    supported_pipeline_types = frozenset({PipelineType.CASCADE})
+
+    # Run on cascade and audio LLM but not S2S (e.g. metrics that require a TTS step)
+    supported_pipeline_types = frozenset({PipelineType.CASCADE, PipelineType.AUDIO_LLM})
 
-### Writing Audio-Native-Aware Metrics
+    # Run on all pipelines (default — no declaration needed)
+    supported_pipeline_types = frozenset(PipelineType)
+```
+
+**Branching on pipeline type within `compute()`:**
 
-If your metric needs user text directly (rather than via `conversation_trace`, which handles this automatically), branch on `context.is_audio_native`:
+If your metric runs on multiple pipelines but needs different behavior per type, branch on `context.pipeline_type` or use `context.is_audio_native`:
 
 ```python
 async def compute(self, context: MetricContext) -> MetricScore:
-    # Option 1: manual branching
+    # Branch by exact pipeline type
+    if context.pipeline_type == PipelineType.CASCADE:
+        user_turns = context.transcribed_user_turns
+    else:
+        user_turns = context.intended_user_turns
+
+    # Or use is_audio_native (True for both AUDIO_LLM and S2S)
     user_turns = context.intended_user_turns if context.is_audio_native else context.transcribed_user_turns
 
-    # Option 2: use conversation_trace (handles S2S vs Cascade automatically)
+    # Or use conversation_trace (handles S2S vs Cascade automatically)
     for entry in context.conversation_trace:
         if entry["role"] == "user":
-            # "intended" for S2S, "transcribed" for Cascade
+            # "intended" for audio-native, "transcribed" for cascade
             user_text = entry["content"]
 ```
 

diff --git a/docs/metrics/faithfulness.md b/docs/metrics/faithfulness.md
@@ -25,7 +25,7 @@ Uses the following MetricContext fields:
 - `conversation_trace`: Full conversation with tool calls (via `format_transcript`)
 - `agent_instructions`, `agent_role`, `agent_tools`: Agent configuration for policy evaluation
 - `current_date_time`: Simulated date/time for temporal reasoning
-- `is_audio_native` (audio-native): Architecture flag (controls which prompt variant is used)
+- `pipeline_type` / `is_audio_native`: Architecture flag (controls which prompt variant is used — cascade vs. audio-native)
 
 ### Audio-Native vs Cascade
 
@@ -35,7 +35,7 @@ This metric has **significantly different behavior** depending on the architectu
 - User turns in the trace are **STT transcripts** — the text the assistant's text LLM actually received.
 - The judge evaluates faithfulness against what the assistant saw (the transcript), not what the user actually said.
 - If STT transcribed "Kim" but the user said "Kin", using "Kim" is faithful (the assistant can only work with what it received). This issue would be captured by the transcription accuracy key entities metric.
-**Audio-native (S2S, S2T+TTS):**
+**Audio-native (AUDIO_LLM, S2S):**
 - User turns in the trace are **intended text** (what the user simulator was instructed to say), since audio-native models do not use transcriptions.
 - The judge evaluates whether the assistant **correctly understood the audio**. If the assistant misheard the user and used incorrect information, that IS a faithfulness violation — accurate audio understanding is part of the audio-native model's responsibility.
 

diff --git a/docs/metrics/speakability.md b/docs/metrics/speakability.md
@@ -25,7 +25,7 @@ Uses `intended_assistant_turns` from MetricContext — the text sent to the TTS
 ### Audio-Native vs Cascade
 
 - **Cascade**: Fully applicable — evaluates whether the LLM's text output is appropriate for TTS.
-- **Audio-native (S2S, S2T+TTS):** **Skipped entirely** (`skip_s2s = True` (audio-native)). Audio-native models generate audio directly, so there is no separate text-to-TTS step and "speakability" of intermediate text is not meaningful.
+- **S2S:** **Skipped** (`supported_pipeline_types = {CASCADE, AUDIO_LLM}`). S2S models generate audio directly without a separate TTS step, so there is no intermediate text whose speakability can be evaluated. AUDIO_LLM models do have a TTS step and are evaluated.
 
 ### Evaluation Methodology
 

diff --git a/docs/metrics/stt_wer.md b/docs/metrics/stt_wer.md
@@ -26,7 +26,7 @@ Uses the following MetricContext fields:
 ### Audio-Native vs Cascade
 
 - **Cascade**: Fully applicable — measures the quality of the assistant's STT pipeline, which directly affects the LLM's input.
-- **Audio-native (S2S, S2T+TTS):** **Skipped entirely** (`skip_s2s = True` (audio-native)). Audio-native models receive raw audio, not STT transcripts, so measuring STT accuracy is not meaningful. The `transcribed_user_turns` field in audio-native systems comes from a secondary transcription service, not the model's actual input.
+- **Audio-native (AUDIO_LLM / S2S):** **Skipped entirely** (`supported_pipeline_types = {CASCADE}`). Audio-native models receive raw audio, not STT transcripts, so measuring STT accuracy is not meaningful. The `transcribed_user_turns` field in audio-native systems comes from a secondary transcription service, not the model's actual input.
 
 ### Evaluation Methodology
 

diff --git a/docs/metrics/transcription_accuracy_key_entities.md b/docs/metrics/transcription_accuracy_key_entities.md
@@ -29,7 +29,7 @@ The judge receives both texts side by side and identifies key entities to compar
 ### Audio-Native vs Cascade
 
 - **Cascade**: Fully applicable — measures whether the assistant's STT correctly captured key entities, which directly affects downstream tool calls and responses.
-- **Audio-native (S2S, S2T+TTS):** **Skipped entirely** (`skip_s2s = True` (audio-native)). Audio-native models receive raw audio, not STT output, so entity-level STT accuracy is not meaningful. Entity perception issues in audio-native systems are captured instead by `faithfulness` (which treats mishearing as a faithfulness violation).
+- **Audio-native (AUDIO_LLM / S2S):** **Skipped entirely** (`supported_pipeline_types = {CASCADE}`). Audio-native models receive raw audio, not STT output, so entity-level STT accuracy is not meaningful. Entity perception issues in audio-native systems are captured instead by `faithfulness` (which treats mishearing as a faithfulness violation).
 
 ### Evaluation Methodology
 

diff --git a/docs/metrics/user_behavioral_fidelity.md b/docs/metrics/user_behavioral_fidelity.md
@@ -31,7 +31,7 @@ Uses the following MetricContext fields:
 This metric has **pipeline-specific prompt text** and provides the judge with **two views** of the conversation:
 
 - **Cascade**: The judge sees the agent-side transcript (`conversation_trace`, where user turns are STT transcriptions) alongside the `intended_user_turns` (ground truth). The prompt explains that discrepancies between the two are transcription errors — the user should not be penalized for the agent mishearing.
-- **Audio-native (S2S, S2T+TTS):** The judge sees the conversation trace (where user turns are already intended text) alongside the `intended_user_turns`. The prompt explains this is an audio-native system and that discrepancies in agent behavior may be due to audio perception errors, not user corruption.
+- **Audio-native (AUDIO_LLM, S2S):** The judge sees the conversation trace (where user turns are already intended text) alongside the `intended_user_turns`. The prompt explains this is an audio-native system and that discrepancies in agent behavior may be due to audio perception errors, not user corruption.
 
 In both cases, `intended_user_turns` serves as ground truth for what the user actually said.
 

diff --git a/src/eva/metrics/base.py b/src/eva/metrics/base.py
@@ -159,7 +159,7 @@ class BaseMetric(ABC):
     metric_type: MetricType = MetricType.CODE  # Override in subclasses
     pass_at_k_threshold: float = 0.5  # Normalized score threshold for pass@k pass/fail
     exclude_from_pass_at_k: bool = False  # Set True for metrics not suitable for pass@k
-    skip_audio_native: bool = False  # Set True for metrics that should not run on audio-native records (S2S, AudioLLM)
+    supported_pipeline_types: frozenset[PipelineType] = frozenset(PipelineType)  # Pipeline types this metric supports
 
     def __init__(self, config: dict[str, Any] | None = None):
         """Initialize the metric.

diff --git a/src/eva/metrics/diagnostic/speakability.py b/src/eva/metrics/diagnostic/speakability.py
@@ -8,6 +8,7 @@
 
 from eva.metrics.base import MetricContext, PerTurnConversationJudgeMetric
 from eva.metrics.registry import register_metric
+from eva.models.config import PipelineType
 
 
 @register_metric
@@ -32,7 +33,7 @@ class SpeakabilityJudgeMetric(PerTurnConversationJudgeMetric):
     description = "Debug metric: LLM judge evaluation of text voice-friendliness per turn"
     category = "diagnostic"
     exclude_from_pass_at_k = True
-    skip_audio_native = True
+    supported_pipeline_types = frozenset({PipelineType.CASCADE, PipelineType.AUDIO_LLM})
     rating_scale = (0, 1)
 
     def get_expected_turn_ids(self, context: MetricContext) -> list[int]:

diff --git a/src/eva/metrics/diagnostic/stt_wer.py b/src/eva/metrics/diagnostic/stt_wer.py
@@ -11,6 +11,7 @@
 from eva.metrics.base import CodeMetric, MetricContext
 from eva.metrics.registry import register_metric
 from eva.metrics.utils import aggregate_wer_errors, extract_wer_errors, reverse_word_error_rate
+from eva.models.config import PipelineType
 from eva.models.results import MetricScore
 from eva.utils.wer_normalization import normalize_text
 
@@ -42,7 +43,7 @@ class STTWERMetric(CodeMetric):
     description = "Debug metric: Speech-to-Text transcription accuracy using Word Error Rate"
     category = "diagnostic"
     exclude_from_pass_at_k = True
-    skip_audio_native = True
+    supported_pipeline_types = frozenset({PipelineType.CASCADE})
 
     def __init__(self, config: dict | None = None):
         """Initialize the metric with language configuration."""

diff --git a/src/eva/metrics/diagnostic/transcription_accuracy_key_entities.py b/src/eva/metrics/diagnostic/transcription_accuracy_key_entities.py
@@ -9,6 +9,7 @@
 from eva.metrics.base import MetricContext, TextJudgeMetric
 from eva.metrics.registry import register_metric
 from eva.metrics.utils import aggregate_per_turn_scores, parse_judge_response_list, resolve_turn_id
+from eva.models.config import PipelineType
 from eva.models.results import MetricScore
 
 
@@ -45,7 +46,7 @@ class TranscriptionAccuracyKeyEntitiesMetric(TextJudgeMetric):
     description = "Debug metric: LLM judge evaluation of STT key entity transcription accuracy for entire conversation"
     category = "diagnostic"
     exclude_from_pass_at_k = True
-    skip_audio_native = True
+    supported_pipeline_types = frozenset({PipelineType.CASCADE})
     rating_scale = None  # Custom scoring (not 1-3 scale)
     default_aggregation = "mean"
 

diff --git a/src/eva/metrics/runner.py b/src/eva/metrics/runner.py
@@ -388,12 +388,10 @@ async def compute_metric(metric: BaseMetric) -> tuple[str, MetricScore]:
                 )
 
         # Filter out metrics incompatible with the pipeline type
-        applicable_metrics = metrics_to_run
-        if context.is_audio_native:
-            skipped = [m.name for m in metrics_to_run if m.skip_audio_native]
-            if skipped:
-                logger.info(f"[{record_id}] Skipping metrics incompatible with audio-native pipeline: {skipped}")
-            applicable_metrics = [m for m in metrics_to_run if not m.skip_audio_native]
+        skipped = [m.name for m in metrics_to_run if context.pipeline_type not in m.supported_pipeline_types]
+        if skipped:
+            logger.info(f"[{record_id}] Skipping metrics incompatible with {context.pipeline_type} pipeline: {skipped}")
+        applicable_metrics = [m for m in metrics_to_run if context.pipeline_type in m.supported_pipeline_types]
 
         # Run all metrics in parallel
         tasks = [compute_metric(metric) for metric in applicable_metrics]

diff --git a/tests/unit/metrics/test_runner.py b/tests/unit/metrics/test_runner.py
@@ -7,18 +7,19 @@
 import yaml
 
 from eva.metrics.runner import MetricsRunner
+from eva.models.config import PipelineType
 from eva.models.results import MetricScore, RecordMetrics
 from tests.unit.conftest import make_evaluation_record
 
 from .conftest import make_metric_score
 
 
 class _FakeMetric:
-    """Minimal stand-in for BaseMetric — only ``name``, ``skip_audio_native``, and pass@k attrs are read."""
+    """Minimal stand-in for BaseMetric — only ``name``, ``supported_pipeline_types``, and pass@k attrs are read."""
 
     def __init__(self, name: str):
         self.name = name
-        self.skip_audio_native = False
+        self.supported_pipeline_types = frozenset(PipelineType)
         self.exclude_from_pass_at_k = False
         self.pass_at_k_threshold = 0.5