Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
58 changes: 45 additions & 13 deletions docs/metric_context.md
Original file line number Diff line number Diff line change
Expand Up @@ -193,40 +193,72 @@ These are absolute paths to output files saved during benchmark execution.
- **`audio_user_path: Optional[str]`** - Path to the user simulator's audio channel (mono WAV file).
- **`audio_mixed_path: Optional[str]`** - Path to the mixed stereo audio (left=assistant, right=user).

## Audio-Native (S2S, S2T+TTS) vs Cascade Architecture
## Audio-Native (S2S, AUDIO_LLM) vs Cascade Architecture

This section explains the architectural differences that affect which variables are reliable. For a quick summary, see [Why Multiple Representations?](#why-multiple-representations-of-the-same-conversation).

"Audio-native" is an umbrella term for architectures where the model processes raw audio input directly, as opposed to cascade where the model receives STT text. This includes Speech-to-Speech (S2S) and Speech-to-Text + TTS (S2T+TTS) architectures.
"Audio-native" is an umbrella term for architectures where the model processes raw audio input directly, as opposed to cascade where the model receives STT text. There are three pipeline types (`PipelineType` enum):

- **`CASCADE`** — separate STT → LLM → TTS steps
- **`AUDIO_LLM`** — audio input to LLM + separate TTS (S2T+TTS)
- **`S2S`** — end-to-end speech-to-speech model (no separate TTS)

### Pipelines

**Cascade:**
**CASCADE:**
`User audio → Agent STT → Text → LLM → Text → TTS → Assistant audio`

The LLM processes **transcribed text**, so `transcribed_user_turns` reflects what the assistant actually saw.

**Audio-native:**
`User audio → Raw audio directly to model → Assistant audio (S2S)` or
`User audio → Raw audio directly to model → Text → TTS → Assistant audio (S2T+TTS)`
**AUDIO_LLM:**
`User audio → Raw audio directly to model → Text → TTS → Assistant audio`

**S2S:**
`User audio → Raw audio directly to model → Assistant audio`

For both AUDIO_LLM and S2S, the model processes **raw audio**. The audit log may contain a transcript from the service's own secondary STT, but this is **not what the model used** — it's just for reference. This is why `transcribed_user_turns` is unreliable for audio-native models and `intended_user_turns` should be used instead.

Check `context.pipeline_type` to determine which mode was used, or `context.is_audio_native` for a boolean grouping of `AUDIO_LLM` and `S2S`.

### Writing Pipeline-Aware Metrics

**Controlling which pipelines a metric runs on:**

The model processes **raw audio**. The audit log may contain a transcript from the service's own secondary STT, but this is **not what the model used** — it's just for reference. This is why `transcribed_user_turns` is unreliable for audio-native models and `intended_user_turns` should be used instead.
Set `supported_pipeline_types` on your metric class. The runner will skip the metric automatically for unsupported pipelines. The default is all three types.

Check `context.pipeline_type` to determine which mode was used, or `context.is_audio_native` for a boolean grouping of `S2S` and `AUDIO_LLM`.
```python
from eva.models.config import PipelineType

class MyMetric(BaseMetric):
# Only run on cascade (e.g. STT/transcription metrics)
supported_pipeline_types = frozenset({PipelineType.CASCADE})

# Run on cascade and audio LLM but not S2S (e.g. metrics that require a TTS step)
supported_pipeline_types = frozenset({PipelineType.CASCADE, PipelineType.AUDIO_LLM})

### Writing Audio-Native-Aware Metrics
# Run on all pipelines (default — no declaration needed)
supported_pipeline_types = frozenset(PipelineType)
```

**Branching on pipeline type within `compute()`:**

If your metric needs user text directly (rather than via `conversation_trace`, which handles this automatically), branch on `context.is_audio_native`:
If your metric runs on multiple pipelines but needs different behavior per type, branch on `context.pipeline_type` or use `context.is_audio_native`:

```python
async def compute(self, context: MetricContext) -> MetricScore:
# Option 1: manual branching
# Branch by exact pipeline type
if context.pipeline_type == PipelineType.CASCADE:
user_turns = context.transcribed_user_turns
else:
user_turns = context.intended_user_turns

# Or use is_audio_native (True for both AUDIO_LLM and S2S)
user_turns = context.intended_user_turns if context.is_audio_native else context.transcribed_user_turns

# Option 2: use conversation_trace (handles S2S vs Cascade automatically)
# Or use conversation_trace (handles S2S vs Cascade automatically)
for entry in context.conversation_trace:
if entry["role"] == "user":
# "intended" for S2S, "transcribed" for Cascade
# "intended" for audio-native, "transcribed" for cascade
user_text = entry["content"]
```

Expand Down
4 changes: 2 additions & 2 deletions docs/metrics/faithfulness.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ Uses the following MetricContext fields:
- `conversation_trace`: Full conversation with tool calls (via `format_transcript`)
- `agent_instructions`, `agent_role`, `agent_tools`: Agent configuration for policy evaluation
- `current_date_time`: Simulated date/time for temporal reasoning
- `is_audio_native` (audio-native): Architecture flag (controls which prompt variant is used)
- `pipeline_type` / `is_audio_native`: Architecture flag (controls which prompt variant is used — cascade vs. audio-native)

### Audio-Native vs Cascade

Expand All @@ -35,7 +35,7 @@ This metric has **significantly different behavior** depending on the architectu
- User turns in the trace are **STT transcripts** — the text the assistant's text LLM actually received.
- The judge evaluates faithfulness against what the assistant saw (the transcript), not what the user actually said.
- If STT transcribed "Kim" but the user said "Kin", using "Kim" is faithful (the assistant can only work with what it received). This issue would be captured by the transcription accuracy key entities metric.
**Audio-native (S2S, S2T+TTS):**
**Audio-native (AUDIO_LLM, S2S):**
- User turns in the trace are **intended text** (what the user simulator was instructed to say), since audio-native models do not use transcriptions.
- The judge evaluates whether the assistant **correctly understood the audio**. If the assistant misheard the user and used incorrect information, that IS a faithfulness violation — accurate audio understanding is part of the audio-native model's responsibility.

Expand Down
2 changes: 1 addition & 1 deletion docs/metrics/speakability.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ Uses `intended_assistant_turns` from MetricContext — the text sent to the TTS
### Audio-Native vs Cascade

- **Cascade**: Fully applicable — evaluates whether the LLM's text output is appropriate for TTS.
- **Audio-native (S2S, S2T+TTS):** **Skipped entirely** (`skip_s2s = True` (audio-native)). Audio-native models generate audio directly, so there is no separate text-to-TTS step and "speakability" of intermediate text is not meaningful.
- **S2S:** **Skipped** (`supported_pipeline_types = {CASCADE, AUDIO_LLM}`). S2S models generate audio directly without a separate TTS step, so there is no intermediate text whose speakability can be evaluated. AUDIO_LLM models do have a TTS step and are evaluated.

### Evaluation Methodology

Expand Down
2 changes: 1 addition & 1 deletion docs/metrics/stt_wer.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ Uses the following MetricContext fields:
### Audio-Native vs Cascade

- **Cascade**: Fully applicable — measures the quality of the assistant's STT pipeline, which directly affects the LLM's input.
- **Audio-native (S2S, S2T+TTS):** **Skipped entirely** (`skip_s2s = True` (audio-native)). Audio-native models receive raw audio, not STT transcripts, so measuring STT accuracy is not meaningful. The `transcribed_user_turns` field in audio-native systems comes from a secondary transcription service, not the model's actual input.
- **Audio-native (AUDIO_LLM / S2S):** **Skipped entirely** (`supported_pipeline_types = {CASCADE}`). Audio-native models receive raw audio, not STT transcripts, so measuring STT accuracy is not meaningful. The `transcribed_user_turns` field in audio-native systems comes from a secondary transcription service, not the model's actual input.

### Evaluation Methodology

Expand Down
2 changes: 1 addition & 1 deletion docs/metrics/transcription_accuracy_key_entities.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ The judge receives both texts side by side and identifies key entities to compar
### Audio-Native vs Cascade

- **Cascade**: Fully applicable — measures whether the assistant's STT correctly captured key entities, which directly affects downstream tool calls and responses.
- **Audio-native (S2S, S2T+TTS):** **Skipped entirely** (`skip_s2s = True` (audio-native)). Audio-native models receive raw audio, not STT output, so entity-level STT accuracy is not meaningful. Entity perception issues in audio-native systems are captured instead by `faithfulness` (which treats mishearing as a faithfulness violation).
- **Audio-native (AUDIO_LLM / S2S):** **Skipped entirely** (`supported_pipeline_types = {CASCADE}`). Audio-native models receive raw audio, not STT output, so entity-level STT accuracy is not meaningful. Entity perception issues in audio-native systems are captured instead by `faithfulness` (which treats mishearing as a faithfulness violation).

### Evaluation Methodology

Expand Down
2 changes: 1 addition & 1 deletion docs/metrics/user_behavioral_fidelity.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ Uses the following MetricContext fields:
This metric has **pipeline-specific prompt text** and provides the judge with **two views** of the conversation:

- **Cascade**: The judge sees the agent-side transcript (`conversation_trace`, where user turns are STT transcriptions) alongside the `intended_user_turns` (ground truth). The prompt explains that discrepancies between the two are transcription errors — the user should not be penalized for the agent mishearing.
- **Audio-native (S2S, S2T+TTS):** The judge sees the conversation trace (where user turns are already intended text) alongside the `intended_user_turns`. The prompt explains this is an audio-native system and that discrepancies in agent behavior may be due to audio perception errors, not user corruption.
- **Audio-native (AUDIO_LLM, S2S):** The judge sees the conversation trace (where user turns are already intended text) alongside the `intended_user_turns`. The prompt explains this is an audio-native system and that discrepancies in agent behavior may be due to audio perception errors, not user corruption.

In both cases, `intended_user_turns` serves as ground truth for what the user actually said.

Expand Down
2 changes: 1 addition & 1 deletion src/eva/metrics/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -159,7 +159,7 @@ class BaseMetric(ABC):
metric_type: MetricType = MetricType.CODE # Override in subclasses
pass_at_k_threshold: float = 0.5 # Normalized score threshold for pass@k pass/fail
exclude_from_pass_at_k: bool = False # Set True for metrics not suitable for pass@k
skip_audio_native: bool = False # Set True for metrics that should not run on audio-native records (S2S, AudioLLM)
supported_pipeline_types: frozenset[PipelineType] = frozenset(PipelineType) # Pipeline types this metric supports

def __init__(self, config: dict[str, Any] | None = None):
"""Initialize the metric.
Expand Down
3 changes: 2 additions & 1 deletion src/eva/metrics/diagnostic/speakability.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@

from eva.metrics.base import MetricContext, PerTurnConversationJudgeMetric
from eva.metrics.registry import register_metric
from eva.models.config import PipelineType


@register_metric
Expand All @@ -32,7 +33,7 @@ class SpeakabilityJudgeMetric(PerTurnConversationJudgeMetric):
description = "Debug metric: LLM judge evaluation of text voice-friendliness per turn"
category = "diagnostic"
exclude_from_pass_at_k = True
skip_audio_native = True
supported_pipeline_types = frozenset({PipelineType.CASCADE, PipelineType.AUDIO_LLM})
rating_scale = (0, 1)

def get_expected_turn_ids(self, context: MetricContext) -> list[int]:
Expand Down
3 changes: 2 additions & 1 deletion src/eva/metrics/diagnostic/stt_wer.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@
from eva.metrics.base import CodeMetric, MetricContext
from eva.metrics.registry import register_metric
from eva.metrics.utils import aggregate_wer_errors, extract_wer_errors, reverse_word_error_rate
from eva.models.config import PipelineType
from eva.models.results import MetricScore
from eva.utils.wer_normalization import normalize_text

Expand Down Expand Up @@ -42,7 +43,7 @@ class STTWERMetric(CodeMetric):
description = "Debug metric: Speech-to-Text transcription accuracy using Word Error Rate"
category = "diagnostic"
exclude_from_pass_at_k = True
skip_audio_native = True
supported_pipeline_types = frozenset({PipelineType.CASCADE})

def __init__(self, config: dict | None = None):
"""Initialize the metric with language configuration."""
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
from eva.metrics.base import MetricContext, TextJudgeMetric
from eva.metrics.registry import register_metric
from eva.metrics.utils import aggregate_per_turn_scores, parse_judge_response_list, resolve_turn_id
from eva.models.config import PipelineType
from eva.models.results import MetricScore


Expand Down Expand Up @@ -45,7 +46,7 @@ class TranscriptionAccuracyKeyEntitiesMetric(TextJudgeMetric):
description = "Debug metric: LLM judge evaluation of STT key entity transcription accuracy for entire conversation"
category = "diagnostic"
exclude_from_pass_at_k = True
skip_audio_native = True
supported_pipeline_types = frozenset({PipelineType.CASCADE})
rating_scale = None # Custom scoring (not 1-3 scale)
default_aggregation = "mean"

Expand Down
10 changes: 4 additions & 6 deletions src/eva/metrics/runner.py
Original file line number Diff line number Diff line change
Expand Up @@ -388,12 +388,10 @@ async def compute_metric(metric: BaseMetric) -> tuple[str, MetricScore]:
)

# Filter out metrics incompatible with the pipeline type
applicable_metrics = metrics_to_run
if context.is_audio_native:
skipped = [m.name for m in metrics_to_run if m.skip_audio_native]
if skipped:
logger.info(f"[{record_id}] Skipping metrics incompatible with audio-native pipeline: {skipped}")
applicable_metrics = [m for m in metrics_to_run if not m.skip_audio_native]
skipped = [m.name for m in metrics_to_run if context.pipeline_type not in m.supported_pipeline_types]
if skipped:
logger.info(f"[{record_id}] Skipping metrics incompatible with {context.pipeline_type} pipeline: {skipped}")
applicable_metrics = [m for m in metrics_to_run if context.pipeline_type in m.supported_pipeline_types]

# Run all metrics in parallel
tasks = [compute_metric(metric) for metric in applicable_metrics]
Expand Down
5 changes: 3 additions & 2 deletions tests/unit/metrics/test_runner.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,18 +7,19 @@
import yaml

from eva.metrics.runner import MetricsRunner
from eva.models.config import PipelineType
from eva.models.results import MetricScore, RecordMetrics
from tests.unit.conftest import make_evaluation_record

from .conftest import make_metric_score


class _FakeMetric:
"""Minimal stand-in for BaseMetric — only ``name``, ``skip_audio_native``, and pass@k attrs are read."""
"""Minimal stand-in for BaseMetric — only ``name``, ``supported_pipeline_types``, and pass@k attrs are read."""

def __init__(self, name: str):
self.name = name
self.skip_audio_native = False
self.supported_pipeline_types = frozenset(PipelineType)
self.exclude_from_pass_at_k = False
self.pass_at_k_threshold = 0.5

Expand Down
Loading