Skip to content

adk web: Native-audio models fail when TEXT modality is requested #4206

@kazunori279

Description

@kazunori279

Problem

When using adk web with native-audio models (e.g., gemini-live-2.5-flash-native-audio), requesting TEXT modality causes an error. Native-audio models only support AUDIO modality.

Root Cause

In src/google/adk/cli/adk_web_server.py, the /run_live WebSocket endpoint accepts modalities from query parameters without validating them against the model's capabilities:

# Lines 1636-1638
modalities: List[Literal["TEXT", "AUDIO"]] = Query(
    default=["AUDIO"]
),  # Only allows "TEXT" or "AUDIO"
# Line 1655
run_config = RunConfig(response_modalities=modalities)

The code at runners.py:985-988 only sets a default when response_modalities is None, but doesn't validate when it's explicitly set to TEXT for native-audio models.

Proposed Solution

  1. Detect native-audio models by checking if the model name contains "native-audio"
  2. Force AUDIO modality for these models instead of returning an error
  3. Ensure output_audio_transcription is enabled so users can see text transcription of the audio response

Implementation

In adk_web_server.py around line 1653-1655:

async def forward_events():
    runner = await self.get_runner_async(app_name)
    
    # Check if agent uses a native-audio model
    agent_or_app = self.agent_loader.load_agent(app_name)
    root_agent = self._get_root_agent(agent_or_app)
    model_name = root_agent.model if isinstance(root_agent.model, str) else ""
    
    # Native audio models only support AUDIO modality
    if "native-audio" in model_name:
        effective_modalities = ["AUDIO"]
    else:
        effective_modalities = modalities
    
    run_config = RunConfig(response_modalities=effective_modalities)

Additional Issue: Transcription not displayed in UI

When using AUDIO modality, output_audio_transcription is enabled by default in RunConfig, and transcription events are created by TranscriptionManager. However, the frontend does not currently render outputTranscription.text - it only displays content.parts[].text.

This means users won't see any text when using native-audio models, even though transcription data is available in the events.

Options to address this:

  1. Update the frontend to display outputTranscription.text when present
  2. Convert transcription to content in the backend so existing UI can display it

Affected Files

  • src/google/adk/cli/adk_web_server.py - WebSocket endpoint for live sessions
  • src/google/adk/runners.py - Default modality handling
  • src/google/adk/flows/llm_flows/transcription_manager.py - Transcription event creation

Metadata

Metadata

Assignees

No one assigned

    Labels

    good first issue[Community] This issue is good for newcomers to participatetools[Component] This issue is related to toolsweb[Component] This issue will be transferred to adk-web

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions