feat(examples): add say-server - streaming Pocket TTS with karaoke highlighting #304

ochafik · 2026-01-19T17:24:47Z

Summary

Adds a new example MCP App: say-server - a real-time text-to-speech server with karaoke-style text highlighting, powered by Kyutai's Pocket TTS 🐐 🔥.

MCP App Features Demonstrated

Single-file executable: Python server with embedded React UI - no build step required
Partial tool inputs (ontoolinputpartial): Widget receives streaming text as it's being generated
Queue-based streaming: Demonstrates how to stream text out and audio in via a polling tool (adds text to an input queue, retrieves audio chunks from an output queue)
Model context updates: Widget updates the LLM with playback progress
Native theming: Uses CSS variables for automatic dark/light mode adaptation
Fullscreen mode: Toggle fullscreen via requestDisplayMode() API
Multi-widget speak lock: Coordinates multiple TTS widgets via localStorage so only one plays at a time (see below)
Hidden tools (visibility: ["app"]): Private tools only accessible to the widget, not the model
External links (openLink): Attribution popup uses app.openLink() to open external URLs
CSP metadata: Resource declares required domains (esm.sh) for in-browser transpilation

Multi-Widget Speak Lock

When multiple TTS widgets exist in the same browser (e.g., multiple chat messages each with their own say widget), they coordinate via localStorage to ensure only one plays at a time:

Step	Description
1. Unique IDs	Each widget receives a UUID via `toolResult._meta.widgetUUID`
2. Announce	On play, widget writes `{uuid, timestamp}` to `localStorage["mcp-tts-playing"]`
3. Poll	Every 200ms, playing widgets check if another took the lock
4. Yield	If another widget started, pause and yield gracefully
5. Clean up	On pause/finish, clear the lock (only if owned)

This "last writer wins" protocol ensures clicking play on any widget immediately pauses others, without requiring cross-iframe postMessage coordination.

Features

Streaming TTS: Audio starts playing as text is being generated
Karaoke highlighting: Words are highlighted in sync with speech
Interactive controls: Click to pause/resume, double-click to restart
Low latency: Uses a polling-based queue for minimal delay

Testing

E2E tests added in tests/e2e/servers.spec.ts
Tested locally via stdio transport

Host config:

{
    "mcpServers": {
        "say": {
            "command": "uv",
            "args": [
                "run",
                "--index",
                "https://pypi.org/simple",
                "https://raw.githubusercontent.com/modelcontextprotocol/ext-apps/ochafik/say-server/examples/say-server/server.py",
                "--stdio"
             ]
        }
    }
}

Less trusting host config (docker instance still can scan your host's local ports, mind you: security is always relative)

{
    "mcpServers": {
        "say": {
            "command": "docker",
            "args": [
                "run",
                "--quiet",
                "--rm",
                "-i",
                "ghcr.io/astral-sh/uv:debian",
                "uv",
                "run",
                "--index",
                "https://pypi.org/simple",
                "https://raw.githubusercontent.com/modelcontextprotocol/ext-apps/ochafik/say-server/examples/say-server/server.py",
                "--stdio"
             ]
        }
    }
}

Dependencies

Uses Python with uv or docker for zero-config execution:

mcp - MCP Python SDK
pocket-tts - Kyutai's TTS library
uvicorn, starlette - HTTP server

Third-party Licenses

All third-party code / model is hopefully properly attributed (see LICENSE comments in server.py and in the info bubble):

pocket-tts python package (Apache 2.0)
pocket-tts model (CC BY 4.0)
kyutai/tts-voices (CC BY 4.0)

- Streaming TTS MCP App with real-time word highlighting - React widget with pause/resume controls - Debounced model context updates - Self-contained server.py for standalone uv run Co-authored-by: Claude <[email protected]>

- Pin Zod to v3.25.1 via esm.sh deps param (fixes .custom() error) - Add external=react,react-dom to prevent duplicate React instances - Update React from 19.1.0 to 19.2.0 - Import React for Babel's classic JSX transform - Consolidate imports: only use @modelcontextprotocol/ext-apps/react - Use Babel from unpkg.com instead of esm.sh - Simplify resource decorator with meta= parameter - Update CSP to allow both esm.sh and unpkg.com

- Add fullscreen toggle button (appears on hover when available) - Remove max-height/scroll in inline mode for cleaner display - Enable scrolling only in fullscreen mode - Track displayMode and fullscreenAvailable state - Handle onHostContextChanged to detect fullscreen support

…e audio streams - Check initial host context for fullscreen availability (not just updates) - Add audioOperationInProgressRef guard to prevent concurrent audio operations - Prevents race conditions from rapid clicks/double-clicks

Simplified CSP metadata handling to match say-server's approach: - Use meta parameter on @mcp.resource decorator directly - Removed custom _read_resource_with_meta handler and handler replacement

- Remove onClick from text display (was conflicting with double-click) - Always show overlay button (with pause icon when playing) - Lower opacity (0.3) when playing, full on hover - Button shows: ▶️ (play), ⏸️ (pause), 🔄 (restart) - Double-click text to restart still works - Eliminates click/dblclick race condition

- Play/restart buttons visible normally - Pause button hidden until hover (doesn't block karaoke text)

- Add promise lock to initTTSQueue to prevent concurrent initialization - Reset queueIdRef and lastTextRef in ontoolresult for next tool call - Close existing AudioContext when starting new session - Reset all playback state (chunks, timings, position) on new session - Fixes: text deltas not sent when new tool call has shorter text

…ef reset - Don't reset queueIdRef in ontoolresult - audio should keep playing - Detect new sessions in ontoolinputpartial/ontoolinput by checking if new text starts with lastTextRef (continuation) or not (new session) - Only reset and create new queue when genuinely new session detected - Fixes: pause/play was creating new streams because queueIdRef was reset

Theming: - Use CSS variables (--font-sans, --color-text-primary/secondary) with fallbacks - Apply host fonts via applyHostFonts() on connect and context changes - Apply host style variables via applyHostStyleVariables() - Apply document theme via applyDocumentTheme() Floating button: - Button follows cursor position (below current line being spoken) - Uses invisible cursor marker element for accurate positioning - Shows ⏸️ when playing, ▶️ when paused, 🔄 when finished - Low opacity when playing, full on hover Keyboard shortcuts: - Space: toggle play/pause - Enter: toggle fullscreen (when available) UI changes: - Large overlay only shown for initial play (idle state) - Floating button used during playback for less intrusive controls

- Remove floating cursor-following button (was too complex) - Fixed play button at top-right corner - Low opacity when playing, full on hover - Keep theming and keyboard shortcuts (Space/Enter) - Add say-server to E2E tests

Say-server requires Pocket TTS model (~500MB) and GPU support, which isn't available in the Docker CI environment. Skip it for now.

- Remove say-server from SKIP_SERVERS (widget works without TTS model) - Add say-server masks for play buttons - Add both say-server and qr-server to grid screenshot generator

… keyboard shortcuts Reverted to working commit 413beca and carefully reapplied: - CSS variables for theming (--font-sans, --color-text-primary/secondary) - Dark mode support via @media (prefers-color-scheme) - Host theming via applyHostStyleVariables/applyHostFonts/applyDocumentTheme - Keyboard shortcuts: Space = play/pause, Enter = fullscreen Preserved working behavior: - Play overlay visible when not playing, hover to show when playing - Pause on click works correctly - Karaoke highlighting works

… version Only CSS changes from working commit 413beca: - Added CSS variables for theming foundation - No JavaScript changes (reverted theming calls and keyboard shortcuts) This isolates whether the issue is CSS or JS related.

- Add autoPlay parameter (default: true) with Pydantic Field description - Description visible in MCP tool schema: 'browsers may block autoplay' - Add Reset button next to Fullscreen button (bottom right) - Both buttons share controlBtn base styles, shown on hover - Widget reads autoPlay from tool input params

Python SDK refactored transport params from constructor to run(): - Remove port, stateless_http from FastMCP() constructor - Pass stateless_http=True to streamable_http_app() instead

Logging added to: - togglePlayPause (all branches) - initTTSQueue (entry and AudioContext creation) - restartPlayback (entry) - ontoolinputpartial/ontoolinput (entry and new session detection) Open browser DevTools console and reproduce the issue - logs will show what's being called.

… corners - Top left: playPauseTopBtn for quick access - Bottom right: playPauseBtn alongside reset and fullscreen - Both show on hover with same styling as other control buttons - Icons: ▶️ (play), ⏸️ (pause), 🔄 (restart when finished)

- Remove central play overlay, use single toolbar in top-right - Fix audio duplication on pause/resume (only defer to pendingChunks for initial autoplay) - Remove double-click to restart (use toolbar restart button) - Sync displayMode when host changes it externally - Click on text to play/pause

- Add list_voices tool to show available voices - Add voice parameter to say tool (predefined names, HuggingFace URLs, or local paths) - Predefined voices: alba, marius, javert, jean, cosette, eponine, azelma, fantine - Update README with voice documentation

- Add English-only constraint prominently in first line and note - List common trigger patterns (say, speak, read aloud, etc.) - Update text parameter description to mention English - Keep description concise for quick LLM scanning

When multiple TTS widgets exist, only one plays at a time: - Widget announces to localStorage when starting playback - Playing widgets poll localStorage every 200ms - If another widget starts, previous one pauses automatically - Lock is cleared on pause/finish/teardown Uses _meta.widgetUUID in tool result for widget identification.

- Add say-server to demo gallery with 300x300 grid-cell.png - Fix qr-server grid-cell.png dimensions (was 672x668, now 300x300) - Add golden snapshot for say-server E2E tests - Add full screenshot and grid-cell thumbnail for say-server 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

- Add blank lines in README for better readability - Add trailing newline to package.json 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

- Add onKeyDown handler to main element for Escape key - Update README with full list of MCP App features: - Model context updates - Native theming - Fullscreen mode - Multi-widget speak lock - Update server.py docstring with feature documentation 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

The TTS queue could hang forever if: - The LLM stops generating mid-stream (no end signal) - User cancels before tool completes - Network interruption prevents end_tts_queue call Changes: - Add last_activity timestamp to track queue activity - Add 30-second timeout (QUEUE_TIMEOUT_SECONDS) - Poll with 5-second timeout, checking for stale queues - Mark stale queues as "error" so widget stops polling - Return error message in poll_tts_audio response - Add error logging in widget polling loop This prevents tool calls from running forever when queues are abandoned. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Add console logging to debug why end_tts_queue might not be called: Widget-side: - Log queueId in ontoolinputpartial, ontoolinput, ontoolresult - Log when new session is detected (queue reset) - Log when initTTSQueue fails - Log when end_tts_queue is called/not called Server-side: - Log when end_tts_queue is called - Log warnings for unknown queues - Log info for already-ended queues This will help identify if the issue is: - User interruption (ontoolresult never called) - Session reset (queueId becomes null) - Host bug (callback dropped) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Add 150ms delay before first poll to give TTS model time to generate the first audio chunk. This reduces unnecessary polling and server load. Current backoff after that: - 30ms if chunks received (keep streaming fast) - 80ms if no chunks (wait longer for generation) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Previous approach: Always wait 150ms before first poll Problem: Even when chunks are ready, we wait unnecessarily New approach: Adaptive polling - Start polling immediately (no initial delay) - 20ms backoff when receiving chunks (fast streaming) - Exponential backoff when no chunks: 50ms → 100ms → 150ms max - Reset backoff when chunks start flowing This reduces latency when TTS is fast while being polite when it's slow. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Uses SVG data URI for a speaker with sound waves icon.

- Expand License section in README with component attribution table - Add voice collection license matrix with commercial use indicators - Add warning about non-commercial expresso/ears voice collections - Clarify NON-COMMERCIAL restriction in list_voices() output - Use consistent CC-BY formatting (hyphenated)

…ext-apps into ochafik/say-server

pkg-pr-new · 2026-01-19T17:28:04Z

Open in StackBlitz

@modelcontextprotocol/ext-apps

npm i https://pkg.pr.new/modelcontextprotocol/ext-apps/@modelcontextprotocol/ext-apps@304

@modelcontextprotocol/server-basic-react

npm i https://pkg.pr.new/modelcontextprotocol/ext-apps/@modelcontextprotocol/server-basic-react@304

@modelcontextprotocol/server-basic-vanillajs

npm i https://pkg.pr.new/modelcontextprotocol/ext-apps/@modelcontextprotocol/server-basic-vanillajs@304

@modelcontextprotocol/server-budget-allocator

npm i https://pkg.pr.new/modelcontextprotocol/ext-apps/@modelcontextprotocol/server-budget-allocator@304

@modelcontextprotocol/server-cohort-heatmap

npm i https://pkg.pr.new/modelcontextprotocol/ext-apps/@modelcontextprotocol/server-cohort-heatmap@304

@modelcontextprotocol/server-customer-segmentation

npm i https://pkg.pr.new/modelcontextprotocol/ext-apps/@modelcontextprotocol/server-customer-segmentation@304

@modelcontextprotocol/server-map

npm i https://pkg.pr.new/modelcontextprotocol/ext-apps/@modelcontextprotocol/server-map@304

@modelcontextprotocol/server-pdf

npm i https://pkg.pr.new/modelcontextprotocol/ext-apps/@modelcontextprotocol/server-pdf@304

@modelcontextprotocol/server-scenario-modeler

npm i https://pkg.pr.new/modelcontextprotocol/ext-apps/@modelcontextprotocol/server-scenario-modeler@304

@modelcontextprotocol/server-shadertoy

npm i https://pkg.pr.new/modelcontextprotocol/ext-apps/@modelcontextprotocol/server-shadertoy@304

@modelcontextprotocol/server-sheet-music

npm i https://pkg.pr.new/modelcontextprotocol/ext-apps/@modelcontextprotocol/server-sheet-music@304

@modelcontextprotocol/server-system-monitor

npm i https://pkg.pr.new/modelcontextprotocol/ext-apps/@modelcontextprotocol/server-system-monitor@304

@modelcontextprotocol/server-threejs

npm i https://pkg.pr.new/modelcontextprotocol/ext-apps/@modelcontextprotocol/server-threejs@304

@modelcontextprotocol/server-transcript

npm i https://pkg.pr.new/modelcontextprotocol/ext-apps/@modelcontextprotocol/server-transcript@304

@modelcontextprotocol/server-video-resource

npm i https://pkg.pr.new/modelcontextprotocol/ext-apps/@modelcontextprotocol/server-video-resource@304

@modelcontextprotocol/server-wiki-explorer

npm i https://pkg.pr.new/modelcontextprotocol/ext-apps/@modelcontextprotocol/server-wiki-explorer@304

commit: 335ba34

- Replace emoji play/pause buttons with neutral white SVG icons - Add (i) info button in bottom-right with attribution popup - Use app.openLink() API for external links in popup - Add dynamic padding to fit popup when shown - Make create_tts_queue async to fix AnyIO event loop error - Add better error logging for TTS queue initialization - Update README with openLink feature documentation - Remove unnecessary HuggingFace login prerequisites

ochafik and others added 30 commits January 16, 2026 11:44

Add say-server: streaming TTS MCP App with karaoke UI

ac5444a

- Streaming TTS MCP App with real-time word highlighting - React widget with pause/resume controls - Debounced model context updates - Self-contained server.py for standalone uv run Co-authored-by: Claude <[email protected]>

Update server.py

31609f4

fix(say-server): show play button when paused, restart when finished

50d8091

fix(qr-server): use meta parameter instead of manual handler patching

fe668e7

Simplified CSP metadata handling to match say-server's approach: - Use meta parameter on @mcp.resource decorator directly - Removed custom _read_resource_with_meta handler and handler replacement

fix(say-server): only show pause button on hover

0d16e28

- Play/restart buttons visible normally - Pause button hidden until hover (doesn't block karaoke text)

docs(say-server): add TODO items for future improvements

6cbe30d

fix(say-server): simplify to fixed play button at top right

8bb83d1

- Remove floating cursor-following button (was too complex) - Fixed play button at top-right corner - Low opacity when playing, full on hover - Keep theming and keyboard shortcuts (Space/Enter) - Add say-server to E2E tests

test(e2e): add say-server to tests but skip in CI

7f915b4

Say-server requires Pocket TTS model (~500MB) and GPU support, which isn't available in the Docker CI environment. Skip it for now.

test(e2e): add say-server and qr-server to screenshot generation

8bb7a98

- Remove say-server from SKIP_SERVERS (widget works without TTS model) - Add say-server masks for play buttons - Add both say-server and qr-server to grid screenshot generator

fix(say-server): update FastMCP API for new SDK version

b3f9059

Python SDK refactored transport params from constructor to run(): - Remove port, stateless_http from FastMCP() constructor - Pass stateless_http=True to streamable_http_app() instead

update package.json of say

2db2946

style(say-server): apply prettier formatting

8be6749

- Add blank lines in README for better readability - Add trailing newline to package.json 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

ochafik and others added 8 commits January 17, 2026 02:22

feat(say-server): add speaker icon to server info

11e3d9c

Uses SVG data URI for a speaker with sound waves icon.

Update last activity to prevent timeout during active polling

b339a00

Merge branch 'ochafik/say-server' of github.com:modelcontextprotocol/…

e06afdc

…ext-apps into ochafik/say-server

ochafik requested a review from antonpk1 January 19, 2026 20:29

docs(say-server): document multi-widget speak lock mechanism

ffd6035

antonpk1 previously approved these changes Jan 19, 2026

View reviewed changes

ochafik changed the title ~~feat(examples): add say-server - streaming TTS with karaoke highlighting~~ feat(examples): add say-server - streaming Pocket TTS with karaoke highlighting Jan 19, 2026

Merge branch 'main' into ochafik/say-server

d0b62a1

ochafik dismissed antonpk1’s stale review via d0b62a1 January 20, 2026 00:06

prettier:fix

335ba34

ochafik merged commit e6983b3 into main Jan 20, 2026
18 of 19 checks passed

ochafik mentioned this pull request Jan 20, 2026

fix(say-server): clean up README #307

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(examples): add say-server - streaming Pocket TTS with karaoke highlighting #304

feat(examples): add say-server - streaming Pocket TTS with karaoke highlighting #304

Uh oh!

ochafik commented Jan 19, 2026 •

edited

Loading

Uh oh!

pkg-pr-new bot commented Jan 19, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat(examples): add say-server - streaming Pocket TTS with karaoke highlighting #304

feat(examples): add say-server - streaming Pocket TTS with karaoke highlighting #304

Uh oh!

Conversation

ochafik commented Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

MCP App Features Demonstrated

Multi-Widget Speak Lock

Features

Testing

Dependencies

Third-party Licenses

Uh oh!

pkg-pr-new bot commented Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ochafik commented Jan 19, 2026 •

edited

Loading

pkg-pr-new bot commented Jan 19, 2026 •

edited

Loading