Skip to content

Conversation

@ochafik
Copy link
Contributor

@ochafik ochafik commented Jan 19, 2026

Summary

Adds a new example MCP App: say-server - a real-time text-to-speech server with karaoke-style text highlighting, powered by Kyutai's Pocket TTS 🐐 🔥.

Screenshot

MCP App Features Demonstrated

  • Single-file executable: Python server with embedded React UI - no build step required
  • Partial tool inputs (ontoolinputpartial): Widget receives streaming text as it's being generated
  • Queue-based streaming: Demonstrates how to stream text out and audio in via a polling tool (adds text to an input queue, retrieves audio chunks from an output queue)
  • Model context updates: Widget updates the LLM with playback progress
  • Native theming: Uses CSS variables for automatic dark/light mode adaptation
  • Fullscreen mode: Toggle fullscreen via requestDisplayMode() API
  • Multi-widget speak lock: Coordinates multiple TTS widgets via localStorage so only one plays at a time (see below)
  • Hidden tools (visibility: ["app"]): Private tools only accessible to the widget, not the model
  • External links (openLink): Attribution popup uses app.openLink() to open external URLs
  • CSP metadata: Resource declares required domains (esm.sh) for in-browser transpilation

Multi-Widget Speak Lock

When multiple TTS widgets exist in the same browser (e.g., multiple chat messages each with their own say widget), they coordinate via localStorage to ensure only one plays at a time:

Step Description
1. Unique IDs Each widget receives a UUID via toolResult._meta.widgetUUID
2. Announce On play, widget writes {uuid, timestamp} to localStorage["mcp-tts-playing"]
3. Poll Every 200ms, playing widgets check if another took the lock
4. Yield If another widget started, pause and yield gracefully
5. Clean up On pause/finish, clear the lock (only if owned)

This "last writer wins" protocol ensures clicking play on any widget immediately pauses others, without requiring cross-iframe postMessage coordination.

Features

  • Streaming TTS: Audio starts playing as text is being generated
  • Karaoke highlighting: Words are highlighted in sync with speech
  • Interactive controls: Click to pause/resume, double-click to restart
  • Low latency: Uses a polling-based queue for minimal delay

Testing

  • E2E tests added in tests/e2e/servers.spec.ts

  • Tested locally via stdio transport

  • Host config:

    {
        "mcpServers": {
            "say": {
                "command": "uv",
                "args": [
                    "run",
                    "--index",
                    "https://pypi.org/simple",
                    "https://raw.githubusercontent.com/modelcontextprotocol/ext-apps/ochafik/say-server/examples/say-server/server.py",
                    "--stdio"
                 ]
            }
        }
    }
  • Less trusting host config (docker instance still can scan your host's local ports, mind you: security is always relative)

    {
        "mcpServers": {
            "say": {
                "command": "docker",
                "args": [
                    "run",
                    "--quiet",
                    "--rm",
                    "-i",
                    "ghcr.io/astral-sh/uv:debian",
                    "uv",
                    "run",
                    "--index",
                    "https://pypi.org/simple",
                    "https://raw.githubusercontent.com/modelcontextprotocol/ext-apps/ochafik/say-server/examples/say-server/server.py",
                    "--stdio"
                 ]
            }
        }
    }

Dependencies

Uses Python with uv or docker for zero-config execution:

  • mcp - MCP Python SDK
  • pocket-tts - Kyutai's TTS library
  • uvicorn, starlette - HTTP server

Third-party Licenses

All third-party code / model is hopefully properly attributed (see LICENSE comments in server.py and in the info bubble):

Info popup

ochafik and others added 30 commits January 16, 2026 11:44
- Streaming TTS MCP App with real-time word highlighting
- React widget with pause/resume controls
- Debounced model context updates
- Self-contained server.py for standalone uv run

Co-authored-by: Claude <[email protected]>
- Pin Zod to v3.25.1 via esm.sh deps param (fixes .custom() error)
- Add external=react,react-dom to prevent duplicate React instances
- Update React from 19.1.0 to 19.2.0
- Import React for Babel's classic JSX transform
- Consolidate imports: only use @modelcontextprotocol/ext-apps/react
- Use Babel from unpkg.com instead of esm.sh
- Simplify resource decorator with meta= parameter
- Update CSP to allow both esm.sh and unpkg.com
- Add fullscreen toggle button (appears on hover when available)
- Remove max-height/scroll in inline mode for cleaner display
- Enable scrolling only in fullscreen mode
- Track displayMode and fullscreenAvailable state
- Handle onHostContextChanged to detect fullscreen support
…e audio streams

- Check initial host context for fullscreen availability (not just updates)
- Add audioOperationInProgressRef guard to prevent concurrent audio operations
- Prevents race conditions from rapid clicks/double-clicks
Simplified CSP metadata handling to match say-server's approach:
- Use meta parameter on @mcp.resource decorator directly
- Removed custom _read_resource_with_meta handler and handler replacement
- Remove onClick from text display (was conflicting with double-click)
- Always show overlay button (with pause icon when playing)
- Lower opacity (0.3) when playing, full on hover
- Button shows: ▶️ (play), ⏸️ (pause), 🔄 (restart)
- Double-click text to restart still works
- Eliminates click/dblclick race condition
- Play/restart buttons visible normally
- Pause button hidden until hover (doesn't block karaoke text)
- Add promise lock to initTTSQueue to prevent concurrent initialization
- Reset queueIdRef and lastTextRef in ontoolresult for next tool call
- Close existing AudioContext when starting new session
- Reset all playback state (chunks, timings, position) on new session
- Fixes: text deltas not sent when new tool call has shorter text
…ef reset

- Don't reset queueIdRef in ontoolresult - audio should keep playing
- Detect new sessions in ontoolinputpartial/ontoolinput by checking if
  new text starts with lastTextRef (continuation) or not (new session)
- Only reset and create new queue when genuinely new session detected
- Fixes: pause/play was creating new streams because queueIdRef was reset
Theming:
- Use CSS variables (--font-sans, --color-text-primary/secondary) with fallbacks
- Apply host fonts via applyHostFonts() on connect and context changes
- Apply host style variables via applyHostStyleVariables()
- Apply document theme via applyDocumentTheme()

Floating button:
- Button follows cursor position (below current line being spoken)
- Uses invisible cursor marker element for accurate positioning
- Shows ⏸️ when playing, ▶️ when paused, 🔄 when finished
- Low opacity when playing, full on hover

Keyboard shortcuts:
- Space: toggle play/pause
- Enter: toggle fullscreen (when available)

UI changes:
- Large overlay only shown for initial play (idle state)
- Floating button used during playback for less intrusive controls
- Remove floating cursor-following button (was too complex)
- Fixed play button at top-right corner
- Low opacity when playing, full on hover
- Keep theming and keyboard shortcuts (Space/Enter)
- Add say-server to E2E tests
Say-server requires Pocket TTS model (~500MB) and GPU support,
which isn't available in the Docker CI environment. Skip it for now.
- Remove say-server from SKIP_SERVERS (widget works without TTS model)
- Add say-server masks for play buttons
- Add both say-server and qr-server to grid screenshot generator
… keyboard shortcuts

Reverted to working commit 413beca and carefully reapplied:
- CSS variables for theming (--font-sans, --color-text-primary/secondary)
- Dark mode support via @media (prefers-color-scheme)
- Host theming via applyHostStyleVariables/applyHostFonts/applyDocumentTheme
- Keyboard shortcuts: Space = play/pause, Enter = fullscreen

Preserved working behavior:
- Play overlay visible when not playing, hover to show when playing
- Pause on click works correctly
- Karaoke highlighting works
… version

Only CSS changes from working commit 413beca:
- Added CSS variables for theming foundation
- No JavaScript changes (reverted theming calls and keyboard shortcuts)

This isolates whether the issue is CSS or JS related.
- Add autoPlay parameter (default: true) with Pydantic Field description
  - Description visible in MCP tool schema: 'browsers may block autoplay'
- Add Reset button next to Fullscreen button (bottom right)
- Both buttons share controlBtn base styles, shown on hover
- Widget reads autoPlay from tool input params
Python SDK refactored transport params from constructor to run():
- Remove port, stateless_http from FastMCP() constructor
- Pass stateless_http=True to streamable_http_app() instead
Logging added to:
- togglePlayPause (all branches)
- initTTSQueue (entry and AudioContext creation)
- restartPlayback (entry)
- ontoolinputpartial/ontoolinput (entry and new session detection)

Open browser DevTools console and reproduce the issue - logs will show what's being called.
… corners

- Top left: playPauseTopBtn for quick access
- Bottom right: playPauseBtn alongside reset and fullscreen
- Both show on hover with same styling as other control buttons
- Icons: ▶️ (play), ⏸️ (pause), 🔄 (restart when finished)
- Remove central play overlay, use single toolbar in top-right
- Fix audio duplication on pause/resume (only defer to pendingChunks for initial autoplay)
- Remove double-click to restart (use toolbar restart button)
- Sync displayMode when host changes it externally
- Click on text to play/pause
- Add list_voices tool to show available voices
- Add voice parameter to say tool (predefined names, HuggingFace URLs, or local paths)
- Predefined voices: alba, marius, javert, jean, cosette, eponine, azelma, fantine
- Update README with voice documentation
- Add English-only constraint prominently in first line and note
- List common trigger patterns (say, speak, read aloud, etc.)
- Update text parameter description to mention English
- Keep description concise for quick LLM scanning
When multiple TTS widgets exist, only one plays at a time:
- Widget announces to localStorage when starting playback
- Playing widgets poll localStorage every 200ms
- If another widget starts, previous one pauses automatically
- Lock is cleared on pause/finish/teardown

Uses _meta.widgetUUID in tool result for widget identification.
- Add say-server to demo gallery with 300x300 grid-cell.png
- Fix qr-server grid-cell.png dimensions (was 672x668, now 300x300)
- Add golden snapshot for say-server E2E tests
- Add full screenshot and grid-cell thumbnail for say-server

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
- Add blank lines in README for better readability
- Add trailing newline to package.json

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
- Add onKeyDown handler to main element for Escape key
- Update README with full list of MCP App features:
  - Model context updates
  - Native theming
  - Fullscreen mode
  - Multi-widget speak lock
- Update server.py docstring with feature documentation

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
ochafik and others added 8 commits January 17, 2026 02:22
The TTS queue could hang forever if:
- The LLM stops generating mid-stream (no end signal)
- User cancels before tool completes
- Network interruption prevents end_tts_queue call

Changes:
- Add last_activity timestamp to track queue activity
- Add 30-second timeout (QUEUE_TIMEOUT_SECONDS)
- Poll with 5-second timeout, checking for stale queues
- Mark stale queues as "error" so widget stops polling
- Return error message in poll_tts_audio response
- Add error logging in widget polling loop

This prevents tool calls from running forever when queues are abandoned.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Add console logging to debug why end_tts_queue might not be called:

Widget-side:
- Log queueId in ontoolinputpartial, ontoolinput, ontoolresult
- Log when new session is detected (queue reset)
- Log when initTTSQueue fails
- Log when end_tts_queue is called/not called

Server-side:
- Log when end_tts_queue is called
- Log warnings for unknown queues
- Log info for already-ended queues

This will help identify if the issue is:
- User interruption (ontoolresult never called)
- Session reset (queueId becomes null)
- Host bug (callback dropped)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Add 150ms delay before first poll to give TTS model time to generate
the first audio chunk. This reduces unnecessary polling and server load.

Current backoff after that:
- 30ms if chunks received (keep streaming fast)
- 80ms if no chunks (wait longer for generation)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Previous approach: Always wait 150ms before first poll
Problem: Even when chunks are ready, we wait unnecessarily

New approach: Adaptive polling
- Start polling immediately (no initial delay)
- 20ms backoff when receiving chunks (fast streaming)
- Exponential backoff when no chunks: 50ms → 100ms → 150ms max
- Reset backoff when chunks start flowing

This reduces latency when TTS is fast while being polite when it's slow.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Uses SVG data URI for a speaker with sound waves icon.
- Expand License section in README with component attribution table
- Add voice collection license matrix with commercial use indicators
- Add warning about non-commercial expresso/ears voice collections
- Clarify NON-COMMERCIAL restriction in list_voices() output
- Use consistent CC-BY formatting (hyphenated)
@pkg-pr-new
Copy link

pkg-pr-new bot commented Jan 19, 2026

Open in StackBlitz

@modelcontextprotocol/ext-apps

npm i https://pkg.pr.new/modelcontextprotocol/ext-apps/@modelcontextprotocol/ext-apps@304

@modelcontextprotocol/server-basic-react

npm i https://pkg.pr.new/modelcontextprotocol/ext-apps/@modelcontextprotocol/server-basic-react@304

@modelcontextprotocol/server-basic-vanillajs

npm i https://pkg.pr.new/modelcontextprotocol/ext-apps/@modelcontextprotocol/server-basic-vanillajs@304

@modelcontextprotocol/server-budget-allocator

npm i https://pkg.pr.new/modelcontextprotocol/ext-apps/@modelcontextprotocol/server-budget-allocator@304

@modelcontextprotocol/server-cohort-heatmap

npm i https://pkg.pr.new/modelcontextprotocol/ext-apps/@modelcontextprotocol/server-cohort-heatmap@304

@modelcontextprotocol/server-customer-segmentation

npm i https://pkg.pr.new/modelcontextprotocol/ext-apps/@modelcontextprotocol/server-customer-segmentation@304

@modelcontextprotocol/server-map

npm i https://pkg.pr.new/modelcontextprotocol/ext-apps/@modelcontextprotocol/server-map@304

@modelcontextprotocol/server-pdf

npm i https://pkg.pr.new/modelcontextprotocol/ext-apps/@modelcontextprotocol/server-pdf@304

@modelcontextprotocol/server-scenario-modeler

npm i https://pkg.pr.new/modelcontextprotocol/ext-apps/@modelcontextprotocol/server-scenario-modeler@304

@modelcontextprotocol/server-shadertoy

npm i https://pkg.pr.new/modelcontextprotocol/ext-apps/@modelcontextprotocol/server-shadertoy@304

@modelcontextprotocol/server-sheet-music

npm i https://pkg.pr.new/modelcontextprotocol/ext-apps/@modelcontextprotocol/server-sheet-music@304

@modelcontextprotocol/server-system-monitor

npm i https://pkg.pr.new/modelcontextprotocol/ext-apps/@modelcontextprotocol/server-system-monitor@304

@modelcontextprotocol/server-threejs

npm i https://pkg.pr.new/modelcontextprotocol/ext-apps/@modelcontextprotocol/server-threejs@304

@modelcontextprotocol/server-transcript

npm i https://pkg.pr.new/modelcontextprotocol/ext-apps/@modelcontextprotocol/server-transcript@304

@modelcontextprotocol/server-video-resource

npm i https://pkg.pr.new/modelcontextprotocol/ext-apps/@modelcontextprotocol/server-video-resource@304

@modelcontextprotocol/server-wiki-explorer

npm i https://pkg.pr.new/modelcontextprotocol/ext-apps/@modelcontextprotocol/server-wiki-explorer@304

commit: 335ba34

- Replace emoji play/pause buttons with neutral white SVG icons
- Add (i) info button in bottom-right with attribution popup
- Use app.openLink() API for external links in popup
- Add dynamic padding to fit popup when shown
- Make create_tts_queue async to fix AnyIO event loop error
- Add better error logging for TTS queue initialization
- Update README with openLink feature documentation
- Remove unnecessary HuggingFace login prerequisites
@ochafik ochafik requested a review from antonpk1 January 19, 2026 20:29
antonpk1
antonpk1 previously approved these changes Jan 19, 2026
@ochafik ochafik changed the title feat(examples): add say-server - streaming TTS with karaoke highlighting feat(examples): add say-server - streaming Pocket TTS with karaoke highlighting Jan 19, 2026
@ochafik ochafik merged commit e6983b3 into main Jan 20, 2026
18 of 19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants