Skip to content

Streaming Agent Responses — Implementation Recommendation

Status: Deferred — implement when pipeline is stable Effort: 4-6 hours Priority: Low — UX polish, not functional


Summary

Stream LLM responses token-by-token to the browser (like Claude.ai/ChatGPT) instead of waiting for the full response. Uses hybrid approach: stream display text in real-time, send structured data (extracted_data, workflow_updates) at completion.

Approach

Backend: - llm.astream() instead of llm.ainvoke() in llm_conversation.py - Chat endpoint returns StreamingResponse (SSE chunks) - Each chunk: data: {"token": "word "} - Final chunk: data: {"done": true, "workflow_state": {...}, "extracted_data": {...}}

Frontend: - fetch() with response.body.getReader() reads chunks - Append tokens to assistant message in state - Blinking cursor at end of growing message

Challenges

Challenge Solution
JSON structured output Split: stream message text, parse full JSON at completion for data extraction
Output validation Run guardrail check on completed text before sending done event
Rich cards (matches, checklists) Don't stream — send as complete done payload
Orchestrator workflow updates Process after completion, same as current

Files to Change

  • app/agents/llm_conversation.pyastream() + yield tokens
  • app/routers/cases.pyStreamingResponse for chat endpoint
  • src/pages/ConversationApp.tsx — chunk reader + progressive render
  • src/services/caseApi.ts — streaming fetch instead of JSON

When to Implement

After: current attachment pipeline is stable, guardrails are tuned, demo feedback incorporated. Before: public launch or investor demo where UX polish matters.