Streaming Agent Responses — Implementation Recommendation¶

Status: Deferred — implement when pipeline is stable Effort: 4-6 hours Priority: Low — UX polish, not functional

Summary¶

Stream LLM responses token-by-token to the browser (like Claude.ai/ChatGPT) instead of waiting for the full response. Uses hybrid approach: stream display text in real-time, send structured data (extracted_data, workflow_updates) at completion.

Approach¶

Backend: - llm.astream() instead of llm.ainvoke() in llm_conversation.py - Chat endpoint returns StreamingResponse (SSE chunks) - Each chunk: data: {"token": "word "} - Final chunk: data: {"done": true, "workflow_state": {...}, "extracted_data": {...}}

Frontend: - fetch() with response.body.getReader() reads chunks - Append tokens to assistant message in state - Blinking cursor at end of growing message

Challenges¶

Challenge	Solution
JSON structured output	Split: stream `message` text, parse full JSON at completion for data extraction
Output validation	Run guardrail check on completed text before sending `done` event
Rich cards (matches, checklists)	Don't stream — send as complete `done` payload
Orchestrator workflow updates	Process after completion, same as current

Files to Change¶

app/agents/llm_conversation.py — astream() + yield tokens
app/routers/cases.py — StreamingResponse for chat endpoint
src/pages/ConversationApp.tsx — chunk reader + progressive render
src/services/caseApi.ts — streaming fetch instead of JSON

When to Implement¶

After: current attachment pipeline is stable, guardrails are tuned, demo feedback incorporated. Before: public launch or investor demo where UX polish matters.