Streaming Agent Responses — Implementation Recommendation¶
Status: Deferred — implement when pipeline is stable Effort: 4-6 hours Priority: Low — UX polish, not functional
Summary¶
Stream LLM responses token-by-token to the browser (like Claude.ai/ChatGPT) instead of waiting for the full response. Uses hybrid approach: stream display text in real-time, send structured data (extracted_data, workflow_updates) at completion.
Approach¶
Backend:
- llm.astream() instead of llm.ainvoke() in llm_conversation.py
- Chat endpoint returns StreamingResponse (SSE chunks)
- Each chunk: data: {"token": "word "}
- Final chunk: data: {"done": true, "workflow_state": {...}, "extracted_data": {...}}
Frontend:
- fetch() with response.body.getReader() reads chunks
- Append tokens to assistant message in state
- Blinking cursor at end of growing message
Challenges¶
| Challenge | Solution |
|---|---|
| JSON structured output | Split: stream message text, parse full JSON at completion for data extraction |
| Output validation | Run guardrail check on completed text before sending done event |
| Rich cards (matches, checklists) | Don't stream — send as complete done payload |
| Orchestrator workflow updates | Process after completion, same as current |
Files to Change¶
app/agents/llm_conversation.py—astream()+ yield tokensapp/routers/cases.py—StreamingResponsefor chat endpointsrc/pages/ConversationApp.tsx— chunk reader + progressive rendersrc/services/caseApi.ts— streaming fetch instead of JSON
When to Implement¶
After: current attachment pipeline is stable, guardrails are tuned, demo feedback incorporated. Before: public launch or investor demo where UX polish matters.