Lunover Engineering Notes

What We Learned Building Our Own Support Bot Widget (Chat With Your Content)

A technical teardown of what actually matters when building a support chat widget: retrieval failure modes, document-as-files navigation, caching, access control, and guardrails for reliable answers.

June 18, 2025By LunoverWork with us

What We Learned Building Our Own Support Bot Widget (Chat With Your Content)

We built our own support bot widget: a chat bubble you embed on a website so visitors can ask questions and get answers from your content (docs, knowledge base, service pages, policies). If you’ve ever shipped one of these, you know the pitch is easy and the reality is messy. The problems you run into are rarely “model problems”. They’re engineering problems:
  • retrieval breaks in predictable ways
  • latency kills trust before the first answer lands
  • access control is easy to get wrong
  • prompt injection becomes “content poisoning”
  • answers must be auditable or you end up with support risk
This post is a detailed teardown: architecture, data model, retrieval strategy, tool design, caching, RBAC, UI patterns, and the guardrails that made our widget reliable.

What we were building (requirements and constraints)

We set a few non-negotiable requirements upfront:
  • Fast first token: users should see a response quickly, even if the assistant needs to “search” in the background.
  • Grounded answers: answers must be supported by content we actually ship.
  • Read-only by default: the assistant must not mutate content.
  • Multi-locale ready: our site is localized; the assistant should not mix languages casually.
  • Safe escalation: when it can’t answer, it should route the user to contact/support without hallucinating.
From those, we derived a simple success definition:
The assistant should behave like a careful engineer reading our website, not like a creative writer improvising.

A working mental model: “support bot = search + navigation + synthesis”

A good support assistant does three things in sequence:
  1. Search: identify candidate pages/sections.
  2. Navigate: open pages, scan headings, follow links, refine the search.
  3. Synthesize: produce a short, correct answer with pointers to where it came from.
Most implementations only do (1) and (3). They skip (2), and that’s where accuracy dies.

Lesson 1: Chunk-only RAG fails when the answer spans pages

Top-K chunk retrieval breaks in consistent ways:
  • the answer is split across two pages
  • the “exact syntax” is in a code block that didn’t rank
  • the right answer is on a page with weak lexical overlap
  • the user’s question is underspecified, so the embedding match is “close enough”
When a model sees only a handful of chunks, it can sound confident while being incomplete. The user experience looks like:
  • correct tone
  • wrong details
  • no way to verify
Our takeaway: retrieval isn’t enough. You need exploration.

Lesson 2: Give the assistant deterministic primitives (docs-as-files)

Humans don’t answer docs questions by reading five random paragraphs. We:
  • find the page
  • open it
  • scan headings
  • search for exact tokens
  • follow internal links
  • repeat until we can prove the answer
So we shaped the assistant’s tool interface like a tiny filesystem over our content:
  • each page is a “file”
  • directories represent sections (or URL path segments)
  • the assistant can ls, find, cat, and grep
This is not a gimmick. It gives the model primitives that are:
  • auditable (you can log tool calls)
  • deterministic (same query yields the same reads)
  • scalable (it can explore without you hardcoding flows)

Architecture (the simplest version that works)

At a high level, we ended up with three layers:
Browser widget
  -> /api/support-chat (stream)
      -> Retrieval layer (lexical + vector)
      -> Tool layer ("filesystem" over content)
      -> Answer synthesis (grounded prompt + citations)
      -> Observability (logs + traces + eval hooks)
The key point: we treat “reading our content” as a tool, not as a side effect of retrieval.

Data model: pages, chunks, and a path tree

We kept the content representation intentionally boring:
  • page: canonical URL/slug + title + locale + visibility + lastmod + headings
  • chunk: { page, chunk_index, text, embedding, tokens, hash }
  • path_tree: { "services/seo": { isPublic: true, groups: [] }, ... }
Why a path tree?
  • the assistant needs a map of “what exists”
  • access control becomes structural (prune paths before sessions begin)
  • ls/find becomes fast and cacheable

Ingestion pipeline: turn a website into a reliable knowledge surface

This was the biggest time sink. “Index your docs” sounds easy until you see what real content looks like:
  • marketing copy + components
  • MDX and headings that don’t map cleanly to HTML
  • navigation pages that repeat content blocks
  • locale variants that partially diverge
Our ingestion pipeline does:
  1. Discover pages: sitemap + known route list.
  2. Fetch rendered HTML: what users and crawlers actually see.
  3. Extract main content: strip nav, footers, cookie banners, repeated UI.
  4. Normalize: collapse whitespace, remove tracking query strings from links.
  5. Segment:
    • page-level record for navigation and citations
    • chunk-level records for retrieval
  6. Annotate:
    • locale
    • visibility (public, client-only, internal)
    • headings outline
    • outbound internal links
The headline lesson: index the rendered site, not just the source repo, or you’ll miss what the user actually sees and you’ll end up citing content that doesn’t exist in production.

Retrieval: combine lexical and vector before you “ask the model”

We don’t trust a single retrieval strategy. We do:
  • lexical search for exact tokens (great for identifiers, acronyms, error codes)
  • vector search for semantic match (great for vague questions)
Then we merge candidates and apply basic sanity checks:
  • reject pages that don’t match the user’s locale (unless explicitly asked)
  • prefer canonical pages over tag pages / duplicate listings
  • prefer pages with strong heading overlap with query terms
Only then do we let the assistant open pages and synthesize.

Tool design: what the assistant can do (and what it cannot)

We implemented a small tool surface area and made it strict: Allowed:
  • list directories: ls /services
  • search paths: find -name "billing"
  • read a full page: cat /services/seo
  • search within content: grep -ri "canonical" /
Not allowed:
  • writing or editing content
  • fetching arbitrary URLs
  • arbitrary network calls
The safety benefit is huge: it makes “prompt injection” mostly a content-quality issue, not a system compromise issue.

The latency lesson: real filesystems are too slow for interactive chat

If you spin up a sandbox/container per session to provide a real filesystem:
  • cold start becomes visible
  • chat feels broken
  • you’re tempted to add complexity like warm pools
For a support widget, the user is staring at the UI. You want instant tools. So we virtualized the filesystem operations over our index:
  • ls and find resolve from the cached path tree
  • cat reassembles the page from stored chunks (sorted by chunk_index)
  • the results are cached per-session (and partially globally, when safe)
The assistant gets the illusion of a shell over files, but there are no actual files.

Cache what repeats: directory tree, pages, and “grep targets”

Caching “answers” is weak because questions vary. What repeats during real conversations is:
  • listing the same sections
  • opening the same 5-10 important pages
  • grepping for the same tokens
So we cached:
  • the path tree
  • reconstructed full pages
  • recent grep candidates
This mattered more than micro-optimizing embeddings, because it improved follow-up speed and reduced “searching…” loops.

RBAC: access control must be structural, not prompt-based

If some docs are not public (drafts, internal notes, client-only docs), you cannot rely on prompting. We enforced RBAC before the assistant runs a single tool call:
  • build a user-scoped path tree
  • prune everything the user can’t access
  • apply the same filter to every query and page read
If the assistant can’t ls a file, it can’t cat it, and it can’t cite it. That’s the only mental model that holds under pressure.

Guardrails: how we stopped confident wrong answers

We added a few rules that dramatically reduced bad answers:

Rule 1: No evidence, no answer

If the assistant can’t find supporting content with tools, it should:
  • say it couldn’t verify
  • show what it checked (pages or sections)
  • offer a next step (contact/support)

Rule 2: Prefer quoting over paraphrasing for critical details

When the answer is sensitive to exact wording (requirements, limitations, legal copy):
  • quote the relevant sentence(s) from the page it read
  • keep the synthesis minimal

Rule 3: Be strict about locale and canonical pages

If the user is on a locale route, the assistant should:
  • prefer that locale’s content
  • avoid mixing languages in a single response
  • fall back to default locale only if the locale page does not exist

The “grep problem”: the killer feature needs a two-phase plan

Naive recursive grep is slow if it reads everything over the network. We used a two-phase approach:
  1. coarse filter using the index (which pages might contain the token)
  2. fine filter in memory over cached page text to extract exact matches + context
This made the assistant feel like it could search like a developer, not guess like a chatbot.

Widget UX: the UI makes or breaks trust

We shipped multiple UI iterations before the widget felt reliable. What mattered most:
  • streaming tokens with a stable layout (avoid jumpy reflow)
  • clear states:
    • “Searching…”
    • “Reading page…”
    • “Answering…”
  • short, in-line citations:
    • “From: /services/seo”
    • “From: /legal/privacy”
  • fallbacks that don’t feel like failure:
    • “I checked X and Y but couldn’t confirm; here’s how to reach us.”
This pairs well with making the assistant’s “work” visible. Users forgive a bot that says “I couldn’t confirm” far more than one that invents a confident answer.

Observability: log the reads, not just the tokens

If you want to improve quality, you need to know what happened. We logged:
  • tool calls (paths listed/read/searched)
  • which pages were used as evidence
  • the final answer length and latency buckets
  • “could not verify” rates
  • escalation rates (contact clicks)
That let us answer practical questions:
  • which pages are missing important information?
  • which questions never find evidence?
  • which pages cause confusion and need restructuring?

A practical build checklist (what we’d do again)

If you’re building a support widget that chats with your content, we’d start with:
  • index the rendered site and store page-level records
  • combine lexical + vector retrieval
  • add an exploration tool layer (files + grep)
  • prune content for RBAC before sessions start
  • keep tools read-only by default
  • add a “no evidence, no answer” rule
  • make “searching/reading/answering” visible in the UI
  • log tool calls so you can debug failures
If you want us to help implement this (widget + indexing + safe architecture):