What We Learned Building Our Own Support Bot Widget (Chat With Your Content)

We built our own support bot widget: a chat bubble you embed on a website so visitors can ask questions and get answers from your content (docs, knowledge base, service pages, policies). If you’ve ever shipped one of these, you know the pitch is easy and the reality is messy. The problems you run into are rarely “model problems”. They’re engineering problems:

retrieval breaks in predictable ways
latency kills trust before the first answer lands
access control is easy to get wrong
prompt injection becomes “content poisoning”
answers must be auditable or you end up with support risk

This post is a detailed teardown: architecture, data model, retrieval strategy, tool design, caching, RBAC, UI patterns, and the guardrails that made our widget reliable.

What we were building (requirements and constraints)

We set a few non-negotiable requirements upfront:

Fast first token: users should see a response quickly, even if the assistant needs to “search” in the background.
Grounded answers: answers must be supported by content we actually ship.
Read-only by default: the assistant must not mutate content.
Multi-locale ready: our site is localized; the assistant should not mix languages casually.
Safe escalation: when it can’t answer, it should route the user to contact/support without hallucinating.

From those, we derived a simple success definition:

The assistant should behave like a careful engineer reading our website, not like a creative writer improvising.

A good support assistant does three things in sequence:

Search: identify candidate pages/sections.
Navigate: open pages, scan headings, follow links, refine the search.
Synthesize: produce a short, correct answer with pointers to where it came from.

Most implementations only do (1) and (3). They skip (2), and that’s where accuracy dies.

Lesson 1: Chunk-only RAG fails when the answer spans pages

Top-K chunk retrieval breaks in consistent ways:

the answer is split across two pages
the “exact syntax” is in a code block that didn’t rank
the right answer is on a page with weak lexical overlap
the user’s question is underspecified, so the embedding match is “close enough”

When a model sees only a handful of chunks, it can sound confident while being incomplete. The user experience looks like:

correct tone
wrong details
no way to verify

Our takeaway: retrieval isn’t enough. You need exploration.

Lesson 2: Give the assistant deterministic primitives (docs-as-files)

Humans don’t answer docs questions by reading five random paragraphs. We:

find the page
open it
scan headings
search for exact tokens
follow internal links
repeat until we can prove the answer

So we shaped the assistant’s tool interface like a tiny filesystem over our content:

each page is a “file”
directories represent sections (or URL path segments)
the assistant can ls, find, cat, and grep

This is not a gimmick. It gives the model primitives that are:

auditable (you can log tool calls)
deterministic (same query yields the same reads)
scalable (it can explore without you hardcoding flows)

Architecture (the simplest version that works)

At a high level, we ended up with three layers:

Browser widget
  -> /api/support-chat (stream)
      -> Retrieval layer (lexical + vector)
      -> Tool layer ("filesystem" over content)
      -> Answer synthesis (grounded prompt + citations)
      -> Observability (logs + traces + eval hooks)

The key point: we treat “reading our content” as a tool, not as a side effect of retrieval.

Data model: pages, chunks, and a path tree

We kept the content representation intentionally boring:

page: canonical URL/slug + title + locale + visibility + lastmod + headings
chunk: { page, chunk_index, text, embedding, tokens, hash }
path_tree: { "services/seo": { isPublic: true, groups: [] }, ... }

Why a path tree?

the assistant needs a map of “what exists”
access control becomes structural (prune paths before sessions begin)
ls/find becomes fast and cacheable

Ingestion pipeline: turn a website into a reliable knowledge surface

This was the biggest time sink. “Index your docs” sounds easy until you see what real content looks like:

marketing copy + components
MDX and headings that don’t map cleanly to HTML
navigation pages that repeat content blocks
locale variants that partially diverge

Our ingestion pipeline does:

Discover pages: sitemap + known route list.
Fetch rendered HTML: what users and crawlers actually see.
Extract main content: strip nav, footers, cookie banners, repeated UI.
Normalize: collapse whitespace, remove tracking query strings from links.
Segment:
- page-level record for navigation and citations
- chunk-level records for retrieval
Annotate:
- locale
- visibility (public, client-only, internal)
- headings outline
- outbound internal links

The headline lesson: index the rendered site, not just the source repo, or you’ll miss what the user actually sees and you’ll end up citing content that doesn’t exist in production.

Retrieval: combine lexical and vector before you “ask the model”

We don’t trust a single retrieval strategy. We do:

lexical search for exact tokens (great for identifiers, acronyms, error codes)
vector search for semantic match (great for vague questions)

Then we merge candidates and apply basic sanity checks:

reject pages that don’t match the user’s locale (unless explicitly asked)
prefer canonical pages over tag pages / duplicate listings
prefer pages with strong heading overlap with query terms

Only then do we let the assistant open pages and synthesize.

Tool design: what the assistant can do (and what it cannot)

We implemented a small tool surface area and made it strict: Allowed:

list directories: ls /services
search paths: find -name "billing"
read a full page: cat /services/seo
search within content: grep -ri "canonical" /

Not allowed:

writing or editing content
fetching arbitrary URLs
arbitrary network calls

The safety benefit is huge: it makes “prompt injection” mostly a content-quality issue, not a system compromise issue.

The latency lesson: real filesystems are too slow for interactive chat

If you spin up a sandbox/container per session to provide a real filesystem:

cold start becomes visible
chat feels broken
you’re tempted to add complexity like warm pools

For a support widget, the user is staring at the UI. You want instant tools. So we virtualized the filesystem operations over our index:

ls and find resolve from the cached path tree
cat reassembles the page from stored chunks (sorted by chunk_index)
the results are cached per-session (and partially globally, when safe)

The assistant gets the illusion of a shell over files, but there are no actual files.

Cache what repeats: directory tree, pages, and “grep targets”

Caching “answers” is weak because questions vary. What repeats during real conversations is:

listing the same sections
opening the same 5-10 important pages
grepping for the same tokens

So we cached:

the path tree
reconstructed full pages
recent grep candidates

This mattered more than micro-optimizing embeddings, because it improved follow-up speed and reduced “searching…” loops.

RBAC: access control must be structural, not prompt-based

If some docs are not public (drafts, internal notes, client-only docs), you cannot rely on prompting. We enforced RBAC before the assistant runs a single tool call:

build a user-scoped path tree
prune everything the user can’t access
apply the same filter to every query and page read

If the assistant can’t ls a file, it can’t cat it, and it can’t cite it. That’s the only mental model that holds under pressure.

Guardrails: how we stopped confident wrong answers

We added a few rules that dramatically reduced bad answers:

Rule 1: No evidence, no answer

If the assistant can’t find supporting content with tools, it should:

say it couldn’t verify
show what it checked (pages or sections)
offer a next step (contact/support)

Rule 2: Prefer quoting over paraphrasing for critical details

When the answer is sensitive to exact wording (requirements, limitations, legal copy):

quote the relevant sentence(s) from the page it read
keep the synthesis minimal

Rule 3: Be strict about locale and canonical pages

If the user is on a locale route, the assistant should:

prefer that locale’s content
avoid mixing languages in a single response
fall back to default locale only if the locale page does not exist

The “grep problem”: the killer feature needs a two-phase plan

Naive recursive grep is slow if it reads everything over the network. We used a two-phase approach:

coarse filter using the index (which pages might contain the token)
fine filter in memory over cached page text to extract exact matches + context

This made the assistant feel like it could search like a developer, not guess like a chatbot. We shipped multiple UI iterations before the widget felt reliable. What mattered most:

streaming tokens with a stable layout (avoid jumpy reflow)
clear states:
- “Searching…”
- “Reading page…”
- “Answering…”
short, in-line citations:
- “From: /services/seo”
- “From: /legal/privacy”
fallbacks that don’t feel like failure:
- “I checked X and Y but couldn’t confirm; here’s how to reach us.”

This pairs well with making the assistant’s “work” visible. Users forgive a bot that says “I couldn’t confirm” far more than one that invents a confident answer.

Observability: log the reads, not just the tokens

If you want to improve quality, you need to know what happened. We logged:

tool calls (paths listed/read/searched)
which pages were used as evidence
the final answer length and latency buckets
“could not verify” rates
escalation rates (contact clicks)

That let us answer practical questions:

which pages are missing important information?
which questions never find evidence?
which pages cause confusion and need restructuring?

A practical build checklist (what we’d do again)

If you’re building a support widget that chats with your content, we’d start with:

index the rendered site and store page-level records
combine lexical + vector retrieval
add an exploration tool layer (files + grep)
prune content for RBAC before sessions start
keep tools read-only by default
add a “no evidence, no answer” rule
make “searching/reading/answering” visible in the UI
log tool calls so you can debug failures

If you want us to help implement this (widget + indexing + safe architecture):

What We Learned Building Our Own Support Bot Widget (Chat With Your Content)

What We Learned Building Our Own Support Bot Widget (Chat With Your Content)

What we were building (requirements and constraints)

A working mental model: “support bot = search + navigation + synthesis”

Lesson 1: Chunk-only RAG fails when the answer spans pages

Lesson 2: Give the assistant deterministic primitives (docs-as-files)

Architecture (the simplest version that works)

Data model: pages, chunks, and a path tree

Ingestion pipeline: turn a website into a reliable knowledge surface

Retrieval: combine lexical and vector before you “ask the model”

Tool design: what the assistant can do (and what it cannot)

The latency lesson: real filesystems are too slow for interactive chat

Cache what repeats: directory tree, pages, and “grep targets”

RBAC: access control must be structural, not prompt-based

Guardrails: how we stopped confident wrong answers

Rule 1: No evidence, no answer

Rule 2: Prefer quoting over paraphrasing for critical details

Rule 3: Be strict about locale and canonical pages

The “grep problem”: the killer feature needs a two-phase plan

Widget UX: the UI makes or breaks trust

Observability: log the reads, not just the tokens

A practical build checklist (what we’d do again)