Security & Quality

Production AI needs guardrails. This is the real, graduated defense behind the assistant — what runs on every request today, and what is tested and ready behind a flag.

Always-on defense (the live default)

Live

Every request runs through a regex severity-weighting detector and an input normalizationpipeline before it reaches the model. This is always-on — it has no feature flag and no external dependency, so it can't be turned off or time out. Patterns are scored by severity (instruction-override and jailbreak attempts score highest), and obfuscated variants are caught after normalization.

Normalization pipeline

1Strip zero-width and invisible characters
2Unicode NFKD compatibility decomposition
3Remove combining marks (accents/diacritics)
4Map Cyrillic/confusable homoglyphs to Latin
5Collapse whitespace, then bounded base64 decode + re-scan

Normalizing first means a payload hidden with zero-width characters, homoglyphs, or base64 is scored on its decoded intent, not its disguise.

Cumulative-severity blocking + appeal

Severity accumulates per identity over a rolling 24-hour window. Once it crosses the block threshold, further requests are refused — but the block is not a dead end: a visitor can appeal through a flow that requires a verified identity, so an honest false positive has a way back in.

ML classifiers

Tested & available · off by default

A second, heavier layer is built and validated but feature-flagged off by default: two small transformer models — ProtectAI's DeBERTa v3 prompt-injection-v2 (~184M params) and Meta's Prompt Guard 22M. When the flag is enabled they run on CPU via ONNX, in parallel, each with its own hard timeout; on any error or timeout they fail open (allow), so a model problem can never break chat. They are a safety net layered on top of the always-on regex, not a replacement for it.

CI regression suite

A 105-case labeled suite runs in CI to catch accuracy drift across both models.

Benchmark

A deeper 1,823-sample benchmark validates any threshold change before it ships.

Keeping the ML layer behind a flag is deliberate: it can be enabled or disabled without a redeploy if a model regresses in production. The honest status today is tested and available, not live.

Operational guardrails

Sliding-window rate limiting

An atomic Redis (Lua) sliding-window limiter throttles abuse, keyed on HMAC-hashed IPs — raw addresses are never stored.

Per-token MCP limits

The MCP server rate-limits tool calls per bearer token. Self-service access issues distinct tokens as revocable grants, so abuse burns that token's own 60/min bucket and revocation can target the offending grant.

Output filtering

A streaming output filter redacts anything resembling an API key, internal file path, or retrieval source marker before it reaches the browser.

Security headers

CSP, HSTS, frame and content-type protections are asserted by an end-to-end test, not just configured.

These guardrails sit in front of the retrieval, voice, and MCP features described on the AI Features page, within the architecture covered under Architecture.