Security & Quality
Production AI needs guardrails. This is the real, graduated defense behind the assistant — what runs on every request today, and what is tested and ready behind a flag.
Always-on defense (the live default)
LiveEvery request runs through a regex severity-weighting detector and an input normalizationpipeline before it reaches the model. This is always-on — it has no feature flag and no external dependency, so it can't be turned off or time out. Patterns are scored by severity (instruction-override and jailbreak attempts score highest), and obfuscated variants are caught after normalization.
Normalization pipeline
- 1Strip zero-width and invisible characters
- 2Unicode NFKD compatibility decomposition
- 3Remove combining marks (accents/diacritics)
- 4Map Cyrillic/confusable homoglyphs to Latin
- 5Collapse whitespace, then bounded base64 decode + re-scan
Normalizing first means a payload hidden with zero-width characters, homoglyphs, or base64 is scored on its decoded intent, not its disguise.
Cumulative-severity blocking + appeal
Severity accumulates per identity over a rolling 24-hour window. Once it crosses the block threshold, further requests are refused — but the block is not a dead end: a visitor can appeal through a flow that requires a verified identity, so an honest false positive has a way back in.
ML classifiers
Tested & available · off by defaultA second, heavier layer is built and validated but feature-flagged off by default: two small transformer models — ProtectAI's DeBERTa v3 prompt-injection-v2 (~184M params) and Meta's Prompt Guard 22M. When the flag is enabled they run on CPU via ONNX, in parallel, each with its own hard timeout; on any error or timeout they fail open (allow), so a model problem can never break chat. They are a safety net layered on top of the always-on regex, not a replacement for it.
A 105-case labeled suite runs in CI to catch accuracy drift across both models.
A deeper 1,823-sample benchmark validates any threshold change before it ships.
Keeping the ML layer behind a flag is deliberate: it can be enabled or disabled without a redeploy if a model regresses in production. The honest status today is tested and available, not live.
Operational guardrails
An atomic Redis (Lua) sliding-window limiter throttles abuse, keyed on HMAC-hashed IPs — raw addresses are never stored.
The MCP server rate-limits tool calls per bearer token. Self-service access issues distinct tokens as revocable grants, so abuse burns that token's own 60/min bucket and revocation can target the offending grant.
A streaming output filter redacts anything resembling an API key, internal file path, or retrieval source marker before it reaches the browser.
CSP, HSTS, frame and content-type protections are asserted by an end-to-end test, not just configured.
These guardrails sit in front of the retrieval, voice, and MCP features described on the AI Features page, within the architecture covered under Architecture.