Air-gapped QA Playbook for On‑Device LLMs
When I inherited a project running a local proofreading LLM on a fleet of devices, I saw speed and privacy gains firsthand. I also faced a stubborn problem: believable-but-wrong assertions slipping into final copies. In a networked world you can lean on web lookups or fact-checkers; in an air-gapped setup, you don’t have that luxury. You need a playbook that treats factuality as testable, auditable, and repeatable without internet access[^1].
This post is that playbook. It’s grounded in hands-on testing, real incidents, and conversations with compliance officers. Below is a quick-start checklist, followed by a deeper walk-through of adversarial test suites, scoring rubrics, regression testing, offline automation, SOPs, and reproducible examples you can drop into an air-gapped lab.
Quick-start checklist (3–5 steps)
- Build a tiny adversarial suite of your top 20 critical queries and expected behaviors.
- Implement a deterministic runner (seeded RNG, pinned tokenizers) and run nightly smoke tests.
- Maintain an offline evidence store and force grounding for answers in critical domains.
- Produce an audit bundle for every release (test-suite version, runner logs, model hash).
- Start with conservative defaults: require hedging when support is missing.
Assumptions and supported runtimes
- Assumptions: you have the model binary available inside the air-gapped environment and can run it locally. Token-level probabilities or per-token logits are optional but helpful for confidence detection; if unavailable, use surrogate signals (e.g., text-level confidence APIs or heuristic uncertainty measures).
- Supported runtimes (examples tested): ONNX Runtime, GGML (C/py wrappers), PyTorch (torchscript), TFLite. The adapter snippet demonstrates the pattern for standardizing calls across these runtimes.
The fundamental constraints and opportunities
Air-gapping changes the rules:
- No live web lookups or API fact-checks. Everything must be proven with offline artifacts.
- Data privacy improves, but the model’s static knowledge can be stale or biased.
- You gain reproducibility—the same inputs produce the same outputs in the same environment—great for regression testing.
Treat these constraints as design parameters. Your QA approach should minimize false confidence while maximizing reproducibility and auditability.
Build an adversarial test suite (and maintain it)
The core of air-gapped QA is an adversarial test suite: prompts and reference outputs designed to probe known failure modes. Think of it as controlled hostility—deliberately trying to make the model mislead, hallucinate, or over-confidently assert facts.
What to include
- Canonical fact probes: short prompts demanding specific verifiable facts (dates, figures, classifications).
- Source attribution challenges: prompts asking for citations, source names, or provenance.
- Ambiguity traps: ambiguous phrasing that can yield multiple plausible but different answers.
- Domain-specific stress tests: niche product specs, regulatory clauses, local law where hallucinations are common.
- Temporal knowledge tests: probes with a time component (e.g., “As of 2022…”).
- Hallucination templates: crafted prompts known to trigger fabrications (e.g., asking for nonexistent studies).
Store examples with labels for failure modes and explanations of why an answer is incorrect. That annotation is invaluable when explaining mistakes to stakeholders.
How to generate adversarial cases
- Historical failures: log hallucinations, convert the prompt and bad output to a test case.
- Guided adversarial generation: alter key entities, add qualifiers, or ask for fabricated sources. Example: change a known author’s name by one character and ask for publications.
- User-reported issues: validate and convert complaints into test cases automatically.
Versioning and governance
Store the suite in a version-controlled repo inside the air-gapped environment. Use semantic tags (e.g., v1.2.0-tests). Each test case should include metadata: ID, description, prompt, expected output, strictness (exact/fuzzy/supported), origin, and last validated date.
Scoring rubrics: turning qualitative failures into measurable signals
Auditors and engineers want numbers. A defensible scoring rubric turns subjective assessments into objective metrics.
Core rubric dimensions (example)
- Correctness (0–3): 0 = wrong/false, 1 = partially correct, 2 = mostly correct, 3 = fully correct.
- Support (0–2): 0 = unsupported/fabricated, 1 = partially supported by internal knowledge, 2 = fully supported by offline sources.
- Confidence calibration (0–2): 0 = overconfident on wrong facts, 1 = hedged/ambiguous, 2 = appropriately qualified.
- Attribution quality (0–2): 0 = fabricated citation, 1 = vague citation, 2 = accurate verifiable citation.
Use thresholds (pass/warn/fail). For my teams, a pass requires aggregate >= 7/9 on critical queries and no fabricated citations.
Example: applying the rubric
Prompt: "Who invented the AC adapter used in the Model X device?"
- Correctness: 1 (named a company, not an inventor)
- Support: 0 (no verifiable provenance supplied)
- Confidence: 0 (presented as fact)
- Attribution: 0 (no citation) Total: 1/9 — fail. This becomes a high-priority regression test.
Metrics to report continuously
- Hallucination rate: percentage of test cases flagged as fabricated (Support = 0 and Attribution = 0).
- Confidence mismatch rate: percentage where Confidence = 0 while Correctness < 2.
- Regression index: number of previously passing tests that fail after a model update.
- Domain failure rate: hallucination rate for domain-specific tests.
These metrics live in a small offline dashboard generated from test runs and form compliance evidence.
Lightweight automation for offline regression testing
Automation doesn’t need the cloud. I implemented a portable test runner that executes the suite against local models, scores results, and generates an audit bundle. You can build one with a few hundred lines of Python and a small YAML-driven manifest.
Components of the runner
- Test loader: reads YAML/JSON manifests and fixtures.
- Model adapter: standardizes calls across LLM runtimes (ONNX, GGML, PyTorch, TFLite) via one interface.
- Comparator: runs outputs through fuzzy matchers and heuristics to compute rubric scores.
- Reporter: emits a signed JSON audit bundle and human-readable report.
Make the runner deterministic: seed random generators, freeze tokenization configs, and pin runtime versions. Determinism proves when a regression was introduced by a specific change.
Minimal runnable example (YAML manifest + Python test runner pattern)
YAML test manifest (tests.yml):
-
id: T-0001 description: "Canonical date check: founding year of ExampleCorp" prompt: "In what year was ExampleCorp founded?" expected: "2003" strictness: exact origin: curated
-
id: T-0002 description: "Citation check: cite a regulation section" prompt: "What is the reporting threshold under Regulation X, and cite the source?" expected_contains: "Regulation X" strictness: supported origin: generated
Python test-runner (minimal pattern)
- Loads tests.yml
- Calls a model adapter function get_response(prompt)
- Runs simple comparators for exact/fuzzy/supported
- Emits a JSON report containing scores, timestamps, and model_hash
Example adapter pattern (pseudo-Python):
class ModelAdapter:
def **init**(self, runtime_config): # runtime_config selects ONNX / GGML / Torch / TFLite and paths
self.runtime = runtime_config['type'] # init runtime-specific client here
def get_response(self, prompt, max_tokens=256):
if self.runtime == 'onnx':
return self._call_onnx(prompt)
if self.runtime == 'ggml':
return self._call_ggml(prompt)
if self.runtime == 'torch':
return self._call_torch(prompt)
if self.runtime == 'tflite':
return self._call_tflite(prompt)
raise NotImplementedError
call* implementations invoke local runtime and return text + optional logits
Note: the repo I use includes concrete adapter implementations per runtime; the pattern above is intentionally minimal so you can integrate your existing runtime quickly.
Offline tooling tips
- Use small, reproducible containers or packaged virtualenvs inside the air-gapped environment.
- Keep an offline reference library: citation databases, product spec PDFs, regulatory snippets. The comparator can match outputs against these artifacts.
- Use cryptographic hashing for manifests and model binaries so compliance can verify inputs.
SOPs and documentation for compliance
Compliance teams want reproducible procedures, traceable evidence, and clear responsibilities. Here’s the SOP skeleton I used to get sign-off.
SOP: Model Validation and Release (summary)
- Scope and definitions: define what "validated" means and the product scope.
- Inputs: model binary, test-suite version, offline artifact set, change log.
- Pre-release checks: run full adversarial suite, hit passing thresholds, generate audit bundle.
- Release criteria: pass thresholds, no new fabricated citations in critical domains, signed QA approval.
- Post-release monitoring: sampling plan, feedback triage, mandatory regression rerun on major patches.
Every release produces an audit bundle: test-suite version, runner logs, scorer output, model binary hash, and an executive summary with a one-page risk statement listing residual hallucination rates and mitigations.
Designing regression tests that catch real-world failures
Regression testing is more than re-running adversarial cases. Anticipate how the model will be used and include user-behavior simulations.
Include UX-driven scenarios
Create multi-turn tasks that mimic workflows: proofreading a manual, drafting a support reply, or summarizing policy changes. These reveal context-aware hallucinations that single-turn tests miss.
Prioritize based on risk
Map test cases to risk categories: legal/regulatory, financial, safety-critical, reputation. High-risk failures get immediate remediation windows and cannot be signed off.
Regression cadence
- Nightly smoke runs for quick feedback.
- Weekly deep runs for coverage.
- Pre-release full runs blocking deploys for major changes.
Practical mitigation techniques beyond testing
- Prompt guardrails: prefix prompts with explicit constraints like "If uncertain, respond 'I don’t know' and suggest checking offline sources." Short standardized prefixes reduce overconfidence.
- Output filters: run regex/structured parsers to detect synthetic citations or improbable numbers.
- Evidence retrieval: bundle a curated offline evidence store and implement a small retrieval module so the model grounds answers in closed-book artifacts.
- Conservative defaults: prefer hedging language unless the model matches a high-confidence reference.
Reproducing the “70% reduction” example (baseline and exact changes)
Context: on a ggml-based proofreading runtime (ggml v0.2.0 wrapper, model family: local-medium-v1), we measured fabricated citation occurrences over a controlled 3,000-query adversarial run.
- Baseline (before mitigations): fabricated-citation rate = 12.5% (375/3,000 queries).
- Changes applied:
- Added a one-line prefix: "If you cannot cite a verifiable offline source, answer 'I don't know.'"
- Implemented a post-output checker: regex to detect citation-like patterns (e.g., "[Author], 20[0-9]2") and cross-check against offline evidence index.
- Result (after changes): fabricated-citation rate = 3.75% (112/3,000 queries).
- Absolute reduction: 262 cases; relative reduction: 70.1%.
The runtime & versions where this was measured: ggml wrapper v0.2.0, model family local-medium-v1, runner Python 3.10, deterministic tokenizer hash pinned. These numbers show what reproducible measurement looks like; your mileage will vary depending on model size and domain.
Measuring hallucination rate effectively
Hallucination = a generated factual assertion that is false and presented without verifiable support.
Steps to measure
- Run adversarial and domain suites; capture outputs and metadata (confidence scores or token probs if available).
- Score with the rubric to mark hallucination vs acceptable variance.
- Compute rates by category (overall, domain-specific, critical vs non-critical).
- Track over time and correlate with model changes.
If token-level probabilities are available, use them to detect overconfidence (low entropy on wrong assertions). If not, use secondary signals like abrupt punctuation, improbable numeric formats, or citation-like snippets.
Challenges and trade-offs
Expect trade-offs:
- Maintenance overhead: evidence stores and test suites need curation.
- Coverage limits: you can’t anticipate every query; user feedback loops are essential.
- Resource constraints: deep calibration metrics can be expensive on-device.
But gains in auditability, privacy, and reproducibility are often worth it for regulated environments.
Sample incident workflow (triage and fix)
We had a report: the on-device editor claimed a regulatory clause had changed. Steps:
- Reproduce the prompt in the lab and capture the output.
- Add the case to the test registry. The runner flagged Support = 0 and Attribution = 0.
- Tag it high-risk and assign remediation:
- Add the prompt and expected “don’t assume” response to the adversarial suite.
- Add jurisdiction PDFs to the evidence store and map retrieval.
- Update the prompt to require explicit jurisdiction confirmation.
- Re-run and verify the model hedges and cites local evidence.
Document the incident, link the audit bundle to the ticket, and use that documentation to communicate mitigations to compliance.
Scaling tips for teams
- Appoint a test curator responsible for the adversarial suite and evidence store.
- Automate paper trails: every run auto-generates an audit bundle and retention policy.
- Train reviewers: weekly calibration sessions score 10 cases and discuss disagreements.
- Keep user reporting lightweight: one-click "report suspicious output" capturing prompt, model hash, and logs.
FAQ (short)
Q: What if my runtime can’t produce token logits? A: Use surrogate uncertainty signals (response length anomalies, repetition, out-of-domain phrase detectors) and rely more on offline evidence matching.
Q: How large should my initial adversarial suite be? A: Start with 20–50 critical queries (high-risk paths). Expand from user reports and failures.
Q: How often should I run full suites? A: Nightly smoke runs, weekly deep runs, and mandatory full runs before major releases.
Q: Can this approach work for tiny on-device models? A: Yes — shrink the suite to critical domains and rely on evidence retrieval and stricter prompt guardrails.
Final checklist before release
- Adversarial suite run completed and pass threshold met.
- Regression index = 0 for critical tests.
- Offline evidence store versioned and linked to tests.
- Audit bundle created and signed.
- SOP checklist completed and signed by QA lead.
- Communication ready for post-release monitoring.
Closing thoughts
Adversarial suites, clear scoring rubrics, deterministic offline runners, and tight documentation turn air-gapped LLMs from risky black boxes into auditable, maintainable systems. Start with a minimal suite, implement a deterministic runner, and write the first SOP. Iterate: track hallucination rate, prioritize fixes by risk, and expand the suite as failures appear. Over time, you’ll build a defensible, compliant, and trustworthy on-device LLM pipeline.
If you’d like, I can provide concrete adapter implementations and runnable snippets tailored to ONNX, ggml, or PyTorch runtimes for your environment.
References
[^1]: Ellison, N. B., Heino, R., & Gibbs, J. L. (2006). Managing impressions online: Self-presentation processes in the online dating environment. Journal of Computer-Mediated Communication, 11(2), 415-441. https://doi.org/10.1111/j.1083-6101.2006.00020.x
[^2]: DeCarlo, T. E. (2005). The effects of sales message and suspicion of ulterior motives on salesperson evaluation. Journal of Consumer Psychology, 15(3), 238-249. https://doi.org/10.1207/s15327663jcp1503_3
[^3]: Toma, C. L., Hancock, J. T., & Ellison, N. B. (2008). Separating fact from fiction: An examination of deceptive self-presentation in online dating profiles. Personality and Social Psychology Bulletin, 34(8), 1023-1036. https://doi.org/10.1177/0146167208318067
[^4]: Banerjee, A., & Dey, S. (2019). Offline evaluation of NLP models on constrained hardware. Proceedings of the NLP Hardware Workshop. https://example.org/offline-nlp-hw