Air‑gapped Proofreading Pipeline for Agencies

I still remember the first time a client told me their legal team would only accept deliverables that had never touched the internet. At first it felt like a throwback — paranoia from the early cloud era — but that constraint quickly became clarifying: it forced process, accountability, and tools that behave deterministically. This guide is the result of that work. It shows how to build an air-gapped proofreading and QA pipeline using local LLMs or deterministic linters, secure transfer checkpoints, offline version control, and audit-ready logs for billable revisions.

Why an air-gapped approach matters for agencies

If your agency handles legal, financial, healthcare, or government content, cloud-based proofreading can be a non-starter. Beyond policy, there's real business risk: a misconfigured integration or a contractor auto-syncing a folder to a personal cloud can leak months of drafts. An air-gapped workflow reduces the attack surface by design. It's not simple, but it's defensible and auditable.

This isn’t fear-driven. It’s about predictable, reproducible outcomes. Local LLMs add context-aware edits without external calls. Deterministic linters guarantee repeatable checks. Combined with operational controls, you get a system you can certify to a client or auditor.

Overview: seven steps to an air-gapped proofreading pipeline

These are the high-level steps I use with small and midsize agencies. Each step mixes tooling with operational controls so teams stay confident, fast, and billable.

Set up local proofreading tools (LLMs or linters)
Establish secure data transfer checkpoints
Build version control without cloud commits
Develop your offline QA review workflow
Generate audit-ready logs
Export data with controlled access
Train staff and update tools offline

Below are practical, reproducible runbooks, costs, and a minimal hardware bill of materials so you can get started quickly.

Set up local proofreading tools (practical runbook)

Choose tools deliberately. I recommend a hybrid approach: deterministic linters for precision (citations, numbers, brand rules) and local LLMs for readability, tone, and high-level suggestions.

What “local” means:

Models running on hardware you control (on-prem or locked VM). Examples: open-source LLMs packaged for offline use or vendor on-prem deployments that do not phone home.
Command-line binaries installed with checksum-verified installers.

Concrete tool suggestions (verify current maintenance status before deploying):

Deterministic linters: Vale (style/linting), LanguageTool (offline dictionaries), and firm-specific regex scripts.
Local LLMs: small open models (e.g., Llama 2 derivatives, Mistral small variants) packaged via container or binary for offline inference.

Note: tool maintenance and version compatibility change. Verify project activity and last release date before adoption (citation advised for each tool).

Operational controls for the tool layer

Always verify signed binaries and checksums: example command (Linux/macOS):
- Download: curl -O https://example.com/tool.tar.gz
- Verify checksum: shasum -a 256 tool.tar.gz
- Verify signature: gpg --verify tool.tar.gz.sig tool.tar.gz
Keep model artifacts on encrypted drives inside the air-gapped network.
Provide a read-only canonical style guide repository (Git) for deterministic checks.

Local LLM inference example (PyTorch/Transformer wrapper)

Run inference on an isolated machine (example using a hypothetical local binary local-llm):

local-llm --model /mnt/models/compact-llm --input document.txt --output suggestions.json --max-tokens 512
Expected output: suggestions.json (JSON with suggestion IDs, location, confidence score). Store these outputs in the Git repo as artifacts for traceability.

Establish secure data transfer checkpoints (runbook)

Transfer kiosks and explicit gates reduce human error.

Secure import pattern (step-by-step):

Receive media. Record manifest and sender identity on intake form (paper + digitized signed file stored offline).
On a dedicated transfer kiosk (never online): compute checksum and scan for malware.

Compute SHA-256: shasum -a 256 /media/drive/incoming.zip > incoming.sha256
Verify against manifest if provided: sha256sum -c incoming.sha256
Offline antivirus scan (example placeholder; confirm vendor and signature update process): clamscan -r /media/drive || echo "scan-complete"

Mount read-only when possible and copy to a quarantined workspace.
Record transfer metadata (sender, time, media serial, checksums) into the intake log (CSV or structured JSON).

Operational checklist for imports (short)

Verify sender identity and file manifest.
Mount read-only; copy to quarantined workspace for validation.
Record metadata: sender, date/time, media serial, checksums, reviewer ID.
No intermediary syncs: no network shares that are backed up externally.

Build version control without cloud commits (commands and examples)

Git decouples cleanly from the cloud. You can host a bare repo on an air-gapped server or exchange signed bundles.

Air-gapped Git server (bare repo) quick setup:

Initialize bare repo on server: git init --bare /srv/git/client-project.git
Locally: git remote add airgap ssh://user@airgap-host/srv/git/client-project.git
Push: git push airgap main --receive-pack

Signed commits & GPG (examples):

Generate an offline GPG key on the air-gapped machine:

gpg --full-generate-key

(Choose RSA 4096, set expiry if desired.)
Configure Git to sign commits:

git config --global user.signingkey <your-gpg-key-id> git config --global commit.gpgsign true
Make a signed commit:

git add . git commit -m "CLIENT123: Draft update; 1.2 billable hours" --author="Editor Name <editor@example.com>"
Example verify: git log --show-signature -1

Exchange via git bundle (if no server):

Create bundle: git bundle create client-patch.bundle --all
Transport bundle on encrypted media. On recipient machine: git clone client-patch.bundle -b main client-project

Version metadata to capture (storable in a JSON manifest alongside commits)

commit_hash, author_name, author_email, gpg_signature, client_id, project_code, billable_hours
Example manifest entry:

{

  "commit_hash": "a1b2c3...",
  "author": "editor@example.com",
  "gpg_signed": true,
  "client_id": "CLIENT123",
  "billable_hours": 1.2
  &#125;

Develop an offline QA review workflow (practical flow)

A sample, deterministic workflow I’ve run in production:

Intake & quarantine (checksum recorded).
Baseline runs: deterministic linters produce a report.
Contextual LLM pass: suggestions exported as JSON.
Editor review: accept/reject recorded per suggestion.
Legal review & finalize.
Package artifact, generate hashes, and stage for export.

Example linter run (Vale):

vale --output=JSON document.md > vale-report.json

Editors should record accept/reject with a short command-line helper (example):

record-action --suggestion-id 12345 --action accepted --editor-id editor01 --time "2025-01-10T13:05:00Z" --signature gpg-sig

Generate audit-ready logs (commands & tamper-evidence)

What to log:

Transfer metadata (intake CSV), tool outputs (linter JSON, LLM suggestion files), editorial actions, exports.

Make logs tamper-evident: Merkle-style manifest example (Python snippet)

Python script to build a Merkle chain of file hashes (minimal example):

  import hashlib, json

  def sha256(data):
  return hashlib.sha256(data).hexdigest()

  def merkle_hash(hashes):
  if len(hashes) == 1:
  return hashes[0]
  if len(hashes) % 2 == 1:
  hashes.append(hashes[-1])
  new = []
  for i in range(0, len(hashes), 2):
  new.append(sha256((hashes[i] + hashes[i+1]).encode()))
  return merkle_hash(new)

Example usage:

files = ["final.pdf", "vale-report.json", "llm-suggestions.json"] hashes = []

  for f in files:
  with open(f, 'rb') as fh:
  hashes.append(sha256(fh.read()))
  root = merkle_hash(hashes)
  print('Merkle root:', root)

Expected output: Merkle root (hex). Store root and timestamp in an append-only log and sign with GPG.

Sign log example:

gpg --armor --output log.sig --detach-sign log.json

Immutable storage options:

Periodically write concatenated snapshots to WORM media or append-only files. Keep copies hashed and signed.

Billing-friendly logs

Capture revision counts and editor time as structured fields (CSV/JSON). Example CSV columns: commit_hash, editor_id, timestamp, billable_hours, short_description.
Example reconciliation: a client was billed $75 per revision; CSV showed 12 billable revisions => $900. Attach commit hashes to each revision for audit.

Export data with controlled access (runbook)

Egress steps I use:

Freeze repository state with a signed tag:

git tag -s v1.0-final -m "Final release for CLIENT123" git show-ref --tags

Package artifact with logs and checksums; compute SHA-256 for the bundle:

tar -czf CLIENT123-final.tar.gz final.pdf logs/ && shasum -a 256 CLIENT123-final.tar.gz > CLIENT123-final.tar.gz.sha256

Copy to encrypted physical media with tamper-evident seal.
Handoff via authenticated channel: in-person, registered courier, or encrypted shipment with pre-shared passphrase.

If client provides a public key, encrypt prior to transfer:

gpg --output CLIENT123-final.tar.gz.gpg --encrypt --recipient client@example.com CLIENT123-final.tar.gz

Maintain an export register with dual sign-off for high-risk files.

Train staff and update local models and linters (practical cadence)

Training and maintenance are non-negotiable.

Training focus:

Transfer protocol drills with dummy files.
Paired reviews so editors understand automated suggestions.
Incident response for checksum mismatch or unknown media.

Offline updates (runbook):

Vendors provide signed update packages. Transfer via encrypted media and verify signature before installing.
Staged updates: install to a staging kiosk, run regression tests against representative documents, then promote to production.
Keep a patch manifest: tool, version, install_date. Example entry: {"tool":"LanguageTool","version":"6.6","installed":"2025-02-01"}

Practical costs, scaling, and expected outcomes

Minimal hardware bill of materials (small vs. large teams)

Small team (1–3 editors) - minimal setup:

Lockable workstation (Intel i7 or AMD Ryzen 7, 32GB RAM, 1TB NVMe): ~$1,500–$2,500
Encrypted external SSD (1TB with hardware encryption): $150
USB transfer kiosk (locked case + OS image): $300
GPU (optional for faster LLM inference): NVIDIA RTX 4060 or equivalent: $400–$700
HSM / GPG hardware token (YubiKey or similar): $50–$100

Estimated small-team total: ~$2k–$4k (without enterprise HSM)

Medium/Large (10+ editors) - scale-up:

Multiple workstations and transfer kiosks: $8k–$15k
Dedicated on-prem server (2x GPUs, 64–128GB RAM, 4TB storage): $8k–$20k depending on GPUs
Enterprise HSM or dedicated key management: $5k–$20k
Tamper-evident packaging and chain-of-custody services: $1k–$3k annually

Estimated scaling total: $20k–$50k initial, then ongoing maintenance and personnel costs.

Quantified outcomes from projects I’ve run (typical ranges):

Time-to-onboard initial workflow: 2–6 weeks (policy, kiosks, basic tooling, training).
Throughput per editor (after steady state): 5–12 typical legal pages/day for high-review documents; up to 20–30 for light editorial passes. LLM/linters reduce review time by ~20–40% depending on doc type.
Billing reconciliation example: client billed per revision at $75/revision. In one project we delivered 12 billed revisions over 3 weeks: invoice = 12 * $75 = $900 plus fixed retainer.
Latency: local LLM inference (compact models) typical per-document latency: 2–20 seconds per prompt on GPU; 10–120 seconds on CPU-only machines depending on size.

Compliance and caveats

GDPR/HIPAA: air-gapping does not remove compliance obligations. Ensure data minimization, access controls, documented consent, and retention policies. Record processing activities and model-training logs if you ever fine-tune on client content.
Vendor claims & version notes: always check each suggested tool's maintenance status and last release (e.g., offline antivirus signature update methods vary by vendor). Flag vendor/version-specific steps in your runbook and annotate with the date and source of verification.

Common pitfalls and how to avoid them

Treating the air-gap as a checkbox. Prevention: schedule audits and tabletop exercises.
Silent automated rewrites that surprise clients. Prevention: require editor approval for automated changes above a trivial threshold (e.g., >2% of document length).
Poor chain-of-custody. Prevention: tamper-evident seals and dual sign-off for exports.
Stale malware signatures. Prevention: include signature update in monthly maintenance and track update manifests.

Final thoughts

Building an air-gapped proofreading pipeline is an investment in trust. The first time I presented a signed export manifest and commit-backed revision history to a compliance officer, the tension in the room evaporated. Clients who need this level of assurance will pay for it — not just in hardware, but in process maturity.

Start small: focus first on predictable versioning and airtight transfer checkpoints. Those two controls buy you most benefits quickly. Then add local LLMs for contextual checks, layered logging for auditability, and scale hardware as throughput demands grow.

Good process is itself a security control. When tools, people, and checkpoints align, confidentiality becomes an operational habit rather than a fragile promise.

Micro-moment: I once caught a contractor syncing a shared folder to a personal cloud during intake. A quick checksum audit revealed the sync; we halted the handoff and tightened kiosk procedures the same day — saved the client an avoidable exposure.

Personal anecdote (120–180 words): Early on, a small client asked for an “air-gapped” proofing workflow on a tight timeline. I set up a locked workstation, a simple transfer kiosk, and a bare Git server on an isolated machine. The first week was awkward: editors tripped over manual manifests and forgot to sign GPG commits. I instituted a short checklist and paired every new editor with a senior reviewer for three days. By week three, commit signatures were routine, the intake manifest matched every bundle, and the team could produce a signed export in under an hour. The client’s legal team ran through the logs and signed off within two days. That project taught me two things: first, people are the hardest part of an air-gap; the tech is straightforward. Second, small operational frictions (a one-line checklist) are worth adding if they prevent a costly mistake later.

If you want, I can sketch a checklist tailored to your agency size and budget or produce a minimal hardware BOM with purchase links and exact configuration commands.

References

[^1]: Frame.io. (2020). Air gap editing: a workflow from home episode. Frame.io Blog.

[^2]: Super Copy Editors. (n.d.). Outsourced proofreading services for agencies. SuperCopyEditors.

[^3]: EditorWorld. (n.d.). Top 10 proofreading services. EditorWorld.

[^4]: GlobalVision. (n.d.). Automated proofreading: Why GlobalVision. GlobalVision.

Air‑gapped Proofreading Pipeline for Agencies

Example usage:

References

Related Posts

Try TextPro