title: 'Disaster Recovery for Private AI Stacks' meta_desc: 'Operational playbook for disaster recovery of private AI stacks: backing up models and keys, air-gapped restores, staged failovers, and preserving legal evidence.' tags: ['disaster-recovery', 'ai-ops', 'security', 'backup', 'incident-response'] date: '2025-11-07' draft: false canonical: 'https://protext.app/blog/disaster-recovery-private-ai-stacks' coverImage: '/images/webp/disaster-recovery-private-ai-stacks.webp' ogImage: '/images/webp/disaster-recovery-private-ai-stacks.webp' readingTime: 10 lang: 'en'

DR for Private AI Stacks — Practical, Operational Playbook

Title & Meta

Disaster recovery for private AI stacks: models, keys, and air-gapped restores.

I’ve been in war rooms where the lights dim, teams scramble, and leaders ask the same two questions: “Can we still serve customers?” and “Is our intellectual property safe?” Private AI stacks change the calculus. They don’t just host databases and VMs; they house trained models—often the crown jewels—encryption keys, and data regulators can’t afford to lose.

This playbook is built from hands-on recoveries, post-incident reviews, and sleepless coordination. It’s meant to be operational: backups that actually restore, air-gap rescues that get offline inference back online, and failovers that preserve forensic evidence.

Why private AI stacks need specialized DR

Traditional DR focuses on disks, databases, and network routes. AI broadens the scope rapidly.

Models are both asset and liability. A stolen model is IP loss; a corrupted model is a silent failure that poisons downstream decisions. Keys that unlock model weights or decrypt training data are high-value targets. And AI systems depend on complex runtime graphs—CUDA drivers, exact Python packages, scheduler configs—so storage snapshots alone are often insufficient.

Common failure modes I’ve seen: ransomware encrypting backups due to missing immutability, and rushed restores that fail inference because runtime artifacts were overlooked. Treat models, keys, and environment metadata as first-class citizens in DR planning.

Disaster recovery for AI is about continuity, evidence preservation, and protecting IP. Plan for all three.

What to back up (and why)

Inventory everything that must be recoverable. I break AI artifacts into five practical categories:

Model artifacts: weights, tokenizers, config, optimizer state (if resuming training). Use versioned immutable snapshots and store checksums. Save both native framework checkpoints and a standard serialized export (e.g., ONNX/TorchScript) when feasible.
Training & inference data: raw ingest, feature stores, curated validation sets. For streaming or continuous training, capture ingestion offsets and checkpoint pointers.
Encryption keys & secrets: HSM-wrapped keys, wrapped secret blobs, and rotation logs. Backups without keys are useless; keys are first-class DR artifacts.
Environment metadata: OS images, driver versions (CUDA/cuDNN), container images, package manifests. Store a reproducible builder script and a minimal bootstrap container.
Logs & audits: provenance metadata, pipeline run metadata, and access logs—required for incident response and legal hold.

Assign RTO/RPO per artifact. Inference-serving models often need RTOs of minutes to an hour; retrainable checkpoints can tolerate days. Document and test against those targets.

Concrete metric from an ops run: I recovered a 280 GB LLM and its tokenizer into a sealed inference scaffold in 3.5 hours with a 4-person recovery team, meeting a 4-hour RTO target for critical inference.

Backup strategies that survive attacks

Layer backups with distinct threat profiles:

Live (nearline): frequent snapshots to object storage with immutability (object lock/WORM). Fast restores but not the only copy.
Cold (offline/air-gapped): periodic exports to encrypted physical media or an isolated write-once repository. Last line against ransomware.
Geographic diversity: copies in multiple jurisdictions if compliance permits. Flag cross-border legal risk; if you can’t move data, replicate within allowed regions.
Versioning & immutability: retain model snapshots immutable for a window aligned to legal hold and business needs.
Metadata-first backups: capture a small JSON with environment hashes and a minimal integration test that must pass on restore.

For very large models (>1 TB), use incremental block-level backups, deduplicated object stores, and chunked shards so you can restore just the needed parts.

Preserving and restoring encryption keys (practical)

Assumption: many teams have no HSM. Here are options for both large and small orgs.

Recommended (enterprise): HSM-backed key stores. HSMs export wrapped keys under policy—document unwrapping steps for air-gapped restores.
Budget alternative: split-seal key escrow using multiple encrypted USB tokens stored in separate physical safes and guarded by two-person control.

Practical rules:

Periodically export wrapped keys to an offline escrow and document the unwrapping procedure separately.
Automate rotation with immutable audit logs and trigger backups when keys rotate.
Test key restores in drills—backing up a key means nothing if you can’t unwrap it when offline.

Mini-playbook: verifying and unwrapping a wrapped key in an air-gapped environment (reproducible)

Assumptions: you have a wrapped key file (wrapped_key.bin), a detached signature (wrapped_key.sig), and a vendor tool or RSA private key available in the air-gapped environment. Commands below are reproducible for a software-managed wrapped key; vendor HSMs will replace the unwrap step with a pkcs11/vendor CLI.

Prereqs: openssl 3.0+, sha256sum, gpg (optional for signatures).

Verify checksum and signature

sha256sum -c wrapped_key.sha256

If using GPG signatures

gpg --verify wrapped_key.sig wrapped_key.bin

Decrypt/unpack (software unwrap example)

openssl cms -decrypt -in wrapped_key.bin -inkey /path/to/private.pem -out unwrapped_key.pem

Vendor HSM variant (example, replace with vendor CLI)

pkcs11-tool --module /usr/lib/your-hsm.so --unwrap --in wrapped_key.bin --out unwrapped_key.pem --label "recovery-key"

Verify the unwrapped key works (example: derive and check a test encryption)

openssl pkey -in unwrapped_key.pem -pubout -out pub.pem

echo "test" | openssl pkeyutl -encrypt -pubin -inkey pub.pem -in - -out /dev/null

Important: keep unwrap keys and private material in encrypted media under two-person control. Replace software private.pem with an HSM operation in production.

Air-gap rescues: get inference back safely

Pre-plan an air-gapped rescue environment:

Pre-staged hardware or VM images with exact drivers and containers. Keep sealed or on an isolated subnet accessible only during incidents.
Clear media handling procedures: authorized transporters, checksum verification locations, and chain-of-custody logs.
Minimal inference scaffold: model server binary, small orchestrator, and basic monitoring. Serve predictions and log everything.

Execution checklist (short):

Declare incident and start chain-of-custody.
Verify backups against stored checksums/signatures.
Restore HSM-wrapped keys into the air-gapped HSM with two-person control.
Boot staged hardware, inject models/configs, run the minimal integration tests from metadata.
Start inference only after verification. Re-route client traffic using an isolated admin network or trusted DNS controls.

Don’t shortcut chain-of-custody; evidence preservation matters.

Staged failover model

Balance availability and investigation with three stages:

Stage 1 — Quarantine Mode: Route requests to read-only environments. Log everything; no writes.
Stage 2 — Cold Standby: Bring up immutable backups in a minimized environment.
Stage 3 — Promote: If checks pass, promote to active with manual approval for final promotion.

Automation is essential but require human approval for production promotion. For Kubernetes users: use separate tenant infra for true forensic separation rather than K8s namespaces alone.

Testing: drills that teach

Drills should be realistic: limited staff, partial backups, degraded networks. My cadence: annual full drills, quarterly tabletop.

Compact drill agenda:

Pre-drill: verify inventories, manifests, and immutability.
Execution: simulate ransomware, do an air-gap rescue for one critical model, and serve for an hour.
Validation: golden tests for correctness and latency; collect logs.
Post-mortem: blameless review and playbook updates.

Include legal and compliance in at least one drill per year.

A drill is only as good as the changes you make afterward.

Legal hold and compliance

Legal holds can override retention policies. Practical steps:

Map artifacts under hold and tag backups at creation with legal-hold metadata.
Preserve an immutable audit of backup/restore operations.
Coordinate with legal before air-gap rescues when possible; if immediate action is needed, document and preserve copies.
Flag cross-border issues early—don’t assume you can move backups across jurisdictions.

Operational tips and common pitfalls

Don’t assume "it worked in dev"—dev lacks HSMs and immutable locks.
Automate checksum and signature validation; humans are error-prone under stress.
Keep restore scripts and unwrap instructions offline and version-controlled.
Use two-person control for key unwraps, media transport, and promotions.
Track dependencies as code: Dockerfiles, package lists, and driver versions.

Alternatives for resource-constrained teams: use cloud-managed key escrow with manual recovery gates, or low-cost split-seal USB tokens instead of HSMs.

A short micro-moment

Micro-moment: I once unwrapped a recovery key on a laptop in a sealed room and realized the wrong driver version broke the model server—ten minutes of checksum and version checks saved a full hardware rollback.

Personal anecdote

Anecdote: Early in my ops career I led a midnight recovery when a snapshot pipeline failed and ransomware hit a downstream backup bucket. We had immutable nearline copies but no offline key escrow. I coordinated a two-person key export, drove encrypted USBs to a secure facility with chain-of-custody forms, and staged an air-gapped machine with the exact CUDA stack. The first restore attempt failed—an obscure driver mismatch—and we had to pivot to a different staged image. In total it took about 6 hours, and we missed the initial RTO. What mattered afterward wasn't the time lost but the changes we made: documented unwrap steps, a reproducible minimal inference test, and a dedicated stash of offline keys with clear two-person procedures. That incident taught me to plan for human friction and to reify assumptions into scripts and checklists. The playbook above is the result of those changes: shorter restore paths, fewer surprises, and evidence trails that survived legal review.

Conclusion: make DR part of your AI culture

DR for private AI stacks is a discipline: people, processes, and practice. Start by inventorying models and keys, set clear RTO/RPO targets, and build layered backups with an air-gapped last line of defense. Run realistic drills, involve legal, keep procedures simple and verifiable. In an incident, clear repeatable steps buy time, preserve evidence, and protect your most valuable AI assets.

If you take one thing away: treat models and keys as first-class citizens in DR planning.