Online Support

Launch Day Stress-Test Checklist

Launch Day Stress-Test Checklist

A launch isn’t a date; it’s a test. The moment real users arrive, your app either absorbs the spike or buckles under the weight of logins, exports, webhooks, and background jobs. The stakes are high: studies report that up to 70% of users abandon an app that feels slow, and crashes drive roughly 71% of uninstalls – two brutal ways to squander hard-won acquisition.
And the macro picture is just as stark: the CISQ 2022 report estimates US$2.41 trillion in annual losses from poor software quality, while QA can consume ~40% of development budgets – a cost that’s justified if it prevents launch-day failures.

This checklist turns that risk into a plan. It’s written for SaaS teams preparing for go-live, with an emphasis on practical tests, clear pass/fail gates, and the operational plumbing that keeps incidents short and recoveries fast.

1) Set targets that define “good enough”

Before you spin up a single load generator, decide what success looks like. Agree the peak traffic model (concurrency, requests per second, typical/large payloads) and the SLOs you’ll hold the product to (p95/p99 latency, error rate, CPU/RAM ceilings, queue depth, autoscale time). Capture these on a one-pager and get sign-off from engineering and product. Without this, a “passed” test means nothing – because no one knows what pass is.

Checklist

  • Peak-hour model documented and shared.
  • SLOs per critical endpoint (auth, search, checkout, exports).
  • “Red lines” set for cost (e.g., egress) and platform quotas.

2) Rehearse the three core load shapes

Every launch sees three patterns: step load (a climb to peak), spiky bursts (flash crowds), and soak (sustained traffic). Each exposes different failure modes.

Step load validates that latency and errors stay within SLOs as you approach 1.2–1.5× forecast peak. Spiky bursts uncover brittle code paths (login storms, export fans, webhook cascades). Soak (2–4 hours at ~0.8× peak) surfaces memory leaks, connection churn, and file-descriptor drift that step tests miss.

Checklist

  • Step test to 1.2–1.5× peak: p95/p99 within SLO, <1% errors.
  • Burst test on auth, exports, and webhook endpoints: no rate-limit thrash.
  • Soak test: no upwards drift (>10%) in memory, handles, or DB connections.

3) Make autoscaling arrive before the pain

Scaling is only useful if capacity lands ahead of the curve. Tune scale-out triggers (CPU, RPS, queue length) and cool-downs, then pre-warm instances/images or serverless functions to avoid cold-start cliffs. Confirm that quota ceilings (max pods, instances, IPs) leave ≥30% headroom above the forecast peak – quotas are silent failure modes on launch day.

Checklist

  • Pre-warming in place (containers/images/functions).
  • Scale-out adds capacity before p95 breaches under step load.
  • Quotas/limits reviewed; max scale documented with ≥30% headroom.

4) Harden the edge (CDN, caching, TLS)

Your CDN and cache policy determine how much traffic the origin must endure. Cache aggressively for static and semi-static assets; validate hit ratios and purge plans. Enforce modern TLS, HSTS, HTTP/2 or HTTP/3, and compress text assets. This keeps origin egress (and bills) sane and shields users from transient backend blips.

Checklist

  • CDN hit-ratio meets target; purge rehearsed.
  • TLS chain, HSTS, and protocol settings verified (no mixed content).
  • Asset compression (gzip/brotli) measured end-to-end.

5) Put guardrails in front of hot paths (WAF & rate limits)

Launch day attracts both real spikes and accidental abuse. Enable WAF managed rules (OWASP-style) on login, search, and payment routes, then tune to reduce false positives. Apply rate limits globally and per user/IP for authentication, exports, and any expensive APIs. The goal is graceful degradation – throttle the few to protect the many.

Checklist

  • WAF enabled and tuned on critical paths; allow/deny lists validated.
  • Rate-limit policies exercised; no collateral damage to normal users.

6) Prove networking & deploy mechanics under load

Blue/green or canary deploys should not create 5xx spikes. Verify load-balancer health checks, connection draining, and zero-downtime rollout while under step load. Keep firewall/NSG rules least-privilege and confirm no accidental listeners are exposed to the internet.

Checklist

  • Rollout rehearsal under load: no error spikes; health checks green.
  • External scan shows only expected ports; bastion/JIT access works.

7) Exercise application hotspots like a hostile user

Failures concentrate in predictable places: authentication and session stores, search/listing endpoints with worst-case filters, and media/file workflows. Test them with ugly payloads and edge-case filters. Validate idempotency for external calls (payments, webhooks) so retries don’t double-charge or duplicate records.

Checklist

  • Login storm test: p95 < target; no session lock contention.
  • Search/listing queries use indices; no table scans at scale.
  • Upload/download/transcode throughput meets targets; back-pressure works.
  • Idempotency keys enforced for external calls.

8) Size the data tier for concurrency – not just storage

Launch traffic is concurrency pressure: connection pools, lock contention, hot indices. Verify pool sizes and timeouts for app→DB and DB→cache, test retry strategies (exponential back-off, caps), and sample query plans for top routes. If migrations are part of launch, measure them on a full-size copy and choose online/backfill strategies where possible.

Checklist

  • Pool sizing prevents exhaustion; retries are exponential + capped.
  • Hot queries indexed; p95 within SLO on real data.
  • Long-running DDL has an online plan or maintenance window.

9) Don’t let vendors be your single point of failure

Third-party dependencies (email/SMS, payments, auth, storage, maps) fail in two ways on launch: slow and down. Exercise timeouts, circuit breakers, and fallbacks. For critical services, validate a contingency (secondary SMTP/SMS or alternate region). “Graceful degradation” is a product decision – show friendly messaging, queue work, and recover later.

Checklist

  • Timeouts & circuit breakers tested; user messaging for brownouts.
  • Secondary provider or region rehearsed for one critical dependency.

10) Restore, don’t just back up

Backups that have never been restored are stories, not safety nets. Enable DB point-in-time recovery, ensure file/object backups are immutable where the platform supports it, and rehearse a timed restore into staging. Compare row/file counts and checksums to validate integrity and measure RTO/RPO realistically. This is your escape hatch from bad data or bad deploys.

Checklist

  • Timed DB restore meets RTO/RPO; integrity checks green.
  • Files/objects restorable; immutability/retention verified.

11) Make observability actionable (not noisy)

Launch day is where dashboards either earn their keep or prove ornamental. Build a single pane for golden signals (latency, traffic, errors, saturation), the business KPIs that product cares about (sign-ups, checkouts), and infra health. Alerts should page only for things a human must fix now, and each must link to a run-book with owners, first actions, and escalation. Test the pager path end-to-end.

Checklist

  • Dashboards load fast and show red/amber/green at a glance.
  • High-signal alerts only; paging route tested with an on-call drill.
  • Logs searchable at high cardinality; PII scrubbing verified.

12) Rehearse failure – and your way back

Pull cables, kill nodes, and simulate a vendor outage. You’re checking that graceful degradation works (queues buffer; users see helpful messages) and that failover keeps data correct. Capture timings: time to detect, time to mitigate, time to recover. These numbers become your incident baseline for post-launch improvements.

Checklist

  • AZ/host failover test passes under user-like load.
  • Dependency brownout/blackout drills: clear UX + fast recovery.
  • Incident timer captured (detect → mitigate → recover).

Day-of run-book (print this bit)

  • Traffic model & SLOs: link to the one-pager.
  • Rollout plan: who flips, where, and rollback trigger.
  • Dashboards: links for product KPIs, golden signals, infra.
  • Pager tree: on-call primary/secondary; vendor portals; comms templates.
  • Kill-switches: feature flags for heavy features (exports, AI, reports).
  • Freeze rules: what’s allowed during the launch window (yes: config toggles; no: schema changes).
  • Hyper-care: staffing and stand-up cadence for the first 24–72 hours; schedule the success review.

Why this matters (and pays for itself)

Users punish delay and instability instantly, hence the focus on latency, crashes, and graceful failure paths. The data bears it out: slow experiences trigger abandonment, crashes drive uninstalls, and at the macro level software defects cost trillions – large enough to justify disciplined, repeatable testing even when QA already commands a hefty share of your budget.

0800 817 4727