Back to Blog

Why a Runbook Is Different from a Playbook

Many organizations have a DDoS incident response playbook. It describes the technical steps for detecting and mitigating an attack: identify the attack vector, deploy firewall rules, escalate to upstream mitigation if needed, verify service recovery. That is valuable, but it is only part of what a NOC team needs.

A runbook is a comprehensive operational document that covers not just the technical response, but the entire organizational process around a DDoS incident. It defines severity levels so every analyst classifies the same attack the same way. It specifies who gets paged at 2 AM and who gets an email in the morning. It includes communication templates so your customer-facing team does not have to write status updates from scratch during a crisis. It provides decision trees so a junior analyst at 3 AM on a Sunday can make the right call without needing to wake up the senior engineer.

The playbook answers "how do we stop the attack?" The runbook answers "how does our organization respond to the attack?" Both are necessary. This guide focuses on building the runbook.

The goal of a runbook is to remove ambiguity. During a DDoS attack, your NOC team should never have to ask "who should I call?" or "is this bad enough to wake up the VP?" The runbook answers those questions in advance.

Severity Classification Matrix

The severity matrix is the foundation of the entire runbook. Every decision — who gets notified, what mitigation to deploy, whether to activate the incident bridge — flows from the severity level. A well-defined matrix ensures consistent classification regardless of which analyst is on shift.

Four severity levels

Severity   Name              Definition
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
S1         Service Down      One or more production services are
                             completely unreachable. Customer impact
                             is total. Revenue is being lost.

S2         Degraded          Services are reachable but experiencing
                             measurable degradation: elevated latency,
                             packet loss, intermittent timeouts.
                             Customer impact is significant.

S3         Elevated Traffic  Anomalous traffic detected. Services are
                             currently unaffected, but traffic volume
                             or patterns suggest an attack is building
                             or is being partially mitigated.

S4         Anomaly Detected  Monitoring has flagged unusual traffic
                             patterns that may indicate reconnaissance,
                             a low-rate attack, or a false positive.
                             No measurable service impact.

Each severity level must have clear, measurable criteria. Avoid subjective language like "significant attack" or "moderate impact." Instead, define specific thresholds that leave no room for interpretation.

Classification criteria

  • S1 triggers: Service health check failures for 2+ minutes, complete link saturation (>95% utilization on any production uplink), customer-reported outage confirmed by monitoring, or automated mitigation has been active for 5+ minutes without resolving the impact.
  • S2 triggers: Latency increase of more than 3x baseline for production services, packet loss exceeding 5% on production links, partial service degradation confirmed by monitoring, or attack volume exceeding 50% of link capacity.
  • S3 triggers: Attack detected and mitigation is active but service impact is not yet measurable, traffic volume exceeds normal baseline by more than 2x, or automated mitigation is containing the attack but volume is still increasing.
  • S4 triggers: Anomalous traffic pattern detected by monitoring, traffic volume slightly above baseline but within normal variance, or a known attack signature detected at low volume.

Severity can change. An incident that starts as S4 can escalate to S1 within seconds. Your runbook must include re-classification triggers: if an S3 incident causes service health check failures, it immediately becomes S1. Flowtriq's automated severity classification updates in real time as attack conditions change.

Escalation Tiers and Notification Matrix

Every severity level maps to a specific set of people who need to be notified, and the notification method must match the urgency. An S4 anomaly does not need to page the VP of Engineering at 3 AM. An S1 outage does.

Tier 1: NOC on-shift analyst

The first responder for all DDoS events. The on-shift NOC analyst receives automated alerts for all severity levels and is responsible for initial triage and classification. For S4 and S3 events, the analyst monitors the situation and may resolve it without escalation. For S2 and S1 events, the analyst initiates the escalation chain immediately.

Tier 2: NOC senior engineer / DDoS specialist

Paged for S2 and S1 events. The senior engineer has deep expertise in DDoS mitigation techniques and is authorized to make decisions about advanced mitigation actions: activating BGP FlowSpec, triggering RTBH, or engaging cloud scrubbing. The senior engineer also takes over incident command for S1 events.

Tier 3: Network operations manager / VP of Engineering

Notified for S1 events. The operations manager is responsible for cross-functional coordination: engaging the customer communication team, authorizing emergency changes, and making decisions about service sacrifices (e.g., blackholing a non-critical IP to protect critical services). For prolonged S1 events (more than 30 minutes), the VP of Engineering is notified.

Tier 4: Executive leadership / ISP contacts

Engaged for S1 events that exceed 1 hour or that affect the entire network. Executive leadership needs visibility for business continuity decisions. ISP upstream contacts are engaged when the attack requires coordination with transit providers (e.g., requesting upstream filtering or activating contracted scrubbing services).

Severity   Tier 1 (NOC)    Tier 2 (Sr Eng)   Tier 3 (Mgmt)     Tier 4 (Exec)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
S1         Page + Bridge   Page + Bridge     Page + Bridge     Email (30min)
S2         Page            Page              Email             —
S3         Alert (chat)    Email             —                 —
S4         Alert (chat)    —                 —                 —

Notification channels

  • Page: PagerDuty, Opsgenie, or equivalent. Requires acknowledgment within 5 minutes. Auto-escalates if unacknowledged.
  • Bridge: Dedicated incident bridge (Zoom, Teams, or Slack huddle) opened immediately for S1 events. All paged personnel join within 10 minutes.
  • Alert (chat): Slack or Teams channel notification. Informational. Does not require immediate response.
  • Email: Summary notification. For awareness, not action.

Mitigation Decision Trees

Decision trees remove ambiguity from the mitigation process. A junior analyst at 3 AM should be able to follow the tree and arrive at the correct mitigation action without needing to make judgment calls that require senior expertise.

Primary decision tree

The first decision point is whether automated mitigation is handling the attack. If you are using an automated detection and mitigation platform like Flowtriq, the system may have already deployed firewall rules, BGP FlowSpec, or other mitigation before a human even looks at the dashboard. In that case, the analyst's role shifts from "decide what to do" to "verify that automation is working."

  1. Is automated mitigation active? If yes, check whether it is effective (service impact resolved). If effective, monitor and document. If not effective, proceed to manual escalation.
  2. Is the attack volumetric (link saturation)? If yes, local firewall rules cannot help. Escalate to BGP FlowSpec or RTBH immediately. If no, deploy or verify local firewall rules first.
  3. Is the target IP critical (cannot be blackholed)? If yes, skip RTBH and go directly to cloud scrubbing. If no, RTBH is an acceptable last resort.
  4. Has the ISP been engaged? For S1 events lasting more than 15 minutes, contact the upstream ISP. They may be able to apply filtering at their edge or activate contracted DDoS mitigation services.

When to handle locally vs. engage upstream

A common source of confusion for NOC teams is knowing when to handle an attack locally and when to engage the upstream ISP or activate a cloud scrubbing service. The decision depends on two factors: whether the attack is saturating your inbound link, and whether your local mitigation tools can filter the attack traffic effectively.

  • Handle locally: Attack volume is under 50% of link capacity AND local firewall rules or rate limiting are effectively reducing the attack. No upstream engagement needed.
  • Engage upstream ISP: Attack is saturating the inbound link (traffic is being dropped by congestion before reaching your firewall) OR local mitigation has been active for 10+ minutes without resolving the impact.
  • Activate cloud scrubbing: Attack exceeds ISP's filtering capability, or the targeted service is too critical for RTBH, or the attack is multi-vector and cannot be described by simple filtering rules.

Communication Templates

During a DDoS attack, your NOC team should not be writing status updates from scratch. Pre-written templates ensure consistent, professional communication and save critical minutes during an incident.

Internal status update (S1/S2)

Subject: [S{SEVERITY}] DDoS Attack — {SERVICE} — {STATUS}

Current Status: {Active / Mitigating / Resolved}
Severity: S{SEVERITY}
Start Time: {UTC TIMESTAMP}
Duration: {DURATION}
Affected Service(s): {SERVICE LIST}
Attack Type: {VECTOR(S)}
Peak Volume: {BPS} / {PPS}
Mitigation: {ACTIONS TAKEN}
Customer Impact: {DESCRIPTION}
Next Update: {TIME} or upon status change

Incident Commander: {NAME}
Bridge: {LINK}

Customer notification (for MSPs and hosting providers)

Subject: Service Advisory — Network Event Affecting {SERVICE}

We are currently experiencing elevated network traffic targeting
our infrastructure. Our automated mitigation systems activated
at {TIME} UTC and are filtering malicious traffic.

Impact: {DESCRIPTION OF CUSTOMER-VISIBLE IMPACT}
Expected Resolution: We are actively monitoring the situation.
{UPDATE CADENCE — e.g., "We will provide updates every 30 minutes."}

If you have questions, contact support at {CONTACT}.

Key principles for customer communication: never speculate about the attacker or motive, provide factual information about impact and response, commit to a specific update cadence and stick to it, and always include a support contact.

Template customization: These templates are starting points. Customize them for your organization's tone, your customer base, and your SLA language. The important thing is that they exist before the attack, not that they are perfect. A good template sent in 5 minutes is better than a perfect message crafted in 30 minutes while customers are wondering what is happening.

Post-Incident Review Checklist

Every DDoS incident at S2 or above should trigger a post-incident review (PIR). The review is not about assigning blame; it is about identifying what worked, what did not, and what needs to change. Schedule the PIR within 72 hours of incident resolution while memories are fresh.

Review agenda

  1. Timeline reconstruction: Walk through the incident timeline from detection to resolution. Verify timestamps and sequence of events. Identify any gaps in the timeline.
  2. Detection effectiveness: How long did it take to detect the attack? Was the detection automatic or manual? If manual, why did automated detection fail? Were the alerts actionable?
  3. Classification accuracy: Was the initial severity classification correct? Did the severity change during the incident? Were the classification criteria clear, or was there confusion?
  4. Escalation effectiveness: Were the right people notified at the right time? Did anyone receive unnecessary notifications? Was anyone who should have been notified left out? How long did it take each tier to respond?
  5. Mitigation effectiveness: Which mitigation actions were taken? How effective was each one? Was the mitigation proportional to the attack? Were there any unintended consequences (e.g., legitimate traffic dropped)?
  6. Communication quality: Were internal and external communications sent on time? Were they accurate? Did customers or stakeholders report confusion or lack of information?
  7. Documentation completeness: Was the incident fully documented? Are PCAP samples, traffic graphs, and mitigation logs preserved? Is the documentation sufficient for compliance requirements?
  8. Runbook adherence: Did the team follow the runbook? Were there situations where the runbook did not provide guidance? Were any runbook procedures skipped or modified? Why?

Action items

Every PIR must produce specific, assignable action items. "Improve monitoring" is not an action item. "Add alerting threshold for GRE protocol floods to Flowtriq detection rules by June 15" is an action item. Each action item needs an owner, a due date, and a follow-up mechanism.

Runbook Maintenance Schedule

A runbook that is written once and never updated is a runbook that will fail when you need it. Infrastructure changes, team members rotate, ISP contracts change, and new attack vectors emerge. The runbook must evolve with your environment.

Quarterly reviews

Every quarter, review the runbook for accuracy. Verify that contact lists are current (people leave, change roles, change phone numbers). Confirm that escalation paths still reflect the organizational structure. Update any technical procedures that have changed due to infrastructure modifications. Check that ISP and vendor contacts are still valid.

Post-incident updates

After every S1 or S2 incident, update the runbook based on PIR findings. If the PIR identified a gap in the decision tree, add the missing branch. If communication templates needed modification, update them. If escalation thresholds were wrong, adjust them. The runbook should improve after every real incident.

Annual tabletop exercise

At least once per year, run a tabletop exercise where the NOC team walks through the runbook using a simulated DDoS scenario. The exercise should test the full process: detection, classification, escalation, mitigation decisions, communication, and post-incident review. This identifies gaps that only become apparent when the process is exercised end-to-end.

Flowtriq supports dry-run simulations that walk through the auto-escalation chain without deploying any actual mitigation. Use this feature during tabletop exercises to validate that automated responses trigger correctly for different attack scenarios.

Integrating Automation into Your Runbook

A modern DDoS runbook should account for automated detection and mitigation. If you use a platform like Flowtriq, your runbook shifts from "step-by-step manual mitigation" to "verify and oversee automated response." This is a significant improvement because it reduces the critical decisions that must be made under pressure.

With automation, the analyst's responsibilities change:

  • Instead of "detect the attack": Verify the automated detection is accurate (not a false positive).
  • Instead of "deploy firewall rules": Verify that auto-deployed rules are effective and not blocking legitimate traffic.
  • Instead of "decide when to escalate": Monitor the automated escalation and override if necessary.
  • Instead of "decide when to de-escalate": Verify that automated de-escalation is safe and confirm the attack has truly ended.

The runbook still needs every section described above. Automation handles the technical response, but the organizational response — severity classification, escalation, communication, documentation, and post-incident review — still requires human coordination guided by a well-written runbook.

Build your runbook before the next attack. Try Flowtriq free for 7 days and pair automated DDoS detection with a NOC runbook that ensures your entire team responds effectively.

Back to Blog

Related Articles