Back to Blog

The 3 AM Scenario Without Runbooks

It is 3:07 AM on a Tuesday. PagerDuty fires. The on-call network engineer wakes up, squints at the alert: "High traffic anomaly detected on edge-router-03." She opens her laptop. SSH into the router. Pull up traffic graphs. Try to figure out what kind of attack this is.

By 3:15 AM, she has confirmed it is a UDP amplification attack targeting a customer subnet. She needs to apply a filter, but the filter syntax for this router platform is something she does not have memorized. She checks the wiki. The wiki page was last updated 18 months ago and references a router OS version they no longer run.

By 3:22 AM, she has paged the senior network engineer. He is awake by 3:30 AM. They are both on a call. The senior engineer dictates filter rules while the on-call engineer types them in. By 3:38 AM, the filter is applied and traffic is dropping.

But nobody updated the status page. Nobody notified the affected customer. Nobody documented what happened. Those tasks fall to the morning shift, which has to reconstruct the incident from chat logs and memory.

Total time from detection to mitigation: 31 minutes. Total people involved: 2 (plus the support team in the morning). Total documentation quality: poor.

The Same Scenario With a Runbook

Same attack. Same 3:07 AM. The Flowtriq agent detects the anomaly and classifies it as a UDP amplification attack within 2 seconds. The runbook triggers automatically.

T+2s: The runbook's first action fires: apply a FlowSpec rule to drop the specific amplification vector (NTP, memcached, or whatever protocol the attack is using). The rule propagates to edge routers via BGP within 1 to 3 seconds.

T+5s: The runbook's second action fires: publish an incident to the status page with the affected component and severity level. Subscribers receive email notifications.

T+8s: The runbook's third action fires: send a Slack notification to the #incidents channel with attack classification, volume, target, and the mitigation action taken.

T+10s: The runbook captures a PCAP sample for forensic analysis and logs the full incident timeline.

The on-call engineer's phone does not ring. The attack is mitigated in under 10 seconds. The status page is accurate. The documentation is complete. In the morning, the team reviews the execution log and confirms everything worked as expected.

The goal of a runbook is not to remove humans from incident response. It is to remove humans from the first 5 minutes, where speed matters most and human response time is worst.

Anatomy of a Flowtriq Runbook

A runbook in Flowtriq is a sequence of actions that execute automatically when specific conditions are met. Every runbook has three components: triggers, conditions, and actions.

Triggers

Triggers define when the runbook activates. The most common trigger is an attack detection event, but you can also trigger on threshold crossings, status changes, or manual invocation.

  • Attack detected: Fires when the detection engine classifies an attack on any monitored node.
  • Threshold crossed: Fires when a specific metric (PPS, BPS, flow count) exceeds a defined value.
  • Severity escalation: Fires when an ongoing incident's severity increases (e.g., medium to high).
  • Manual: Triggered by an operator from the dashboard or via API call.

Conditions

Conditions act as filters on the trigger. A trigger fires on any attack detection, but conditions let you narrow it down:

Trigger: attack.detected
Conditions:
  - attack.severity >= "high"
  - attack.classification IN ["udp_amplification", "udp_flood"]
  - target.node.tag == "production"
  - attack.volume_bps >= 1000000000   # 1 Gbps+

If the trigger fires but the conditions are not met, the runbook does not execute. This prevents your NTP amplification runbook from firing on a SYN flood, or your production runbook from firing on a staging node.

Actions

Actions are the steps the runbook executes, in order. Flowtriq supports several action types:

  • FlowSpec rule: Push a BGP FlowSpec rule to your edge routers to filter specific traffic patterns.
  • RTBH: Announce a blackhole route for the target IP (last resort for volumetric attacks).
  • Webhook: Send an HTTP POST to any URL with incident data (Slack, PagerDuty, custom systems).
  • Status page update: Create or update an incident on your Flowtriq status page.
  • PCAP capture: Trigger a packet capture on the affected node for forensic analysis.
  • Notify: Send email or Slack notifications to specific teams or individuals.
  • Wait: Pause execution for a specified duration (useful for staged mitigation).
  • Conditional branch: Check a condition mid-execution and fork to different action paths.

Example Runbook: Game Server Provider

A game hosting company runs 200 nodes across three data centers. Their primary concern is player experience: any disruption longer than 5 seconds causes player disconnects, and players who disconnect during a match often do not come back.

Runbook: "Game Server Auto-Mitigate"
Trigger: attack.detected
Conditions:
  - attack.severity >= "medium"
  - target.node.tag IN ["game-us", "game-eu", "game-ap"]

Actions:
  1. FlowSpec: Drop traffic matching attack signature
     - Protocol: from detection engine
     - Source port: from detection engine (if amplification)
     - Rate limit: 100 Mbps per source (if flood)

  2. Status Page: Create incident
     - Component: map node tag to status page component
     - Severity: from detection engine
     - Message: "DDoS attack detected and auto-mitigated.
                 Players may experience brief connectivity issues."

  3. Notify: Slack #game-ops
     - Include: attack type, volume, target node, action taken

  4. PCAP: Capture 30 seconds on target node
     - Filter: attack traffic only
     - Store: incident attachment

  5. Wait: 300 seconds

  6. Conditional: If attack still active after 5 min
     Yes -> Escalate: Page on-call engineer
     No  -> Status Page: Resolve incident

This runbook handles 90% of their incidents without human intervention. The on-call engineer only gets paged for attacks that survive the initial FlowSpec mitigation, which typically means the attack is either very large (requiring upstream intervention) or very sophisticated (requiring manual analysis).

Example Runbook: Enterprise

An enterprise with a multi-homed network and strict change management requirements. They need documentation for every mitigation action and cannot apply network changes without logging.

Runbook: "Enterprise DDoS Response"
Trigger: attack.detected
Conditions:
  - attack.severity >= "high"
  - target.subnet IN ["203.0.113.0/24", "198.51.100.0/24"]

Actions:
  1. Notify: PagerDuty (P2 severity)
     - Include full incident context

  2. PCAP: Capture 60 seconds on all affected nodes
     - Full packet capture (headers + payload)

  3. Webhook: POST to ITSM system
     - Create change ticket for mitigation action
     - Attach: attack classification, volume, target

  4. Wait: 30 seconds (allow change ticket creation)

  5. FlowSpec: Apply mitigation rule
     - Log: Include change ticket ID in rule comment

  6. Status Page: Create internal incident
     - Visibility: Internal only (staff)
     - Include: change ticket reference

  7. Notify: Email [email protected]
     - Subject: "DDoS Incident [ID] - Auto-mitigated"
     - Body: Full incident report with PCAP reference

The enterprise runbook prioritizes documentation and audit trail over speed. Every action is logged, every mitigation is tied to a change ticket, and the security team gets a full report without having to reconstruct events from chat logs.

Example Runbook: ISP

An ISP protecting subscriber subnets across a regional network. Their challenge is per-subscriber detection at scale, and their mitigation options include both FlowSpec (surgical) and RTBH (last resort).

Runbook: "ISP Subscriber Protection"
Trigger: attack.detected
Conditions:
  - target.type == "subscriber_prefix"

Actions:
  1. FlowSpec: Apply rate limit on attack vector
     - Max rate: 500 Mbps per source prefix
     - Protocol/port: from detection engine

  2. Notify: NOC Slack channel
     - Include: subscriber ID, attack type, volume

  3. Status Page: Create incident (subscriber-facing)
     - Component: Subscriber's region
     - Auto-resolve: when attack ends

  4. Wait: 120 seconds

  5. Conditional: If attack volume > 10 Gbps AND still active
     Yes -> RTBH: Blackhole target IP
            Notify: Subscriber via email
            "Your IP has been temporarily blackholed
             due to a volumetric DDoS attack."
     No  -> Continue monitoring

  6. Wait: 600 seconds

  7. Conditional: If RTBH active AND attack subsided
     Yes -> Remove RTBH route
            Notify: Subscriber "Service restored"
     No  -> Page on-call NOC engineer

Execution Logs and Audit Trail

Every runbook execution produces a detailed log that records exactly what happened, when, and why. The execution log includes:

  • Trigger event: The detection event that started the runbook, including full attack metadata.
  • Condition evaluation: Which conditions were checked and their results.
  • Action execution: Each action with its start time, end time, status (success/failure), and output.
  • Conditional branches: Which branch was taken and why.
  • Error details: If any action failed, the full error message and whether the runbook continued or halted.

Execution logs are retained for 90 days on standard plans and indefinitely on enterprise plans. They are exportable as JSON for integration with your SIEM, compliance systems, or incident post-mortem tools.

Rollback detection

Sometimes a mitigation action makes things worse. A FlowSpec rule that is too broad might block legitimate traffic along with the attack. Flowtriq monitors the impact of each mitigation action and can detect when a rule is causing collateral damage.

If the system detects that legitimate traffic dropped significantly after a FlowSpec rule was applied, it flags the rule in the execution log and can optionally roll it back automatically. You configure the rollback sensitivity per runbook:

Rollback settings:
  Monitor period: 60 seconds after rule application
  Legitimate traffic drop threshold: 30%
  Action on threshold breach: Roll back rule + alert engineer

This safety net prevents automated mitigation from becoming automated disruption. The system can act fast because it can also undo fast.

Building Your First Runbook

Start simple. Your first runbook should handle the most common attack type you see and perform two actions: apply mitigation and send a notification. Do not try to automate everything on day one.

  1. Go to Runbooks in the Flowtriq dashboard and click Create Runbook.
  2. Set the trigger to attack.detected and add a severity condition (start with >= high to limit scope).
  3. Add a FlowSpec action using the detection engine's classification data.
  4. Add a Slack or email notification action.
  5. Enable the runbook in dry-run mode first. In dry-run mode, the runbook logs what it would do without actually applying any mitigation. Run it for a week to verify it would have made the right decisions.
  6. Once you trust the logic, switch to live mode.

After your first runbook is stable, expand: add status page updates, add PCAP captures, add conditional branches for different attack types, and build separate runbooks for different infrastructure segments.

Dry-run mode is your friend. Every new runbook should spend at least one week in dry-run mode before going live. Review the execution logs daily to verify the runbook would have taken the correct action for each event. Automated mitigation that blocks legitimate traffic is worse than no automation at all.

The "It Runs Itself" Pitch

The value proposition of automated runbooks is not about replacing your network engineers. It is about letting them sleep. Your best engineers should be designing network architecture, planning capacity, and building resilience. They should not be typing FlowSpec rules from memory at 3 AM while half asleep.

Automated runbooks handle the first 5 minutes of every incident: the detection, the classification, the initial mitigation, the notification, and the documentation. Your engineers handle the exceptions: the attacks that survive auto-mitigation, the novel vectors that need manual analysis, and the strategic decisions about long-term defense.

The result is faster mitigation (seconds instead of minutes), better documentation (automated logs instead of reconstructed timelines), and an on-call rotation that people do not dread.

Flowtriq runbooks are included on all plans starting at $9.99/node/month. Start your free 14-day trial or explore pricing to see what fits your infrastructure.

Back to Blog

Related Articles