Back to Blog

The question "how should I set up DDoS alerting?" has no single answer. It depends entirely on how many servers you are protecting. A bash script that polls /proc/net/dev is perfectly fine for a single VPS. That same approach will bury you in noise at 50 servers and become completely unusable at 500. Every stage of infrastructure growth demands a different alerting architecture, different threshold strategies, and different escalation workflows.

This guide walks through four distinct stages of scale and explains exactly what changes at each one. If you are planning to grow from a handful of servers to a fleet, this is the roadmap for keeping your alerting useful instead of overwhelming.

1 Server: The Basics

When you are protecting a single server, the alerting setup is straightforward. You know the server's normal traffic profile. You know its capacity. You are probably the only person who will respond to an alert. The goal at this stage is simple: detect abnormal traffic and get notified fast.

The most practical approach for a single server is a lightweight monitoring script that reads packet and byte counters from the kernel. On Linux, /proc/net/dev gives you real-time interface statistics. A cron job that runs every 30 seconds can calculate packets per second (PPS) and bits per second (BPS), compare them against static thresholds, and fire an alert when something looks wrong.

Here is a minimal monitoring script that does exactly that:

#!/bin/bash
IFACE="eth0"
THRESHOLD_PPS=50000
THRESHOLD_BPS=500000000  # 500 Mbps
WEBHOOK="https://hooks.slack.com/services/YOUR/WEBHOOK/URL"

read_counters() {
  awk -v iface="$IFACE:" '$1==iface {print $2, $3, $10, $11}' /proc/net/dev
}

BEFORE=($(read_counters))
sleep 5
AFTER=($(read_counters))

RX_BYTES=$(( (AFTER[0] - BEFORE[0]) / 5 ))
RX_PKTS=$(( (AFTER[1] - BEFORE[1]) / 5 ))

if [ "$RX_PKTS" -gt "$THRESHOLD_PPS" ] || [ "$RX_BYTES" -gt "$THRESHOLD_BPS" ]; then
  BPS_HUMAN=$(numfmt --to=iec-i --suffix=B $((RX_BYTES * 8)))
  curl -s -X POST "$WEBHOOK" \
    -H 'Content-type: application/json' \
    -d "{\"text\":\"DDoS Alert: ${RX_PKTS} PPS / ${BPS_HUMAN}ps on $IFACE\"}"
fi

This works. For one server, it is honestly all you need to get started. Set your PPS threshold to roughly 3x your normal peak traffic, point it at a Slack channel or email, and you will know when something abnormal happens.

The problems at this stage are minor but worth noting. Static thresholds require manual tuning. Your "normal" traffic at 2 PM on a Tuesday looks very different from midnight on a Sunday. You will get false positives during legitimate traffic spikes (marketing campaigns, Reddit front page, etc.) and you might miss slow-ramp attacks that stay just under your threshold. But for a single server, these tradeoffs are acceptable.

The biggest risk at the single-server stage is not detection quality. It is the fact that you are the only responder. If the alert fires at 3 AM and you sleep through it, nobody else is watching. Keep this in mind as you grow.

10 Servers: Per-Node Baselines

At 10 servers, the bash-script-per-server approach starts showing cracks. The core problem is that different servers have wildly different traffic profiles. Your web frontend might handle 20,000 PPS normally. Your database server should never see more than 2,000 PPS from external sources. Your game server sees bursty UDP traffic that would trigger alerts on any other node. A single static threshold applied across all servers is either too sensitive for some or too lenient for others.

This is where per-node baselines become essential. Instead of setting a fixed PPS threshold, you need each server to learn its own normal traffic pattern over time. A web server with a 20,000 PPS average should alert at 60,000 PPS. A database server with a 1,500 PPS average should alert at 4,500 PPS. Same multiplier, very different absolute thresholds.

What changes at 10 servers

  • Centralized alerting: Instead of each server sending alerts independently, you need a single dashboard where all nodes report. One Slack channel for all DDoS alerts across every server. One place to check the status of all nodes at once.
  • Per-node thresholds: Each server needs its own baseline, either manually configured or dynamically calculated from historical data. Manual tuning at 10 servers is tedious but still possible.
  • Node identification in alerts: Every alert must include which server is affected. "50,000 PPS detected" is useless without knowing whether that is your edge proxy (normal) or your database (very bad).
  • Protocol-level visibility: At this scale, you start needing to know not just total PPS but what kind of traffic is spiking. A SYN flood looks very different from a UDP amplification attack, and the response is different for each.

Flowtriq handles this stage by running a lightweight agent on each server that reports metrics to the central dashboard. Each node automatically builds two baselines: a fast baseline (last 5 minutes, catches sudden spikes) and a slow baseline (last 24 hours, adapts to daily traffic patterns). Alerts fire when traffic exceeds both baselines by a configurable multiplier, meaning each node effectively tunes itself.

# Flowtriq agent config (/etc/flowtriq/agent.yml)
# Each node reports to the central dashboard
# Baselines are calculated per-node automatically

server: https://app.flowtriq.com
api_key: ft_node_a3b8c1d4e5f6
interface: eth0
fast_baseline_window: 300    # 5 minutes
slow_baseline_window: 86400  # 24 hours
alert_multiplier_warning: 3  # 3x baseline = warning
alert_multiplier_critical: 8 # 8x baseline = critical

The key insight at 10 servers is that alerting becomes a data management problem. You are no longer asking "is this server under attack?" You are asking "which of my servers, if any, are experiencing abnormal traffic right now?" That requires centralization and per-node context.

50 Servers: Fighting Alert Fatigue

At 50 servers, alert fatigue becomes the dominant problem. Even with per-node baselines and a 3x multiplier, you will see multiple alerts per day from normal traffic variance across 50 nodes. A legitimate CDN cache flush, a deployment that temporarily changes traffic patterns, a scheduled load test. Each of these generates alerts that someone has to look at, evaluate, and dismiss.

If every alert goes to the same Slack channel with the same urgency, your team will start ignoring that channel. This is not a theoretical risk. It is the most common failure mode for DDoS alerting at this scale. The solution requires three capabilities that were unnecessary at 10 servers:

Severity classification

Not all anomalies are equal. A 3x baseline deviation that lasts 8 seconds is very different from a 15x deviation sustained for 5 minutes. Your alerting system must classify events into severity tiers (warning, high, critical) and route them to different channels. Warnings go to a Slack channel that engineers check during business hours. Critical alerts page the on-call engineer immediately.

Escalation policies

When a critical alert fires, what happens if the on-call engineer does not respond within 5 minutes? At 50 servers, you need a defined escalation chain. Flowtriq supports this through escalation policies that you configure per severity level:

# Flowtriq escalation policy example
#
# Level 1: Notify primary on-call via PagerDuty (immediate)
# Level 2: Notify secondary on-call + team lead (after 5 min)
# Level 3: Notify VP Engineering (after 15 min)
#
# Severity mapping:
#   warning  -> Slack #ddos-alerts only, no escalation
#   high     -> PagerDuty Level 1
#   critical -> PagerDuty Level 1 + auto-escalate through all levels

Maintenance windows

Deployments, migrations, and planned maintenance cause traffic anomalies that look like attacks. At 50 servers, someone is deploying something almost every day. Without maintenance windows, your team wastes time investigating alerts that were caused by a known deployment. Flowtriq lets you schedule maintenance windows per node or per node group. During a window, alerts are suppressed or downgraded to informational severity, and the audit log records that the suppression was active.

Alert grouping

If an attacker targets your network with a carpet-bombing attack (low-volume traffic spread across many IPs), 30 of your 50 servers might trigger alerts simultaneously. Receiving 30 individual alert messages is worse than useless. It creates panic and makes it harder to understand the scope of the attack. At this scale, your alerting system needs to group related events. Instead of "Server A: warning, Server B: warning, Server C: warning..." repeated 30 times, you need "Cluster-wide anomaly detected: 30 of 50 nodes showing elevated traffic. Likely carpet-bomb pattern."

The rule of thumb at 50 servers: if your on-call engineer gets paged more than twice per week for non-critical events, your severity classification needs tuning. The goal is zero false pages, not zero false detections. Detections can be logged. Pages must be actionable.

500 Servers: Automated Response

At 500 servers, human response to individual alerts is no longer viable as a primary defense. The math is simple: if each server has even a 0.1% chance of experiencing an attackable anomaly on any given day, you are looking at one event every two days across the fleet. During an actual coordinated attack targeting multiple servers, you might see dozens of critical alerts within seconds. No human team can triage, investigate, and respond to that volume in real time.

This is where alerting transitions from "notify a human" to "trigger an automated response and inform the human about what was done." The alerting system becomes the first line of defense, not just a notification mechanism.

Auto-mitigation

At enterprise scale, Flowtriq integrates with upstream mitigation systems to respond automatically. When a critical-severity attack is detected, the system can push BGP FlowSpec rules to your edge routers within seconds, filtering malicious traffic before it reaches the target server. For cloud deployments, it can trigger scrubbing center activation via API (Cloudflare, AWS Shield Advanced, Akamai Prolexic) without waiting for a human to log in and click buttons.

# Auto-mitigation rule (Flowtriq dashboard)
#
# Trigger:    severity >= critical AND duration > 30s
# Action:     Push FlowSpec rule to edge routers
# FlowSpec:   Rate-limit source-prefix/32 matching
#             detected attack protocol to 1000 PPS
# Notify:     PagerDuty (informational) + Slack #noc
# Auto-expire: 30 minutes (re-evaluate, extend if needed)
# Audit:      Full log entry with rule details + trigger data

The key word is "automatically." The system detects, classifies, mitigates, and logs the event. The on-call engineer receives a notification that says "Attack detected on nodes web-042 through web-048. FlowSpec rule deployed at 02:14 UTC. Attack mitigated. PPS returned to baseline at 02:16 UTC." The engineer reviews the action taken, confirms it was appropriate, and goes back to sleep.

Correlation and pattern detection

At 500 servers, you need to distinguish between isolated incidents and coordinated campaigns. Is this a single server being targeted, or is this a carpet-bomb attack spread across your entire fleet? Are the attack vectors rotating (SYN flood followed by UDP amplification followed by DNS reflection), indicating a sophisticated attacker? Are the source IPs part of a known botnet?

Flowtriq's correlation engine groups events across nodes by time window, source IP overlap, and attack vector similarity. Instead of 47 individual alerts, you get a single incident: "Coordinated multi-vector attack targeting web cluster nodes 12 through 58. Primary vector, UDP amplification from 2,400 unique sources. Secondary vector, SYN flood from 800 sources. 3 source IP ranges overlap with Mirai IOC database."

Role-based access and audit trails

Enterprise deployments have compliance requirements. Your NOC team needs to see alerts and mitigation status. Your network engineers need to modify detection policies and escalation rules. Your security team needs forensic data (PCAP captures, attack timelines). Your compliance team needs audit logs proving that attacks were detected and responded to within SLA timelines.

Flowtriq supports four roles (owner, admin, analyst, readonly) with granular permissions. Every action is logged to the audit trail: who changed a threshold, who triggered a manual mitigation, who acknowledged an alert and when. This audit data is essential for SOC 2, PCI DSS, and ISO 27001 compliance reporting.

Common Mistakes at Every Scale

Regardless of how many servers you operate, these mistakes show up repeatedly:

  • Single-channel alerting: Sending every alert to one Slack channel or one email inbox. Warnings and critical alerts must go to different destinations with different urgency levels. A warning in Slack is fine. A critical alert needs to page someone.
  • Identical thresholds across all servers: A web proxy and a database server do not have the same traffic profile. Using the same PPS threshold for both means one is too sensitive and the other is not sensitive enough.
  • No escalation policy: If the primary on-call does not respond in 5 minutes, the alert should escalate. Without escalation, a single missed notification means an attack goes unresponded for hours.
  • No maintenance windows: Every deployment, every migration, every load test triggers false alerts. Without a way to suppress alerts during planned changes, your team stops trusting the alerting system entirely.
  • No automated response at scale: If you have more than 100 servers and your only mitigation strategy is "an engineer logs in and does something manually," you will not respond fast enough to prevent damage during a real attack.
  • Alerting on every anomaly instead of confirmed attacks: A 10-second traffic spike is not the same as a sustained DDoS attack. If your system does not distinguish between transient anomalies and real attacks, you will drown in noise.

Solving Alert Fatigue

Alert fatigue is the number one reason DDoS detection systems fail in production. The system detects real attacks perfectly, but the team has been conditioned to ignore alerts because 95% of them are false positives or low-severity events that did not need attention.

The fix is a combination of better detection and better routing:

  • Dynamic baselines instead of static thresholds: A server that normally handles 40,000 PPS at 2 PM should not alert at 50,000 PPS. Dynamic baselines adapt to time-of-day and day-of-week patterns, reducing false positives by 60-80% compared to static thresholds.
  • Duration-based filtering: Require anomalies to persist for at least 15-30 seconds before generating an alert. This eliminates the vast majority of transient spikes while still catching real attacks.
  • Severity-based routing: Warnings go to Slack. High-severity events go to PagerDuty with standard urgency. Critical events page immediately. Daily summary emails cover everything that happened in the last 24 hours for review during business hours.
  • Deduplication: One attack should produce one alert, not a new notification every 30 seconds for the duration of the event. Use dedup keys to group ongoing events into a single incident.

Channel Strategy at Scale

The right channel strategy distributes alerts based on urgency and purpose:

  • Slack or Discord: Warnings and informational events. The team monitors this channel during work hours. No expectation of immediate response.
  • PagerDuty or OpsGenie: High and critical events. These page the on-call engineer with phone calls and SMS. Escalation policies ensure someone always responds.
  • Webhooks: Auto-mitigation triggers. When a critical event fires, a webhook pushes FlowSpec rules or activates cloud scrubbing. No human in the loop for the initial response.
  • Email: Daily and weekly summary reports. Aggregate data on detection events, false positive rates, and mitigation actions taken. Useful for management reporting and compliance documentation.

Flowtriq supports all of these channels natively. You configure them once in the dashboard, assign severity filters to each channel, and the system routes every event to the right destination automatically.

How Flowtriq Scales With You

The same Flowtriq agent runs on every server regardless of fleet size. At 1 server, you get dynamic baselines and Slack alerts out of the box. At 10 servers, the centralized dashboard shows all nodes with per-node baselines. At 50 servers, you add escalation policies and maintenance windows to manage alert volume. At 500 servers, you enable auto-mitigation hooks and correlation rules to handle events that no human team could triage manually.

You do not need to rearchitect your alerting as you grow. You add capabilities on top of what already exists. The detection engine is the same at every tier. The agent footprint is the same. What changes is how you route, escalate, and respond to the events that detection produces.

Start where you are. Flowtriq works for a single server at $9.99/mo per node with a 7-day free trial. As your infrastructure grows, add escalation policies, maintenance windows, and auto-mitigation without changing your detection setup. Start your free trial and set up alerting that scales with your infrastructure.

Back to Blog

Related Articles