Why Every DDoS Tool Blocks Legitimate Traffic (And How to Fix It)

Back to Blog

We build Flowtriq, a DDoS detection platform, and we have false positives too. This post is not a claim that we solved the problem. It is an honest look at why false positives happen across all DDoS tools, what architectural decisions make them worse or better, and where the industry still has work to do.

The false positive problem is universal

Every DDoS detection system faces the same fundamental challenge: distinguishing between a legitimate traffic spike and a malicious flood. Both look like a sudden increase in packets per second. Both deviate from normal baselines. The difference is in the structure of the traffic, and detecting that structure accurately, in real time, is genuinely hard.

A product launch sends your web server from 5,000 requests/second to 50,000 requests/second in minutes. A SYN flood attack does the same thing. To a threshold-based detector, both events look identical: traffic exceeded the configured limit, so it must be an attack. The result is that your product launch gets classified as a DDoS attack, your detector fires mitigation rules, and your legitimate customers get blocked.

This is not a theoretical problem. It happens regularly to operators running every major DDoS tool on the market.

"We had a flash sale that brought our site traffic to 10x normal. Our DDoS tool triggered a BGP blackhole within 30 seconds and took down our entire e-commerce operation for 12 minutes. The attack was our own customers."

Why false positives happen: the four root causes

1. Static thresholds

The oldest and most common approach to DDoS detection is a static threshold: "if packets per second exceeds X, trigger an alert." This is how most open-source detection tools work, and it is the default configuration for many commercial tools as well.

The problem with static thresholds is that they require a human to pick the right number. Set it too low and you get false positives on every traffic spike. Set it too high and you miss real attacks. The "right" number changes constantly as your traffic patterns evolve, and most operators set it once during initial deployment and never touch it again.

Static thresholds also cannot account for servers with different traffic profiles. A mail server handling 50,000 connections/second is normal. A DNS server at 50,000 connections/second might be a reflection attack. A single threshold applied across both servers will be wrong for at least one of them.

2. Short learning periods

Some tools attempt to learn "normal" traffic automatically, but use learning windows that are too short to capture real-world traffic patterns. A 24-hour learning period misses weekly cycles. A 7-day window misses monthly billing spikes. A 30-day window misses seasonal patterns.

Worse, if an attack happens during the learning period, the tool learns the attack traffic as "normal" and then fails to detect similar attacks afterward. This is a known issue with several commercial products that use fixed learning windows without attack-aware filtering.

3. Aggressive blocking without confidence scoring

Many DDoS tools treat detection as binary: traffic is either normal or it is an attack. There is no confidence score, no severity gradient, no middle ground. When the tool decides something is an attack, it immediately takes action, whether that is a firewall rule, a BGP blackhole, or a rate limit. There is no "I think this might be an attack, let me alert first and act second" mode.

This is where the most damaging false positives come from. It is one thing to send a false alert to Slack. It is another to push a BGP blackhole for a /24 that contains 200 customer IPs because one server showed unusual traffic patterns. The former wastes attention. The latter causes an outage.

4. Signature-based detection without context

Signature-based detection matches traffic against known attack patterns (specific packet flags, payload sizes, protocol anomalies). This approach is fast and accurate for known attack types, but it produces false positives when legitimate traffic happens to match a signature.

For example, some DDoS signatures flag any UDP packet over 1400 bytes as a potential amplification response. But legitimate DNS responses (especially DNSSEC-signed responses), QUIC connections, and VPN tunnel traffic all produce UDP packets in that size range. A signature-only detector will flag all of these as potential attacks.

What makes false positives worse

Beyond the root causes, certain operational patterns amplify the false positive problem:

One baseline for the entire network. Tools that build a single traffic baseline for an entire subnet or network miss per-server variance. Your web servers, mail servers, DNS servers, and game servers all have different traffic profiles. Averaging them into one baseline guarantees false positives on the outliers.

No startup grace period. When a new server comes online, it has no baseline. Some tools apply the network-wide threshold immediately, which means any legitimate traffic on the new server that exceeds the average triggers a false detection. Other tools use aggressive defaults during the learning period, producing a burst of false positives until the baseline stabilizes.

No hysteresis. Hysteresis means requiring traffic to exceed a threshold by a significant margin before triggering, and requiring it to drop well below the threshold before clearing. Without hysteresis, traffic that oscillates around the threshold produces rapid-fire alerts, each one a potential false positive that triggers mitigation actions.

No distinction between inbound and outbound. Some tools monitor total traffic without distinguishing direction. A server that sends a large backup to another server (high outbound) gets flagged the same as a server receiving a volumetric attack (high inbound). These are fundamentally different events that require different responses.

How adaptive baselining reduces false positives

Adaptive baselining replaces static thresholds with dynamic thresholds that adjust continuously based on observed traffic. The core idea is simple: instead of a human picking a number, the system observes traffic over time and sets thresholds based on what it has seen.

The implementation details matter enormously. Here is what separates useful adaptive baselining from the kind that just adds a different flavor of false positives:

P99 sliding windows

Rather than using the average traffic rate as the baseline (which gets skewed by spikes), effective baselining uses percentile-based measurements. The p99 (99th percentile) of a rolling window captures the "almost-highest" traffic the server has seen, excluding extreme outliers. This produces a threshold that accommodates normal spikes but still catches genuine anomalies.

At Flowtriq, we use the 99th percentile of a 300-sample rolling window. This means the threshold reflects the server's recent behavior, adapts as traffic patterns change, and is not thrown off by a single unusual spike. When a traffic burst exceeds the p99, the system evaluates it further (packet structure, protocol distribution, source entropy) before deciding whether to alert.

Hysteresis bands

Instead of a single threshold, effective detection uses two: a trigger threshold (when to start investigating) and a clear threshold (when to stop). The gap between them prevents rapid-fire alerts when traffic oscillates around the detection boundary.

In practice, this means an attack must exceed the baseline by a meaningful margin to trigger detection, and traffic must drop well below the trigger point before the system considers the attack over. This eliminates the class of false positives caused by traffic hovering near the threshold.

Startup grace periods

When a new server comes online or the agent is first deployed, there is no historical baseline to compare against. Rather than applying aggressive defaults and producing false positives during the learning period, a well-designed system uses a grace period where detection is active but mitigation actions are suppressed.

During the grace period, the system builds its baseline, learns normal traffic patterns, and generates alerts in observation mode. Operators can review these alerts to verify the system's understanding of normal before enabling automated mitigation. This eliminates the "day one false positive storm" that plagues many deployments.

Multi-signal classification

Volume alone is not enough to determine whether traffic is an attack. Effective classification examines multiple signals simultaneously:

Packet size distribution: Attacks often produce uniform packet sizes (all 1400-byte UDP, all 40-byte SYN). Legitimate traffic shows varied packet sizes.
Source IP entropy: Attacks from botnets produce high source entropy (thousands of unique source IPs). Legitimate spikes from real users show clustering around geographic regions and repeat visitors.
Protocol ratios: A sudden shift from 80% TCP / 20% UDP to 15% TCP / 85% UDP is suspicious regardless of volume. Normal traffic rarely changes protocol distribution this dramatically.
Flag analysis: SYN floods produce packets with only the SYN flag set. Normal TCP traffic has a balanced mix of SYN, ACK, PSH, and FIN flags.
Port distribution: Legitimate traffic targets specific service ports. Attacks often target random ports or specific amplification-related ports.

When multiple signals align (high volume, uniform packet sizes, high source entropy, abnormal protocol ratios), the confidence that the event is an actual attack increases significantly. When only volume changes but everything else looks normal, the system is more conservative about triggering.

See Flowtriq's adaptive detection in action

Per-node percentile baselines, hysteresis bands, multi-signal classification, and startup grace periods. $9.99/node/month, 14-day free trial.

Start Free Trial →

What we are still building (honestly)

Flowtriq's adaptive baselining is better than static thresholds, but it is not perfect. Here are the areas where we are actively working to improve:

False positive feedback loop. Currently, when an operator identifies a false positive, they can dismiss the alert, but that dismissal does not feed back into the detection model. We are building a feedback mechanism where dismissed false positives adjust the baseline and classification weights for that specific server. This is not live yet.

Time-of-day awareness. Our current baseline adapts continuously but does not maintain separate profiles for different times of day. A server that handles 10x more traffic during business hours than overnight should have different thresholds for each period. We are working on time-bucketed baselines that maintain separate p99 calculations for each hour of the day. This is in development.

Cross-node correlation. When 10 servers in the same network see traffic spikes simultaneously, the probability that it is a real attack (rather than a legitimate traffic event) increases. We do not currently correlate across nodes for detection decisions, though we do for dashboard visualization. Cross-node correlation for detection is on our roadmap.

Protocol-specific tuning. Our classification engine treats DNS traffic and HTTP traffic with the same analysis pipeline. DNS traffic has specific patterns (query/response ratios, QNAME distribution, response sizes) that could improve detection accuracy if analyzed with protocol-aware logic. This is planned but not yet implemented.

We are sharing this because the DDoS detection industry has a habit of claiming problems are solved when they are merely improved. False positives are an ongoing engineering challenge, not a checkbox feature. Any vendor who tells you they have zero false positives is either lying or not detecting enough.

What operators can do today

Regardless of which DDoS tool you run, there are practical steps to reduce false positives:

Use alert-first, block-second policies. Configure your tool to alert on detection and require human confirmation (or a time delay) before triggering mitigation. This gives you a window to verify that the event is real before taking action that could affect customers.

Set per-server thresholds. If your tool supports it, configure thresholds for each server based on its actual traffic profile rather than using one threshold for your entire network. A game server and a web server should not share the same detection parameters.

Review and adjust regularly. Traffic patterns change. A threshold that was correct six months ago may produce false positives today because your traffic has grown. Schedule quarterly reviews of your detection configuration.

Maintain a false positive log. Track every false positive, what triggered it, what traffic caused it, and what you changed to prevent recurrence. This log is invaluable for tuning and for evaluating whether your tool is getting better or worse over time.

Test with synthetic traffic. Before enabling automated mitigation, test your detection with controlled traffic spikes that mimic both attacks and legitimate surges. Verify that the tool correctly distinguishes between them before trusting it to take autonomous action.

The honest bottom line

False positives in DDoS detection are not a solved problem. They are a managed problem. The difference between good and bad DDoS tools is not whether they produce false positives, but how many, how often, and how damaging those false positives are.

Static thresholds are the worst approach because they require manual tuning and cannot adapt to changing traffic. Adaptive baselining is better because it tracks real traffic patterns and sets thresholds accordingly. Multi-signal classification is better still because it examines packet structure, not just volume, before triggering.

But even the best detection system will occasionally flag legitimate traffic as an attack. The question is whether your tool gives you the controls to minimize the damage when it gets it wrong: alert-first policies, confidence scoring, hysteresis, grace periods, and feedback mechanisms that learn from mistakes.

We are working on all of these at Flowtriq. Some are shipped. Some are in progress. We will be transparent about which is which.