The Inference API Threat Model
AI inference APIs present a unique attack surface that differs from traditional web applications. Every request to a prediction endpoint triggers a forward pass through a neural network, consuming GPU memory, compute cycles, and often significant CPU resources for pre- and post-processing. A standard HTTP GET flood against a static web page costs the server almost nothing per request. An HTTP flood against an inference endpoint costs the server a full model inference per request, which can range from 10ms to several seconds of GPU time depending on the model.
This cost asymmetry is the core of the threat model. Attackers do not need massive botnets to overwhelm an inference API. A few hundred requests per second targeting a large language model endpoint can exhaust the GPU's processing capacity entirely. The attack does not need to saturate the network link. It needs to saturate the GPU.
Three primary attack patterns target inference APIs:
- HTTP floods on prediction endpoints: High-volume POST requests with valid-looking input payloads. Each request forces a full model inference, burning GPU cycles on malicious input. The requests may use stolen or leaked API keys to bypass authentication.
- Resource exhaustion via expensive inputs: Carefully crafted inputs that maximize inference time. For language models, this means maximum-length prompts that produce maximum-length outputs. For image models, maximum-resolution inputs. The attacker optimizes for cost-per-request to the defender, not volume.
- API key abuse and credential stuffing: Using compromised API keys to make authenticated requests that bypass rate limits applied to unauthenticated traffic. Since the requests are authenticated, they are treated as legitimate by most API gateways.
Why Rate Limiting Alone Is Not Enough
The instinctive response to API abuse is rate limiting: cap the number of requests per API key, per IP, or per time window. For traditional APIs, this works well enough. For inference APIs, rate limiting creates a fundamental conflict between security and usability.
Legitimate inference workloads frequently look identical to attacks. A customer running batch predictions on a dataset of 100,000 images will send thousands of requests in rapid succession. An ML pipeline that processes a queue of text inputs will sustain high request rates for hours. A real-time application load-testing against an inference endpoint before production deployment will generate traffic patterns indistinguishable from an HTTP flood.
Setting rate limits low enough to block attacks will block legitimate batch processing. Setting them high enough to allow batch processing will allow attacks through. There is no single threshold that correctly separates the two because the traffic patterns overlap completely at the request level.
# The rate limiting dilemma for inference APIs Legitimate batch customer: 2,000 req/s for 30 minutes Valid API key, authenticated Sequential image classification requests Total: 3.6M requests DDoS attack: 2,000 req/s for 30 minutes Stolen API key, authenticated Sequential image classification requests Total: 3.6M requests # Rate limiting cannot distinguish these two scenarios # because at the request level, they are identical.
Per-key rate limits help in theory but create operational overhead in practice. Customers with legitimate high-volume workloads will constantly hit limits and file support tickets. Custom rate limit tiers per customer add complexity and still do not solve the stolen-key problem.
Multi-Signal Detection for Inference APIs
Effective DDoS detection for inference APIs requires looking beyond request rate. Individual requests may be indistinguishable between legitimate and malicious traffic, but the aggregate behavior differs across multiple signals when observed together.
Packets per second combined with request patterns. A legitimate batch job sends requests sequentially, waiting for each response before sending the next (or using a bounded concurrency pool). An attack typically fires requests without waiting for responses, creating a much higher ratio of open connections to completed requests. Monitoring connection behavior alongside raw PPS reveals patterns that request rate alone misses.
Source diversity analysis. A legitimate customer's batch job comes from one or a small number of source IPs. A botnet-driven flood comes from hundreds or thousands of sources. Even when individual sources stay below rate limits, the sudden appearance of many new source IPs all targeting the same inference endpoint is a strong anomaly signal.
Temporal correlation with GPU metrics. This is where inference-specific detection becomes powerful. Flowtriq's agent can be configured alongside Prometheus exporters that track GPU utilization, inference latency, and queue depth. When network anomalies correlate with GPU saturation, the detection confidence increases dramatically. A spike in PPS that coincides with GPU utilization jumping from 60% to 100% and inference latency tripling is almost certainly an attack, not organic growth.
Prometheus Integration for GPU Correlation
GPU cloud providers already run Prometheus and Grafana for infrastructure monitoring. Flowtriq exports its detection metrics in Prometheus format, allowing operators to build correlation dashboards and alerting rules that combine network-level anomalies with GPU-level metrics.
# Example: Prometheus alerting rule combining Flowtriq + GPU metrics
- alert: InferenceEndpointDDoS
expr: |
flowtriq_pps_anomaly_score{node="inference-01"} > 3
AND nvidia_gpu_utilization{node="inference-01"} > 95
AND inference_latency_p99{endpoint="/v1/predict"} > 500
for: 2m
labels:
severity: critical
annotations:
summary: "Probable DDoS on inference endpoint"
description: "Network anomaly detected concurrent with GPU saturation and latency spike"
This multi-signal approach catches attacks that evade any single detection method. The network anomaly alone might be a legitimate traffic surge. GPU saturation alone might be a large customer job. But the combination of both, correlated in time, is a strong indicator of malicious activity.
Auto-Mitigation That Preserves Batch Processing
Detection is only half the problem. Mitigation for inference APIs must be surgical enough to block attack traffic without disrupting legitimate batch customers. Blanket IP blocking or aggressive rate limiting will cause collateral damage that costs more in customer churn than the attack itself.
Flowtriq's mitigation pipeline supports graduated response:
- Source-based filtering: When source diversity analysis identifies a botnet (hundreds of new IPs appearing simultaneously), those sources are blocked at the kernel level via iptables/nftables. Established customer IPs are whitelisted and unaffected.
- Protocol-level filtering: Volumetric UDP or SYN floods targeting the inference server's IP are dropped before reaching the application. HTTP-layer inference traffic on established TCP connections continues uninterrupted.
- BGP FlowSpec escalation: When attack volume exceeds what the server can filter locally, FlowSpec rules push filtering upstream. The rules target specific attack characteristics (source prefixes, protocols, ports) rather than blackholing the entire IP, so legitimate inference traffic continues to flow.
- Webhook integration for API gateway coordination: Flowtriq can trigger webhooks that instruct API gateways to temporarily tighten authentication, require additional verification for new API keys, or activate CAPTCHA challenges on non-API traffic, all without touching established batch processing sessions.
The goal is not to block all suspicious traffic. It is to maintain service for authenticated, established customers while filtering the attack. Batch processing jobs that were running before the attack started should continue running through it.
Implementation Checklist for GPU Cloud Providers
If you operate inference APIs on GPU infrastructure, here are the concrete steps to harden your DDoS posture:
- Deploy per-node detection: Every inference server should run its own detection agent. Network-level monitoring misses targeted attacks on individual endpoints.
- Correlate network and GPU metrics: Connect your DDoS detection to your GPU monitoring stack. The combination of network anomalies and GPU saturation is a far stronger signal than either alone.
- Whitelist established customer sessions: Maintain a dynamic allowlist of authenticated, established customer connections. Mitigation rules should never touch these sessions.
- Monitor per-endpoint, not just per-server: Different inference endpoints (image classification vs. text generation vs. embedding) have different cost profiles and different normal traffic patterns. Per-endpoint baselines catch targeted attacks that per-server baselines average away.
- Automate the response pipeline: A 10-minute manual response to a DDoS on inference infrastructure can cost thousands of dollars in lost GPU time and SLA penalties. Detection-to-mitigation must be automated and execute in seconds.
Multi-signal detection for inference APIs. Flowtriq correlates network anomalies with GPU metrics via Prometheus, enabling detection that catches attacks rate limiting misses. See how it works for GPU/AI cloud or calculate your DDoS risk exposure.