Cloudflare experienced a severe service disruption on November 14, 2024, resulting in a substantial loss of customer log data. The incident, which lasted 3.5 hours, affected their logpush service and led to the loss of 55% of customer logs.
The scale of the incident was significant, considering Cloudflare’s daily processing of 50 trillion event logs, of which 4.5 trillion are typically delivered to customers. The majority of Cloudflare Logs users were impacted during this period.
The technical breakdown revealed that a misconfiguration in the Logfwdr component triggered a ‘blank configuration,’ initiating a chain reaction. This activated the failsafe system, causing an unprecedented log volume spike. The Buftee buffer system, overwhelmed by 40 times its normal capacity, failed along with multiple safeguards due to improper configuration settings.
In response, Cloudflare implemented three key remediation measures:
– Deployed a new system for detecting and alerting misconfigurations
– Enhanced Buftee configuration for better spike handling
– Established routine overload testing protocols
The incident served as a crucial reminder of the importance of robust failsafe configurations and comprehensive testing in large-scale logging infrastructure.