Home ยป Cloudflare Locking System Experiences Glitch, Crashes for 3.5 Hours Resulting in 55% Customer Lockout

Cloudflare Locking System Experiences Glitch, Crashes for 3.5 Hours Resulting in 55% Customer Lockout

Cloudflare has disclosed that their Cloudflare Logs file storage service went down for approximately 3.5 hours on November 14, 2024, resulting in 55% of the total logs that should have been stored during that time period being lost.

The Cloudflare Logs service stores log files of various events that occur in the network. Due to the massive size of Cloudflare’s network, they have to store logs a staggering 4.5 trillion times a day (counted by events that need to be recorded) and send this log data to customers.

The process of managing Cloudflare’s logs consists of 4 software components (all written in Go) working seamlessly together. Starting with Logfwdr, which receives logs from Cloudflare’s system and forwards them in batches, followed by Logreceiver, which receives the batches and categorizes them, Buftee manages buffer handling to divide tasks into smaller parts, and Logpush forwards the logs in the buffer to customers.

The issue arose when Cloudflare’s team added new data types for Logpush, requiring additional configuration in Logfwdr to recognize the log types. However, a system bug caused this configuration to become blank, leading Logfwdr to believe it didn’t need to store logs and send them. The Cloudflare team identified and resolved the issue within the first 5 minutes but inadvertently triggered a second problem by restoring the blank configuration, activating Logfwdr’s failsafe mode, which sent log events to all customers instead of based on the configured values. This feature had been created a while back, as Cloudflare users were few, not considering the repercussions of sending logs to all customers, creating massive traffic jams affecting other software in the queue, such as Bufftee needing a 40x increase in buffer size. Eventually, the team had to reset the entire system.

Cloudflare acknowledges that the failsafe system bug in Logfwdr resulted from insufficient testing to determine if Cloudflare’s extensive system could handle such errors. The critical point was Bufftee, meant to act as a buffer against errors, but it crumbled under the immense traffic load. Cloudflare’s solution moving forward is to conduct overload tests regularly to simulate increased system loads to prevent such issues from recurring.

TLDR: Cloudflare Logs service experienced a 3.5-hour outage, losing 55% of stored logs, leading to system failures and the need for system resets. Steps are being taken to conduct overload tests to prevent similar incidents in the future.

More Reading

Post navigation

Leave a Comment

Leave a Reply

Your email address will not be published. Required fields are marked *

Japan’s Resplendent Bank Announces Intention to Disburse Late Payment Fines Following a Cataclysmic Two-Day Monetary Transfer System Collapse

Singapore Phone Outage Sweeps Across Island for Four Hours: Chukchun Number Unreachable