Inside the Data Bottleneck Threatening Streaming at Global Scale

YouTube’s Sean McCarthy and Hydrolix COO Tony Falco on what is breaking modern observability and how teams are rebuilding it

When millions of viewers tune in to a live event, the spectacle on screen is only part of the story. Behind every stream, image load, ad impression, and click sits a torrent of machine data few consumers ever see.

CDN logs operate as the circulatory system of the internet. During major live events, McCarthy and Falco describe environments that can generate millions of log lines per second. In their discussion, they pointed to streaming operations that can produce roughly 8 terabytes of log data in a single day. At global scale, even small performance degradations can ripple outward, disrupting user experiences and, in extreme cases, triggering cascading system failures.

The strain on data teams has intensified. Systems designed for yesterday’s traffic volumes are now operating under radically different velocity and scale.

To examine where modern data pipelines falter, Sean McCarthy, Head of OTT Live Engineering at YouTube, joined Tony Falco, COO of Hydrolix, to outline what they see as five structural constraints undermining observability today.

For companies operating across multiple CDNs, fragmentation becomes the first barrier.

Each vendor structures fields differently. Logs arrive on different schedules. Teams normalize data according to internal priorities. The result is limited visibility at the precise moment clarity is needed.

“We often ran into significant problems getting a single, coherent view of the data.This was the central operational challenge of running a multi-CDN environment,” McCarthy explains.

Before real time logging matured, teams relied on delayed client analytics or vendor outage alerts. That lag carried risk during live broadcasts.

“What unified, sub-second visibility looks like now versus the past is the difference between a blurry, historical photograph and an immediate, 4K live feed.”

Falco says fragmentation was central to Hydrolix’s founding. Drawing on the team’s experience at Cedexis, where petabytes of CDN data were collected, he recalls the financial pressure that accompanied scale.

“Every CDN request generates dozens of log events that tell you at each step how things are functioning. We were processing billions of transactions a day, and while it scaled, the cost of BigQuery was approaching the cost of headcount. The value of the data is only valuable if you can get the insights out of it. We set out to solve that foundational problem.”

Normalization, both argue, is the prerequisite for effective observability. Without it, every downstream decision becomes reactive.

When Scale Outpaces Architecture

Even after data is unified, volume presents a second constraint.

“The massive amount of data that multi-CDN sources generate is hard to fathom,” McCarthy says. “It’s not enough to simply collect this massive amount of data; you need the ability to query and observe it as it is ingested.”

Legacy architectures were not built for ingestion at this velocity. Falco attributes next generation scalability to structural shifts in infrastructure.

“It comes down to two major innovations: cloud object storage like S3 and Kubernetes. Together, they make a decoupled architecture where ingest and query scale independently. You can go from 10 pods to 100 pods and back down without over-provisioning.”

Elasticity becomes decisive.

“In a lot of systems, you can’t scale back down once you scale up. Legacy systems ship, compute, ingest, and store as one rigid unit, but Hydrolix breaks it apart. With Kubernetes, the whole front end is elastic.”

The ability to expand during peak demand and contract afterward shapes both performance and cost discipline.

The Economics of Retention

For years, full fidelity log retention posed financial hurdles. Under many traditional pricing models, retaining 10 terabytes per day beyond a 90 day window can escalate into hundreds of thousands of dollars per month, according to Falco. As a result, sampling became common practice.

Sampling, however, conceals edge cases.

“Performance issues are often highly specific events masked by aggregation,” McCarthy notes. “Full-fidelity logs ensure you capture every unique error.”

Hydrolix addresses the cost equation through compression that the company says can reach 25 to 50 times on commodity object storage.

“We use the most cost-effective hot storage and then we layer on our own compression and partition the data,” Falco says. “It retrieves data even from slower storage. All of that adds up to a highly performant, durable database at a fraction of the cost.”

Extended retention broadens what teams can analyze, including rare QoE anomalies, device specific failures, regional congestion patterns, SLA drift, and A and B test performance shifts.

Hydrolix has also pointed to commentary circulating in industry forums. One user wrote:

We moved to Hydrolix. 15+ months retention means we can actually do some analysis…and it’s about 25% the cost of Splunk. (Username Pik000)

For McCarthy, the broader implication is direct.

“Addressing the problems Hydrolix solves is non-negotiable.”

Visibility at the Business Layer

Observability gaps do not remain isolated within engineering teams. They surface in revenue leakage, infrastructure overages, and security exposure.

Falco points to firewall blind spots as one example.

“We found that a huge percentage of traffic that’s supposed to be blocked is not blocked.” He adds, “We’re seeing as much as 60% of traffic on major brands coming from bots and bypassing their firewall.”

He characterizes this as an observation drawn from Hydrolix’s customer environments rather than a universal market statistic.

The consequences include CDN overage fees, origin strain, latency spikes, and broader security risk.

“The decisions leaders need to make are relatively simple, unless they aren’t getting the information needed to make those simple decisions.” Falco clarifies that “it can take teams two or three days to resolve one alert. Backlogs mount. People burn out. Being able to classify events in real time and act immediately is the key.”

McCarthy notes that once visibility becomes normalized and immediate, response patterns shift. Teams act sooner. Incidents are contained before they widen.

The Build Versus Buy Divide

Constructing a bespoke real time multi CDN pipeline requires specialized skill sets. It demands expertise in ingestion engineering, data modeling, and distributed infrastructure that many organizations do not maintain internally.

“This often requires a dedicated team,” McCarthy says. “Unless your core business is real-time analytics, it likely does not make sense to build it yourself.”

Falco reinforces that assessment.

“Very few companies can build something this complex and obtainable on their own. Companies like Lyft, Uber, and Nielsen scaled because their entire business demanded it. Most companies try, run into intricacies, fail, and all the while spending far more money than they would with a specialized vendor.”

Reducing Mean Time to Resolution becomes the dividing line.

“It’s our superpower,” Falco says. “If you summarize every testimonial, case study, and call, the pain point we fix is the same: Find and fix problems before your customers—or your boss—see them.”

The Nordic electronics retailer Elkjøp provides a case study. During Black Friday 2024, the company detected the onset of a DDoS attack and used TrafficPeak to respond.

When asked what “instant” meant in practice, Elkjøp Team Lead, eCommerce Jonas Petersson responded, “The entire event from spotting to stopping the attack was instant. No sites went out of service and none of our customers experienced any impact whatsoever.”

Organizations that attempt to replicate similar pipelines internally often find themselves investing heavily in infrastructure while operational friction remains unresolved.

The New Baseline for Streaming

Streaming platforms now operate in an environment where delay is visible and failure is public. Fragmentation, scale, retention cost, architectural rigidity, and skills gaps combine to widen the gap between incident and intervention.

Falco distills the mandate.

“Reduce the time between a problem appearing and a human fixing it. When you do that, everything else — cost, retention, performance — falls into place.”

In an economy measured in milliseconds, compressing that interval has become a defining marker of resilience.