The Engineering Challenge That Broke Kafka
LinkedIn faces a scale that would make most engineering teams weep. 32 trillion records per day. 17 petabytes of data daily. 400,000 topics spread across 150 Kafka clusters running on 10,000+ machines, all serving 1.2 billion users. At this magnitude, even the infrastructure you invented starts to crack.
This is the story of how LinkedIn, the company that created Apache Kafka and gave it to the world, had to replace their own creation with something entirely new. It’s a fascinating case study in technical pragmatism over sentiment, and a glimpse into infrastructure challenges that 99.9% of companies will never face.
When your own open source project isn’t good enough
LinkedIn’s relationship with Kafka is deeply personal. Jay Kreps, Neha Narkhede, and Jun Rao created Kafka at LinkedIn in 2010, specifically to handle the company’s exploding data needs. For 15 years, it served as LinkedIn’s “circulatory system for data,” processing everything from user interactions to machine learning features.
But by 2024, cracks were showing. The single-controller architecture that worked fine for hundreds of topics became a bottleneck with 400,000 topics. Partition-based rebalancing that took minutes for small clusters now required “stop-the-world” synchronization across massive infrastructure. Metadata management became a nightmare of coordination across 150 separate clusters.
The breaking point wasn’t theoretical, it was operational. LinkedIn’s engineering teams were spending too much time managing Kafka’s complexity instead of building features. Onur Karaman, a former Apache Kafka committer who had worked on Kafka’s scalability improvements, recognized that fundamental architectural changes were needed that couldn’t be implemented within Kafka’s existing design constraints.
NorthGuard: Rethinking distributed log storage
NorthGuard represents a fundamental architectural departure from Kafka. While Kafka organizes data into partitions managed by a central controller, NorthGuard uses a hierarchical model with segment-level replication. Instead of replicating entire partitions, NorthGuard replicates 1GB segments that live for at most one hour. This enables automatic load balancing without the expensive rebalancing operations that plague Kafka at scale.
Metadata management gets completely reimagined. Rather than Kafka’s single controller bottleneck, NorthGuard distributes metadata across multiple nodes using replicated state machines. This eliminates the controller scaling issues that become crippling at LinkedIn’s scale.
The storage layer provides millisecond-level durability across replicas with strong consistency guarantees, a significant improvement over Kafka’s eventual consistency model.
xInfra: The virtualization layer that enables migration
Building a Kafka replacement is one thing. Migrating 400,000 topics without downtime is another entirely. xInfra (pronounced “ZIN-frah”) is LinkedIn’s virtualized Pub/Sub layer that abstracts away physical clusters.
xInfra enables dual-write mechanisms during migration, allowing applications to write to both Kafka and NorthGuard simultaneously while consumers can read from either system transparently. Virtual topics can span multiple physical clusters, enabling gradual migration without application changes.
This virtualization layer provides a unified control plane for both Kafka and NorthGuard systems, allowing LinkedIn to migrate thousands of topics processing trillions of records daily with zero downtime.
The scale that necessitated this change
LinkedIn processes 32 trillion records per day across 400,000 topics on 150 Kafka clusters running on 10,000+ machines, serving 1.2 billion users. At this magnitude, even infrastructure you invented starts to break down.
The migration is already showing results: thousands of topics moved to NorthGuard, processing trillions of records daily with zero downtime. Performance improvements include automatic load balancing, reduced operational overhead, and “lights-out operations” where clusters self-manage without constant human intervention.
Industry implications and future impact
LinkedIn’s decision to replace Kafka, their own open-source success story, sends a clear signal about the limitations of current streaming technologies at hyperscale. The architectural innovations in NorthGuard may influence the next generation of messaging systems, with segment-level replication and decentralized metadata management addressing scalability challenges other hyperscale companies will face.
However, adoption beyond LinkedIn faces barriers. The protocol incompatibility with Kafka’s APIs means organizations would need to rewrite existing integrations. The complexity is designed for LinkedIn’s specific scale, most companies don’t process 32 trillion records daily.
Cloud providers will likely take notice. AWS Kinesis, Google Pub/Sub, and Azure Event Hubs may incorporate similar virtualization and scaling techniques. The Kafka ecosystem itself may evolve in response to these demonstrated limitations.
The bigger picture
LinkedIn’s willingness to replace their own successful infrastructure reflects a unique engineering culture. This is the same company that implemented “Project Inversion” in 2011, a complete feature freeze to rebuild their entire development infrastructure when scaling issues threatened the business.
Onur Karaman, the NorthGuard tech lead, is a former Apache Kafka committer who worked on Kafka’s controller redesign. This isn’t external criticism of Kafka, it’s the original creators acknowledging architectural limits and building solutions.
In five years, this change could influence how other companies approach messaging systems. The architectural patterns may become standard for handling extreme scale, while cloud providers incorporate similar techniques. For most organizations, this represents a fascinating case study rather than a blueprint, the scale that justified NorthGuard simply doesn’t exist for typical enterprise use cases.
LinkedIn’s infrastructure evolution demonstrates engineering culture that prioritizes long-term scalability over short-term convenience, even when it means replacing their own open-source contributions to the world.
Original LinkedIn blog post
Additional resources: InfoQ analysis, BigDataWire coverage, Hacker News discussion