Kafka/Cassandra Ingestion Pipeline

Hero

The Question

Your Kafka/Cassandra data ingestion pipeline—walk me through how you handled backpressure, data loss prevention, and schema evolution. What was your approach to ensuring exactly-once semantics if that was a requirement?

SmartDrive's Data Problem¶

SmartDrive installed "smart recorders"—IoT devices in customer vehicle fleets—that captured sensor data and video. That data flowed back to SmartDrive's systems to power safety metrics, vehicle performance analytics, and driver behavior analysis against customer-defined safety criteria.

The existing system was a monolith running on-prem, and it was struggling to scale. We were part of a ground-up rearchitecture to a cloud-based distributed system using open-source technologies. The new system needed to reliably ingest data from over 40,000 IoT devices transmitting over cellular networks.

My Role and the Architecture¶

Nick's influence

I talked about Nicholas Brookins in how I got here—he's the one who pushed me to evaluate technologies on their merits instead of defaulting to what was familiar. This pipeline was a direct product of that shift in thinking.

I helped Nick design this ingestion system. We developed a custom protocol for the sensor data—bit-packed byte arrays compressed using Snappy to minimize cellular data costs and transmission time.

I implemented a service called Hydrant in Go, deployed as a stateless service in Kubernetes. Hydrant received packages from the IoT devices, validated them, converted them to Avro format, and placed them into Kafka topics based on a hash. We used 10-20 topics with hash-based partitioning to distribute the load across the Kafka cluster.

From there, our data engineering team had Scala processors that consumed from Kafka, processed the data as needed for different Cassandra keyspaces and tables, which the Analytics team then used for various purposes including tuning the IoT device firmware based on vehicle characteristics.

Backpressure Handling¶

We designed every layer to scale horizontally. Hydrant was stateless and ran in Kubernetes, which meant we could scale it horizontally to handle any incoming load from the IoT devices.

Each Hydrant instance could also process messages in a package concurrently, taking advantage of Go's lightweight goroutines. This meant we could handle high-throughput packages efficiently even on a single instance.

On the Kafka side, we used hash-based partitioning across 10-20 topics to distribute the load. The hash determined which partition a message went to, ensuring even distribution across the Kafka cluster. I don't recall the exact hashing algorithm or range mapping we used, but the principle was load distribution to prevent any single partition from becoming a bottleneck.

The Scala processors on the consumer side could also scale horizontally since Kafka's consumer group model allows multiple consumers to process different partitions in parallel.

Data Loss Prevention¶

Our approach to data loss prevention worked at multiple levels, though I'll be upfront that it wasn't perfect—we had to balance reliability against the reality of resource-constrained IoT devices.

When Hydrant successfully processed a message, it sent an ACK back to the IoT device. The device wouldn't delete the message from its internal storage until it received the ACK. If Hydrant couldn't process a message for any reason—validation failure, Kafka unavailable, processing error—it returned a NACK. The device would then queue that message in its internal storage and retry later.

The devices sent messages sequentially: once one message was successfully acknowledged, they'd proceed to send any queued messages. This ensured ordering and gave us a retry mechanism at the edge.

However, IoT devices have limited memory, so if a device failed completely or its storage became corrupted before messages could be transmitted, that data was lost. For our analytics use case, this was an acceptable trade-off—we didn't need every single data point to provide valuable insights to customers. The system was designed for high-volume telemetry where some loss was tolerable, not for critical transactional data where every message must be preserved.

Messages that failed validation in Hydrant were placed in a separate Kafka topic for the firmware team to investigate, ensuring we could identify and fix systematic issues with the devices or protocol.

Schema Evolution¶

We built versioning into the protocol from day one—each bit-packed message included a version field. This allowed Hydrant to handle different protocol versions as they came in from devices in the field.

We required that any new schema versions be backwards compatible, which was critical since we had 40,000 deployed devices that we couldn't update simultaneously. That said, the firmware team was extremely rigorous in their development and testing process, so once we released the initial schema version, it proved to be very stable. Schema changes were rare, which made evolution manageable in practice.

When we did need to evolve the schema, the version field allowed us to route different versions through the appropriate processing logic, though this happened infrequently enough that it wasn't a major operational burden.

Exactly-Once Semantics¶

We didn't implement exactly-once semantics across the entire pipeline—our system was designed for at-least-once delivery. The ACK/NACK mechanism meant that if a device retried after a network issue, we could potentially receive the same message multiple times.

Hydrant could detect and deduplicate records within a single package. Each message included a CRC or unique hash, and Hydrant compared these hashes to find duplicates, discarding them without fully parsing every message. However, Hydrant didn't track whether it had seen a particular message before across different packages. That would have required maintaining state about previously processed messages, which would have broken Hydrant's stateless, horizontally-scalable design.

If cross-message deduplication was needed, it would have been handled in the Scala processors downstream, which had access to Cassandra and could check for duplicate data. I know the Analytics team also had deduplication processes in their workflows, though I can't speak to the specifics of their implementation since that was outside my area of responsibility.

For our analytics use case, this was the right trade-off. We were processing high-volume telemetry data to generate fleet-wide safety insights and driver behavior trends. Some duplication in the data wouldn't significantly impact aggregate analytics, and the complexity and performance hit of guaranteed exactly-once delivery at the ingestion layer just wasn't worth it for what the business needed.