Akka Streams: Our Journey for Fast Data Processing and Tech Primer

Akka Streams

In this blog I will walk you through the basics of Akka Streams, along with that we will also dive through the journey of how it helped us in creating a stable Data Ingestion pipeline. I have redacted company and project information from this blog for security reasons. At the end of the blog I have shared github link for Akka Streams Tech Primer, I would recommend to go through the blog before jumping on to the primer for better understanding.

Akka Streams is a toolkit for event based processing of streaming data using bounded buffer space. It is an actor based implementation of reactive streams specifications. It not only provides out of the box back-pressuring but also promises type safety at every stage of the streamed pipeline.

The main goal of Reactive Streams is to govern the exchange of stream data across an asynchronous boundary (i.e. between producer and consumer in below Flow) — think of it as passing elements on to another thread or thread-pool while ensuring that the receiving side is not forced to buffer arbitrary amounts of data (example, more than N messages). In other words, back pressure is an integral part of this model in order to allow the queues which mediate between threads to be bounded.

Akka Streams Bounded Buffered communication

How Akka Streams helped us in creating stable and high-performance data ingestion pipeline:

Create a Data Ingestion Client service (ICS) which can:

  • Handle never ending, volume-variable (sometimes in bursts, other times in trickle) flows that run for days without needing any restart.
  • Not only process big volumes of data but also process it fast.
  • Support both pulling and pushing of messages from diverse sources (Kafka, Flat file, CDC).
  • Take data in a variety of formats (Avro, XML, Json, DSV, etc) and convert it to desired format (Avro) if needed.
  • Encrypt the data after some validations an
Ingestion Client Service

Akka is a toolkit for building highly concurrent, distributed and resilient message-driven applications and Actors are basically concurrent processes that communicate by exchanging messages.

To ensure our new application could scale as our data ingest rates grew, we opted to use Akka Actors, a framework that utilizes the Actor model as its core concurrency abstraction. We chose this because it was proven to scale and we loved the simplicity of the API. This worked well for the cases where data was being pushed to us through different channels (e.g. Kafka) and in our initial tests we reached a performance of 50k events/min (epm) with 10 threads from single threaded flow of 20K epm (2.5KB.event). While the basic actor model provides a simple, scalable concurrency paradigm you still need to think about how work can pile up in your system.

When data was being pulled (e.g. from File), the actors responsible for pulling the data need to work in tandem with the rest of the actors. However the sole way to interact with an actor is via a message, which means slow actors can have messages queued up in memory while fast actors continue to send them messages creating hot spots. As the load and volume increased we started seeing GC pauses, causing uneven rates of processing for various stages and eventually bringing the application to halt. This leads to unpredictable, over/under-utilization of resources across different components and sometimes eventually crashing the JVM.

Akka Streams is a toolkit built on the top of actor-model for processing sequences of elements using bounded buffer space. It removes the low level plumbing that developers have to do to handle streams in case of Akka. Streams are Reactive and Asynchronous.

Akka’s Streams, that are built from the concepts of Reactive Streams, provided following benefits for the kind of problem we were facing.

Back pressure capabilities are provided transparently and out-of-the-box(back pressuring is the ability to say “hey slow down!”). Hence, where the target system is slower than the source system it makes the pipeline flow controlled and throttled to a maximum and not larger than the survivable speed.

Ability to compose small components into any arbitrarily complex pipeline. We can break our big problem into smaller blocks that can be then composed (and reused) to form the solution.

All the Akka-Streams building blocks are completely type safe: if a component produces/processes/takes a type A, it will be impossible to use a type B by mistake. Compiling type checks dramatically increase the correctness of our code and reduce the need for additional tests. The compiler works more so you do less.

Here, when we talk of ​building blocks i​n Akka Streams we are referring mainly to the following three:

An operator with ​exactly one output,​ emitting data elements whenever downstream operators are ready to receive them.

Source

An operator which has ​exactly one input and output,​ which connects its upstream and downstream by transforming the data elements flowing through it.

Flow

An Operator with​ exactly one input ​, requesting and accepting data elements possibly slowing down the upstream producer of elements.

Sink

We can work directly with the low level Akka-Streams API to create our graphs, and that will give us complete control over every aspect of the stream. However, Akka-Streams also provides a high-level DSL, that not only provides for a much simpler and idiomatic way of constructing streams, but it is built in such a way that allows us to translate designs into code.

For example, let’s assume we need to build a simple encryption flow that will take clear text messages, encrypt them according to a certain logic and publish them to a Kafka topic. This processing stream could look like this:

And with the provide DSL the actual code would look like below:

And while the example above is arguably trivial, we can go as elaborated and sophisticated as we want our design and the DSL will still allow us to express it in a simple way.

Implementation:

Using the Graph DSL based approach which allows us to segregate logical processing blocks which can be wired together in the end and ensuring type-safety over stage boundaries, we created a separate stage for every challenge listed above and wired them together to create the final workflow.

The back pressured communication ensures that neither the producer or consumer end of the workflow can get overwhelmed with the order. This makes all the stages within the ICS work in tandem while removing the bottlenecks and points of contention from it. To keep resource utilization as efficient as possible, Akka Streams enables elastic spawning of actor instances in response to varying load, which then elastically Scala back when not needed.

Results (Apps max memory allocation: 5 GB):

● Improved the performance by multiple folds with no concurrent-programming and distributed-systems problems:
Improved application performance by over 10 folds (25k to 320k, there were other factors along with technology shift which we were able to identify because of the simplicity added by Akka Streams.)

Avoided the risk of expensive downtime in production:
We did the stress testing on Streams based ICS with a volume of 140M records (150GB) and saw consistent throughput (variance of ~10%) with no GC pauses throughout the test.

Improved CPU Utilization:
With Streams the CPU utilization is consistently around 80%.

● Able to Scale both elastically when needed:

The Reactive nature of Akka Streams allows us to grow both horizontally and vertically.

Tech Primer:

Below projects walk you through the steps to build a basic flow with a working example. It will help you in understanding the core concepts of Akka Stream.

References:

Staff Data Engineer at Samsung Research

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store