Drinking from the (Kinesis Data) Firehose: Data Streams Made Easy! (Alt title: AVOID RE-INVENTING THE WHEEL!)

FC
4 min readSep 5, 2021
How I’d imagine things would go if I had to reinvent Kafka/Kinesis
Me if I were to try to create a data streaming service from scratch

Have you ever gone to a fancy Italian restaurant and thought to yourself: “Oh man this pizza would taste so much better, if only I knew a way to transfer large amounts of data between different services using cloud distributed technologies!”

No? Exactly. Data streaming is one of those things that in your day to day life (even as a software developer) you don’t really need to worry about… Until you’re faced with a big-data problem where you absolutely need a solid and reliable system for it.

Use-Case Time! Data Analytics!

Lets take a real life example now: Lets say you have a massive amount of real time data being generated on System A(e.g. request data, Gigabites/second) that you want to share to be consumed by other systems.

What system would you build for this? I’ll save you some time and give you the answer: YOU DON’T BUILD A SYSTEM! Instead, you use the open-source/widely available solution created by the people who worked long, borderline unhealthy amounts of hours fleshing out all the bugs, edge cases, and use-cases that’s widely adopted over multiple industries.

You will not create a better solution in a reasonable timeframe. (and if you are able to, I’d suggest also solving “P=NP” and getting that million dollar prize) So what do you “borrow”?

Data Streams!

Data streams can continuously capture gigabytes of data per second from hundreds of thousands of sources. The data collected is available in milliseconds to enable real-time analytics. Use cases include real-time dashboards, analytics, real-time anomaly detection, dynamic pricing, and more.

Popular data stream solutions that exist include:

  • Kafka: Apache’s open source offering, the “OG” data streaming solution that you can download and use right now.
  • AWS Kinesis: AWS copying Kafka’s homework for their own out of the box solution
  • Google PubSub: Google Cloud copying Kafka’s homework. Not to familiar with this one as I use AWS mainly, I’m sure it’s “GoogleTastic!” though.

Basically by spinning up a Kinesis data stream we’re able have our “System A” publish everything to the stream to be consumed by services reading from the stream. In some cases, you can now set up other services to consume from the data stream.

In our use case, we need this streaming data from “System A” to be pushed to AWS S3 so that our external systems, which can easily be configured to read from S3, can consume the data and give us the analytic charts we desire.

Now, I’ll discuss how you can write a dynamically scalable system in Java, or Python that listens to Kinesis Data Streams, and publishes each record (configurable to publish in batches) to S3 or Redshift! JUST KIDDING, DO NOT DO THIS WITHOUT A GOOD REASON! AWS already has a solution that is way more efficient and fail-proof than a single/team of developers can come up with in a reasonable time frame.

Kinesis Data Firehose (This picture is way more interesting than their icon, so I’m going with it)

Kinesis Data Firehose — A wheel that already exists!

AWS Kinesis Data Firehose is a streaming solution that can either be:

  • Used directly as the data source, you can publish records to it directly
  • Configured to read from a Kinesis Data Stream

What makes Data Firehose shine, is the fact that it is able to be configured such that it take the records that it reads and automatically publishes them directly to either.

  • An S3 bucket
  • Amazon Redshift
Billy Mays here! Bringing you the best memes of the early 2000s!
Billy Mays here! Big Data doesn’t need to be Big Problem!

You are also able to apply AWS Lambda transforms to the data being collected by firehose before it gets pumped out to the bucket destination! If you need to do some operations (e.g. filter out certain data) you’re able to do so without having to modify any of your other system.

AWS Kinesis Data Firehose is a damn powerful tool (and very affordable one for companies in the position where it can be useful!). Being aware of it’s existence can be huge for you and your company in the right circumstance.

--

--

FC

Enterprise Software Developer By Day, Game Dev by night