ArkFlow: High-performance Rust stream processing engine
I worked on stream processing, it was fun, but I also believe it was over-engineered and brittle. The customers also didn't want real-time data, they looked at the calculated values once a week, then made decisions based on that.
Then, I joined another company that somehow had money to pay 50-100 people, and they were using CSV, sh scripts, batch processing, and all that. It solved the clients' needs, and they didn't need to maintain a complicated architecture and the code that could have been difficult to reason about otherwise.
The first company with the stream processing after I left, was bought by a competitor at fire sale price, some of the tech were relevant for them, but the stream processing stuff was immediately shut down. The acquiring company had just simple batch processing and they were printing money in comparison.
If you think it's still worth going with stream processing, give your reasoning to the team, and most reasonable developers would learn it if they really believe it's a significantly better solution for the given problem.
Not to over-simplify, but if you can't convince 5 out of 10 people to learn to make their job better, it's either that the people are not up to the task, or you are wrong that stream processing would make a difference.
Systems that needed complex streaming architectures in 2015 could probably be handled today with fast disk and large postgres instance (or BigQuery).
My concept of stream processing is trying to process gigabits to gigabytes a second, and turn it into something much much smaller so that it's manageable to database and analyze. To my mind for 'stream processing' calling malloc is sometimes too expensive let alone using any of the technologies called out in this tech stack.
I understand back pressure, and circuit breakers, but they have to happen at the OS / process level (for my general work) -- a metric that auto scales a microservice worker after going through prometheus + an HPA or something like that ends up with too many inefficiencies to make things practical. A few threads on a single machine just work, but end up taking ages to engineer a 'cloud native' solution.
Once I'm down to a job a second (and that job takes more than a few seconds to run to hide the framework's overhead) or less things like Airflow start to work, and not just fall flat, but at that point are these expensive frame works worth it? I'm only producing 1-1000 jobs a second.
Stream processing with these frameworks like Faust, Airflow, Kafka Streams etc, all just seem like brittle overkill once you start trying to actually deploy and use them. How do I tune the PostgreSQL database for Airflow? How do I manage my S3 life cycles to minimize cost?
A task queue + an HPA really feels more like the right kind of thing to me at that scale vs really caring too much about back pressure, etc when the data rate is 'low', but I've generally been told by colleagues to reach for more complicated stream processors that perform worse, are (IMO) harder to orchestrate, and (IMO) harder to manage and deploy.
Companies are organized around an operational tempo that reflects what their systems are capable of. Even if you replace one of their systems with a real-time or quasi-real-time stream processing architecture, nothing else in the organization operates with that low of a latency, including the people. It is a very heavy lift to even ask them to reorganize the way they do things.
A related issue is that stream processing systems still work poorly for some data models and often don’t scale well. Most implementations place narrow constraints on the properties of the data models and their statefulness. If you have a system sitting in the middle of your operational data model that requires logic which does not fit within those limitations then the whole exercise starts to break down. Despite its many downsides, batching generalizes much better and more easily than stream processing. This could be ameliorated with better stream processing tech (as in, core data structures, algorithms, and architecture) but there hasn’t been much progress on that front.
ArkFlow is positioned as a lightweight distributed stream processing engine that integrates streaming batches. With the help of datafusion's huge ecosystem and ArkFlow's scalable capabilities, we hope to build a huge data processing ecosystem to help the community simplify the threshold for data processing, because we always believe that flowing data can generate greater value.
Finally, thanks to everyone for their attention.
I haven't dug deep into this project, so take this with a grain of salt.
ArkFlow is a "stateless" stream processor, like vector or benthos (now Redpanda Connect). These are great for routing data around your infrastructure while doing simple, stateless transformations on them. They tend to be easy to run and scale, and are programmed by manually constructing the graph of operations.
Arroyo (like Flink or Rising Wave) is a "stateful" stream processor, which means it supports operations like windowed aggregations, joins, and incremental SQL view maintenance. Arroyo is programmed declaratively via SQL, which is automatically planned into a dataflow (graph) representation. The tradeoff is that state is hard to manage, and these systems are much harder to operate and scale (although we've done a lot of work with Arroyo to mitigate this!).
I wrote about the difference at length here: https://www.arroyo.dev/blog/stateful-stream-processing
Can you help me understand how this would plug into stream processing? My immediate thought is for web page interaction replays — but that seems sort of exotic a use case?
a major difference seems to be converting things to arrow and using SQL instead of using a DSL (vrl)
seems like a simplified equivalent of https://vector.dev/
No? Vector is for observability, to get your metrics/logs, transform them if needed, and put them in the necessary backends. Transformation is optional, and for cases like downsampling or converting formats or adding metadata.
ArkFlow gets data from stuff like databes and message queues/brokers, transforms it, and puts it back in databases and message queues/brokers. Transformation looks like a pretty central use case.
Very different scenarios. It's like saying that a Renault Kangoo is a simplified equivalent of a BTR-80 because both have wheels, engine and space for stuff.