apache kafka-Real-time processing: Storm / flink vs standard application (java, c#...)

Arvid Heise 2020-12-04 22:15:41

It's hard to give some definitive advice with so little information available. So I leave my response vague until you provide more specific information:

Choosing a processing framework over a native implementation gives you the following advantages:

Parallel processing with (in theory) infinite scalability: If you ever expect that you cannot process all events in a single thread in a timely manner, you first need to scale up (more threads) and eventually scale out (more machines). A frameworks takes care of all synchronization between threads and machines, so you just need to write sequential code glued together with some high-level primitives (similar to LINQ in C#).
Fault tolerance: What happens when your code screws up (some edge case not implemented)? When you run out of resources? When network (to Kinesis or other machines) temporarily breaks? A framework takes care of all these nasty little details.
In case of failure, when you restart application, most frameworks give you some form of exactly once processing: How do you avoid losing data? How do you avoid duplicates when reprocessing old data?
Managed state: If your application needs to remember things for a certain time (calculating sums/average or joining data), how do you ensure that the state is kept in sync with data in case of failure?
Advanced features: time triggers, complex event processing (=pattern matching on events), writing to different sinks (Kafka for low latency, s3 for batch processing)
Flexibility of storage: if you want to try out a different storage system, it's much easier to change source/sink in an application writing in a framework.
Integration in deployment platforms: If you want to scale to several machines, it's usually much easier to scale a platform that already offers related integration (at the time of writing that should be mostly Kubernetes). But all frameworks also support simple local setups where you just scale-up on one (bigger) machine.
Low-level optimizations: When using new engines with higher abstractions, it's possible that the frameworks generate code that is much more efficient than what you can implement yourself (with specific memory layout or serialized data processing).

The big downsides are usually:

Complexity of the framework: you need to understand how the framework works from a user's perspective. However, you usually save time by not going into the details of writing a custom consumer/producer, so it's not as bad as it initially seems.
Flexibility in code: you cannot write arbitrary code anymore. Since the framework handles parallelism for you, you need to think in terms of chunks of data and adjust your algorithms accordingly. Standard SQL operations are usually directly supported though in one form or another.
Less control over resource usage: since the platform schedules the task across machines, you may end up with unfortunate assignments and the platform may give you too little options to fix it. Note that most applications are more intrinsically bound to bad resource utilization because of data skew and suboptimal algorithms though.

Xire 2020-12-04 11:12:24

My question is poor in information because I don't have a specific case in mind, I simply didn't understand the factor of choice between one or the other solution. Your answer is very complete and satisfies me greatly, thank you very much for your time (and your answer)!

Arvid Heise 2020-12-04 14:16:17

Glad that I can help. I added one more pos + one more con (resource utilization).

Real-time processing: Storm / flink vs standard application (java, c#...)

热门帖子

热门github