Warm tip: This article is reproduced from serverfault.com, please click

Real-time processing: Storm / flink vs standard application (java, c#...)

发布于 2020-12-03 16:49:54

I am wondering about the choice of implementing an application processing events coming from Kafka, I have in mind two architecture patterns:

  • an application developed using the Apache Storm or Apache Flink framework that would process events consumed from Kafka
  • a Java application (or python, C#...), deployed X times (scalable depending on traffic), which would process events coming from Kafka

I find it difficult to see which of the scenarios is the most interesting. Someone could help me on this topic ?

Questioner
Xire
Viewed
0
Arvid Heise 2020-12-04 22:15:41

It's hard to give some definitive advice with so little information available. So I leave my response vague until you provide more specific information:

Choosing a processing framework over a native implementation gives you the following advantages:

  • Parallel processing with (in theory) infinite scalability: If you ever expect that you cannot process all events in a single thread in a timely manner, you first need to scale up (more threads) and eventually scale out (more machines). A frameworks takes care of all synchronization between threads and machines, so you just need to write sequential code glued together with some high-level primitives (similar to LINQ in C#).
  • Fault tolerance: What happens when your code screws up (some edge case not implemented)? When you run out of resources? When network (to Kinesis or other machines) temporarily breaks? A framework takes care of all these nasty little details.
  • In case of failure, when you restart application, most frameworks give you some form of exactly once processing: How do you avoid losing data? How do you avoid duplicates when reprocessing old data?
  • Managed state: If your application needs to remember things for a certain time (calculating sums/average or joining data), how do you ensure that the state is kept in sync with data in case of failure?
  • Advanced features: time triggers, complex event processing (=pattern matching on events), writing to different sinks (Kafka for low latency, s3 for batch processing)
  • Flexibility of storage: if you want to try out a different storage system, it's much easier to change source/sink in an application writing in a framework.
  • Integration in deployment platforms: If you want to scale to several machines, it's usually much easier to scale a platform that already offers related integration (at the time of writing that should be mostly Kubernetes). But all frameworks also support simple local setups where you just scale-up on one (bigger) machine.
  • Low-level optimizations: When using new engines with higher abstractions, it's possible that the frameworks generate code that is much more efficient than what you can implement yourself (with specific memory layout or serialized data processing).

The big downsides are usually:

  • Complexity of the framework: you need to understand how the framework works from a user's perspective. However, you usually save time by not going into the details of writing a custom consumer/producer, so it's not as bad as it initially seems.
  • Flexibility in code: you cannot write arbitrary code anymore. Since the framework handles parallelism for you, you need to think in terms of chunks of data and adjust your algorithms accordingly. Standard SQL operations are usually directly supported though in one form or another.
  • Less control over resource usage: since the platform schedules the task across machines, you may end up with unfortunate assignments and the platform may give you too little options to fix it. Note that most applications are more intrinsically bound to bad resource utilization because of data skew and suboptimal algorithms though.