Parallel Infinite Java Streams run out of Memory

65.6k 2020-01-31 21:46

You say “but I don't quite know in which order things are evaluated and where buffering occurs”, which is precisely what parallel streams are about. The order of evaluation is unspecified.

A critical aspect of your example is the .limit(100_000_000). This implies that the implementation can’t just sum up arbitrary values, but must sum up the first 100,000,000 numbers. Note that in the reference implementation, .unordered().limit(100_000_000) doesn’t change the outcome, which indicates that there’s no special implementation for the unordered case, but that’s an implementation detail.

Now, when worker threads process the elements, they can’t just sum them up, as they have to know which elements they are allowed to consume, which depends on how many elements are preceding their specific workload. Since this stream doesn’t know the sizes, this can only be known when the prefix elements have been processed, which never happens for infinite streams. So the worker threads keep buffering for the moment, this information becomes available.

In principle, when a worker thread knows that it processes the leftmost¹ work-chunk, it could sum up the elements immediately, count them, and signal the end when reaching the limit. So the Stream could terminate, but this depends on a lot of factors.

In your case, a plausible scenario is that the other worker threads are faster in allocating buffers than the leftmost job is counting. In this scenario, subtle changes to the timing could make the stream occasionally return with a value.

When we slow down all worker threads except the one processing the leftmost chunk, we can make the stream terminate (at least in most runs):

System.out.println(IntStream
    .iterate(1, i -> i+1)
    .parallel()
    .peek(i -> { if(i != 1) LockSupport.parkNanos(1_000_000_000); })
    .flatMap(n -> IntStream.iterate(n, i -> i+n))
    .limit(100_000_000)
    .sum()
);

¹ I’m following a suggestion by Stuart Marks to use left-to-right order when talking about the encounter order rather than the processing order.

Thomas Ahle 2020-01-31 21:59:47

Very nice answer! I wonder if there is even a risk that all the threads start running the flatMap operations, and none get allocated to actually empty the buffers (summing)? In my actual use case the infinite streams are instead files too large to keep in memory. I wonder how I may rewrite the stream to keep memory usage down?

Holger 2020-01-31 22:15:18

Are you using Files.lines(…)? It has been improved significantly in Java 9.

Holger 2020-01-31 22:46:36

This is what it does in Java 8. In newer JREs, it will still fall back to BufferedReader.lines() in certain circumstances (not the default filesystem, a special charset, or the size larger than Integer.MAX_FILES). If one of these applies, a custom solution could help. This would be worth a new Q&A…

Holger 2020-01-31 22:54:45

Integer.MAX_VALUE, of course…

Holger 2020-02-01 00:36:35

What is the outer stream, a stream of files? Does it have a predictable size?

Related issues

Alternative approach for checking data types to build an object

Is it possible to get first and last name from Google Authentication on Firebase?

Constructor that receives TXT files to read and store them

Logback with Elastic Beanstalk

Camel Predicate Example in xml DSL

How do I get user input validation into my java calculator program?

firebase getting exact childs from database

Listener for custom dialog null

JFrame very small

AWS Elastic Beanstalk Application Logging with Logback