Warm tip: This article is reproduced from stackoverflow.com, please click
apache-spark hadoop

How to control the number of Hadoop IPC retry attempts for a Spark job submission?

发布于 2020-04-08 09:22:25

Suppose I attempt to submit a Spark (2.4.x) job to a Kerberized cluster, without having valid Kerberos credentials. In this case, the Spark launcher tries repeatedly to initiate a Hadoop IPC call, but fails:

20/01/22 15:49:32 INFO retry.RetryInvocationHandler: java.io.IOException: Failed on local exception: java.io.IOException: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS]; Host Details : local host is: "node-1.cluster/172.18.0.2"; destination host is: "node-1.cluster":8032; , while invoking ApplicationClientProtocolPBClientImpl.getClusterMetrics over null after 1 failover attempts. Trying to failover after sleeping for 35160ms.

This will repeat a number of times (30, in my case), until eventually the launcher gives up and the job submission is considered failed.

Various other similar questions mention these properties (which are actually YARN properties but prefixed with spark. as per the standard mechanism to pass them with a Spark application).

  • spark.yarn.maxAppAttempts
  • spark.yarn.resourcemanager.am.max-attempts

However, neither of these properties affects the behavior I'm describing. How can I control the number of IPC retries in a Spark job submission?

Questioner
Jeff Evans
Viewed
114
Jeff Evans 2020-02-01 05:49

After a good deal of debugging, I figured out the properties involved here.

  • yarn.client.failover-max-attempts (controls the max attempts)

Without specifying this, the number of attempts appears to come from the ratio of these two properties (numerator first, denominator second).

  • yarn.resourcemanager.connect.max-wait.ms
  • yarn.client.failover-sleep-base-ms

Of course as with any YARN properties, these must be prefixed with spark.hadoop. in the context of a Spark job submission.

The relevant class (which resolves all these properties) is RMProxy, within the Hadoop YARN project (source here). All these, and related, properties are documented here.