Warm tip: This article is reproduced from serverfault.com, please click

amazon emr-AWS EMR上的spark-submit运行但在访问S3时失败

(amazon emr - spark-submit on AWS EMR runs but fails on accessing S3)

发布于 2020-11-28 21:15:08

我编写了一个Spark应用程序,编译为.jar文件,可以spark-shell --jars myApplication.jar在EMR群集的主节点上运行时很好地使用它

scala> // pass in the existing spark context to the doSomething function, run with a particular argument.
scala> com.MyCompany.MyMainClass.doSomething(spark, "dataset1234")
...

这样一切都很好。我还设置了我的main功能,因此可以通过以下方式提交spark-shell

package com.MyCompany
import org.apache.spark.sql.SparkSession
object MyMainClass {
  val spark = SparkSession.builder()
    .master(("local[*]"))
    .appName("myApp")
    .getOrCreate()

  def main(args: Array[String]): Unit = {
    doSomething(spark, args(0))
  }
  
  // implementation of doSomething(...) omitted
}

通过仅打印出一个非常简单的main方法args,我确认可以使用调用该main方法spark-submit但是,当我尝试将群集与实际生产作业一起提交时,它将失败。我这样提交:

spark-submit --deploy-mode cluster --class com.MyCompany.MyMainClass s3://my-bucket/myApplication.jar dataset1234

在控制台中,我看到许多消息,包括一些警告,但是没有什么特别有用:

20/11/28 19:28:47 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
20/11/28 19:28:47 WARN DependencyUtils: Skip remote jar s3://my-bucket/myApplication.jar.
20/11/28 19:28:47 INFO RMProxy: Connecting to ResourceManager at ip-xxx-xxx-xxx-xxx.region.compute.internal/172.31.31.156:8032
20/11/28 19:28:47 INFO Client: Requesting a new application from cluster with 20 NodeManagers
20/11/28 19:28:48 INFO Configuration: resource-types.xml not found
20/11/28 19:28:48 INFO ResourceUtils: Unable to find 'resource-types.xml'.
20/11/28 19:28:48 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (24576 MB per container)
20/11/28 19:28:48 INFO Client: Will allocate AM container, with 2432 MB memory including 384 MB overhead
20/11/28 19:28:48 INFO Client: Setting up container launch context for our AM
20/11/28 19:28:48 INFO Client: Setting up the launch environment for our AM container
20/11/28 19:28:48 INFO Client: Preparing resources for our AM container
20/11/28 19:28:48 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
20/11/28 19:28:51 INFO Client: Uploading resource file:/mnt/tmp/spark-e34d573d-8f23-403c-ac41-aa5154db8ecd/__spark_libs__8971082428743972083.zip -> hdfs://ip-xxx-xxx-xxx-xxx.region.compute.internal:8020/user/hadoop/.sparkStaging/application_1606587406989_0005/__spark_libs__8971082428743972083.zip
20/11/28 19:28:53 INFO ClientConfigurationFactory: Set initial getObject socket timeout to 2000 ms.
20/11/28 19:28:53 INFO Client: Uploading resource s3://my-bucket/myApplication.jar -> hdfs://ip-xxx-xxx-xxx-xxx.region.compute.internal:8020/user/hadoop/.sparkStaging/application_1606587406989_0005/myApplication.jar
20/11/28 19:28:54 INFO S3NativeFileSystem: Opening 's3://my-bucket/myApplication.jar' for reading
20/11/28 19:28:54 INFO Client: Uploading resource file:/mnt/tmp/spark-e34d573d-8f23-403c-ac41-aa5154db8ecd/__spark_conf__5385616689365996012.zip -> hdfs://ip-xxx-xxx-xxx-xxx.region.compute.internal:8020/user/hadoop/.sparkStaging/application_1606587406989_0005/__spark_conf__.zip
20/11/28 19:28:54 INFO SecurityManager: Changing view acls to: hadoop
20/11/28 19:28:54 INFO SecurityManager: Changing modify acls to: hadoop
20/11/28 19:28:54 INFO SecurityManager: Changing view acls groups to:
20/11/28 19:28:54 INFO SecurityManager: Changing modify acls groups to:
20/11/28 19:28:54 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(hadoop); groups with view permissions: Set(); users  with modify permissions: Set(hadoop); groups with modify permissions: Set()
20/11/28 19:28:54 INFO Client: Submitting application application_1606587406989_0005 to ResourceManager
20/11/28 19:28:54 INFO YarnClientImpl: Submitted application application_1606587406989_0005
20/11/28 19:28:55 INFO Client: Application report for application_1606587406989_0005 (state: ACCEPTED)
20/11/28 19:28:55 INFO Client:
         client token: N/A

然后,每秒一次,持续几分钟(在本示例中为6分钟),我得到的应用程序报告state: ACCEPTED直到失败并没有有用的信息。

20/11/28 19:28:56 INFO Client: Application report for application_1606587406989_0005 (state: ACCEPTED)
...
... (lots of these messages)
...
20/11/28 19:31:55 INFO Client: Application report for application_1606587406989_0005 (state: ACCEPTED)
20/11/28 19:34:52 INFO Client: Application report for application_1606587406989_0005 (state: FAILED)
20/11/28 19:34:52 INFO Client:
         client token: N/A
         diagnostics: Application application_1606587406989_0005 failed 2 times due to AM Container for appattempt_1606587406989_0005_000002 exited with  exitCode: 13
Failing this attempt.Diagnostics: [2020-11-28 19:32:24.087]Exception from container-launch.
Container id: container_1606587406989_0005_02_000001
Exit code: 13

[2020-11-28 19:32:24.117]Container exited with a non-zero exit code 13. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
Last 4096 bytes of stderr :
elled)
20/11/28 19:32:22 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
20/11/28 19:32:22 WARN TaskSetManager: Lost task 15.0 in stage 1.0 (TID 135, ip-xxx-xxx-xxx-xxx.region.compute.internal, executor driver): TaskKilled (Stage cancelled)

最终,日志将指示:

org.apache.spark.sql.AnalysisException: Path does not exist: s3://my-bucket/dataset1234.parquet;
        at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$checkAndGlobPathIfNecessary$1(DataSource.scala:759)

我的应用程序首先创建了该文件,但在创建失败时,仅默默地忽略了该文件并继续运行(以防工作被执行,工作并再次执行,以尝试覆盖文件)。第二部分是它将读取此文件并做一些其他工作。因此,我从此错误消息中知道的是我的应用程序正在运行,并且继续经过第一部分,但是显然Spark无法将文件写出到S3。从第二条日志消息中似乎(?),Spark无法从S3下载远程jar文件。(在运行之前spark-submit我确实碰巧将文件复制到〜hadoop /中,尽管我不知道它是否无法从S3下载并找到本地副本。)

spark-submit通过检查在Web界面中创建步骤的EMR AWS CLI导出配置显示的内容,我得到了命令。这是EMR不具有S3权限的问题吗?这似乎不太可能,但是这里还有什么问题呢?它的确可以满足我的要求,但似乎可以成功找到该文件不存在(即,具有对我的存储桶的读取权限),但是它无法创建该文件。

如何获得有关此的更好的调试信息?有什么方法可以确保正确的EMR <-> S3权限?

Questioner
sousvide
Viewed
11
srikanth holur 2020-12-22 03:17:44

摆脱这个.master(("local[*]"))在群集上运行并访问s3文件时,你的maser不应位于本地