Warm tip: This article is reproduced from serverfault.com, please click

apache-spark hadoop hdfs yarn

Spark on yarn concept understanding

发布于 2014-07-23 12:00:27

I am trying to understand how spark runs on YARN cluster/client. I have the following question in my mind.

Is it necessary that spark is installed on all the nodes in yarn cluster? I think it should because worker nodes in cluster execute a task and should be able to decode the code(spark APIs) in spark application sent to cluster by the driver?
It says in the documentation "Ensure that HADOOP_CONF_DIR or YARN_CONF_DIR points to the directory which contains the (client side) configuration files for the Hadoop cluster". Why does client node have to install Hadoop when it is sending the job to cluster?

Questioner

Sporty

Viewed

0

63.9k 2016-06-12 20:03:18

We are running spark jobs on YARN (we use HDP 2.2).

We don't have spark installed on the cluster. We only added the Spark assembly jar to the HDFS.

For example to run the Pi example:

./bin/spark-submit \
  --verbose \
  --class org.apache.spark.examples.SparkPi \
  --master yarn-cluster \
  --conf spark.yarn.jar=hdfs://master:8020/spark/spark-assembly-1.3.1-hadoop2.6.0.jar \
  --num-executors 2 \
  --driver-memory 512m \
  --executor-memory 512m \
  --executor-cores 4 \
  hdfs://master:8020/spark/spark-examples-1.3.1-hadoop2.6.0.jar 100

--conf spark.yarn.jar=hdfs://master:8020/spark/spark-assembly-1.3.1-hadoop2.6.0.jar - This config tell the yarn from were to take the spark assembly. If you don't use it, it will upload the jar from were you run spark-submit.

About your second question: The client node doesn't not need Hadoop installed. It only needs the configuration files. You can copy the directory from your cluster to your client.

y2k-shubham 2018-03-21 09:37:56

As pointed out by @Jacek Laskowski here, Spark no longer uses --num-executors in YARN mode

y2k-shubham 2018-03-22 06:52:53

I believe we can now fall-back on spark.dynamicAllocation.initialExecutors for --num-executors or spark.executor.instances

热门帖子

1

我又又又来求购 pixel 初代了。。。

2

[内购终身版限时免费][macOS] Barbee - 拯救你的菜单栏空间（刘海屏快来！

3

关于出镜旅游

4

原来凤凰系统已经倒闭 2 年多了

5

B 站必火推广好假啊

6

Apple TV 经常自动开机

7

推特尼日尼亚去也涨价了,羊毛没了

8

求推荐 3k 左右的烘干机

9

真搞笑， homepod 在 windows 上延迟比在 mac 上还小。。

10

[疑难杂症] 安卓面具求助，拯救楼主的 3000 软 QAQ

热门github

1

A multi-platform library for OpenGL, OpenGL ES, Vulkan, window and input

2

Dev tool that writes scalable apps from scratch while the developer oversees the implementation

3

shadcn/ui, but for Svelte. ✨

4

The Python Risk Identification Tool for generative AI (PyRIT) is an open access automation framework to empower security professionals and machine learning engineers to proactively find risks in their generative AI systems.

5

Performance-portable, length-agnostic SIMD with runtime dispatch

6

ZK Credo

7

OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement

8

Joplin - the secure note taking and to-do app with synchronisation capabilities for Windows, macOS, Linux, Android and iOS.

9

Mamba is a new state space model architecture showing promising performance on information-dense data such as language modeling, where previous subquadratic models fall short of Transformers. It is based on the line of progress on structured state space models, with an efficient hardware-aware design and implementation in the spirit of FlashAttention.

10

This repository contains System Design resources which are useful while preparing for interviews and learning Distributed Systems

11

Curso para aprender el lenguaje de programación Python desde cero y para principiantes. 75 clases, 37 horas en vídeo, código, proyectos y grupo de chat. Fundamentos, frontend, backend, testing, IA...

12

🎓 Path to a free self-taught education in Computer Science!

13

1️⃣🐝🏎️ The One Billion Row Challenge -- A fun exploration of how quickly 1B rows from a text file can be aggregated with Java

14

A collective list of free APIs

15

📚 Freely available programming books