Spark Architecture 系统架构_网络技术_宿州市腾雀网络科技有限公司

Spark Architecture 系统架构

发布时间 - 2025-06-27 00:00:00 点击率：次

let's delve into the apache spark architecture, providing a high-level overview and discussing some key software components in detail.

High-Level Overview Apache Spark's application architecture is composed of several crucial components that work together to process data in a distributed environment. Understanding these components is essential for grasping how Spark functions. The key components include:

Driver Program
Master Node
Worker Node
Executor
Tasks
SparkContext
SQL Context
Spark Session

Here's an overview of how these components integrate within the overall architecture:

Apache Spark application architecture - Standalone mode

Driver Program The Driver Program serves as the primary component of a Spark application. The machine hosting the Spark application process, which initializes SparkContext and Spark Session, is referred to as the Driver Node, and the running process is known as the Driver Process. This program interacts with the Cluster Manager to allocate tasks to executors.

Cluster Manager As the name suggests, a Cluster Manager oversees a cluster. Spark is compatible with various cluster managers such as YARN, Mesos, and a Standalone cluster manager. In a standalone setup, there are two continuously running daemons: one on the master node and one on each worker node. Further details on cluster managers and deployment models will be covered in Chapter 8, Operating in Clustered Mode.

Worker If you're familiar with Hadoop, you'll recognize that a Worker Node is akin to a slave node. These nodes are where the actual computational work occurs within Spark executors. They report their available resources back to the master node. Typically, each node in a Spark cluster, except the master, runs a worker process. Usually, one Spark worker daemon is initiated per worker node, which then launches and oversees executors for the applications.

Executors The master node allocates resources and utilizes workers across the cluster to instantiate Executors for the driver. These executors are employed by the driver to execute tasks. Executors are initiated only when a job begins on a worker node. Each application maintains its own set of executor processes, which can remain active throughout the application's lifecycle and execute tasks across multiple threads. This approach ensures application isolation and prevents data sharing between different applications. Executors are responsible for task execution and managing data in memory or on disk.

Tasks A task represents a unit of work dispatched to an executor. It is essentially a command sent from the Driver Program to an executor, serialized as a Function object. The executor deserializes this command (which is part of your previously loaded JAR) and executes it on a specific data partition.

A partition is a logical division of data spread across a Spark cluster. Spark typically reads data from a distributed storage system and partitions it to facilitate parallel processing across the cluster. For instance, when reading from HDFS, a partition is created for each HDFS partition. Partitions are crucial because Spark executes one task per partition. Consequently, the number of partitions is significant. Spark automatically sets the number of partitions unless manually specified, e.g., sc.parallelize(data, numPartitions).

SparkContext SparkContext serves as the entry point for a Spark session. It connects you to the Spark cluster and enables the creation of RDDs, accumulators, and broadcast variables on that cluster. Ideally, only one SparkContext should be active per JVM. Therefore, you must call stop() on the active SparkContext before initiating a new one. In local mode, when starting a Python or Scala shell, a SparkContext object is automatically created, and the variable sc references this SparkContext object, allowing you to create RDDs from text files without explicitly initializing it.

/** 
 * Read a text file from HDFS, a local file system (available on all nodes), or any 
 * Hadoop-supported file system URI, and return it as an RDD of Strings.
 * The text files must be encoded as UTF-8.
 * 
 * @param path path to the text file on a supported file system
 * @param minPartitions suggested minimum number of partitions for the resulting RDD
 * @return RDD of lines of the text file
 */
def textFile(
    path: String,
    minPartitions: Int = defaultMinPartitions): RDD[String] = withScope {
  assertNotStopped()
  hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
    minPartitions).map(pair => pair._2.toString).setName(path)
}
/** Get an RDD for a Hadoop file with an arbitrary InputFormat
@note Because Hadoop's RecordReader class re-uses the same Writable object for each 
record, directly caching the returned RDD or directly passing it to an aggregation or shuffle 
operation will create many references to the same object.
If you plan to directly cache, sort, or aggregate Hadoop writable objects, you should first 
copy them using a map function.
@param path directory to the input data files, the path can be comma separated paths 
as a list of inputs
@param inputFormatClass storage format of the data to be read
@param keyClass Class of the key associated with the inputFormatClass parameter
@param valueClass Class of the value associated with the inputFormatClass parameter
@param minPartitions suggested minimum number of partitions for the resulting RDD
@return RDD of tuples of key and corresponding value
*/
def hadoopFile[K, V](
path: String,
inputFormatClass: Class[_ <: InputFormat[K, V]],
keyClass: Class[K],
valueClass: Class[V],
minPartitions: Int = defaultMinPartitions): RDD[(K, V)] = withScope {
assertNotStopped()
val confBroadcast = broadcast(new SerializableConfiguration(hadoopConfiguration))
val setInputPathsFunc = (jobConf: JobConf) => FileInputFormat.setInputPaths(jobConf, path)
new HadoopRDD(
this,
confBroadcast,
Some(setInputPathsFunc),
inputFormatClass,
keyClass,
valueClass,
minPartitions).setName(path)
}

Spark Session The Spark Session is the entry point for programming with Spark using the dataset and DataFrame API.

For more in-depth information, you can refer to the following resource: Apache Spark Architecture.

# python # apache # ai # red

相关栏目：【网站优化151355 】【网络推广146373 】【网络技术251813 】【 AI营销90571 】

上一篇：义乌网站建设：打造线上商业新门户

下一篇：python snownlp情感分析简易d

相关栏目网站优化
网络推广
网络技术
AI营销

最新文章 Sublime怎么一键压缩JS代码 Su sublime如何在搜索中使用正则表达式 Sublime如何设置透明窗口效果 Su mysql如何设计商品表结构_mysql css属性背景图不显示怎么办_通过检查路如何使用Golang实现排序_Golan 农历闰月是怎么回事_为合回归年加一月调整塑造《刺客信条》艾吉奥传奇的编剧离开育碧 1英里等于多少公里 1mile和km的换 css grid布局中行和列是如何定义的 PS批量旋转和翻转图片，快速校正图片方向 C# Swagger UI自定义方法 C OPPO手机九宫格和全键盘怎么切换_OP Go语言如何实现用户登录注册_Golan 1节飞行速度多少公里每小时 1节是多少公纸嫁衣8千子树第五章庙门怎么开启庙门 Laravel 多行数据编辑表单中实现逐明日之后如何提升钓鱼等级明日之后提升钓支付宝怎样查年度账单_支付宝年度账单查看 C# 多线程UI更新Dispatcher

上一篇：义乌网站建设：打造线上商业新门户

下一篇：python snownlp情感分析简易d