spark支持YARN做资源调度器,所以YARN的原理还是应该知道的:
http://www.socc2013.org/home/program/a5-vavilapalli.pdf 但总体来说,这是一篇写得一般的论文,它的原理没有什么特别突出的,而且它列举的数据没有对比性,几乎看不出YARN有什么优势。反正我看完的感觉是,YARN的资源分配在延迟上估计很糟糕。而实际使用似乎也印证了这个预感。
Abstract
two key shortcomings: 1) tight coupling of a specific programming model with the resource management infrastructure, forcing developers to abuse the MapReduce programming model, and 2) centralized handling of jobs’ control flow, which resulted in endless scalability concerns for the scheduler
以前的资源管理器的两个缺点: 1)与编程模式耦合得紧密,迫使程序员过度使用MapReduce的编程模式;2)过于集中处理jobs的控制流程,使得扩展性不佳; YARN 致力于解决这些问题。
Introduction
Nothing interesting...
Hadoop on Demand shortcomings
我们关心的问题:
- Scalability 扩展性
- Multi-tenancy 多路租用
- Serviceability 可用性
- Locality awareness 本地化
- High Cluster Utiltization 集群使用率
- Reliability/Availability 可靠性
- Secure and auditable operation 安全
- Support for Programming Model Diversity 编程模式多样化
- Flexible Resource Model 资源的可伸缩性
- Backward compatibility 后向兼容
3 Architecture
To address the requirements we discussed in Section 2, YARN lifts some functions into a platform layer responsible for resource management, leaving coordination of logical execution plans to a host of framework implementations. Specifically, a per-cluster ResourceManager (RM) tracks resource usage and node liveness, enforces allocation invariants, and arbitrates contention among tenants. By separating these duties in the JobTracker’s charter, the central allocator can use an abstract description of tenants’ requirements, but remain ignorant of the semantics of each allocation. That responsibility is delegated to an ApplicationMaster (AM), which coordinates the logical plan of a single job by requesting resources from the RM, generating a physical plan from the resources it receives, and coordinating the execution of that plan around faults.
YARN将资源管理的功能提升到平台层面处理,而将逻辑执行计划的协调工作留给一个主机框架实现。具体来说,一个资源管理器(RM)跟踪资源的使用和可用性,根据恒定的分配方式,仲裁资源竞争和租约冲突。 它将每个应用的语意留给了应用管理器(AM),它负责一个job的从RM获取资源到协调逻辑计划、产生物理计划,并执行计划和处理执行中的错误。
3.1 Overview
The RM runs as a daemon on a dedicated machine, and acts as the central authority arbitrating resources among various competing applications in the cluster. Given this central and global view of the cluster resources, it can enforce rich, familiar properties such as fairness [R10], capacity [R10], and locality [R4] across tenants. Depending on the application demand, scheduling priorities, and resource availability, the RM dynamically allocates leases– called containers– to applications to run on particular nodes. 5 The container is a logical bundle of resources (e.g., h2GB RAM, 1 CPUi) bound to a particular node [R4,R9]. In order to enforce and track such assignments, the RM interacts with a special system daemon running on each node called the NodeManager (NM). Communications between RM and NMs are heartbeatbased for scalability. NMs are responsible for monitoring resource availability, reporting faults, and container lifecycle management (e.g., starting, killing). The RM assembles its global view from these snapshots of NM state
RM 作为一个后台服务在某个指定的机器上执行,负责集中的仲裁处理。根据应用需求、调度优先级、可用资源,RM动态分配租约(也叫容器)给应用程序在特定的节点上执行。容器是绑定在某个特地节点上的资源的逻辑概念,例如2GB RAM,1CPU. 为了加强资源分配的跟踪,RM将与一个节点管理器交互(NM). RM与NM之间使用心跳进行通信(出于扩展性的考虑)。NM负责将空资源的可用性并向RM会报资源的错误和状态。而RM整合全局的的资源状态。
Jobs are submitted to the RM via a public submission protocol and go through an admission control phase during which security credentials are validated and various operational and administrative checks are performed [R7]. Accepted jobs are passed to the scheduler to be run. Once the scheduler has enough resources, the application is moved from accepted to running state. Aside from internal bookkeeping, this involves allocating a container for the AM and spawning it on a node in the cluster. A record of accepted applications is written to persistent storage and recovered in case of RM restart or failure
主要流程:
提交job -> 进入审核状态 -> 审核通过,进入一接受状态,开始分配资源 -> 分配到资源,进入执行状态,同时在一个NM上分配一个容器来执行AM。 为了能在出错后恢复,一个被接受的应用将被记录到磁盘中。
The ApplicationMaster is the “head” of a job, managing all lifecycle aspects including dynamically increasing and decreasing resources consumption, managing the flow of execution (e.g., running reducers against the output of maps), handling faults and computation skew, and performing other local optimizations. In fact, the AM can run arbitrary user code, and can be written in any programming language since all communication with the RM and NM is encoded using extensible communication protocols 6 —as an example consider 5 We will refer to “containers” as the logical lease on resources and the actual process spawned on the node interchangeably. 6 See: https://code.google.com/p/protobuf/ the Dryad port we discuss in Section 4.2. YARN makes few assumptions about the AM, although in practice we expect most jobs will use a higher level programming framework (e.g., MapReduce, Dryad, Tez, REEF, etc.). By delegating all these functions to AMs, YARN’s architecture gains a great deal of scalability [R1], programming model flexibility [R8], and improved upgrading/testing [R3] (since multiple versions of the same framework can coexist).
AM是应用的牵头人,负责资源的动态增减、执行流程、错误处理、computation skew、和本地化优化等方面的生命周期管理。事实上,AM可以执行任何编程语言的代码,只要满足AM和RM的通信协议要求。
Typically, an AM will need to harness the resources (cpus, RAM, disks etc.) available on multiple nodes to complete a job. To obtain containers, AM issues resource requests to the RM. The form of these requests includes specification of locality preferences and properties of the containers. The RM will attempt to satisfy the resource requests coming from each application according to availability and scheduling policies. When a resource is allocated on behalf of an AM, the RM generates a lease for the resource, which is pulled by a subsequent AM heartbeat. A token-based security mechanism guarantees its authenticity when the AM presents the container lease to the NM [R4]. Once the ApplicationMaster discovers that a container is available for its use, it encodes an application-specific launch request with the lease. In MapReduce, the code running in the container is either a map task or a reduce task. 7 If needed, running containers may communicate directly with the AM through an application-specific protocol to report status and liveness and receive framework-specific commands– YARN neither facilitates nor enforces this communication. Overall, a YARN deployment provides a basic, yet robust infrastructure for lifecycle management and monitoring of containers, while application-specific semantics are managed by each framework [R3,R8].
AM向RM发起资源请求,请求包含本地化偏好和容器属性。RM根据资源可用性和分配策略给AM分配资源。如果有可用的资源,RM给AM分配一个带租期的容器,并且通过AM的下一个心跳返回给AM。一个基于token的安全机制确保AM拥有包含某个节点的容器。 如果有需用,容器可以直接与AM通过指定协议通信,并会报自己的状态给AM。
3.2 Resource Manager (RM)
two public interfaces towards: RM暴露两个公开接口:
1) clients submitting applications, and 客户端提交应用程序;
2) ApplicationMaster(s) dynamically negotiating access to resources, and AM动态协商资源
ps: one internal interface towards NodeManagers for cluster monitoring and resource access management.
,还有一个与NM通信的内部接口。
AM向RM申请资源的格式:
1. number of containers (e.g., 200 containers),
2. resources per container h2GB RAM, 1 CPUi,
3. locality preferences, and
4. priority of requests within the application
point out what the ResourceManager is not responsible for. As discussed, it is not responsible for coordinating application execution or task fault-tolerance, but neither is is charged with 1) providing status or metrics for running applications (now part of the ApplicationMaster), nor 2) serving framework specific reports of completed jobs (now delegated to a per-framework daemon)
需要指出RM不负责的工作:
1) 提供应用的状态信息和度量信息(它是AM的工作)
2) serving framework specific reports of completed jobs
3.3 Application Master (AM)
The ApplicationMaster is the process that coordinates the application’s execution in the cluster, but it itself is run in the cluster just like any other container. A component of
the RM negotiates for the container to spawn this bootstrap process.
AM负责协调应用程序在集群中的执行,但它自己也和应用程序一样执行在集群的容器上,它是由RM启动的。
The AM periodically heartbeats to the RM to affirm its liveness and to update the record of its demand。 AM通过周期的心跳信息确认它需要的资源是否可用。
the allocations to an application are late binding: the process spawned is not bound to the request, but to the lease. The conditions that caused the AM to issue the request may not remain true when it receives its resources, but the semantics of the container are fungible and frameworkspecific
分配给应用程序的资源是推迟绑定的: 资源是绑定给一个租约,而不是帮给给一个(AM的)请求,所以AM收到资源时,可能资源已经不属于它了。
Since the RM does not interpret the container status, the AM determines the semantics of the success or failure of the container exit status reported by NMs through the RM. Since the AM is itself a container running in a cluster of unreliable hardware, it should be resilient to failure.
由于RM不解析容器的状态,AM需要根据容器返回的状态决定应用程序的成功或失败。又由于AM也是使用容器实现的,它应该具有容错性。
3.4 Node Manager (NM)
The NodeManager is the “worker” daemon in YARN. It authenticates container leases, manages containers’ dependencies, monitors their execution, and provides a set of services to containers. Operators configure it to report memory, CPU, and other resources available at this node and allocated for YARN. After registering with the RM, the NM heartbeats its status and receives instructions
NM作为一个工作节点。 它负责容器的租约、依赖关系、检视应用程序的执行、提供服务给容器。 它通过心跳信息会报内存、cpu和其他资源的可用性给RM,
All containers in YARN– including AMs– are described by a container launch context (CLC). This record includes a map of environment variables, dependencies stored in remotely accessible storage, security tokens, payloads for NM services, and the command necessary to create the process.
YARN中所有容器可以用CLC表示。CLC包括环境变量、存储在远端的依赖、安全信息、NM的负载信息、创建进程的命令。
To launch the container, the NM copies all the necessary dependencies– data files, executables, tarballs– to local storage.
为了执行一个进程,NM将必要的信息、依赖拷贝到本地存储。
3.5 YARN framework/application writers
the responsibilities of a YARN application author:
1. Submitting the application by passing a CLC for the ApplicationMaster to the RM. 通过CLC提交应用到RM。
2. When RM starts the AM, it should register with the RM and periodically advertise its liveness and requirements over the heartbeat protocol。
当RM启动AM后,AM应该向RM注册并通过心跳信息周期汇报状态和可用资源。
3. Once the RM allocates a container, AM can construct a CLC to launch the container on the corresponding NM. It may also monitor the status of the running container and stop it when the resource should be reclaimed. Monitoring the progress of work done inside the container is strictly the AM’s responsibility.
当RM分配一个容器后,AM通过CLC在NM上启动应用程序。AM同时也监控容器的执行状态,停止并回收容器。监控容器的工作状态是AM的责任。
4. Once the AM is done with its work, it should unregister from the RM and exit cleanly.
一旦AM完成了所有工作,它应该向RM取消注册,并清理资源后退出。
5. Optionally, framework authors may add control flow between their own clients to report job status and expose a control plane.
7 Conclusion
Thanks to the decoupling of resource management and programming framework, YARN provides: 由于YARN从编程框架中解耦了,它能提供:
1) greater scalability, 可扩展性
2) higher efficiency, and 更有效率(这个我觉得很难说)
3) enables a large number of different frameworks to efficiently share a cluster. 使不同的计算框架可以共享相同的集群。