Spark源码系列（七）Spark on yarn具体实现 -数据库-火龙果软件工程

Spark源码系列（七）Spark on yarn具体实现

作者岑玉海的博客，火龙果软件发布于 2014-11-11

7883 次浏览

本来不打算写的了，但是真的是闲来无事，整天看美剧也没啥意思。这一章打算讲一下Spark on yarn的实现，1.0.0里面已经是一个stable的版本了，可是1.0.1也出来了，离1.0.0发布才一个月的时间，更新太快了，节奏跟不上啊，这里仍旧是讲1.0.0的代码，所以各位朋友也不要再问我讲的是哪个版本，目前为止发布的文章都是基于1.0.0的代码。

在第一章《spark-submit提交作业过程》的时候，我们讲过Spark on yarn的在cluster模式下它的main class是org.apache.spark.deploy.yarn.Client。okay，这个就是我们的头号目标。

提交作业

找到main函数，里面调用了run方法，我们直接看run方法。

val appId = runApp()
    monitorApplication(appId)
    System.exit(0)

运行App，跟踪App，最后退出。我们先看runApp吧。

def runApp(): ApplicationId = {
    // 校验参数，内存不能小于384Mb，Executor的数量不能少于1个。
    validateArgs()
    // 这两个是父类的方法，初始化并且启动Client
    init(yarnConf)
    start()

// 记录集群的信息(e.g, NodeManagers的数量，队列的信息).
logClusterResourceDetails()

// 准备提交请求到ResourcManager (specifically its ApplicationsManager (ASM)// Get a new client application.
val newApp = super.createApplication()
val newAppResponse = newApp.getNewApplicationResponse()
val appId = newAppResponse.getApplicationId()
// 检查集群的内存是否满足当前的作业需求
verifyClusterResources(newAppResponse)

// 准备资源和环境变量.
//1.获得工作目录的具体地址: /.sparkStaging/appId/
val appStagingDir = getAppStagingDir(appId)
　　//2.创建工作目录，设置工作目录权限，上传运行时所需要的jar包
val localResources = prepareLocalResources(appStagingDir)
//3.设置运行时需要的环境变量
val launchEnv = setupLaunchEnv(localResources, appStagingDir)
　　//4.设置运行时JVM参数，设置SPARK_USE_CONC_INCR_GC为true的话，就使用CMS的垃圾回收机制
val amContainer = createContainerLaunchContext(newAppResponse, localResources, launchEnv)

// 设置application submission context.
val appContext = newApp.getApplicationSubmissionContext()
appContext.setApplicationName(args.appName)
appContext.setQueue(args.amQueue)
appContext.setAMContainerSpec(amContainer)
appContext.setApplicationType("SPARK")

// 设置ApplicationMaster的内存，Resource是表示资源的类，目前有CPU和内存两种.
val memoryResource = Records.newRecord(classOf[Resource]).asInstanceOf[Resource]
memoryResource.setMemory(args.amMemory + YarnAllocationHandler.MEMORY_OVERHEAD)
appContext.setResource(memoryResource)

// 提交Application.
submitApp(appContext)
appId
}

monitorApplication就不说了，不停的调用getApplicationReport方法获得最新的Report，然后调用getYarnApplicationState获取当前状态，如果状态为FINISHED、FAILED、KILLED就退出。

说到这里，顺便把跟yarn相关的参数也贴出来一下，大家一看就清楚了。

def defaultPartitioner(rdd: RDD[_], others: RDD[_]*): Partitioner = {
    val bySize = (Seq(rdd) ++ others).sortBy(_.partitions.size).reverse
    for (r <- bySize if r.partitioner.isDefined) {
      return r.partitioner.get
    }
    if (rdd.context.conf.contains("spark.default.parallelism")) {
      new HashPartitioner(rdd.context.defaultParallelism)
    } else {
      new HashPartitioner(bySize.head.partitions.size)
    }
  }

ApplicationMaster

直接看run方法就可以了，main函数就干了那么一件事...

def run() {
    // 设置本地目录，默认是先使用yarn的YARN_LOCAL_DIRS目录，再到LOCAL_DIRS
    System.setProperty("spark.local.dir", getLocalDirs())

// set the web ui port to be ephemeral for yarn so we don't conflict with
// other spark processes running on the same box
System.setProperty("spark.ui.port", "0")

// when running the AM, the Spark master is always "yarn-cluster"
System.setProperty("spark.master", "yarn-cluster")

　　// 设置优先级为30，和mapreduce的优先级一样。它比HDFS的优先级高，因为它的操作是清理该作业在hdfs上面的Staging目录
ShutdownHookManager.get().addShutdownHook(new AppMasterShutdownHook(this), 30)

appAttemptId = getApplicationAttemptId()
　　// 通过yarn.resourcemanager.am.max-attempts来设置，默认是2
　　// 目前发现它只在清理Staging目录的时候用
isLastAMRetry = appAttemptId.getAttemptId() >= maxAppAttempts
amClient = AMRMClient.createAMRMClient()
amClient.init(yarnConf)
amClient.start()

// setup AmIpFilter for the SparkUI - do this before we start the UI
　　// 方法的介绍说是yarn用来保护ui界面的，我感觉是设置ip代理的
addAmIpFilter()
　　// 注册ApplicationMaster到内部的列表里
ApplicationMaster.register(this)

// 安全认证相关的东西，默认是不开启的，省得给自己找事
val securityMgr = new SecurityManager(sparkConf)

// 启动driver程序
userThread = startUserClass()

// 等待SparkContext被实例化，主要是等待spark.driver.port property被使用
　　// 等待结束之后，实例化一个YarnAllocationHandler
waitForSparkContextInitialized()

// Do this after Spark master is up and SparkContext is created so that we can register UI Url.
　　// 向yarn注册当前的ApplicationMaster, 这个时候isFinished不能为true，是true就说明程序失败了
synchronized {
if (!isFinished) {
registerApplicationMaster()
registered = true
}
}

// 申请Container来启动Executor
allocateExecutors()

// 等待程序运行结束
userThread.join()

System.exit(0)
}

run方法里面主要干了5项工作：

1、初始化工作

2、启动driver程序

3、注册ApplicationMaster

4、分配Executors

5、等待程序运行结束

我们重点看分配Executor方法。

private def allocateExecutors() {
    try {
      logInfo("Allocating " + args.numExecutors + " executors.")
      // 分host、rack、任意机器三种类型向ResourceManager提交ContainerRequest
　　　 // 请求的Container数量可能大于需要的数量
      yarnAllocator.addResourceRequests(args.numExecutors)
      // Exits the loop if the user thread exits.
      while (yarnAllocator.getNumExecutorsRunning < args.numExecutors && userThread.isAlive) {
        if (yarnAllocator.getNumExecutorsFailed >= maxNumExecutorFailures) {
          finishApplicationMaster(FinalApplicationStatus.FAILED, "max number of executor failures reached")
        }
　　　　 // 把请求回来的资源进行分配，并释放掉多余的资源
        yarnAllocator.allocateResources()
        ApplicationMaster.incrementAllocatorLoop(1)
        Thread.sleep(100)
      }
    } finally {
      // In case of exceptions, etc - ensure that count is at least ALLOCATOR_LOOP_WAIT_COUNT,
      // so that the loop in ApplicationMaster#sparkContextInitialized() breaks.
      ApplicationMaster.incrementAllocatorLoop(ApplicationMaster.ALLOCATOR_LOOP_WAIT_COUNT)
    }
    logInfo("All executors have launched.")

// 启动一个线程来状态报告
if (userThread.isAlive) {
// Ensure that progress is sent before YarnConfiguration.RM_AM_EXPIRY_INTERVAL_MS elapses.
val timeoutInterval = yarnConf.getInt(YarnConfiguration.RM_AM_EXPIRY_INTERVAL_MS, 120000)

// we want to be reasonably responsive without causing too many requests to RM.
val schedulerInterval = sparkConf.getLong("spark.yarn.scheduler.heartbeat.interval-ms", 5000)

// must be <= timeoutInterval / 2.
val interval = math.min(timeoutInterval / 2, schedulerInterval)

launchReporterThread(interval)
}
}

这里面我们只需要看addResourceRequests和allocateResources方法即可。

先说addResourceRequests方法，代码就不贴了。

Client向ResourceManager提交Container的请求，分三种类型：优先选择机器、同一个rack的机器、任意机器。

优先选择机器是在RDD里面的getPreferredLocations获得的机器位置，如果没有优先选择机器，也就没有同一个rack之说了，可以是任意机器。

下面我们接着看allocateResources方法。

1、把从ResourceManager中获得的Container进行选择，选择顺序是按照前面的介绍的三种类别依次进行，优先选择机器 > 同一个rack的机器 > 任意机器。

2、选择了Container之后，给每一个Container都启动一个ExecutorRunner一对一贴身服务，给它发送运行CoarseGrainedExecutorBackend的命令。

3、ExecutorRunner通过NMClient来向NodeManager发送请求。

总结：

把作业发布到yarn上面去执行这块涉及到的类不多，主要是涉及到Client、ApplicationMaster、YarnAllocationHandler、ExecutorRunner这四个类。

1、Client作为Yarn的客户端，负责向Yarn发送启动ApplicationMaster的命令。

2、ApplicationMaster就像项目经理一样负责整个项目所需要的工作，包括请求资源，分配资源，启动Driver和Executor，Executor启动失败的错误处理。

3、ApplicationMaster的请求、分配资源是通过YarnAllocationHandler来进行的。

4、Container选择的顺序是：优先选择机器 > 同一个rack的机器 > 任意机器。

5、ExecutorRunner只负责向Container发送启动CoarseGrainedExecutorBackend的命令。

6、Executor的错误处理是在ApplicationMaster的launchReporterThread方法里面，它启动的线程除了报告运行状态，还会监控Executor的运行，一旦发现有丢失的Executor就重新请求。

7、在yarn目录下看到的名称里面带有YarnClient的是属于yarn-client模式的类，实现和前面的也差不多。

其它的内容更多是Yarn的客户端api使用，我也不太会，只是看到了能懂个意思，哈哈。

7883 次浏览