Spark
Application¿ÉÒÔÖ±½ÓÔËÐÐÔÚYARN¼¯ÈºÉÏ£¬ÕâÖÖÔËÐÐģʽ£¬»á½«×ÊÔ´µÄ¹ÜÀíÓëе÷ͳһ½»¸øYARN¼¯ÈºÈ¥´¦Àí£¬ÕâÑùÄܹ»ÊµÏÖ¹¹½¨ÓÚYARN¼¯ÈºÖ®ÉÏApplicationµÄ¶àÑùÐÔ£¬±ÈÈç¿ÉÒÔÔËÐÐMapReduc³ÌÐò£¬¿ÉÒÔÔËÐÐHBase¼¯Èº£¬Ò²¿ÉÒÔÔËÐÐStorm¼¯Èº£¬»¹¿ÉÒÔÔËÐÐʹÓÃPython¿ª·¢»úÆ÷ѧϰӦÓóÌÐò£¬µÈµÈ¡£
ÎÒÃÇÖªµÀ£¬Spark on YARNÓÖ·ÖΪclientģʽºÍclusterģʽ£ºÔÚclientģʽÏ£¬Spark
ApplicationÔËÐеÄDriver»áÔÚÌá½»³ÌÐòµÄ½ÚµãÉÏ£¬¶ø¸Ã½Úµã¿ÉÄÜÊÇYARN¼¯ÈºÄÚ²¿½Úµã£¬Ò²¿ÉÄܲ»ÊÇ£¬Ò»°ãÀ´ËµÌá½»Spark
ApplicationµÄ¿Í»§¶Ë½Úµã²»ÊÇYARN¼¯ÈºÄÚ²¿µÄ½Úµã£¬ÄÇôÔÚ¿Í»§¶Ë½ÚµãÉÏ¿ÉÒÔ¸ù¾Ý×Ô¼ºµÄÐèÒª°²×°¸÷ÖÖÐèÒªµÄÈí¼þºÍ»·¾³£¬ÒÔÖ§³ÅSpark
ApplicationÕý³£ÔËÐС£ÔÚclusterģʽÏ£¬Spark ApplicationÔËÐÐʱµÄËùÓнø³Ì¶¼ÔÚYARN¼¯ÈºµÄNodeManager½ÚµãÉÏ£¬¶øÇÒ¾ßÌåÔÚÄÄЩNodeManagerÉÏÔËÐÐÊÇÓÉYARNµÄµ÷¶È²ßÂÔËù¾ö¶¨µÄ¡£
¶Ô±ÈÕâÁ½ÖÖģʽ£¬×î¹Ø¼üµÄÊÇSpark ApplicationÔËÐÐʱDriverËùÔڵĽڵ㲻ͬ£¬¶øÇÒ£¬Èç¹ûÏëÒª¶ÔDriverËùÔÚ½ÚµãµÄÔËÐл·¾³½øÐÐÅäÖã¬Çø±ðºÜ´ó£¬µ«Õâ¶ÔÓÚPySpark
ApplicationÔËÐÐÀ´ËµÊǷdz£¹Ø¼üµÄ¡£
PySparkÊÇSparkΪʹÓÃPython³ÌÐò±àдSpark Application¶øÊµÏֵĿͻ§¶Ë¿â£¬Í¨¹ýPySparkÒ²¿ÉÒÔ±àдSpark
Application²¢ÔÚSpark¼¯ÈºÉÏÔËÐС£Python¾ßÓзdz£·á¸»µÄ¿ÆÑ§¼ÆËã¡¢»úÆ÷ѧϰ´¦Àí¿â£¬Èçnumpy¡¢pandas¡¢scipyµÈµÈ¡£ÎªÁËÄܹ»³ä·ÖÀûÓÃÕâЩ¸ßЧµÄPythonÄ£¿é£¬ºÜ¶à»úÆ÷ѧϰ³ÌÐò¶¼»áʹÓÃPythonʵÏÖ£¬Í¬Ê±Ò²Ï£ÍûÄܹ»ÔÚSpark¼¯ÈºÉÏÔËÐС£
PySpark ApplicationÔËÐÐÔÀí
Àí½âPySpark ApplicationµÄÔËÐÐÔÀí£¬ÓÐÖúÓÚÎÒÃÇʹÓÃPython±àдSpark Application£¬²¢Äܹ»¶ÔPySpark
Application½øÐи÷ÖÖµ÷ÓÅ¡£PySpark¹¹½¨ÓÚSparkµÄJava APIÖ®ÉÏ£¬Êý¾ÝÔÚPython½Å±¾ÀïÃæ½øÐд¦Àí£¬¶øÔÚJVMÖлº´æºÍShuffleÊý¾Ý£¬Êý¾Ý´¦ÀíÁ÷³ÌÈçÏÂͼËùʾ£¨À´×ÔApache
Spark Wiki£©£º

Spark Application»áÔÚDriverÖд´½¨pyspark.SparkContext¶ÔÏ󣬺óÐøÍ¨¹ýpyspark.SparkContext¶ÔÏóÀ´¹¹½¨Job
DAG²¢Ìá½»DAGÔËÐС£Ê¹ÓÃPython±àдPySpark Application£¬ÔÚPython±àдµÄDriverÖÐÒ²ÓÐÒ»¸öpyspark.SparkContext¶ÔÏ󣬸Ãpyspark.SparkContext¶ÔÏó»áͨ¹ýPy4JÄ£¿éÆô¶¯Ò»¸öJVMʵÀý£¬´´½¨Ò»¸öJavaSparkContext¶ÔÏó¡£PY4JÖ»ÓÃÔÚDriverÉÏ£¬ºóÐøÔÚPython³ÌÐòÓëJavaSparkContext¶ÔÏóÖ®¼äµÄͨÐÅ£¬¶¼»áͨ¹ýPY4JÄ£¿éÀ´ÊµÏÖ£¬¶øÇÒ¶¼ÊDZ¾µØÍ¨ÐÅ¡£
PySpark ApplicationÖÐÒ²ÓÐRDD£¬¶ÔPython RDDµÄTransformation²Ù×÷£¬¶¼»á±»Ó³Éäµ½JavaÖеÄPythonRDD¶ÔÏóÉÏ¡£¶ÔÓÚÔ¶³Ì½ÚµãÉϵÄPython
RDD²Ù×÷£¬Java PythonRDD¶ÔÏó»á´´½¨Ò»¸öPython×Ó½ø³Ì£¬²¢»ùÓÚPipeµÄ·½Ê½Óë¸ÃPython×Ó½ø³ÌͨÐÅ£¬½«Óû§±àдPython´¦Àí´úÂëºÍÊý¾Ý·¢Ë͵½Python×Ó½ø³ÌÖнøÐд¦Àí¡£
ÏÂÃæ£¬ÎÒÃÇ»ùÓÚSpark on YARNģʽ£¬²¢¸ù¾Ýµ±Ç°ÆóÒµËù¾ßÓеÄʵ¼Ê¼¯ÈºÔËÐл·¾³Çé¿ö£¬À´ËµÃ÷ÈçºÎÔÚSpark¼¯ÈºÉÏÔËÐÐPySpark
Application£¬´óÖ·ÖΪÈçÏÂ3ÖÖÇé¿ö£º
YARN¼¯ÈºÅäÖÃPython»·¾³
ÕâÖÖÇé¿ö£¬Èç¹ûÊdzõʼ°²×°YARN¡¢Spark¼¯Èº£¬²¢¿¼Âǵ½Á˵±Ç°Ó¦Óó¡¾°ÐèÒªÖ§³ÖPython³ÌÐòÔËÐÐÔÚSpark¼¯ÈºÖ®ÉÏ£¬Õâʱ¿ÉÒÔ×¼±¸ºÃ¶ÔÓ¦PythonÈí¼þ°ü¡¢ÒÀÀµÄ£¿é£¬ÔÚYARN¼¯ÈºÖеÄÿ¸ö½ÚµãÉϽøÐа²×°¡£ÕâÑù£¬YARN¼¯ÈºµÄÿ¸öNodeManagerÉ϶¼¾ßÓÐPython»·¾³£¬¿ÉÒÔ±àдPySpark
Application²¢ÔÚ¼¯ÈºÉÏÔËÐС£Ä¿Ç°±È½ÏÁ÷ÐеÄÊÇÖ±½Ó°²×°PythonÐéÄâ»·¾³£¬Ê¹ÓÃAnacondaµÈÈí¼þ£¬¿ÉÒÔ¼«´óµØ¼ò»¯Python»·¾³µÄ¹ÜÀí¹¤×÷¡£
ÕâÖÖ·½Ê½µÄȱµãÊÇ£¬Èç¹ûºóÐøÊ¹ÓÃPython±àдSpark Application£¬ÐèÒªÔö¼ÓеÄÒÀÀµÄ£¿é£¬ÄÇô¾ÍÐèÒªÔÚYARN¼¯ÈºµÄÿ¸ö½ÚµãÉ϶¼½øÐиÃÐÂÔöÄ£¿éµÄ°²×°¡£¶øÇÒ£¬Èç¹ûÒÀÀµPythonµÄ°æ±¾£¬¿ÉÄÜ»¹ÐèÒª¹ÜÀí²»Í¬°æ±¾Python»·¾³¡£ÒòΪÌá½»PySpark
ApplicationÔËÐУ¬¾ßÌåÔÚÄÄЩNodeManagerÉÏÔËÐиÃApplication£¬ÊÇÓÉYARNµÄµ÷¶ÈÆ÷¾ö¶¨µÄ£¬±ØÐ뱣֤ÿ¸öNodeManagerÉ϶¼¾ßÓÐPython»·¾³£¨»ù´¡»·¾³+ÒÀÀµÄ£¿é£©¡£
YARN¼¯Èº²»ÅäÖÃPython»·¾³
ÕâÖÖÇé¿ö£¬¸üÊÊºÏÆóÒµÒѾ°²×°Á˹æÄ£½Ï´óµÄYARN¼¯Èº£¬²¢ÔÚ¿ªÊ¼Ê¹ÓÃʱ²¢Î´¿¼Âǵ½ºóÐø»áʹÓûùÓÚPythonÀ´±àдSpark
Application£¬²¢ÇÒ²»ÏëÔÚYARN¼¯ÈºµÄNodeManagerÉϰ²×°Python»·¾³ÒÀÀµÒÀÀµÄ£¿é¡£ÎÒÃDzο¼ÁËBenjamin
ZaitlenµÄ²©ÎÄ£¨Ïê¼ûºóÃæ²Î¿¼Á´½Ó£©£¬²¢»ùÓÚAnacondaÈí¼þ»·¾³½øÐÐÁËʵ¼ùºÍÑéÖ¤£¬¾ßÌåʵÏÖ˼·ÈçÏÂËùʾ£º
ÔÚÈÎÒâÒ»¸öLInux OSµÄ½ÚµãÉÏ£¬°²×°AnacondaÈí¼þ
ͨ¹ýAnaconda´´½¨ÐéÄâPython»·¾³
ÔÚ´´½¨ºÃµÄPython»·¾³ÖÐÏÂÔØ°²×°ÒÀÀµµÄPythonÄ£¿é
½«Õû¸öPython»·¾³´ò³Ézip°ü
Ìá½»PySpark Applicationʱ£¬²¢Í¨¹ý¨CarchivesÑ¡ÏîÖ¸¶¨zip°ü·¾¶
ÏÂÃæ½øÐÐÏêϸ˵Ã÷£º
Ê×ÏÈ£¬ÎÒÃÇÔÚCentOS 7.2ÉÏ£¬»ùÓÚPython 2.7£¬ÏÂÔØÁËAnaconda2-5.0.0.1-Linux-x86_64.sh°²×°Èí¼þ£¬²¢½øÐÐÁ˰²×°¡£AnacondaµÄ°²×°Â·¾¶Îª/root/anaconda2¡£
È»ºó£¬´´½¨Ò»¸öPythonÐéÄâ»·¾³£¬Ö´ÐÐÈçÏÂÃüÁ
conda create -n
mlpy_env --copy -y -q python= 2 numpy pandas scipy |
ÉÏÊöÃüÁî´´½¨ÁËÒ»¸öÃû³ÆÎªmlpy_envµÄPython»·¾³£¬¨CcopyÑ¡Ï¶ÔÓ¦µÄÈí¼þ°ü¶¼°²×°µ½¸Ã»·¾³ÖУ¬°üÀ¨Ò»Ð©CµÄ¶¯Ì¬Á´½Ó¿âÎļþ¡£Í¬Ê±£¬ÏÂÔØnumpy¡¢pandas¡¢scipyÕâÈý¸öÒÀÀµÄ£¿éµ½¸Ã»·¾³ÖС£
½Ó×Å£¬½«¸ÃPython»·¾³´ò°ü£¬Ö´ÐÐÈçÏÂÃüÁ
cd /root/anaconda2/envs
zip -r mlpy_env.zip mlpy_env
|
¸ÃzipÎļþ´ó¸ÅÓÐ400MB×óÓÒ£¬½«¸ÃzipѹËõ°ü¿½±´µ½Ö¸¶¨Ä¿Â¼ÖУ¬·½±ãºóÐøÌá½»PySpark Application£º
×îºó£¬ÎÒÃÇ¿ÉÒÔÌá½»ÎÒÃǵÄPySpark Application£¬Ö´ÐÐÈçÏÂÃüÁ
PYSPARK_PYTHON= ./ANACONDA/mlpy_env/ bin/python
spark-submit \
--conf spark.yarn.appMasterEnv .PYSPARK_PYTHON =./ANACONDA/mlpy_env/ bin/python
\
--master yarn-cluster \
--archives /tmp/ mlpy_env.zip #ANACONDA \
/var/lib/hadoop-hdfs/ pyspark/test_pyspark_dependencies.py
|
ÉÏÃæµÄtest_pyspark_dependencies.pyÎļþÖУ¬Ê¹ÓÃÁËnumpy¡¢pandas¡¢scipyÕâÈý¸öÒÀÀµ°üµÄº¯Êý£¬Í¨¹ýÉÏÃæÌáµ½µÄYARN¼¯ÈºµÄclusterģʽ¿ÉÒÔÔËÐÐÔÚSpark¼¯ÈºÉÏ¡£
¿ÉÒÔ¿´µ½£¬ÉÏÃæµÄÒÀÀµzipѹËõ°ü½«Õû¸öPythonµÄÔËÐл·¾³¶¼°üº¬ÔÚÀïÃæ£¬ÔÚÌá½»PySpark
Applicationʱ»á½«¸Ã»·¾³zip°üÉÏ´«µ½ÔËÐÐApplicationµÄËùÔÚµÄÿ¸ö½ÚµãÉÏ£¬²¢½âѹËõºóΪPython´úÂëÌṩÔËÐÐʱ»·¾³¡£Èç¹û²»Ïëÿ´Î¶¼´Ó¿Í»§¶Ë½«¸Ã»·¾³ÎļþÉÏ´«µ½¼¯ÈºÖÐÔËÐÐPySpark
ApplicationµÄ½ÚµãÉÏ£¬Ò²¿ÉÒÔ½«zip°üÉÏ´«µ½HDFSÉÏ£¬²¢Ð޸ĨCarchives²ÎÊýµÄֵΪhdfs:///tmp/mlpy_env.zip
#ANACONDA£¬Ò²ÊÇ¿ÉÒԵġ£
ÁíÍ⣬ÐèҪ˵Ã÷µÄÊÇ£¬Èç¹ûÎÒÃÇ¿ª·¢µÄ/var/lib/hadoop-hdfs/pyspark
/test_pyspark_dependencies.pyÎļþÖУ¬Ò²ÒÀÀµµÄһЩÎÒÃÇ×Ô¼ºÊµÏֵĴ¦Àíº¯Êý£¬¾ßÓжà¸öPythonÒÀÀµµÄÎļþ£¬ÏëҪͨ¹ýÉÏÃæµÄ·½Ê½ÔËÐУ¬±ØÐ뽫ÕâЩÒÀÀµµÄPythonÎļþ¿½±´µ½ÎÒÃÇ´´½¨µÄ»·¾³ÖУ¬¶ÔÓ¦µÄĿ¼Ϊmlpy_env/lib/python2.7/site-packages/ÏÂÃæ¡£
»ùÓÚ»ìºÏ±à³ÌÓïÑÔ»·¾³
¼ÙÈçÎÒÃÇ»¹ÊÇÏ£ÍûʹÓÃSpark on YARNģʽÀ´ÔËÐÐPySpark Application£¬µ«²¢²»½«Python³ÌÐòÌá½»µ½YARN¼¯ÈºÉÏÔËÐС£Õâʱ£¬ÎÒÃÇ¿ÉÒÔ¿¼ÂÇʹÓûìºÏ±à³ÌÓïÑԵķ½Ê½£¬À´´¦ÀíÊý¾ÝÈÎÎñ¡£±ÈÈ磬»úÆ÷ѧϰApplication¾ßÓеü´ú¼ÆËãµÄÌØÐÔ£¬¸üÊʺÏÔÚÒ»¸ö¸ßÅäµÄ½ÚµãÉÏÔËÐУ»¶øÆÕͨµÄETLÊý¾Ý´¦Àí¾ßÓжà»ú²¢Ðд¦ÀíµÄÌØµã£¬ÊʺϷŵ½¼¯ÈºÉϽøÐзֲ¼Ê½´¦Àí¡£
Ò»¸öÍêÕûµÄ»úÆ÷ѧϰApplicationµÄÉè¼ÆÓë¹¹½¨£¬¿ÉÒÔ½«Ëã·¨²¿·ÖºÍÊý¾Ý×¼±¸²¿·Ö·ÖÀë³öÀ´£¬Ê¹ÓÃScala/Java½øÐÐÊý¾ÝÔ¤´¦Àí£¬Êä³öÒ»¸ö»úÆ÷ѧϰËã·¨ËùÐèÒª£¨¸ü±ãÓÚµü´ú¡¢Ñ°ÓżÆË㣩µÄÊäÈëÊý¾Ý¸ñʽ£¬Õâ»á¼«´óµØÑ¹ËõËã·¨ÊäÈëÊý¾ÝµÄ¹æÄ££¬´Ó¶øÊ¹Ëã·¨µü´ú¼ÆËã³ä·ÖÀûÓõ¥»ú±¾µØµÄ×ÊÔ´£¨ÄÚ´æ¡¢CPU¡¢ÍøÂ磩£¬Õâ¿ÉÄÜ»á±ÈÖ±½Ó·Åµ½¼¯ÈºÖмÆËãÒª¿ìµÃ¶à¡£
Òò´Ë£¬ÎÒÃÇÔÚ¶Ô»úÆ÷ѧϰApplication×¼±¸Êý¾Ýʱ£¬Ê¹ÓÃÔÉúµÄScala±à³ÌÓïÑÔʵÏÖSpark
ApplicationÀ´´¦ÀíÊý¾Ý£¬°üÀ¨×ª»»¡¢Í³¼Æ¡¢Ñ¹ËõµÈµÈ£¬½«Âú×ãËã·¨ÊäÈë¸ñʽµÄÊý¾ÝÊä³öµ½HDFSÖ¸¶¨Ä¿Â¼ÖС£ÔÚÐÔÄÜ·½Ã棬¶ÔÊý¾Ý¹æÄ£½Ï´óµÄÇé¿öÏ£¬ÔÚSpark¼¯ÈºÉÏ´¦ÀíÊý¾Ý£¬Scala/JavaʵÏÖµÄSpark
ApplicationÔËÐÐÐÔÄÜÒªºÃһЩ¡£È»ºó£¬Ëã·¨µü´ú²¿·Ö£¬»ùÓڷḻ¡¢¸ßÐÔÄܵÄPython¿ÆÑ§¼ÆËãÄ£¿é£¬Ê¹ÓÃPythonÓïÑÔʵÏÖ£¬Æäʵֱ½ÓʹÓÃPySpark
APIʵÏÖÒ»¸ö»úÆ÷ѧϰPySpark Application£¬ÔËÐÐģʽΪYARN clientģʽ¡£Õâʱ£¬¾ÍÐèÒªÔÚËã·¨ÔËÐеĽڵãÉϰ²×°ºÃPython»·¾³¼°ÆäÒÀÀµÄ£¿é£¨¶ø²»ÐèÒªÔÚYARN¼¯ÈºµÄ½ÚµãÉϰ²×°£©£¬Driver³ÌÐò´ÓHDFSÖжÁÈ¡ÊäÈëÊý¾Ý£¨»º´æµ½±¾µØ£©£¬È»ºóÔÚ±¾µØ½øÐÐËã·¨µÄµü´ú¼ÆË㣬×îºóÊä³öÄ£ÐÍ¡£
×ܽá
¶ÔÓÚÖØ¶ÈʹÓÃPySparkµÄÇé¿ö£¬±ÈÈçÆ«Ïò»úÆ÷ѧϰ£¬¿ÉÒÔ¿¼ÂÇÔÚÕû¸ö¼¯ÈºÖж¼°²×°ºÃPython»·¾³£¬²¢¸ù¾Ý²»Í¬µÄÐèÒª½øÐÐÒÀÀµÄ£¿éµÄͳһ¹ÜÀí£¬Äܹ»=¼«´óµØ·½±ãPySpark
ApplicationµÄÔËÐС£
²»ÔÚYARN¼¯ÈºÉϰ²×°Python»·¾³µÄ·½°¸£¬»áʹÌá½»µÄPython»·¾³zip°üÔÚYARN¼¯ÈºÖд«Êä´øÀ´Ò»¶¨¿ªÏú£¬¶øÇÒÿ´ÎÌá½»Ò»¸öPySpark
Application¶¼ÐèÒª´ò°üÒ»¸ö»·¾³zipÎļþ£¬Èç¹ûÓдóÁ¿µÄPythonʵÏÖµÄPySpark ApplicationÐèÒªÔÚSpark¼¯ÈºÉÏÔËÐУ¬¿ªÏú»áÔ½À´Ô½´ó¡£ÁíÍ⣬Èç¹ûPySparkÓ¦ÓóÌÐòÐ޸ģ¬¿ÉÄÜÐèÒªÖØÐ´ò°ü»·¾³¡£µ«ÊÇÕâÑù×öȷʵ²»ÔÚÐèÒª¿¼ÂÇYARN¼¯Èº¼¯Èº½ÚµãÉϵÄPython»·¾³ÁË£¬Èκΰ汾Python±àдµÄPySpark
Application¶¼¿ÉÒÔʹÓü¯Èº×ÊÔ´ÔËÐС£
¹ØÓÚ¸ÃÎÊÌ⣬SPARK-13587£¨Ïê¼ûÏÂÃæ²Î¿¼Á´½Ó£©Ò²ÔÚÌÖÂÛÈç¹ûÓÅ»¯¸ÃÎÊÌ⣬ºóÐøÓ¦¸Ã»áÓÐÒ»¸ö±È½ÏºÏÊʵĽâ¾ö·½°¸¡£ |