ÕªÒª£ºSparkÊÇ·¢Ô´ÓÚÃÀ¹ú¼ÓÖÝ´óѧ²®¿ËÀû·ÖУAMPLabµÄ¼¯Èº¼ÆËãÆ½Ì¨¡£ËüÁ¢×ãÓÚÄÚ´æ¼ÆË㣬´Ó¶àµü´úÅúÁ¿´¦Àí³ö·¢£¬¼æÊÕ²¢ÐîÊý¾Ý²Ö¿â¡¢Á÷´¦ÀíºÍͼ¼ÆËãµÈ¶àÖÖ¼ÆË㷶ʽ£¬ÊǺ±¼ûµÄÈ«ÄÜÑ¡ÊÖ¡£
SparkÒÑÕýʽÉêÇë¼ÓÈëApache·õ»¯Æ÷£¬´ÓÁé»úÒ»ÉÁµÄʵÑéÊÒ¡°µç»ð»¨¡±³É³¤Îª´óÊý¾Ý¼¼Êõƽ̨ÖÐÒì¾üÍ»ÆðµÄÐÂÈñ¡£±¾ÎÄÖ÷Òª½²ÊöSparkµÄÉè¼ÆË¼Ïë¡£SparkÈçÆäÃû£¬Õ¹ÏÖÁË´óÊý¾Ý²»³£¼ûµÄ¡°µç¹âʯ»ð¡±¡£¾ßÌåÌØµã¸ÅÀ¨Îª¡°Çá¡¢¿ì¡¢ÁéºÍÇÉ¡±¡£
Ç᣺Spark
0.6ºËÐÄ´úÂëÓÐ2ÍòÐУ¬Hadoop 1.0Ϊ9ÍòÐУ¬2.0Ϊ22ÍòÐС£Ò»·½Ã棬¸ÐлScalaÓïÑԵļò½àºÍ·á¸»±í´ïÁ¦£»ÁíÒ»·½Ã棬SparkºÜºÃµØÀûÓÃÁËHadoopºÍMesos£¨²®¿ËÀû
ÁíÒ»¸ö½øÈë·õ»¯Æ÷µÄÏîÄ¿£¬Ö÷¹¥¼¯ÈºµÄ¶¯Ì¬×ÊÔ´¹ÜÀí£©µÄ»ù´¡ÉèÊ©¡£ËäÈ»ºÜÇᣬµ«ÔÚÈÝ´íÉè¼ÆÉϲ»´òÕÛ¿Û¡£Ö÷´´ÈËMateiÉù³Æ£º¡°²»°Ñ´íÎóµ±ÌØÀý´¦Àí¡£¡±ÑÔÏÂ
Ö®Ò⣬ÈÝ´íÊÇ»ù´¡ÉèÊ©µÄÒ»²¿·Ö¡£
¿ì£ºSpark¶ÔСÊý¾Ý¼¯ÄÜ´ïµ½ÑÇÃë¼¶µÄÑÓ³Ù£¬Õâ¶ÔÓÚHadoop
MapReduce£¨ÒÔϼò³ÆMapReduce£©ÊÇÎÞ·¨ÏëÏóµÄ£¨ÓÉÓÚ¡°ÐÄÌø¡±¼ä¸ô»úÖÆ£¬½öÈÎÎñÆô¶¯¾ÍÓÐÊýÃëµÄÑÓ³Ù£©¡£¾Í´óÊý¾Ý¼¯¶øÑÔ£¬¶ÔµäÐ͵ĵü´ú»úÆ÷
ѧϰ¡¢¼´Ï¯²éѯ£¨ad-hoc query£©¡¢Í¼¼ÆËãµÈÓ¦Óã¬Spark°æ±¾±È»ùÓÚMapReduce¡¢HiveºÍPregelµÄʵÏÖ¿ìÉÏÊ®±¶µ½°Ù±¶¡£ÆäÖÐÄÚ´æ¼ÆËã¡¢Êý¾Ý±¾µØÐÔ
£¨locality£©ºÍ´«ÊäÓÅ»¯¡¢µ÷¶ÈÓÅ»¯µÈ¸Ã¾ÓÊ×¹¦£¬Ò²ÓëÉè¼ÆÒÁʼ¼´±ü³ÖµÄÇáÁ¿ÀíÄî²»ÎÞ¹ØÏµ¡£
Á飺SparkÌṩÁ˲»Í¬²ãÃæµÄÁé»îÐÔ¡£ÔÚʵÏֲ㣬ËüÍêÃÀÑÝÒïÁËScala
trait¶¯Ì¬»ìÈ루mixin£©²ßÂÔ£¨Èç¿É¸ü»»µÄ¼¯Èºµ÷¶ÈÆ÷¡¢ÐòÁл¯¿â£©£»ÔÚÔÓPrimitive£©²ã£¬ËüÔÊÐíÀ©Õ¹ÐµÄÊý¾ÝËã×Ó
£¨operator£©¡¢ÐµÄÊý¾ÝÔ´£¨ÈçHDFSÖ®ÍâÖ§³ÖDynamoDB£©¡¢ÐµÄlanguage bindings£¨JavaºÍPython£©£»ÔÚ·¶Ê½£¨Paradigm£©²ã£¬SparkÖ§³ÖÄÚ´æ¼ÆËã¡¢¶àµü´úÅúÁ¿´¦Àí¡¢¼´Ï¯²éѯ¡¢Á÷´¦ÀíºÍͼ¼ÆËãµÈ¶àÖÖ
·¶Ê½¡£
ÇÉ£ºÇÉÔÚ½èÊÆºÍ½èÁ¦¡£Spark½èHadoopÖ®ÊÆ£¬ÓëHadoopÎÞ·ì½áºÏ£»½Ó×ÅShark£¨SparkÉϵÄÊý¾Ý²Ö¿âʵÏÖ£©½èÁËHiveµÄÊÆ£»Í¼¼ÆËã½è
ÓÃPregelºÍPowerGraphµÄAPIÒÔ¼°PowerGraphµÄµã·Ö¸î˼Ïë¡£Ò»ÇеÄÒ»ÇУ¬¶¼½èÖúÁËScala£¨±»¹ã·ºÓþΪJavaµÄδÀ´È¡´ú
Õߣ©Ö®ÊÆ£ºSpark±à³ÌµÄLook'n'Feel¾ÍÊÇÔÖÔζµÄScala£¬ÎÞÂÛÊÇÓï·¨»¹ÊÇAPI¡£ÔÚʵÏÖÉÏ£¬ÓÖÄÜÁéÇɽèÁ¦¡£ÎªÖ§³Ö½»»¥Ê½±à
³Ì£¬SparkÖ»Ðè¶ÔScalaµÄShellС×öÐ޸ģ¨Ïà±È֮ϣ¬Î¢ÈíΪ֧³ÖJavaScript Console¶ÔMapReduce½»»¥Ê½±à³Ì£¬²»½öÒª¿çÔ½JavaºÍJavaScriptµÄ˼άÆÁÕÏ£¬ÔÚʵÏÖÉÏ»¹Òª´ó¶¯¸É¸ê£©¡£
˵ÁËÒ»´ó¶ÑºÃ´¦£¬»¹ÊÇÒªÖ¸³öSparkδÕéÍêÃÀ¡£ËüÓÐÏÈÌìµÄÏÞÖÆ£¬²»ÄܺܺõØÖ§³ÖϸÁ£¶È¡¢Òì²½µÄÊý¾Ý´¦Àí£»Ò²ÓкóÌìµÄÔÒò£¬¼´Ê¹ÓкܰôµÄ»ùÒò£¬±Ï¾¹»¹¸Õ¸ÕÆð²½£¬ÔÚÐÔÄÜ¡¢Îȶ¨ÐԺͷ¶Ê½µÄ¿ÉÀ©Õ¹ÐÔÉÏ»¹ÓкܴóµÄ¿Õ¼ä¡£
¼ÆË㷶ʽºÍ³éÏó
SparkÊ×ÏÈÊÇÒ»ÖÖ´ÖÁ£¶ÈÊý¾Ý²¢ÐУ¨data parallel£©µÄ¼ÆË㷶ʽ¡£
Êý¾Ý²¢ÐиúÈÎÎñ²¢ÐУ¨task parallel£©µÄÇø±ðÌåÏÖÔÚÒÔÏÂÁ½·½Ãæ¡£
¼ÆËãµÄÖ÷ÌåÊÇÊý¾Ý¼¯ºÏ£¬¶ø·Ç¸ö±ðÊý¾Ý¡£¼¯ºÏµÄ³¤¶ÈÊÓʵÏÖ¶ø¶¨£¬ÈçSIMD£¨µ¥Ö¸Áî¶àÊý¾Ý£©ÏòÁ¿Ö¸ÁîÒ»°ãÊÇ4µ½64£¬GPUµÄSIMT£¨µ¥Ö¸Áî¶àỊ̈߳©Ò»°ã
ÊÇ32£¬SPMD£¨µ¥³ÌÐò¶àÊý¾Ý£©¿ÉÒÔ¸ü¿í¡£Spark´¦ÀíµÄÊÇ´óÊý¾Ý£¬Òò´Ë²ÉÓÃÁËÁ£¶ÈºÜ´ÖµÄ¼¯ºÏ£¬½Ð×öResilient
Distributed Datasets£¨RDD£©¡£
¼¯ºÏÄÚµÄËùÓÐÊý¾Ý¶¼¾¹ýͬÑùµÄËã×ÓÐòÁС£Êý¾Ý²¢Ðпɱà³ÌÐԺã¬Ò×ÓÚ»ñµÃ¸ß²¢ÐÐÐÔ£¨ÓëÊý¾Ý¹æÄ£Ïà¹Ø£¬¶ø·ÇÓë³ÌÐòÂß¼µÄ²¢ÐÐÐÔÏà¹Ø£©£¬Ò²Ò×ÓÚ¸ßЧµØÓ³Éäµ½µ×²ã
µÄ²¢Ðлò·Ö²¼Ê½Ó²¼þÉÏ¡£´«Í³µÄarray/vector±à³ÌÓïÑÔ¡¢SSE/AVX intrinsics¡¢CUDA/OpenCL¡¢Ct£¨C++
for throughput£©£¬¶¼ÊôÓÚ´ËÀà¡£²»Í¬µãÔÚÓÚ£¬SparkµÄÊÓÒ°ÊÇÕû¸ö¼¯Èº£¬¶ø·Çµ¥¸ö½Úµã»ò²¢Ðд¦ÀíÆ÷¡£
Êý¾Ý²¢Ðеķ¶Ê½¾ö¶¨ÁË SparkÎÞ·¨ÍêÃÀÖ§³ÖϸÁ£¶È¡¢Òì²½¸üеIJÙ×÷¡£Í¼¼ÆËã¾ÍÓдËÀà²Ù×÷£¬ËùÒÔ´ËʱSpark²»ÈçGraphLab£¨Ò»¸ö´ó¹æÄ£Í¼¼ÆËã¿ò¼Ü£©£»»¹ÓÐһЩӦÓã¬
ÐèҪϸÁ£¶ÈµÄÈÕÖ¾¸üкÍÊý¾Ý¼ì²éµã£¬ËüÒ²²»ÈçRAMCloud£¨Ë¹Ì¹¸£µÄÄÚ´æ´æ´¢ºÍ¼ÆËãÑо¿ÏîÄ¿£©ºÍPercolator£¨GoogleÔöÁ¿¼ÆËã¼¼Êõ£©¡£
·´¹ýÀ´£¬ÕâҲʹSparkÄܹ»¾«ÐĸûÔÅËüÉó¤µÄÓ¦ÓÃÁìÓò£¬ÊÔͼ´Öϸͨ³ÔµÄDryad£¨Î¢ÈíÔçÆÚµÄ´óÊý¾Ýƽ̨£©·´¶ø²»Éõ³É¹¦¡£
SparkµÄRDD£¬²ÉÓÃÁËScala¼¯ºÏÀàÐ͵ıà³Ì·ç¸ñ¡£ËüͬÑù²ÉÓÃÁ˺¯ÊýʽÓïÒ壨functional
semantics£©£ºÒ»ÊDZհü£¬¶þÊÇRDDµÄ²»¿ÉÐÞ¸ÄÐÔ¡£Âß¼ÉÏ£¬Ã¿Ò»¸öRDDËã×Ó¶¼Éú³ÉеÄRDD£¬Ã»Óи±×÷Óã¬ËùÒÔËã×ÓÓÖ±»³ÆÎªÊÇÈ·¶¨ÐԵģ»ÓÉÓÚËù
ÓÐËã×Ó¶¼ÊÇÃݵȵ쬳öÏÖ´íÎóʱֻÐè°ÑËã×ÓÐòÁÐÖØÐÂÖ´Ðм´¿É¡£
SparkµÄ¼ÆËã³éÏóÊÇÊý¾ÝÁ÷£¬¶øÇÒÊÇ´øÓй¤×÷¼¯£¨working set£©µÄÊý¾ÝÁ÷¡£Á÷´¦ÀíÊÇÒ»ÖÖÊý¾ÝÁ÷Ä£ÐÍ£¬MapReduceÒ²ÊÇ£¬Çø±ðÔÚÓÚMapReduceÐèÒªÔÚ¶à´Îµü´úÖÐά»¤¹¤×÷¼¯¡£¹¤×÷¼¯µÄ³éÏóºÜÆÕ±é£¬Èç¶à
µü´ú»úÆ÷ѧϰ¡¢½»»¥Ê½Êý¾ÝÍÚ¾òºÍͼ¼ÆË㡣Ϊ±£Ö¤ÈÝ´í£¬MapReduce²ÉÓÃÁËÎȶ¨´æ´¢£¨ÈçHDFS£©À´³ÐÔØ¹¤×÷¼¯£¬´ú¼ÛÊÇËÙ¶ÈÂý¡£HaLoop²ÉÓÃÑ»·
Ãô¸ÐµÄµ÷¶ÈÆ÷£¬±£Ö¤Ç°´Îµü´úµÄReduceÊä³öºÍ±¾´Îµü´úµÄMapÊäÈëÊý¾Ý¼¯ÔÚͬһ̨ÎïÀí»úÉÏ£¬ÕâÑù¿ÉÒÔ¼õÉÙÍøÂ翪Ïú£¬µ«ÎÞ·¨±ÜÃâ´ÅÅÌI/OµÄÆ¿¾±¡£
SparkµÄÍ»ÆÆÔÚÓÚ£¬ÔÚ±£Ö¤ÈÝ´íµÄǰÌáÏ£¬ÓÃÄÚ´æÀ´³ÐÔØ¹¤×÷¼¯¡£ÄÚ´æµÄ´æÈ¡ËÙ¶È¿ìÓÚ´ÅÅ̶à¸öÊýÁ¿¼¶£¬´Ó¶ø¿ÉÒÔ¼«´óÌáÉýÐÔÄÜ¡£¹Ø¼üÊÇʵÏÖÈÝ´í£¬´«Í³ÉÏÓÐÁ½ÖÖ·½·¨£ºÈÕ
Ö¾ºÍ¼ì²éµã¡£¿¼Âǵ½¼ì²éµãÓÐÊý¾ÝÈßÓàºÍÍøÂçͨÐŵĿªÏú£¬Spark²ÉÓÃÈÕÖ¾Êý¾Ý¸üС£Ï¸Á£¶ÈµÄÈÕÖ¾¸üв¢²»±ãÒË£¬¶øÇÒÇ°Ãæ½²¹ý£¬SparkÒ²²»É󤡣
Spark¼Ç¼µÄÊÇ´ÖÁ£¶ÈµÄRDD¸üУ¬ÕâÑù¿ªÏú¿ÉÒÔºöÂÔ²»¼Æ¡£¼øÓÚSparkµÄº¯ÊýʽÓïÒåºÍÃݵÈÌØÐÔ£¬Í¨¹ýÖØ·ÅÈÕÖ¾¸üÐÂÀ´ÈÝ´í£¬Ò²²»»áÓи±×÷Óá£
±à³ÌÄ£ÐÍ
À´¿´Ò»¶Î´úÂ룺textFileËã×Ó´ÓHDFS¶ÁÈ¡ÈÕÖ¾Îļþ£¬·µ»Ø¡°file¡±£¨RDD£©£»filterËã×Óɸ³ö´ø¡°ERROR¡±µÄÐУ¬¸³¸ø
¡°errors¡±£¨ÐÂRDD£©£»cacheËã×Ó°ÑËü»º´æÏÂÀ´ÒÔ±¸Î´À´Ê¹Óã»countËã×Ó·µ»Ø¡°errors¡±µÄÐÐÊý¡£RDD¿´ÆðÀ´ÓëScala¼¯ºÏÀàÐÍ
ûÓÐÌ«´ó²î±ð£¬µ«ËüÃǵÄÊý¾ÝºÍÔËÐÐÄ£ÐÍ´óÏàåÄÒì¡£

ͼ1¸ø³öÁËRDDÊý¾ÝÄ£ÐÍ£¬²¢½«ÉÏÀýÖÐÓõ½µÄËĸöËã×ÓÓ³Éäµ½ËÄÖÖËã×ÓÀàÐÍ¡£Spark³ÌÐò¹¤×÷ÔÚÁ½¸ö¿Õ¼äÖУºSpark
RDD¿Õ¼äºÍScalaÔÉúÊý¾Ý¿Õ¼ä¡£ÔÚÔÉúÊý¾Ý¿Õ¼äÀÊý¾Ý±íÏÖΪ±êÁ¿£¨scalar£¬¼´Scala»ù±¾ÀàÐÍ£¬ÓÃéÙɫС·½¿é±íʾ£©¡¢¼¯ºÏÀàÐÍ£¨À¶É«ÐéÏß
¿ò£©ºÍ³Ö¾Ã´æ´¢£¨ºìɫԲÖù£©¡£
ͼ1 Á½¸ö¿Õ¼äµÄÇл»£¬ËÄÀ಻ͬµÄRDDËã×Ó
ÊäÈëËã×Ó£¨éÙÉ«¼ýÍ·£©½«Scala¼¯ºÏÀàÐÍ»ò´æ´¢ÖеÄÊý¾ÝÎüÈëRDD¿Õ¼ä£¬×ªÎªRDD£¨À¶É«ÊµÏß¿ò£©¡£ÊäÈëËã×ÓµÄÊäÈë´óÖÂÓÐÁ½ÀࣺһÀàÕë¶ÔScala¼¯ºÏÀàÐÍ£¬Èçparallelize£»ÁíÒ»ÀàÕë¶Ô´æ´¢Êý¾Ý£¬ÈçÉÏÀýÖеÄtextFile¡£ÊäÈëËã×ÓµÄÊä³ö¾ÍÊÇSpark¿Õ¼äµÄRDD¡£
ÒòΪº¯ÊýÓïÒ壬RDD¾¹ý±ä»»£¨transformation£©Ëã×Ó£¨À¶É«¼ýÍ·£©Éú³ÉеÄRDD¡£±ä»»Ëã×ÓµÄÊäÈëºÍÊä³ö¶¼ÊÇRDD¡£RDD»á±»»®·Ö³ÉºÜ¶àµÄ·ÖÇø
£¨partition£©·Ö²¼µ½¼¯ÈºµÄ¶à¸ö½ÚµãÖУ¬Í¼1ÓÃÀ¶É«Ð¡·½¿é´ú±í·ÖÇø¡£×¢Ò⣬·ÖÇøÊǸöÂß¼¸ÅÄ±ä»»Ç°ºóµÄоɷÖÇøÔÚÎïÀíÉÏ¿ÉÄÜÊÇͬһ¿éÄÚ´æ»ò´æ
´¢¡£ÕâÊǺÜÖØÒªµÄÓÅ»¯£¬ÒÔ·ÀÖ¹º¯Êýʽ²»±äÐÔµ¼ÖµÄÄÚ´æÐèÇóÎÞÏÞÀ©ÕÅ¡£ÓÐЩRDDÊǼÆËãµÄÖмä½á¹û£¬Æä·ÖÇø²¢²»Ò»¶¨ÓÐÏàÓ¦µÄÄÚ´æ»ò´æ´¢ÓëÖ®¶ÔÓ¦£¬Èç¹ûÐèÒª
£¨ÈçÒÔ±¸Î´À´Ê¹Óã©£¬¿ÉÒÔµ÷Óûº´æËã×Ó£¨Àý×ÓÖеÄcacheËã×Ó£¬»ÒÉ«¼ýÍ·±íʾ£©½«·ÖÇøÎﻯ£¨materialize£©´æÏÂÀ´£¨»ÒÉ«·½¿é£©¡£
Ò»²¿·Ö±ä»»Ëã×ÓÊÓRDDµÄÔªËØÎª¼òµ¥ÔªËØ£¬·ÖΪÈçϼ¸Àࣺ
1.ÊäÈëÊä³öÒ»¶ÔÒ»£¨element-wise£©µÄËã×Ó£¬ÇÒ½á¹ûRDDµÄ·ÖÇø½á¹¹²»±ä£¬Ö÷ÒªÊÇmap¡¢flatMap£¨mapºóչƽΪһάRDD£©£»
2.ÊäÈëÊä³öÒ»¶ÔÒ»£¬µ«½á¹ûRDDµÄ·ÖÇø½á¹¹·¢ÉúÁ˱仯£¬Èçunion£¨Á½¸öRDDºÏΪһ¸ö£©¡¢coalesce£¨·ÖÇø¼õÉÙ£©£»
3.´ÓÊäÈëÖÐÑ¡Ôñ²¿·ÖÔªËØµÄËã×Ó£¬Èçfilter¡¢distinct£¨È¥³ýÈßÓàÔªËØ£©¡¢subtract£¨±¾RDDÓС¢ËüRDDÎÞµÄÔªËØÁôÏÂÀ´£©ºÍsample£¨²ÉÑù£©¡£
ÁíÒ»²¿·Ö±ä»»Ëã×ÓÕë¶ÔKey-Value¼¯ºÏ£¬ÓÖ·ÖΪ£º
1.¶Ôµ¥¸öRDD×öelement-wiseÔËË㣬ÈçmapValues£¨±£³ÖÔ´RDDµÄ·ÖÇø·½Ê½£¬ÕâÓëmap²»Í¬£©£»
2.¶Ôµ¥¸öRDDÖØÅÅ£¬Èçsort¡¢partitionBy£¨ÊµÏÖÒ»ÖÂÐԵķÖÇø»®·Ö£¬Õâ¸ö¶ÔÊý¾Ý±¾µØÐÔÓÅ»¯ºÜÖØÒª£¬ºóÃæ»á½²£©£»
3.¶Ôµ¥¸öRDD»ùÓÚkey½øÐÐÖØ×éºÍreduce£¬ÈçgroupByKey¡¢reduceByKey£»
4.¶ÔÁ½¸öRDD»ùÓÚkey½øÐÐjoinºÍÖØ×飬Èçjoin¡¢cogroup¡£
ºóÈýÀà²Ù×÷¶¼Éæ¼°ÖØÅÅ£¬³ÆÎªshuffleÀà²Ù×÷¡£
´ÓRDDµ½RDDµÄ±ä»»Ëã×ÓÐòÁУ¬Ò»Ö±ÔÚRDD¿Õ¼ä·¢Éú¡£ÕâÀïºÜÖØÒªµÄÉè¼ÆÊÇlazy evaluation£º¼ÆËã²¢²»Êµ¼Ê·¢Éú£¬Ö»ÊDz»¶ÏµØ¼Ç¼µ½ÔªÊý¾Ý¡£ÔªÊý¾ÝµÄ½á¹¹ÊÇDAG£¨ÓÐÏòÎÞ»·Í¼£©£¬ÆäÖÐÿһ¸ö¡°¶¥µã¡±ÊÇRDD£¨°üÀ¨Éú²ú¸ÃRDD
µÄËã×Ó£©£¬´Ó¸¸RDDµ½×ÓRDDÓС°±ß¡±£¬±íʾRDD¼äµÄÒÀÀµÐÔ¡£Spark¸øÔªÊý¾ÝDAGÈ¡Á˸öºÜ¿áµÄÃû×Ö£¬Lineage£¨ÊÀϵ£©¡£Õâ¸ö
LineageÒ²ÊÇÇ°ÃæÈÝ´íÉè¼ÆÖÐËù˵µÄÈÕÖ¾¸üС£
LineageÒ»Ö±Ôö³¤£¬Ö±µ½ÓöÉÏÐж¯£¨action£©Ëã×Ó£¨Í¼1ÖеÄÂÌÉ«¼ýÍ·£©£¬Õâʱ ¾ÍÒªevaluateÁË£¬°Ñ¸Õ²ÅÀÛ»ýµÄËùÓÐËã×ÓÒ»´ÎÐÔÖ´ÐС£Ðж¯Ëã×ÓµÄÊäÈëÊÇRDD£¨ÒÔ¼°¸ÃRDDÔÚLineageÉÏÒÀÀµµÄËùÓÐRDD£©£¬Êä³öÊÇÖ´ÐкóÉú
³ÉµÄÔÉúÊý¾Ý£¬¿ÉÄÜÊÇScala±êÁ¿¡¢¼¯ºÏÀàÐ͵ÄÊý¾Ý»ò´æ´¢¡£µ±Ò»¸öËã×ÓµÄÊä³öÊÇÉÏÊöÀàÐÍʱ£¬¸ÃËã×Ó±ØÈ»ÊÇÐж¯Ëã×Ó£¬ÆäЧ¹ûÔòÊÇ´ÓRDD¿Õ¼ä·µ»ØÔÉúÊý¾Ý
¿Õ¼ä¡£
Ðж¯Ëã×ÓÓÐÈçϼ¸ÀࣺÉú³É±êÁ¿£¬Èçcount£¨·µ»ØRDDÖÐÔªËØµÄ¸öÊý£©¡¢reduce¡¢fold/aggregate£¨¼û
ScalaͬÃûËã×ÓÎĵµ£©£»·µ»Ø¼¸¸ö±êÁ¿£¬Èçtake£¨·µ»ØÇ°¼¸¸öÔªËØ£©£»Éú³ÉScala¼¯ºÏÀàÐÍ£¬Èçcollect£¨°ÑRDDÖеÄËùÓÐÔªËØµ¹Èë
Scala¼¯ºÏÀàÐÍ£©¡¢lookup£¨²éÕÒ¶ÔÓ¦keyµÄËùÓÐÖµ£©£»Ð´Èë´æ´¢£¬ÈçÓëǰÎÄtextFile¶ÔÓ¦µÄsaveAsText-File¡£»¹ÓÐÒ»¸ö¼ì
²éµãËã×Ócheckpoint¡£µ±LineageÌØ±ð³¤Ê±£¨ÕâÔÚͼ¼ÆËãÖÐʱ³£·¢Éú£©£¬³ö´íÊ±ÖØÐÂÖ´ÐÐÕû¸öÐòÁÐÒªºÜ³¤Ê±¼ä£¬¿ÉÒÔÖ÷¶¯µ÷ÓÃ
checkpoint°Ñµ±Ç°Êý¾ÝдÈëÎȶ¨´æ´¢£¬×÷Ϊ¼ì²éµã¡£
ÕâÀïÓÐÁ½¸öÉè¼ÆÒªµã¡£Ê×ÏÈÊÇlazy evaluation¡£ÊìϤ±àÒëµÄ¶¼ÖªµÀ£¬±àÒëÆ÷ÄÜ¿´µ½µÄscopeÔ½´ó£¬ÓÅ»¯µÄ»ú»á¾ÍÔ½¶à¡£SparkËäȻûÓбàÒ룬µ«µ÷¶ÈÆ÷ʵ¼ÊÉ϶ÔDAG×öÁËÏßÐÔ¸´
ÔӶȵÄÓÅ»¯¡£ÓÈÆäÊǵ±SparkÉÏÃæÓжàÖÖ¼ÆË㷶ʽ»ìºÏʱ£¬µ÷¶ÈÆ÷¿ÉÒÔ´òÆÆ²»Í¬·¶Ê½´úÂëµÄ±ß½ç½øÐÐÈ«¾Öµ÷¶ÈºÍÓÅ»¯¡£ÏÂÃæµÄÀý×ÓÖаÑSharkµÄSQL´úÂë
ºÍSparkµÄ»úÆ÷ѧϰ´úÂë»ìÔÚÁËÒ»Æð¡£¸÷²¿·Ö´úÂë·Òëµ½µ×²ãRDDºó£¬ÈںϳÉÒ»¸ö´óµÄDAG£¬ÕâÑù¿ÉÒÔ»ñµÃ¸ü¶àµÄÈ«¾ÖÓÅ»¯»ú»á¡£

ÁíÒ»¸öÒªµãÊÇÒ»µ©Ðж¯Ëã×Ó²úÉúÔÉúÊý¾Ý£¬¾Í±ØÐëÍ˳öRDD¿Õ¼ä¡£ÒòΪĿǰSparkÖ»Äܹ»¸ú×ÙRDDµÄ¼ÆË㣬ÔÉúÊý¾ÝµÄ¼ÆËã¶ÔËüÀ´ËµÊDz»¿É¼ûµÄ£¨³ý·ÇÒÔºó
Spark»áÌṩÔÉúÊý¾ÝÀàÐͲÙ×÷µÄÖØÔØ¡¢wrapper»òimplicit conversion£©¡£Õⲿ·Ö²»¿É¼ûµÄ´úÂë¿ÉÄÜÒýÈëǰºóRDDÖ®¼äµÄÒÀÀµ£¬ÈçÏÂÃæµÄ´úÂ룺

µÚÈýÐÐfilter¶Ôerrors.count()µÄÒÀÀµÊÇÓÉ(cnt-1)Õâ¸öÔÉúÊý¾ÝÔËËã²úÉúµÄ£¬µ«µ÷¶ÈÆ÷¿´²»µ½Õâ¸öÔËË㣬ÄǾͻá³öÎÊÌâÁË¡£
ÓÉÓÚSpark²¢²»Ìṩ¿ØÖÆÁ÷£¬ÔÚ¼ÆËãÂß¼ÐèÒªÌõ¼þ·Ö֧ʱ£¬Ò²±ØÐë»ØÍ˵½ScalaµÄ¿Õ¼ä¡£ÓÉÓÚScalaÓïÑÔ¶Ô×Ô¶¨Òå¿ØÖÆÁ÷µÄÖ§³ÖºÜÇ¿£¬²»ÅųýδÀ´SparkÒ²»áÖ§³Ö¡£
Spark »¹ÓÐÁ½¸öºÜʵÓõŦÄÜ¡£Ò»¸öÊǹ㲥£¨broadcast£©±äÁ¿¡£ÓÐЩÊý¾Ý£¬Èçlookup±í£¬¿ÉÄÜ»áÔÚ¶à¸ö×÷Òµ¼ä·´¸´Óõ½£»ÕâЩÊý¾Ý±ÈRDDҪСµÃ¶à£¬²»
ÒËÏñRDDÄÇÑùÔÚ½ÚµãÖ®¼ä»®·Ö¡£½â¾öÖ®µÀÊÇÌṩһ¸öеÄÓïÑԽṹ¡ª¡ª¹ã²¥±äÁ¿£¬À´ÐÞÊδËÀàÊý¾Ý¡£SparkÔËÐÐʱ°Ñ¹ã²¥±äÁ¿ÐÞÊεÄÄÚÈÝ·¢µ½¸÷¸ö½Úµã£¬²¢±£
´æÏÂÀ´£¬Î´À´ÔÙÓÃʱÎÞÐèÔÙËÍ¡£Ïà±ÈHadoopµÄdistributed cache£¬¹ã²¥ÄÚÈÝ¿ÉÒÔ¿ç×÷Òµ¹²Ïí¡£SparkÌá½»ÕßMosharafʦ´ÓP2PµÄÀÏ·¨Ê¦Ion
Stoica£¬²ÉÓÃÁËBitTorrent£¨Ã»´í£¬¾ÍÊÇÏÂÔØµçÓ°µÄÄǸöBT£©µÄ¼ò»¯ÊµÏÖ¡£ÓÐÐËȤµÄ¶ÁÕß¿ÉÒԲο¼SIGCOMM'11µÄÂÛÎÄ
Orchestra¡£ÁíÒ»¸ö¹¦ÄÜÊÇAccumulator£¨Ô´ÓÚMapReduceµÄcounter£©£ºÔÊÐíSpark´úÂëÖмÓÈëһЩȫ¾Ö±äÁ¿×ö
bookkeeping£¬Èç¼Ç¼µ±Ç°µÄÔËÐÐÖ¸±ê¡£
ÔËÐк͵÷¶È
ͼ2ÏÔʾÁËSpark³ÌÐòµÄÔËÐг¡¾°¡£ËüÓɿͻ§¶ËÆô¶¯£¬·ÖÁ½¸ö½×¶Î£ºµÚÒ»½×¶Î¼Ç¼±ä»»Ëã×ÓÐòÁС¢ÔöÁ¿¹¹½¨DAGͼ£»µÚ¶þ½×¶ÎÓÉÐж¯Ëã×Ó´¥
·¢£¬DAGScheduler°ÑDAGͼת»¯Îª×÷Òµ¼°ÆäÈÎÎñ¼¯¡£SparkÖ§³Ö±¾µØµ¥½ÚµãÔËÐУ¨¿ª·¢µ÷ÊÔÓÐÓã©»ò¼¯ÈºÔËÐС£¶ÔÓÚºóÕߣ¬¿Í»§¶ËÔËÐÐÓÚ
master½ÚµãÉÏ£¬Í¨¹ýCluster manager°Ñ»®·ÖºÃ·ÖÇøµÄÈÎÎñ¼¯·¢Ë͵½¼¯ÈºµÄworker/slave½ÚµãÉÏÖ´ÐС£
ͼ2 Spark³ÌÐòÔËÐйý³Ì
Spark ´«Í³ÉÏÓëMesos¡°½¹²»ÀëÃÏ¡±£¬Ò²¿ÉÖ§³ÖAmazon EC2ºÍYARN¡£µ×²ãÈÎÎñµ÷¶ÈÆ÷µÄ»ùÀàÊǸötrait£¬ËüµÄ²»Í¬ÊµÏÖ¿ÉÒÔ»ìÈëʵ¼ÊµÄÖ´ÐС£ÀýÈ磬ÔÚMesosÉÏÓÐÁ½ÖÖµ÷¶ÈÆ÷ʵÏÖ£¬Ò»ÖÖ°Ñÿ¸ö½ÚµãµÄËùÓÐ
×ÊÔ´·Ö¸øSpark£¬ÁíÒ»ÖÖÔÊÐíSpark×÷ÒµÓëÆäËû×÷ÒµÒ»Æðµ÷¶È¡¢¹²Ïí¼¯Èº×ÊÔ´¡£worker½ÚµãÉÏÓÐÈÎÎñỊ̈߳¨task
thread£©ÕæÕýÔËÐÐDAGSchedulerÉú³ÉµÄÈÎÎñ£»»¹Óпé¹ÜÀíÆ÷£¨block manager£©¸ºÔðÓëmasterÉϵÄblock
manager masterͨÐÅ£¨ÍêÃÀʹÓÃÁËScalaµÄActorģʽ£©£¬ÎªÈÎÎñÏß³ÌÌṩÊý¾Ý¿é¡£
×îÓÐȤµÄ²¿·ÖÊÇDAGScheduler¡£ÏÂÃæÏê½âËüµÄ¹¤×÷¹ý³Ì¡£RDDµÄÊý¾Ý½á¹¹ÀïºÜÖØÒªµÄÒ»¸öÓòÊǶԸ¸RDDµÄÒÀÀµ¡£Èçͼ3Ëùʾ£¬ÓÐÁ½ÀàÒÀÀµ£ºÕ£¨Narrow£©ÒÀÀµºÍ¿í£¨Wide£©ÒÀÀµ¡£

ͼ3 ÕÒÀÀµºÍ¿íÒÀÀµ
ÕÒÀÀµÖ¸¸¸RDDµÄÿһ¸ö·ÖÇø×î¶à±»Ò»¸ö×ÓRDDµÄ·ÖÇøËùÓ㬱íÏÖΪһ¸ö¸¸RDDµÄ·ÖÇø¶ÔÓ¦ÓÚÒ»¸ö×ÓRDDµÄ·ÖÇø£¬ºÍÁ½¸ö¸¸RDDµÄ·ÖÇø¶ÔÓ¦ÓÚÒ»¸ö×ÓRDD
µÄ·ÖÇø¡£Í¼3ÖУ¬map/filterºÍunionÊôÓÚµÚÒ»À࣬¶ÔÊäÈë½øÐÐÐͬ»®·Ö£¨co-partitioned£©µÄjoinÊôÓÚµÚ¶þÀà¡£
¿íÒÀÀµÖ¸×ÓRDDµÄ·ÖÇøÒÀÀµÓÚ¸¸RDDµÄËùÓзÖÇø£¬ÕâÊÇÒòΪshuffleÀà²Ù×÷£¬Èçͼ3ÖеÄgroupByKeyºÍδ¾Ðͬ»®·ÖµÄjoin¡£
ÕÒÀÀµ¶ÔÓÅ»¯ºÜÓÐÀû¡£Âß¼ÉÏ£¬Ã¿¸öRDDµÄËã×Ó¶¼ÊÇÒ»¸öfork/join£¨´Ëjoin·ÇÉÏÎĵÄjoinËã×Ó£¬¶øÊÇָͬ²½¶à¸ö²¢ÐÐÈÎÎñµÄbarrier£©£º
°Ñ¼ÆËãforkµ½Ã¿¸ö·ÖÇø£¬ËãÍêºójoin£¬È»ºófork/joinÏÂÒ»¸öRDDµÄËã×Ó¡£Èç¹ûÖ±½Ó·Òëµ½ÎïÀíʵÏÖ£¬ÊǺܲ»¾¼ÃµÄ£ºÒ»ÊÇÿһ¸öRDD£¨¼´Ê¹
ÊÇÖмä½á¹û£©¶¼ÐèÒªÎﻯµ½ÄÚ´æ»ò´æ´¢ÖУ¬·Ñʱ·Ñ¿Õ¼ä£»¶þÊÇjoin×÷Ϊȫ¾ÖµÄbarrier£¬ÊǺܰº¹óµÄ£¬»á±»×îÂýµÄÄǸö½ÚµãÍÏËÀ¡£Èç¹û×ÓRDDµÄ·ÖÇøµ½
¸¸RDDµÄ·ÖÇøÊÇÕÒÀÀµ£¬¾Í¿ÉÒÔʵʩ¾µäµÄfusionÓÅ»¯£¬°ÑÁ½¸öfork/joinºÏΪһ¸ö£»Èç¹ûÁ¬ÐøµÄ±ä»»Ëã×ÓÐòÁж¼ÊÇÕÒÀÀµ£¬¾Í¿ÉÒ԰Ѻܶà¸ö
fork/join²¢ÎªÒ»¸ö£¬²»µ«¼õÉÙÁË´óÁ¿µÄÈ«¾Öbarrier£¬¶øÇÒÎÞÐèÎﻯºÜ¶àÖмä½á¹ûRDD£¬Õ⽫¼«´óµØÌáÉýÐÔÄÜ¡£Spark°ÑÕâ¸ö½Ð×öÁ÷Ë®Ïß
£¨pipeline£©ÓÅ»¯¡£
±ä»»Ëã×ÓÐòÁÐÒ»ÅöÉÏshuffleÀà²Ù×÷£¬¿íÒÀÀµ¾Í·¢ÉúÁË£¬Á÷Ë®ÏßÓÅ»¯ÖÕÖ¹¡£ÔÚ¾ßÌåʵÏÖ ÖУ¬DAGScheduler´Óµ±Ç°Ëã×ÓÍùǰ»ØËÝÒÀÀµÍ¼£¬Ò»Åöµ½¿íÒÀÀµ£¬¾ÍÉú³ÉÒ»¸östageÀ´ÈÝÄÉÒѱéÀúµÄËã×ÓÐòÁС£ÔÚÕâ¸östageÀ¿ÉÒÔ°²È«µØÊµ
Ê©Á÷Ë®ÏßÓÅ»¯¡£È»ºó£¬ÓÖ´ÓÄǸö¿íÒÀÀµ¿ªÊ¼¼ÌÐø»ØËÝ£¬Éú³ÉÏÂÒ»¸östage¡£
ÒªÉÁ½¸öÎÊÌ⣺һ£¬·ÖÇøÈçºÎ»®·Ö£»¶þ£¬·ÖÇø¸Ã·Åµ½¼¯ÈºÄÚÄĸö½Úµã¡£ÕâÕýºÃ¶ÔÓ¦ÓÚRDD½á¹¹ÖÐÁíÍâÁ½¸öÓò£º·ÖÇø»®·ÖÆ÷£¨partitioner£©ºÍÊ×ѡλÖã¨preferred
locations£©¡£
·ÖÇø»®·Ö¶ÔÓÚshuffleÀà²Ù×÷ºÜ¹Ø¼ü£¬Ëü¾ö¶¨Á˸òÙ×÷µÄ¸¸RDDºÍ×ÓRDDÖ®¼äµÄÒÀÀµÀàÐÍ¡£ÉÏÎÄÌáµ½£¬Í¬Ò»¸öjoinËã×Ó£¬Èç¹ûÐͬ»®·ÖµÄ»°£¬Á½¸ö¸¸
RDDÖ®¼ä¡¢¸¸RDDÓë×ÓRDDÖ®¼äÄÜÐγÉÒ»ÖµķÖÇø°²ÅÅ£¬¼´Í¬Ò»¸ökey±£Ö¤±»Ó³É䵽ͬһ¸ö·ÖÇø£¬ÕâÑù¾ÍÄÜÐγÉÕÒÀÀµ¡£·´Ö®£¬Èç¹ûûÓÐÐͬ»®·Ö£¬µ¼Ö¿í
ÒÀÀµ¡£
ËùνÐͬ»®·Ö£¬¾ÍÊÇÖ¸¶¨·ÖÇø»®·ÖÆ÷ÒÔ²úÉúǰºóÒ»ÖµķÖÇø°²ÅÅ¡£PregelºÍHaLoop°ÑÕâ¸ö×÷ΪϵͳÄÚÖõÄÒ»²¿·Ö£»¶øSpark
ĬÈÏÌṩÁ½ÖÖ»®·ÖÆ÷£ºHashPartitionerºÍRangePartitioner£¬ÔÊÐí³ÌÐòͨ¹ýpartitionByËã×ÓÖ¸¶¨¡£×¢Ò⣬HashPartitionerÄܹ»·¢»Ó×÷Óã¬ÒªÇókeyµÄhashCodeÊÇÓÐЧµÄ£¬¼´Í¬ÑùÄÚÈݵÄkey²úÉúͬÑùµÄhashCode¡£Õâ¶Ô
StringÊdzÉÁ¢µÄ£¬µ«¶ÔÊý×é¾Í²»³ÉÁ¢£¨ÒòΪÊý×éµÄhashCodeÊÇÓÉËüµÄ±êʶ£¬¶ø·ÇÄÚÈÝ£¬Éú³É£©¡£ÕâÖÖÇé¿öÏ£¬SparkÔÊÐíÓû§×Ô¶¨Òå
ArrayHashPartitioner¡£
µÚ¶þ¸öÎÊÌâÊÇ·ÖÇø·ÅÖõĽڵ㣬Õâ¹ØºõÊý¾Ý±¾µØÐÔ£º±¾µØÐԺã¬ÍøÂçͨОÍÉÙ¡£ÓÐЩRDD²úÉúʱ¾Í ÓÐÊ×ѡλÖã¬ÈçHadoopRDD·ÖÇøµÄÊ×ѡλÖþÍÊÇHDFS¿éËùÔڵĽڵ㡣ÓÐЩRDD»ò·ÖÇø±»»º´æÁË£¬ÄǼÆËã¾ÍÓ¦¸ÃË͵½»º´æ·ÖÇøËùÔÚµÄ½Úµã½øÐС£ÔÙ²»
È»£¬¾Í»ØËÝRDDµÄlineageÒ»Ö±ÕÒµ½¾ßÓÐÊ×ѡλÖÃÊôÐԵĸ¸RDD£¬²¢¾Ý´Ë¾ö¶¨×ÓRDDµÄ·ÅÖá£
¿í/ÕÒÀÀµµÄ¸ÅÄî²»Ö¹ÓÃÔÚµ÷¶ÈÖУ¬¶ÔÈÝ´íÒ²ºÜÓÐÓá£Èç¹ûÒ»¸ö½Úµãå´»úÁË£¬¶øÇÒÔËËãÊÇÕÒÀÀµ£¬ÄÇÖ»Òª°Ñ¶ªÊ§µÄ¸¸RDD·ÖÇøÖØËã¼´¿É£¬¸úÆäËû½ÚµãûÓÐÒÀÀµ¡£¶ø¿íÒÀÀµÐèÒª¸¸RDDµÄËùÓзÖÇø¶¼´æÔÚ£¬
ÖØËã¾ÍºÜ°º¹óÁË¡£ËùÒÔÈç¹ûʹÓÃcheckpointËã×ÓÀ´×ö¼ì²éµã£¬²»½öÒª¿¼ÂÇlineageÊÇ·ñ×ã¹»³¤£¬Ò²Òª¿¼ÂÇÊÇ·ñÓпíÒÀÀµ£¬¶Ô¿íÒÀÀµ¼Ó¼ì²éµãÊÇ×îÎï
ÓÐËùÖµµÄ¡£
½áÓï
ÒòΪƪ·ùËùÏÞ£¬±¾ÎÄÖ»ÄܽéÉÜSparkµÄ»ù±¾¸ÅÄîºÍÉè¼ÆË¼Ï룬ÄÚÈÝÀ´×ÔSparkµÄ¶àƪÂÛÎÄ£¨ÒÔNSDI'12
¡°Resilient Distributed Datasets: A Fault-Tolerant Abstraction
for In-Memory Cluster Computing¡±ÎªÖ÷£©£¬Ò²ÓÐÎÒºÍͬÊÂÑо¿SparkµÄÐĵã¬ÒÔ¼°¶àÄêÀ´´Óʲ¢ÐÐ/·Ö²¼Ê½ÏµÍ³Ñо¿µÄ¸ÐÎò¡£SparkºËÐijÉÔ±/SharkÖ÷´´ÕßÐÁœ›
¶Ô±¾ÎÄ×÷ÁËÉóÔĺÍÐ޸ģ¬ÌØ´ËÖÂл£¡ |