×Ô2014Äê3Ô·ÝõÒÉíApache¶¥¼¶ÏîÄ¿£¨TLP£©£¬SparkÒÑÈ»³ÉΪASF×î»îÔ¾µÄÏîĿ֮һ£¬µÃµ½ÁËÒµÄڹ㷺µÄÖ§³Ö——2014Äê12Ô·¢²¼µÄSpark 1.2°æ±¾°üº¬ÁËÀ´×Ô172λContributor¹±Ï×µÄ1000¶à¸öcommits¡£
2014ÄêµÄ´óÊý¾ÝÁìÓò£¬Apache Spark£¨ÒÔϼò³ÆSpark£©ÎÞÒÉ×îÊÜÖõÄ¿¡£Spark£¬³ö×ÔÃûÃŲ®¿ËÀûAMPLabÖ®ÊÖ£¬Ä¿Ç°ÓÉÉÌÒµ¹«Ë¾Databricks±£¼Ý»¤º½¡£×Ô2014Äê3Ô·ÝõÒÉíApache¶¥¼¶ÏîÄ¿£¨TLP£©£¬SparkÒÑÈ»³ÉΪASF×î»îÔ¾µÄÏîĿ֮һ£¬µÃµ½ÁËÒµÄڹ㷺µÄÖ§³Ö——2014Äê12Ô·¢²¼µÄSpark 1.2°æ±¾°üº¬ÁËÀ´×Ô172λContributor¹±Ï×µÄ1000¶à¸öcommits¡£¶øÔÚ2014Ò»ÕûÄêÖУ¬Spark¹²·¢²¼ÁË´óС9¸ö°æ±¾£¨°üº¬5Ôµ׷¢²¼¾ßÓÐÀï³Ì±®ÒâÒåµÄ1.0°æ±¾£©£¬ÆäÉçÇø»îÔ¾¶È¿É¼ûÒ»°ß¡£ÖµµÃÒ»ÌáµÄÊÇ£¬2014Äê11Ô£¬Databricks»ùÓÚAWSÍê³ÉÁËÒ»¸öDaytona GrayÀà±ðµÄSort Benchmark£¬²¢´´ÔìÁ˸òâÊÔµÄмͼ¡£±¾ÎĽ«¸ÅÀ¨ÐÔµØ×ܽáSparkÔÚ2014ÄêµÄ·¢Õ¹¡£
Spark 2014£¬ÐÇÐÇÖ®»ðÒѳÉÁÇÔÖ®ÊÆ
Ê×ÏÈ£¬Spark»áÒé¼°Ïà¹Ø½»Á÷¡£Ä¿Ç°£¬ÊÀ½ç·¶Î§ÄÚ×îȨÍþµÄSparkÁìÓò»áÒéÎÞÒÉÊÇSpark Summit£¬ÒÑÓÚ2013ÄêÓë2014ÄêÁ¬Ðø³É¹¦¾Ù°ìÁ½½ì£¬À´×ÔÈ«Çò¸÷µØµÄ¹¤³ÌʦÃÇÓë»á·ÖÏíÁ˸÷×ÔµÄSparkʹÓð¸Àý¡£¼øÓÚĿǰSparkµÄ»ð±¬Ì¬ÊÆ£¬Spark Summit½«ÔÚ2015Äê·ÖSpark Summit EastÓëSpark Summit WestÁ½´Î¾ÙÐС£×ÅÑÛ¹úÄÚ£¬Ê×½ìÖйúSpark¼¼Êõ·å»á£¨Spark Summit China£©ÓÚ2014Äê4ÔÂÔÚ±±¾©¾Ù°ì£¬¾Ýͳ¼Æ£¬È«¹ú¸÷´ó»¥ÁªÍø¹«Ë¾¼¸ºõ¶¼³öϯÁË»áÒé¡£Òò´Ë£¬´ó¼Ò¿ÉÒÔÆÚ´ýϽñÄêµÄSpark Summit ChinaÓÖ»á´øÀ´ÔõÑùµÄ¾ªÏ²¡£³ýÈ¥ÕâÑù±È½Ï´óÐ͵ĻáÒ飬Spark MeetupÒ²²»¶¨ÆÚµØÔÚÈ«Çò¸÷µØ¾ÙÐУ¬½ØÖ¹±¾ÎÄд×÷ʱ£¬ÒÑÓÐÀ´×Ô13¸ö²»Í¬¹ú¼ÒµÄ33¸ö³ÇÊоٰì¹ýSpark Meetup£¬¹úÄÚĿǰÒѾ¾Ù°ìSpark MeetupµÄ³ÇÊÐÓÐËĸö£¬·Ö±ðÊDZ±¾©¡¢º¼ÖÝ¡¢ÉϺ£ºÍÉîÛÚ¡£³ýÁËÏßϽ»Á÷£¬ÏßÉÏÒ²»á×é֯һЩ¹«¿ª¿Î£¬¹©ÄÇЩ²»·½±ãµ½ÏßϽ»Á÷µÄÅóÓѲμӡ£ÓÉ´Ë¿ÉÒÔ¿´³ö£¬2014Äê¹ØÓÚSparkµÄ½»Á÷»î¶¯·Ç³£Æµ·±£¬Õâ¶ÔÍÆ¶¯Spark·¢Õ¹ÊÇ´óÓÐñÔÒæµÄ¡£
Æä´Î£¬ÔÚ2014Ä꣬¸÷´ó³§ÉÌÏà¼ÌÐû²¼ÓëDatabricks½øÐкÏ×÷¡£ÆäÖУ¬ClouderaÔçÔÚ2013Äêµ×¼´Ðû²¼½«ÔÚÆä·¢ÐаæÖÐÌí¼ÓSpark£¬¶øºóÓÖÓиü¶àµÄÆóÒµ¼ÓÈë½øÀ´£¬ÈçDatastax¡¢MapR¡¢Pivotal¼°HortonworksµÈ¡£Óɴ˿ɼû£¬SparkÒѵõ½ÁËÖÚ¶à´óÊý¾ÝÆóÒµµÄÈϿɣ¬¶øÕâЩÆóҵҲȷʵ½«×Ô¼ºµÄ²úÆ·ÓëSpark½øÐÐÁ˽ôÃܵɡ£Æ©ÈçDatastax½«CassandraÓëSpark½øÐÐÁ˼¯³É£¬Ê¹µÃSpark¿ÉÒÔ²Ù×÷CassandraÄÚµÄÊý¾Ý£¬ÓÖÆ©ÈçElasticSearchÒ²ºÍSpark½øÐÐÁ˼¯³É£¬¸ü¶àÕâ·½ÃæµÄ¶¯×÷¿É²Î¿¼Spark Summit 2014ÖÐÌáµ½µÄÏà¹ØÄÚÈÝ¡£
´ËÍ⣬SparkÔÚ2014ÄêÒ²ÎüÒýÁ˸ü¶àÆóÒµµÄÂäµØÊ¹Ó᣹úÍâ±È½ÏÖªÃûµÄÓÐYahoo! ¡¢eBay¡¢Twitter¡¢Amazon¡¢SAP¡¢Tableau¼°MicroStrategyµÈ£»Í¬Ê±£¬ÖµµÃ¸ßÐ˵ÄÊÇ£¬ÔÚSparkÂäµØÊµ¼ùÉÏ£¬¹úÄÚÆóÒµÒ²²»åضàÈã¬ÌÔ±¦¡¢ÌÚѶ¡¢°Ù¶È¡¢Ð¡Ãס¢¾©¶«¡¢Î¨Æ·»á¡¢°®ÆæÒÕ¡¢ËѺü¡¢ÆßÅ£¡¢»ªÎª¼°ÑÇÐŵÈÖªÃûÆóÒµ¶¼½øÐÐÁËÉú²ú»·¾³Ê¹Ó㬴ӶøÒ²´Ù³ÉÁËÔ½À´Ô½¶àµÄ»ªÈ˹¤³ÌʦΪSparkÌá½»´úÂë£¬ÌØ±ðÊÇSpark SQLÕâ¸ö×é¼þ£¬ÉõÖÁÓÐÒ»°ë×óÓÒµÄContributor¶¼ÊÇ»ªÈ˹¤³Ìʦ¡£¸÷´óÖªÃûÆóÒµµÄʹÓ㬴ó·ù¶ÈÌáÉýÁËÕû¸öÒµ½çʹÓÃSparkµÄÐËȤºÍÐÅÐÄ£¬ÎÒÃÇÓÐÀíÓÉÏàÐÅ£¬ÔÚ2015Ä꣬ʹÓÃSparkµÄÆóÒµÊýÁ¿±Ø»áÊǾ®ÅçʽµÄ±¬·¢¡£Óë´Ëͬʱ£¬ÒѾ³öÏÖÁËÒ»Åú»ùÓÚSpark×öÓ¦ÓõĴ´Òµ¹«Ë¾£¬¶øÆäÖÐÓв»ÉÙ·¢Õ¹µÃÏ൱²»´í£¬ÈçAdataoºÍTupleJump¡£
Ëæ×ÅÊг¡É϶ÔSpark¹¤³ÌʦÐèÇóµÄÈÕÒæ¼ÓÇ¿£¬DatabricksÒ²ÊÊʱµØÍƳöÁËSpark¿ª·¢ÕßÈÏÖ¤¼Æ»®£¬µÚÒ»´ÎÏßϲâÊÔÒѾÓÚ2014Äê11ÔÂÔÚÎ÷°àÑÀ°ÍÈûÂÞÄǾÙÐС£½ØÖ¹µ½±¾ÎÄд×÷ʱ£¨2015Äê1Ô£©£¬Spark¿ª·¢ÕßÈÏÖ¤»¹²»Ö§³ÖÏßÉϲâÊÔ£¬µ«ÏßÉϲâÊÔÆ½Ì¨²»¾Ãºó¾Í»áÉÏÏß¡£
»ùÓÚSpark³ÖÐø½¡¿µ·¢Õ¹µÄÉú̬ϵͳ£¬Ô½À´Ô½¶àµÄÆóÒµºÍ»ú¹¹ÔÚSparkÉÏÃæ¿ª·¢Ó¦ÓúÍÀ©Õ¹¿â¡£Ëæ×ÅÕâЩ¿âµÄÔö³¤£¬DatabricksÔÚ2014ÄêÊ¥µ®½ÚǰϦÉÏÏßÁËÒ»¸öÀàËÆpipµÄ¹¦ÄÜÀ´¸ú×ÙÕâЩ¿âµÄÍøÕ¾£ºhttp://spark-packages.org£¬Ä¿Ç°ÒѾÓÐһЩ¿âÈëפSpark Packages£¬ÆäÖÐÓм¸¸öÏ൱²»´í£¬±ÈÈ磺dibbhatt/kafka-spark-consumer¡¢spark-jobserver/spark-jobserverºÍmengxr/spark-als¡£
Spark 2014£¬½âÎöÖÚÈËʰ²ñϵļ¼ÊõÑݽø
Èçͼ1Ëùʾ£¬¿ÉÒÔ¿´³öSpark°üº¬ÁËÅú´¦Àí¡¢Á÷´¦Àí¡¢Í¼´¦Àí¡¢»úÆ÷ѧϰ¡¢¼´Ï¯²éѯÓë¹ØÏµ²éѯµÈ¹¦ÄÜ£¬Õâ¾ÍÒâζ×ÅÎÒÃÇÖ»ÐèÒªÒ»¸ö¿ò¼Ü¾Í¿ÉÒÔÂú×ã¸÷ÖÖʹÓó¡¾°µÄÐèÇó¡£Èç¹û·ÅÔÚÒÔǰ£¬ÎÒÃÇ¿ÉÄÜÐèҪΪÿ¸ö¹¦Äܶ¼×¼±¸Ò»Ì׿ò¼Ü£¬Æ©Èç²ÉÓÃHadoop MapReduceÀ´×öÅú´¦ÀíºÍ²ÉÓÃStormÀ´×öÁ÷ʽ´¦Àí£¬ÕâÑù×ö´øÀ´µÄ½á¹ûÊÇÎÒÃDZØÐë·Ö±ðÕë¶ÔÁ½Ì×¼ÆËã¿ò¼Ü±àд²»Í¬µÄÒµÎñ´úÂ룬¶ø±àд³öµÄÒµÎñ´úÂëÒ²¼¸ºõÎÞ·¨ÖØÓã»ÁíÒ»·½Ã棬ΪÁËʹϵͳÎȶ¨£¬ÎÒÃÇ»¹µÃ¶îÍâͶÈëÈËÁ¦È¥ÉîÈëÀí½âHadoop MapReduce¼°StormµÄÔÀí£¬Õ⽫Ôì³ÉºÜ´óµÄÈËÁ¦¿ªÏú¡£µ±²ÉÓÃSparkºó£¬ÎÒÃÇÖ»ÐèҪȥÀí½âSpark¼´¿É£¬ÁíÒ»¸öÎüÒýÈ˵ĵط½ÔÚÓÚSparkÅú´¦ÀíÓëÁ÷¼ÆËãµÄÒµÎñ´úÂ뼸ºõ¿ÉÒÔÍêÈ«ÖØÓã¬ÕâÒ²¾ÍÒâζ×ÅÎÒÃÇÖ»ÐèÒª±àдһ·ÝÂß¼´úÂë¾Í¿ÉÒÔ·Ö±ðÔËÐÐÅú´¦ÀíÓëÁ÷¼ÆËã¡£×îºó£¬Spark¿ÉÒÔÎÞ·ìʹÓô洢ÔÚHDFSÉϵÄÊý¾Ý£¬ÎÞÐèÈκÎÊý¾ÝÇ¨ÒÆ¶¯×÷¡£

ͼ1 Spark Stack
ͬʱ£¬ÓÉÓÚÏÖ´æÏµÍ³±ØÐëÒªÓëÒÔHDFSΪ´ú±íµÄ·Ö²¼Ê½Îļþϵͳ½øÐÐÊý¾Ý¹²ÏíºÍ½»»»£¬ÓÉ´ËÔì³ÉµÄIO¿ªÏú´ó·ù¶ÈµØ½µµÍÁ˼ÆËãЧÂÊ£»³ý´ËÖ®Í⣬·´¸´µÄÐòÁл¯Óë·´ÐòÁл¯Ò²ÊDz»¿ÉºöÂԵĿªÏú¡£¼øÓÚ´Ë£¬SparkÖгéÏó³öÁËRDDµÄ¸ÅÄ²¢»ùÓÚRDD¶¨ÒåÁËһϵÁзḻµÄËã×Ó£¬MapReduceÖ»ÊÇÆäÖÐÒ»¸ö·Ç³£Ð¡µÄ×Ó¼¯£¬Óë´Ëͬʱ£¬RDDÒ²¿ÉÒÔ±»»º´æÔÚÄÚ´æÖУ¬´Ó¶øµü´ú¼ÆËã¿ÉÒÔ³ä·ÖµØÏíÊÜÄÚ´æ¼ÆËãËù´øÀ´µÄ¼ÓËÙЧ¹û¡£ÓëMapReduce»ùÓÚ½ø³ÌµÄ¼ÆËãÄ£ÐͲ»Ò»Ñù£¬Spark»ùÓÚµÄÊǶàÏß³ÌÄ£ÐÍ£¬ÕâÒ²Òâζ×ÅSparkµÄÈÎÎñµ÷¶ÈÑÓ³Ù¿ÉÒÔ¿ØÖÆÔÚÑÇÃë¼¶£¬µ±ÈÎÎñÌØ±ð¶àµÄʱºò£¬Õâô×ö¿ÉÒÔ´ó·ù¶È½µµÍÕûÌåµ÷¶Èʱ¼ä£¬²¢ÇÒΪ»ùÓÚmacro batchµÄÁ÷ʽ¼ÆËã´òÏ»ù´¡¡£SparkµÄÁíÒ»¸öÌØÉ«ÊÇ»ùÓÚDAGµÄÈÎÎñµ÷¶ÈÓëÓÅ»¯£¬Spark²»ÐèÒªÏñMapReduceÒ»ÑùΪÿһ²½²Ù×÷¶¼È¥µ÷¶ÈÒ»¸ö×÷Òµ£¬Ïà·´£¬Spark·á¸»µÄËã×Ó¿ÉÒÔ¸ü×ÔÈ»µØÒÔDAGÐÎʽ±í´ïÔËË㡣ͬʱ£¬ÔÚSparkÖУ¬Ã¿¸östageÄÚ²¿ÊÇÓÐpipelineÓÅ»¯µÄ£¬ËùÒÔ¼´Ê¹ÎÒÃDz»Ê¹ÓÃÄڴ滺´æÊý¾Ý£¬SparkµÄÖ´ÐÐЧÂÊÒ²Òª±ÈHadoop¸ß¡£×îºóSpark»ùÓÚRDDµÄlineageÐÅÏ¢À´ÈÝ´í£¬ÓÉÓÚRDDÊDz»¿É±äµÄ£¬Spark²¢²»ÐèÒª¼Ç¼Öмä״̬£¬µ±RDDµÄijЩpartition¶ªÊ§Ê±£¬Spark¿ÉÒÔÀûÓÃRDDµÄlineageÐÅÏ¢À´½øÐв¢ÐеĻָ´£¬²»¹ýµ±lineage½Ï³¤Ê±£¬»¹ÊÇÍÆ¼öÓû§ÊÊʱcheckpoint£¬´Ó¶ø¼õÉÙ»Ö¸´Ê±¼ä¡£
ÒÔÏÂÎÒÃÇÑØ×Å2014Äê¸÷Ö÷Òª°æ±¾µÄ·¢²¼¹ì¼£¼òµ¥×ܽáÏÂSpark¼°¸÷¸ö×é¼þ£¨Spark Streaming¡¢MLlib¡¢GraphX¼°Spark SQL£©ÔÚй¦Äܼ°Îȶ¨ÐÔÉÏ×ö³öµÄŬÁ¦¡£
Spark 0.9.x
2014Äê2Ô³õ£¬Databricks·¢²¼ÁËSparkµÄµÚÒ»¸ö°æ±¾0.9.0£¬ÕâÒ»°æ±¾´øÀ´µÄ×îÖ±½ÓµÄ±ä»¯Êǽ«Scala´Ó2.9.xÉý¼¶µ½ÁË2.10¡£ÓÉÓÚScalaÔÚÄÇʱ²¢Ã»ÓÐ×öµ½¶þ½øÖÆÏòϼæÈÝ£¬ËùÒÔ´ó¼Ò²»µÃ²»Ê¹ÓÃScala2.10ÖØÐ±àÒëÒµÎñ´úÂ룬ÕâÒ²ËãÊǸö²åÇú°É¡£
Õâ¸ö°æ±¾×î´óµÄ¹±Ï×Ó¦¸ÃÊǼÓÈëÁËÅäÖÃϵͳ£¬¼´SparkConf¡£ÔÚÕâ֮ǰ£¬¸÷ÖÖÊôÐÔ²ÎÊý¶¼Ö±½Ó×÷ΪMasterµÄ²ÎÊý´«½øÈ¥£¬¶øÓÐÁËSparkConfºó£¬Master¾Í²»ÐèÒª¹ÜÕâЩÁË£¬¸÷ÖÖ²ÎÊýÔÚSparkConfÖÐÅäÖÃÍê³Éºó£¬½«SparkConf´«¸øMaster¼´¿É£¬ÕâÔÚ²âÊÔÖÐÊǷdz£ÓÐÓõġ£ÁíÍâÔÚÌá½»ÈÎÎñʱ£¬ÔÊÐí°ÑDriver³ÌÐò·Åµ½¼¯ÈºÖеÄij̨·þÎñÆ÷ÉÏÔËÐУ¬ÒÔǰֻÄÜ·ÅÔÚ¼¯ÈºÍâµÄ·þÎñÆ÷ÉÏÔËÐС£
Spark StreamingÖÕÓÚÔÚÕâ¸ö°æ±¾“×ÔÐÅ”µØ½áÊøÁËalpha°æ±¾£¬²¢ÇÒ¼ÓÈëÁËHAģʽ£¬ÏÖÔÚ´ó¼ÒÖªµÀ£¬ÆäʵÄÇʱµÄHA²¢²»Äܱ£Ö¤Êý¾Ý²»¶ªÊ§£¬ÕâÒ»µãµ½1.2µÄʱºòÎÒÃÇÔÙ̸¡£ÔÚSpark StreamingÌø³öalphaµÄͬʱ£¬ÐÂÔö¼ÓÁËalpha×é¼þGraphX£¬GraphXÊÇÒ»¸ö·Ö²¼Ê½Í¼¼ÆËã¿ò¼Ü£¬ÔÚÕâ¸ö°æ±¾ÖÐÌṩÁËһЩ±ê×¼Ëã·¨£¬ÈçPageRank¡¢connected components¡¢ strongly connected componentsÓëtriangle countingµÈµÈ£¬µ«Îȶ¨ÐÔ»¹Óдý¼ÓÇ¿¡£MLlibÔÚÕâ¸ö°æ±¾ÖÐÔö¼ÓÁ˳£ÓÃµÄÆÓËØ±´Ò¶Ë¹Ëã·¨£¬²»¹ý¸üÒýÈË×¢ÒâµÄÊÇ£¬MLlibÖÕÓÚÒ²¿ªÊ¼Ö§³ÖPython APIÁË£¨ÐèÒªNumPyµÄÖ§³Ö£©¡£
ÉçÇø·Ö±ðÓÚ4Ô·ÝÓë7Ô·ݷ¢²¼ÁËÁ½¸ömaintena-nce°æ±¾£º0.9.1Óë0.9.2£¬ÐÞ¸´ÁËһЩBug£¬ÎÞеÄfeature¼ÓÈ룬²»¹ý0.9.1µ¹ÊÇSpark³ÉΪApache¶¥¼¶ÏîÄ¿ºóµÄµÚÒ»¸ö·¢²¼¡£
Spark 1.0.x
ÓÓǧºôÍò»½Ê¼³öÀ´”ÐÎÈÝSpark1.0Ò»µã¶¼²»Îª¹ý£¬×÷Ϊһ¸öÀï³Ì±®Ê½µÄ·¢²¼£¬SparkÉçÇøÒ²ÊǷdz£½÷É÷£¬ÔÚ·¢²¼Á˶à¸öRC°æ±¾ºó£¬ÖÕÓÚÔÚ5Ôµ×Õýʽ·¢²¼ÁË1.0°æ±¾¡£Õâ¸ö°æ±¾ÓÐ110¶àλContributor£¬Àú¾4¸öÔµĹ²Í¬Å¬Á¦£¬¶ø1.0°æ±¾Ò²ºÁÎÞÐüÄîµØ³ÉΪÁËSparkµ®ÉúÒÔÀ´×î´óµÄÒ»´Î·¢²¼¡£×÷Ϊ1.xµÄ¿ª¶Ë°æ±¾£¬SparkÉçÇøÒ²¶ÔAPIÔÚÒÔºóËùÓÐ1.x°æ±¾ÉϵļæÈÝÐÔ×öÁ˱£Ö¤¡£ÁíÒ»·½Ã棬Spark 1.0µÄJava API¿ªÊ¼Ö§³ÖJava 8µÄlambda±í´ïʽ£¬Õâ¶àÉÙÈÃһЩ±ØÐëÓÃJavaÀ´Ð´Spark³ÌÐòµÄÓû§µÃµ½Á˲»Ð¡µÄ±ãÀû¡£
ÍòÖÚÖõÄ¿µÄSpark SQLÖÕÓÚÔÚÕâ¸ö°æ±¾ÖÐÁÁÏ࣬¾¡¹ÜÖ»ÊÇalpha°æ±¾£¬µ«È«Çò¸÷µØµÄSparkÓû§ÃÇÒѾÆÈ²»¼°´ý¿ªÊ¼³¢ÊÔ£¬ÕâÒ»ÊÆÍ·ÖÁ½ñÈÔÔÚÑÓÐø£¬Spark SQLÏÖÔÚÊÇSparkÖÐ×î»îÔ¾µÄ×é¼þ£¬Ã»ÓÐÖ®Ò»¡£Ìáµ½Spark SQL£¬²»µÃ²»ÌáShark£¬DatabricksÔÚSpark Summit 2014ÉÏÐû²¼SharkÒѾÍê³ÉÁËÆäѧÊõʹÃü£¬ÇÒSharkµÄÕûÌåÉè¼Æ¼Ü¹¹¶ÔHiveµÄÒÀÀµÐÔ̫ǿ£¬ÄÑÒÔÖ§³ÖÆä³¤Ô¶·¢Õ¹£¬ËùÒÔ¾ö¶¨ÖÕÖ¹Shark¿ª·¢£¬È«Ãæ×ªÏòSpark SQL¡£Spark SQLÖ§³ÖÒÔSQLµÄÐÎʽÀ´²Ù×÷½á¹¹»¯Êý¾Ý£¬²¢ÇÒÒ²Ö§³ÖʹÓÃHiveContextÀ´²Ù×÷HiveÖеÄÊý¾Ý¡£ÔÚÕâ¸ö·½Ã棬ҵÄÚ¶ÔSQL on HadoopµÄ³¬Ç¿ÐèÇó¾ö¶¨ÁËSpark SQL±Ø½«³¤ÆÚ´¦ÓÚ¿ìËÙ·¢Õ¹µÄÌ¬ÊÆ¡£ÖµµÃÒ»ÌáµÄÊÇ£¬HiveÉçÇøÒ²ÍÆ³öÁËÒ»¸öHive on SparkµÄÏîÄ¿——½«HiveµÄÖ´ÐÐÒýÇæ»»³ÉSpark¡£²»¹ý´ÓÄ¿±êÉÏ¿´£¬Hive on Spark¸ü×¢ÖØÓÚÕë¶ÔHive³¹µ×µØÏòϼæÈÝÐÔ£¬¶øSpark SQL¸ü×¢ÖØÓÚSparkÓëÆäËû×é¼þµÄ»¥²Ù×÷ºÍ¶àÔª»¯Êý¾Ý´¦Àí¡£
MLlib·½ÃæÒ²ÓÐÒ»¸ö½Ï´óµÄ½ø²½£¬1.0¿ªÊ¼ÖÕÓÚÖ§³ÖÏ¡Êè¾ØÕóÁË£¬Õâ¶ÔMLlibµÄʹÓÃÕßÀ´Ëµ¾ø¶ÔÊÇÒ»¸öÈÃÈË»¶ÐÀ¹ÄÎèµÄÌØÐÔ¡£ÔÚËã·¨·½Ã棬MLlibÒ²Ôö¼ÓÁ˾ö²ßÊ÷¡¢SVD¼°PCAµÈ¡£Spark StreamingÓëGraphXµÄÐÔÄÜÔÚÕâ¸ö°æ±¾Öж¼µÃµ½ÁËÔöÇ¿¡£
´ËÍ⣬SparkÌṩÁËÒ»¸öеÄÌá½»ÈÎÎñµÄ¹¤¾ß£¬³ÆÎªspark-submit£¬ÎÞÂÛÊÇÔËÐÐÔÚStandaloneģʽ£¬»¹ÊÇÔËÐÐÔÚYARNÉÏ£¬¶¼¿ÉÒÔʹÓÃÕâ¸ö¹¤¾ßÌá½»ÈÎÎñ¡£´ÓÕâÒ»µãÉÏ˵£¬SparkͳһÁËÌá½»ÈÎÎñµÄÈë¿Ú¡£
×îºó£¬ÉçÇøÔÚ7ÔºÍ8Ô·ݷֱ𷢲¼ÁË1.0.1Óë1.0.2Á½¸ömaintenance°æ±¾¡£
Spark 1.1.x
Spark 1.1.0ÔÚ9ÔÂÈçÆÚ¶øÖÁ¡£´Ë°æ±¾¼ÓÈëÁËsort-basedµÄshuffleʵÏÖ£¬Ö®Ç°hash-basedµÄshuffleÐèҪΪÿ¸öreducer¶¼´ò¿ªÒ»¸öÎļþ£¬µ¼ÖµĽá¹ûÊÇ´óÁ¿µÄbuffer¿ªÏúÓëµÍЧµÄI/O£¬¶ø×îÐÂsort-basedµÄshuffleʵÏÖÄܺܺõؽâ¾öÉÏÊöÎÊÌ⣬µ±shuffleÊý¾ÝÁ¿Ìرð´óµÄʱºò£¬sort-basedµÄshuffleÓÅÊÆÓÈÆäÃ÷ÏÔ¡£ÐèÒªÖ¸³öµÄÊÇ£¬ºÍMapReduceÕë¶ÔKVÅÅÐò²»Ò»Ñù£¬sort-basedÊǰ´ÕÕpartitionÐòºÅ½øÐÐÅÅÐòµÄ£¬ÔÚpartitionÄÚ²¿²¢²»ÅÅÐò¡£µ«ÊÇ1.1ÖÐĬÈϵÄshuffle·½Ê½»¹ÊÇ»ùÓÚhashµÄ£¬µ½1.2ÖвŻá°Ñsort-based×÷ΪĬÈϵÄshuffle·½Ê½¡£
Spark SQLÔÚÕâ¸ö°æ±¾Àï¼ÓÈëÁ˲»ÉÙÐÂÌØÐÔ¡£×îÖµµÃ¹Ø×¢µÄÊǼÓÈëÁËJDBC ServerµÄ¹¦ÄÜ£¬ÕâÒâζ×ÅÓû§¿ÉÒÔֻдJDBC´úÂë¾Í¿ÉÒÔÏíÊÜSpark SQLµÄ¸÷ÖÖ¹¦ÄÜ¡£
MLlibÒýÈëÁËÒ»¸öÓÃÓÚÍê³É³éÑù¡¢Ïà¹ØÐÔ¡¢¹À¼Æ¡¢²âÊÔµÈÈÎÎñµÄͳ¼Æ¿â¡£Ö®Ç°ºôÉùºÜ¸ßµÄÌØÕ÷³éÈ¡¹¤¾ßWord2VecºÍTF-IDFÒ²±»¼Ó½øÁ˴˰汾¡£³ýÁËÔö¼ÓһЩеÄËã·¨Ö®Í⣬MLlibÐÔÄÜÔÚÕâÒ»°æ±¾ÖеÃÒ²µ½Á˽ϴóµÄÌáÉý¡£±ÈÆðMLlib£¬GraphXÔÚÕâÒ»°æ²¢ÎÞÌØ±ð´óµÄ¸Ä±ä¡£
Spark StreamingÔÚÕâÒ»°æ±¾µÄÊý¾ÝÔ´ÖмÓÈëÁ˶ÔAmazon KinesisµÄÖ§³Ö£¬Ö»²»¹ý¹úÄÚÓû§¶ÔÕâ¸öÊý¾ÝÔ´Ö§³ÖµÄÐËȤ²»ÊǺܴ󣬶ÔÓÚ¹úÍâÓû§µÄÒâÒå¸ü¶àһЩ¡£²»¹ýÔÚÕâ¸ö°æ±¾ÖУ¬Spark Streaming¸Ä±äÁË´ÓFlumeÈ¡µÃÊý¾ÝµÄ·½Ê½£¬Ö®Ç°ÊÇFlume pushÊý¾Ýµ½executor/workerÖУ¬µ«ÔÚÕâÖÖģʽÏ£¬µ±executor/worker¹Òµôºó£¬Flume±ãÎÞ·¨ÔÙÕý³£µØpushÊý¾Ý¡£ËùÒÔÏÖÔÚ°Ñpush¸Ä³ÉÁËpull£¬ÕâÒâζ׿´Ê¹Ä³¸öreceiver¹Òµôºó£¬Ò²Äܱ£Ö¤ÔÚÆäËûworkerÉÏÐÂÆô¶¯µÄreceiverÒ²ÄܼÌÐøÕý³£µØ½ÓÊÕÊý¾Ý¡£ÁíÒ»¸öÖØÒªµÄ¸Ä½øÊǼÓÈëÁËÏÞÁ÷µÄ¹¦ÄÜ£¬Æ©Èç֮ǰSpark StreamingÔÚ¶ÁÈ¡KafkaÖÐtopicÊý¾Ýʱ¾³£»á·¢ÉúOOM£¬¶ø¼ÓÈëÏÞÁ÷ºó£¬OOM»ù±¾²»ÔÙ·¢Éú¡£Spark StreamingÓëMLlibµÄ½áºÏÊÇÁíÒ»¸ö²»µÃ²»ÌáµÄÈ«ÐÂÌØÐÔ£¬ÀûÓÃStreamingµÄʵʱÐÔÔÚÏßѵÁ·Ä£ÐÍ£¬µ«µ±ÏÂÖ»ÊÇÒ»¸ö±È½Ï³õ¼¶µÄʵÏÖ¡£
ÔÚ11Ôµ׷¢²¼µÄmaintenance°æ±¾1.1.1ÖÐÐÞ¸´ÁËÒ»¸ö½Ï´óµÄÎÊÌ⣬֮ǰÔÚʹÓÃÍⲿÊý¾Ý½á¹¹Ê±£¨ExternalAppendOnlyMapÓëExternalSorter£©»á²úÉú´óÁ¿·Ç³£Ð¡µÄÖмäÎļþ£¬Õâ²»µ«»áÔì³É“too many open files”µÄÒì³££¬Ò²»á¼«´óµØÓ°ÏìÐÔÄÜ£¬1.1.1°æ±¾¶ÔÆä½øÐÐÁËÐÞ¸´¡£
Spark 1.2.0
12ÔÂÖÐÑ®·¢²¼ÁË1.2£¬²»µÃ²»ËµSparkÉçÇøÔÚ¿ØÖÆ·¢²¼½ø¶È¹¤×÷ÉÏ×öµÃºÜÔÞ¡£Ôڴ˰汾ÖУ¬Ê×µ±Æä³åµÄ¾ÍÊǰÑsort-based shuffleÉèÖóÉÁËĬÈϵÄshuffle²ßÂÔ¡£ÁíÒ»·½Ã棬ÔÚÊý¾Ý´«ÊäÁ¿·Ç³£´óµÄÇé¿öÏ£¬connection managerÖÕÓÚ»»³ÉNetty-basedµÄʵÏÖÁË£¬ÒÔǰµÄʵÏַdz£ÂýµÄÔÒòÊÇÿ´Î¶¼Òª´Ó´ÅÅ̶Áµ½ÄÚºË̬£¬ÔÙµ½Óû§Ì¬£¬Ôٻص½ÄÚºË̬½øÈëÍø¿¨£¬ÏÖÔÚÓÃzero-copyÀ´ÊµÏÖ£¬Ð§ÂʸßÁ˺ܶࡣ
¶ÔÓÚSpark Streaming˵£¬ÖÕÓÚÒ²ËãÊǸöССµÄÀï³Ì±®£¬¿ªÊ¼Ö§³Öfully H/Aģʽ¡£ÒÔǰµ±driver¹ÒµôµÄʱºò£¬¿ÉÄܻᶪʧµôһС²¿·ÖÊý¾Ý¡£ÏÖÔÚ¼ÓÉÏÁËÒ»²ãWAL£¨Write Ahead Log£©£¬Ã¿´ÎreceiverÊÕµ½Êý¾Ýºó¶¼»á´æÔÚHDFSÉÏ£¬ÕâÑù¼´Ê¹driver¹Òµô£¬µ±ËüÖØÆôÆðÀ´ºó£¬»¹ÊÇ¿ÉÒÔ½Ó×Å´¦Àí¡£Í¬Ê±´ó¼ÒÒ²ÐèҪעÒâ unreliable receiversºÍreliable receiversµÄÇø±ð£¬Ö»ÓÐÓû§Ê¹ÓÃreliable receivers²ÅÄܱ£Ö¤Êý¾ÝÁ㶪ʧ¡£
MLlib×î´ó±ä¶¯ÊÇÒýÈëÁËеÄpipeline API£¬¿ÉÒÔ¸ü¼Ó±ã½ÝµØ´î½¨»úÆ÷ѧϰÏà¹ØµÄÈ«Ì×Á÷Ë®Ïߣ¬ÆäÖл¹°üÀ¨ÁËÒÔSpark SQL SchemaRDDΪ»ù´¡µÄdataset API¡£
GraphX½áÊøalphaÕýʽ·¢²¼£¬Í¬Ê±ÌṩÁËstable API£¬ÕâÒâζ×ÅÓû§²»ÐèÒªµ£ÐÄÏÖÓдúÂëÒÔºó»áÒòAPIµÄ±ä»¯¶ø¸Ä¶¯ÁË¡£´ËÍ⣬еĺËÐÄAPI aggregateMessagesÒ²Ìæ´úµôÁËmapReduceTriplet£¬´ó¼ÒҪעÒâÕâ¸ö±ä¶¯¡£
Spark SQL×îÖØÒªµÄÌØÐÔºÁÎÞÒÉÎÊÓ¦¸ÃÊôÓÚexternal data source£¬´ËAPIÈÿª·¢Õß¿ÉÒÔ¸üÈÝÒ׵ؿª·¢³ö¶Ô½ÓÍⲿÊý¾ÝÔ´µÄspark connector£¬Í³Ò»ÓÃSQL²Ù×÷ËùÓÐÊý¾ÝÔ´£¬Í¬Ê±Ò²¿ÉÒÔpush predicates to data source£¬Æ©ÈçÄãÒª´ÓHBaseÈ¡Êý¾Ýºó×öһЩɸѡ£¬Ò»°ãÎÒÃÇÐèÒª°ÑÊý¾Ý´ÓHBaseȫȡ³öÀ´ºóÔÚSparkÒýÇæÖÐɸѡ£¬ÏÖÔÚ¿ÉÒÔ°ÑÕâ¸ö²½ÖèÍÆµ½data source¶Ë£¬ÈÃÓû§ÔÚÈ¡Êý¾ÝµÄʱºò¾Í¿ÉÒÔɸѡ¡£ÁíÒ»¸öÖµµÃÒ»ÌáµÄÊÇÏÖÔÚcacheTableºÍÔÉúµÄcacheÒѾͳһÁËÓïÒ壬²¢ÇÒÐÔÄܺÍÎȶ¨ÐÔÒ²ÓÐÏÔÖøÌáÉý£¬²»µ«ÄÚ´æ±íÖ§³Öpredicates pushdown£¬¿ÉÒÔ»ùÓÚͳ¼ÆÐÅÏ¢Ìø¹ýÅúÁ¿Êý¾Ý£¬¶øÇÒ½¨ÄÚ´æbufferʱ·Ö¶Î½¨Á¢£¬Òò´ËÔÚcache½Ï´óµÄ±íʱҲ²»ÔÙ»áOOM¡£
ÓÉÓÚÆª·ùÔÒò£¬ÒÔÉÏÎÒÃǼòµ¥×ܽáÁËSparkÔÚ2014ÄêµÄ¸÷¸ö°æ±¾ÖбȽÏÖØÒªµÄÌØÐÔ£¬µ«ÓÐÒ»¸ö¹¦ÄܵÄÔöǿʼÖչᴩÆäÖЗ—YARN£¬ÓÉÓÚĿǰºÜ¶à¹«Ë¾¶¼°Ñ²»Í¬µÄ¼ÆËã¿ò¼ÜÅÜÔÚYARNÉÏ£¬ËùÒÔSpark¶ÔYARNµÄÖ§³Ö¿Ï¶¨»áÔ½À´Ô½ºÃ£¬ÊÂʵÉÏSparkȷʵÔÚÕâ·½Ãæ×öÁ˺ܶ๤×÷¡£
½áÓï
2014Äê¶ÔSparkÊǷdz£ÖØÒªµÄÒ»Ä꣬²»½öÒòΪ·¢²¼ÁËÀï³Ì±®Ê½µÄ1.0°æ±¾£¬¸üÖØÒªµÄÊÇͨ¹ýÕû¸öÉçÇøµÄŬÁ¦£¬Spark±äµÃÔ½À´Ô½Îȶ¨Óë¸ßЧ£¬Ò²ÕýÔÚ±»Ô½À´Ô½¶àµÄÆóÒµ²ÉÓá£ÔÚ2015Äê£¬Ëæ×ÅÉçÇø²»¶ÏµÄŬÁ¦£¬ÏàÐÅSparkÒ»¶¨»á´ïµ½Ò»¸öеĸ߶ȣ¬ÔÚ¸ü¶àµÄÆóÒµÖаçÑݸüÖØÒªµÄ½ÇÉ«¡£
¸ÐлÀ´×ÔDatabricks¹«Ë¾µÄReynold XinºÍÁ¬³Ç¸ø±¾ÎÄreview£¬²¢Ìṩ±¦¹ó½¨Òé¡£
|