±¾ÎÄÔòÖ÷Òª½éÉÜTalkingDataÔÚ´óÊý¾Ýƽ̨½¨Éè¹ý³ÌÖУ¬Öð½¥ÒýÈëSpark£¬²¢ÇÒÒÔHadoop
YARNºÍSparkΪ»ù´¡À´¹¹½¨Òƶ¯´óÊý¾Ýƽ̨µÄ¹ý³Ì¡£
µ±Ï£¬SparkÒѾÔÚ¹úÄڵõ½Á˹㷺µÄÈϿɺÍÖ§³Ö£º2014Ä꣬Spark
Summit ChinaÔÚ±±¾©ÕÙ¿ª£¬³¡Ãæ»ð±¬£»Í¬Ä꣬Spark MeetupÔÚ±±¾©¡¢ÉϺ£¡¢ÉîÛںͺ¼ÖÝËĸö³ÇÊоٰ죬ÆäÖнö±±¾©¾Í³É¹¦¾Ù°ìÁË5´Î£¬ÄÚÈݸüº¸ÇSpark
Core¡¢Spark Streaming¡¢Spark MLlib¡¢Spark SQLµÈÖÚ¶àÁìÓò¡£¶ø×÷Ϊ½ÏÔç¹Ø×¢ºÍÒýÈëSparkµÄÒÆ¶¯»¥ÁªÍø´óÊý¾Ý×ۺϷþÎñ¹«Ë¾£¬TalkingDataÒ²»ý¼«µØ²ÎÓëµ½¹úÄÚSparkÉçÇøµÄ¸÷Öֻ£¬²¢¶à´ÎÔÚMeetupÖзÖÏí¹«Ë¾µÄSparkʹÓþÑé¡£±¾ÎÄÔòÖ÷Òª½éÉÜTalkingDataÔÚ´óÊý¾Ýƽ̨½¨Éè¹ý³ÌÖУ¬Öð½¥ÒýÈëSpark£¬²¢ÇÒÒÔHadoop
YARNºÍSparkΪ»ù´¡À´¹¹½¨Òƶ¯´óÊý¾Ýƽ̨µÄ¹ý³Ì¡£
³õʶSpark
×÷Ϊһ¼ÒÔÚÒÆ¶¯»¥ÁªÍø´óÊý¾ÝÁìÓò´´ÒµµÄ¹«Ë¾£¬Ê±¿Ì¹Ø×¢´óÊý¾Ý¼¼ÊõÁìÓòµÄ·¢Õ¹ºÍ½ø²½Êǹ«Ë¾¼¼ÊõÍŶӱØ×öµÄ¹¦¿Î¡£¶øÔÚÕûÀíStrata
2013¹«¿ªµÄ½²Òåʱ£¬Ò»ÆªÖ÷ÌâΪ¡¶An Introduction on the Berkeley Data
Analytics Stack_BDAS_Featuring Spark,Spark Streaming,and
Shark¡·µÄ½Ì³ÌÒýÆðÁËÕû¸ö¼¼ÊõÍŶӵĹØ×¢ºÍÌÖÂÛ£¬ÆäÖÐSpark»ùÓÚÄÚ´æµÄRDDÄ£ÐÍ¡¢¶Ô»úÆ÷ѧϰËã·¨µÄÖ§³Ö¡¢Õû¸ö¼¼ÊõÕ»ÖÐʵʱ´¦ÀíºÍÀëÏß´¦ÀíµÄͳһģÐÍÒÔ¼°Shark¶¼ÈÃÈËÑÛǰһÁÁ¡£Í¬Ê±ÆÚÎÒÃǹØ×¢µÄ»¹ÓÐImpala£¬µ«¶Ô±ÈSpark£¬Impala¿ÉÒÔÀí½âΪ¶ÔHiveµÄÉý¼¶£¬¶øSparkÔò³¢ÊÔÎ§ÈÆRDD½¨Á¢Ò»¸öÓÃÓÚ´óÊý¾Ý´¦ÀíµÄÉú̬ϵͳ¡£¶ÔÓÚÒ»¼ÒÊý¾ÝÁ¿¸ßËÙÔö³¤£¬ÒµÎñÓÖÊÇÒÔ´óÊý¾Ý´¦ÀíΪºËÐIJ¢ÇÒÔÚ²»¶Ï±ä»¯µÄ´´Òµ¹«Ë¾¶øÑÔ£¬ºóÕßÎÞÒɸüÖµµÃ½øÒ»²½¹Ø×¢ºÍÑо¿¡£
Spark³õ̽
2013ÄêÖÐÆÚ£¬Ëæ×ÅÒµÎñ¸ßËÙ·¢Õ¹£¬Ô½À´Ô½¶àµÄÒÆ¶¯É豸²àÊý¾Ý±»¸÷¸ö²»Í¬µÄÒµÎñƽ̨ÊÕ¼¯¡£ÄÇôÕâЩÊý¾Ý³ýÁËÌṩ²»Í¬ÒµÎñËùÐèÒªµÄÒµÎñÖ¸±ê£¬ÊÇ·ñ»¹Ô̲Ø×Ÿü¶àµÄ¼ÛÖµ£¿ÎªÁ˸üºÃµØÍÚ¾òÊý¾ÝDZÔÚ¼ÛÖµ£¬ÎÒÃǾö¶¨½¨Ôì×Ô¼ºµÄÊý¾ÝÖÐÐÄ£¬½«¸÷ÒµÎñƽ̨µÄÊý¾Ý»ã¼¯µ½Ò»Æð£¬¶Ô¸²¸ÇÉ豸µÄÏà¹ØÊý¾Ý½øÐмӹ¤¡¢·ÖÎöºÍÍÚ¾ò£¬´Ó¶øÌ½Ë÷Êý¾ÝµÄ¼ÛÖµ¡£³õÆÚÊý¾ÝÖÐÐÄÖ÷Òª¹¦ÄÜÉèÖÃÈçÏÂËùʾ£º
1. ¿çÊг¡¾ÛºÏµÄ°²×¿Ó¦ÓÃÅÅÃû£»
2. »ùÓÚÓû§ÐËȤµÄÓ¦ÓÃÍÆ¼ö¡£
»ùÓÚµ±Ê±µÄ¼¼ÊõÕÆÎճ̶Ⱥ͹¦ÄÜÐèÇó£¬Êý¾ÝÖÐÐÄËù²ÉÓõļ¼Êõ¼Ü¹¹Èçͼ1¡£

ͼ1 »ùÓÚHadoop 2.0µÄÊý¾ÝÖÐÐļ¼Êõ¼Ü¹¹
Õû¸öϵͳ¹¹½¨»ùÓÚHadoop 2.0£¨Cloudera CDH4.3£©£¬²ÉÓÃÁË×îÔʼµÄ´óÊý¾Ý¼ÆËã¼Ü¹¹¡£Í¨¹ýÈÕÖ¾»ã¼¯³ÌÐò£¬½«²»Í¬ÒµÎñƽ̨µÄÈÕÖ¾»ã¼¯µ½Êý¾ÝÖÐÐÄ£¬²¢Í¨¹ýETL½«Êý¾Ý½øÐиñʽ»¯´¦Àí£¬´¢´æµ½HDFS¡£ÆäÖУ¬ÅÅÃûºÍÍÆ¼öËã·¨µÄʵÏÖ¶¼²ÉÓÃÁËMapReduce£¬ÏµÍ³ÖÐÖ»´æÔÚÀëÏßÅúÁ¿¼ÆË㣬²¢Í¨¹ý»ùÓÚAzkabanµÄµ÷¶Èϵͳ½øÐÐÀëÏßÈÎÎñµÄµ÷¶È¡£
µÚÒ»¸ö°æ±¾µÄÊý¾ÝÖÐÐļܹ¹»ù±¾ÉÏÊÇÒÔÂú×ã¡°×î»ù±¾µÄÊý¾ÝÀûÓá±ÕâһĿµÄ½øÐÐÉè¼ÆµÄ¡£È»¶ø£¬Ëæ×ŶÔÊý¾Ý¼Ûֵ̽Ë÷µÃÖð½¥¼ÓÉԽÀ´Ô½¶àµÄʵʱ·ÖÎöÐèÇó±»Ìá³ö¡£Óë´Ëͬʱ£¬¸ü¶àµÄ»úÆ÷ѧϰË㷨ҲؽÐèÌí¼Ó£¬ÒÔ±ãÖ§³Ö²»Í¬µÄÊý¾ÝÍÚ¾òÐèÇó¡£¶ÔÓÚʵʱÊý¾Ý·ÖÎö£¬ÏÔÈ»²»ÄÜͨ¹ý¡°¶Ôÿ¸ö·ÖÎöÐèÇóµ¥¶À¿ª·¢MapReduceÈÎÎñ¡±À´Íê³É£¬Òò´ËÒýÈëHive
ÊÇÒ»¸ö¼òµ¥¶øÖ±½ÓµÄÑ¡Ôñ¡£¼øÓÚ´«Í³µÄMapReduceÄ£ÐͲ¢²»ÄܺܺõØÖ§³Öµü´ú¼ÆË㣬ÎÒÃÇÐèÒªÒ»¸ö¸üºÃµÄ²¢ÐмÆËã¿ò¼ÜÀ´Ö§³Ö»úÆ÷ѧϰËã·¨¡£¶øÕâЩÕýÊÇÎÒÃÇÒ»Ö±ÔÚÃÜÇйØ×¢µÄSparkËùÉó¤µÄÁìÓò¡ª¡ªÆ¾½èÆä¶Ôµü´ú¼ÆËãµÄÓѺÃÖ§³Ö£¬SparkÀíËùµ±È»µØ³ÉΪÁ˲»¶þ֮ѡ¡£2013Äê9Ôµף¬Ëæ×ÅSpark
0.8.0·¢²¼£¬ÎÒÃǾö¶¨¶Ô×î³õµÄ¼Ü¹¹½øÐÐÑݽø£¬ÒýÈëHive×÷Ϊ¼´Ê±²éѯµÄ»ù´¡£¬Í¬Ê±ÒýÈëSpark¼ÆËã¿ò¼ÜÀ´Ö§³Ö»úÆ÷ѧϰÀàÐ͵ļÆË㣬²¢ÇÒÑéÖ¤SparkÕâ¸öеļÆËã¿ò¼ÜÊÇ·ñÄܹ»È«ÃæÌæ´ú´«Í³µÄÒÔMapReduceΪ»ù´¡µÄ¼ÆËã¿ò¼Ü¡£Í¼2ΪÕû¸öϵͳµÄ¼Ü¹¹Ñݱ䡣

ͼ2 ÔÚÔʼ¼Ü¹¹ÖвâÊÔSpark
ÔÚÕâ¸ö¼Ü¹¹ÖУ¬ÎÒÃǽ«Spark 0.8.1²¿ÊðÔÚYARNÉÏ£¬Í¨¹ý·ÖQueue£¬À´¸ôÀë»ùÓÚSparkµÄ»úÆ÷ѧϰÈÎÎñ£¬¼ÆËãÅÅÃûµÄÈÕ³£MapReduceÈÎÎñºÍ»ùÓÚHiveµÄ¼´Ê±·ÖÎöÈÎÎñ¡£
ÏëÒªÒýÈëSpark£¬µÚÒ»²½ÐèÒª×öµÄ¾ÍÊÇҪȡµÃÖ§³ÖÎÒÃÇHadoop»·¾³µÄSpark°ü¡£ÎÒÃǵÄHadoop»·¾³ÊÇCloudera·¢²¼µÄCDH
4.3£¬Ä¬ÈϵÄSpark·¢²¼°ü²¢²»°üº¬Ö§³ÖCDH 4.3µÄ°æ±¾£¬Òò´ËÖ»ÄÜ×Ô¼º±àÒë¡£Spark¹Ù·½ÎĵµÍƼöÓÃMaven½øÐбàÒ룬¿ÉÊDZàÒëÈ´²»ÈçÏëÏóÖÐ˳Àû¡£¸÷ÖÖ°üÒÀÀµÓÉÓÚÖÚËùÖÜÖªµÄÔÒò£¬²»ÄÜ˳ÀûµØ´ÓijЩÒÀÀµÖÐÐÄ¿âÏÂÔØ¡£ÓÚÊÇÎÒÃDzÉÈ¡ÁË×î¼òµ¥Ö±½ÓµÄÈÆ¿ª°ì·¨£¬ÀûÓÃAWSÔÆÖ÷»ú½øÐбàÒë¡£ÐèҪעÒâµÄÊÇ£¬±àÒëǰһ¶¨Òª×ñÑÎĵµµÄ½¨Ò飬ÉèÖãº

·ñÔò£¬±àÒë¹ý³ÌÖоͻáÓöµ½ÄÚ´æÒç³öµÄÎÊÌâ¡£Õë¶ÔCDH 4.3£¬mvn
buildµÄ²ÎÊýΪ£º

ÔÚ±àÒë³É¹¦ËùÐèÒªµÄSpark°üºó£¬²¿ÊðºÍÔÚHadoop»·¾³ÖÐÔËÐÐSparkÔòÊǷdz£¼òµ¥µÄÊÂÇé¡£½«±àÒëºÃµÄSparkĿ¼´ò°üѹËõºó£¬ÔÚ¿ÉÒÔÔËÐÐHadoop
ClientµÄ»úÆ÷ÉϽâѹËõ£¬¾Í¿ÉÒÔÔËÐÐSparkÁË¡£ÏëÒªÑéÖ¤SparkÊÇ·ñÄܹ»Õý³£ÔÚÄ¿±êHadoop»·¾³ÉÏÔËÐУ¬¿ÉÒÔ²ÎÕÕSparkµÄ¹Ù·½Îĵµ£¬ÔËÐÐexampleÖеÄSparkPiÀ´ÑéÖ¤£º

Íê³ÉSpark²¿ÊðÖ®ºó£¬Ê£ÏµľÍÊÇ¿ª·¢»ùÓÚSparkµÄ³ÌÐòÁË¡£ËäÈ»SparkÖ§³ÖJava¡¢Python£¬µ«×îºÏÊÊ¿ª·¢Spark³ÌÐòµÄÓïÑÔ»¹ÊÇScala¡£¾¹ýÒ»¶Îʱ¼äµÄÃþË÷ʵ¼ù£¬ÎÒÃÇÕÆÎÕÁËScalaÓïÑԵĺ¯Êýʽ±à³ÌÓïÑÔÌØµãºó£¬ÖÕÓÚÌå»áÁËÀûÓÃScala¿ª·¢SparkÓ¦Óõľ޴óºÃ´¦¡£Í¬ÑùµÄ¹¦ÄÜ£¬ÓÃMapReduce¼¸°ÙÐвÅÄÜʵÏֵļÆË㣬ÔÚSparkÖУ¬Scalaͨ¹ý¶Ì¶ÌµÄÊýÊ®ÐдúÂë¾ÍÄÜÍê³É¡£¶øÔÚÔËÐÐʱ£¬Í¬ÑùµÄ¼ÆË㹦ÄÜ£¬SparkÉÏÖ´ÐÐÔò±ÈMapReduceÓÐÊýÊ®±¶µÄÌá¸ß¡£¶ÔÓÚÐèÒªµü´úµÄ»úÆ÷ѧϰËã·¨À´½²£¬SparkµÄRDDÄ£ÐÍÏà±ÈMapReduceµÄÓÅÊÆÔò¸üÊÇÃ÷ÏÔ£¬¸üºÎ¿ö»¹Óлù±¾µÄMLlibµÄÖ§³Ö¡£¾¹ý¼¸¸öÔµÄʵ¼ù£¬Êý¾ÝÍÚ¾òÏà¹Ø¹¤×÷±»ÍêÈ«Ç¨ÒÆµ½Spark£¬²¢ÇÒÔÚSparkÉÏʵÏÖÁËÊʺÏÎÒÃÇÊý¾Ý¼¯µÄ¸ü¸ßЧµÄLRµÈµÈËã·¨¡£
È«ÃæÓµ±§Spark
½øÈë2014Ä꣬¹«Ë¾µÄÒµÎñÓÐÁ˳¤×ãµÄ·¢Õ¹£¬¶Ô±ÈÊý¾ÝÖÐÐÄÆ½Ì¨½¨Á¢Ê±£¬Ã¿ÈÕ´¦ÀíµÄÊý¾ÝÁ¿Òà·Á˼¸·¬¡£Ã¿ÈÕµÄÅÅÃû¼ÆËãËù»¨µÄʱ¼äÔ½À´Ô½³¤£¬¶ø»ùÓÚHiveµÄ¼´Ê±¼ÆËãÖ»ÄÜÖ§³ÖÈճ߶ȵļÆË㣬Èç¹ûµ½ÖÜÕâ¸ö³ß¶È£¬¼ÆËãËù»¨µÄʱ¼äÒѾºÜÄÑÈÌÊÜ£¬µ½ÔÂÕâ¸ö³ß¶ÈÔò»ù±¾ÉÏû°ì·¨Íê³É¼ÆËã¡£»ùÓÚÔÚSparkÉϵÄÈÏÖªºÍ»ýÀÛ£¬ÊÇʱºò½«Õû¸öÊý¾ÝÖÐÐÄÇ¨ÒÆµ½SparkÉÏÁË¡£
2014Äê4Ô£¬Spark Summit ChinaÔÚ±±¾©¾ÙÐС£±§×ÅѧϰµÄÄ¿µÄ£¬ÎÒÃǼ¼ÊõÍŶÓÒ²²Î¼ÓÁËÔÚÖйú¾ÙÐеÄÕâÒ»´ÎSparkÊ¢»á¡£Í¨¹ýÕâ´ÎÊ¢»á£¬ÎÒÃÇÁ˽⵽¹úÄڵĺܶàͬÐÐÒѾ¿ªÊ¼²ÉÓÃSparkÀ´½¨Ôì×Ô¼ºµÄ´óÊý¾Ýƽ̨£¬¶øSparkÒ²±ä³ÉÁËÔÚASFÖÐ×îΪ»îÔ¾µÄÏîĿ֮һ¡£ÁíÍ⣬ԽÀ´Ô½¶àµÄ´óÊý¾ÝÏà¹ØµÄ²úÆ·Ò²Öð½¥ÔÚºÍSparkÏàÈںϻòÕßÔÚÏòSparkÇ¨ÒÆ¡£SparkÎÞÒɽ«»á±äΪһ¸öÏà±ÈHadoop
MapReduce¸üºÃµÄÉú̬ϵͳ¡£Í¨¹ýÕâ´Î´ó»á£¬ÎÒÃǸü¼Ó¼á¶¨ÁËÈ«ÃæÓµ±§SparkµÄ¾öÐÄ¡£
»ùÓÚYARNºÍSpark£¬ÎÒÃÇ¿ªÊ¼ÖØÐ¼ܹ¹Êý¾ÝÖÐÐÄÒÀÀµµÄ´óÊý¾Ýƽ̨¡£Õû¸öеÄÊý¾Ýƽ̨Ӧ¸ÃÄܹ»³ÐÔØ£º
1. ׼ʵʱµÄÊý¾Ý»ã¼¯ºÍETL£»
2. Ö§³ÖÁ÷ʽµÄÊý¾Ý¼Ó¹¤£»
3. ¸ü¸ßЧµÄÀëÏß¼ÆËãÄÜÁ¦£»
4. ¸ßËٵĶàά·ÖÎöÄÜÁ¦£»
5. ¸ü¸ßЧµÄ¼´Ê±·ÖÎöÄÜÁ¦£»
6. ¸ßЧµÄ»úÆ÷ѧϰÄÜÁ¦£»
7. ͳһµÄÊý¾Ý·ÃÎʽӿڣ»
8. ͳһµÄÊý¾ÝÊÓͼ£»
9. Áé»îµÄÈÎÎñµ÷¶È.
Õû¸öеļܹ¹³ä·ÖµØÀûÓÃYARNºÍSpark£¬²¢ÇÒÈںϹ«Ë¾µÄһЩ¼¼Êõ»ýÀÛ£¬¼Ü¹¹Èçͼ3Ëùʾ¡£
ÔÚеļܹ¹ÖУ¬ÒýÈëÁËKafka×÷ΪÈÕÖ¾»ã¼¯µÄͨµÀ¡£¼¸¸öÒµÎñϵͳÊÕ¼¯µÄÒÆ¶¯É豸²àµÄÈÕÖ¾£¬ÊµÊ±µØÐ´Èëµ½Kafka
ÖУ¬´Ó¶ø·½±ãºóÐøµÄÊý¾ÝÏû·Ñ¡£
ÀûÓÃSpark Streaming£¬¿ÉÒÔ·½±ãµØ¶ÔKafkaÖеÄÊý¾Ý½øÐÐÏû·Ñ´¦Àí¡£ÔÚÕû¸ö¼Ü¹¹ÖУ¬Spark
StreamingÖ÷ÒªÍê³ÉÁËÒÔϹ¤×÷¡£
1. ÔʼÈÕÖ¾µÄ±£´æ¡£½«KafkaÖеÄÔʼÈÕÖ¾ÒÔJSON¸ñʽÎÞËðµÄ±£´æÔÚHDFSÖС£
2. Êý¾ÝÇåÏ´ºÍת»»£¬ÇåÏ´ºÍ±ê×¼»¯Ö®ºó£¬×ª±äΪParquet¸ñʽ£¬´æ´¢ÔÚHDFSÖУ¬·½±ãºóÐøµÄ¸÷ÖÖÊý¾Ý¼ÆËãÈÎÎñ¡£
3. ¶¨ÒåºÃµÄÁ÷ʽ¼ÆËãÈÎÎñ£¬±ÈÈç»ùÓÚÆµ´Î¹æÔòµÄ±êÇ©¼Ó¹¤µÈµÈ£¬¼ÆËã½á¹ûÖ±½Ó´æ´¢ÔÚMongoDBÖС£

ͼ3 ºÏÁËYARNºÍSparkµÄ×îÐÂÊý¾ÝÖÐÐļܹ¹
ÅÅÃû¼ÆËãÈÎÎñÔòÔÚSparkÉÏ×öÁËÖØÐÂʵÏÖ£¬½èÁ¦Spark´øÀ´µÄÐÔÄÜÌá¸ß£¬ÒÔ¼°ParquetÁÐʽ´æ´¢´øÀ´µÄ¸ßЧÊý¾Ý·ÃÎÊ¡£Í¬ÑùµÄ¼ÆËãÈÎÎñ£¬ÔÚÊý¾ÝÁ¿Ìá¸ßµ½ÔÀ´3±¶µÄÇé¿öÏ£¬Ê±¼ä¿ªÏúÖ»ÓÐÔÀ´µÄ1/6¡£
ͬʱ£¬ÔÚÀûÓÃSparkºÍParquetÁÐʽ´æ´¢´øÀ´µÄÐÔÄÜÌáÉýÖ®Íâ£¬Ôø¾ºÜÄÑÂú×ãÒµÎñÐèÇóµÄ¼´Ê±¶àά¶ÈÊý¾Ý·ÖÎöÖÕÓÚ³ÉΪÁË¿ÉÄÜ¡£Ôø¾ÀûÓÃHiveÐèҪСʱ¼¶±ð²ÅÄÜÍê³ÉÈճ߶ȵĶàά¶È¼´Ê±·ÖÎö£¬ÔÚмܹ¹ÉÏ£¬Ö»ÐèÒª2·ÖÖÓ¾ÍÄܹ»Ë³ÀûÍê³É¡£¶øÖܳ߶ÈÉÏÒ²²»¹ýÊ®·ÖÖÓ¾ÍÄܹ»Ëã³ö½á¹û¡£Ôø¾ÔÚHiveÉÏÎÞ·¨Íê³ÉµÄÔ³߶ȶàά¶È·ÖÎö¼ÆË㣬ÔòÔÚÁ½¸öСʱÄÚÒ²¿ÉÒÔËã³ö½á¹û¡£ÁíÍâSpark
SQLµÄÖð½¥ÍêÉÆÒ²½µµÍÁË¿ª·¢µÄÄѶȡ£
ÀûÓÃYARNÌṩµÄ×ÊÔ´¹ÜÀíÄÜÁ¦£¬ÓÃÓÚ¶àά¶È·ÖÎö£¬×ÔÖ÷Ñз¢µÄBitmapÒýÇæÒ²±»Ç¨ÒƵ½ÁËYARNÉÏ¡£¶ÔÓÚÒѾȷ¶¨ºÃµÄά¶È£¬¿ÉÒÔÔ¤ÏÈ´´½¨BitmapË÷Òý¡£¶ø¶àά¶ÈµÄ·ÖÎö£¬Èç¹ûËùÐèÒªµÄά¶ÈÒѾԤÏȽ¨Á¢ÁËBitmapË÷Òý£¬Ôòͨ¹ýBitmapÒýÇæÓÉBitmap¼ÆËãÀ´ÊµÏÖ£¬´Ó¶ø¿ÉÒÔÌṩʵʱµÄ¶àά¶ÈµÄ·ÖÎöÄÜÁ¦¡£
ÔÚеļܹ¹ÖУ¬ÎªÁ˸ü·½±ãµØ¹ÜÀíÊý¾Ý£¬ÎÒÃÇÒýÈëÁË»ùÓÚHCatalogµÄÔªÊý¾Ý¹ÜÀíϵͳ£¬Êý¾ÝµÄ¶¨Òå¡¢´æ´¢¡¢·ÃÎʶ¼Í¨¹ýÔªÊý¾Ý¹ÜÀíϵͳ£¬´Ó¶øÊµÏÖÁËÊý¾ÝµÄͳһÊÓͼ£¬·½±ãÁËÊý¾Ý×ʲúµÄ¹ÜÀí¡£
YARNÖ»ÌṩÁË×ÊÔ´µÄµ÷¶ÈÄÜÁ¦£¬ÔÚÒ»¸ö´óÊý¾Ýƽ̨£¬·Ö²¼Ê½µÄÈÎÎñµ÷¶ÈϵͳͬÑù²»¿É»òȱ¡£ÔÚеļܹ¹ÖУ¬ÎÒÃÇ×ÔÐпª·¢ÁËÒ»¸öÖ§³ÖDAGµÄ·Ö²¼Ê½ÈÎÎñµ÷¶Èϵͳ£¬½áºÏYARNÌṩµÄ×ÊÔ´µ÷¶ÈÄÜÁ¦£¬´Ó¶øÊµÏÖ¶¨Ê±ÈÎÎñ¡¢¼´Ê±ÈÎÎñÒÔ¼°²»Í¬ÈÎÎñ¹¹³ÉµÄpipeline¡£
»ùÓÚÎ§ÈÆYARNºÍSparkµÄеļܹ¹£¬Ò»¸öÕë¶ÔÊý¾ÝÒµÎñ²¿ÃŵÄ×Ô·þÎñ´óÊý¾Ýƽ̨µÃÒÔʵÏÖ£¬Êý¾ÝÒµÎñ²¿ÃÅ¿ÉÒÔ·½±ãµØÀûÓÃÕâ¸öƽ̨¶Ô½øÐжàά¶ÈµÄ·ÖÎö¡¢Êý¾ÝµÄ³éÈ¡£¬ÒÔ¼°½øÐÐ×Ô¶¨ÒåµÄ±êÇ©¼Ó¹¤¡£×Ô·þÎñϵͳÌá¸ßÁËÊý¾ÝÀûÓõÄÄÜÁ¦£¬Í¬Ê±Ò²´ó´óÌá¸ßÁËÊý¾ÝÀûÓõÄЧÂÊ¡£
ʹÓÃSparkÓöµ½µÄһЩ¿Ó
ÈκÎм¼ÊõµÄÒýÈë¶¼»áÀú¾Ä°Éúµ½ÊìϤ£¬´Ó×î³õм¼Êõ´øÀ´µÄ¾ªÏ²£¬µ½ºóÀ´Óöµ½À§ÄÑʱµÄÒ»³ïĪչºÍã°â꣬ÔÙµ½ÎÊÌâ½â¾öºóµÄÓäÔ㬴óÊý¾ÝйóSparkͬÑù²»ÄÜÃâËס£ÏÂÃæ¾ÍÁоÙһЩÎÒÃÇÓöµ½µÄ¿Ó¡£
¡¾¿ÓÒ»£ºÅܴܺóµÄÊý¾Ý¼¯µÄʱºò£¬»áÓöµ½org.apache.spark.SparkException:
Error communicating with MapOutputTracker¡¿
Õâ¸ö´íÎ󱨵úÜÒþ»Þ£¬´Ó´íÎóÈÕÖ¾¿´£¬ÊÇSpark¼¯ÈºpartitionÁË£¬µ«Èç¹û¹Û²ìÎïÀí»úÆ÷µÄÔËÐÐÇé¿ö£¬»á·¢ÏÖ´ÅÅÌI/O·Ç³£¸ß¡£½øÒ»²½·ÖÎö»á·¢ÏÖÔÒòÊÇSparkÔÚ´¦Àí´óÊý¾Ý¼¯Ê±µÄshuffle¹ý³ÌÖÐÉú³ÉÁËÌ«¶àµÄÁÙʱÎļþ£¬Ôì³ÉÁ˲Ù×÷ϵͳ´ÅÅÌI/O¸ºÔعý´ó¡£ÕÒµ½ÔÒòºó£¬½â¾öÆðÀ´¾ÍºÜ¼òµ¥ÁË£¬ÉèÖÃspark.shuffle.consolidateFilesΪtrue¡£Õâ¸ö²ÎÊýÔÚĬÈϵÄÉèÖÃÖÐÊÇfalseµÄ£¬¶ÔÓÚlinuxµÄext4Îļþϵͳ£¬½¨Òé´ó¼Ò»¹ÊÇĬÈÏÉèÖÃΪtrue°É¡£Spark¹Ù·½ÎĵµµÄÃèÊöÒ²½¨Òéext4ÎļþϵͳÉèÖÃΪtrueÀ´Ìá¸ßÐÔÄÜ¡£
¡¾¿Ó¶þ£ºÔËÐÐʱ±¨Fetch failure´í¡¿
ÔÚ´óÊý¾Ý¼¯ÉÏ£¬ÔËÐÐSpark³ÌÐò£¬ÔںܶàÇé¿öÏ»áÓöµ½Fetch failureµÄ´í¡£ÓÉÓÚSpark±¾ÉíÉè¼ÆÊÇÈÝ´íµÄ£¬´ó²¿·ÖµÄFetch
failure»á¾¹ýÖØÊÔºóͨ¹ý£¬Òò´ËÕû¸öSparkÈÎÎñ»áÕý³£ÅÜÍ꣬²»¹ýÓÉÓÚÖØÊÔµÄÓ°Ï죬ִÐÐʱ¼ä»áÏÔÖøÔö³¤¡£Ôì³ÉFetch
failureµÄ¸ù±¾ÔÒòÔò²»¾¡Ïàͬ¡£´Ó´íÎó±¾Éí¿´£¬ÊÇÓÉÓÚÈÎÎñ²»ÄÜ´ÓÔ¶³ÌµÄ½Úµã¶ÁÈ¡shuffleµÄÊý¾Ý£¬¾ßÌåÔÒòÔòÐèÒªÀûÓãº

²é¿´SparkµÄÔËÐÐÈÕÖ¾£¬´Ó¶øÕÒµ½Ôì³ÉFetch failureµÄ¸ù±¾ÔÒò¡£ÆäÖд󲿷ֵÄÎÊÌâ¶¼¿ÉÒÔͨ¹ýºÏÀíµÄ²ÎÊýÅäÖÃÒÔ¼°¶Ô³ÌÐò½øÐÐÓÅ»¯À´½â¾ö¡£2014ÄêSpark
Summit ChinaÉϳ³¬µÄÄǸöרÌ⣬¶ÔÓÚÈçºÎ¶ÔSparkÐÔÄܽøÐÐÓÅ»¯£¬Óзdz£ºÃµÄ½¨Òé¡£
µ±È»£¬ÔÚʹÓÃSpark¹ý³ÌÖл¹Óöµ½¹ýÆäËû²»Í¬µÄÎÊÌ⣬²»¹ýÓÉÓÚSpark±¾ÉíÊÇ¿ªÔ´µÄ£¬Í¨¹ýÔ´´úÂëµÄÔĶÁ£¬ÒÔ¼°½èÖú¿ªÔ´ÉçÇøµÄ°ïÖú£¬´ó²¿·ÖÎÊÌâ¶¼¿ÉÒÔ˳Àû½â¾ö¡£
ÏÂÒ»²½µÄ¼Æ»®
SparkÔÚ2014ÄêÈ¡µÃÁ˳¤×ãµÄ·¢Õ¹£¬Î§ÈÆSparkµÄ´óÊý¾ÝÉú̬ϵͳҲÖð½¥µÄÍêÉÆ¡£Spark
1.3ÒýÈëÁËÒ»¸öеÄDataFrame API£¬Õâ¸öеÄDataFrame API½«»áʹµÃSpark¶ÔÓÚÊý¾ÝµÄ´¦Àí¸ü¼ÓÓѺá£Í¬Ñù³ö×ÔÓÚAMPLabµÄ·Ö²¼Ê½»º´æÏµÍ³TachyonÒòΪÆäÓëSparkµÄÁ¼ºÃ¼¯³ÉÒ²Öð½¥ÒýÆðÁËÈËÃǵÄ×¢Òâ¡£¼øÓÚÔÚÒµÎñ³¡¾°ÖУ¬ºÜ¶à»ù´¡Êý¾ÝÊÇÐèÒª±»¶à¸ö²»Í¬µÄSparkÈÎÎñÖØ¸´Ê¹Óã¬ÏÂÒ»²½£¬ÎÒÃǽ«»áÔڼܹ¹ÖÐÒýÈëTachyonÀ´×÷Ϊ»º´æ²ã¡£ÁíÍâ£¬Ëæ×ÅSSDµÄÈÕÒæÆÕ¼°£¬ÎÒÃǺóÐøµÄ¼Æ»®ÊÇÔÚ¼¯ÈºÖÐÿ̨»úÆ÷¶¼ÒýÈëSSD´æ´¢£¬ÅäÖÃSparkµÄshuffleµÄÊä³öµ½SSD£¬ÀûÓÃSSDµÄ¸ßËÙËæ»ú¶ÁдÄÜÁ¦£¬½øÒ»²½Ìá¸ß´óÊý¾Ý´¦ÀíЧÂÊ¡£
ÔÚ»úÆ÷ѧϰ·½Ã棬H2O»úÆ÷ѧϰÒýÇæÒ²ºÍSparkÓÐÁËÁ¼ºÃµÄ¼¯³É´Ó¶ø²úÉúÁËSparkling-water¡£ÏàÐÅÀûÓÃSparking-water£¬×÷Ϊһ¼Ò´´Òµ¹«Ë¾£¬ÎÒÃÇÒ²¿ÉÒÔÀûÓÃÉî¶ÈѧϰµÄÁ¦Á¿À´½øÒ»²½ÍÚ¾òÊý¾ÝµÄ¼ÛÖµ¡£
½áÓï
2004Ä꣬GoogleµÄMapReduceÂÛÎĽҿªÁË´óÊý¾Ý´¦ÀíµÄʱ´ú£¬HadoopµÄMapReduceÔÚ¹ýÈ¥½Ó½ü10ÄêµÄʱ¼ä³ÉÁË´óÊý¾Ý´¦ÀíµÄ´úÃû´Ê¡£¶øMatei
Zaharia 2012Äê¹ØÓÚRDDµÄһƪÂÛÎÄ¡°Resilient Distributed Datasets:
A Fault-Tolerant Abstraction for In-Memory Cluster
Computing¡±Ôò½ÒʾÁË´óÊý¾Ý´¦Àí¼¼ÊõÒ»¸öÐÂʱ´úµÄµ½À´¡£°éËæ×ÅеÄÓ²¼þ¼¼ÊõµÄ·¢Õ¹¡¢µÍÑÓ³Ù´óÊý¾Ý´¦ÀíµÄ¹ã·ºÐèÇóÒÔ¼°Êý¾ÝÍÚ¾òÔÚ´óÊý¾ÝÁìÓòµÄÈÕÒæÆÕ¼°£¬Spark×÷Ϊһ¸öոеĴóÊý¾ÝÉú̬ϵͳ£¬Öð½¥È¡´ú´«Í³µÄMapReduce¶ø³ÉΪÐÂÒ»´ú´óÊý¾Ý´¦Àí¼¼ÊõµÄÈÈÃÅ¡£ÎÒÃǹýÈ¥Á½Äê´ÓMapReduceµ½Spark¼Ü¹¹µÄÑݱä¹ý³Ì£¬Ò²»ù±¾ÉÏ´ú±íÁËÏ൱һ²¿·Ö´óÊý¾ÝÁìÓò´ÓÒµÕߵļ¼ÊõÑݽøµÄÀú³Ì¡£ÏàÐÅËæ×ÅSparkÉú̬µÄÈÕÒæÍêÉÆ£¬»áÓÐÔ½À´Ô½¶àµÄÆóÒµ½«×Ô¼ºµÄÊý¾Ý´¦ÀíÇ¨ÒÆµ½SparkÉÏÀ´¡£¶ø°éËæ×ÅÔ½À´Ô½¶àµÄ´óÊý¾Ý¹¤³ÌʦÊìϤºÍÁ˽âSpark£¬¹úÄÚµÄSparkÉçÇøÒ²»áÔ½À´Ô½»îÔ¾£¬Spark×÷Ϊһ¸ö¿ªÔ´µÄƽ̨£¬ÏàÐÅÒ²»áÓÐÔ½À´Ô½¶àµÄ»ªÈ˱ä³ÉSparkÏà¹ØÏîÄ¿µÄContributor£¬SparkÒ²»á±äµÃÔ½À´Ô½³ÉÊìºÍÇ¿´ó¡£ |