Äú¿ÉÒÔ¾èÖú£¬Ö§³ÖÎÒÃǵĹ«ÒæÊÂÒµ¡£

1Ôª 10Ôª 50Ôª





ÈÏÖ¤Â룺  ÑéÖ¤Âë,¿´²»Çå³þ?Çëµã»÷Ë¢ÐÂÑéÖ¤Âë ±ØÌî



  ÇóÖª ÎÄÕ ÎÄ¿â Lib ÊÓÆµ iPerson ¿Î³Ì ÈÏÖ¤ ×Éѯ ¹¤¾ß ½²×ù Model Center   Code  
»áÔ±   
   
 
     
   
 ¶©ÔÄ
  ¾èÖú
Spark SQL ÔÚ×Ö½ÚÌø¶¯Êý¾Ý²Ö¿âÁìÓòµÄÓÅ»¯Êµ¼ù
 
×÷Õߣº¹ù¿¡
  2799  次浏览      28
2021-12-2
 
±à¼­ÍƼö:
±¾ÎÄÖ÷ÒªÖ÷Òª·ÖΪËĸö²¿·Ö£ºµÚÒ»¸ö²¿·ÖÊÇ Spark SQL ¼Ü¹¹¼ò½é£»µÚ¶þ²¿·Ö½éÉÜ×Ö½ÚÌø¶¯ÔÚ Spark SQL ÒýÇæÉϵÄÓÅ»¯Êµ¼ù£»µÚÈý²¿·ÖÊÇ×Ö½ÚÌø¶¯ÔÚ Spark Shuffle Îȶ¨ÐÔÌáÉýÓëÐÔÄÜÓÅ»¯£»µÚËIJ¿·ÖÊÇ Spark SQL δÀ´¹æ»®ÓëÕ¹Íû¡£Ï£Íû¶ÔÄúµÄѧϰÓÐËù°ïÖú¡£
±¾ÎÄÀ´×ÔInfoQ£¬ÓÉ»ðÁú¹ûÈí¼þLinda±à¼­¡¢ÍƼö¡£

µ¼¶Á£ºSpark SQL ÊÇ×Ö½ÚÌø¶¯ÄÚ²¿×îÖØÒªµÄ²éѯÒýÇæÖ®Ò»£¬ËüÿÌì´¦Àí°ÙÍòÒÚ¼¶Êý¾Ý£¬µ¥ÈÎÎñ Shuffle Êý¾ÝÁ¿¿É³¬¹ý 200TB¡£²»¹ýÒòΪ Spark ÓëÆäËüϵͳ»ìºÏ²¿Êð£¬Òò´ËÐÔÄÜÓëÎȶ¨ÐÔÎÊÌâ ¶¼ÊÇÐèÒªÖØµã½â¾öµÄ¡£±¾ÎÄÖ÷Òª·ÖΪËĸö²¿·Ö£ºµÚÒ»¸ö²¿·ÖÊÇ Spark SQL ¼Ü¹¹¼ò½é£»µÚ¶þ²¿·Ö½éÉÜ×Ö½ÚÌø¶¯ÔÚ Spark SQL ÒýÇæÉϵÄÓÅ»¯Êµ¼ù£»µÚÈý²¿·ÖÊÇ×Ö½ÚÌø¶¯ÔÚ Spark Shuffle Îȶ¨ÐÔÌáÉýÓëÐÔÄÜÓÅ»¯£»µÚËIJ¿·ÖÊÇ Spark SQL δÀ´¹æ»®ÓëÕ¹Íû¡£

Spark SQL ¼Ü¹¹¼ò½é

ÎÒÃÇÏȼòµ¥ÁÄһϠSpark SQL µÄ¼Ü¹¹¡£ÏÂÃæÕâÕÅͼÃèÊöÁËÒ»Ìõ SQL Ìá½»Ö®ºóÐèÒª¾­ÀúµÄ¼¸¸ö½×¶Î£¬½áºÏÕâЩ½×¶Î¾Í¿ÉÒÔ¿´µ½ÔÚÄÄЩ»·½Ú¿ÉÒÔ×öÓÅ»¯¡£

ºÜ¶àʱºò£¬×öÊý¾Ý²Ö¿â½¨Ä£µÄͬѧ¸üÇãÏòÓÚÖ±½Óд SQL ¶ø·ÇʹÓà Spark µÄ DSL¡£Ò»Ìõ SQL Ìá½»Ö®ºó»á±» Parser ½âÎö²¢×ª»¯Îª Unresolved Logical Plan¡£ËüµÄÖØµãÊÇ Logical Plan Ò²¼´Âß¼­¼Æ»®£¬ËüÃèÊöÁËÏ£Íû×öʲôÑùµÄ²éѯ¡£Unresolved ÊÇÖ¸¸Ã²éѯÏà¹ØµÄһЩÐÅϢδ֪£¬±ÈÈç²»ÖªµÀ²éѯµÄÄ¿±ê±íµÄ Schema ÒÔ¼°Êý¾ÝλÖá£

ÉÏÊöÐÅÏ¢´æÓÚ Catalog ÄÚ¡£ÔÚÉú²ú»·¾³ÖУ¬Ò»°ãÓÉ Hive Metastore Ìṩ Catalog ·þÎñ¡£Analyzer »á½áºÏ Catalog ½« Unresolved Logical Plan ת»»Îª Resolved Logical Plan¡£

µ½ÕâÀﻹ²»¹»¡£²»Í¬µÄÈËд³öÀ´µÄ SQL ²»Ò»Ñù£¬Éú³ÉµÄ Resolved Logical Plan Ò²¾Í²»Ò»Ñù£¬Ö´ÐÐЧÂÊÒ²²»Ò»Ñù¡£ÎªÁ˱£Ö¤ÎÞÂÛÓû§ÈçºÎд SQL ¶¼¿ÉÒÔ¸ßЧµÄÖ´ÐУ¬Spark SQL ÐèÒª¶Ô Resolved Logical Plan ½øÐÐÓÅ»¯£¬Õâ¸öÓÅ»¯ÓÉ Optimizer Íê³É¡£Optimizer °üº¬ÁËһϵÁйæÔò£¬¶Ô Resolved Logical Plan ½øÐеȼÛת»»£¬×îÖÕÉú³É Optimized Logical Plan¡£¸Ã Optimized Logical Plan ²»Äܱ£Ö¤ÊÇÈ«¾Ö×îÓŵ쬵«ÖÁÉÙÊǽӽü×îÓŵġ£

ÉÏÊö¹ý³ÌÖ»Óë SQL Óйأ¬Óë²éѯÓйأ¬µ«ÊÇÓë Spark Î޹أ¬Òò´ËÎÞ·¨Ö±½ÓÌá½»¸ø Spark Ö´ÐС£Query Planner ¸ºÔ𽫠Optimized Logical Plan ת»»Îª Physical Plan£¬½ø¶ø¿ÉÒÔÖ±½ÓÓÉ Spark Ö´ÐС£

ÓÉÓÚͬһÖÖÂß¼­Ëã×Ó¿ÉÒÔÓжàÖÖÎïÀíʵÏÖ¡£Èç Join ÓжàÖÖʵÏÖ£¬ShuffledHashJoin¡¢BroadcastHashJoin¡¢BroadcastNestedLoopJoin¡¢SortMergeJoin µÈ¡£Òò´Ë Optimized Logical Plan ¿É±» Query Planner ת»»Îª¶à¸ö Physical Plan¡£ÈçºÎÑ¡Ôñ×îÓÅµÄ Physical Plan ³ÉΪһ¼þ·Ç³£Ó°Ïì×îÖÕÖ´ÐÐÐÔÄܵÄÊÂÇé¡£Ò»ÖֱȽϺõķ½Ê½ÊÇ£¬¹¹½¨Ò»¸ö Cost Model£¬²¢¶ÔËùÓкòÑ¡µÄ Physical Plan Ó¦Óøà Model ²¢ÌôÑ¡ Cost ×îСµÄ Physical Plan ×÷Ϊ×îÖÕµÄ Selected Physical Plan¡£

Physical Plan ¿ÉÖ±½Óת»»³É RDD ÓÉ Spark Ö´ÐС£ÎÒÃǾ­³£Ëµ¡°¼Æ»®¸Ï²»Éϱ仯¡±£¬ÔÚÖ´Ðйý³ÌÖУ¬¿ÉÄÜ·¢ÏÖÔ­¼Æ»®²»ÊÇ×îÓŵģ¬ºóÐøÖ´Ðмƻ®Èç¹ûÄܸù¾ÝÔËÐÐʱµÄͳ¼ÆÐÅÏ¢½øÐе÷Õû¿ÉÄÜÌáÉýÕûÌåÖ´ÐÐЧÂÊ¡£Õⲿ·Ö¶¯Ì¬µ÷ÕûÓÉ Adaptive Execution Íê³É¡£

ºóÃæ½éÉÜ×Ö½ÚÌø¶¯ÔÚ Spark SQL ÉÏ×öµÄһЩÓÅ»¯£¬Ö÷ÒªÎ§ÈÆÕâÒ»½Ú½éÉܵÄÂß¼­¼Æ»®ÓÅ»¯ÓëÎïÀí¼Æ»®ÓÅ»¯Õ¹¿ª¡£

Spark SQL ÒýÇæÓÅ»¯

Bucket ¸Ä½ø

ÔÚ Spark Àʵ¼Ê²¢Ã»ÓÐ Bucket Join Ëã×Ó¡£ÕâÀï˵µÄ Bucket Join ·ºÖ¸²»ÐèÒª Shuffle µÄ SortMergeJoin¡£

ÏÂͼչʾÁË SortMergeJoin µÄ»ù±¾Ô­Àí¡£ÓÃÐéÏß¿ò´ú±íµÄ Table 1 ºÍ Table 2 ÊÇÁ½ÕÅÐèÒª°´Ä³×ֶνøÐÐ Join µÄ±í¡£ÐéÏß¿òÄÚµÄ partition 0 µ½ partition m ÊǸñíת»»³É RDD ºóµÄ Partition£¬¶ø·Ç±íµÄ·ÖÇø¡£¼ÙÉè Table 1 Óë Table 2 ת»»Îª RDD ºó·Ö±ð°üº¬ m ºÍ k ¸ö Partition¡£ÎªÁ˽øÐÐ Join£¬ÐèҪͨ¹ý Shuffle ±£Ö¤Ïàͬ Join Key µÄÊý¾ÝÔÚͬһ¸ö Partition ÄÚÇÒ Partition ÄÚ°´ Key ÅÅÐò£¬Í¬Ê±±£Ö¤ Table 1 Óë Table 2 ¾­¹ý Shuffle ºóµÄ RDD µÄ Partition ÊýÏàͬ¡£

ÈçÏÂͼËùʾ£¬¾­¹ý Shuffle ºóÖ»ÐèÒªÆô¶¯ n ¸ö Task£¬Ã¿¸ö Task ´¦Àí Table 1 Óë Table 2 ÖжÔÓ¦ Partition µÄÊý¾Ý½øÐÐ Join ¼´¿É¡£Èç Task 0 Ö»ÐèҪ˳ÐòɨÃè Shuffle ºóµÄ×óÓÒÁ½±ßµÄ partition 0 ¼´¿ÉÍê³É Join¡£

¸Ã·½·¨µÄÓÅÊÆÊÇÊÊÓó¡¾°¹ã£¬¼¸ºõ¿ÉÓÃÓÚÈÎÒâ´óСµÄÊý¾Ý¼¯¡£ÁÓÊÆÊÇÿ´Î Join ¶¼ÐèÒª¶ÔÈ«Á¿Êý¾Ý½øÐÐ Shuffle£¬¶ø Shuffle ÊÇ×îÓ°Ïì Spark SQL ÐÔÄܵĻ·½Ú¡£Èç¹ûÄܱÜÃâ Shuffle ÍùÍùÄÜ´ó·ùÌáÉý Spark SQL ÐÔÄÜ¡£

¶ÔÓÚ´óÊý¾ÝµÄ³¡¾°À´½²£¬Êý¾ÝÒ»°ãÊÇÒ»´ÎдÈë¶à´Î²éѯ¡£Èç¹û¾­³£¶ÔÁ½ÕÅ±í°´Ïàͬ»òÀàËÆµÄ·½Ê½½øÐÐ Join£¬Ã¿´Î¶¼ÐèÒª¸¶³ö Shuffle µÄ´ú¼Û¡£ÓëÆäÕâÑù£¬²»ÈçÈÃÊý¾ÝÔÚдµÄʱºò£¬¾ÍÈÃÊý¾Ý°´ÕÕÀûÓÚ Join µÄ·½Ê½·Ö²¼£¬´Ó¶øÊ¹µÃ Join ʱÎÞÐè½øÐÐ Shuffle¡£ÈçÏÂͼËùʾ£¬Table 1 Óë Table 2 ÄÚµÄÊý¾Ý°´ÕÕÏàͬµÄ Key ½øÐзÖͰÇÒͰÊý¶¼Îª n£¬Í¬Ê±Í°ÄÚ°´¸Ã Key ÅÅÐò¡£¶ÔÕâÁ½ÕÅ±í½øÐÐ Join ʱ£¬¿ÉÒÔ±ÜÃâ Shuffle£¬Ö±½ÓÆô¶¯ n ¸ö Task ½øÐÐ Join¡£

×Ö½ÚÌø¶¯¶Ô Spark SQL µÄ BucketJoin ×öÁËËÄÏî±È½Ï´óµÄ¸Ä½ø¡£

¸Ä½øÒ»£ºÖ§³ÖÓë Hive ¼æÈÝ

ÔÚ¹ýÈ¥Ò»¶Îʱ¼ä£¬×Ö½ÚÌø¶¯°Ñ´óÁ¿µÄ Hive ×÷ÒµÇ¨ÒÆµ½ÁË SparkSQL¡£¶ø Hive Óë Spark SQL µÄ Bucket ±í²»¼æÈÝ¡£¶ÔÓÚʹÓà Bucket ±íµÄ³¡¾°£¬Èç¹ûÖ±½Ó¸üмÆËãÒýÇæ£¬»áÔì³É Spark SQL дÈë Hive Bucket ±íµÄÊý¾ÝÎÞ·¨±»ÏÂÓ뵀 Hive ×÷Òµµ±³É Bucket ±í½øÐÐ Bucket Join£¬´Ó¶øÔì³É×÷ÒµÖ´ÐÐʱ¼ä±ä³¤£¬¿ÉÄÜÓ°Ïì SLA¡£

ΪÁ˽â¾öÕâ¸öÎÊÌ⣬ÎÒÃÇÈà Spark SQL Ö§³Ö Hive ¼æÈÝģʽ£¬´Ó¶ø±£Ö¤ Spark SQL дÈëµÄ Bucket ±íÓë Hive дÈëµÄ Bucket ±íЧ¹ûÒ»Ö£¬²¢ÇÒÕâÖÖ±í¿ÉÒÔ±» Hive ºÍ Spark SQL µ±³É Bucket ±í½øÐÐ Bucket Join ¶ø²»ÐèÒª Shuffle¡£Í¨¹ýÕâÖÖ·½Ê½±£Ö¤ Hive Ïò Spark SQL µÄ͸Ã÷Ç¨ÒÆ¡£

µÚÒ»¸öÐèÒª½â¾öµÄÎÊÌâÊÇ£¬Hive µÄÒ»¸ö Bucket Ò»°ãÖ»°üº¬Ò»¸öÎļþ£¬¶ø Spark SQL µÄÒ»¸ö Bucket ¿ÉÄܰüº¬¶à¸öÎļþ¡£½â¾ö°ì·¨ÊǶ¯Ì¬Ôö¼ÓÒ»´ÎÒÔ Bucket Key Ϊ Key ²¢ÇÒ²¢ÐжÈÓë Bucket ¸öÊýÏàͬµÄ Shuffle¡£

µÚ¶þ¸öÐèÒª½â¾öµÄÎÊÌâÊÇ£¬Hive 1.x µÄ¹þÏ£·½Ê½Óë Spark SQL 2.x µÄ¹þÏ£·½Ê½£¨Murmur3Hash£©²»Í¬£¬Ê¹µÃÏàͬµÄÊý¾ÝÔÚ Hive ÖÐµÄ Bucket ID Óë Spark SQL ÖÐµÄ Bucket ID ²»Í¬¶øÎÞ·¨Ö±½Ó Join¡£ÔÚ Hive ¼æÈÝģʽÏ£¬ÎÒÃÇÈÃÉÏÊö¶¯Ì¬Ôö¼ÓµÄ Shuffle ʹÓà Hive ÏàͬµÄ¹þÏ£·½Ê½£¬´Ó¶ø½â¾ö¸ÃÎÊÌâ¡£

¸Ä½ø¶þ£ºÖ§³Ö±¶Êý¹ØÏµ Bucket Join

Spark SQL ÒªÇóÖ»ÓÐ Bucket ÏàͬµÄ±í²ÅÄÜ£¨±ØÒª·Ç³ä·ÖÌõ¼þ£©½øÐÐ Bucket Join¡£¶ÔÓÚÁ½ÕÅ´óСÏà²îºÜ´óµÄ±í£¬±ÈÈ缸°Ù GB µÄά¶È±íÓ뼸ʮ TB £¨µ¥·ÖÇø£©µÄÊÂʵ±í£¬ËüÃÇµÄ Bucket ¸öÊýÍùÍù²»Í¬£¬²¢ÇÒ¸öÊýÏà²îºÜ¶à£¬Ä¬ÈÏÎÞ·¨½øÐÐ Bucket Join¡£Òò´ËÎÒÃÇͨ¹ýÁ½ÖÖ·½Ê½Ö§³ÖÁ˱¶Êý¹ØÏµµÄ Bucket Join£¬¼´µ±Á½ÕÅ Bucket ±íµÄ Bucket ÊýÊDZ¶Êý¹ØÏµÊ±Ö§³Ö Bucket Join¡£

µÚÒ»ÖÖ·½Ê½£¬Task ¸öÊýÓëС±í Bucket ¸öÊýÏàͬ¡£ÈçÏÂͼËùʾ£¬Table A °üº¬ 3 ¸ö Bucket£¬Table B °üº¬ 6 ¸ö Bucket¡£´Ëʱ Table B µÄ bucket 0 Óë bucket 3 µÄÊý¾ÝºÏ¼¯Ó¦¸ÃÓë Table A µÄ bucket 0 ½øÐÐ Join¡£ÕâÖÖÇé¿öÏ£¬¿ÉÒÔÆô¶¯ 3 ¸ö Task¡£ÆäÖÐ Task 0 ¶Ô Table A µÄ bucket 0 Óë Table B µÄ bucket 0 + bucket 3 ½øÐÐ Join¡£ÔÚÕâÀÐèÒª¶Ô Table B µÄ bucket 0 Óë bucket 3 µÄÊý¾ÝÔÙ×öÒ»´Î merge sort ´Ó¶ø±£Ö¤ºÏ¼¯ÓÐÐò¡£

Èç¹û Table A Óë Table B µÄ Bucket ¸öÊýÏà²î²»´ó£¬¿ÉÒÔʹÓÃÉÏÊö·½Ê½¡£Èç¹û Table B µÄ Bucket ¸öÊýÊÇ Bucket A Bucket ¸öÊýµÄ 10 ±¶£¬ÄÇÉÏÊö·½Ê½ËäÈ»±ÜÃâÁË Shuffle£¬µ«¿ÉÄÜÒòΪ²¢ÐжȲ»¹»·´¶ø±È°üº¬ Shuffle µÄ SortMergeJoin ËÙ¶ÈÂý¡£´Ëʱ¿ÉÒÔʹÓÃÁíÍâÒ»ÖÖ·½Ê½£¬¼´ Task ¸öÊýÓë´ó±í Bucket ¸öÊýÏàµÈ£¬ÈçÏÂͼËùʾ£º

Ôڸ÷½°¸Ï£¬¿É½« Table A µÄ 3 ¸ö Bucket ¶Á¶à´Î¡£ÔÚÉÏͼÖУ¬Ö±½Ó½« Table A Óë Table A ½øÐÐ Bucket Union £¨ÐµÄËã×Ó£¬Óë Union ÀàËÆ£¬µ«±£ÁôÁË Bucket ÌØÐÔ£©£¬½á¹ûÏ൱ÓÚ 6 ¸ö Bucket£¬Óë Table B µÄ Bucket ¸öÊýÏàͬ£¬´Ó¶ø¿ÉÒÔ½øÐÐ Bucket Join¡£

¸Ä½øÈý£ºÖ§³Ö BucketJoin ½µ¼¶

¹«Ë¾ÄÚ²¿¹ýȥʹÓà Bucket µÄ±í½ÏÉÙ£¬ÔÚÎÒÃÇ¶Ô Bucket ×öÁËһϵÁиĽøºó£¬´óÁ¿Óû§Ï£Íû½«±íת»»Îª Bucket ±í¡£×ª»»ºó£¬±íµÄÔªÐÅÏ¢ÏÔʾ¸Ã±íΪ Bucket ±í£¬¶øÀúÊ··ÖÇøÄÚµÄÊý¾Ý²¢Î´°´ Bucket ±íÒªÇó·Ö²¼£¬ÔÚ²éѯÀúÊ·Êý¾Ýʱ»á³öÏÖÎÞ·¨Ê¶±ð Bucket µÄÎÊÌâ¡£

ͬʱ£¬ÓÉÓÚÊý¾ÝÁ¿ÉÏÕǿ죬ƽ¾ù Bucket ´óСҲ¿ìËÙÔö³¤¡£Õâ»áÔì³Éµ¥ Task ÐèÒª´¦ÀíµÄÊý¾ÝÁ¿¹ý´ó½ø¶øÒýÆðʹÓà Bucket ºóµÄЧ¹û¿ÉÄܲ»ÈçÖ±½ÓʹÓûùÓÚ Shuffle µÄ Join¡£

ΪÁ˽â¾öÉÏÊöÎÊÌ⣬ÎÒÃÇʵÏÖÁËÖ§³Ö½µ¼¶µÄ Bucket ±í¡£»ù±¾Ô­ÀíÊÇ£¬Ã¿´ÎÐÞ¸Ä Bucket ÐÅÏ¢£¨°üº¬ÉÏÊöÁ½ÖÖÇé¿ö¡ª¡ª½«·Ç Bucket ±íתΪ Bucket ±í£¬ÒÔ¼°ÐÞ¸Ä Bucket ¸öÊý£©Ê±£¬¼Ç¼ÐÞ¸ÄÈÕÆÚ¡£²¢ÇÒÔÚ¾ö¶¨Ê¹ÓÃÄÄÖÖ Join ·½Ê½Ê±£¬¶ÔÓÚ Bucket ±íÏȼì²éËù²éѯµÄÊý¾ÝÊÇ·ñÖ»°üº¬¸ÃÈÕÆÚÖ®ºóµÄ·ÖÇø¡£Èç¹ûÊÇ£¬Ôòµ±³É Bucket ±í´¦Àí£¬Ö§³Ö Bucket Join£»·ñÔòµ±³ÉÆÕͨÎÞ Bucket µÄ±í¡£

¸Ä½øËÄ£ºÖ§³Ö³¬¼¯

¶ÔÓÚÒ»Õų£ÓÃ±í£¬¿ÉÄÜ»áÓëÁíÍâÒ»ÕÅ±í°´ User ×Ö¶Î×ö Join£¬Ò²¿ÉÄÜ»áÓëÁíÍâÒ»ÕÅ±í°´ User ºÍ App ×Ö¶Î×ö Join£¬ÓëÆäËü±í°´ User Óë Item ×ֶνøÐÐ Join¡£¶ø Spark SQL Ô­ÉúµÄ Bucket Join ÒªÇó Join Key Set Óë±íµÄ Bucket Key Set ÍêÈ«Ïàͬ²ÅÄܽøÐÐ Bucket Join¡£Ôڸó¡¾°ÖУ¬²»Í¬ Join µÄ Key Set ²»Í¬£¬Òò´ËÎÞ·¨Í¬Ê±Ê¹Óà Bucket Join¡£Õ⼫´óµÄÏÞÖÆÁË Bucket Join µÄÊÊÓó¡¾°¡£

Õë¶Ô´ËÎÊÌ⣬ÎÒÃÇÖ§³ÖÁ˳¬¼¯³¡¾°Ï嵀 Bucket Join¡£Ö»Òª Join Key Set °üº¬ÁË Bucket Key Set£¬¼´¿É½øÐÐ Bucket Join¡£

ÈçÏÂͼËùʾ£¬Table X Óë Table Y£¬¶¼°´×Ö¶Î A ·Ö Bucket¡£¶ø²éѯÐèÒª¶Ô Table X Óë Table Y ½øÐÐ Join£¬ÇÒ Join Key Set Ϊ A Óë B¡£´Ëʱ£¬ÓÉÓÚ A ÏàµÈµÄÊý¾Ý£¬ÔÚÁ½±íÖÐµÄ Bucket ID Ïàͬ£¬ÄÇ A Óë B ¸÷×ÔÏàµÈµÄÊý¾ÝÔÚÁ½±íÖÐµÄ Bucket ID ¿Ï¶¨Ò²Ïàͬ£¬ËùÒÔÊý¾Ý·Ö²¼ÊÇÂú×ã Join ÒªÇóµÄ£¬²»ÐèÒª Shuffle¡£Í¬Ê±£¬Bucket Join »¹ÐèÒª±£Ö¤Á½±í°´ Join Key Set ¼´ A ºÍ B ÅÅÐò£¬´ËʱֻÐèÒª¶Ô Table X Óë Table Y ½øÐзÖÇøÄÚÅÅÐò¼´¿É¡£ÓÉÓÚÁ½±ßÒѾ­°´×Ö¶Î A ÅÅÐòÁË£¬´ËʱÔÙ°´ A Óë B ÅÅÐò£¬´ú¼ÛÏà¶Ô½ÏµÍ¡£

ÎﻯÁÐ

Spark SQL ´¦ÀíǶÌ×ÀàÐÍÊý¾Ýʱ£¬´æÔÚÒÔÏÂÎÊÌ⣺

¶ÁÈ¡´óÁ¿²»±ØÒªµÄÊý¾Ý£º¶ÔÓÚ Parquet / ORC µÈÁÐʽ´æ´¢¸ñʽ£¬¿ÉÖ»¶ÁÈ¡ÐèÒªµÄ×ֶΣ¬¶øÖ±½ÓÌø¹ýÆäËü×ֶΣ¬´Ó¶ø¼«´ó½ÚÊ¡ IO¡£¶ø¶ÔÓÚǶÌ×Êý¾ÝÀàÐ͵Ä×ֶΣ¬ÈçÏÂͼÖÐµÄ Map ÀàÐ굀 people ×ֶΣ¬ÍùÍùÖ»ÐèÒª¶ÁÈ¡ÆäÖеÄ×Ó×ֶΣ¬Èç people.age¡£È´ÐèÒª½«Õû¸ö Map ÀàÐ굀 people ×Ö¶ÎÈ«²¿¶ÁÈ¡³öÀ´È»ºó³éÈ¡³ö people.age ×ֶΡ£Õâ»áÒýÈë´óÁ¿µÄÎÞÒâÒåµÄ IO ¿ªÏú¡£ÔÚÎÒÃǵij¡¾°ÖУ¬´æÔÚ²»ÉÙ Map ÀàÐ͵Ä×ֶΣ¬¶øÇҺܶà°üº¬¼¸Ê®ÖÁ¼¸°Ù¸ö Key£¬ÕâÒ²¾ÍÒâζ×Å IO ±»·Å´óÁ˼¸Ê®ÖÁ¼¸°Ù±¶£»

ÎÞ·¨½øÐÐÏòÁ¿»¯¶ÁÈ¡£º¶øÏòÁ¿»¯¶ÁÄܼ«´óµÄÌáÉýÐÔÄÜ¡£µ«½ØÖ¹µ½Ä¿Ç°£¨2019 Äê 10 Ô 26 ÈÕ£©£¬Spark ²»Ö§³Ö°üº¬Ç¶Ì×Êý¾ÝÀàÐ͵ÄÏòÁ¿»¯¶ÁÈ¡¡£Õ⼫´óµÄÓ°ÏìÁ˰üº¬Ç¶Ì×Êý¾ÝÀàÐ͵IJéѯÐÔÄÜ£»

²»Ö§³Ö Filter ÏÂÍÆ£ºÄ¿Ç°£¨2019 Äê 10 Ô 26 ÈÕ£©µÄ Spark ²»Ö§³ÖǶÌ×ÀàÐÍ×Ö¶ÎÉ쵀 Filter µÄÏÂÍÆ£»

ÖØ¸´¼ÆË㣺JSON ×ֶΣ¬ÔÚ Spark SQL ÖÐÒÔ String ÀàÐÍ´æÔÚ£¬ÑϸñÀ´Ëµ²»ËãǶÌ×Êý¾ÝÀàÐÍ¡£²»¹ýʵ¼ùÖÐÒ²³£ÓÃÓÚ±£´æ²»¹Ì¶¨µÄ¶à¸ö×ֶΣ¬ÔÚ²éѯʱͨ¹ý JSON Path ³éȡĿ±ê×Ó×ֶΣ¬¶ø´óÐÍ JSON ×Ö·û´®µÄ×ֶγéÈ¡·Ç³£ÏûºÄ CPU¡£¶ÔÓÚÈȵã±í£¬Æµ·±Öظ´³éÈ¡Ïàͬ×Ó×ֶηdz£ÀË·Ñ×ÊÔ´¡£

¶ÔÓÚÕâ¸öÎÊÌ⣬×öÊý²ÖµÄͬѧҲÏëÁËһЩ½â¾ö·½°¸¡£ÈçÏÂͼËùʾ£¬ÔÚÃûΪ base_table µÄ±íÖ®Íâ´´½¨ÁËÒ»ÕÅÃûΪ sub_table µÄ±í£¬²¢ÇÒ½«¸ßƵʹÓõÄ×Ó×Ö¶Î people.age ÉèÖÃΪһ¸ö¶îÍâµÄ Integer ÀàÐ͵Ä×ֶΡ£ÏÂÓβ»ÔÙͨ¹ý base_table ²éѯ people.age£¬¶øÊÇʹÓà sub_table É쵀 age ×ֶδúÌæ¡£Í¨¹ýÕâÖÖ·½Ê½£¬½«Ç¶Ì×ÀàÐÍ×Ö¶ÎÉϵIJéѯתΪÁË Primitive ÀàÐÍ×ֶεIJéѯ£¬Í¬Ê±½â¾öÁËÉÏÊöÎÊÌâ¡£

ÕâÖÖ·½°¸´æÔÚÃ÷ÏÔȱÏÝ£º

¶îÍâά»¤ÁËÒ»ÕÅ±í£¬ÒýÈëÁË´óÁ¿µÄ¶îÍâ´æ´¢/¼ÆË㿪Ïú£»

ÎÞ·¨ÔÚбíÉϲéѯÐÂÔö×ֶεÄÀúÊ·Êý¾Ý£¨ÈçÒªÖ§³Ö¶ÔÀúÊ·Êý¾ÝµÄ²éѯ£¬ÐèÒªÖØÅÜÀúÊ·×÷Òµ£¬¿ªÏú¹ý´ó£¬ÎÞ·¨½ÓÊÜ£©£»

±íµÄά»¤·½ÐèÒªÔÚÐ޸ıí½á¹¹ºóÐ޸IJåÈëÊý¾ÝµÄ×÷Òµ£»

ÐèÒªÏÂÓβéѯ·½Ð޸IJéѯÓï¾ä£¬Íƹã³É±¾½Ï´ó£»

ÔËÓª³É±¾¸ß£ºÈç¹û¸ßƵ×Ó×ֶα仯£¬ÐèҪɾ³ý²»ÔÙÐèÒªµÄ¶ÀÁ¢×Ó×ֶΣ¬²¢Ìí¼ÓÐÂ×Ó×Ö¶ÎΪ¶ÀÁ¢×ֶΡ£É¾³ýǰ£¬ÐèҪȷ±£ÏÂÓÎÎÞÒµÎñʹÓøÃ×ֶΡ£¶øÐÂÔö×Ö¶ÎÐèҪ֪ͨ²¢ÍƽøÏÂÓÎÒµÎñ·½Ê¹ÓÃÐÂ×ֶΡ£

Ϊ½â¾öÉÏÊöËùÓÐÎÊÌ⣬ÎÒÃÇÉè¼Æ²¢ÊµÏÖÁËÎﻯÁС£ËüµÄÔ­ÀíÊÇ£º

ÐÂÔöÒ»¸ö Primitive ÀàÐÍ×ֶΣ¬±ÈÈç Integer ÀàÐ굀 age ×ֶΣ¬²¢ÇÒÖ¸¶¨ËüÊÇ people.age µÄÎﻯ×ֶΣ»

²åÈëÊý¾Ýʱ£¬ÎªÎﻯ×Ö¶Î×Ô¶¯Éú³ÉÊý¾Ý£¬²¢ÔÚ Partition Parameter ÄÚ±£´æÎﻯ¹ØÏµ¡£Òò´Ë¶Ô²åÈëÊý¾ÝµÄ×÷ÒµÍêȫ͸Ã÷£¬±íµÄά»¤·½²»ÐèÒªÐÞ¸ÄÒÑÓÐ×÷Òµ£»

²éѯʱ£¬¼ì²éËùÐè²éѯµÄËùÓÐ Partition£¬Èç¹û¶¼°üº¬ÎﻯÐÅÏ¢£¨people.age µ½ age µÄÓ³É䣩£¬Ö±½Ó½« select people.age ×Ô¶¯ÖØÐ´Îª select age£¬´Ó¶øÊµÏÖ¶ÔÏÂÓβéѯ·½µÄÍêȫ͸Ã÷ÓÅ»¯¡£Í¬Ê±¼æÈÝÀúÊ·Êý¾Ý¡£

ÏÂͼչʾÁËÔÚijÕźËÐıíÉÏʹÓÃÎﻯÁеÄÊÕÒæ£º

ÎﻯÊÓͼ

ÔÚ OLAP ÁìÓò£¬¾­³£»á¶ÔÏàͬ±íµÄijЩ¹Ì¶¨×ֶνøÐÐ Group By ºÍ Aggregate / Join µÈºÄʱ²Ù×÷£¬Ôì³É´óÁ¿Öظ´ÐÔ¼ÆË㣬ÀË·Ñ×ÊÔ´£¬ÇÒÓ°Ïì²éѯÐÔÄÜ£¬²»ÀûÓÚÌáÉýÓû§ÌåÑé¡£

ÎÒÃÇʵÏÖÁË»ùÓÚÎﻯÊÓͼµÄÓÅ»¯¹¦ÄÜ£º

ÈçÉÏͼËùʾ£¬²éѯÀúÊ·ÏÔʾ´óÁ¿²éѯ¸ù¾Ý user ½øÐÐ group by£¬È»ºó¶Ô num ½øÐÐ sum »ò count ¼ÆËã¡£´Ëʱ¿É´´½¨Ò»ÕÅÎﻯÊÓͼ£¬ÇÒ¶Ô user ½øÐÐ gorup by£¬¶Ô num ½øÐÐ avg£¨avg »á×Ô¶¯×ª»»Îª count ºÍ sum£©¡£Óû§¶Ôԭʼ±í½øÐÐ select user, sum(num) ²éѯʱ£¬Spark SQL ×Ô¶¯½«²éÑ¯ÖØÐ´Îª¶ÔÎﻯÊÓͼµÄ select user, sum_num ²éѯ¡£

ÆäËüÓÅ»¯

ÏÂͼչʾÁËÎÒÃÇÔÚ Spark SQL ÉϽøÐÐµÄÆäËü²¿·ÖÓÅ»¯¹¤×÷£º

Spark Shuffle Îȶ¨ÐÔÌáÉýÓëÐÔÄÜÓÅ»¯

Spark Shuffle ´æÔÚµÄÎÊÌâ

Shuffle µÄÔ­Àí£¬ºÜ¶àͬѧӦ¸ÃÒѾ­ºÜÊìϤÁË¡£¼øÓÚʱ¼ä¹ØÏµ£¬ÕâÀï²»½éÉܹý¶àϸ½Ú£¬Ö»¼òµ¥½éÉÜÏ»ù±¾Ä£ÐÍ¡£

ÈçÉÏͼËùʾ£¬ÎÒÃǽ« Shuffle ÉÏÓÎ Stage ³ÆÎª Mapper Stage£¬ÆäÖÐµÄ Task ³ÆÎª Mapper¡£Shuffle ÏÂÓÎ Stage ³ÆÎª Reducer Stage£¬ÆäÖÐµÄ Task ³ÆÎª Reducer¡£

ÿ¸ö Mapper »á½«×Ô¼ºµÄÊý¾Ý·ÖΪ×î¶à N ¸ö²¿·Ö£¬N Ϊ Reducer ¸öÊý¡£Ã¿¸ö Reducer ÐèҪȥ×î¶à M £¨Mapper ¸öÊý£©¸ö Mapper »ñÈ¡ÊôÓÚ×Ô¼ºµÄÄDz¿·ÖÊý¾Ý¡£

Õâ¸ö¼Ü¹¹´æÔÚÁ½¸öÎÊÌ⣺

Ò»¡¢Îȶ¨ÐÔÎÊÌâ

Mapper µÄ Shuffle Write Êý¾Ý´æÓÚ Mapper ±¾µØ´ÅÅÌ£¬Ö»ÓÐÒ»¸ö¸±±¾¡£µ±¸Ã»úÆ÷³öÏÖ´ÅÅ̹ÊÕÏ£¬»òÕß IO ÂúÔØ£¬CPU ÂúÔØÊ±£¬Reducer ÎÞ·¨¶ÁÈ¡¸ÃÊý¾Ý£¬´Ó¶øÒýÆð FetchFailedException£¬½ø¶øµ¼Ö Stage Retry¡£Stage Retry »áÔì³É×÷ÒµÖ´ÐÐʱ¼äÔö³¤£¬Ö±½ÓÓ°Ïì SLA¡£Í¬Ê±£¬Ö´ÐÐʱ¼äÔ½³¤£¬³öÏÖ Shuffle Êý¾ÝÎÞ·¨¶ÁÈ¡µÄ¿ÉÄÜÐÔÔ½´ó£¬·´¹ýÀ´ÓÖ»áÔì³É¸ü¶à Stage Retry¡£Èç´ËÑ­»·£¬¿ÉÄܵ¼Ö´óÐÍ×÷ÒµÎÞ·¨³É¹¦Ö´ÐС£

¶þ¡¢ÐÔÄÜÎÊÌâ

ÿ¸ö Mapper µÄÊý¾Ý»á±»´óÁ¿ Reducer ¶ÁÈ¡£¬²¢ÇÒÊÇËæ»ú¶ÁÈ¡²»Í¬²¿·Ö¡£¼ÙÉè Mapper µÄ Shuffle Êä³öΪ 512MB£¬Reducer ÓÐ 10 Íò¸ö£¬ÄÇÆ½¾ùÿ¸ö Reducer ¶ÁÈ¡Êý¾Ý 512MB / 100000 = 5.24KB¡£²¢ÇÒ£¬²»Í¬ Reducer ²¢ÐжÁÈ¡Êý¾Ý¡£¶ÔÓÚ Mapper Êä³öÎļþ¶øÑÔ£¬´æÔÚ´óÁ¿µÄËæ»ú¶ÁÈ¡¡£¶ø HDD µÄËæ»ú IO ÐÔÄÜÔ¶µÍÓÚ˳Ðò IO¡£×îÖÕµÄÏÖÏóÊÇ£¬Reducer ¶ÁÈ¡ Shuffle Êý¾Ý·Ç³£Âý£¬·´Ó³µ½ Metrics ÉϾÍÊÇ Reducer Shuffle Read Blocked Time ½Ï³¤£¬ÉõÖÁÕ¼Õû¸ö Reducer Ö´ÐÐʱ¼äµÄÒ»´ó°ë£¬ÈçÏÂͼËùʾ¡£

»ùÓÚ HDFS µÄ Shuffle Îȶ¨ÐÔÌáÉý

¾­¹Û²ì£¬ÒýÆð Shuffle ʧ°ÜµÄ×î´óÒòËØ²»ÊÇ´ÅÅ̹ÊÕϵÈÓ²¼þÎÊÌ⣬¶øÊÇ CPU ÂúÔØºÍ´ÅÅÌ IO ÂúÔØ¡£

ÈçÉÏͼËùʾ£¬»úÆ÷µÄ CPU ʹÓÃÂʽӽü 100%£¬Ê¹µÃ Mapper ²àµÄ Node Manager ÄÚµÄ Spark External Shuffle Service ÎÞ·¨¼°Ê±Ìṩ Shuffle ·þÎñ¡£

ÏÂͼÖÐ Data Node Õ¼ÓÃÁËÕų̂»úÆ÷ IO ×ÊÔ´µÄ 84%£¬²¿·Ö´ÅÅÌ IO ÍêÈ«´òÂú£¬ÕâʹµÃ¶ÁÈ¡ Shuffle Êý¾Ý·Ç³£Âý£¬½ø¶øÊ¹µÃ Reducer ²àÎÞ·¨ÔÚ³¬Ê±Ê±¼äÄÚ¶ÁÈ¡Êý¾Ý£¬Ôì³É FetchFailedException¡£

ÎÞÂÛÊǺÎÖÖÔ­Òò£¬ÎÊÌâµÄÖ¢½á¶¼ÊÇ Mapper ²àµÄ Shuffle Write Êý¾ÝÖ»±£´æÔÚ±¾µØ£¬Ò»µ©¸Ã½Úµã³öÏÖÎÊÌ⣬»áÔì³É¸Ã½ÚµãÉÏËùÓÐ Shuffle Write Êý¾ÝÎÞ·¨±» Reducer ¶ÁÈ¡¡£½â¾öÕâ¸öÎÊÌâµÄÒ»¸öͨÓ÷½·¨ÊÇ£¬Í¨¹ý¶à¸±±¾±£Ö¤¿ÉÓÃÐÔ¡£

×î³õʼµÄÒ»¸ö¼òµ¥·½°¸ÊÇ£¬Mapper ²à×îÖÕÊý¾ÝÎļþÓëË÷ÒýÎļþ²»Ð´ÔÚ±¾µØ´ÅÅÌ£¬¶øÊÇÖ±½Óдµ½ HDFS¡£Reducer ²»ÔÙͨ¹ý Mapper ²àµÄ External Shuffle Service ¶ÁÈ¡ Shuffle Êý¾Ý£¬¶øÊÇÖ±½Ó´Ó HDFS ÉÏ»ñÈ¡Êý¾Ý£¬ÈçÏÂͼËùʾ¡£

¿ìËÙʵÏÖÕâ¸ö·½°¸ºó£¬ÎÒÃÇ×öÁ˼¸×é¼òµ¥µÄ²âÊÔ¡£½á¹û±íÃ÷£º

Mapper Óë Reducer ²»¶àʱ£¬Shuffle ¶ÁдÐÔÄÜÓëԭʼ·½°¸Ïà±ÈÎÞ²îÒì¡£

Mapper Óë Reducer ½Ï¶àʱ£¬Shuffle ¶Á±äµÃ·Ç³£Âý¡£

ÔÚÉÏÃæµÄʵÑé¹ý³ÌÖУ¬HDFS ·¢³öÁ˱¨¾¯ÐÅÏ¢¡£ÈçÏÂͼËùʾ£¬HDFS Name Node Proxy µÄ QPS ·åÖµ´ïµ½ 60 Íò¡££¨×¢£º×Ö½ÚÌø¶¯×ÔÑÐÁË Node Name Proxy£¬²¢ÔÚ Proxy ²ãʵÏÖÁË»º´æ£¬Òò´Ë¶Á QPS ¿ÉÒÔÖ§³Åµ½Õâ¸öÁ¿¼¶£©¡£

Ô­ÒòÔÚÓÚ£¬×ܹ² 10000 Reducer£¬ÐèÒª´Ó 10000 ¸ö Mapper ´¦¶ÁÈ¡Êý¾ÝÎļþºÍË÷ÒýÎļþ£¬×ܹ²ÐèÒª¶ÁÈ¡ HDFS 10000 * 1000 * 2 = 2 ÒڴΡ£

Èç¹ûÖ»ÊÇ Name Node µÄµ¥µãÐÔÄÜÎÊÌ⣬»¹¿ÉÒÔͨ¹ýһЩ¼òµ¥µÄ·½·¨½â¾ö¡£ÀýÈçÔÚ Spark Driver ²à±£´æËùÓÐ Mapper µÄ Block Location£¬È»ºó Driver ½«¸ÃÐÅÏ¢¹ã²¥ÖÁËùÓÐ Executor£¬Ã¿¸ö Reducer ¿ÉÒÔÖ±½Ó´Ó Executor ´¦»ñÈ¡ Block Location£¬È»ºóÎÞÐëÁ¬½Ó Name Node£¬¶øÊÇÖ±½Ó´Ó Data Node ¶ÁÈ¡Êý¾Ý¡£µ«¼øÓÚ Data Node µÄÏß³ÌÄ£ÐÍ£¬ÕâÖÖ·½°¸»á¶Ô Data Node Ôì³É½Ï´ó³å»÷¡£

×îºóÎÒÃÇÑ¡ÔñÁËÒ»ÖֱȽϼòµ¥¿ÉÐеķ½°¸£¬ÈçÏÂͼËùʾ¡£

Mapper µÄ Shuffle Êä³öÊý¾ÝÈÔÈ»°´Ô­·½°¸Ð´±¾µØ´ÅÅÌ£¬Ð´ÍêºóÉÏ´«µ½ HDFS¡£Reducer ÈÔÈ»°´Ô­Ê¼·½°¸Í¨¹ý Mapper ²àµÄ External Shuffle Service ¶ÁÈ¡ Shuffle Êý¾Ý¡£Èç¹ûʧ°ÜÁË£¬Ôò´Ó HDFS ¶ÁÈ¡¡£ÕâÖÖ·½°¸¼«´ó¼õÉÙÁË¶Ô HDFS µÄ·ÃÎÊÆµÂÊ¡£

¸Ã·½°¸ÉÏÏß½üÒ»Ä꣺

¸²¸Ç 57% ÒÔÉ쵀 Spark Shuffle Êý¾Ý£»

ʹµÃ Spark ×÷ÒµÕûÌåÐÔÄÜÌáÉý 14%£»

Ìì¼¶´ó×÷ÒµÐÔÄÜÌáÉý 18%£»

Сʱ¼¶×÷ÒµÐÔÄÜÌáÉý 12%¡£

¸Ã·½°¸Ö¼ÔÚÌáÉý Spark Shuffle Îȶ¨ÐÔ´Ó¶øÌáÉý×÷ÒµÎȶ¨ÐÔ£¬µ«×îÖÕûÓÐʹÓ÷½²îµÈÖ¸±êÀ´ºâÁ¿Îȶ¨ÐÔµÄÌáÉý¡£Ô­ÒòÔÚÓÚÿÌ켯Ⱥ¸ºÔز»Ò»Ñù£¬ÕûÌå·½²î½Ï´ó¡£Shuffle Îȶ¨ÐÔÌáÉýºó£¬Stage Retry ´ó·ù¼õÉÙ£¬ÕûÌå×÷ÒµÖ´ÐÐʱ¼ä¼õÉÙ£¬Ò²¼´ÐÔÄÜÌáÉý¡£×îÖÕͨ¹ý¶Ô±ÈʹÓø÷½°¸Ç°ºóµÄ×ܵÄ×÷ÒµÖ´ÐÐʱ¼äÀ´¶Ô±ÈÐÔÄܵÄÌáÉý£¬ÓÃÓÚºâÁ¿¸Ã·½°¸µÄЧ¹û¡£

Shuffle ÐÔÄÜÓÅ»¯Êµ¼ùÓë̽Ë÷

ÈçÉÏÎÄËù·ÖÎö£¬Shuffle ÐÔÄÜÎÊÌâµÄÔ­ÒòÔÚÓÚ£¬Shuffle Write ÓÉ Mapper Íê³É£¬È»ºó Reducer ÐèÒª´ÓËùÓÐ Mapper ´¦¶ÁÈ¡Êý¾Ý¡£ÕâÖÖÄ£ÐÍ£¬ÎÒÃdzÆÖ®ÎªÒÔ Mapper ΪÖÐÐÄµÄ Shuffle¡£ËüµÄÎÊÌâÔÚÓÚ£º

Mapper ²à»áÓÐ M ´Î˳Ðòд IO£»

Mapper ²à»áÓÐ M * N * 2 ´ÎËæ»ú¶Á IO£¨ÕâÊÇ×î´óµÄÐÔÄÜÆ¿¾±£©£»

Mapper ²àµÄ External Shuffle Service ±ØÐëÓë Mapper λÓÚͬһ̨»úÆ÷£¬ÎÞ·¨×öµ½ÓÐЧµÄ´æ´¢¼ÆËã·ÖÀ룬Shuffle ·þÎñÎÞ·¨¶ÀÁ¢À©Õ¹¡£

Õë¶ÔÉÏÊöÎÊÌ⣬ÎÒÃÇÌá³öÁËÒÔ Reducer ΪÖÐÐĵģ¬´æ´¢¼ÆËã·ÖÀëµÄ Shuffle ·½°¸£¬ÈçÏÂͼËùʾ¡£

¸Ã·½°¸µÄÔ­ÀíÊÇ£¬Mapper Ö±½Ó½«ÊôÓÚ²»Í¬ Reducer µÄÊý¾Ýдµ½²»Í¬µÄ Shuffle Service¡£ÔÚÉÏͼÖУ¬×ܹ² 2 ¸ö Mapper£¬5 ¸ö Reducer£¬5 ¸ö Shuffle Service¡£ËùÓÐ Mapper ¶¼½«ÊôÓÚ Reducer 0 µÄÊý¾ÝÔ¶³ÌÁ÷ʽ·¢Ë͸ø Shuffle Service 0£¬²¢ÓÉËü˳ÐòдÈë´ÅÅÌ¡£Reducer 0 Ö»ÐèÒª´Ó Shuffle Service 0 ˳Ðò¶ÁÈ¡ËùÓÐÊý¾Ý¼´¿É£¬ÎÞÐèÔÙ´Ó M ¸ö Mapper È¡Êý¾Ý¡£¸Ã·½°¸µÄÓÅÊÆÔÚÓÚ£º

½« M * N * 2 ´ÎËæ»ú IO ±äΪ N ´Î˳Ðò IO¡£

Shuffle Service ¿ÉÒÔ¶ÀÁ¢ÓÚ Mapper »òÕß Reducer ²¿Ê𣬴Ӷø×öµ½¶ÀÁ¢À©Õ¹£¬×öµ½´æ´¢¼ÆËã·ÖÀë¡£

Shuffle Service ¿É½«Êý¾ÝÖ±½Ó´æÓÚ HDFS µÈ¸ß¿ÉÓô洢£¬Òò´Ë¿Éͬʱ½â¾ö Shuffle Îȶ¨ÐÔÎÊÌâ¡£

Spark SQL δÀ´¹æ»®

δÀ´ÎÒÃÇ´ó¸Å»á×öÒÔϹ滮£º

Ö§³Ö¸ü¶àµÄË÷Òý£»

Ö§³ÖÎļþ¼¶¹ýÂË£»

Ö§³ÖһЩ ACID£»

¶¯Ì¬°´×é·ÖÇø£»

Runtime Filter£¨Ö§³Ö·ÖÇø¼¶¼°Ðм¶¹ýÂË£©£»

ÖÇÄܽ¨ÉèÊý¾Ý²Ö¿â£»

Filter ÖØÅÅ£»

¿ÉÀ©Õ¹µÄ Catalog Service¡£

ÎҵķÖÏí¾Íµ½ÕâÀлл´ó¼Ò¡£

ÍøÓÑÎÊÌ⼯½õ

Q£ºÎﻯÁÐÐÂÔöÒ»ÁУ¬ÊÇ·ñÐèÒªÐÞ¸ÄÀúÊ·Êý¾Ý£¿

A£ºÀúÊ·Êý¾ÝÌ«¶à£¬²»ÊʺÏÐÞ¸ÄÀúÊ·Êý¾Ý¡£

Q£ºÈç¹ûÓû§µÄÇëÇóͬʱ°üº¬ÐÂÊý¾ÝºÍÀúÊ·Êý¾Ý£¬ÈçºÎ´¦Àí£¿

A£ºÒ»°ã¶øÑÔ£¬Óû§ÐÞ¸ÄÊý¾Ý¶¼ÊÇÒÔ Partition Ϊµ¥Î»¡£ËùÒÔÎÒÃÇÔÚ Partition Parameter Éϱ£´æÁËÎﻯÁÐÏà¹ØÐÅÏ¢¡£Èç¹ûÓû§µÄ²éѯͬʱ°üº¬ÁËРPartition ÓëÀúÊ· Partition£¬ÎÒÃÇ»áÔÚРPartition ÉÏÕë¶ÔÎﻯÁнøÐÐ SQL Rewrite£¬ÀúÊ· Partition ²» Rewrite£¬È»ºó½«ÐÂÀÏ Partition ½øÐÐ Union£¬´Ó¶øÔÚ±£Ö¤Êý¾ÝÕýÈ·ÐÔµÄǰÌáϾ¡¿ÉÄܳä·ÖÀûÓÃÎﻯÁеÄÓÅÊÆ¡£

Q£ºÄãºÃ£¬ÄãÃÇÕë¶ÔÓû§µÄ³¡¾°£¬×öÁ˺ܶàͦÓмÛÖµµÄÓÅ»¯¡£ÏñÎﻯÁС¢ÎﻯÊÓͼ£¬¶¼ÐèÒª¸ù¾ÝÓû§µÄ²éѯ Pattern ½øÐÐÉèÖá£Ä¿Ç°ÄãÃÇÊÇÈ˹¤·ÖÎöÕâЩ²éѯ£¬»¹ÊÇÓÐijÖÖ»úÖÆ×Ô¶¯È¥·ÖÎö²¢ÓÅ»¯£¿

A£ºÄ¿Ç°ÎÒÃÇÖ÷ÒªÊÇͨ¹ýһЩÉó¼ÆÐÅÏ¢¸¨ÖúÈ˹¤·ÖÎö¡£Í¬Ê±ÎÒÃÇÒ²ÕýÔÚ×öÎﻯÁÐÓëÎﻯÊÓͼµÄÍÆ¼ö·þÎñ£¬×îÖÕ×öµ½ÖÇÄܽ¨ÉèÎﻯÁÐÓëÎﻯÊÓͼ¡£

Q£º¸Õ¸Õ½éÉܵĻùÓÚ HDFS µÄ Spark Shuffle Îȶ¨ÐÔÌáÉý·½°¸£¬ÊÇ·ñ¿ÉÒÔÒì²½ÉÏ´« Shuffle Êý¾ÝÖÁ HDFS£¿

A£ºÕâ¸öÏ뷨ͦºÃ£¬ÎÒÃÇ֮ǰҲ¿¼Âǹý£¬µ«»ùÓÚ¼¸µã¿¼ÂÇ£¬×îÖÕûÓÐÕâÑù×ö¡£µÚÒ»£¬µ¥ Mapper µÄ Shuffle Êä³öÊý¾ÝÁ¿Ò»°ãºÜС£¬ÉÏ´«µ½ HDFS ºÄʱÔÚ 2 ÃëÒÔÄÚ£¬Õâ¸öʱ¼ä¿ªÏú¿ÉÒÔºöÂÔ£»µÚ¶þ£¬ÎÒÃǹ㷺ʹÓà External Shuffle Service ºÍ Dynamic Allocation£¬Mapper Ö´ÐÐÍê³Éºó¿ÉÄÜ Executor ¾Í»ØÊÕÁË£¬Èç¹ûÒªÒì²½ÉÏ´«£¬¾Í±ØÐëÒÀÀµÆäËü×é¼þ£¬Õâ»áÌáÉý¸´ÔÓ¶È£¬ROI ½ÏµÍ¡£

ÒÔÉÏÄÚÈÝ·ÖÏíµÄÊÇ×Ö½ÚÌø¶¯ÔÚÌáÉý»ùÓÚ Spark SQL µÄ ETL Îȶ¨ÐÔÒÔ¼°ÓÅ»¯ ad-hoc ²éѯµÄÐÔÄÜ·½ÃæµÄʵ¼ù¡£

   
2799 ´Îä¯ÀÀ       28
Ïà¹ØÎÄÕÂ

»ùÓÚEAµÄÊý¾Ý¿â½¨Ä£
Êý¾ÝÁ÷½¨Ä££¨EAÖ¸ÄÏ£©
¡°Êý¾Ýºþ¡±£º¸ÅÄî¡¢ÌØÕ÷¡¢¼Ü¹¹Óë°¸Àý
ÔÚÏßÉ̳ÇÊý¾Ý¿âϵͳÉè¼Æ ˼·+Ч¹û
 
Ïà¹ØÎĵµ

GreenplumÊý¾Ý¿â»ù´¡Åàѵ
MySQL5.1ÐÔÄÜÓÅ»¯·½°¸
ijµçÉÌÊý¾ÝÖÐ̨¼Ü¹¹Êµ¼ù
MySQL¸ßÀ©Õ¹¼Ü¹¹Éè¼Æ
Ïà¹Ø¿Î³Ì

Êý¾ÝÖÎÀí¡¢Êý¾Ý¼Ü¹¹¼°Êý¾Ý±ê×¼
MongoDBʵս¿Î³Ì
²¢·¢¡¢´óÈÝÁ¿¡¢¸ßÐÔÄÜÊý¾Ý¿âÉè¼ÆÓëÓÅ»¯
PostgreSQLÊý¾Ý¿âʵսÅàѵ
×îл¼Æ»®
DeepSeek´óÄ£ÐÍÓ¦Óÿª·¢ 6-12[ÏÃÃÅ]
È˹¤ÖÇÄÜ.»úÆ÷ѧϰTensorFlow 6-22[Ö±²¥]
»ùÓÚ UML ºÍEA½øÐзÖÎöÉè¼Æ 6-30[±±¾©]
ǶÈëʽÈí¼þ¼Ü¹¹-¸ß¼¶Êµ¼ù 7-9[±±¾©]
Óû§ÌåÑé¡¢Ò×ÓÃÐÔ²âÊÔÓëÆÀ¹À 7-25[Î÷°²]
ͼÊý¾Ý¿âÓë֪ʶͼÆ× 8-23[±±¾©]
 
×îÐÂÎÄÕÂ
InfluxDB¸ÅÄîºÍ»ù±¾²Ù×÷
InfluxDB TSM´æ´¢ÒýÇæÖ®Êý¾ÝдÈë
Éî¶ÈÂþ̸Êý¾Ýϵͳ¼Ü¹¹¡ª¡ªLambda architecture
Lambda¼Ü¹¹Êµ¼ù
InfluxDB TSM´æ´¢ÒýÇæÖ®Êý¾Ý¶ÁÈ¡
×îпγÌ
OracleÊý¾Ý¿âÐÔÄÜÓÅ»¯¡¢¼Ü¹¹Éè¼ÆºÍÔËÐÐά»¤
²¢·¢¡¢´óÈÝÁ¿¡¢¸ßÐÔÄÜÊý¾Ý¿âÉè¼ÆÓëÓÅ»¯
NoSQLÊý¾Ý¿â£¨Ô­Àí¡¢Ó¦Óá¢×î¼Ñʵ¼ù£©
ÆóÒµ¼¶Hadoop´óÊý¾Ý´¦Àí×î¼Ñʵ¼ù
OracleÊý¾Ý¿âÐÔÄÜÓÅ»¯×î¼Ñʵ¼ù
³É¹¦°¸Àý
ij½ðÈÚ¹«Ë¾ Mysql¼¯ÈºÓëÐÔÄÜÓÅ»¯
±±¾© ²¢·¢¡¢´óÈÝÁ¿¡¢¸ßÐÔÄÜÊý¾Ý¿âÉè¼ÆÓëÓÅ»¯
ÖªÃûijÐÅϢͨÐŹ«Ë¾ NoSQL»º´æÊý¾Ý¿â¼¼Êõ
±±¾© oracleÊý¾Ý¿âSQLÓÅ»¯
ÖйúÒÆ¶¯ IaaSÔÆÆ½Ì¨-Ö÷Á÷Êý¾Ý¿â¼°´æ´¢¼¼Êõ