±à¼ÍƼö: |
±¾ÎÄÖ÷ÒªÖ÷Òª·ÖΪËĸö²¿·Ö£ºµÚÒ»¸ö²¿·ÖÊÇ
Spark SQL ¼Ü¹¹¼ò½é£»µÚ¶þ²¿·Ö½éÉÜ×Ö½ÚÌø¶¯ÔÚ Spark SQL ÒýÇæÉϵÄÓÅ»¯Êµ¼ù£»µÚÈý²¿·ÖÊÇ×Ö½ÚÌø¶¯ÔÚ
Spark Shuffle Îȶ¨ÐÔÌáÉýÓëÐÔÄÜÓÅ»¯£»µÚËIJ¿·ÖÊÇ Spark SQL
δÀ´¹æ»®ÓëÕ¹Íû¡£Ï£Íû¶ÔÄúµÄѧϰÓÐËù°ïÖú¡£
±¾ÎÄÀ´×ÔInfoQ£¬ÓÉ»ðÁú¹ûÈí¼þLinda±à¼¡¢ÍƼö¡£ |
|
µ¼¶Á£ºSpark SQL ÊÇ×Ö½ÚÌø¶¯ÄÚ²¿×îÖØÒªµÄ²éѯÒýÇæÖ®Ò»£¬ËüÿÌì´¦Àí°ÙÍòÒÚ¼¶Êý¾Ý£¬µ¥ÈÎÎñ
Shuffle Êý¾ÝÁ¿¿É³¬¹ý 200TB¡£²»¹ýÒòΪ Spark ÓëÆäËüϵͳ»ìºÏ²¿Êð£¬Òò´ËÐÔÄÜÓëÎȶ¨ÐÔÎÊÌâ
¶¼ÊÇÐèÒªÖØµã½â¾öµÄ¡£±¾ÎÄÖ÷Òª·ÖΪËĸö²¿·Ö£ºµÚÒ»¸ö²¿·ÖÊÇ Spark SQL ¼Ü¹¹¼ò½é£»µÚ¶þ²¿·Ö½éÉÜ×Ö½ÚÌø¶¯ÔÚ
Spark SQL ÒýÇæÉϵÄÓÅ»¯Êµ¼ù£»µÚÈý²¿·ÖÊÇ×Ö½ÚÌø¶¯ÔÚ Spark Shuffle Îȶ¨ÐÔÌáÉýÓëÐÔÄÜÓÅ»¯£»µÚËIJ¿·ÖÊÇ
Spark SQL δÀ´¹æ»®ÓëÕ¹Íû¡£
Spark SQL ¼Ü¹¹¼ò½é
ÎÒÃÇÏȼòµ¥ÁÄһϠSpark SQL µÄ¼Ü¹¹¡£ÏÂÃæÕâÕÅͼÃèÊöÁËÒ»Ìõ SQL Ìá½»Ö®ºóÐèÒª¾ÀúµÄ¼¸¸ö½×¶Î£¬½áºÏÕâЩ½×¶Î¾Í¿ÉÒÔ¿´µ½ÔÚÄÄЩ»·½Ú¿ÉÒÔ×öÓÅ»¯¡£

ºÜ¶àʱºò£¬×öÊý¾Ý²Ö¿â½¨Ä£µÄͬѧ¸üÇãÏòÓÚÖ±½Óд SQL ¶ø·ÇʹÓà Spark µÄ DSL¡£Ò»Ìõ SQL
Ìá½»Ö®ºó»á±» Parser ½âÎö²¢×ª»¯Îª Unresolved Logical Plan¡£ËüµÄÖØµãÊÇ
Logical Plan Ò²¼´Âß¼¼Æ»®£¬ËüÃèÊöÁËÏ£Íû×öʲôÑùµÄ²éѯ¡£Unresolved ÊÇÖ¸¸Ã²éѯÏà¹ØµÄһЩÐÅϢδ֪£¬±ÈÈç²»ÖªµÀ²éѯµÄÄ¿±ê±íµÄ
Schema ÒÔ¼°Êý¾ÝλÖá£
ÉÏÊöÐÅÏ¢´æÓÚ Catalog ÄÚ¡£ÔÚÉú²ú»·¾³ÖУ¬Ò»°ãÓÉ Hive Metastore Ìṩ Catalog
·þÎñ¡£Analyzer »á½áºÏ Catalog ½« Unresolved Logical Plan
ת»»Îª Resolved Logical Plan¡£
µ½ÕâÀﻹ²»¹»¡£²»Í¬µÄÈËд³öÀ´µÄ SQL ²»Ò»Ñù£¬Éú³ÉµÄ Resolved Logical Plan
Ò²¾Í²»Ò»Ñù£¬Ö´ÐÐЧÂÊÒ²²»Ò»Ñù¡£ÎªÁ˱£Ö¤ÎÞÂÛÓû§ÈçºÎд SQL ¶¼¿ÉÒÔ¸ßЧµÄÖ´ÐУ¬Spark SQL
ÐèÒª¶Ô Resolved Logical Plan ½øÐÐÓÅ»¯£¬Õâ¸öÓÅ»¯ÓÉ Optimizer Íê³É¡£Optimizer
°üº¬ÁËһϵÁйæÔò£¬¶Ô Resolved Logical Plan ½øÐеȼÛת»»£¬×îÖÕÉú³É Optimized
Logical Plan¡£¸Ã Optimized Logical Plan ²»Äܱ£Ö¤ÊÇÈ«¾Ö×îÓŵ쬵«ÖÁÉÙÊǽӽü×îÓŵġ£
ÉÏÊö¹ý³ÌÖ»Óë SQL Óйأ¬Óë²éѯÓйأ¬µ«ÊÇÓë Spark Î޹أ¬Òò´ËÎÞ·¨Ö±½ÓÌá½»¸ø Spark
Ö´ÐС£Query Planner ¸ºÔ𽫠Optimized Logical Plan ת»»Îª Physical
Plan£¬½ø¶ø¿ÉÒÔÖ±½ÓÓÉ Spark Ö´ÐС£
ÓÉÓÚͬһÖÖÂß¼Ëã×Ó¿ÉÒÔÓжàÖÖÎïÀíʵÏÖ¡£Èç Join ÓжàÖÖʵÏÖ£¬ShuffledHashJoin¡¢BroadcastHashJoin¡¢BroadcastNestedLoopJoin¡¢SortMergeJoin
µÈ¡£Òò´Ë Optimized Logical Plan ¿É±» Query Planner ת»»Îª¶à¸ö
Physical Plan¡£ÈçºÎÑ¡Ôñ×îÓÅµÄ Physical Plan ³ÉΪһ¼þ·Ç³£Ó°Ïì×îÖÕÖ´ÐÐÐÔÄܵÄÊÂÇé¡£Ò»ÖֱȽϺõķ½Ê½ÊÇ£¬¹¹½¨Ò»¸ö
Cost Model£¬²¢¶ÔËùÓкòÑ¡µÄ Physical Plan Ó¦Óøà Model ²¢ÌôÑ¡ Cost
×îСµÄ Physical Plan ×÷Ϊ×îÖÕµÄ Selected Physical Plan¡£
Physical Plan ¿ÉÖ±½Óת»»³É RDD ÓÉ Spark Ö´ÐС£ÎÒÃǾ³£Ëµ¡°¼Æ»®¸Ï²»Éϱ仯¡±£¬ÔÚÖ´Ðйý³ÌÖУ¬¿ÉÄÜ·¢ÏÖԼƻ®²»ÊÇ×îÓŵģ¬ºóÐøÖ´Ðмƻ®Èç¹ûÄܸù¾ÝÔËÐÐʱµÄͳ¼ÆÐÅÏ¢½øÐе÷Õû¿ÉÄÜÌáÉýÕûÌåÖ´ÐÐЧÂÊ¡£Õⲿ·Ö¶¯Ì¬µ÷ÕûÓÉ
Adaptive Execution Íê³É¡£
ºóÃæ½éÉÜ×Ö½ÚÌø¶¯ÔÚ Spark SQL ÉÏ×öµÄһЩÓÅ»¯£¬Ö÷ÒªÎ§ÈÆÕâÒ»½Ú½éÉܵÄÂß¼¼Æ»®ÓÅ»¯ÓëÎïÀí¼Æ»®ÓÅ»¯Õ¹¿ª¡£
Spark SQL ÒýÇæÓÅ»¯
Bucket ¸Ä½ø
ÔÚ Spark Àʵ¼Ê²¢Ã»ÓÐ Bucket Join Ëã×Ó¡£ÕâÀï˵µÄ Bucket Join ·ºÖ¸²»ÐèÒª
Shuffle µÄ SortMergeJoin¡£
ÏÂͼչʾÁË SortMergeJoin µÄ»ù±¾ÔÀí¡£ÓÃÐéÏß¿ò´ú±íµÄ Table 1 ºÍ Table
2 ÊÇÁ½ÕÅÐèÒª°´Ä³×ֶνøÐÐ Join µÄ±í¡£ÐéÏß¿òÄÚµÄ partition 0 µ½ partition
m ÊǸñíת»»³É RDD ºóµÄ Partition£¬¶ø·Ç±íµÄ·ÖÇø¡£¼ÙÉè Table 1 Óë Table
2 ת»»Îª RDD ºó·Ö±ð°üº¬ m ºÍ k ¸ö Partition¡£ÎªÁ˽øÐÐ Join£¬ÐèҪͨ¹ý Shuffle
±£Ö¤Ïàͬ Join Key µÄÊý¾ÝÔÚͬһ¸ö Partition ÄÚÇÒ Partition ÄÚ°´ Key
ÅÅÐò£¬Í¬Ê±±£Ö¤ Table 1 Óë Table 2 ¾¹ý Shuffle ºóµÄ RDD µÄ Partition
ÊýÏàͬ¡£
ÈçÏÂͼËùʾ£¬¾¹ý Shuffle ºóÖ»ÐèÒªÆô¶¯ n ¸ö Task£¬Ã¿¸ö Task ´¦Àí Table
1 Óë Table 2 ÖжÔÓ¦ Partition µÄÊý¾Ý½øÐÐ Join ¼´¿É¡£Èç Task 0 Ö»ÐèҪ˳ÐòɨÃè
Shuffle ºóµÄ×óÓÒÁ½±ßµÄ partition 0 ¼´¿ÉÍê³É Join¡£

¸Ã·½·¨µÄÓÅÊÆÊÇÊÊÓó¡¾°¹ã£¬¼¸ºõ¿ÉÓÃÓÚÈÎÒâ´óСµÄÊý¾Ý¼¯¡£ÁÓÊÆÊÇÿ´Î Join ¶¼ÐèÒª¶ÔÈ«Á¿Êý¾Ý½øÐÐ Shuffle£¬¶ø
Shuffle ÊÇ×îÓ°Ïì Spark SQL ÐÔÄܵĻ·½Ú¡£Èç¹ûÄܱÜÃâ Shuffle ÍùÍùÄÜ´ó·ùÌáÉý
Spark SQL ÐÔÄÜ¡£
¶ÔÓÚ´óÊý¾ÝµÄ³¡¾°À´½²£¬Êý¾ÝÒ»°ãÊÇÒ»´ÎдÈë¶à´Î²éѯ¡£Èç¹û¾³£¶ÔÁ½ÕÅ±í°´Ïàͬ»òÀàËÆµÄ·½Ê½½øÐÐ Join£¬Ã¿´Î¶¼ÐèÒª¸¶³ö
Shuffle µÄ´ú¼Û¡£ÓëÆäÕâÑù£¬²»ÈçÈÃÊý¾ÝÔÚдµÄʱºò£¬¾ÍÈÃÊý¾Ý°´ÕÕÀûÓÚ Join µÄ·½Ê½·Ö²¼£¬´Ó¶øÊ¹µÃ
Join ʱÎÞÐè½øÐÐ Shuffle¡£ÈçÏÂͼËùʾ£¬Table 1 Óë Table 2 ÄÚµÄÊý¾Ý°´ÕÕÏàͬµÄ
Key ½øÐзÖͰÇÒͰÊý¶¼Îª n£¬Í¬Ê±Í°ÄÚ°´¸Ã Key ÅÅÐò¡£¶ÔÕâÁ½ÕÅ±í½øÐÐ Join ʱ£¬¿ÉÒÔ±ÜÃâ
Shuffle£¬Ö±½ÓÆô¶¯ n ¸ö Task ½øÐÐ Join¡£

×Ö½ÚÌø¶¯¶Ô Spark SQL µÄ BucketJoin ×öÁËËÄÏî±È½Ï´óµÄ¸Ä½ø¡£
¸Ä½øÒ»£ºÖ§³ÖÓë Hive ¼æÈÝ
ÔÚ¹ýÈ¥Ò»¶Îʱ¼ä£¬×Ö½ÚÌø¶¯°Ñ´óÁ¿µÄ Hive ×÷ÒµÇ¨ÒÆµ½ÁË SparkSQL¡£¶ø Hive Óë Spark
SQL µÄ Bucket ±í²»¼æÈÝ¡£¶ÔÓÚʹÓà Bucket ±íµÄ³¡¾°£¬Èç¹ûÖ±½Ó¸üмÆËãÒýÇæ£¬»áÔì³É
Spark SQL дÈë Hive Bucket ±íµÄÊý¾ÝÎÞ·¨±»ÏÂÓ뵀 Hive ×÷Òµµ±³É Bucket
±í½øÐÐ Bucket Join£¬´Ó¶øÔì³É×÷ÒµÖ´ÐÐʱ¼ä±ä³¤£¬¿ÉÄÜÓ°Ïì SLA¡£
ΪÁ˽â¾öÕâ¸öÎÊÌ⣬ÎÒÃÇÈà Spark SQL Ö§³Ö Hive ¼æÈÝģʽ£¬´Ó¶ø±£Ö¤ Spark SQL
дÈëµÄ Bucket ±íÓë Hive дÈëµÄ Bucket ±íЧ¹ûÒ»Ö£¬²¢ÇÒÕâÖÖ±í¿ÉÒÔ±» Hive
ºÍ Spark SQL µ±³É Bucket ±í½øÐÐ Bucket Join ¶ø²»ÐèÒª Shuffle¡£Í¨¹ýÕâÖÖ·½Ê½±£Ö¤
Hive Ïò Spark SQL µÄ͸Ã÷Ç¨ÒÆ¡£
µÚÒ»¸öÐèÒª½â¾öµÄÎÊÌâÊÇ£¬Hive µÄÒ»¸ö Bucket Ò»°ãÖ»°üº¬Ò»¸öÎļþ£¬¶ø Spark SQL
µÄÒ»¸ö Bucket ¿ÉÄܰüº¬¶à¸öÎļþ¡£½â¾ö°ì·¨ÊǶ¯Ì¬Ôö¼ÓÒ»´ÎÒÔ Bucket Key Ϊ Key
²¢ÇÒ²¢ÐжÈÓë Bucket ¸öÊýÏàͬµÄ Shuffle¡£

µÚ¶þ¸öÐèÒª½â¾öµÄÎÊÌâÊÇ£¬Hive 1.x µÄ¹þÏ£·½Ê½Óë Spark SQL 2.x µÄ¹þÏ£·½Ê½£¨Murmur3Hash£©²»Í¬£¬Ê¹µÃÏàͬµÄÊý¾ÝÔÚ
Hive ÖÐµÄ Bucket ID Óë Spark SQL ÖÐµÄ Bucket ID ²»Í¬¶øÎÞ·¨Ö±½Ó
Join¡£ÔÚ Hive ¼æÈÝģʽÏ£¬ÎÒÃÇÈÃÉÏÊö¶¯Ì¬Ôö¼ÓµÄ Shuffle ʹÓà Hive ÏàͬµÄ¹þÏ£·½Ê½£¬´Ó¶ø½â¾ö¸ÃÎÊÌâ¡£
¸Ä½ø¶þ£ºÖ§³Ö±¶Êý¹ØÏµ Bucket Join
Spark SQL ÒªÇóÖ»ÓÐ Bucket ÏàͬµÄ±í²ÅÄÜ£¨±ØÒª·Ç³ä·ÖÌõ¼þ£©½øÐÐ Bucket Join¡£¶ÔÓÚÁ½ÕÅ´óСÏà²îºÜ´óµÄ±í£¬±ÈÈ缸°Ù
GB µÄά¶È±íÓ뼸ʮ TB £¨µ¥·ÖÇø£©µÄÊÂʵ±í£¬ËüÃÇµÄ Bucket ¸öÊýÍùÍù²»Í¬£¬²¢ÇÒ¸öÊýÏà²îºÜ¶à£¬Ä¬ÈÏÎÞ·¨½øÐÐ
Bucket Join¡£Òò´ËÎÒÃÇͨ¹ýÁ½ÖÖ·½Ê½Ö§³ÖÁ˱¶Êý¹ØÏµµÄ Bucket Join£¬¼´µ±Á½ÕÅ Bucket
±íµÄ Bucket ÊýÊDZ¶Êý¹ØÏµÊ±Ö§³Ö Bucket Join¡£
µÚÒ»ÖÖ·½Ê½£¬Task ¸öÊýÓëС±í Bucket ¸öÊýÏàͬ¡£ÈçÏÂͼËùʾ£¬Table A °üº¬ 3 ¸ö
Bucket£¬Table B °üº¬ 6 ¸ö Bucket¡£´Ëʱ Table B µÄ bucket 0
Óë bucket 3 µÄÊý¾ÝºÏ¼¯Ó¦¸ÃÓë Table A µÄ bucket 0 ½øÐÐ Join¡£ÕâÖÖÇé¿öÏ£¬¿ÉÒÔÆô¶¯
3 ¸ö Task¡£ÆäÖÐ Task 0 ¶Ô Table A µÄ bucket 0 Óë Table B
µÄ bucket 0 + bucket 3 ½øÐÐ Join¡£ÔÚÕâÀÐèÒª¶Ô Table B µÄ bucket
0 Óë bucket 3 µÄÊý¾ÝÔÙ×öÒ»´Î merge sort ´Ó¶ø±£Ö¤ºÏ¼¯ÓÐÐò¡£

Èç¹û Table A Óë Table B µÄ Bucket ¸öÊýÏà²î²»´ó£¬¿ÉÒÔʹÓÃÉÏÊö·½Ê½¡£Èç¹û Table
B µÄ Bucket ¸öÊýÊÇ Bucket A Bucket ¸öÊýµÄ 10 ±¶£¬ÄÇÉÏÊö·½Ê½ËäÈ»±ÜÃâÁË
Shuffle£¬µ«¿ÉÄÜÒòΪ²¢ÐжȲ»¹»·´¶ø±È°üº¬ Shuffle µÄ SortMergeJoin ËÙ¶ÈÂý¡£´Ëʱ¿ÉÒÔʹÓÃÁíÍâÒ»ÖÖ·½Ê½£¬¼´
Task ¸öÊýÓë´ó±í Bucket ¸öÊýÏàµÈ£¬ÈçÏÂͼËùʾ£º

Ôڸ÷½°¸Ï£¬¿É½« Table A µÄ 3 ¸ö Bucket ¶Á¶à´Î¡£ÔÚÉÏͼÖУ¬Ö±½Ó½« Table A
Óë Table A ½øÐÐ Bucket Union £¨ÐµÄËã×Ó£¬Óë Union ÀàËÆ£¬µ«±£ÁôÁË Bucket
ÌØÐÔ£©£¬½á¹ûÏ൱ÓÚ 6 ¸ö Bucket£¬Óë Table B µÄ Bucket ¸öÊýÏàͬ£¬´Ó¶ø¿ÉÒÔ½øÐÐ
Bucket Join¡£
¸Ä½øÈý£ºÖ§³Ö BucketJoin ½µ¼¶
¹«Ë¾ÄÚ²¿¹ýȥʹÓà Bucket µÄ±í½ÏÉÙ£¬ÔÚÎÒÃÇ¶Ô Bucket ×öÁËһϵÁиĽøºó£¬´óÁ¿Óû§Ï£Íû½«±íת»»Îª
Bucket ±í¡£×ª»»ºó£¬±íµÄÔªÐÅÏ¢ÏÔʾ¸Ã±íΪ Bucket ±í£¬¶øÀúÊ··ÖÇøÄÚµÄÊý¾Ý²¢Î´°´ Bucket
±íÒªÇó·Ö²¼£¬ÔÚ²éѯÀúÊ·Êý¾Ýʱ»á³öÏÖÎÞ·¨Ê¶±ð Bucket µÄÎÊÌâ¡£

ͬʱ£¬ÓÉÓÚÊý¾ÝÁ¿ÉÏÕǿ죬ƽ¾ù Bucket ´óСҲ¿ìËÙÔö³¤¡£Õâ»áÔì³Éµ¥ Task ÐèÒª´¦ÀíµÄÊý¾ÝÁ¿¹ý´ó½ø¶øÒýÆðʹÓÃ
Bucket ºóµÄЧ¹û¿ÉÄܲ»ÈçÖ±½ÓʹÓûùÓÚ Shuffle µÄ Join¡£
ΪÁ˽â¾öÉÏÊöÎÊÌ⣬ÎÒÃÇʵÏÖÁËÖ§³Ö½µ¼¶µÄ Bucket ±í¡£»ù±¾ÔÀíÊÇ£¬Ã¿´ÎÐÞ¸Ä Bucket ÐÅÏ¢£¨°üº¬ÉÏÊöÁ½ÖÖÇé¿ö¡ª¡ª½«·Ç
Bucket ±íתΪ Bucket ±í£¬ÒÔ¼°ÐÞ¸Ä Bucket ¸öÊý£©Ê±£¬¼Ç¼ÐÞ¸ÄÈÕÆÚ¡£²¢ÇÒÔÚ¾ö¶¨Ê¹ÓÃÄÄÖÖ
Join ·½Ê½Ê±£¬¶ÔÓÚ Bucket ±íÏȼì²éËù²éѯµÄÊý¾ÝÊÇ·ñÖ»°üº¬¸ÃÈÕÆÚÖ®ºóµÄ·ÖÇø¡£Èç¹ûÊÇ£¬Ôòµ±³É
Bucket ±í´¦Àí£¬Ö§³Ö Bucket Join£»·ñÔòµ±³ÉÆÕͨÎÞ Bucket µÄ±í¡£
¸Ä½øËÄ£ºÖ§³Ö³¬¼¯
¶ÔÓÚÒ»Õų£ÓÃ±í£¬¿ÉÄÜ»áÓëÁíÍâÒ»ÕÅ±í°´ User ×Ö¶Î×ö Join£¬Ò²¿ÉÄÜ»áÓëÁíÍâÒ»ÕÅ±í°´ User
ºÍ App ×Ö¶Î×ö Join£¬ÓëÆäËü±í°´ User Óë Item ×ֶνøÐÐ Join¡£¶ø Spark
SQL ÔÉúµÄ Bucket Join ÒªÇó Join Key Set Óë±íµÄ Bucket Key
Set ÍêÈ«Ïàͬ²ÅÄܽøÐÐ Bucket Join¡£Ôڸó¡¾°ÖУ¬²»Í¬ Join µÄ Key Set ²»Í¬£¬Òò´ËÎÞ·¨Í¬Ê±Ê¹ÓÃ
Bucket Join¡£Õ⼫´óµÄÏÞÖÆÁË Bucket Join µÄÊÊÓó¡¾°¡£
Õë¶Ô´ËÎÊÌ⣬ÎÒÃÇÖ§³ÖÁ˳¬¼¯³¡¾°Ï嵀 Bucket Join¡£Ö»Òª Join Key Set °üº¬ÁË
Bucket Key Set£¬¼´¿É½øÐÐ Bucket Join¡£
ÈçÏÂͼËùʾ£¬Table X Óë Table Y£¬¶¼°´×Ö¶Î A ·Ö Bucket¡£¶ø²éѯÐèÒª¶Ô Table
X Óë Table Y ½øÐÐ Join£¬ÇÒ Join Key Set Ϊ A Óë B¡£´Ëʱ£¬ÓÉÓÚ A
ÏàµÈµÄÊý¾Ý£¬ÔÚÁ½±íÖÐµÄ Bucket ID Ïàͬ£¬ÄÇ A Óë B ¸÷×ÔÏàµÈµÄÊý¾ÝÔÚÁ½±íÖÐµÄ Bucket
ID ¿Ï¶¨Ò²Ïàͬ£¬ËùÒÔÊý¾Ý·Ö²¼ÊÇÂú×ã Join ÒªÇóµÄ£¬²»ÐèÒª Shuffle¡£Í¬Ê±£¬Bucket
Join »¹ÐèÒª±£Ö¤Á½±í°´ Join Key Set ¼´ A ºÍ B ÅÅÐò£¬´ËʱֻÐèÒª¶Ô Table
X Óë Table Y ½øÐзÖÇøÄÚÅÅÐò¼´¿É¡£ÓÉÓÚÁ½±ßÒѾ°´×Ö¶Î A ÅÅÐòÁË£¬´ËʱÔÙ°´ A Óë B ÅÅÐò£¬´ú¼ÛÏà¶Ô½ÏµÍ¡£

ÎﻯÁÐ
Spark SQL ´¦ÀíǶÌ×ÀàÐÍÊý¾Ýʱ£¬´æÔÚÒÔÏÂÎÊÌ⣺
¶ÁÈ¡´óÁ¿²»±ØÒªµÄÊý¾Ý£º¶ÔÓÚ Parquet / ORC µÈÁÐʽ´æ´¢¸ñʽ£¬¿ÉÖ»¶ÁÈ¡ÐèÒªµÄ×ֶΣ¬¶øÖ±½ÓÌø¹ýÆäËü×ֶΣ¬´Ó¶ø¼«´ó½ÚÊ¡
IO¡£¶ø¶ÔÓÚǶÌ×Êý¾ÝÀàÐ͵Ä×ֶΣ¬ÈçÏÂͼÖÐµÄ Map ÀàÐ굀 people ×ֶΣ¬ÍùÍùÖ»ÐèÒª¶ÁÈ¡ÆäÖеÄ×Ó×ֶΣ¬Èç
people.age¡£È´ÐèÒª½«Õû¸ö Map ÀàÐ굀 people ×Ö¶ÎÈ«²¿¶ÁÈ¡³öÀ´È»ºó³éÈ¡³ö people.age
×ֶΡ£Õâ»áÒýÈë´óÁ¿µÄÎÞÒâÒåµÄ IO ¿ªÏú¡£ÔÚÎÒÃǵij¡¾°ÖУ¬´æÔÚ²»ÉÙ Map ÀàÐ͵Ä×ֶΣ¬¶øÇҺܶà°üº¬¼¸Ê®ÖÁ¼¸°Ù¸ö
Key£¬ÕâÒ²¾ÍÒâζ×Å IO ±»·Å´óÁ˼¸Ê®ÖÁ¼¸°Ù±¶£»
ÎÞ·¨½øÐÐÏòÁ¿»¯¶ÁÈ¡£º¶øÏòÁ¿»¯¶ÁÄܼ«´óµÄÌáÉýÐÔÄÜ¡£µ«½ØÖ¹µ½Ä¿Ç°£¨2019 Äê 10 Ô 26 ÈÕ£©£¬Spark
²»Ö§³Ö°üº¬Ç¶Ì×Êý¾ÝÀàÐ͵ÄÏòÁ¿»¯¶ÁÈ¡¡£Õ⼫´óµÄÓ°ÏìÁ˰üº¬Ç¶Ì×Êý¾ÝÀàÐ͵IJéѯÐÔÄÜ£»
²»Ö§³Ö Filter ÏÂÍÆ£ºÄ¿Ç°£¨2019 Äê 10 Ô 26 ÈÕ£©µÄ Spark ²»Ö§³ÖǶÌ×ÀàÐÍ×Ö¶ÎÉϵÄ
Filter µÄÏÂÍÆ£»
ÖØ¸´¼ÆË㣺JSON ×ֶΣ¬ÔÚ Spark SQL ÖÐÒÔ String ÀàÐÍ´æÔÚ£¬ÑϸñÀ´Ëµ²»ËãǶÌ×Êý¾ÝÀàÐÍ¡£²»¹ýʵ¼ùÖÐÒ²³£ÓÃÓÚ±£´æ²»¹Ì¶¨µÄ¶à¸ö×ֶΣ¬ÔÚ²éѯʱͨ¹ý
JSON Path ³éȡĿ±ê×Ó×ֶΣ¬¶ø´óÐÍ JSON ×Ö·û´®µÄ×ֶγéÈ¡·Ç³£ÏûºÄ CPU¡£¶ÔÓÚÈȵã±í£¬Æµ·±Öظ´³éÈ¡Ïàͬ×Ó×ֶηdz£ÀË·Ñ×ÊÔ´¡£

¶ÔÓÚÕâ¸öÎÊÌ⣬×öÊý²ÖµÄͬѧҲÏëÁËһЩ½â¾ö·½°¸¡£ÈçÏÂͼËùʾ£¬ÔÚÃûΪ base_table µÄ±íÖ®Íâ´´½¨ÁËÒ»ÕÅÃûΪ
sub_table µÄ±í£¬²¢ÇÒ½«¸ßƵʹÓõÄ×Ó×Ö¶Î people.age ÉèÖÃΪһ¸ö¶îÍâµÄ Integer
ÀàÐ͵Ä×ֶΡ£ÏÂÓβ»ÔÙͨ¹ý base_table ²éѯ people.age£¬¶øÊÇʹÓà sub_table
É쵀 age ×ֶδúÌæ¡£Í¨¹ýÕâÖÖ·½Ê½£¬½«Ç¶Ì×ÀàÐÍ×Ö¶ÎÉϵIJéѯתΪÁË Primitive ÀàÐÍ×ֶεIJéѯ£¬Í¬Ê±½â¾öÁËÉÏÊöÎÊÌâ¡£

ÕâÖÖ·½°¸´æÔÚÃ÷ÏÔȱÏÝ£º
¶îÍâά»¤ÁËÒ»ÕÅ±í£¬ÒýÈëÁË´óÁ¿µÄ¶îÍâ´æ´¢/¼ÆË㿪Ïú£»
ÎÞ·¨ÔÚбíÉϲéѯÐÂÔö×ֶεÄÀúÊ·Êý¾Ý£¨ÈçÒªÖ§³Ö¶ÔÀúÊ·Êý¾ÝµÄ²éѯ£¬ÐèÒªÖØÅÜÀúÊ·×÷Òµ£¬¿ªÏú¹ý´ó£¬ÎÞ·¨½ÓÊÜ£©£»
±íµÄά»¤·½ÐèÒªÔÚÐ޸ıí½á¹¹ºóÐ޸IJåÈëÊý¾ÝµÄ×÷Òµ£»
ÐèÒªÏÂÓβéѯ·½Ð޸IJéѯÓï¾ä£¬Íƹã³É±¾½Ï´ó£»
ÔËÓª³É±¾¸ß£ºÈç¹û¸ßƵ×Ó×ֶα仯£¬ÐèҪɾ³ý²»ÔÙÐèÒªµÄ¶ÀÁ¢×Ó×ֶΣ¬²¢Ìí¼ÓÐÂ×Ó×Ö¶ÎΪ¶ÀÁ¢×ֶΡ£É¾³ýǰ£¬ÐèҪȷ±£ÏÂÓÎÎÞÒµÎñʹÓøÃ×ֶΡ£¶øÐÂÔö×Ö¶ÎÐèҪ֪ͨ²¢ÍƽøÏÂÓÎÒµÎñ·½Ê¹ÓÃÐÂ×ֶΡ£
Ϊ½â¾öÉÏÊöËùÓÐÎÊÌ⣬ÎÒÃÇÉè¼Æ²¢ÊµÏÖÁËÎﻯÁС£ËüµÄÔÀíÊÇ£º
ÐÂÔöÒ»¸ö Primitive ÀàÐÍ×ֶΣ¬±ÈÈç Integer ÀàÐ굀 age ×ֶΣ¬²¢ÇÒÖ¸¶¨ËüÊÇ
people.age µÄÎﻯ×ֶΣ»
²åÈëÊý¾Ýʱ£¬ÎªÎﻯ×Ö¶Î×Ô¶¯Éú³ÉÊý¾Ý£¬²¢ÔÚ Partition Parameter ÄÚ±£´æÎﻯ¹ØÏµ¡£Òò´Ë¶Ô²åÈëÊý¾ÝµÄ×÷ÒµÍêȫ͸Ã÷£¬±íµÄά»¤·½²»ÐèÒªÐÞ¸ÄÒÑÓÐ×÷Òµ£»
²éѯʱ£¬¼ì²éËùÐè²éѯµÄËùÓÐ Partition£¬Èç¹û¶¼°üº¬ÎﻯÐÅÏ¢£¨people.age µ½ age
µÄÓ³É䣩£¬Ö±½Ó½« select people.age ×Ô¶¯ÖØÐ´Îª select age£¬´Ó¶øÊµÏÖ¶ÔÏÂÓβéѯ·½µÄÍêȫ͸Ã÷ÓÅ»¯¡£Í¬Ê±¼æÈÝÀúÊ·Êý¾Ý¡£

ÏÂͼչʾÁËÔÚijÕźËÐıíÉÏʹÓÃÎﻯÁеÄÊÕÒæ£º

ÎﻯÊÓͼ
ÔÚ OLAP ÁìÓò£¬¾³£»á¶ÔÏàͬ±íµÄijЩ¹Ì¶¨×ֶνøÐÐ Group By ºÍ Aggregate /
Join µÈºÄʱ²Ù×÷£¬Ôì³É´óÁ¿Öظ´ÐÔ¼ÆË㣬ÀË·Ñ×ÊÔ´£¬ÇÒÓ°Ïì²éѯÐÔÄÜ£¬²»ÀûÓÚÌáÉýÓû§ÌåÑé¡£
ÎÒÃÇʵÏÖÁË»ùÓÚÎﻯÊÓͼµÄÓÅ»¯¹¦ÄÜ£º

ÈçÉÏͼËùʾ£¬²éѯÀúÊ·ÏÔʾ´óÁ¿²éѯ¸ù¾Ý user ½øÐÐ group by£¬È»ºó¶Ô num ½øÐÐ sum
»ò count ¼ÆËã¡£´Ëʱ¿É´´½¨Ò»ÕÅÎﻯÊÓͼ£¬ÇÒ¶Ô user ½øÐÐ gorup by£¬¶Ô num ½øÐÐ
avg£¨avg »á×Ô¶¯×ª»»Îª count ºÍ sum£©¡£Óû§¶ÔÔʼ±í½øÐÐ select user,
sum(num) ²éѯʱ£¬Spark SQL ×Ô¶¯½«²éÑ¯ÖØÐ´Îª¶ÔÎﻯÊÓͼµÄ select user,
sum_num ²éѯ¡£
ÆäËüÓÅ»¯
ÏÂͼչʾÁËÎÒÃÇÔÚ Spark SQL ÉϽøÐÐµÄÆäËü²¿·ÖÓÅ»¯¹¤×÷£º

Spark Shuffle Îȶ¨ÐÔÌáÉýÓëÐÔÄÜÓÅ»¯
Spark Shuffle ´æÔÚµÄÎÊÌâ
Shuffle µÄÔÀí£¬ºÜ¶àͬѧӦ¸ÃÒѾºÜÊìϤÁË¡£¼øÓÚʱ¼ä¹ØÏµ£¬ÕâÀï²»½éÉܹý¶àϸ½Ú£¬Ö»¼òµ¥½éÉÜÏ»ù±¾Ä£ÐÍ¡£

ÈçÉÏͼËùʾ£¬ÎÒÃǽ« Shuffle ÉÏÓÎ Stage ³ÆÎª Mapper Stage£¬ÆäÖÐµÄ Task
³ÆÎª Mapper¡£Shuffle ÏÂÓÎ Stage ³ÆÎª Reducer Stage£¬ÆäÖÐµÄ Task
³ÆÎª Reducer¡£
ÿ¸ö Mapper »á½«×Ô¼ºµÄÊý¾Ý·ÖΪ×î¶à N ¸ö²¿·Ö£¬N Ϊ Reducer ¸öÊý¡£Ã¿¸ö Reducer
ÐèҪȥ×î¶à M £¨Mapper ¸öÊý£©¸ö Mapper »ñÈ¡ÊôÓÚ×Ô¼ºµÄÄDz¿·ÖÊý¾Ý¡£
Õâ¸ö¼Ü¹¹´æÔÚÁ½¸öÎÊÌ⣺
Ò»¡¢Îȶ¨ÐÔÎÊÌâ
Mapper µÄ Shuffle Write Êý¾Ý´æÓÚ Mapper ±¾µØ´ÅÅÌ£¬Ö»ÓÐÒ»¸ö¸±±¾¡£µ±¸Ã»úÆ÷³öÏÖ´ÅÅ̹ÊÕÏ£¬»òÕß
IO ÂúÔØ£¬CPU ÂúÔØÊ±£¬Reducer ÎÞ·¨¶ÁÈ¡¸ÃÊý¾Ý£¬´Ó¶øÒýÆð FetchFailedException£¬½ø¶øµ¼ÖÂ
Stage Retry¡£Stage Retry »áÔì³É×÷ÒµÖ´ÐÐʱ¼äÔö³¤£¬Ö±½ÓÓ°Ïì SLA¡£Í¬Ê±£¬Ö´ÐÐʱ¼äÔ½³¤£¬³öÏÖ
Shuffle Êý¾ÝÎÞ·¨¶ÁÈ¡µÄ¿ÉÄÜÐÔÔ½´ó£¬·´¹ýÀ´ÓÖ»áÔì³É¸ü¶à Stage Retry¡£Èç´ËÑ»·£¬¿ÉÄܵ¼Ö´óÐÍ×÷ÒµÎÞ·¨³É¹¦Ö´ÐС£

¶þ¡¢ÐÔÄÜÎÊÌâ
ÿ¸ö Mapper µÄÊý¾Ý»á±»´óÁ¿ Reducer ¶ÁÈ¡£¬²¢ÇÒÊÇËæ»ú¶ÁÈ¡²»Í¬²¿·Ö¡£¼ÙÉè Mapper
µÄ Shuffle Êä³öΪ 512MB£¬Reducer ÓÐ 10 Íò¸ö£¬ÄÇÆ½¾ùÿ¸ö Reducer
¶ÁÈ¡Êý¾Ý 512MB / 100000 = 5.24KB¡£²¢ÇÒ£¬²»Í¬ Reducer ²¢ÐжÁÈ¡Êý¾Ý¡£¶ÔÓÚ
Mapper Êä³öÎļþ¶øÑÔ£¬´æÔÚ´óÁ¿µÄËæ»ú¶ÁÈ¡¡£¶ø HDD µÄËæ»ú IO ÐÔÄÜÔ¶µÍÓÚ˳Ðò IO¡£×îÖÕµÄÏÖÏóÊÇ£¬Reducer
¶ÁÈ¡ Shuffle Êý¾Ý·Ç³£Âý£¬·´Ó³µ½ Metrics ÉϾÍÊÇ Reducer Shuffle Read
Blocked Time ½Ï³¤£¬ÉõÖÁÕ¼Õû¸ö Reducer Ö´ÐÐʱ¼äµÄÒ»´ó°ë£¬ÈçÏÂͼËùʾ¡£

»ùÓÚ HDFS µÄ Shuffle Îȶ¨ÐÔÌáÉý
¾¹Û²ì£¬ÒýÆð Shuffle ʧ°ÜµÄ×î´óÒòËØ²»ÊÇ´ÅÅ̹ÊÕϵÈÓ²¼þÎÊÌ⣬¶øÊÇ CPU ÂúÔØºÍ´ÅÅÌ IO
ÂúÔØ¡£

ÈçÉÏͼËùʾ£¬»úÆ÷µÄ CPU ʹÓÃÂʽӽü 100%£¬Ê¹µÃ Mapper ²àµÄ Node Manager
ÄÚµÄ Spark External Shuffle Service ÎÞ·¨¼°Ê±Ìṩ Shuffle ·þÎñ¡£
ÏÂͼÖÐ Data Node Õ¼ÓÃÁËÕų̂»úÆ÷ IO ×ÊÔ´µÄ 84%£¬²¿·Ö´ÅÅÌ IO ÍêÈ«´òÂú£¬ÕâʹµÃ¶ÁÈ¡
Shuffle Êý¾Ý·Ç³£Âý£¬½ø¶øÊ¹µÃ Reducer ²àÎÞ·¨ÔÚ³¬Ê±Ê±¼äÄÚ¶ÁÈ¡Êý¾Ý£¬Ôì³É FetchFailedException¡£

ÎÞÂÛÊǺÎÖÖÔÒò£¬ÎÊÌâµÄÖ¢½á¶¼ÊÇ Mapper ²àµÄ Shuffle Write Êý¾ÝÖ»±£´æÔÚ±¾µØ£¬Ò»µ©¸Ã½Úµã³öÏÖÎÊÌ⣬»áÔì³É¸Ã½ÚµãÉÏËùÓÐ
Shuffle Write Êý¾ÝÎÞ·¨±» Reducer ¶ÁÈ¡¡£½â¾öÕâ¸öÎÊÌâµÄÒ»¸öͨÓ÷½·¨ÊÇ£¬Í¨¹ý¶à¸±±¾±£Ö¤¿ÉÓÃÐÔ¡£
×î³õʼµÄÒ»¸ö¼òµ¥·½°¸ÊÇ£¬Mapper ²à×îÖÕÊý¾ÝÎļþÓëË÷ÒýÎļþ²»Ð´ÔÚ±¾µØ´ÅÅÌ£¬¶øÊÇÖ±½Óдµ½ HDFS¡£Reducer
²»ÔÙͨ¹ý Mapper ²àµÄ External Shuffle Service ¶ÁÈ¡ Shuffle
Êý¾Ý£¬¶øÊÇÖ±½Ó´Ó HDFS ÉÏ»ñÈ¡Êý¾Ý£¬ÈçÏÂͼËùʾ¡£

¿ìËÙʵÏÖÕâ¸ö·½°¸ºó£¬ÎÒÃÇ×öÁ˼¸×é¼òµ¥µÄ²âÊÔ¡£½á¹û±íÃ÷£º
Mapper Óë Reducer ²»¶àʱ£¬Shuffle ¶ÁдÐÔÄÜÓëÔʼ·½°¸Ïà±ÈÎÞ²îÒì¡£
Mapper Óë Reducer ½Ï¶àʱ£¬Shuffle ¶Á±äµÃ·Ç³£Âý¡£

ÔÚÉÏÃæµÄʵÑé¹ý³ÌÖУ¬HDFS ·¢³öÁ˱¨¾¯ÐÅÏ¢¡£ÈçÏÂͼËùʾ£¬HDFS Name Node Proxy
µÄ QPS ·åÖµ´ïµ½ 60 Íò¡££¨×¢£º×Ö½ÚÌø¶¯×ÔÑÐÁË Node Name Proxy£¬²¢ÔÚ Proxy
²ãʵÏÖÁË»º´æ£¬Òò´Ë¶Á QPS ¿ÉÒÔÖ§³Åµ½Õâ¸öÁ¿¼¶£©¡£

ÔÒòÔÚÓÚ£¬×ܹ² 10000 Reducer£¬ÐèÒª´Ó 10000 ¸ö Mapper ´¦¶ÁÈ¡Êý¾ÝÎļþºÍË÷ÒýÎļþ£¬×ܹ²ÐèÒª¶ÁÈ¡
HDFS 10000 * 1000 * 2 = 2 ÒڴΡ£
Èç¹ûÖ»ÊÇ Name Node µÄµ¥µãÐÔÄÜÎÊÌ⣬»¹¿ÉÒÔͨ¹ýһЩ¼òµ¥µÄ·½·¨½â¾ö¡£ÀýÈçÔÚ Spark Driver
²à±£´æËùÓÐ Mapper µÄ Block Location£¬È»ºó Driver ½«¸ÃÐÅÏ¢¹ã²¥ÖÁËùÓÐ
Executor£¬Ã¿¸ö Reducer ¿ÉÒÔÖ±½Ó´Ó Executor ´¦»ñÈ¡ Block Location£¬È»ºóÎÞÐëÁ¬½Ó
Name Node£¬¶øÊÇÖ±½Ó´Ó Data Node ¶ÁÈ¡Êý¾Ý¡£µ«¼øÓÚ Data Node µÄÏß³ÌÄ£ÐÍ£¬ÕâÖÖ·½°¸»á¶Ô
Data Node Ôì³É½Ï´ó³å»÷¡£
×îºóÎÒÃÇÑ¡ÔñÁËÒ»ÖֱȽϼòµ¥¿ÉÐеķ½°¸£¬ÈçÏÂͼËùʾ¡£

Mapper µÄ Shuffle Êä³öÊý¾ÝÈÔÈ»°´Ô·½°¸Ð´±¾µØ´ÅÅÌ£¬Ð´ÍêºóÉÏ´«µ½ HDFS¡£Reducer
ÈÔÈ»°´Ôʼ·½°¸Í¨¹ý Mapper ²àµÄ External Shuffle Service ¶ÁÈ¡ Shuffle
Êý¾Ý¡£Èç¹ûʧ°ÜÁË£¬Ôò´Ó HDFS ¶ÁÈ¡¡£ÕâÖÖ·½°¸¼«´ó¼õÉÙÁË¶Ô HDFS µÄ·ÃÎÊÆµÂÊ¡£
¸Ã·½°¸ÉÏÏß½üÒ»Ä꣺
¸²¸Ç 57% ÒÔÉ쵀 Spark Shuffle Êý¾Ý£»
ʹµÃ Spark ×÷ÒµÕûÌåÐÔÄÜÌáÉý 14%£»
Ìì¼¶´ó×÷ÒµÐÔÄÜÌáÉý 18%£»
Сʱ¼¶×÷ÒµÐÔÄÜÌáÉý 12%¡£

¸Ã·½°¸Ö¼ÔÚÌáÉý Spark Shuffle Îȶ¨ÐÔ´Ó¶øÌáÉý×÷ÒµÎȶ¨ÐÔ£¬µ«×îÖÕûÓÐʹÓ÷½²îµÈÖ¸±êÀ´ºâÁ¿Îȶ¨ÐÔµÄÌáÉý¡£ÔÒòÔÚÓÚÿÌ켯Ⱥ¸ºÔز»Ò»Ñù£¬ÕûÌå·½²î½Ï´ó¡£Shuffle
Îȶ¨ÐÔÌáÉýºó£¬Stage Retry ´ó·ù¼õÉÙ£¬ÕûÌå×÷ÒµÖ´ÐÐʱ¼ä¼õÉÙ£¬Ò²¼´ÐÔÄÜÌáÉý¡£×îÖÕͨ¹ý¶Ô±ÈʹÓø÷½°¸Ç°ºóµÄ×ܵÄ×÷ÒµÖ´ÐÐʱ¼äÀ´¶Ô±ÈÐÔÄܵÄÌáÉý£¬ÓÃÓÚºâÁ¿¸Ã·½°¸µÄЧ¹û¡£
Shuffle ÐÔÄÜÓÅ»¯Êµ¼ùÓë̽Ë÷
ÈçÉÏÎÄËù·ÖÎö£¬Shuffle ÐÔÄÜÎÊÌâµÄÔÒòÔÚÓÚ£¬Shuffle Write ÓÉ Mapper Íê³É£¬È»ºó
Reducer ÐèÒª´ÓËùÓÐ Mapper ´¦¶ÁÈ¡Êý¾Ý¡£ÕâÖÖÄ£ÐÍ£¬ÎÒÃdzÆÖ®ÎªÒÔ Mapper ΪÖÐÐĵÄ
Shuffle¡£ËüµÄÎÊÌâÔÚÓÚ£º
Mapper ²à»áÓÐ M ´Î˳Ðòд IO£»
Mapper ²à»áÓÐ M * N * 2 ´ÎËæ»ú¶Á IO£¨ÕâÊÇ×î´óµÄÐÔÄÜÆ¿¾±£©£»
Mapper ²àµÄ External Shuffle Service ±ØÐëÓë Mapper λÓÚͬһ̨»úÆ÷£¬ÎÞ·¨×öµ½ÓÐЧµÄ´æ´¢¼ÆËã·ÖÀ룬Shuffle
·þÎñÎÞ·¨¶ÀÁ¢À©Õ¹¡£
Õë¶ÔÉÏÊöÎÊÌ⣬ÎÒÃÇÌá³öÁËÒÔ Reducer ΪÖÐÐĵģ¬´æ´¢¼ÆËã·ÖÀëµÄ Shuffle ·½°¸£¬ÈçÏÂͼËùʾ¡£

¸Ã·½°¸µÄÔÀíÊÇ£¬Mapper Ö±½Ó½«ÊôÓÚ²»Í¬ Reducer µÄÊý¾Ýдµ½²»Í¬µÄ Shuffle Service¡£ÔÚÉÏͼÖУ¬×ܹ²
2 ¸ö Mapper£¬5 ¸ö Reducer£¬5 ¸ö Shuffle Service¡£ËùÓÐ Mapper
¶¼½«ÊôÓÚ Reducer 0 µÄÊý¾ÝÔ¶³ÌÁ÷ʽ·¢Ë͸ø Shuffle Service 0£¬²¢ÓÉËü˳ÐòдÈë´ÅÅÌ¡£Reducer
0 Ö»ÐèÒª´Ó Shuffle Service 0 ˳Ðò¶ÁÈ¡ËùÓÐÊý¾Ý¼´¿É£¬ÎÞÐèÔÙ´Ó M ¸ö Mapper
È¡Êý¾Ý¡£¸Ã·½°¸µÄÓÅÊÆÔÚÓÚ£º
½« M * N * 2 ´ÎËæ»ú IO ±äΪ N ´Î˳Ðò IO¡£
Shuffle Service ¿ÉÒÔ¶ÀÁ¢ÓÚ Mapper »òÕß Reducer ²¿Ê𣬴Ӷø×öµ½¶ÀÁ¢À©Õ¹£¬×öµ½´æ´¢¼ÆËã·ÖÀë¡£
Shuffle Service ¿É½«Êý¾ÝÖ±½Ó´æÓÚ HDFS µÈ¸ß¿ÉÓô洢£¬Òò´Ë¿Éͬʱ½â¾ö Shuffle
Îȶ¨ÐÔÎÊÌâ¡£
Spark SQL δÀ´¹æ»®
δÀ´ÎÒÃÇ´ó¸Å»á×öÒÔϹ滮£º
Ö§³Ö¸ü¶àµÄË÷Òý£»
Ö§³ÖÎļþ¼¶¹ýÂË£»
Ö§³ÖһЩ ACID£»
¶¯Ì¬°´×é·ÖÇø£»
Runtime Filter£¨Ö§³Ö·ÖÇø¼¶¼°Ðм¶¹ýÂË£©£»
ÖÇÄܽ¨ÉèÊý¾Ý²Ö¿â£»
Filter ÖØÅÅ£»
¿ÉÀ©Õ¹µÄ Catalog Service¡£

ÎҵķÖÏí¾Íµ½ÕâÀлл´ó¼Ò¡£
ÍøÓÑÎÊÌ⼯½õ
Q£ºÎﻯÁÐÐÂÔöÒ»ÁУ¬ÊÇ·ñÐèÒªÐÞ¸ÄÀúÊ·Êý¾Ý£¿
A£ºÀúÊ·Êý¾ÝÌ«¶à£¬²»ÊʺÏÐÞ¸ÄÀúÊ·Êý¾Ý¡£
Q£ºÈç¹ûÓû§µÄÇëÇóͬʱ°üº¬ÐÂÊý¾ÝºÍÀúÊ·Êý¾Ý£¬ÈçºÎ´¦Àí£¿
A£ºÒ»°ã¶øÑÔ£¬Óû§ÐÞ¸ÄÊý¾Ý¶¼ÊÇÒÔ Partition Ϊµ¥Î»¡£ËùÒÔÎÒÃÇÔÚ Partition Parameter
Éϱ£´æÁËÎﻯÁÐÏà¹ØÐÅÏ¢¡£Èç¹ûÓû§µÄ²éѯͬʱ°üº¬ÁËРPartition ÓëÀúÊ· Partition£¬ÎÒÃÇ»áÔÚÐÂ
Partition ÉÏÕë¶ÔÎﻯÁнøÐÐ SQL Rewrite£¬ÀúÊ· Partition ²» Rewrite£¬È»ºó½«ÐÂÀÏ
Partition ½øÐÐ Union£¬´Ó¶øÔÚ±£Ö¤Êý¾ÝÕýÈ·ÐÔµÄǰÌáϾ¡¿ÉÄܳä·ÖÀûÓÃÎﻯÁеÄÓÅÊÆ¡£
Q£ºÄãºÃ£¬ÄãÃÇÕë¶ÔÓû§µÄ³¡¾°£¬×öÁ˺ܶàͦÓмÛÖµµÄÓÅ»¯¡£ÏñÎﻯÁС¢ÎﻯÊÓͼ£¬¶¼ÐèÒª¸ù¾ÝÓû§µÄ²éѯ Pattern
½øÐÐÉèÖá£Ä¿Ç°ÄãÃÇÊÇÈ˹¤·ÖÎöÕâЩ²éѯ£¬»¹ÊÇÓÐijÖÖ»úÖÆ×Ô¶¯È¥·ÖÎö²¢ÓÅ»¯£¿
A£ºÄ¿Ç°ÎÒÃÇÖ÷ÒªÊÇͨ¹ýһЩÉó¼ÆÐÅÏ¢¸¨ÖúÈ˹¤·ÖÎö¡£Í¬Ê±ÎÒÃÇÒ²ÕýÔÚ×öÎﻯÁÐÓëÎﻯÊÓͼµÄÍÆ¼ö·þÎñ£¬×îÖÕ×öµ½ÖÇÄܽ¨ÉèÎﻯÁÐÓëÎﻯÊÓͼ¡£
Q£º¸Õ¸Õ½éÉܵĻùÓÚ HDFS µÄ Spark Shuffle Îȶ¨ÐÔÌáÉý·½°¸£¬ÊÇ·ñ¿ÉÒÔÒì²½ÉÏ´« Shuffle
Êý¾ÝÖÁ HDFS£¿
A£ºÕâ¸öÏ뷨ͦºÃ£¬ÎÒÃÇ֮ǰҲ¿¼Âǹý£¬µ«»ùÓÚ¼¸µã¿¼ÂÇ£¬×îÖÕûÓÐÕâÑù×ö¡£µÚÒ»£¬µ¥ Mapper µÄ Shuffle
Êä³öÊý¾ÝÁ¿Ò»°ãºÜС£¬ÉÏ´«µ½ HDFS ºÄʱÔÚ 2 ÃëÒÔÄÚ£¬Õâ¸öʱ¼ä¿ªÏú¿ÉÒÔºöÂÔ£»µÚ¶þ£¬ÎÒÃǹ㷺ʹÓÃ
External Shuffle Service ºÍ Dynamic Allocation£¬Mapper
Ö´ÐÐÍê³Éºó¿ÉÄÜ Executor ¾Í»ØÊÕÁË£¬Èç¹ûÒªÒì²½ÉÏ´«£¬¾Í±ØÐëÒÀÀµÆäËü×é¼þ£¬Õâ»áÌáÉý¸´ÔÓ¶È£¬ROI
½ÏµÍ¡£
ÒÔÉÏÄÚÈÝ·ÖÏíµÄÊÇ×Ö½ÚÌø¶¯ÔÚÌáÉý»ùÓÚ Spark SQL µÄ ETL
Îȶ¨ÐÔÒÔ¼°ÓÅ»¯ ad-hoc ²éѯµÄÐÔÄÜ·½ÃæµÄʵ¼ù¡£ |