±à¼ÍƼö: |
¹¤×÷ÖÐʹÓõ½ÁËhive£¬mysqlµÈÊý¾Ý¿â£¬²»Í¬µÄÊý¾Ý¿âÓв»Í¬µÄÓ¦Óó¡¾°£¬¸ÃÈçºÎÕýÈ·µÄÑ¡ÔñÊý¾Ý´æ´¢Óë´¦Àí·½Ê½£¬ÐèÒªÁ˽âµ×²ãÔÀí£¬²ÅÄÜÉÙ×ßÍä·£¬±¾ÎÄÖ÷ÒªÊǼǼһÏÂhiveµÄʵÏÖÔÀíÒÔ¼°Ò»Ð©¶ÔÓ¦µÄ¸ÅÄî¡£
±¾ÎÄÀ´×ÔÖªºõ£¬ÓÉ»ðÁú¹ûÈí¼þAnna±à¼¡¢ÍƼö¡£ |
|
Front
ÔÚ¿ªÊ¼Á˽âhive֮ǰ£¬ÐèÒªÁ˽âһЩ֪ʶ»òÕ߸ÅÄ¿ÉÒÔ¸üºÃµÄÀí½âhiveʵÏÖÔÀí
MapReduce
Google MapReduceÊÇGoogle»ùÓÚº¯Êýʽ±à³Ìmap£¨Ó³É䣩£¬reduce£¨»¯¼ò£©Ìá³öµÄÒ»ÖÖ·Ö²¼Ê½±à³ÌÄ£ÐÍ£¬ÔÚÄ£ÐÍÖÐÒþ²ØÁË·Ö²¼Ê½¼¯ÈºµÄʵÏÖϸ½Ú£¬½»ÓÉ¿ò¼Üµ×²ã½øÐÐʵÏÖ£¬Äܹ»Ê¹³ÌÐòÔ±ÔÚ²»Á˽â·Ö²¼Ê½²¢Ðбà³ÌµÄÇé¿öÏ£¬½«×Ô¼ºÊéдµÄ³ÌÐòÔÚ·Ö²¼Ê½ÏµÍ³ÉÏÔËÐÐ
±à³ÌÄ£ÐÍ
Map: ½«ÊäÈëµÄÒ»¶Ô¼üÖµ¶Ôת»»ÎªÒ»×éÖмä¼üÖµ¶Ô £¨k1,v1) -> list(k2,v2)
Reduce: ½«ËùÓмüÏàͬµÄÖмä¼üÖµ¶ÔºÏ²¢£¬µÃµ½¹ØÓÚÄǸö¼üµÄ½á¹û (k2,list(v2)) ->
(k2,v3)
¾Ù¸öÀõ×Ó
ÒÔÒ»¸öºÜ¼òµ¥µÄWordCountΪÀý×Ó£¬¼ÙÉè¸ø¶¨´óÁ¿Êý¾ÝµÄÎĵµ£¬¼ÆËãÆäÖÐÿ¸öµ¥´Ê³öÏֵĴÎÊý£¬ÏÂÃæÊÇα´úÂë
map(String key,String
value,Context context){
// key: document name
// value: document contents
String[] words = value.split(separator);
for(String word:words){
context.write(word,1);
}
}
reduce(String word,Iterable<Integer>
values,Context context){
int sum = 0;
for(Integer value:values){
sum += value;
}
con.write(word,sum);
} |

¸ü¶àµÄÀõ×Ó
¼ÆËã URL ·ÃÎÊÆµÂÊ£ºMap º¯Êý´¦ÀíÈÕÖ¾ÖÐ web Ò³ÃæÇëÇóµÄ¼Ç¼£¬È»ºóÊä³ö(URL,1)¡£Reduce
º¯Êý°ÑÏàͬURL µÄ value Öµ¶¼ÀÛ¼ÓÆðÀ´£¬²úÉú(URL,¼Ç¼×ÜÊý)½á¹û¡£
µ¹×ªÍøÂçÁ´½Óͼ£ºMap º¯ÊýÔÚÔ´Ò³Ãæ£¨source£©ÖÐËÑË÷ËùÓеÄÁ´½ÓÄ¿±ê£¨target£©²¢Êä³öΪ(target,source)¡£
Reduce º¯Êý°Ñ¸ø¶¨Á´½ÓÄ¿±ê£¨target£©µÄÁ´½Ó×éºÏ³ÉÒ»¸öÁÐ±í£¬Êä³ö(target,list(source))¡£
µ¹ÅÅË÷Òý£ºMap º¯Êý·ÖÎöÿ¸öÎĵµÊä³öÒ»¸ö(´Ê,ÎĵµºÅ)µÄÁÐ±í£¬Reduce º¯ÊýµÄÊäÈëÊÇÒ»¸ö¸ø¶¨´ÊµÄËùÓÐ
£¨´Ê£¬ÎĵµºÅ£©£¬ÅÅÐòËùÓеÄÎĵµºÅ£¬Êä³ö(´Ê,list£¨ÎĵµºÅ£©)¡£ËùÓеÄÊä³ö¼¯ºÏÐγÉÒ»¸ö¼òµ¥µÄµ¹ÅÅË÷Òý£¬Ëü
ÒÔÒ»ÖÖ¼òµ¥µÄËã·¨¸ú×Ù´ÊÔÚÎĵµÖеÄλÖá£
ÿ¸öÖ÷»úµÄ¼ìË÷´ÊÏòÁ¿£º¼ìË÷´ÊÏòÁ¿ÓÃÒ»¸ö(´Ê,ƵÂÊ)ÁбíÀ´¸ÅÊö³öÏÖÔÚÎĵµ»òÎĵµ¼¯ÖеÄ×îÖØÒªµÄһЩ ´Ê¡£Map
º¯ÊýΪÿһ¸öÊäÈëÎĵµÊä³ö(Ö÷»úÃû,¼ìË÷´ÊÏòÁ¿)£¬ÆäÖÐÖ÷»úÃûÀ´×ÔÎĵµµÄ URL¡£Reduce º¯Êý½ÓÊÕ¸ø
¶¨Ö÷»úµÄËùÓÐÎĵµµÄ¼ìË÷´ÊÏòÁ¿£¬²¢°ÑÕâЩ¼ìË÷´ÊÏòÁ¿¼ÓÔÚÒ»Æð£¬¶ªÆúµôµÍƵµÄ¼ìË÷´Ê£¬Êä³öÒ»¸ö×îÖÕµÄ(Ö÷»úÃû,¼ìË÷´ÊÏòÁ¿)¡£
ϵͳʵÏÖ

Ê×ÏÈ£¬Óû§Í¨¹ý MapReduce ¿Í»§¶ËÖ¸¶¨ Map º¯ÊýºÍ Reduce º¯Êý£¬ÒÔ¼°´Ë´Î MapReduce
¼ÆËãµÄÅäÖ㬰üÀ¨Öмä½á¹û¼üÖµ¶ÔµÄ Partition ÊýÁ¿ R ÒÔ¼°ÓÃÓÚÇзÖÖмä½á¹ûµÄ¹þÏ£º¯Êý hash
¡£ Óû§¿ªÊ¼ MapReduce ¼ÆËãºó£¬Õû¸ö MapReduce ¼ÆËãµÄÁ÷³Ì¿É×ܽáÈçÏ£º
×÷ΪÊäÈëµÄÎļþ»á±»·ÖΪ M ¸ö Split£¬Ã¿¸ö Split µÄ´óСͨ³£ÔÚ 16~64 MB Ö®¼ä
Èç´Ë£¬Õû¸ö MapReduce ¼ÆËã°üº¬ M ¸öMap ÈÎÎñºÍ R ¸ö Reduce ÈÎÎñ¡£Master
½áµã»á´Ó¿ÕÏÐµÄ Worker ½áµãÖнøÐÐѡȡ²¢ÎªÆä·ÖÅä Map ÈÎÎñºÍ Reduce ÈÎÎñ
ÊÕµ½ Map ÈÎÎñµÄ Worker ÃÇ£¨ÓÖ³Æ Mapper£©¿ªÊ¼¶ÁÈë×Ô¼º¶ÔÓ¦µÄ Split£¬½«¶ÁÈëµÄÄÚÈݽâÎöΪÊäÈë¼üÖµ¶Ô²¢µ÷ÓÃÓÉÓû§¶¨ÒåµÄ
Map º¯Êý¡£ÓÉ Map º¯Êý²úÉúµÄÖмä½á¹û¼üÖµ¶Ô»á±»ÔÝʱ´æ·ÅÔÚ»º³åÄÚ´æÇøÖÐ
ÔÚ Map ½×¶Î½øÐеÄͬʱ£¬Mapper ÃÇÖÜÆÚÐԵؽ«·ÅÖÃÔÚ»º³åÇøÖеÄÖмä½á¹û´æÈëµ½×Ô¼ºµÄ±¾µØ´ÅÅÌÖУ¬Í¬Ê±¸ù¾ÝÓû§Ö¸¶¨µÄ
Partition º¯Êý£¨Ä¬ÈÏΪ hash(key) mod R£©½«²úÉúµÄÖмä½á¹û·ÖΪ R ¸ö²¿·Ö¡£ÈÎÎñÍê³Éʱ£¬Mapper
±ã»á½«Öмä½á¹ûÔÚÆä±¾µØ´ÅÅÌÉϵĴæ·ÅλÖñ¨¸æ¸ø Master
Mapper Éϱ¨µÄÖмä½á¹û´æ·ÅλÖûᱻ Master ת·¢¸ø Reducer¡£µ± Reducer
½ÓÊÕµ½ÕâЩÐÅÏ¢ºó±ã»áͨ¹ý RPC ¶ÁÈ¡´æ´¢ÔÚ Mapper ±¾µØ´ÅÅÌÉÏÊôÓÚ¶ÔÓ¦ Partition
µÄÖмä½á¹û¡£ÔÚ¶ÁÈ¡Íê±Ïºó£¬Reducer »á¶Ô¶ÁÈ¡µ½µÄÊý¾Ý½øÐÐÅÅÐòÒÔÁîÓµÓÐÏàͬ¼üµÄ¼üÖµ¶ÔÄܹ»Á¬Ðø·Ö²¼
Ö®ºó£¬Reducer »áΪÿ¸ö¼üÊÕ¼¯ÓëÆä¹ØÁªµÄÖµµÄ¼¯ºÏ£¬²¢ÒÔÖ®µ÷ÓÃÓû§¶¨ÒåµÄ Reduce º¯Êý¡£Reduce
º¯ÊýµÄ½á¹û»á±»·ÅÈëµ½¶ÔÓ¦µÄ Reduce Partition ½á¹ûÎļþ
ʵ¼ÊÉÏ£¬ÔÚÒ»¸ö MapReduce ¼¯ÈºÖУ¬Master »á¼Ç¼ÿһ¸ö Map ºÍ Reduce ÈÎÎñµÄµ±Ç°Íê³É״̬£¬ÒÔ¼°Ëù·ÖÅäµÄ
Worker¡£³ý´ËÖ®Í⣬Master »¹¸ºÔ𽫠Mapper ²úÉúµÄÖмä½á¹ûÎļþµÄλÖúʹóСת·¢¸ø
Reducer¡£
ÖµµÃ×¢ÒâµÄÊÇ£¬Ã¿´Î MapReduce ÈÎÎñÖ´ÐÐʱ£¬ M ºÍ R µÄÖµ¶¼Ó¦±È¼¯ÈºÖÐµÄ Worker
ÊýÁ¿Òª¸ßµÃ¶à£¬ÒÔ´ï³É¼¯ÈºÄÚ¸ºÔؾùºâµÄЧ¹û¡£
ÁÐʽ´æ´¢
ʲôÊÇÁÐʽ´æ´¢
´«Í³ÊÂÎñÐÍÊý¾Ý¿âͨ³£²ÉÓÃÐÐʽ´æ´¢¡£ÒÔÏÂͼΪÀý£¬ËùÓеÄÁÐÒÀ´ÎÅÅÁй¹³ÉÒ»ÐУ¬ÒÔÐÐΪµ¥Î»´æ´¢£¬ÔÙÅäºÏÒÔ B+
Ê÷×÷ΪË÷Òý£¬¾ÍÄÜ¿ìËÙͨ¹ýÖ÷¼üÕÒµ½ÏàÓ¦µÄÐÐÊý¾Ý¡£

ÐÐʽ´æ´¢¶ÔÓÚ OLTP£¨Áª»úÊÂÎñ´¦Àí£© ³¡¾°ÊǺÜ×ÔÈ»µÄ£º´ó¶àÊý²Ù×÷¶¼ÒÔʵÌ壨entity£©Îªµ¥Î»£¬¼´´ó¶àΪÔöɾ¸Ä²éÒ»ÕûÐмǼ£¬ÏÔÈ»°ÑÒ»ÐÐÊý¾Ý´æÔÚÎïÀíÉÏÏàÁÚµÄλÖÃÊǸöºÜºÃµÄÑ¡Ôñ¡£
È»¶ø£¬¶ÔÓÚ OLAP £¨Áª»ú·ÖÎö´¦Àí£©³¡¾°£¬Ò»¸öµäÐ͵IJéѯÐèÒª±éÀúÕû¸ö±í£¬½øÐзÖ×é¡¢ÅÅÐò¡¢¾ÛºÏµÈ²Ù×÷£¬ÕâÑùÒ»À´°´Ðд洢µÄÓÅÊÆ¾Í²»¸´´æÔÚÁË¡£¸üÔã¸âµÄÊÇ£¬·ÖÎöÐÍ
SQL ³£³£²»»áÓõ½ËùÓеÄÁУ¬¶ø½ö½ö¶ÔÆäÖÐijЩ¸ÐÐËȤµÄÁÐ×öÔËË㣬ÄÇÒ»ÐÐÖÐÄÇЩÎ޹صÄÁÐÒ²²»µÃ²»²ÎÓëɨÃè¡£
ÁÐʽ´æ´¢¾ÍÊÇΪÕâÑùµÄÐèÇóÉè¼ÆµÄ¡£ÈçÏÂͼËùʾ£¬Í¬Ò»ÁеÄÊý¾Ý±»Ò»¸ö½ÓÒ»¸ö½ô°¤×Å´æ·ÅÔÚÒ»Æð£¬±íµÄÿÁй¹³ÉÒ»¸ö³¤Êý×é¡£

ÏÔÈ»£¬ÁÐʽ´æ´¢¶ÔÓÚ OLTP ²»ÓѺã¬Ò»ÐÐÊý¾ÝµÄдÈëÐèҪͬʱÐ޸Ķà¸öÁС£µ«¶Ô OLAP ³¡¾°ÓÐןܴóµÄÓÅÊÆ£º
µ±²éѯÓï¾äֻɿ¼°²¿·ÖÁÐʱ£¬Ö»ÐèҪɨÃèÏà¹ØµÄÁÐ
ÿһÁеÄÊý¾Ý¶¼ÊÇÏàͬÀàÐ͵ģ¬±Ë´Ë¼äÏà¹ØÐÔ¸ü´ó£¬¶ÔÁÐÊý¾ÝѹËõµÄЧÂʽϸß
ÁÐʽ´æ´¢Óë·Ö²¼Ê½Îļþϵͳ
ÔÚÏÖ´úµÄ´óÊý¾Ý¼Ü¹¹ÖУ¬GFS¡¢HDFS µÈ·Ö²¼Ê½ÎļþϵͳÒѾ³ÉΪ´æ·Å´ó¹æÄ£Êý¾Ý¼¯µÄÖ÷Á÷·½Ê½¡£·Ö²¼Ê½ÎļþϵͳÏà±Èµ¥»úÉϵĴÅÅÌ£¬¾ß±¸¶à¸±±¾¸ß¿ÉÓá¢ÈÝÁ¿´ó¡¢³É±¾µÍµÈÖî¶àÓÅÊÆ£¬µ«Ò²´øÀ´ÁËһЩµ¥»ú¼Ü¹¹ËùûÓеÄÎÊÌ⣺
¶Áд¾ùÒª¾¹ýÍøÂ磬ÍÌÍÂÁ¿¿ÉÒÔ׷ƽÉõÖÁ³¬¹ýÓ²ÅÌ£¬µ«ÊÇÑÓ³ÙÒª±ÈÓ²ÅÌ´óµÃ¶à£¬ÇÒÊÜÍøÂç»·¾³Ó°ÏìºÜ´ó¡£
¿ÉÒÔ½øÐдóÍÌÍÂÁ¿µÄ˳Ðò¶Áд£¬µ«Ëæ»ú·ÃÎÊÐÔÄܺܲ´ó¶à²»Ö§³ÖËæ»úдÈ롣ΪÁ˵ÖÏûÍøÂçµÄ overhead£¬Í¨³£Ð´Èë¶¼ÒÔ¼¸Ê®
MB Ϊµ¥Î»¡£ ÉÏÊöȱµã¶ÔÓÚÖØ¶ÈÒÀÀµËæ»ú¶ÁдµÄ OLTP ³¡¾°À´ËµÊÇÖÂÃüµÄ¡£ËùÒÔÎÒÃÇ¿´µ½£¬ºÜ¶à¶¨Î»ÓÚ
OLAP µÄÁÐʽ´æ´¢Ñ¡Ôñ·ÅÆú OLTP ÄÜÁ¦£¬´Ó¶øÄܹ¹½¨ÔÚ·Ö²¼Ê½Îļþϵͳ֮ÉÏ¡£
ÒªÏ뽫·Ö²¼Ê½ÎļþϵͳµÄÐÔÄÜ·¢»Óµ½¼«Ö£¬ÎÞ·ÇÓм¸ÖÖ·½·¨£º°´¿é£¨·ÖƬ£©¶ÁÈ¡Êý¾Ý¡¢Á÷ʽ¶ÁÈ¡¡¢×·¼ÓдÈëµÈ¡£ÎÒÃÇÔÚºóÃæ»á¿´µ½Ò»Ð©¿ªÔ´½çÁ÷ÐеÄÁÐʽ´æ´¢Ä£ÐÍ£¬½«ÕâЩÓÅ»¯·½·¨ÌåÏÖÔÚ´æ´¢¸ñʽµÄÉè¼ÆÖС£
ÁÐʽ´æÍ³´¢Ïµ°¸Àý
Apache ORC
Apache ORC ×î³õÊÇΪ֧³Ö Hive É쵀 OLAP ²éѯ¿ª·¢µÄÒ»ÖÖÎļþ¸ñʽ£¬Èç½ñÔÚ Hadoop
Éú̬ϵͳÖÐÓй㷺µÄÓ¦Óá£ORC Ö§³Ö¸÷ÖÖ¸ñʽµÄ×ֶΣ¬°üÀ¨³£¼ûµÄ int¡¢string µÈ£¬Ò²°üÀ¨ struct¡¢list¡¢map
µÈ×éºÏ×ֶΣ»×Ö¶ÎµÄ meta ÐÅÏ¢¾Í·ÅÔÚ ORC ÎļþµÄβ²¿£¨Õâ±»³ÆÎª×ÔÃèÊöµÄ£©¡£
Êý¾Ý½á¹¹¼°Ë÷Òý
Ϊ·ÖÇø¹¹ÔìË÷ÒýÊÇÒ»ÖÖ³£¼ûµÄÓÅ»¯·½°¸£¬ORC µÄÊý¾Ý½á¹¹·Ö³ÉÒÔÏ 3 ¸ö²ã¼¶£¬ÔÚÿ¸ö²ã¼¶É϶¼ÓÐË÷ÒýÐÅÏ¢À´¼ÓËÙ²éѯ

File Level£º¼´Ò»¸ö ORC Îļþ£¬Footer Öб£´æÁËÊý¾ÝµÄ meta ÐÅÏ¢£¬»¹ÓÐÎļþÊý¾ÝµÄË÷ÒýÐÅÏ¢£¬ÀýÈç¸÷ÁÐÊý¾ÝµÄ×î´ó×îСֵ£¨·¶Î§£©¡¢NULL
Öµ·Ö²¼¡¢²¼Â¡¹ýÂËÆ÷µÈ£¬ÕâЩÐÅÏ¢¿ÉÓÃÀ´¿ìËÙÈ·¶¨¸ÃÎļþÊÇ·ñ°üº¬Òª²éѯµÄÊý¾Ý¡£Ã¿¸ö ORC ÎļþÖаüº¬¶à¸ö
Stripe¡£
Stripe Level ¶ÔÓ¦Ô±íµÄÒ»¸ö·¶Î§·ÖÇø£¬ÀïÃæ°üº¬¸Ã·ÖÇøÄÚ¸÷ÁеÄÖµ¡£Ã¿¸ö Stripe Ò²ÓÐ×Ô¼ºµÄÒ»¸öË÷Òý·ÅÔÚ
footer ÀºÍ file-level Ë÷ÒýÀàËÆ¡£
Row-Group Level £ºÒ»ÁÐÖеÄÿ 10000 ÐÐÊý¾Ý¹¹³ÉÒ»¸ö row-group£¬Ã¿¸ö
row-group ÓµÓÐ×Ô¼ºµÄ row-level Ë÷Òý£¬ÐÅϢͬÉÏ
ORC ÀïµÄ Stripe ¾ÍÏñ´«Í³Êý¾Ý¿âµÄÒ³£¬ËüÊÇ ORC ÎļþÅúÁ¿¶ÁдµÄ»ù±¾µ¥Î»¡£ÕâÊÇÓÉÓÚ·Ö²¼Ê½´¢´æÏµÍ³µÄ¶ÁдÑӳٽϴó£¬Ò»´Î
IO ²Ù×÷Ö»ÓÐÅúÁ¿¶Áȡһ¶¨Á¿µÄÊý¾Ý²Å»®Ëã¡£ÕâºÍ°´Ò³¶Áд´ÅÅ̵Ä˼·ҲÓй²Í¨Ö®´¦¡£
ÏñÆäËûºÜ¶à´¢´æ¸ñʽһÑù£¬ORC Ñ¡Ôñ½«Í³¼ÆÊý¾ÝºÍ Metadata ·ÅÔÚ File ºÍ Stripe
µÄβ²¿¶ø²»ÊÇÍ·²¿¡£µ« ORC ÔÚ Stripe µÄ¶ÁдÉÏ»¹ÓÐÒ»µãÓÅ»¯£¬ÄǾÍÊǰѷÖÇøÁ£¶ÈСÓÚ Stripe
µÄ½á¹¹£¨Èç Column ºÍ Row-Group£©µÄË÷Òýͳһ³éÈ¡³öÀ´·Åµ½ Stripe µÄÍ·²¿¡£ÕâÊÇÒòΪÔÚÅú´¦Àí¼ÆËãÖÐÒ»°ãÊǰÑÕû¸ö
Stripe ¶ÁÈëÅúÁ¿´¦ÀíµÄ£¬½«ÕâЩË÷Òý³éÈ¡³öÀ´¿ÉÒÔ¼õÉÙÔÚÅú´¦Àí³¡¾°ÏÂÐèÒªµÄ IO£¨Åú´¦Àí¶ÁÈ¡¿ÉÒÔÌø¹ýÕâÒ»²¿·Ö£©¡£
Dremel (2010) / Apache Parquet
Dremel ÊÇ Google Ñз¢µÄÓÃÓÚ´ó¹æÄ£Ö»¶ÁÊý¾ÝµÄ²éѯϵͳ£¬ÓÃÓÚ½øÐпìËÙµÄ ad-hoc ²éѯ£¬ÃÖ²¹
MapReduce ½»»¥Ê½²éѯÄÜÁ¦µÄ²»×㡣ΪÁ˱ÜÃâ¶ÔÊý¾ÝµÄ¶þ´Î¿½±´£¬Dremel µÄÊý¾Ý¾Í·ÅÔÚÔ´¦£¬Í¨³£ÊÇ
GFS ÕâÑùµÄ·Ö²¼Ê½Îļþϵͳ£¬Îª´ËÐèÒªÉè¼ÆÒ»ÖÖͨÓõÄÎļþ¸ñʽ¡£
Dremel µÄϵͳÉè¼ÆºÍ´ó¶à OLAP µÄÁÐʽÊý¾Ý¿â²¢ÎÞÌ«¶à´´Ðµ㣬µ«ÊÇÆä¾«ÇɵĴ洢¸ñʽȴ±äµÃÁ÷ÐÐÆðÀ´£¬Apache
Parquet ¾ÍÊÇËüµÄ¿ªÔ´¸´¿Ì°æ¡£×¢Òâ Parquet ºÍ ORC Ò»Ñù¶¼ÊÇÒ»ÖÖ´æ´¢¸ñʽ£¬¶ø·ÇÍêÕûµÄϵͳ¡£
ǶÌ×Êý¾ÝÄ£ÐÍ
Google ÄÚ²¿´óÁ¿Ê¹Óà Protobuf ×÷Ϊ¿çƽ̨¡¢¿çÓïÑÔµÄÊý¾ÝÐòÁл¯¸ñʽ£¬Ïà±È JSON Òª¸ü½ô´Õ²¢¾ßÓиüÇ¿µÄ±í´ïÄÜÁ¦¡£Protobuf
²»½öÔÊÐíÓû§¶¨Ò屨Ð루required£©ºÍ¿ÉÑ¡£¨optinal£©×ֶΣ¬»¹ÔÊÐíÓû§¶¨Òå repeated
×ֶΣ¬ÒâζןÃ×ֶοÉÒÔ³öÏÖ 0¡«N ´Î£¬ÀàËÆ±ä³¤Êý×é¡£
Dremel ¸ñʽµÄÉè¼ÆÄ¿µÄ¾ÍÊǰ´ÁÐÀ´´æ´¢ Protobuf µÄÊý¾Ý¡£ÓÉÓÚ repeated ×ֶεĴæÔÚ£¬ÕâÒª±È°´Áд洢¹ØÏµÐ͵ÄÊý¾ÝÀ§ÄÑһЩ¡£Ò»°ãµÄ˼·¿ÉÄÜÊÇÓÃÖÕÖ¹·û±íʾÿ¸ö
repeat ½áÊø£¬µ«ÊÇ¿¼Âǵ½Êý¾Ý¿ÉÄܺÜÏ¡Ê裬Dremel ÒýÈëÁËÒ»ÖÖ¸üΪ½ô´ÕµÄ¸ñʽ¡£
×÷ΪÀý×Ó£¬ÏÂͼ×ó°ë±ßչʾÁËÊý¾ÝµÄ schema ºÍ 2 ¸ö Document µÄʵÀý£¬ÓÒ°ë±ßÊÇÐòÁл¯Ö®ºóµÄ¸÷¸öÁС£ÐòÁл¯Ö®ºóµÄÁжà³öÁË
R¡¢D Á½ÁУ¬·Ö±ð´ú±í Repetition Level ºÍ Definition Level£¬Í¨¹ýÕâÁ½¸öÖµ¾ÍÄÜÈ·±£Î¨Ò»µØ·´ÐòÁл¯³öÔ±¾µÄÊý¾Ý¡£

Repetition Level ±íʾµ±Ç°ÖµÔÚÄÄÒ»¸ö¼¶±ðÉÏÖØ¸´¡£¶ÔÓÚ·Ç repeated ×Ö¶ÎÖ»ÒªÌîÉÏ
trivial Öµ 0 ¼´¿É£»·ñÔò£¬Ö»ÒªÕâ¸ö×ֶοÉÄܳöÏÖÖØ¸´£¨ÎÞÂÛ±¾ÉíÊÇ repeated »¹ÊÇÍâ²ã½á¹¹ÊÇ
repeated£©£¬Ó¦µ±Îª R ÌîÉϵ±Ç°ÖµÔÚÄÄÒ»²ãÉÏ repeat¡£
¾Ù¸öÀý×Ó˵Ã÷£º¶ÔÓÚ Name.Language.Code ÎÒÃÇÒ»¹²ÓÐÈýÌõ·Ç NULL µÄ¼Ç¼¡£
µÚÒ»¸öÊÇ en-us£¬³öÏÖÔÚµÚÒ»¸ö Name µÄµÚÒ»¸ö Lanuage µÄµÚÒ»¸ö Code ÀïÃæ¡£ÔÚ´Ë֮ǰ£¬ÕâÈý¸öÔªËØÊÇûÓÐÖØ¸´¹ýµÄ£¬¶¼ÊǵÚÒ»´Î³öÏÖ¡£ËùÒÔÆä
R=0
µÚ¶þ¸öÊÇ en£¬³öÏÖÔÚÏÂÒ»¸ö Language ÀïÃæ¡£Ò²¾ÍÊÇ˵ Language ÊÇÖØ¸´µÄÔªËØ¡£Name.Language.Code
ÖÐLanguage Åŵڶþ¸ö£¬ËùÒÔÆä R=2
µÚÈý¸öÊÇ en-gb£¬³öÏÖÔÚÏÂÒ»¸ö Name ÖУ¬Name ÊÇÖØ¸´ÔªËØ£¬ÅŵÚÒ»¸ö£¬ËùÒÔÆä R=1
×¢Òâµ½ en-gbÊÇÊôÓÚµÚ3¸ö Name µÄ¶ø·ÇµÚ2¸öName£¬ÎªÁ˱í´ïÕâ¸öÊÂʵ£¬ÎÒÃÇÔÚ en ºÍ
en-gbÖмä·ÅÁËÒ»¸ö R=1 µÄ NULL¡£
Definition Level ÊÇΪÁË˵Ã÷ NULL ±»¶¨ÒåÔÚÄÄÒ»²ã£¬Ò²¾ÍÐû¸æÄÇÒ»²ãµÄ repeat
µ½´ËΪֹ¡£¶ÔÓÚ·Ç NULL ×Ö¶ÎÖ»ÒªÌîÉÏ trivial Öµ£¬¼´Êý¾Ý±¾ÉíËùÔÚµÄ level ¼´¿É¡£
ͬÑù¾Ù¸öÀý×Ó£¬¶ÔÓÚ Name.Language.Country ÁÐ
us ·Ç NULL ÖµÌîÉÏ Country ×Ö¶ÎµÄ level ¼´ D=3
NULL ÔÚ R1 ÄÚ²¿£¬±íʾµ±Ç° Name Ö®ÄÚ¡¢ºóÐøËùÓÐ Language ¶¼²»º¬ÓÐ Country
×ֶΡ£ËùÒÔDΪ2¡£
NULL ÔÚ R1 ÄÚ²¿£¬±íʾµ±Ç° Document Ö®ÄÚ¡¢ºóÐøËùÓÐ Name ¶¼²»º¬ÓÐ Country
×ֶΡ£ËùÒÔDΪ1¡£
gb ·Ç NULL ÖµÌîÉÏ Country ×Ö¶ÎµÄ level ¼´ D=3
NULL ÔÚ R2 ÄÚ²¿£¬±íʾºóÐøËùÓÐ Document ¶¼²»º¬ÓÐ Country ×ֶΡ£ËùÒÔDΪ0¡£
¿ÉÒÔÖ¤Ã÷£¬½áºÏ R¡¢D Á½¸öÊýÖµÒ»¶¨ÄÜΨһ¹¹½¨³öÔʼÊý¾Ý¡£ÎªÁ˸ßЧ±à½âÂ룬Dremel ÔÚÖ´ÐÐʱÊ×Ïȹ¹½¨³ö״̬»ú£¬Ö®ºóÀûÓÃ״̬»ú´¦ÀíÁÐÊý¾Ý¡£²»½öÈç´Ë£¬×´Ì¬»ú»¹»á½áºÏ²éѯÐèÇóºÍÊý¾ÝµÄ
structure Ö±½ÓÌø¹ýÎ޹صÄÊý¾Ý¡£
Hive
hiveʵ¼ÊÉÏÊÇÒÔHDFS×÷Ϊ´æ´¢£¬MapReduce×÷Ϊ¼ÆËãÒýÇæ£¬YARN×÷Ϊ×ÊÔ´·ÖÅä¼°ÈÝ´í»úÖÆ£¬ÒÀÍÐÓÚhadoopÉú̬ϵͳʵÏÖµÄÒ»ÖÖOLAPÊý¾Ý²Ö¿â£¬¾ßÌåÃèÊöÈçÏÂ
HiveÊÇÒ»¸öSQL½âÎöÒýÇæ£¬½«SQLÓï¾äתÒë³ÉMapReduce Job£¬È»ºóÔÙHadoopƽ̨ÉÏÔËÐУ¬´ïµ½¿ìËÙ¿ª·¢µÄÄ¿µÄ¡£
HiveÖеıíÊÇ´¿Âß¼±í£¬¾ÍÖ»ÊDZíµÄ¶¨ÒåµÈ£¬¼´±íµÄÔªÊý¾Ý¡£±¾ÖʾÍÊÇHadoopµÄĿ¼/Îļþ£¬´ïµ½ÁËÔªÊý¾ÝÓëÊý¾Ý´æ´¢·ÖÀëµÄÄ¿µÄ
Hive±¾Éí²»´æ´¢Êý¾Ý£¬ËüÍêÈ«ÒÀÀµHDFSºÍMapReduce¡£
HiveµÄÄÚÈÝÊǶÁ¶àдÉÙ£¬Ä¬Èϲ»Ö§³Ö¶ÔÊý¾ÝµÄupdateºÍdelete
Hive¼Ü¹¹

Hive ÓÉÍⲿCLI£¬Hive Thrift Server»òÕßWeb UIÌá½»SQL£¬Ìá½»ÖÁDriverÖУ¬Driver½«sql½âÎö³ÉMapReduceÖ´Ðмƻ®£¬²¢½øÐÐÂß¼ÓÅ»¯¼°ÎïÀíÓÅ»¯ºóÌá½»ÖÁMapReduce½øÐÐÖ´ÐУ¬Èç¹ûÓÐÐèҪдÈëµÄÊý¾Ý¾ÍдÈëHDFSÎļþÖУ¬²¢ÇҼǼÏÂMetadataÖÁMetastoreÖÐ
HiveµÄ´æ´¢Îļþ¸ñʽ
HiveËùÓд洢¶¼ÊÇÒÔÎļþ¸ñÊ½Çø·ÖĿ¼´æ·ÅÔÚhdfsÉϵ쬴¢´æ¸ñʽµÄ²»Í¬¼°ÌصãÇø·ÖÓÚ¸÷¸öÎļþ¸ñʽµÄÌØµã£¬HiveÖ§³ÖÔÚ½¨±íʱʹÓÃSTORED
AS (TextFile|RCFile|SequenceFile|AVRO|ORC|Parquet)À´Ö¸¶¨´æ´¢¸ñʽ
ÒÔÏÂÊÇÿÖÖ¸ñʽµÄÌØµã
TextFile: ÐÐʽ´æ´¢£¬Ã¿Ò»Ðж¼ÊÇÒ»Ìõ¼Ç¼£¬Ã¿Ðж¼ÒÔ»»Ðзû£¨\ n£©½áβ¡£Êý¾Ý²»×öѹËõ£¬´ÅÅÌ¿ªÏú´ó£¬Êý¾Ý½âÎö¿ªÏú´ó¡£¿É½áºÏGzip¡¢Bzip2ʹÓã¨ÏµÍ³×Ô¶¯¼ì²é£¬Ö´Ðвéѯʱ×Ô¶¯½âѹ£©£¬µ«Ê¹ÓÃÕâÖÖ·½Ê½£¬hive²»»á¶ÔÊý¾Ý½øÐÐÇз֣¬´Ó¶øÎÞ·¨¶ÔÊý¾Ý½øÐв¢ÐвÙ×÷¡£
SequenceFile: ÐÐʽ´æ´¢£¬ÊÇHadoop APIÌṩµÄÒ»ÖÖ¶þ½øÖÆÎļþÖ§³Ö£¬Æä¾ßÓÐʹÓ÷½±ã¡¢¿É·Ö¸î¡¢¿ÉѹËõµÄÌØµã¡£Ö§³ÖÈýÖÖѹËõÑ¡Ôñ£ºNONE,
RECORD, BLOCK¡£ RecordѹËõÂʵͣ¬Ò»°ã½¨ÒéʹÓÃBLOCKѹËõ¡£
RCFile£ºÐÐÁд洢Ïà½áºÏ£¬Ê×ÏÈ£¬Æä½«Êý¾Ý°´Ðзֿ飬±£Ö¤Í¬Ò»¸örecordÔÚÒ»¸ö¿éÉÏ£¬±ÜÃâ¶ÁÒ»¸ö¼Ç¼ÐèÒª¶ÁÈ¡¶à¸öblock¡£Æä´Î£¬¿éÊý¾ÝÁÐʽ´æ´¢£¬ÓÐÀûÓÚÊý¾ÝѹËõºÍ¿ìËÙµÄÁдæÈ¡¡£
AVRO£º¿ªÔ´ÏîÄ¿£¬ÎªHadoopÌṩÊý¾ÝÐòÁл¯ºÍÊý¾Ý½»»»·þÎñ¡£Äú¿ÉÒÔÔÚHadoopÉú̬ϵͳºÍÒÔÈκαà³ÌÓïÑÔ±àдµÄ³ÌÐòÖ®¼ä½»»»Êý¾Ý¡£AvroÊÇ»ùÓÚ´óÊý¾ÝHadoopµÄÓ¦ÓóÌÐòÖÐÁ÷ÐеÄÎļþ¸ñʽ֮һ¡£
ORC: ÁÐʽ´æ´¢£¬Hive´Ó´óÐͱí¶ÁÈ¡£¬Ð´ÈëºÍ´¦ÀíÊý¾Ýʱ£¬Ê¹ÓÃORCÎļþ¿ÉÒÔÌá¸ßÐÔÄÜ¡£
Parquet: ÁÐʽ´æ´¢£¬ÃæÏòÁеĶþ½øÖÆÎļþ¸ñʽ£¬²»¿ÉÒÔÖ±½Ó¶ÁÈ¡
ÎÒÃÇÔÚ¶Á¶àдÉÙ²¢ÇÒֻʹÓÃhiveµÄÇé¿öÏ£¬Ó¦¸Ã¾¡Á¿Ê¹ÓÃorcÒÔÌá¸ßÐÔÄÜ
hql½âÎöÁ÷³Ì
Hive»á½«Hive sql½âÎöΪMapReduce£¬ÔÚÁ˽âSQLÈçºÎ±àÒëΪMapReduce֮ǰ£¬ÏÈ¿´¿´MapReduce¿ò¼ÜʵÏÖSQL²Ù×÷µÄ»ù´¡ÔÀí
MapReduceʵÏÖ»ù±¾SQL²Ù×÷µÄÔÀí
JoinµÄʵÏÖÔÀí
select u.name,
o.orderid from order o join user u on o.uid =
u.uid;
|
ÔÚmapµÄÊä³övalueÖÐΪ²»Í¬±íµÄÊý¾Ý´òÉÏtag±ê¼Ç£¬ÔÚreduce½×¶Î¸ù¾ÝtagÅжÏÊý¾ÝÀ´Ô´¡£MapReduceµÄ¹ý³ÌÈçÏ£¨ÕâÀïÖ»ÊÇ˵Ã÷×î»ù±¾µÄJoinµÄʵÏÖ£¬»¹ÓÐÆäËûµÄʵÏÖ·½Ê½£©

Group ByµÄʵÏÖÔÀí
select rank,
isonline, count(*) from city group by rank, isonline; |
½«GroupByµÄ×Ö¶Î×éºÏΪmapµÄÊä³ökeyÖµ£¬ÀûÓÃMapReduceµÄÅÅÐò£¬ÔÚreduce½×¶Î±£´æLastKeyÇø·Ö²»Í¬µÄkey¡£MapReduceµÄ¹ý³ÌÈçÏ£¨µ±È»ÕâÀïÖ»ÊÇ˵Ã÷Reduce¶ËµÄ·ÇHash¾ÛºÏ¹ý³Ì£©
.png)
DistinctµÄʵÏÖÔÀí
select dealid,
count(distinct uid) num from order group by dealid; |
µ±Ö»ÓÐÒ»¸ödistinct×Ö¶Îʱ£¬Èç¹û²»¿¼ÂÇMap½×¶ÎµÄHash GroupBy£¬Ö»ÐèÒª½«GroupBy×ֶκÍDistinct×Ö¶Î×éºÏΪmapÊä³ökey£¬ÀûÓÃmapreduceµÄÅÅÐò£¬Í¬Ê±½«GroupBy×Ö¶Î×÷ΪreduceµÄkey£¬ÔÚreduce½×¶Î±£´æLastKey¼´¿ÉÍê³ÉÈ¥ÖØ

HQLת»¯ÎªMapReduceµÄ¹ý³Ì

½«HQLתΪMapReduceÖ´ÐеÄÖ÷ÒªÁ÷³ÌÈçÏÂ
Óï·¨½âÎö ½«HQL½âÎöΪAST£¨AbstractSyntaxTree£¬³éÏóÓï·¨Ê÷£©£¬Èë¿ÚΪParseDriver.run()·½·¨¡£¸Ã²½ÖèÖ÷Òª½èÖúÓÚAntlr3ʵÏÖSQLµÄ´Ê·¨ºÍÓï·¨½âÎö£¬ÕâÀï²»Ïêϸ½éÉÜAntlr£¬Ö»ÐèÒªÁ˽âʹÓÃAntlr¹¹ÔìÌØ¶¨µÄÓïÑÔÖ»ÐèÒª±àдһ¸öÓï·¨Îļþ£¬¶¨Òå´Ê·¨ºÍÓï·¨Ìæ»»¹æÔò¼´¿É£¬AntlrÍê³ÉÁË´Ê·¨·ÖÎö¡¢Óï·¨·ÖÎö¡¢ÓïÒå·ÖÎö¡¢Öмä´úÂëÉú³ÉµÄ¹ý³Ì¡£
ÓïÒå·ÖÎöµÚÒ»½×¶Î AST TreeÈÔÈ»·Ç³£¸´ÔÓ£¬²»¹»½á¹¹»¯£¬²»·½±ãÖ±½Ó·ÒëΪMapReduce³ÌÐò£¬AST
Treeת»¯ÎªQueryBlock¾ÍÊǽ«SQL½øÒ»²¿³éÏóºÍ½á¹¹»¯¡£QueryBlockÊÇÒ»ÌõSQL×î»ù±¾µÄ×é³Éµ¥Ôª£¬°üÀ¨Èý¸ö²¿·Ö£ºÊäÈëÔ´£¬¼ÆËã¹ý³Ì£¬Êä³ö¡£¼òµ¥À´½²Ò»¸öQueryBlock¾ÍÊÇÒ»¸ö×Ó²éѯ¡£
Éú³É²éѯ¼Æ»® ½«QueryBlock½âÎö³ÉOperatorÊ÷£¬Hive×îÖÕÉú³ÉµÄMapReduceÈÎÎñ£¬Map½×¶ÎºÍReduce½×¶Î¾ùÓÉOperatorTree×é³É¡£Operator£¬¾ÍÊÇÔÚMap½×¶Î»òÕßReduce½×¶ÎÍê³Éµ¥Ò»Ìض¨µÄ²Ù×÷¡£
Âß¼ÓÅ»¯ ÓÅ»¯ÒÑÉú³ÉµÄOperatorÊ÷£¬ºÏ²¢²Ù×÷·û£¬´ïµ½¼õÉÙMapReduce Job£¬¼õÉÙshuffleÊý¾ÝÁ¿µÄÄ¿µÄ
Éú³ÉMRÈÎÎñ ½«OperatorÊ÷½âÎöΪTaskÓÐÏòÎÞ»·Í¼
ÎïÀíÓÅ»¯ ¸ÄдTaskÓÐÏòÎÞ»·Í¼£¬½«Ä³Ð©½áµãµÄÆÕͨTask¸ÄдΪ¿ÉÔÚÔËÐÐʱ½øÐзÖ֦ѡÔñµÄConditionalTask
Ö´ÐÐ ½«TaskÓÐÏòÎÞ»·Í¼½»ÓÉ¿ò¼Ü½øÐÐÖ´ÐÐ
×ܽá
HiveÊÇÊôÓÚOLAPÐ͵ÄÊý¾Ý²Ö¿â£¬ÊÊÓ󡾰ΪִÐÐʱЧÐÔ²»¸ßµÄ£¬Êý¾ÝÁ¿´óµÄÀëÏß·ÖÎöÐͼÆË㣬ÀýÈ籨±í·ÖÎö£¬Èç¹ûÓдóÁ¿ÊÂÎñÐÔ²Ù×÷£¬ÇëʹÓÃOLTPÐÍÊý¾Ý¿â£¬ÈçmysqlµÈ
Hive´æ´¢Îļþ¸ñʽÓжàÖÖ£¬Ä¬ÈÏΪÐÐʽ´æ´¢µÄTextFile£¬Õ¼Óÿռä½Ï´ó£¬²¢ÇÒ¼ÆËã¶ÁÈ¡ÐÔÄܲ»¸ß£¬ÔÚʹÓÃhiveʱ£¬¾¡Á¿Ñ¡ÓÃorc¸ñʽ´æ´¢£¬Ñ¹Ëõ±ÈÀý½Ï´ó£¬ÇжÁÈ¡ÐÔÄܸܺß
Hive SQLµ×²ãΪMapReduce£¬ÐèÌá½»ÖÁyarnÉÏÖ´ÐУ¬yarn·ÖÅä×ÊÔ´¸øMapReduceÈÎÎñ£¬´óÁ¿hive
sqlͬʱÌá½»¿ÉÄÜ»á·Ç³£ËðºÄmaster½áµã·ÖÅäÈÎÎñµÄ×ÊÔ´£¬Èç¹ûÐèÒªÔÚ³ÌÐòÖе÷ÓÃhive sql insertʱ£¬ÇëʹÓÃÅúÁ¿²åÈëµÄsql»òÕßͨ¹ýÆäËü·½Ê½(Èç±àдÓû§×Ô¶¨Ò庯Êý¶ÁÈ¡mysql)½øÐÐʵÏÖ
|