ΪʲôҪʹÓÃhive+pythonÀ´·ÖÎöÊý¾Ý
¾Ù¸öÀý×Ó,
µ±ÄêûÓÐÊý¾Ý¿âµÄʱºò, ÈËÃDZà³ÌÀ´²Ù×÷Îļþϵͳ, ÕâÏ൱ÓÚ ÎÒÃDZàдmapreduceÀ´·ÖÎöÊý¾Ý
ºóÀ´ÓÐÁËÊý¾Ý¿â, ÔÙûÈ˲Ù×÷ÎļþϵͳÁË(³ý·ÇÓÐÆäËüÐèÇó), ¶øÊÇÖ±½ÓʹÓÃsqlºÍһЩÓïÑÔ(php, java,
python)À´²Ù×÷Êý¾Ý. Õâ¾ÍÏ൱ÓÚ hive + pythonÁË
hive + pythonÄܽâ¾ö´ó¶àµÄÐèÇó, ³ý·ÇÄãµÄÊý¾ÝÊǷǽṹ»¯Êý¾Ý, ´ËʱÄã¾Í»Øµ½ÁËÔ¶¹Åʱ´ú²»µÃ²»Ð´mapreduceÁË.
¶øÎªÊ²Ã´²»Ê¹ÓÃhive+java, hive+c, hive+...
ÒòΪ:
pythonÕæÊÇÌ«ºÃÓÃÁË, ½Å±¾ÓïÑÔ, ÎÞÐè±àÒë, ÓÐÇ¿´óµÄ»úÆ÷ѧϰ¿â, ÊʺϿÆÑ§¼ÆËã(Õâ¾ÍÊÇÊý¾Ý·ÖÎö°¡!!)
ʹÓÃhive+pythonÀ´·ÖÎöÊý¾Ý
hiveÓëpythonµÄ·Ö¹¤: ʹÓÃhive sql×÷ΪpythonµÄÊý¾ÝÔ´,
pythonµÄÊä³ö×÷ΪmapµÄÊä³ö, ÔÙʹÓÃhiveµÄ¾ÛºÏº¯Êý×÷Ϊreduce.
ÏÂÃæÊ¹ÓÃÒ»¸öÀý×ÓÀ´·ÖÎö: ͳ¼ÆÃ¿¸öÈËÔÚijÈÕÆÚÈËϳԵĸ÷ÖÖʳƷµÄÊýÁ¿
½¨±í user_foods Óû§Ê³Æ·±í
hive> create table user_foods (user_id string, food_type string, datetime string ) partitioned by(dt string) ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n' STORED AS TEXTFILE
# partitioned by(dt string) ÒÔÈÕÆÚ·ÖÇø
# бÌ岿·Ö±íʾÐÐÓëÐÐÖ®¼äÒÔ\n·Ö¸ô, ×Ö¶ÎÓë×ֶμäÒÔ\t·Ö¸ô. |
¸ù¾ÝÒµÎñÐèÒª, ÒòΪÊǰ´ÌìÀ´Í³¼Æ, Ϊ¼õÉÙ·ÖÎöʱµÄÊý¾ÝÁ¿, ÉÏÊöhive±íÒÔdt(ÈÕÆÚ)Ϊ·ÖÇø.
´´½¨Hive±íºó, »áÔÚHDFS /hive/Ŀ¼Ï´´½¨Ò»¸öÓë±íÃûͬÃûµÄÎļþ¼Ð

µ¼ÈëÊý¾Ý
½¨Á¢·ÖÇø
hive> ALTER TABLE user_foods ADD PARTITION(dt='2014-06-07'); |
´´½¨·ÖÇøºó, hdfsĿ¼/hive/user_foods/϶àÁËÒ»¸ödf='2014-06-07'µÄĿ¼
´´½¨²âÊÔÊý¾Ý
´´½¨Ò»¸öÎļþÈçdata.txt, ¼ÓÈë²âÊÔÊý¾Ý
user_1 food1 2014-06-07 09:00 user_1 food1 2014-06-07 09:02 user_1 food2 2014-06-07 09:00 user_2 food2 2014-06-07 09:00 user_2 food23 2014-06-07 09:00 |
µ¼ÈëÊý¾Ý
hive> LOAD DATA LOCAL INPATH '/Users/life/Desktop/data.txt'
OVERWRITE INTO TABLE user_foods PARTITION(dt='2014-06-07'); |
µ¼Èë³É¹¦ºó, ʹÓÃselect * from user_foods²é¿´ÏÂ.
»òʹÓÃ
hive> select * from user_foods where user_id='user_1' |
Õâ»áÉú³ÉÒ»¸ömapreduce
½öʹÓÃhiveÀ´·ÖÎö
"ͳ¼ÆÃ¿¸öÈËÔÚijÈÕÆÚÈËϳԵĸ÷ÖÖʳƷµÄÊýÁ¿" Ì«¹ý¼òµ¥,
²»ÐèÒªpython¾Í¿ÉʵÏÖ:
hive> select user_id, food_type, count(*) from user_foods where dt='2014-06-07' group by user_id, food_type; |
½á¹û:

½áºÏʹÓÃpython
Èç¹ûÐèÒª¶ÔÊý¾ÝÇåÏ´»ò¸ü½øÒ»²½´¦Àí, ÄÇô¿Ï¶¨ÐèÒª×Ô¶¨Òåmap, Õâ¾Í¿ÉÒÔʹÓÃpythonÀ´ÊµÏÖÁË.
±ÈÈçfood2Óëfood23ÈÏΪÊÇͬһÀàÐÍʳƷ, ´ËʱÀûÓÃpython½øÐÐÊý¾ÝÇåÏ´, pythonµÄ½Å±¾ÈçÏÂ:
(m.py)
#!/usr/bin/env python #encoding=utf-8
import sys
if __name__=="__main__":
# ½âÎöÿһÐÐÊý¾Ý
for line in sys.stdin:
# ÂÔ¹ý¿ÕÐÐ
if not line or not line.strip():
continue
# ÕâÀïÓÃtry ±ÜÃâÌØÊâÐнâÎö´íÎóµ¼ÖÂÈ«²¿³ö´í
try:
userId, foodType, dt = line.strip().split("\t")
except:
continue
# ÇåÏ´Êý¾Ý, ¿ÕÊý¾ÝÂÔ¹ý
if userId == '' or foodType == '':
continue
# ÇåÏ´Êý¾Ý
if(foodType == "food23"):
foodType = "food2"
# Êä³ö, ÒÔ\t·Ö¸ô, ¼´mapµÄÊä³ö
print userId + "\t" + foodType |
ÔÙʹÓÃhql½áºÏpython½Å±¾À´·ÖÎö:
1. ¼ÓÈëpython½Å±¾, Ï൱ÓÚdistributed cache
2. Ö´ÐÐ, ʹÓÃtrnsformºÍusing
hive> add file /Users/life/Desktop/m.py; hive> select user_id, food_type, count(*) from ( select transform (user_id, food_type, datetime) using 'python m.py' as (user_id, food_type) from user_foods where dt='2014-06-07' ) tmp group by user_id, food_type; |
½á¹û:

python½Å±¾µ÷ÊÔ½¨Òé
1. Ê×Ïȱ£Ö¤½Å±¾Ã»ÓÐÓï·¨´íÎó, ¿ÉÒÔÖ´ÐÐpython m.pyÀ´ÑéÖ¤
2. È·±£´úÂëûÓÐÆäËüÊä³ö
3. ¿ÉÒÔʹÓòâÊÔÊý¾ÝÀ´²âÊԽű¾, ±ÈÈç:
$> cat data.txt | python m.py user_1 food1 user_1 food1 user_1 food2 user_2 food2 user_2 food2 |
1, 2, 3¶¼ÕýÈ·ºó, Èç¹ûÔÙʹÓÃhive+pythonÓдíÎó, ¿ÉÄܵĴíÎóÓÐ:
1. python½Å±¾¶ÔÊý¾ÝµÄ´¦Àí²»½¡×³, ÓÐЩ±ß½çÌõ¼þûÓп¼ÂÇ, µ¼ÖÂpython³öÏÖexception
2. ×Ô¼º×ܽá°É
ÆäËü
ÉÏÃæÕâ¸öÀý×ÓµÄpython½Å±¾³äµ±mapµÄ½ÇÉ«, µ±È»Ò²¿ÉÒÔÔÙ½¨Á¢Ò»¸öreduce.pyÀ´Í³¼ÆmapµÄÊä³ö¶ø²»Ê¹ÓÃhiveµÄ¾ÛºÏº¯Êý.
ÕâÊǽ¨Á¢ÔÚhiveÒѲ»ÄÜÂú×ãÄãµÄÐèÇóÖ®ÉϵÄ.
|