±à¼ÍƼö: |
±¾ÎÄÖ÷Òª½²½âÁËÊý¾Ý¿ÆÑ§¼ÒµÚÎå¡¢Áù¡¢ÆßÌõÏß·£º×ÔÈ»ÓïÑÔ´¦Àí¡¢Êý¾Ý¿ÉÊÓ»¯¡¢´óÊý¾Ý¡£
±¾ÎÄÀ´×Ô΢ÐÅÇØÂ·£¬ÓÉ»ðÁú¹ûÈí¼þAnna±à¼¡¢ÍƼö¡£ |
|
ÔÚ¡¶ Êý¾Ý¿ÆÑ§¼Ò³É³¤Ö¸ÄÏ(ÉÏ)
¡·ÖÐÒѾ½éÉÜÁË»ù´¡ÔÀí¡¢Í³¼ÆÑ§¡¢±à³ÌÄÜÁ¦ºÍ»úÆ÷ѧϰµÄÒªµã´ó¸Ù£¬½ñÌì¸üкóÐøµÄµÚÎå¡¢Áù¡¢ÆßÌõÏß·£º×ÔÈ»ÓïÑÔ´¦Àí¡¢Êý¾Ý¿ÉÊÓ»¯¡¢´óÊý¾Ý¡£

×¼±¸ºÃÔÚеÄÒ»Ä꣬ѧϰ³ÉΪδÀ´ÎåÄê×îÐԸеÄְλô¡£
¡ª¡ª¡ª¡ª¡ª¡ª
Text Mining / NLP
Îı¾ÍÚ¾ò£¬×ÔÈ»ÓïÑÔ´¦Àí¡£ÕâÊÇÒ»¸öºá¿çÈËÀàѧ¡¢ÓïÑÔѧµÄ½»²æÁìÓò¡£ÖÐÎĵÄ×ÔÈ»ÓïÑÔ´¦Àí¸üÓÐÄѶȣ¬ÕâÊǺºÓïÓï·¨ÌØÐÔ¾ö¶¨µÄ£¬Ó¢ÎÄÊÇÒ»´Êµ¥´ÊΪ×îÐ¡ÔªËØ£¬ÓпոñÇø·Ö£¬ÖÐÎÄÔòÊÇ×Ö£¬ÇÒÊÇÁ¬ÐøµÄ¡£Õâ¾ÍÐèÒªÖÐÎÄÔڷִʵĻù´¡ÉÏÔÙ½øÐÐ×ÔÈ»ÓïÑÔ´¦Àí¡£ÖÐÎÄ·Ö´ÊÖÊÁ¿¾ö¶¨Á˺óÐøºÃ»µ¡£
Corpus
ÓïÁÏ¿â
ËüÖ¸´ó¹æÄ£µÄµç×ÓÎı¾¿â£¬ËüÊÇ×ÔÈ»ÓïÑԵĻù´¡¡£ÓïÁÏ¿âûÓй̶¨µÄÀàÐÍ£¬ÎÄÏס¢Ð¡Ëµ¡¢ÐÂÎŶ¼¿ÉÒÔÊÇÓïÁÏ£¬Ö÷Ҫȡ¾öÓÚÄ¿µÄ¡£ÓïÁÏ¿âÓ¦¸Ã¿¼ÂǶà¸öÎÄÌå¼äµÄƽºâ£¬¼´ÐÂÎÅÓ¦¸Ã°üº¬¸÷Ìâ²ÄÐÂÎÅ¡£
ÓïÁÏ¿âÊÇÐèÒª¼Ó¹¤µÄ£¬²»ÊÇËæ±ãÍøÉÏÏÂÔØ¸ötxt¾ÍÊÇÓïÁϿ⣬Ëü±ØÐë´¦Àí£¬°üº¬ÓïÑÔѧ±ê×¢£¬´ÊÐÔ±ê×¢¡¢ÃüÃûʵÌå¡¢¾ä·¨½á¹¹µÈ¡£Ó¢ÎÄÓïÁÏ¿â±È½Ï³ÉÊ죬ÖÐÎÄÓïÁÏ»¹ÔÚ·¢Õ¹ÖС£
NLTK-Data
×ÔÈ»ÓïÑÔ¹¤¾ß°ü
NLTK´´Á¢ÓÚ2001Ä꣬ͨ¹ý²»¶Ï·¢Õ¹£¬ÒѾ³ÉΪ×îºÃµÄÓ¢ÓïÓïÑÔ¹¤¾ß°üÖ®Ò»¡£ÄÚº¬¶à¸öÖØÒªÄ£¿éºÍ·á¸»µÄÓïÁϿ⣬±ÈÈçnltk.corpus
ºÍ nltk.utilities¡£PythonµÄNLTKºÍRµÄTMÊÇÖ÷Á÷µÄÓ¢ÎŤ¾ß°ü£¬ËüÃÇÒ²ÄÜÓÃÓÚÖÐÎÄ£¬±ØÐëÏÈ·Ö´Ê¡£ÖÐÎÄÒ²Óв»ÉÙ´¦Àí°ü£ºTextRank¡¢Jieba¡¢HanLP¡¢FudanNLP¡¢NLPIRµÈ¡£
Named Entity Recognition
ÃüÃûʵÌåʶ±ð
ËüÊÇÈ·ÇеÄÃû´Ê¶ÌÓÈç×éÖ¯¡¢ÈË¡¢Ê±¼ä£¬µØÇøµÈµÈ¡£ÃüÃûʵÌåʶ±ðÔòÊÇʶ±ðËùÓÐÎÄ×ÖÖеÄÃüÃûʵÌ壬ÊÇ×ÔÈ»ÓïÑÔ´¦ÀíÁìÓòµÄÖØÒª»ù´¡¹¤¾ß¡£
ÃüÃûʵÌåÓÐÁ½¸öÐèÒªÍê³ÉµÄ²½Ö裬һÊÇÈ·¶¨ÃüÃûʵÌåµÄ±ß½ç£¬¶þÊÇÈ·¶¨ÀàÐÍ¡£ºº×ÖµÄʵÌåʶ±ð±È½ÏÀ§ÄÑ£¬±ÈÈçÄϾ©Êг¤½´óÇÅ£¬»á²úÉúÄϾ©
| Êг¤ | ½´óÇÅ ¡¢ÄϾ©ÊÐ | ³¤½´óÇÅ Á½ÖÖ½á¹û£¬Õâ¾ÍÊǷִʵÄÈÎÎñ¡£È·¶¨ÀàÐÍÔòÊÇÃ÷È·Õâ¸öʵÌåÊǵØÇø¡¢Ê±¼ä¡¢»òÕ߯äËû¡£¿ÉÒÔÀí½â³ÉÎÄ×Ö°æµÄÊý¾ÝÀàÐÍ¡£
ÃüÃûʵÌåÖ÷ÒªÓÐÁ½Àà·½·¨£¬»ùÓÚ¹æÔòºÍ´ÊµäµÄ·½·¨£¬ÒÔ¼°»ùÓÚ»úÆ÷ѧϰµÄ·½·¨¡£¹æÔòÖ÷ÒªÒԴʵäÕýÈ·ÇзֳöʵÌ壬»úÆ÷ѧϰÖ÷ÒªÒÔÒþÂí¶û¿É·òÄ£ÐÍ¡¢×î´óìØÄ£ÐͺÍÌõ¼þËæ»úÓòΪÖ÷¡£
Text Analysis
Îı¾·ÖÎö
ÕâÊÇÒ»¸ö±È½Ï´óµÄ½»²æÁìÓò¡£ÒÔÓïÑÔѧÑо¿µÄ½Ç¶È¿´£¬Îı¾·ÖÎö°üÀ¨Óï·¨·ÖÎöºÍÓïÒå·ÖÎö£¬ºóÕßÏֽ׶νøÕ¹±È½Ï»ºÂý¡£Óï·¨·ÖÎöÒÔÕýÈ·¹¹½¨³ö¶¯´Ê¡¢Ãû´Ê¡¢½é´ÊµÈ×é³ÉµÄÓï·¨Ê÷ΪÖ÷ҪĿµÄ¡£

Èç¹û²»ÉîÈëÑо¿ÁìÓò¡¢ÔòÓÐÎı¾ÏàËÆ¶È¡¢Ô¤²âµ¥´Ê¡¢À§»óÖµµÈ·ÖÎö£¬ÕâÊDZȽϳÉÊìµÄÓ¦Óá£
UIMA
UIMA ÊÇÒ»¸öÓÃÓÚ·ÖÎö·Ç½á¹¹»¯ÄÚÈÝ£¨±ÈÈçÎı¾¡¢ÊÓÆµºÍÒôƵ£©µÄ×é¼þ¼Ü¹¹ºÍÈí¼þ¿ò¼ÜʵÏÖ¡£Õâ¸ö¿ò¼ÜµÄÄ¿µÄÊÇΪ·Ç½á¹¹»¯·ÖÎöÌṩһ¸öͨÓÃµÄÆ½Ì¨£¬´Ó¶øÌṩÄܹ»¼õÉÙÖØ¸´¿ª·¢µÄ¿ÉÖØÓ÷ÖÎö×é¼þ¡£
Term Document Matrix
´Ê-Îĵµ¾ØÕó
ËüÊÇÒ»¸ö¶þά¾ØÕó£¬ÐÐÊÇ´Ê£¬ÁÐÊÇÎĵµ£¬Ëü¼Ç¼µÄÊÇËùÒÔµ¥´ÊÔÚËùÓÐÎĵµÖгöÏÖÆµÂÊ¡£ËùÒÔËüÊÇÒ»¸ö¸ßάÇÒÏ¡ÊèµÄ¾ØÕó¡£

Õâ¸ö¾ØÕóÊÇTF-IDF£¨term frequency¨Cinverse
document frequency£©Ëã·¨µÄ»ù´¡¡£TFÖ¸´úµÄ´ÊÔÚÎĵµÖгöÏֵįµÂÊ£¬ÃèÊöµÄÊÇ´ÊÓïÔÚ¸ÃÎĵµµÄÖØÒªÊý£¬IDFÊÇÄæÏòÎļþƵÂÊ£¬ÃèÊöµÄÊǵ¥´ÊÔÚËùÓÐÎĵµÖеÄÖØÒªÊý¡£ÎÒÃÇÈÏΪ£¬ÔÚËùÓÐÎĵµÖж¼³öÏֵĴʿ϶¨Êǵġ¢ÄãºÃ¡¢ÊDz»ÊÇÕâÀà³£Óôʣ¬ÖØÒªÐÔ²»¸ß£¬¶øÔ½Ï¡ÉٵĴÊÔ½ÖØÒª¡£¹ÊÓÉ×ÜÎĵµÊý³ýÒÔ°üº¬¸Ã´ÊµÄÎĵµÊý£¬È»ºóÈ¡¶ÔÊý»ñµÃ¡£
´Ê-Îĵµ¾ØÕó¿ÉÒÔÓþØÕóµÄ·½·¨¿ìËÙ¼ÆËãTF-IDF¡£
ËüµÄ±äÖÖÐÎʽÊÇDocument Term Matrix£¬ÐÐÁеߵ¹¡£
Term Frequency & Weight
´ÊƵºÍÈ¨ÖØ
´ÊƵ¼´´ÊÓïÔÚÎĵµÖгöÏֵĴÎÊý£¬ÕâÀïµÄÎĵµ¿ÉÒÔÈÏΪÊÇһƪÐÂÎÅ¡¢Ò»·ÝÎı¾£¬ÉõÖÁÊÇÒ»¶Î¶Ô»°¡£´ÊƵ±íʾÁË´ÊÓïµÄÖØÒª³Ì¶È£¬Ò»°ãÕâ¸ö´Ê³öÏÖµÄÔ½¶à£¬ÎÒÃÇ¿ÉÒÔÈÏΪËüÔ½ÖØÒª£¬µ«Ò²ÓпÉÄÜÓöµ½ºÜ¶àÎÞÓôʣ¬±ÈÈçµÄ¡¢µØ¡¢µÃµÈ¡£ÕâЩÊÇÍ£Óôʣ¬Ó¦¸Ãɾ³ý¡£ÁíÍâÒ»²¿·ÖÊÇÈÕ³£ÓÃÓÄãºÃ£¬Ð»Ð»£¬¶ÔÎı¾·ÖÎöûÓаïÖú£¬ÎªÁËÇø·Ö³öËüÃÇ£¬ÎÒÃÇÔÙ¼ÓÈëÈ¨ÖØ¡£
È¨ÖØ´ú±íÁË´ÊÓïµÄÖØÒª³Ì¶È£¬ÏñÄãºÃ¡¢Ð»Ð»ÕâÖÖ£¬ÎÒÃÇÈÏΪËüµÄÈ¨ÖØÊǺܵͣ¬¼¸ºõûÓÐÈκμÛÖµ¡£È¨ÖؼÈÄÜÈ˹¤´ò·Ö£¬Ò²ÄÜͨ¹ý¼ÆËã»ñµÃ¡£Í¨³££¬×¨ÒµÀà´Ê»ãÎÒÃÇ»á¸øÓè¸ü¸ßµÄÈ¨ÖØ£¬³£ÓôÊÔòµÍÈ¨ÖØ¡£
ͨ¹ý´ÊƵºÍÈ¨ÖØ£¬ÎÒÃÇÄÜÌáÈ¡³ö´ú±íÕâ·ÝÎı¾µÄÌØÕ÷´Ê£¬¾µäË㷨ΪTF-IDF¡£
Support Vector Machines
Ö§³ÖÏòÁ¿»ú
ËüÊÇÒ»ÖÖ¶þÀà·ÖÀàÄ£ÐÍ£¬ÓбðÓÚ¸ÐÖª»ú£¬ËüÊÇÇó¼ä¸ô×î´óµÄÏßÐÔ·ÖÀà¡£µ±Ê¹Óú˺¯Êýʱ£¬ËüÒ²¿ÉÒÔ×÷Ϊ·ÇÏßÐÔ·ÖÀàÆ÷¡£
Ëü¿ÉÒÔϸ·ÖΪÏßÐÔ¿É·ÖÖ§³ÖÏòÁ¿»ú¡¢ÏßÐÔÖ§³ÖÏòÁ¿»ú£¬·ÇÏßÐÔÖ§³ÖÏòÁ¿»ú¡£
·ÇÏßÐÔÎÊÌⲻ̫ºÃÇó½â£¬Í¼×ó¾ÍÊǽ«·ÇÏßÐÔµÄÌØÕ÷¿Õ¼äÓ³É䵽пռ䣬½«Æäת»»³ÉÏßÐÔ·ÖÀࡣ˵µÄͨË׵㣬¾ÍÊÇÀûÓú˺¯Êý½«×óÍ¼ÌØÕ÷¿Õ¼ä£¨Å·Ê½»òÀëÉ¢¼¯ºÏ£©µÄ³¬ÇúÃæ×ª»»³ÉÓÒÍ¼ÌØÕ÷¿Õ¼ä£¨Ï£¶û²®Ìؿռ䣩Öеĵij¬Æ½Ãæ¡£

³£Óú˺¯ÊýÓжàÏîʽºËº¯Êý£¬¸ß˹ºËº¯Êý£¬×Ö·û´®ºËº¯Êý¡£
×Ö·û´®ºËº¯ÊýÓÃÓÚÎı¾·ÖÀà¡¢ÐÅÏ¢¼ìË÷µÈ£¬SVMÔÚ¸ßάµÄÎı¾·ÖÀàÖбíÏֽϺã¬ÕâÒ²ÊdzöÏÖÔÚ×ÔÈ»ÓïÑÔ´¦Àí·¾¶ÉϵÄÔÒò¡£
Association Rules
¹ØÁª¹æÔò
ËüÓÃÀ´ÍÚ¾òÊý¾Ý±³ºó´æÔÚµÄÐÅÏ¢£¬×îÖªÃûµÄÀý×Ó¾ÍÊÇÆ¡¾ÆÓëÄò²¼ÁË£¬ËäÈ»ËüÊÇÐé¹¹µÄ¡£µ«ÎÒÃÇ¿ÉÒÔÀí½âËüÔ̺¬µÄÒâ˼£ºÂòÁËÄò²¼µÄÈ˸üÓпÉÄܹºÂòÆ¡¾Æ¡£
ËüÊÇÐÎÈçX¡úYµÄÔ̺ʽ£¬ÊÇÒ»ÖÖµ¥ÏòµÄ¹æÔò£¬¼´ÂòÁËÄò²¼µÄÈ˸üÓпÉÄܹºÂòÆ¡¾Æ£¬µ«ÊÇÂòÁËÆ¡¾ÆµÄÈËδ±Ø»áÂòÄò²¼¡£ÎÒÃÇÔÚ¹æÔòÖÐÒýÈëÁËÖ§³Ö¶ÈºÍÖÃÐŶÈÀ´½âÊÍÕâÖÖµ¥Ïò¡£Ö§³Ö¶È±íÃ÷ÕâÌõ¹æÔòµÄÔÚÕûÌåÖз¢ÉúµÄ¿ÉÄÜÐÔ´óС£¬Èç¹ûÂòÄò²¼Æ¡¾ÆµÄÈËÉÙ£¬ÄÇô֧³Ö¶È¾ÍС¡£ÖÃÐŶȱíʾ´ÓXÍÆµ¼YµÄ¿ÉÐŶȴóС£¬¼´ÊÇ·ñÕæµÄÂòÁËÄò²¼µÄÈË»áÂòÆ¡¾Æ¡£
¹ØÁª¹æÔòµÄºËÐľÍÊÇÕÒ³öƵ·±ÏîÄ¿¼¯£¬AprioriËã·¨¾ÍÊÇÆäÖеĵäÐÍ¡£Æµ·±ÏîÄ¿¼¯ÊÇͨ¹ý±éÀúµü´úÇó½âµÄ£¬Ê±¼ä¸´ÔӶȺܸߣ¬´óÐÍÊý¾Ý¼¯µÄ±íÏÖ²»ºÃ¡£
¹ØÁª¹æÔòºÍÐͬ¹ýÂËÒ»Ñù£¬¶¼ÊÇÏàËÆÐÔµÄÇó½â£¬Çø·ÖÊÇÐͬ¹ýÂËÕÒµÄÊÇÏàËÆµÄÈË£¬½«ÈË»®·ÖȺÌå×ö¸öÐÔ»¯ÍƼö£¬¶ø¹ØÁª¹æÔòûÓйýÂ˵ĸÅÄÊÇÕë¶ÔÕûÌåµÄ£¬Óë¸öÈËÆ«ºÃÎ޹أ¬¼ÆËã³öµÄ½á¹ûÊÇÕë¶ÔËùÓÐÈË¡£
Market Based Analysis
¹ºÎïÀº·ÖÎö£¬Êµ¼ÊÊÇMarket Basket Analysis£¬×÷Õß±ÊÎó¡£
´«Í³ÁãÊÛÒµÖУ¬¹ºÎïÀºÖ¸µÄÊÇÏû·ÑÕßÒ»´ÎÐÔ¹ºÂòµÄÉÌÆ·£¬ÊÕÓªÌõÉϵĵ¥×ÓÊý¾Ý¶¼»á±»¼Ç¼ÏÂÀ´ÒÔ¹©·ÖÎö¡£¸üÓÅÐãµÄ¹ºÎïÀº·ÖÎö£¬»¹»áÓúìÍâÉ䯵¼Ç¼ÉÌÆ·µÄ°Ú·Å£¬¹Ë¿ÍÔÚ³¬ÊеÄÒÆ¶¯£¬ÈËÁ÷Á¿µÈÊý¾Ý¡£
¹ØÁª¹æÔòÊǹºÎïÀº·ÖÎöµÄÖ÷ÒªÓ¦Ó㬵«»¹°üÀ¨´ÙÏú´òÕÛ¶ÔÏúÊÛÁ¿µÄÓ°Ïì¡¢»áÔ±ÖÆ¶È»ý·ÖÖÆ¶ÈµÄ·ÖÎö¡¢»ØÍ·¿ÍºÍп͵ķÖÎö¡£
Feature Extraction
ÌØÕ÷ÌáÈ¡
ËüÊÇÌØÕ÷¹¤³ÌµÄÖØÒª¸ÅÄî¡£Êý¾ÝºÍÌØÕ÷¾ö¶¨ÁË»úÆ÷ѧϰµÄÉÏÏÞ£¬¶øÄ£ÐͺÍËã·¨Ö»ÊDZƽüÕâ¸öÉÏÏÞ¶øÒÑ¡£¶øºÜ¶àÄ£ÐͶ¼»áÓöµ½Î¬ÊýÔÖÄÑ£¬¼´Î¬¶ÈÌ«¶à£¬Õâ¶ÔÐÔÄÜÆ¿¾±Ôì³ÉÁË¿¼Ñé¡£³£¼ûÎı¾¡¢Í¼Ïñ¡¢ÉùÒôÕâЩÁìÓò¡£
ΪÁ˽â¾öÕâÒ»ÎÊÌ⣬ÎÒÃÇÐèÒª½øÐÐÌØÕ÷ÌáÈ¡£¬½«ÔÊ¼ÌØÕ÷ת»»³É×îÓÐÖØÒªÐÔµÄÌØÕ÷¡£ÏñÖ¸ÎÆÊ¶±ð¡¢±Ê¼£Ê¶±ð£¬ÕâЩ¶¼ÊÇÓÐʵÌåÓм£¿Éѵ쬶ø±íÇéʶ±ðµÈÔòÊDZȽϳéÏóµÄ¸ÅÄî¡£ÕâÒ²ÊÇÌØÕ÷ÌáÈ¡µÄÌôÕ½¡£
²»Í¬Ä£Ê½ÏµÄÌØÕ÷ÌáÈ¡·½·¨²»Ò»Ñù£¬Îı¾µÄÌØÕ÷ÌáÈ¡ÓÐTF-IDF¡¢ÐÅÏ¢ÔöÒæµÈ£¬ÏßÐÔÌØÕ÷ÌáÈ¡°üÀ¨PCA¡¢LDA£¬·ÇÏßÐÔÌØÕ÷ÌáÈ¡°üÀ¨ºËKernel¡£
Using Mahout
ʹÓÃMahout
MahoutÊÇHadoopÖеĻúÆ÷ѧϰ·Ö²¼Ê½¿ò¼Ü£¬ÖÐÎÄÃûÇýÏóÈË¡£
Mahout°üº¬ÁËÈý¸öÖ÷Ìâ£ºÍÆ¼öϵͳ¡¢¾ÛÀàºÍ·ÖÀà¡£·Ö±ð¶ÔÓ¦²»Í¬µÄ³¡¾°¡£
MahoutÔÚHadoopƽ̨ÉÏ£¬½èÖúMR¼ÆËã¿ò¼Ü£¬¿ÉÒÔ¼ò±ã»¯µÄ´¦Àí²»ÉÙÊý¾ÝÍÚ¾òÈÎÎñ¡£Êµ¼ÊMahoutÒѾ²»ÔÙά»¤ÐµÄMR£¬»¹ÊÇͶÏòÁËSpark£¬ÓëMlib»¥Îª²¹³ä¡£
Using Weka
WekaÊÇÒ»¿îÃâ·ÑµÄ£¬»ùÓÚJAVA»·¾³Ï¿ªÔ´µÄ»úÆ÷ѧϰÒÔ¼°Êý¾ÝÍÚ¾òÈí¼þ¡£
Using NLTK
ʹÓÃ×ÔÈ»ÓïÑÔ¹¤¾ß°ü
Classify Text
Îı¾·ÖÀà
½«Îı¾¼¯½øÐзÖÀ࣬ÓëÆäËû·ÖÀàË㷨ûÓб¾ÖÊÇø±ð¡£¼ÙÈçÏÖÔÚÒª½«ÉÌÆ·µÄÆÀÂÛ½øÐÐÕý¸ºÇé¸Ð·ÖÀ࣬Ê×ÏȷִʺóÒª½«Îı¾ÌØÕ÷»¯£¬ÒòΪÎı¾±ØÈ»ÊǸßά£¬ÎÒÃDz»¿ÉÄÜÑ¡ÔñËùÓеĴÊÓï×÷ÎªÌØÕ÷£¬¶øÊÇÓ¦¸ÃÒÔ×îÄÜ´ú±í¸ÃÎı¾µÄ´Ê×÷ÎªÌØÕ÷£¬ÀýÈçÖ»ÔÚÕýÇé¸ÐÖгöÏֵĴʣºÌرð°ô£¬ºÜºÃ£¬ÍêÃÀ¡£¼ÆËã³ö¿¨·½¼ìÑéÖµ»òÐÅÏ¢ÔöÒæÖµ£¬ÓÃÅÅÃû¿¿Ç°µÄµ¥´Ê×÷ÎªÌØÕ÷¡£
ËùÒÔÆÀÂÛµÄÎı¾ÌØÕ÷¾ÍÊÇ[word11,word12,¡¡]£¬[word21,word22,¡¡]£¬×ª»»³É¸ßάµÄÏ¡Êè¾ØÕó£¬Ö®ºóÔòÊÇѡȡ×îÊʺϵÄËã·¨ÁË¡£
À¬»øÓʼþ¡¢·´»Æ¼ø±ð¡¢ÎÄÕ·ÖÀàµÈ¶¼ÊôÓÚÕâ¸öÓ¦Óá£
Vocabulary Mapping
´Ê»ãÓ³Éä
NLPÓÐÒ»¸öÖØÒªµÄ¸ÅÄ±¾ÌåºÍʵÌ壬±¾ÌåÊÇÒ»¸öÀ࣬ʵÌåÊÇÒ»¸öʵÀý¡£±ÈÈçÊÖ»ú¾ÍÊDZ¾Ìå¡¢iPhoneºÍСÃ×ÊÇʵÌ壬ËüÃǹ²Í¬¹¹³ÉÁË֪ʶ¿â¡£ºÜ¶àÎÄ×ÖÊÇÒ»´Ê¶àÒâ»òÕß¶à´ÊÒ»Ò⣬±ÈÈçÆ»¹û¼È¿ÉÒÔÊÇÊÖ»úÒ²¿ÉÒÔÊÇË®¹û£¬iPhoneÔòͬʱÓÐË®¹û»ú¡¢Æ»¹û»ú¡¢iPhone34567µÄÖî¶à½Ð·¨¡£¼ÆËã»úÊÇÎÞ·¨Àí½âÕâô¸´Ôӵĺ¬Òå¡£´Ê»ãÓ³Éä¾ÍÊǽ«¼¸¸ö¸ÅÄîÏà½üµÄ´Ê»ãͳһ³ÉÒ»¸ö£¬ÈüÆËã»úºÍÈ˵ÄÈÏ֪ûÓÐÇø±ð¡£
¡ª¡ª¡ª¡ª¡ª¡ª
VisualizationÊý¾Ý¿ÉÊÓ»¯
ÕâÊÇÄѶȽϵ͵Ļ·½Ú£¬Í³¼ÆÑ§»òÕß´óÊý¾Ý£¬¶¼ÊDz»¶Ï·¢Õ¹Ñݱ䣬ÊÇÊôÓÚÖÕÉíѧϰµÄ֪ʶ£¬¶ø¿ÉÊÓ»¯Ö»ÒªÁ˽âÕÆÎÕ£¬¿ÉÒÔÊÜÓúܶàÄê¡£ÕâÀï²¢²»°üÀ¨¿ÉÊÓ»¯µÄ±à³Ì»·½Ú¡£
Uni, Bi & Multivariate Viz
µ¥/Ë«/¶à ±äÁ¿
ÔÚÊý¾Ý¿ÉÊÓ»¯ÖУ¬ÎÒÃÇͨ¹ý²»Í¬µÄ±äÁ¿£¯Î¬¶È×éºÏ£¬¿ÉÒÔ×÷³ö²»Í¬µÄ¿ÉÊÓ»¯³É¹û¡£µ¥±äÁ¿¡¢Ë«±äÁ¿ºÍ¶à±äÁ¿Óв»Í¬×÷ͼ·½Ê½¡£
ggplot2
RÓïÑÔµÄÒ»¸ö¾µä¿ÉÊÓ»¯°ü
ggoplot2µÄºËÐÄÂß¼Êǰ´Í¼²ã×÷ͼ£¬Ã¿Ò»¸öÓï¾ä¶¼´ú±íÁËÒ»¸öͼ²ã¡£ÒԴ˽«¸÷»æÍ¼ÔªËØ·ÖÀë¡£
ggplot(...) +
geom(...) +
stat(...) +
annotate(...) +
scale(...)
ÉÏͼ¾ÍÊǵäÐ͵Äggplot2º¯Êý·ç¸ñ¡£plotÊÇÕûÌåͼ±í£¬geomÊÇ»æÍ¼º¯Êý£¬statÊÇͳ¼Æº¯Êý£¬annotateÊÇ×¢Êͺ¯Êý£¬scaleÊDZê³ßº¯Êý¡£ggplotµÄ»æÍ¼·ç¸ñÊǻҵװ׸ñ¡£
ggplot2µÄȱµãÊÇ»æÍ¼±È½Ï»ºÂý£¬±Ï¾¹ÊÇÒÔͼ²ãµÄ·½Ê½£¬µ«ÊÇ覲»ÑÚ褣¬ËüÒÀ¾ÉÊǺܶàÈËʹÓÃRµÄÀíÓÉ¡£
Histogram & Pie(Uni)
Ö±·½Í¼ºÍ±ýͼ£¨µ¥±äÁ¿£©
Ö±·½Í¼ÒѾ½éÉܹýÁË£¬ÕâÀï¾Í·ÅÕÅͼ¡£

±ýͼ²»Êdz£ÓõÄͼÐΣ¬Èô±äÁ¿Ö®¼äµÄ²î±ð²»´ó£¬Èç35%ºÍ40%£¬ÔÚ±ýͼµÄÃæ»ý±ÈÀý¿¿ÈâÑÛÊǷֱ治³öÀ´¡£
Tree & Tree Map
Ê÷ͼºÍ¾ØÐÎÊ÷ͼ
Ê÷ͼ´ú±íµÄÊÇÒ»Öֽṹ¡£²ã´Î¾ÛÀàµÄʵÀýͼ¾ÍÊôÓÚÊ÷ͼ¡£
µ±Î¬¶ÈµÄ±äÁ¿¶à´ó£¬ÓÖÐèÒª¶Ô±Èʱ£¬¿ÉÒÔʹÓþØÐÎÊ÷ͼ¡£Í¨¹ýÃæ»ý±íʾ±äÁ¿µÄ´óС£¬ÑÕÉ«±íʾÀàÄ¿¡£

Scatter Plot (Bi)
É¢µãͼ£¨Ë«±äÁ¿£©
É¢µãͼÔÚÊý¾Ý̽Ë÷Öо³£Óõ½£¬ÓÃÒÔ·ÖÎöÁ½¸ö±äÁ¿Ö®¼äµÄ¹ØÏµ£¬Ò²¿ÉÒÔÓÃÓڻع顢·ÖÀàµÄ̽Ë÷¡£

ÀûÓÃÉ¢µãͼ¾ØÕó£¬ÔòÄܽ«Ë«±äÁ¿ÍØÕ¹Îª¶à±äÁ¿¡£

Line Charts (Bi)
ÕÛÏßͼ£¨Ë«±äÁ¿£©
Ëü³£ÓÃÓÚÃè»æÇ÷ÊÆºÍ±ä»¯£¬ºÍʱ¼äά¶ÈÊǺûùÓÑ£¬ÈçÓ°ËæÐΡ£

Spatial Charts
¿Õ¼äͼ£¬Ó¦¸Ã¾ÍÊǵØÍ¼µÄÒâ˼
Ò»ÇÐÉæ¼°µ½¿Õ¼äÊôÐÔµÄÊý¾Ý¶¼ÄÜʹÓõØÀíͼ¡£µØÀíͼÐèÒª±íÊ¾×ø±êµÄÊý¾Ý£¬¿ÉÒÔÊǾγ¶È¡¢Ò²¿ÉÒÔÊǵØÀíʵÌ壬±ÈÈçÉϺ£Êб±¾©ÊС£¾Î³¶ÈµÄÊý¾Ý£¬³£³£ºÍPOI¹Ò¹³¡£

Survey Plot
²»ÖªµÀ¾ßÌåµÄº¬Ò壬´ÖÂÔ·ÒëͼÐÎ̽Ë÷
plotÊÇRÖÐ×î³£Óõĺ¯Êý£¬Í¨¹ýplot£¨x£¬y£©£¬ÎÒÃÇ¿ÉÒÔÉ趨²»Í¬µÄ²ÎÊý£¬¾ö¶¨Ê¹ÓÃÄÇÖÖͼÐΡ£
Timeline
ʱ¼äÖá
µ±Êý¾ÝÉæ¼°µ½Ê±¼ä£¬»òÕß´æÔÚÏȺó˳Ðò£¬ÎÒÃÇ¿ÉÒÔʹÓÃʱ¼äÖá¡£²»ÉÙ¿ÉÊÓ»¯¿ò¼Ü£¬Ò²Ö§³ÖÒÔʱ¼ä²¥·ÅµÄÐÎʽÃèÊöÊý¾ÝµÄ±ä»¯¡£

Decision Tree
¾ö²ßÊ÷
ÕâÀïµÄ¾ö²ßÊ÷²»ÊÇËã·¨£¬¶øÊÇ»ùÓÚ½âÊÍÐԺõÄÒ»¸öÓ¦Óá£

µ±Êý¾ÝÓöµ½ÊÇ·ñ£¬»òÕßÑ¡ÔñµÄÂß¼ÅжÏʱ£¬¾ö²ßÊ÷²»Ê§ÎªÒ»ÖÖ¿ÉÊÓ»¯Ë¼Â·¡£
D3.js
ÖªÃûµÄÊý¾Ý¿ÉÊÓ»¯Ç°¶Ë¿ò¼Ü
d3¿ÉÒÔÖÆ×÷¸´ÔÓµÄͼÐΣ¬ÏñÖ±·½Í¼É¢µãͼÕâÀ࣬ÓÃÆäËû¿ò¼ÜÍê³É±È½ÏºÃ£¬Ñ§Ï°³É±¾±ÈǰÕߵ͡£
d3ÊÇ»ùÓÚsvgµÄ£¬µ±Êý¾ÝÁ¿±ä´óÔËË㸴ÔÓºó£¬d3ÐÔÄÜ»á±ä²î¡£¶øcanvasµÄÐÔÄÜ»áºÃ²»ÉÙ£¬¹úÄÚµÄecharts»ùÓÚºóÕß¡£ÓÐÖÐÎÄÎĵµ£¬ÊôÓڱȽÏÓѺõĿò¼Ü¡£
RÓïÑÔÖÐÓÐÒ»¸ö½Ðd3NetWorkµÄ°ü£¬PythonÔòÓÐd3pyµÄ°ü£¬µ±È»Ö±½Ó´î½¨»·¾³Ò²ÐС£
IBM ManyEyes
Many EyesÊÇIBM¹«Ë¾µÄÒ»¿îÔÚÏß¿ÉÊÓ»¯´¦Àí¹¤¾ß¡£¸Ã¹¤¾ß¿ÉÒÔ¶ÔÊý×Ö£¬Îı¾µÈ½øÐпÉÊÓ»¯´¦Àí¡£Ó¦¸ÃÊÇÃâ·ÑµÄ¡£Í¼ÍøÉÏËæ±ãÕҵġ£

Tableau
¹úÍâÖªÃûµÄÉÌÓÃBI£¬·ÖΪDesktopºÍServer£¬Ç°ÕßÊÇÊý¾Ý·ÖÎöµ¥»ú°æ£¬ºóÕßÖ§³Ö˽Óл¯²¿Êð¡£¼ÓÆðÀ´µÃ¼¸Ç§ÃÀ½ð£¬Í¦¹óµÄ¡£Í¼ÍøÉÏËæ±ãÕҵġ£

¡ª¡ª¡ª¡ª¡ª¡ª
Big Data ´óÊý¾Ý
Ô½À´Ô½»ð±¬µÄ¼¼Êõ¸ÅÄHadoop»¹Ã»ÓÐÐËÆð¼¸Ä꣬µÚ¶þ´úSparkÒѾºóÀ´¾ÓÉÏ¡£
ÒòΪ×÷ÕßдµÄ±È½ÏÔ磬ÏÖÔÚµÄм¼ÊõûÓйý¶àÉæ¼°¡£²¿·Ö¹¤¾ßÎÒ²»ÊìϤ£¬¾ÍÂÔ¹ýÁË¡£
Map Reduce Fundamentals
MapReduce¿ò¼Ü
ËüÊÇHadoopºËÐĸÅÄî¡£Ëüͨ¹ý½«¼ÆËãÈÎÎñ·Ö¸î³É¶à¸ö´¦Àíµ¥Ôª·ÖÉ¢µ½¸÷¸ö·þÎñÆ÷½øÐС£
MapReduceÓÐÒ»¸öºÜ°ôµÄ½âÊÍ£¬Èç¹ûÄãÒª¼ÆËãÒ»¸±ÅƵÄÊýÁ¿£¬´«Í³µÄ´¦Àí·½·¨ÊÇÕÒÒ»¸öÈËÊý¡£¶øMapReduceÔòÊÇÕÒÀ´Ò»ÈºÈË£¬Ã¿¸öÈËÊýÆäÖеÄÒ»²¿·Ö£¬×îÖÕ½«½á¹û»ã×Ü¡£·ÖÅä¸øÃ¿¸öÈËÊýµÄ¹ý³ÌÊÇMap£¬´¦Àí»ã×ܽá¹ûµÄ¹ý³ÌÊÇReduce¡£
Hadoop Components
Hadoop×é¼þ
HadooºÅ³ÆÉú̬£¬Ëü¾ÍÊÇÓÉÎÞÊý×齨ƴ½ÓÆðÀ´µÄ¡£

¸÷Àà×é¼þ°üÀ¨HDFS¡¢MapReduce¡¢Hive¡¢HBase¡¢Zookeeper¡¢Sqoop¡¢Pig¡¢Mahout¡¢FlumeµÈ¡£×îºËÐĵľÍÊÇHDFSºÍMapReduceÁË¡£
HDFS
HadoopµÄ·Ö²¼Ê½Îļþϵͳ
HDFSµÄÉè¼ÆË¼Â·ÊÇÒ»´Î¶ÁÈ¡£¬¶à´Î·ÃÎÊ£¬ÊôÓÚÁ÷ʽÊý¾Ý·ÃÎÊ¡£HDFSµÄÊý¾Ý¿éĬÈÏ64MB£¨Hadoop
2.X ±ä³ÉÁË128MB£©£¬²¢ÇÒÒÔ64MBΪµ¥Î»·Ö¸î£¬¿éµÄ´óС×ñÑĦ¶û¶¨Àí¡£ËüºÍMRϢϢÏà¹Ø£¬Í¨³£À´Ëµ£¬Map
TaskµÄÊýÁ¿¾ÍÊÇ¿éµÄÊýÁ¿¡£64MBµÄÎļþΪ1¸öMap£¬65MB£¨64MB+1MB£©Îª2¸öMap¡£
Data Replication Principles
Êý¾Ý¸´ÖÆÔÀí
Êý¾Ý¸´ÖÆÊôÓÚ·Ö²¼Ê½¼ÆËãµÄ·¶³ë£¬Ëü²¢²»½ö½ö¾ÖÏÞÓÚÊý¾Ý¿â¡£
HadoopºÍµ¥¸öÊý¾Ý¿âϵͳµÄ²î±ðÔÚÓÚÔ×ÓÐÔºÍÒ»ÖÂÐÔ¡£ÔÚÔ×ÓÐÔ·½Ã棬ҪÇó·Ö²¼Ê½ÏµÍ³µÄËùÓвÙ×÷ÔÚËùÓÐÏà¹Ø¸±±¾ÉÏҪôÌá½»£¬
Ҫô»Ø¹ö£¬ ¼´³ýÁ˱£Ö¤ÔÓеľֲ¿ÊÂÎñµÄÔ×ÓÐÔ£¬»¹ÐèÒª¿ØÖÆÈ«¾ÖÊÂÎñµÄÔ×ÓÐÔ£» ÔÚÒ»ÖÂÐÔ·½Ã棬¶à¸±±¾Ö®¼äÐèÒª±£Ö¤µ¥Ò»¸±±¾Ò»ÖÂÐÔ¡£
HadoopÊý¾Ý¿é½«»á±»¸´ÖƵ½¶à̬·þÎñÆ÷ÉÏÒÔÈ·±£Êý¾Ý²»»á¶ªÊ§¡£
Setup Hadoop (IBM/Cloudera/HortonWorks)
°²×°Hadoop
°üÀ¨ÉçÇø°æ¡¢ÉÌÒµ·¢Ðа桢ÒÔ¼°¸÷ÖÖÔÆ¡£
Name & Data Nodes
Ãû³ÆºÍÊý¾Ý½Úµã
HDFSͨÐÅ·ÖΪÁ½²¿·Ö£¬ClientºÍNameNode &
DataNode¡£

NameNode£º¹ÜÀíHDFSµÄÃû³Æ¿Õ¼äºÍÊý¾Ý¿éÓ³ÉäÐÅÏ¢£¬´¦Àíclient¡£NameNodeÓÐÒ»¸öÖúÊÖ½ÐSecondary
NameNode£¬¸ºÔð¾µÏñ±¸·ÝºÍÈÕÖ¾ºÏ²¢£¬¸ºµ£¹¤×÷¸ºÔØ¡¢Ìá¸ßÈÝ´íÐÔ£¬ÎóɾÊý¾ÝµÄ»°ÕâÀïÒ²Äָܻ´£¬µ±È»¸ü½¨Òé¼Ótrash¡£
DataNode£ºÕæÕýµÄÊý¾Ý½Úµã£¬´æ´¢Êµ¼ÊµÄÊý¾Ý¡£»áºÍNameNodeÖ®¼äά³ÖÐÄÌø¡£
Job & Task Tracker
ÈÎÎñ¸ú×Ù
JobTracker¸ºÔð¹ÜÀíËùÓÐ×÷Òµ£¬½²×÷Òµ·Ö¸ô³ÉһϵÁÐÈÎÎñ£¬È»¶ø½²ÈÎÎñÖ¸ÅɸøTaskTracker¡£Äã¿ÉÒÔ°ÑËüÏëÏó³É¾Àí¡£
TaskTracker¸ºÔðÔËÐÐMapÈÎÎñºÍReduceÈÎÎñ£¬µ±½ÓÊÕµ½JobTrackerÈÎÎñºó¸É»î¡¢Ö´ÐС¢Ö®ºó»ã±¨ÈÎÎñ״̬¡£Äã¿ÉÒÔ°ÑËüÏëÏó³ÉÔ±¹¤¡£Ò»Ì¨·þÎñÆ÷¾ÍÊÇÒ»¸öÔ±¹¤¡£
M/R Programming
Map/Reduce±à³Ì
MRµÄ±à³ÌÒÀÀµJobTrackerºÍTaskTracker¡£TaskTracker¹ÜÀí×ÅMapºÍReduceÁ½¸öÀà¡£ÎÒÃÇ¿ÉÒÔ°ÑËüÏëÏó³ÉÁ½¸öº¯Êý¡£
MapTaskÒýÇæ»á½«Êý¾ÝÊäÈë¸ø³ÌÐòÔ±±àдºÃµÄMap( )º¯Êý£¬Ö®ºóÊä³öÊý¾ÝдÈëÄڴ棯´ÅÅÌ£¬ReduceTaskÒýÇæ½«Map(
)º¯ÊýµÄÊä³öÊý¾ÝºÏ²¢ÅÅÐòºó×÷Ϊ×Ô¼ºµÄÊäÈëÊý¾Ý£¬´«µÝ¸øreduce( )£¬×ª»»³ÉеÄÊä³ö¡£È»ºó»ñµÃ½á¹û¡£
ÍøÂçÉϺܶసÀý¶¼Í¨¹ýͳ¼Æ´ÊƵ½âÊÍMR±à³Ì£º

ÔʼÊý¾Ý¼¯·Ö¸îºó£¬Mapº¯Êý¶ÔÊý¾Ý¼¯µÄÔªËØ½øÐвÙ×÷£¬Éú³É¼ü-Öµ¶ÔÐÎʽÖмä½á¹û£¬ÕâÀï¾ÍÊÇ{¡°word¡±,counts}£¬Reduceº¯Êý¶Ô¼ü-Öµ¶ÔÐÎʽ½øÐмÆË㣬µÃµ½×îÖյĽá¹û¡£
HadoopµÄºËÐÄ˼ÏëÊÇMapReduce£¬MapReduceµÄºËÐÄ˼ÏëÊÇshuffle¡£shuffleÔÚÖмäÆðÁËʲô×÷ÓÃÄØ£¿shuffleµÄÒâ˼ÊÇÏ´ÅÆ£¬ÔÚMR¿ò¼ÜÖУ¬Ëü´ú±íµÄÊǰÑÒ»×éÎÞ¹æÔòµÄÊý¾Ý¾¡Á¿×ª»»³ÉÒ»×é¾ßÓÐÒ»¶¨¹æÔòµÄÊý¾Ý¡£

Ç°ÃæËµ¹ý£¬mapº¯Êý»á½«½á¹ûдÈëµ½Äڴ棬Èç¹û¼¯ÈºµÄÈÎÎñÓкܶ࣬ËðºÄ»á·Ç³£À÷º¦£¬shuffle¾ÍÊǼõÉÙÕâÖÖËðºÄµÄ¡£Í¼ÀýÖÐÎÒÃÇ¿´µ½£¬mapÊä³öÁ˽á¹û£¬´Ëʱ·ÅÔÚ»º´æÖУ¬Èç¹û»º´æ²»¹»£¬»áдÈëµ½´ÅÅ̳ÉΪÒçдÎļþ£¬ÎªÁËÐÔÄÜ¿¼ÂÇ£¬ÏµÍ³»á°Ñ¶à¸ökeyºÏ²¢ÔÚÒ»Æð£¬ÀàËÆmerge/group£¬Í¼ÀýµÄºÏ²¢¾ÍÊÇ{"Bear",[1,1]},{"Car",[1,1,1]}£¬È»ºóÇóºÍ£¬µÈMapÈÎÎñÖ´ÐÐÍê³É£¬ReduceÈÎÎñ¾ÍÖ±½Ó¶ÁÈ¡ÎļþÁË¡£
ÁíÍ⣬ËüÒ²ÊÇÔì³ÉÊý¾ÝÇãбµÄÔÒò£¬¾ÍÊÇijһ¸ökeyµÄÊýÁ¿Ìرð¶à£¬µ¼ÖÂÈÎÎñ¼ÆËãºÄʱ¹ý³¤¡£
Sqoop: Loading Data in HDFS
SqoopÊÇÒ»¸ö¹¤¾ß£¬ÓÃÀ´½«´«Í³Êý¾Ý¿âÖеÄÊý¾Ýµ¼Èëµ½HadoopÖС£ËäÈ»HadoopÖ§³Ö¸÷ÖÖ¸÷ÑùµÄÊý¾Ý£¬µ«ËüÒÀ¾ÉÐèÒªºÍÍⲿÊý¾Ý½øÐн»»¥¡£
SqoopÖ§³Ö¹ØÏµÐÍÊý¾Ý¿â£¬MySQLºÍPostgreSQL¾¹ýÁËÓÅ»¯¡£Èç¹ûÒªÁ¬ÆäËûÊý¾Ý¿âÀýÈçNoSQL£¬ÐèÒªÁíÍâÏÂÔØÁ¬½ÓÆ÷¡£µ¼ÈëʱÐèҪעÒâÊý¾ÝÒ»ÖÂÐÔ¡£
SqoopÒ²Ö§³Öµ¼³ö£¬µ«ÊÇSQLÓжàÖÖÊý¾ÝÀàÐÍ£¬ÀýÈçString¶ÔÓ¦µÄCHAR£¨64£©ºÍVARCHAR£¨200£©µÈ£¬±ØÐëÈ·¶¨Õâ¸öÀàÐͿɲ»¿ÉÒÔʹÓá£
Flue, Scribe: For Unstruct Data
2ÖÖÈÕÖ¾Ïà¹ØµÄϵͳ£¬ÎªÁË´¦Àí·Ç½á¹¹»¯Êý¾Ý¡£
SQL with Pig
ÀûÓÃPigÓïÑÔÀ´½øÐÐSQL²Ù×÷¡£
PigÊÇÒ»ÖÖ̽Ë÷´ó¹æÄ£Êý¾Ý¼¯µÄ½Å±¾ÓïÑÔ£¬PigÊǽӽü½Å±¾·½Ê½È¥ÃèÊöMapReduce¡£ËüºÍHiveµÄÇø±ðÊÇ£¬PigÓýű¾ÓïÑÔ½âÊÍMR£¬HiveÓÃSQL½âÊÍMR¡£
ËüÖ§³ÖÎÒÃǶԼÓÔØ³öÀ´µÄÊý¾Ý½øÐÐÅÅÐò¡¢¹ýÂË¡¢ÇóºÍ¡¢·Ö×é(group by)¡¢¹ØÁª(Joining)¡£²¢ÇÒÖ§³Ö×Ô¶¨Ò庯Êý£¨UDF£©£¬Ëü±ÈHive×î´óµÄÓÅÊÆÔÚÓÚÁé»îºÍËÙ¶È¡£µ±²éѯÂß¼·Ç³£¸´ÔÓµÄʱºò£¬HiveµÄËÙ¶È»áºÜÂý£¬ÉõÖÁÎÞ·¨Ð´³öÀ´£¬ÄÇôPig¾ÍÓÐÓÃÎäÖ®µØÁË¡£
DWH with Hive
ÀûÓÃHiveÀ´ÊµÏÖÊý¾Ý²Ö¿â
HiveÌṩÁËÒ»ÖÖ²éѯÓïÑÔ£¬ÒòΪ´«Í³Êý¾Ý¿âµÄSQLÓû§Ç¨ÒƵ½Hadoop£¬ÈÃËûÃÇѧϰµ×²ãµÄMR
APIÊDz»¿ÉÄܵģ¬ËùÒÔHive³öÏÖÁË£¬°ïÖúSQLÓû§ÃÇÍê³É²éѯÈÎÎñ¡£
HiveºÜÊʺÏ×öÊý¾Ý²Ö¿â£¬ËüµÄÌØÐÔÊÊÓÃÓÚ¾²Ì¬£¬SQLÖеÄInsert¡¢Update¡¢DelµÈ¼Ç¼²Ù×÷²»ÊÊÓÃÓÚHive¡£
Ëü»¹ÓÐÒ»¸öȱµã£¬Hive²éѯÓÐÑÓʱ£¬ÒòΪËüµÃÆô¶¯MR£¬Õâ¸öʱ¼äÏûºÄ²»ÉÙ¡£´«Í³SQLÊý¾Ý¿â¼òµ¥²éѯ¼¸ÃëÄÚ¾ÍÄÜÍê³É£¬Hive¿ÉÄܻỨ·ÑÒ»·ÖÖÓ¡£Ö»ÓÐÊý¾Ý¼¯×ã¹»´ó£¬ÄÇôÆô¶¯ºÄ·ÑµÄʱ¼ä¾ÍºöÂÔ²»¼Æ¡£
¹ÊHiveÊÊÓõij¡¾°ÊÇÿÌìÁ賿Åܵ±ÌìÊý¾ÝµÈµÈ¡£ËüÊÇÀàSQLÓïÑÔ£¬Êý¾Ý·ÖÎöʦÄÜÖ±½ÓÓ㬲úÆ·¾ÀíÄÜÖ±½ÓÓã¬Áà³öÒ»¸ö´óѧÉúÅàѵ¼¸ÌìÒ²ÄÜÓá£Ð§Âʿ졣
¿ÉÒÔ½«Hive×÷ͨÓòéѯ£¬¶øÓÃPig¶¨ÖÆUDF£¬×ö¸÷ÖÖ¸´ÔÓ·ÖÎö¡£HiveºÍMySQLÓï·¨×î½Ó½ü¡£
Scribe, Chukwa For Weblog
ScribeÊÇFacebook¿ªÔ´µÄÈÕÖ¾ÊÕ¼¯ÏµÍ³£¬ÔÚFacebookÄÚ²¿ÒѾµÃµ½µÄÓ¦Óá£
ChukwaÊÇÒ»¸ö¿ªÔ´µÄÓÃÓÚ¼à¿Ø´óÐÍ·Ö²¼Ê½ÏµÍ³µÄÊý¾ÝÊÕ¼¯ÏµÍ³¡£
Using Mahout
ÒѾ½éÉܹýÁË
Zookeeper Avro
Zookeeper£¬ÊÇHadoopµÄÒ»¸öÖØÒª×é¼þ£¬Ëü±»Éè¼ÆÓÃÀ´×öе÷·þÎñµÄ¡£Ö÷ÒªÊÇÓÃÀ´½â¾ö·Ö²¼Ê½Ó¦ÓÃÖо³£Óöµ½µÄһЩÊý¾Ý¹ÜÀíÎÊÌ⣬È磺ͳһÃüÃû·þÎñ¡¢×´Ì¬Í¬²½·þÎñ¡¢¼¯Èº¹ÜÀí¡¢·Ö²¼Ê½Ó¦ÓÃÅäÖÃÏîµÄ¹ÜÀíµÈ¡£
AvroÊÇHadoopÖеÄÒ»¸ö×ÓÏîÄ¿£¬ËüÊÇÒ»¸ö»ùÓÚ¶þ½øÖÆÊý¾Ý´«Êä¸ßÐÔÄܵÄÖмä¼þ¡£³ýÍ⻹ÓÐKryo¡¢protobufµÈ¡£
Storm: Hadoop Realtime
StormÊÇ×îеÄÒ»¸ö¿ªÔ´¿ò¼Ü
Ä¿µÄÊÇ´óÊý¾ÝÁ÷µÄʵʱ´¦Àí¡£ËüµÄÌØµãÊÇÁ÷£¬HadoopµÄÊý¾Ý²éѯ£¬ÓÅ»¯µÄÔٺã¬Ò²Òª»ùÓÚHDFS½øÐÐMR²éѯ£¬ÓÐûÓиü¿ìµÄ·½·¨ÄØ£¿ÊÇÓеġ£¾ÍÊÇÔÚÊý¾Ý²úÉúʱ¾ÍÈ¥¼à¿ØÈÕÖ¾£¬È»ºóÂíÉϽøÐмÆËã¡£±ÈÈçÒ³Ãæ·ÃÎÊ£¬ÓÐÈ˵ã»÷һϣ¬ÎÒ¼ÆËã¾Í+1£¬ÔÙÓÐÈ˵㣬+1¡£ÄÇôÕâ¸öÒ³ÃæµÄUVÎÒÒ²¾ÍÄÜʵʱ֪µÀÁË¡£
HadoopÉó¤Åú´¦Àí£¬¶øStormÔòÊÇÁ÷ʽ´¦Àí£¬ÍÌÍ¿϶¨ÊÇHadoopÓÅ£¬¶øÊ±Ñӿ϶¨ÊÇStormºÃ¡£
Rhadoop, RHipe
½«RºÍhadoop½áºÏÆðÀ´2Öּܹ¹¡£
RHadoop°üº¬Èý¸ö°ü£¨rmr£¬rhdfs£¬rhbase£©£¬·Ö±ð¶ÔÓ¦MapReduce£¬HDFS£¬HBaseÈý¸ö²¿·Ö¡£
Spark»¹Óиö½ÐSparkRµÄ¡£
rmr
RHadoopµÄÒ»¸ö°ü£¬ºÍhadoopµÄMapReduceÏà¹Ø¡£
ÁíÍâHadoopµÄɾ³ýÃüÁîÒ²½Ðrmr£¬²»ÖªµÀ×÷ÕßÊDz»ÊÇÖ¸´úµÄÕâ¸ö¡¡
Classandra
Ò»ÖÖÁ÷ÐеÄNoSqlÊý¾Ý¿â
ÎÒÃdz£³£ËµCassandraÊÇÒ»¸öÃæÏòÁУ¨Column-Oriented£©µÄÊý¾Ý¿â£¬ÆäʵÕâ²»ÍêÈ«¶Ô¡ª¡ªÊý¾ÝÊÇÒÔËÉÉ¢½á¹¹µÄ¶àά¹þÏ£±í´æ´¢ÔÚÊý¾Ý¿âÖУ»ËùνËÉÉ¢½á¹¹£¬ÊÇָÿÐÐÊý¾Ý¿ÉÒÔÓв»Í¬µÄÁнṹ£¬¶øÔÚ¹ØÏµÐÍÊý¾ÝÖУ¬Í¬Ò»ÕűíµÄËùÓÐÐбØÐëÓÐÏàͬµÄÁС£ÔÚCassandraÖпÉÒÔʹÓÃÒ»¸öΨһʶ±ðºÅ·ÃÎÊÐУ¬ËùÒÔÎÒÃÇ¿ÉÒÔ¸üºÃÀí½âΪ£¬CassandraÊÇÒ»¸ö´øË÷ÒýµÄ£¬ÃæÏòÐеĴ洢¡£

CassandraÖ»ÐèÒªÄ㶨ÒåÒ»¸öÂß¼ÉϵÄÈÝÆ÷£¨Keyspaces£©×°ÔØÁÐ×壨Column
Families£©¡£
CassandraÊʺϿìËÙ¿ª·¢¡¢Áé»î²¿Êð¼°ÍØÕ¹¡¢Ö§³Ö¸ßIO¡£ËüºÍHBase»¥Îª¾ºÕù¶ÔÊÖ£¬Cassandra+Spark
vs HBase+Hadoop£¬CassandraÇ¿µ÷AP £¬HbaseÇ¿µ÷CP¡£
MongoDB, Neo4j
MongoDBÊÇÎĵµÐÍNoSQLÊý¾Ý¿â¡£
MongoDBÈç¹û²»Éæ¼°Join£¬»á·Ç³£Áé»îºÍÓÅÊÆ¡£¾ÙÒ»¸öÎÒÃÇ×î³£¼ûµÄµç×ÓÉÌÎñÍøÕ¾×÷Àý×Ó£¬²»Í¬µÄ²úÆ·ÀàÄ¿£¬²úÆ·¹æ·¶¡¢ËµÃ÷ºÍ½éÉܶ¼²»Ò»Ñù£¬µç×Ó²úÆ·Äò²¼ÁãʳÊÖ»ú¿¨µÈµÈ£¬ÔÚ¹ØÏµÐÍÊý¾Ý¿âÖÐÉè¼Æ±í½á¹¹ÊÇÔÖÄÑ£¬µ«ÊÇÔÚMongoDBÖоÍÄÜ×Ô¶¨ÒåÍØÕ¹¡£
ÔÙ·ÅÒ»Õź͹ØÏµÐÍÊý¾Ý¿â¶Ô±ÈµÄÕÜѧͼ°É£º

Neo4jÊÇ×îÁ÷ÐеÄͼÐÎÊý¾Ý¿â¡£
ͼÐÎÊý¾Ý¿âÈçÆäÃû×Ö£¬ÔÊÐíÊý¾ÝÒÔ½ÚµãµÄÐÎʽ£¬Ó¦ÓÃͼÐÎÀíÂ۴洢ʵÌåÖ®¼äµÄ¹ØÏµÐÅÏ¢¡£
×î³£¼ûµÄ³¡¾°ÊÇÉç½»¹ØÏµÁ´¡¢·²ÊÇÒµÎñÂß¼ºÍ¹ØÏµ´øµã±ßµÄ¶¼ÄÜÓÃͼÐÎÊý¾Ý¿â¡£

¸ú¹ØÏµÊý¾Ý¿âÏà±È£¬Í¼ÐÎÊý¾Ý¿â×îÖ÷ÒªµÄÓŵãÊǽâ¾öÁËͼ¼ÆËã(ÒµÎñÂß¼)ÔÚ¹ØÏµÊý¾Ý¿âÉÏ´óÁ¿µÄjoin²Ù×÷£¬±ÈÈçÈÃÄã²éѯ£ºÄãÂèÂèµÄ½ã½ãµÄ¾Ë¾ËµÄÅ®¶ùµÄÃÃÃÃÊÇË£¿ÕâµÃд¼¸¸öJoin°¡¡£µ«·²¹ØÏµ£¬join²Ù×÷µÄ´ú¼ÛÊǾ޴óµÄ£¬¶øGraphDBÄÜºÜ¿ìµØ¸ø³ö½á¹û¡£ |