±à¼ÍƼö: |
±¾ÎÄÀ´×ÔÓÚcsdn£¬½éÉÜÁËWord2Vec¶ÔÖÐÎÄÎı¾µÄÓ÷¨¡£
|
|
1.¼òµ¥½éÉÜ
PS£ºµÚÒ»²¿·ÖÖ÷ÒªÊǸø´ó¼ÒÒýÈë»ù´¡ÄÚÈÝ×÷ÆÌµæ£¬ÕâÀàÎÄÕºܶ࣬ϣÍû´ó¼Ò×Ô¼ºÈ¥Ñ§Ï°¸ü¶à¸üºÃµÄ»ù´¡ÄÚÈÝ£¬ÕâÆª²©¿ÍÖ÷ÒªÊǽéÉÜWord2Vec¶ÔÖÐÎÄÎı¾µÄÓ÷¨¡£
(1) ͳ¼ÆÓïÑÔÄ£ÐÍ
ͳ¼ÆÓïÑÔÄ£Ð͵ÄÒ»°ãÐÎʽÊǸø¶¨ÒÑÖªµÄÒ»×é´Ê£¬Çó½âÏÂÒ»¸ö´ÊµÄÌõ¼þ¸ÅÂÊ¡£ÐÎʽÈçÏ£º

ͳ¼ÆÓïÑÔÄ£Ð͵ÄÒ»°ãÐÎʽֱ¹Û¡¢×¼È·£¬nԪģÐÍÖмÙÉèÔÚ²»¸Ä±ä´ÊÓïÔÚÉÏÏÂÎÄÖеÄ˳ÐòǰÌáÏ£¬¾àÀëÏà½üµÄ´ÊÓï¹ØÏµÔ½½ü£¬¾àÀë½ÏÔ¶µÄ¹ØÁª¶ÈÔ½Ô¶£¬µ±¾àÀë×㹻Զʱ£¬´ÊÓïÖ®¼äÔòûÓйØÁª¶È¡£
µ«¸ÃÄ£ÐÍûÓÐÍêÈ«ÀûÓÃÓïÁϵÄÐÅÏ¢£º
1) ûÓп¼ÂǾàÀë¸üÔ¶µÄ´ÊÓïÓ뵱ǰ´ÊµÄ¹ØÏµ£¬¼´³¬³ö·¶Î§nµÄ´Ê±»ºöÂÔÁË£¬¶øÕâÁ½ÕߺܿÉÄÜÓйØÏµµÄ¡£
ÀýÈ磬¡°»ªÊ¢¶ÙÊÇÃÀ¹úµÄÊ×¶¼¡±Êǵ±Ç°Óï¾ä£¬¸ôÁË´óÓÚn¸ö´ÊµÄµØ·½ÓÖ³öÏÖÁË¡°±±¾©ÊÇÖйúµÄÊ×¶¼¡±£¬ÔÚnԪģÐÍÖС°»ªÊ¢¶Ù¡±ºÍ¡°±±¾©¡±ÊÇûÓйØÏµµÄ£¬È»¶øÕâÁ½¸ö¾ä×ÓÈ´Òþº¬ÁËÓï·¨¼°ÓïÒå¹ØÏµ£¬¼´¡±»ªÊ¢¶Ù¡°ºÍ¡°±±¾©¡±¶¼ÊÇÃû´Ê£¬²¢ÇÒ·Ö±ðÊÇÃÀ¹úºÍÖйúµÄÊ×¶¼¡£
2) ºöÂÔÁË´ÊÓïÖ®¼äµÄÏàËÆÐÔ£¬¼´ÉÏÊöÄ£ÐÍÎÞ·¨¿¼ÂÇ´ÊÓïµÄÓï·¨¹ØÏµ¡£
ÀýÈ磬ÓïÁÏÖеġ°ÓãÔÚË®ÖÐÓΡ±Ó¦¸ÃÄܹ»°ïÖúÎÒÃDzúÉú¡°ÂíÔÚ²ÝÔÉÏÅÜ¡±ÕâÑùµÄ¾ä×Ó£¬ÒòΪÁ½¸ö¾ä×ÓÖС°Ó㡱ºÍ¡°Âí¡±¡¢¡°Ë®¡±ºÍ¡°²ÝÔ¡±¡¢¡°ÓΡ±ºÍ¡°ÅÜ¡±¡¢¡°ÖС±ºÍ¡°ÉÏ¡±¾ßÓÐÏàͬµÄÓï·¨ÌØÐÔ¡£
¶øÔÚÉñ¾ÍøÂç¸ÅÂÊÓïÑÔÄ£ÐÍÖУ¬ÕâÁ½ÖÖÐÅÏ¢½«³ä·ÖÀûÓõ½¡£
(2) Éñ¾ÍøÂç¸ÅÂÊÓïÑÔÄ£ÐÍ
Éñ¾ÍøÂç¸ÅÂÊÓïÑÔÄ£ÐÍÊÇÒ»ÖÖÐÂÐ˵Ä×ÔÈ»ÓïÑÔ´¦ÀíËã·¨£¬¸ÃÄ£ÐÍͨ¹ýѧϰѵÁ·ÓïÁÏ»ñÈ¡´ÊÏòÁ¿ºÍ¸ÅÂÊÃܶȺ¯Êý£¬´ÊÏòÁ¿ÊǶàάʵÊýÏòÁ¿£¬ÏòÁ¿Öаüº¬ÁË×ÔÈ»ÓïÑÔÖеÄÓïÒåºÍÓï·¨¹ØÏµ£¬´ÊÏòÁ¿Ö®¼äÓàÏÒ¾àÀëµÄ´óС´ú±íÁË´ÊÓïÖ®¼ä¹ØÏµµÄÔ¶½ü£¬´ÊÏòÁ¿µÄ¼Ó¼õÔËËãÔòÊǼÆËã»úÔÚ"Dz´ÊÔì¾ä"¡£
Éñ¾ÍøÂç¸ÅÂÊÓïÑÔÄ£Ð;ÀúÁ˺ܳ¤µÄ·¢Õ¹½×¶Î£¬ÓÉBengioµÈÈË2003ÄêÌá³öµÄÉñ¾ÍøÂçÓïÑÔÄ£ÐÍNNLM£¨Neural
network language model£©×îΪ֪Ãû£¬ÒÔºóµÄ·¢Õ¹¹¤×÷¶¼²ÎÕÕ´ËÄ£ÐͽøÐС£Àú¾Ê®ÓàÄêµÄÑо¿£¬Éñ¾ÍøÂç¸ÅÂÊÓïÑÔÄ£ÐÍÓÐÁ˺ܴó·¢Õ¹¡£
Èç½ñÔڼܹ¹·½ÃæÓбÈNNLM¸ü¼òµ¥µÄCBOWÄ£ÐÍ¡¢Skip-gramÄ£ÐÍ£»Æä´ÎÔÚѵÁ··½Ã棬³öÏÖÁËHierarchical
SoftmaxËã·¨¡¢¸º²ÉÑùËã·¨£¨Negative Sampling£©£¬ÒÔ¼°ÎªÁ˼õСƵ·±´Ê¶Ô½á¹û׼ȷÐÔºÍѵÁ·ËٶȵÄÓ°Ïì¶øÒýÈëµÄÇ·²ÉÑù£¨Subsumpling£©¼¼Êõ¡£

ÉÏͼÊÇ»ùÓÚÈý²ãÉñ¾ÍøÂçµÄ×ÔÈ»ÓïÑÔ¹À¼ÆÄ£ÐÍNNLM(Neural Network
Language Model)¡£NNLM¿ÉÒÔ¼ÆËãijһ¸öÉÏÏÂÎĵÄÏÂÒ»¸ö´ÊΪwiµÄ¸ÅÂÊ£¬¼´(wi=i|context)£¬´ÊÏòÁ¿ÊÇÆäѵÁ·µÄ¸±²úÎï¡£NNLM¸ù¾ÝÓïÁÏ¿âCÉú³É¶ÔÓ¦µÄ´Ê»ã±íV¡£
Éñ½«ÍøÂç֪ʶ¿ÉÒԲο¼ÎÒµÄǰÎIJ©¿Í£ºÉñ¾ÍøÂçºÍ»úÆ÷ѧϰ»ù´¡ÈëÃÅ·ÖÏí
NNLMÍÆ¼öRachel-Zhang´óÉñÎÄÕ£ºword2vec¡ª¡ª¸ßЧwordÌØÕ÷ÇóÈ¡
½üÄêÀ´£¬Éñ¾ÍøÂç¸ÅÂÊÓïÑÔÄ£ÐÍ·¢Õ¹Ñ¸ËÙ£¬Word2vecÊÇ×îм¼ÊõÀíÂ۵ĺϼ¯¡£
Word2vecÊÇGoogle¹«Ë¾ÔÚ2013Ä꿪·ÅµÄÒ»¿îÓÃÓÚѵÁ·´ÊÏòÁ¿µÄÈí¼þ¹¤¾ß¡£ËùÒÔ£¬ÔÚ½²Êöword2vec֮ǰ£¬Ïȸø´ó¼Ò½éÉÜ´ÊÏòÁ¿µÄ¸ÅÄî¡£
(3) ´ÊÏòÁ¿
²Î¿¼£ºlicstar´óÉñµÄNLPÎÄÕ Deep Learning in NLP £¨Ò»£©´ÊÏòÁ¿ºÍÓïÑÔÄ£ÐÍ
ÕýÈç×÷ÕßËù˵£ºDeep Learning Ëã·¨ÒѾÔÚͼÏñºÍÒôƵÁìÓòÈ¡µÃÁ˾ªÈ˵ijɹû£¬µ«ÊÇÔÚ NLP
ÁìÓòÖÐÉÐδ¼ûµ½Èç´Ë¼¤¶¯ÈËÐĵĽá¹û¡£ÓÐÒ»ÖÖ˵·¨ÊÇ£¬ÓïÑÔ£¨´Ê¡¢¾ä×Ó¡¢ÆªÕµȣ©ÊôÓÚÈËÀàÈÏÖª¹ý³ÌÖвúÉúµÄ¸ß²ãÈÏÖª³éÏóʵÌ壬¶øÓïÒôºÍͼÏñÊôÓÚ½ÏΪµ×²ãµÄÔʼÊäÈëÐźţ¬ËùÒÔºóÁ½Õ߸üÊʺÏ×ödeep
learningÀ´Ñ§Ï°ÌØÕ÷¡£
µ«Êǽ«´ÊÓá°´ÊÏòÁ¿¡±µÄ·½Ê½±íʾ¿ÉνÊǽ« Deep Learning Ëã·¨ÒýÈë NLP ÁìÓòµÄÒ»¸öºËÐļ¼Êõ¡£×ÔÈ»ÓïÑÔÀí½âÎÊÌâת»¯Îª»úÆ÷ѧϰÎÊÌâµÄµÚÒ»²½¶¼ÊÇͨ¹ýÒ»ÖÖ·½·¨°ÑÕâЩ·ûºÅÊýѧ»¯¡£
´ÊÏòÁ¿¾ßÓÐÁ¼ºÃµÄÓïÒåÌØÐÔ£¬ÊDZíʾ´ÊÓïÌØÕ÷µÄ³£Ó÷½Ê½¡£´ÊÏòÁ¿µÄÿһάµÄÖµ´ú±íÒ»¸ö¾ßÓÐÒ»¶¨µÄÓïÒåºÍÓï·¨ÉϽâÊ͵ÄÌØÕ÷¡£¹Ê¿ÉÒÔ½«´ÊÏòÁ¿µÄÿһά³ÆÎªÒ»¸ö´ÊÓïÌØÕ÷¡£´ÊÏòÁ¿ÓÃDistributed
Representation±íʾ£¬Ò»ÖÖµÍάʵÊýÏòÁ¿¡£
ÀýÈ磬NLPÖÐ×îÖ±¹Û¡¢×î³£ÓõĴʱíʾ·½·¨ÊÇOne-hot Representation¡£Ã¿¸ö´ÊÓÃÒ»¸öºÜ³¤µÄÏòÁ¿±íʾ£¬ÏòÁ¿µÄά¶È±íʾ´Ê±í´óС£¬¾ø´ó¶àÊýÊÇ0£¬Ö»ÓÐÒ»¸öά¶ÈÊÇ1£¬´ú±íµ±Ç°´Ê¡£
¡°»°Í²¡±±íʾΪ [0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 ¡] ¼´´Ó0¿ªÊ¼»°Í²¼ÇΪ3¡£
µ«ÕâÖÖOne-hot Representation²ÉÓÃÏ¡Êè¾ØÕóµÄ·½Ê½±íʾ´Ê£¬ÔÚ½â¾öijЩÈÎÎñʱ»áÔì³ÉάÊýÔÖÄÑ£¬¶øÊ¹ÓõÍάµÄ´ÊÏòÁ¿¾ÍºÜºÃµÄ½â¾öÁ˸ÃÎÊÌ⡣ͬʱ´Óʵ¼ùÉÏ¿´£¬¸ßάµÄÌØÕ÷Èç¹ûÒªÌ×ÓÃDeep
Learning£¬Æä¸´ÔӶȼ¸ºõÊÇÄÑÒÔ½ÓÊܵģ¬Òò´ËµÍάµÄ´ÊÏòÁ¿ÔÚÕâÀïÒ²±¥ÊÜ×·Åõ¡£
Distributed RepresentationµÍάʵÊýÏòÁ¿£¬È磺[0.792, ?0.177,
?0.107, 0.109, ?0.542, ¡]¡£ËüÈÃÏàËÆ»òÏà¹ØµÄ´ÊÔÚ¾àÀëÉϸü¼Ó½Ó½ü¡£
×ÜÖ®£¬Distributed RepresentationÊÇÒ»¸ö³íÃÜ¡¢µÍάµÄʵÊýÏÞÁ¿£¬ËüµÄÿһά±íʾ´ÊÓïµÄÒ»¸öDZÔÚÌØÕ÷£¬¸ÃÌØÕ÷²¶»ñÁËÓÐÓõľ䷨ºÍÓïÒåÌØÕ÷¡£ÆäÌØµãÊǽ«´ÊÓïµÄ²»Í¬¾ä·¨ºÍÓïÒåÌØÕ÷·Ö²¼µ½ËüµÄÿһ¸öά¶ÈÉÏÈ¥±íʾ¡£
ÍÆ¼öÎÒÇ°ÃæµÄ»ù´¡ÎÄÕ£ºPython¼òµ¥ÊµÏÖ»ùÓÚVSMµÄÓàÏÒÏàËÆ¶È¼ÆËã
(4) Word2vec
²Î¿¼£ºWord2vecµÄºËÐļܹ¹¼°ÆäÓ¦Óà ¡¤ Ðܸ»ÁÖ£¬µËâùºÀ£¬ÌÆÏþêÉ ¡¤ ±±ÓÊ2015Äê
Word2vecÊÇGoogle¹«Ë¾ÔÚ2013Ä꿪·ÅµÄÒ»¿îÓÃÓÚѵÁ·´ÊÏòÁ¿µÄÈí¼þ¹¤¾ß¡£Ëü¸ù¾Ý¸ø¶¨µÄÓïÁϿ⣬ͨ¹ýÓÅ»¯ºóµÄѵÁ·Ä£ÐÍ¿ìËÙÓÐЧµÄ½«Ò»¸ö´ÊÓï±í´ï³ÉÏòÁ¿ÐÎʽ£¬ÆäºËÐļܹ¹°üÀ¨CBOWºÍSkip-gram¡£
ÔÚ¿ªÊ¼Ö®Ç°£¬ÒýÈëÄ£Ð͸´ÔÓ¶È£¬¶¨ÒåÈçÏ£º

ÆäÖУ¬E±íʾѵÁ·µÄ´ÎÊý£¬T±íʾѵÁ·ÓïÁÏÖдʵĸöÊý£¬QÒòÄ£ÐͶøÒì¡£EÖµ²»ÊÇÎÒÃǹØÐĵÄÄÚÈÝ£¬TÓëѵÁ·ÓïÁÏÓйأ¬ÆäÖµÔ½´óÄ£Ð;ÍԽ׼ȷ£¬QÔÚÏÂÃæ½²Êö¾ßÌåÄ£ÐÍÊÇÌÖÂÛ¡£
NNLMÄ£ÐÍÊÇÉñ¾ÍøÂç¸ÅÂÊÓïÑÔÄ£Ð͵Ļù´¡Ä£ÐÍ¡£ÔÚNNLMÄ£ÐÍÖУ¬´ÓÒþº¬²ãµ½Êä³ö²ãµÄ¼ÆËãʱÖ÷ÒªÓ°ÏìѵÁ·Ð§Âʵĵط½£¬CBOWºÍSkip-gramÄ£ÐÍ¿¼ÂÇÈ¥µôÒþº¬²ã¡£Êµ¼ùÖ¤Ã÷ÐÂѵÁ·µÄ´ÊÏòÁ¿µÄ¾«È·¶È¿ÉÄܲ»ÈçNNLMÄ£ÐÍ£¨¾ßÓÐÒþº¬²ã£©£¬µ«¿ÉÒÔͨ¹ýÔö¼ÓѵÁ·ÓïÁϵķ½·¨À´ÍêÉÆ¡£
Word2vec°üº¬Á½ÖÖѵÁ·Ä£ÐÍ£¬·Ö±ðÊÇCBOWºÍSkip_gram(ÊäÈë²ã¡¢·¢Éä²ã¡¢Êä³ö²ã)£¬ÈçÏÂͼËùʾ£º

CBOWÄ£ÐÍ£º
Àí½âΪÉÏÏÂÎľö¶¨µ±Ç°´Ê³öÏֵĸÅÂÊ¡£ÔÚCBOWÄ£ÐÍÖУ¬ÉÏÏÂÎÄËùÓеĴʶԵ±Ç°´Ê³öÏÖ¸ÅÂʵÄÓ°ÏìµÄÈ¨ÖØÊÇÒ»ÑùµÄ£¬Òò´Ë½ÐCBOW(continuous
bag-of-words model)Ä£ÐÍ¡£ÈçÔÚ´ü×ÓÖÐÈ¡´Ê£¬È¡³öÊýÁ¿×ã¹»µÄ´Ê¾Í¿ÉÒÔÁË£¬ÖÁÓÚÈ¡³öµÄÏȺó˳ÐòÊÇÎ޹ؽôÒªµÄ¡£
Skip-gramÄ£ÐÍ£º
Skip-gramÄ£ÐÍÊÇÒ»¸ö¼òµ¥ÊµÓõÄÄ£ÐÍ¡£ÎªÊ²Ã´»áÌá³ö¸ÃÎÊÌâÄØ£¿
ÔÚNLPÖУ¬ÓïÁϵÄѡȡÊÇÒ»¸öÏàµ±ÖØÒªµÄÎÊÌâ¡£
Ê×ÏÈ£¬ÓïÁϱØÐë³ä·Ö¡£Ò»·½Ãæ´ÊµäµÄ´ÊÁ¿Òª×ã¹»´ó£¬ÁíÒ»·½Ã澡¿ÉÄܵذüº¬·´Ó³´ÊÓïÖ®¼ä¹ØÏµµÄ¾ä×Ó£¬Èç¡°ÓãÔÚË®ÖÐÓΡ±ÕâÖÖ¾äʽÔÚÓïÁÏÖо¡¿ÉÄܵض࣬ģÐͲÅÄÜѧϰµ½¸Ã¾äÖеÄÓïÒåºÍÓï·¨¹ØÏµ£¬ÕâºÍÈËÀàѧϰ×ÔÈ»ÓïÑÔÊÇÒ»¸öµÀÀí£¬ÖØ¸´´ÎÊý¶àÁË£¬Ò²¾Í»áÄ£ÐÍÁË¡£
Æä´Î£¬ÓïÁϱØÐë׼ȷ¡£ËùѡȡµÄÓïÁÏÄܹ»ÕýÈ··´Ó³¸ÃÓïÑÔµÄÓïÒåºÍÓï·¨¹ØÏµ¡£ÈçÖÐÎĵġ¶ÈËÃñÈÕ±¨¡·±È½Ï׼ȷ¡£µ«¸ü¶àʱºò²»ÊÇÓïÁÏѡȡÒý·¢×¼È·ÐÔÎÊÌ⣬¶øÊÇ´¦ÀíµÄ·½·¨¡£
ÓÉÓÚ´°¿Ú´óСµÄÏÞÖÆ£¬Õâ»áµ¼Ö³¬³ö´°¿ÚµÄ´ÊÓïÓ뵱ǰ´ÊÖ®¼äµÄ¹ØÏµ²»ÄÜÕýÈ·µØ·´Ó³µ½Ä£ÐÍÖУ¬Èç¹ûµ¥´¿À©´ó´°¿Ú´óС»áÔö¼ÓѵÁ·µÄ¸´ÔÓ¶È¡£Skip-gramÄ£Ð͵ÄÌá³öºÜºÃ½â¾öÁËÕâЩÎÊÌâ¡£
Skip-gram±íʾ¡°Ìø¹ýijЩ·ûºÅ¡±¡£ÀýÈç¾ä×Ó¡°Öйú×ãÇòÌßµÃÕæÊÇÌ«ÀÃÁË¡±ÓÐ4¸ö3Ôª´Ê×飬·Ö±ðÊÇ¡°Öйú×ãÇòÌߵᱡ¢¡°×ãÇòÌßµÃÕæÊÇ¡±¡¢¡°ÌßµÃÕæÊÇÌ«Àᱡ¢¡°ÕæÊÇÌ«ÀÃÁË¡±£¬¾ä×ӵı¾Òâ¶¼ÊÇ¡°Öйú×ãÇòÌ«Àá±£¬¿ÉÊÇÉÏÃæ4¸ö3Ôª×é²¢²»ÄÜ·´Ó³³öÕâ¸öÐÅÏ¢¡£
´Ëʱ£¬Ê¹ÓÃSkip-gramÄ£ÐÍÔÊÐíijЩ´Ê±»Ìø¹ý£¬Òò´Ë¿É×é³É¡°Öйú×ãÇòÌ«Àá±Õâ¸ö3Ôª´Ê×é¡£Èç¹ûÔÊÐíÌø¹ý2¸ö´Ê£¬¼´2-Skip-gram£¬ÄÇôÉϾ仰×é³ÉµÄ3Ôª´Ê×éΪ£º

ÓÉÉϱí¿ÉÖª£ºÒ»·½ÃæSkip-gram·´Ó³Á˾ä×ÓµÄÕæÊµÒâ˼£¬ÔÚÐÂ×é³ÉµÄÕâ18¸ö3Ôª´Ê×éÖУ¬ÓÐ8¸ö´Ê×éÄܹ»ÕýÈ··´Ó³Àý¾äÖеÄÕæÊµÒâ˼£»ÁíÒ»·½Ã棬À©´óÁËÓïÁÏ£¬3Ôª´Ê×éÓÉÔÀ´µÄ4¸öÀ©Õ¹µ½ÁË18¸ö¡£
ÓïÁϵÄÀ©Õ¹Äܹ»Ìá¸ßѵÁ·µÄ׼ȷ¶È£¬»ñµÃµÄ´ÊÏòÁ¿¸üÄÜ·´Ó³ÕæÊµµÄÎı¾º¬Òå¡£
2.ÏÂÔØÔ´Âë
ÏÂÔØµØÖ·£ºhttp://word2vec.googlecode.com/svn/trunk/
ʹÓÃSVN CheckoutÔ´´úÂ룬ÈçÏÂͼËùʾ¡£


3.ÖÐÎÄÓïÁÏ
PS£º×îºó¸½ÓÐword2vecÔ´Âë¡¢Èý´ó°Ù¿ÆÓïÁÏ¡¢ÌÚѶÐÂÎÅÓïÁϺͷִÊpython´úÂë¡£
ÖÐÎÄÓïÁÏ¿ÉÒԲο¼ÎÒµÄÎÄÕ£¬Í¨¹ýPythonÏÂÔØ°Ù¶È°Ù¿Æ¡¢»¥¶¯°Ù¿Æ¡¢Î¬»ù°Ù¿ÆµÄÄÚÈÝ¡£
[python] lantern·ÃÎÊÖÐÎÄά»ù°Ù¿Æ¼°seleniumÅÀȡά»ù°Ù¿ÆÓïÁÏ
[PythonÅÀ³æ] Selenium»ñÈ¡°Ù¶È°Ù¿ÆÂÃÓξ°µã µÄInfoBoxÏûÏ¢ºÐ
ÏÂÔØ½á¹ûÈçÏÂͼËùʾ£¬¹²300¸ö¹ú¼Ò£¬°Ù¶È°Ù¿Æ¡¢»¥¶¯°Ù¿Æ¡¢Î¬»ù°Ù¿Æ¸÷×Ô100¸ö£¬¶ÔÓ¦µÄ±àºÅ¶¼ÊÇ0001.txt~0100.txt£¬Ã¿¸ötxtÖаüº¬Ò»¸öʵÌ壨¹ú¼Ò£©µÄÐÅÏ¢¡£

È»ºóÔÙʹÓÃJieba·Ö´Ê¹¤¾ß¶ÔÆë½øÐÐÖÐÎķִʺÍÎĵµºÏ²¢¡£
[python]
view plain copy
ico_fork.svg
#encoding=utf-8
import sys
import re
import codecs
import os
import shutil
import jieba
import jieba.analyse
#µ¼Èë×Ô¶¨Òå´Êµä
jieba.load_userdict("dict_all.txt")
#Read file and cut
def read_file_cut():
#create path
pathBaidu = "BaiduSpiderCountry\\"
resName = "Result_Country.txt"
if os.path.exists(resName):
os.remove(resName)
result = codecs.open(resName, 'w', 'utf-8')
num = 1
while num<=100: #5A 200 ÆäËü100
name = "%04d" % num
fileName = pathBaidu + str(name) + ".txt"
source = open(fileName, 'r')
line = source.readline()
while line!="":
line = line.rstrip('\n')
#line = unicode(line, "utf-8")
seglist = jieba.cut(line,cut_all=False) #¾«È·Ä£Ê½
output = ' '.join(list(seglist)) #¿Õ¸ñÆ´½Ó
#print output
result.write(output + ' ') #¿Õ¸ñÈ¡´ú»»ÐÐ'\r\n'
line = source.readline()
else:
print 'End file: ' + str(num)
result.write('\r\n')
source.close()
num = num + 1
else:
print 'End Baidu'
result.close()
#Run function
if __name__ == '__main__':
read_file_cut() |
ÉÏÃæÖ»ÏÔʾÁ˶԰ٶȰٿÆ100¸ö¹ú¼Ò½øÐзִʵĴúÂ룬µ«ºËÐÄ´úÂëÒ»Ñù¡£Í¬Ê±£¬Èç¹ûÐèÒª¶ÔÍ£ÓôʹýÂË»ò±êµã·ûºÅ¹ýÂË¿ÉÒÔ×Ô¶¨ÒåʵÏÖ¡£
·Ö´ÊÏê¼û£º [python] ʹÓÃJieba¹¤¾ßÖÐÎķִʼ°Îı¾¾ÛÀà¸ÅÄî
·Ö´ÊºÏ²¢ºóµÄ½á¹ûΪResult_Country.txt£¬Ï൱ÓÚ600ÐУ¬Ã¿ÐжÔÓ¦Ò»¸ö·Ö´ÊºóµÄ¹ú¼Ò¡£

4.ÔËÐÐÔ´Âë
Ç¿ÁÒÍÆ¼öÈýƪ´óÉñ½éÉÜword2vec´¦ÀíÖÐÎÄÓïÁϵÄÎÄÕ£¬ÆäÖÐFelvenºÃÏñÊÇʦÐÖ¡£
WindowsÏÂʹÓÃWord2vec¼ÌÐø´ÊÏòÁ¿ÑµÁ· - Ò»Ö»ÄñµÄÌì¿Õ
ÀûÓÃword2vec¶Ô¹Ø¼ü´Ê½øÐоÛÀà - Felven
http://www.52nlp.cn/ÖÐÓ¢ÎÄά»ù°Ù¿ÆÓïÁÏÉϵÄword2vecʵÑé
word2vec ´ÊÏòÁ¿¹¤¾ß - °Ù¶ÈÎÄ¿â
ÒòΪword2vecÐèÒªlinux»·¾³£¬ËùÓÐÊ×ÏÈÔÚwindowsϰ²×°linux»·¾³Ä£ÄâÆ÷£¬ÍƼöcygwin¡£È»ºó°ÑÓïÁÏResult_Country.txt·ÅÈëword2vecĿ¼Ï£¬ÐÞ¸Ädemo-word.shÎļþ£¬¸ÃÎļþĬÈÏÇé¿öÏÂʹÓÃ×Ô´øµÄtext8Êý¾Ý½øÐÐѵÁ·£¬Èç¹ûѵÁ·Êý¾Ý²»´æÔÚ£¬Ôò»á½øÐÐÏÂÔØ£¬ÒòΪÐèҪʹÓÃ×Ô¼ºµÄÊý¾Ý½øÐÐѵÁ·£¬¹Ê×¢Ê͵ôÏÂÔØ´úÂë¡£
demo-word.shÎļþÐÞ¸ÄÈçÏ£º
[plain]
view plain copy
ico_fork.svg
make
#if [ ! -e text8 ]; then
# wget http://mattmahoney.net/dc/text8.zip -O
text8.gz
# gzip -d text8.gz -f
#fi
time ./word2vec -train Result_Country.txt -output
vectors.bin -cbow 1 -size 200 -window 8 -negative
25 -hs 0 -sample 1e-4 -threads 20 -binary 1
-iter 15
./distance vectors.bin |
ÏÂͼ²ÎÊýÔ´×ÔÎÄÕ£ºWindowsÏÂʹÓÃWord2vec¼ÌÐø´ÊÏòÁ¿ÑµÁ·
- Ò»Ö»ÄñµÄÌì¿Õ

ÔËÐÐÃüÁîsh demo-word.sh£¬µÈ´ýѵÁ·Íê³É¡£Ä£ÐÍѵÁ·Íê³ÉÖ®ºó£¬µÃµ½ÁËvectors.binÕâ¸ö´ÊÏòÁ¿Îļþ£¬¿ÉÒÔÖ±½ÓÔËÓá£

5.½á¹ûչʾ
ͨ¹ýѵÁ·µÃµ½µÄ´ÊÏòÁ¿ÎÒÃÇ¿ÉÒÔ½øÐÐÏàÓ¦µÄ×ÔÈ»ÓïÑÔ´¦Àí¹¤×÷£¬±ÈÈçÇóÏàËÆ´Ê¡¢¹Ø¼ü´Ê¾ÛÀàµÈ¡£ÆäÖÐword2vecÖÐÌṩÁËdistanceÇó´ÊµÄcosineÏàËÆ¶È£¬²¢ÅÅÐò¡£Ò²¿ÉÒÔÔÚѵÁ·Ê±£¬ÉèÖÃ-classes²ÎÊýÀ´Ö¸¶¨¾ÛÀàµÄ´Ø¸öÊý£¬Ê¹ÓÃkmeans½øÐоÛÀà¡£
[plain]
view plain copy
ico_fork.svg
cd C:/Users/dell/Desktop/word2vec
sh demo-word.sh
./distance vectors.bin |
ÊäÈë°¢¸»º¹£º¿¦²¼¶û£¨Ê×¶¼£©¡¢¿²´ó¹þ£¨Ö÷Òª³ÇÊУ©¡¢¼ª¶û¼ªË¹Ë¹Ì¹¡¢ÒÁÀ¿ËµÈ¡£

ÊäÈë¹ú¸è£º

ÊäÈëÊ×¶¼£º

ÊäÈëGDP:

×îºóÏ£ÍûÎÄÕ¶ÔÄãÓÐËù°ïÖú£¬Ö÷ÒªÊÇʹÓõķ½·¨¡£Í¬Ê±¸ü¶àÓ¦ÓÃÐèÒªÄã×Ô¼ºÈ¥Ñо¿Ñ§Ï°¡£ |