Äú¿ÉÒÔ¾èÖú£¬Ö§³ÖÎÒÃǵĹ«ÒæÊÂÒµ¡£

1Ôª 10Ôª 50Ôª





ÈÏÖ¤Â룺  ÑéÖ¤Âë,¿´²»Çå³þ?Çëµã»÷Ë¢ÐÂÑéÖ¤Âë ±ØÌî



  ÇóÖª ÎÄÕ ÎÄ¿â Lib ÊÓÆµ iPerson ¿Î³Ì ÈÏÖ¤ ×Éѯ ¹¤¾ß ½²×ù Model Center   Code  
»áÔ±   
   
 

     
   
 ¶©ÔÄ
  ¾èÖú
word2vec´ÊÏòÁ¿ÑµÁ·¼°ÖÐÎÄÎı¾ÏàËÆ¶È¼ÆËã

  13661  次浏览      28
 2017-11-23
 

±à¼­ÍƼö:
±¾ÎÄÀ´×ÔÓÚcsdn£¬½éÉÜÁËWord2Vec¶ÔÖÐÎÄÎı¾µÄÓ÷¨¡£

1.¼òµ¥½éÉÜ

PS£ºµÚÒ»²¿·ÖÖ÷ÒªÊǸø´ó¼ÒÒýÈë»ù´¡ÄÚÈÝ×÷ÆÌµæ£¬ÕâÀàÎÄÕºܶ࣬ϣÍû´ó¼Ò×Ô¼ºÈ¥Ñ§Ï°¸ü¶à¸üºÃµÄ»ù´¡ÄÚÈÝ£¬ÕâÆª²©¿ÍÖ÷ÒªÊǽéÉÜWord2Vec¶ÔÖÐÎÄÎı¾µÄÓ÷¨¡£

(1) ͳ¼ÆÓïÑÔÄ£ÐÍ

ͳ¼ÆÓïÑÔÄ£Ð͵ÄÒ»°ãÐÎʽÊǸø¶¨ÒÑÖªµÄÒ»×é´Ê£¬Çó½âÏÂÒ»¸ö´ÊµÄÌõ¼þ¸ÅÂÊ¡£ÐÎʽÈçÏ£º

ͳ¼ÆÓïÑÔÄ£Ð͵ÄÒ»°ãÐÎʽֱ¹Û¡¢×¼È·£¬nԪģÐÍÖмÙÉèÔÚ²»¸Ä±ä´ÊÓïÔÚÉÏÏÂÎÄÖеÄ˳ÐòǰÌáÏ£¬¾àÀëÏà½üµÄ´ÊÓï¹ØÏµÔ½½ü£¬¾àÀë½ÏÔ¶µÄ¹ØÁª¶ÈÔ½Ô¶£¬µ±¾àÀë×㹻Զʱ£¬´ÊÓïÖ®¼äÔòûÓйØÁª¶È¡£

µ«¸ÃÄ£ÐÍûÓÐÍêÈ«ÀûÓÃÓïÁϵÄÐÅÏ¢£º

1) ûÓп¼ÂǾàÀë¸üÔ¶µÄ´ÊÓïÓ뵱ǰ´ÊµÄ¹ØÏµ£¬¼´³¬³ö·¶Î§nµÄ´Ê±»ºöÂÔÁË£¬¶øÕâÁ½ÕߺܿÉÄÜÓйØÏµµÄ¡£

ÀýÈ磬¡°»ªÊ¢¶ÙÊÇÃÀ¹úµÄÊ×¶¼¡±Êǵ±Ç°Óï¾ä£¬¸ôÁË´óÓÚn¸ö´ÊµÄµØ·½ÓÖ³öÏÖÁË¡°±±¾©ÊÇÖйúµÄÊ×¶¼¡±£¬ÔÚnԪģÐÍÖС°»ªÊ¢¶Ù¡±ºÍ¡°±±¾©¡±ÊÇûÓйØÏµµÄ£¬È»¶øÕâÁ½¸ö¾ä×ÓÈ´Òþº¬ÁËÓï·¨¼°ÓïÒå¹ØÏµ£¬¼´¡±»ªÊ¢¶Ù¡°ºÍ¡°±±¾©¡±¶¼ÊÇÃû´Ê£¬²¢ÇÒ·Ö±ðÊÇÃÀ¹úºÍÖйúµÄÊ×¶¼¡£

2) ºöÂÔÁË´ÊÓïÖ®¼äµÄÏàËÆÐÔ£¬¼´ÉÏÊöÄ£ÐÍÎÞ·¨¿¼ÂÇ´ÊÓïµÄÓï·¨¹ØÏµ¡£

ÀýÈ磬ÓïÁÏÖеġ°ÓãÔÚË®ÖÐÓΡ±Ó¦¸ÃÄܹ»°ïÖúÎÒÃDzúÉú¡°ÂíÔÚ²ÝÔ­ÉÏÅÜ¡±ÕâÑùµÄ¾ä×Ó£¬ÒòΪÁ½¸ö¾ä×ÓÖС°Ó㡱ºÍ¡°Âí¡±¡¢¡°Ë®¡±ºÍ¡°²ÝÔ­¡±¡¢¡°ÓΡ±ºÍ¡°ÅÜ¡±¡¢¡°ÖС±ºÍ¡°ÉÏ¡±¾ßÓÐÏàͬµÄÓï·¨ÌØÐÔ¡£

¶øÔÚÉñ¾­ÍøÂç¸ÅÂÊÓïÑÔÄ£ÐÍÖУ¬ÕâÁ½ÖÖÐÅÏ¢½«³ä·ÖÀûÓõ½¡£

(2) Éñ¾­ÍøÂç¸ÅÂÊÓïÑÔÄ£ÐÍ

Éñ¾­ÍøÂç¸ÅÂÊÓïÑÔÄ£ÐÍÊÇÒ»ÖÖÐÂÐ˵Ä×ÔÈ»ÓïÑÔ´¦ÀíËã·¨£¬¸ÃÄ£ÐÍͨ¹ýѧϰѵÁ·ÓïÁÏ»ñÈ¡´ÊÏòÁ¿ºÍ¸ÅÂÊÃܶȺ¯Êý£¬´ÊÏòÁ¿ÊǶàάʵÊýÏòÁ¿£¬ÏòÁ¿Öаüº¬ÁË×ÔÈ»ÓïÑÔÖеÄÓïÒåºÍÓï·¨¹ØÏµ£¬´ÊÏòÁ¿Ö®¼äÓàÏÒ¾àÀëµÄ´óС´ú±íÁË´ÊÓïÖ®¼ä¹ØÏµµÄÔ¶½ü£¬´ÊÏòÁ¿µÄ¼Ó¼õÔËËãÔòÊǼÆËã»úÔÚ"Dz´ÊÔì¾ä"¡£

Éñ¾­ÍøÂç¸ÅÂÊÓïÑÔÄ£Ð;­ÀúÁ˺ܳ¤µÄ·¢Õ¹½×¶Î£¬ÓÉBengioµÈÈË2003ÄêÌá³öµÄÉñ¾­ÍøÂçÓïÑÔÄ£ÐÍNNLM£¨Neural network language model£©×îΪ֪Ãû£¬ÒÔºóµÄ·¢Õ¹¹¤×÷¶¼²ÎÕÕ´ËÄ£ÐͽøÐС£Àú¾­Ê®ÓàÄêµÄÑо¿£¬Éñ¾­ÍøÂç¸ÅÂÊÓïÑÔÄ£ÐÍÓÐÁ˺ܴó·¢Õ¹¡£

Èç½ñÔڼܹ¹·½ÃæÓбÈNNLM¸ü¼òµ¥µÄCBOWÄ£ÐÍ¡¢Skip-gramÄ£ÐÍ£»Æä´ÎÔÚѵÁ··½Ã棬³öÏÖÁËHierarchical SoftmaxËã·¨¡¢¸º²ÉÑùËã·¨£¨Negative Sampling£©£¬ÒÔ¼°ÎªÁ˼õСƵ·±´Ê¶Ô½á¹û׼ȷÐÔºÍѵÁ·ËٶȵÄÓ°Ïì¶øÒýÈëµÄÇ·²ÉÑù£¨Subsumpling£©¼¼Êõ¡£

ÉÏͼÊÇ»ùÓÚÈý²ãÉñ¾­ÍøÂçµÄ×ÔÈ»ÓïÑÔ¹À¼ÆÄ£ÐÍNNLM(Neural Network Language Model)¡£NNLM¿ÉÒÔ¼ÆËãijһ¸öÉÏÏÂÎĵÄÏÂÒ»¸ö´ÊΪwiµÄ¸ÅÂÊ£¬¼´(wi=i|context)£¬´ÊÏòÁ¿ÊÇÆäѵÁ·µÄ¸±²úÎï¡£NNLM¸ù¾ÝÓïÁÏ¿âCÉú³É¶ÔÓ¦µÄ´Ê»ã±íV¡£

Éñ½«ÍøÂç֪ʶ¿ÉÒԲο¼ÎÒµÄǰÎIJ©¿Í£ºÉñ¾­ÍøÂçºÍ»úÆ÷ѧϰ»ù´¡ÈëÃÅ·ÖÏí

NNLMÍÆ¼öRachel-Zhang´óÉñÎÄÕ£ºword2vec¡ª¡ª¸ßЧwordÌØÕ÷ÇóÈ¡

½üÄêÀ´£¬Éñ¾­ÍøÂç¸ÅÂÊÓïÑÔÄ£ÐÍ·¢Õ¹Ñ¸ËÙ£¬Word2vecÊÇ×îм¼ÊõÀíÂ۵ĺϼ¯¡£

Word2vecÊÇGoogle¹«Ë¾ÔÚ2013Ä꿪·ÅµÄÒ»¿îÓÃÓÚѵÁ·´ÊÏòÁ¿µÄÈí¼þ¹¤¾ß¡£ËùÒÔ£¬ÔÚ½²Êöword2vec֮ǰ£¬Ïȸø´ó¼Ò½éÉÜ´ÊÏòÁ¿µÄ¸ÅÄî¡£

(3) ´ÊÏòÁ¿

²Î¿¼£ºlicstar´óÉñµÄNLPÎÄÕ Deep Learning in NLP £¨Ò»£©´ÊÏòÁ¿ºÍÓïÑÔÄ£ÐÍ

ÕýÈç×÷ÕßËù˵£ºDeep Learning Ëã·¨ÒѾ­ÔÚͼÏñºÍÒôƵÁìÓòÈ¡µÃÁ˾ªÈ˵ijɹû£¬µ«ÊÇÔÚ NLP ÁìÓòÖÐÉÐδ¼ûµ½Èç´Ë¼¤¶¯ÈËÐĵĽá¹û¡£ÓÐÒ»ÖÖ˵·¨ÊÇ£¬ÓïÑÔ£¨´Ê¡¢¾ä×Ó¡¢ÆªÕµȣ©ÊôÓÚÈËÀàÈÏÖª¹ý³ÌÖвúÉúµÄ¸ß²ãÈÏÖª³éÏóʵÌ壬¶øÓïÒôºÍͼÏñÊôÓÚ½ÏΪµ×²ãµÄԭʼÊäÈëÐźţ¬ËùÒÔºóÁ½Õ߸üÊʺÏ×ödeep learningÀ´Ñ§Ï°ÌØÕ÷¡£

µ«Êǽ«´ÊÓá°´ÊÏòÁ¿¡±µÄ·½Ê½±íʾ¿ÉνÊǽ« Deep Learning Ëã·¨ÒýÈë NLP ÁìÓòµÄÒ»¸öºËÐļ¼Êõ¡£×ÔÈ»ÓïÑÔÀí½âÎÊÌâת»¯Îª»úÆ÷ѧϰÎÊÌâµÄµÚÒ»²½¶¼ÊÇͨ¹ýÒ»ÖÖ·½·¨°ÑÕâЩ·ûºÅÊýѧ»¯¡£

´ÊÏòÁ¿¾ßÓÐÁ¼ºÃµÄÓïÒåÌØÐÔ£¬ÊDZíʾ´ÊÓïÌØÕ÷µÄ³£Ó÷½Ê½¡£´ÊÏòÁ¿µÄÿһάµÄÖµ´ú±íÒ»¸ö¾ßÓÐÒ»¶¨µÄÓïÒåºÍÓï·¨ÉϽâÊ͵ÄÌØÕ÷¡£¹Ê¿ÉÒÔ½«´ÊÏòÁ¿µÄÿһά³ÆÎªÒ»¸ö´ÊÓïÌØÕ÷¡£´ÊÏòÁ¿ÓÃDistributed Representation±íʾ£¬Ò»ÖÖµÍάʵÊýÏòÁ¿¡£

ÀýÈ磬NLPÖÐ×îÖ±¹Û¡¢×î³£ÓõĴʱíʾ·½·¨ÊÇOne-hot Representation¡£Ã¿¸ö´ÊÓÃÒ»¸öºÜ³¤µÄÏòÁ¿±íʾ£¬ÏòÁ¿µÄά¶È±íʾ´Ê±í´óС£¬¾ø´ó¶àÊýÊÇ0£¬Ö»ÓÐÒ»¸öά¶ÈÊÇ1£¬´ú±íµ±Ç°´Ê¡£

¡°»°Í²¡±±íʾΪ [0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 ¡­] ¼´´Ó0¿ªÊ¼»°Í²¼ÇΪ3¡£

µ«ÕâÖÖOne-hot Representation²ÉÓÃÏ¡Êè¾ØÕóµÄ·½Ê½±íʾ´Ê£¬ÔÚ½â¾öijЩÈÎÎñʱ»áÔì³ÉάÊýÔÖÄÑ£¬¶øÊ¹ÓõÍάµÄ´ÊÏòÁ¿¾ÍºÜºÃµÄ½â¾öÁ˸ÃÎÊÌ⡣ͬʱ´Óʵ¼ùÉÏ¿´£¬¸ßάµÄÌØÕ÷Èç¹ûÒªÌ×ÓÃDeep Learning£¬Æä¸´ÔӶȼ¸ºõÊÇÄÑÒÔ½ÓÊܵģ¬Òò´ËµÍάµÄ´ÊÏòÁ¿ÔÚÕâÀïÒ²±¥ÊÜ×·Åõ¡£

Distributed RepresentationµÍάʵÊýÏòÁ¿£¬È磺[0.792, ?0.177, ?0.107, 0.109, ?0.542, ¡­]¡£ËüÈÃÏàËÆ»òÏà¹ØµÄ´ÊÔÚ¾àÀëÉϸü¼Ó½Ó½ü¡£

×ÜÖ®£¬Distributed RepresentationÊÇÒ»¸ö³íÃÜ¡¢µÍάµÄʵÊýÏÞÁ¿£¬ËüµÄÿһά±íʾ´ÊÓïµÄÒ»¸öDZÔÚÌØÕ÷£¬¸ÃÌØÕ÷²¶»ñÁËÓÐÓõľ䷨ºÍÓïÒåÌØÕ÷¡£ÆäÌØµãÊǽ«´ÊÓïµÄ²»Í¬¾ä·¨ºÍÓïÒåÌØÕ÷·Ö²¼µ½ËüµÄÿһ¸öά¶ÈÉÏÈ¥±íʾ¡£

ÍÆ¼öÎÒÇ°ÃæµÄ»ù´¡ÎÄÕ£ºPython¼òµ¥ÊµÏÖ»ùÓÚVSMµÄÓàÏÒÏàËÆ¶È¼ÆËã

(4) Word2vec

²Î¿¼£ºWord2vecµÄºËÐļܹ¹¼°ÆäÓ¦Óà ¡¤ Ðܸ»ÁÖ£¬µËâùºÀ£¬ÌÆÏþêÉ ¡¤ ±±ÓÊ2015Äê

Word2vecÊÇGoogle¹«Ë¾ÔÚ2013Ä꿪·ÅµÄÒ»¿îÓÃÓÚѵÁ·´ÊÏòÁ¿µÄÈí¼þ¹¤¾ß¡£Ëü¸ù¾Ý¸ø¶¨µÄÓïÁϿ⣬ͨ¹ýÓÅ»¯ºóµÄѵÁ·Ä£ÐÍ¿ìËÙÓÐЧµÄ½«Ò»¸ö´ÊÓï±í´ï³ÉÏòÁ¿ÐÎʽ£¬ÆäºËÐļܹ¹°üÀ¨CBOWºÍSkip-gram¡£

ÔÚ¿ªÊ¼Ö®Ç°£¬ÒýÈëÄ£Ð͸´ÔÓ¶È£¬¶¨ÒåÈçÏ£º

ÆäÖУ¬E±íʾѵÁ·µÄ´ÎÊý£¬T±íʾѵÁ·ÓïÁÏÖдʵĸöÊý£¬QÒòÄ£ÐͶøÒì¡£EÖµ²»ÊÇÎÒÃǹØÐĵÄÄÚÈÝ£¬TÓëѵÁ·ÓïÁÏÓйأ¬ÆäÖµÔ½´óÄ£Ð;ÍԽ׼ȷ£¬QÔÚÏÂÃæ½²Êö¾ßÌåÄ£ÐÍÊÇÌÖÂÛ¡£

NNLMÄ£ÐÍÊÇÉñ¾­ÍøÂç¸ÅÂÊÓïÑÔÄ£Ð͵Ļù´¡Ä£ÐÍ¡£ÔÚNNLMÄ£ÐÍÖУ¬´ÓÒþº¬²ãµ½Êä³ö²ãµÄ¼ÆËãʱÖ÷ÒªÓ°ÏìѵÁ·Ð§Âʵĵط½£¬CBOWºÍSkip-gramÄ£ÐÍ¿¼ÂÇÈ¥µôÒþº¬²ã¡£Êµ¼ùÖ¤Ã÷ÐÂѵÁ·µÄ´ÊÏòÁ¿µÄ¾«È·¶È¿ÉÄܲ»ÈçNNLMÄ£ÐÍ£¨¾ßÓÐÒþº¬²ã£©£¬µ«¿ÉÒÔͨ¹ýÔö¼ÓѵÁ·ÓïÁϵķ½·¨À´ÍêÉÆ¡£

Word2vec°üº¬Á½ÖÖѵÁ·Ä£ÐÍ£¬·Ö±ðÊÇCBOWºÍSkip_gram(ÊäÈë²ã¡¢·¢Éä²ã¡¢Êä³ö²ã)£¬ÈçÏÂͼËùʾ£º

CBOWÄ£ÐÍ£º

Àí½âΪÉÏÏÂÎľö¶¨µ±Ç°´Ê³öÏֵĸÅÂÊ¡£ÔÚCBOWÄ£ÐÍÖУ¬ÉÏÏÂÎÄËùÓеĴʶԵ±Ç°´Ê³öÏÖ¸ÅÂʵÄÓ°ÏìµÄÈ¨ÖØÊÇÒ»ÑùµÄ£¬Òò´Ë½ÐCBOW(continuous bag-of-words model)Ä£ÐÍ¡£ÈçÔÚ´ü×ÓÖÐÈ¡´Ê£¬È¡³öÊýÁ¿×ã¹»µÄ´Ê¾Í¿ÉÒÔÁË£¬ÖÁÓÚÈ¡³öµÄÏȺó˳ÐòÊÇÎ޹ؽôÒªµÄ¡£

Skip-gramÄ£ÐÍ£º

Skip-gramÄ£ÐÍÊÇÒ»¸ö¼òµ¥ÊµÓõÄÄ£ÐÍ¡£ÎªÊ²Ã´»áÌá³ö¸ÃÎÊÌâÄØ£¿

ÔÚNLPÖУ¬ÓïÁϵÄѡȡÊÇÒ»¸öÏàµ±ÖØÒªµÄÎÊÌâ¡£

Ê×ÏÈ£¬ÓïÁϱØÐë³ä·Ö¡£Ò»·½Ãæ´ÊµäµÄ´ÊÁ¿Òª×ã¹»´ó£¬ÁíÒ»·½Ã澡¿ÉÄܵذüº¬·´Ó³´ÊÓïÖ®¼ä¹ØÏµµÄ¾ä×Ó£¬Èç¡°ÓãÔÚË®ÖÐÓΡ±ÕâÖÖ¾äʽÔÚÓïÁÏÖо¡¿ÉÄܵض࣬ģÐͲÅÄÜѧϰµ½¸Ã¾äÖеÄÓïÒåºÍÓï·¨¹ØÏµ£¬ÕâºÍÈËÀàѧϰ×ÔÈ»ÓïÑÔÊÇÒ»¸öµÀÀí£¬ÖØ¸´´ÎÊý¶àÁË£¬Ò²¾Í»áÄ£ÐÍÁË¡£

Æä´Î£¬ÓïÁϱØÐë׼ȷ¡£ËùѡȡµÄÓïÁÏÄܹ»ÕýÈ··´Ó³¸ÃÓïÑÔµÄÓïÒåºÍÓï·¨¹ØÏµ¡£ÈçÖÐÎĵġ¶ÈËÃñÈÕ±¨¡·±È½Ï׼ȷ¡£µ«¸ü¶àʱºò²»ÊÇÓïÁÏѡȡÒý·¢×¼È·ÐÔÎÊÌ⣬¶øÊÇ´¦ÀíµÄ·½·¨¡£

ÓÉÓÚ´°¿Ú´óСµÄÏÞÖÆ£¬Õâ»áµ¼Ö³¬³ö´°¿ÚµÄ´ÊÓïÓ뵱ǰ´ÊÖ®¼äµÄ¹ØÏµ²»ÄÜÕýÈ·µØ·´Ó³µ½Ä£ÐÍÖУ¬Èç¹ûµ¥´¿À©´ó´°¿Ú´óС»áÔö¼ÓѵÁ·µÄ¸´ÔÓ¶È¡£Skip-gramÄ£Ð͵ÄÌá³öºÜºÃ½â¾öÁËÕâЩÎÊÌâ¡£

Skip-gram±íʾ¡°Ìø¹ýijЩ·ûºÅ¡±¡£ÀýÈç¾ä×Ó¡°Öйú×ãÇòÌßµÃÕæÊÇÌ«ÀÃÁË¡±ÓÐ4¸ö3Ôª´Ê×飬·Ö±ðÊÇ¡°Öйú×ãÇòÌߵᱡ¢¡°×ãÇòÌßµÃÕæÊÇ¡±¡¢¡°ÌßµÃÕæÊÇÌ«Àᱡ¢¡°ÕæÊÇÌ«ÀÃÁË¡±£¬¾ä×ӵı¾Òâ¶¼ÊÇ¡°Öйú×ãÇòÌ«Àá±£¬¿ÉÊÇÉÏÃæ4¸ö3Ôª×é²¢²»ÄÜ·´Ó³³öÕâ¸öÐÅÏ¢¡£

´Ëʱ£¬Ê¹ÓÃSkip-gramÄ£ÐÍÔÊÐíijЩ´Ê±»Ìø¹ý£¬Òò´Ë¿É×é³É¡°Öйú×ãÇòÌ«Àá±Õâ¸ö3Ôª´Ê×é¡£Èç¹ûÔÊÐíÌø¹ý2¸ö´Ê£¬¼´2-Skip-gram£¬ÄÇôÉϾ仰×é³ÉµÄ3Ôª´Ê×éΪ£º

ÓÉÉϱí¿ÉÖª£ºÒ»·½ÃæSkip-gram·´Ó³Á˾ä×ÓµÄÕæÊµÒâ˼£¬ÔÚÐÂ×é³ÉµÄÕâ18¸ö3Ôª´Ê×éÖУ¬ÓÐ8¸ö´Ê×éÄܹ»ÕýÈ··´Ó³Àý¾äÖеÄÕæÊµÒâ˼£»ÁíÒ»·½Ã棬À©´óÁËÓïÁÏ£¬3Ôª´Ê×éÓÉÔ­À´µÄ4¸öÀ©Õ¹µ½ÁË18¸ö¡£

ÓïÁϵÄÀ©Õ¹Äܹ»Ìá¸ßѵÁ·µÄ׼ȷ¶È£¬»ñµÃµÄ´ÊÏòÁ¿¸üÄÜ·´Ó³ÕæÊµµÄÎı¾º¬Òå¡£

2.ÏÂÔØÔ´Âë

ÏÂÔØµØÖ·£ºhttp://word2vec.googlecode.com/svn/trunk/

ʹÓÃSVN CheckoutÔ´´úÂ룬ÈçÏÂͼËùʾ¡£

3.ÖÐÎÄÓïÁÏ

PS£º×îºó¸½ÓÐword2vecÔ´Âë¡¢Èý´ó°Ù¿ÆÓïÁÏ¡¢ÌÚѶÐÂÎÅÓïÁϺͷִÊpython´úÂë¡£

ÖÐÎÄÓïÁÏ¿ÉÒԲο¼ÎÒµÄÎÄÕ£¬Í¨¹ýPythonÏÂÔØ°Ù¶È°Ù¿Æ¡¢»¥¶¯°Ù¿Æ¡¢Î¬»ù°Ù¿ÆµÄÄÚÈÝ¡£

[python] lantern·ÃÎÊÖÐÎÄά»ù°Ù¿Æ¼°seleniumÅÀȡά»ù°Ù¿ÆÓïÁÏ

[PythonÅÀ³æ] Selenium»ñÈ¡°Ù¶È°Ù¿ÆÂÃÓξ°µã µÄInfoBoxÏûÏ¢ºÐ

ÏÂÔØ½á¹ûÈçÏÂͼËùʾ£¬¹²300¸ö¹ú¼Ò£¬°Ù¶È°Ù¿Æ¡¢»¥¶¯°Ù¿Æ¡¢Î¬»ù°Ù¿Æ¸÷×Ô100¸ö£¬¶ÔÓ¦µÄ±àºÅ¶¼ÊÇ0001.txt~0100.txt£¬Ã¿¸ötxtÖаüº¬Ò»¸öʵÌ壨¹ú¼Ò£©µÄÐÅÏ¢¡£

È»ºóÔÙʹÓÃJieba·Ö´Ê¹¤¾ß¶ÔÆë½øÐÐÖÐÎķִʺÍÎĵµºÏ²¢¡£

[python] view plain copy
ico_fork.svg
#encoding=utf-8
import sys
import re
import codecs
import os
import shutil
import jieba
import jieba.analyse

#µ¼Èë×Ô¶¨Òå´Êµä
jieba.load_userdict("dict_all.txt")

#Read file and cut
def read_file_cut():
#create path
pathBaidu = "BaiduSpiderCountry\\"
resName = "Result_Country.txt"
if os.path.exists(resName):
os.remove(resName)
result = codecs.open(resName, 'w', 'utf-8')

num = 1
while num<=100: #5A 200 ÆäËü100
name = "%04d" % num
fileName = pathBaidu + str(name) + ".txt"
source = open(fileName, 'r')
line = source.readline()

while line!="":
line = line.rstrip('\n')
#line = unicode(line, "utf-8")
seglist = jieba.cut(line,cut_all=False) #¾«È·Ä£Ê½
output = ' '.join(list(seglist)) #¿Õ¸ñÆ´½Ó
#print output
result.write(output + ' ') #¿Õ¸ñÈ¡´ú»»ÐÐ'\r\n'
line = source.readline()
else:
print 'End file: ' + str(num)
result.write('\r\n')
source.close()
num = num + 1
else:
print 'End Baidu'
result.close()

#Run function
if __name__ == '__main__':
read_file_cut()

ÉÏÃæÖ»ÏÔʾÁ˶԰ٶȰٿÆ100¸ö¹ú¼Ò½øÐзִʵĴúÂ룬µ«ºËÐÄ´úÂëÒ»Ñù¡£Í¬Ê±£¬Èç¹ûÐèÒª¶ÔÍ£ÓôʹýÂË»ò±êµã·ûºÅ¹ýÂË¿ÉÒÔ×Ô¶¨ÒåʵÏÖ¡£

·Ö´ÊÏê¼û£º [python] ʹÓÃJieba¹¤¾ßÖÐÎķִʼ°Îı¾¾ÛÀà¸ÅÄî

·Ö´ÊºÏ²¢ºóµÄ½á¹ûΪResult_Country.txt£¬Ï൱ÓÚ600ÐУ¬Ã¿ÐжÔÓ¦Ò»¸ö·Ö´ÊºóµÄ¹ú¼Ò¡£

4.ÔËÐÐÔ´Âë

Ç¿ÁÒÍÆ¼öÈýƪ´óÉñ½éÉÜword2vec´¦ÀíÖÐÎÄÓïÁϵÄÎÄÕ£¬ÆäÖÐFelvenºÃÏñÊÇʦÐÖ¡£

WindowsÏÂʹÓÃWord2vec¼ÌÐø´ÊÏòÁ¿ÑµÁ· - Ò»Ö»ÄñµÄÌì¿Õ

ÀûÓÃword2vec¶Ô¹Ø¼ü´Ê½øÐоÛÀà - Felven

http://www.52nlp.cn/ÖÐÓ¢ÎÄά»ù°Ù¿ÆÓïÁÏÉϵÄword2vecʵÑé

word2vec ´ÊÏòÁ¿¹¤¾ß - °Ù¶ÈÎÄ¿â

ÒòΪword2vecÐèÒªlinux»·¾³£¬ËùÓÐÊ×ÏÈÔÚwindowsϰ²×°linux»·¾³Ä£ÄâÆ÷£¬ÍƼöcygwin¡£È»ºó°ÑÓïÁÏResult_Country.txt·ÅÈëword2vecĿ¼Ï£¬ÐÞ¸Ädemo-word.shÎļþ£¬¸ÃÎļþĬÈÏÇé¿öÏÂʹÓÃ×Ô´øµÄtext8Êý¾Ý½øÐÐѵÁ·£¬Èç¹ûѵÁ·Êý¾Ý²»´æÔÚ£¬Ôò»á½øÐÐÏÂÔØ£¬ÒòΪÐèҪʹÓÃ×Ô¼ºµÄÊý¾Ý½øÐÐѵÁ·£¬¹Ê×¢Ê͵ôÏÂÔØ´úÂë¡£

demo-word.shÎļþÐÞ¸ÄÈçÏ£º

[plain] view plain copy
ico_fork.svg
make
#if [ ! -e text8 ]; then
# wget http://mattmahoney.net/dc/text8.zip -O text8.gz
# gzip -d text8.gz -f
#fi
time ./word2vec -train Result_Country.txt -output vectors.bin -cbow 1 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 20 -binary 1 -iter 15
./distance vectors.bin

ÏÂͼ²ÎÊýÔ´×ÔÎÄÕ£ºWindowsÏÂʹÓÃWord2vec¼ÌÐø´ÊÏòÁ¿ÑµÁ· - Ò»Ö»ÄñµÄÌì¿Õ

ÔËÐÐÃüÁîsh demo-word.sh£¬µÈ´ýѵÁ·Íê³É¡£Ä£ÐÍѵÁ·Íê³ÉÖ®ºó£¬µÃµ½ÁËvectors.binÕâ¸ö´ÊÏòÁ¿Îļþ£¬¿ÉÒÔÖ±½ÓÔËÓá£

5.½á¹ûչʾ

ͨ¹ýѵÁ·µÃµ½µÄ´ÊÏòÁ¿ÎÒÃÇ¿ÉÒÔ½øÐÐÏàÓ¦µÄ×ÔÈ»ÓïÑÔ´¦Àí¹¤×÷£¬±ÈÈçÇóÏàËÆ´Ê¡¢¹Ø¼ü´Ê¾ÛÀàµÈ¡£ÆäÖÐword2vecÖÐÌṩÁËdistanceÇó´ÊµÄcosineÏàËÆ¶È£¬²¢ÅÅÐò¡£Ò²¿ÉÒÔÔÚѵÁ·Ê±£¬ÉèÖÃ-classes²ÎÊýÀ´Ö¸¶¨¾ÛÀàµÄ´Ø¸öÊý£¬Ê¹ÓÃkmeans½øÐоÛÀà¡£

[plain] view plain copy
ico_fork.svg
cd C:/Users/dell/Desktop/word2vec
sh demo-word.sh
./distance vectors.bin

ÊäÈë°¢¸»º¹£º¿¦²¼¶û£¨Ê×¶¼£©¡¢¿²´ó¹þ£¨Ö÷Òª³ÇÊУ©¡¢¼ª¶û¼ªË¹Ë¹Ì¹¡¢ÒÁÀ­¿ËµÈ¡£

ÊäÈë¹ú¸è£º

ÊäÈëÊ×¶¼£º

ÊäÈëGDP:

×îºóÏ£ÍûÎÄÕ¶ÔÄãÓÐËù°ïÖú£¬Ö÷ÒªÊÇʹÓõķ½·¨¡£Í¬Ê±¸ü¶àÓ¦ÓÃÐèÒªÄã×Ô¼ºÈ¥Ñо¿Ñ§Ï°¡£

   
13661 ´Îä¯ÀÀ       28
Ïà¹ØÎÄÕÂ

»ùÓÚͼ¾í»ýÍøÂçµÄͼÉî¶Èѧϰ
×Ô¶¯¼ÝÊ»ÖеÄ3DÄ¿±ê¼ì²â
¹¤Òµ»úÆ÷ÈË¿ØÖÆÏµÍ³¼Ü¹¹½éÉÜ
ÏîĿʵս£ºÈçºÎ¹¹½¨ÖªÊ¶Í¼Æ×
 
Ïà¹ØÎĵµ

5GÈ˹¤ÖÇÄÜÎïÁªÍøµÄµäÐÍÓ¦ÓÃ
Éî¶ÈѧϰÔÚ×Ô¶¯¼ÝÊ»ÖеÄÓ¦ÓÃ
ͼÉñ¾­ÍøÂçÔÚ½»²æÑ§¿ÆÁìÓòµÄÓ¦ÓÃÑо¿
ÎÞÈË»úϵͳԭÀí
Ïà¹Ø¿Î³Ì

È˹¤ÖÇÄÜ¡¢»úÆ÷ѧϰ&TensorFlow
»úÆ÷ÈËÈí¼þ¿ª·¢¼¼Êõ
È˹¤ÖÇÄÜ£¬»úÆ÷ѧϰºÍÉî¶Èѧϰ
ͼÏñ´¦ÀíËã·¨·½·¨Óëʵ¼ù