±à¼ÍƼö: |
±¾ÎÄÀ´×ÔÓÚ¼òÊé,±¾ÎĽÌÄãÓüòµ¥Ò×ѧµÄ¹¤Òµ¼¶Python×ÔÈ»ÓïÑÔ´¦ÀíÈí¼þ°üSpacy£¬¶Ô×ÔÈ»ÓïÑÔÎı¾×ö´ÊÐÔ·ÖÎö¡¢ÃüÃûʵÌåʶ±ð¡¢ÒÀÀµ¹ØÏµ¿Ì»£¬ÒÔ¼°´ÊǶÈëÏòÁ¿µÄ¼ÆËãºÍ¿ÉÊÓ»¯¡£
|
|

äά
ÎÒ×ܰ®Öظ´Ò»¾äâ¸ñ°®ËµµÄ»°£º
To the one with a hammer, everything looks like a
nail. £¨ÊÖÖÐÓд¸£¬¿´Ê²Ã´¶¼Ïñ¶¤£©
Õâ¾ä»°ÊÇʲôÒâË¼ÄØ£¿
¾ÍÊÇÄã²»ÄÜÖ»ÕÆÎÕÊýÁ¿ºÜÉٵķ½·¨¡¢¹¤¾ß¡£
·ñÔòÄãµÄÈÏÖª»á±»×Ô¼ºÄÜÁ¦¿òס¡£²»Ö»ÊÇ´æÔÚäµã£¬¶øÊÇ´æÔÚ¡°Ã¤Î¬¡±¡£
Äã»á³¢ÊÔÓò»ºÏÊʵķ½·¨½â¾öÎÊÌ⣨»¹×ÔÚ¼¡°Ò»ÕÐÏÊ£¬³Ô±éÌ족£©£¬È´¶ÔÔ±¾ºÏÊʵŤ¾ßÊÓ¶ø²»¼û¡£
½á¹û¿ÉÏë¶øÖª¡£
ËùÒÔ£¬ÄãµÃÔÚ×Ô¼ºµÄ¹¤¾ßÏäÀïÃæ£¬¶à·ÅһЩ±øÈС£
×î½üÎÒÓÖ¶Ô×Ô¼ºµÄѧÉú£¬Äî߶â¸ñÕâ¾ä»°¡£
ÒòΪËûÃÇ¿ªÊ¼×öʵ¼ÊÑо¿ÈÎÎñµÄʱºò£¬Ò»Óöµ½×ÔÈ»ÓïÑÔ´¦Àí(Natural Language Processing,
NLP)£¬ÄÔ×ÓÀïÏëµ½µÄ¾ÍÊÇ´ÊÔÆ¡¢Çé¸Ð·ÖÎöºÍLDAÖ÷Ì⽨ģ¡£
Ϊʲô£¿
ÒòΪÎÒµÄרÀ¸ºÍ¹«ÖÚºÅÀ×ÔÈ»ÓïÑÔ´¦Àí²¿·Ö£¬Ö»Ð´¹ýÕâЩÄÚÈÝ¡£
ÄãÈç¹ûÈÏΪ£¬NLPÖ»ÄÜ×öÕâЩÊ£¬¾Í´ó´íÌØ´íÁË¡£
¿´¿´Õâ¶ÎÊÓÆµ£¬Äã´ó¸Å¾ÍÄܸÐÊܵ½Ä¿Ç°×ÔÈ»ÓïÑÔ´¦ÀíµÄÇ°ÑØ£¬ÒѾµ½ÁËÄÄÀï¡£

µ±È»£¬ÄãÊÖÍ·ÓµÓеŤ¾ßºÍÊý¾Ý£¬Éв»ÄÜ×ö³öGoogleչʾµÄºÚ¿Æ¼¼Ð§¹û¡£
µ«ÊÇ£¬ÏÖÓеŤ¾ß£¬Ò²×ã¿ÉÒÔÈÃÄã¶Ô×ÔÈ»ÓïÑÔÎı¾£¬×ö³ö¸ü·á¸»µÄ´¦Àí½á¹û¡£
¿Æ¼¼µÄ·¢Õ¹£¬ÅѸËÙ¡£
³ýÁËÔÛÃÇ֮ǰÎÄÕÂÖÐÒѽéÉܹýµÄ½á°Í·Ö´Ê¡¢SnowNLPºÍTextBlob£¬»ùÓÚPythonµÄ×ÔÈ»ÓïÑÔ´¦Àí¹¤¾ß»¹Óкܶ࣬ÀýÈç
NLTK ºÍ gensim µÈ¡£
ÎÒÎÞ·¨°ïÄãÒ»Ò»ÊìϤ£¬Äã¿ÉÄÜÓõ½µÄËùÓÐ×ÔÈ»ÓïÑÔ´¦Àí¹¤¾ß¡£
µ«ÊÇÔÛÃDz»·Á¿ª¸öÍ·£¬½éÉÜÒ»¿î½Ð×ö Spacy µÄ Python ¹¤¾ß°ü¡£
ʣϵģ¬×Ô¼º¾ÙÒ»·´Èý¡£
¹¤¾ß
Spacy µÄ Slogan£¬ÊÇÕâÑùµÄ£º
Industrial-Strength Natural Language Processing. £¨¹¤Òµ¼¶±ðµÄ×ÔÈ»ÓïÑÔ´¦Àí£©

Õâ¾ä»°ÌýÉÏÈ¥£¬ÊDz»ÊÇÓÐЩ¿ñÍý°¡£¿
²»¹ýÈ˼һ¹ÊÇÓÃÊý¾Ý˵»°µÄ¡£
Êý¾Ý²É×ÔͬÐÐÆÀÒé(Peer-reviewed)ѧÊõÂÛÎÄ£º

¿´ÍêÉÏÊöµÄÊý¾Ý·ÖÎö£¬ÎÒÃÇ´óÖ¶ÔÓÚSpacyµÄÐÔÄÜÓÐЩÁ˽⡣
µ«ÊÇÎÒÑ¡ÓÃËü£¬²»½ö½öÊÇÒòΪËü¡°¹¤Òµ¼¶±ð¡±µÄÐÔÄÜ£¬¸üÊÇÒòΪËüÌṩÁ˱ã½ÝµÄÓû§µ÷Óýӿڣ¬ÒÔ¼°·á¸»¡¢ÏêϸµÄÎĵµ¡£
½ö¾ÙÒ»Àý¡£

ÉÏͼÊÇSpacyÉÏÊ̵ֽ̳ĵÚÒ»Ò³¡£
¿ÉÒÔ¿´µ½£¬×ó²àÓмòÃ÷µÄÊ÷×´µ¼º½Ìõ£¬ÖмäÊÇÏêϸµÄÎĵµ£¬ÓÒ²àÊÇÖØµãÌáʾ¡£
½ö°²×°ÕâÒ»ÏÄã¾Í¿ÉÒÔµã»÷Ñ¡Ôñ²Ù×÷ϵͳ¡¢Python°ü¹ÜÀí¹¤¾ß¡¢Python°æ±¾¡¢ÐéÄâ»·¾³ºÍÓïÑÔÖ§³ÖµÈ±êÇ©¡£ÍøÒ³»á¶¯Ì¬ÎªÄãÉú³É°²×°µÄÓï¾ä¡£

ÕâÖÖÉè¼Æ£¬¶ÔÐÂÊÖÓû§£¬ºÜÓаïÖú°É£¿
SpacyµÄ¹¦ÄÜÓкܶࡣ
´Ó×î¼òµ¥µÄ´ÊÐÔ·ÖÎö£¬µ½¸ß½×µÄÉñ¾ÍøÂçÄ£ÐÍ£¬Î廨°ËÃÅ¡£
ƪ·ùËùÏÞ£¬±¾ÎÄֻΪÄãչʾÒÔÏÂÄÚÈÝ£º
1.´ÊÐÔ·ÖÎö
2.ÃüÃûʵÌåʶ±ð
3.ÒÀÀµ¹ØÏµ¿Ì»
4.´ÊǶÈëÏòÁ¿µÄ½üËÆ¶È¼ÆËã
5.´ÊÓオάºÍ¿ÉÊÓ»¯
ѧÍêÕâÆª½Ì³Ì£¬Äã¿ÉÒÔ°´Í¼Ë÷æ÷£¬ÀûÓÃSpacyÌṩµÄÏêϸÎĵµ£¬×ÔѧÆäËû×ÔÈ»ÓïÑÔ´¦Àí¹¦ÄÜ¡£
ÎÒÃÇ¿ªÊ¼°É¡£
»·¾³
Çëµã»÷Õâ¸öÁ´½Ó£¨http://t.cn/R35fElv£©£¬Ö±½Ó½øÈëÔÛÃǵÄʵÑé»·¾³¡£
¶Ô£¬Äãû¿´´í¡£
Äã²»ÐèÒªÔÚ±¾µØ¼ÆËã»ú°²×°ÈκÎÈí¼þ°ü¡£Ö»ÒªÓÐÒ»¸öÏÖ´ú»¯ä¯ÀÀÆ÷£¨°üÀ¨Google Chrome, Firefox,
SafariºÍMicrosoft EdgeµÈ£©¾Í¿ÉÒÔÁË¡£È«²¿µÄÒÀÀµÈí¼þ£¬ÎÒ¶¼ÒѾΪÄã×¼±¸ºÃÁË¡£
´ò¿ªÁ´½ÓÖ®ºó£¬Äã»á¿´¼ûÕâ¸öÒ³Ãæ¡£

²»Í¬ÓÚ֮ǰµÄ Jupyter Notebook£¬Õâ¸ö½çÃæÀ´×Ô Jupyter Lab¡£
Äã¿ÉÒÔ½«ËüÀí½âΪ Jupyter Notebook µÄÔöÇ¿°æ£¬Ëü¾ß±¸ÒÔÏÂÌØÕ÷£º
´úÂëµ¥ÔªÖ±½ÓÊó±êÍ϶¯£»
Ò»¸öä¯ÀÀÆ÷±êÇ©£¬¿É´ò¿ª¶à¸öNotebook£¬¶øÇÒ·Ö±ðʹÓò»Í¬µÄKernel£»
ÌṩʵʱäÖȾµÄMarkdown±à¼Æ÷£»
ÍêÕûµÄÎļþä¯ÀÀÆ÷£»
CSVÊý¾ÝÎļþ¿ìËÙä¯ÀÀ
¡¡
ͼÖÐ×ó²à·ÖÀ¸£¬Êǹ¤×÷Ŀ¼ÏµÄÈ«²¿Îļþ¡£
ÓÒ²à´ò¿ªµÄ£¬ÊÇÔÛÃÇҪʹÓõÄipynbÎļþ¡£
¸ù¾ÝÔÛÃǵĽ²½â£¬ÇëÄãÖðÌõÖ´ÐУ¬¹Û²ì½á¹û¡£
ÎÒÃÇ˵һ˵ÑùÀýÎı¾Êý¾ÝµÄÀ´Ô´¡£
Èç¹ûÄã֮ǰ¶Á¹ýÎ񵀮äËû×ÔÈ»ÓïÑÔ´¦Àí·½ÃæµÄ½Ì³Ì£¬Ó¦¸Ã¼ÇµÃÕⲿµçÊӾ硣

¶Ô£¬¾ÍÊÇ"Yes, Minister"¡£
³öÓÚ¶ÔÕⲿ80Äê´úÓ¢¹úϲ¾çµÄϲ°®£¬ÎÒ»¹ÊÇÓÃά»ù°Ù¿ÆÉÏ"Yes, Minister"µÄ½éÉÜÄÚÈÝ£¬×÷ΪÎı¾·ÖÎöÑùÀý¡£

ÏÂÃæ£¬ÎÒÃǾÍÕýʽ¿ªÊ¼£¬Ò»²½²½Ö´ÐгÌÐò´úÂëÁË¡£
ÎÒ½¨ÒéÄãÏÈÍêÈ«°´Õս̳ÌÅÜÒ»±é£¬ÔËÐгö½á¹û¡£
Èç¹ûÒ»ÇÐÕý³££¬ÔÙ½«ÆäÖеÄÊý¾Ý£¬Ì滻ΪÄã×Ô¼º¸ÐÐËȤµÄÄÚÈÝ¡£
Ö®ºó£¬³¢ÊÔ´ò¿ªÒ»¸ö¿Õ°× ipynb Îļþ£¬¸ù¾Ý½Ì³ÌºÍÎĵµ£¬×Ô¼ºÇôúÂ룬²¢ÇÒ³¢ÊÔ×öµ÷Õû¡£
ÕâÑù»áÓÐÖúÓÚÄãÀí½â¹¤×÷Á÷³ÌºÍ¹¤¾ßʹÓ÷½·¨¡£
ʵ¼ù
ÎÒÃÇ´Óά»ù°Ù¿ÆÒ³ÃæµÄµÚÒ»×ÔÈ»¶ÎÖУ¬ÕªÈ¡²¿·ÖÓï¾ä£¬·Åµ½text±äÁ¿ÀïÃæ¡£
text = "The
sequel, Yes, Prime Minister, ran from 1986 to
1988. In total there were 38 episodes, of which
all but one lasted half an hour. Almost all episodes
ended with a variation of the title of the series
spoken as the answer to a question posed by the
same character, Jim Hacker. Several episodes were
adapted for BBC Radio, and a stage play was produced
in 2010, the latter leading to a new television
series on UKTV Gold in 2013." |
ÏÔʾһÏ£¬¿´ÊÇ·ñÕýÈ·´æ´¢¡£
'The sequel,
Yes, Prime Minister, ran from 1986 to 1988. In
total there were 38 episodes, of which all but
one lasted half an hour. Almost all episodes ended
with a variation of the title of the series spoken
as the answer to a question posed by the same
character, Jim Hacker. Several episodes were adapted
for BBC Radio, and a stage play was produced in
2010, the latter leading to a new television series
on UKTV Gold in 2013.' |
ûÎÊÌâÁË¡£
ÏÂÃæÎÒÃǶÁÈëSpacyÈí¼þ°ü¡£
ÎÒÃÇÈÃSpacyʹÓÃÓ¢ÓïÄ£ÐÍ£¬½«Ä£ÐÍ´æ´¢µ½±äÁ¿nlpÖС£
ÏÂÃæ£¬ÎÒÃÇÓÃnlpÄ£ÐÍ·ÖÎöÔÛÃǵÄÎı¾¶ÎÂ䣬½«½á¹ûÃüÃûΪdoc¡£
ÎÒÃÇ¿´¿´docµÄÄÚÈÝ¡£
The sequel, Yes,
Prime Minister, ran from 1986 to 1988. In total
there were 38 episodes, of which all but one lasted
half an hour. Almost all episodes ended with a
variation of the title of the series spoken as
the answer to a question posed by the same character,
Jim Hacker. Several episodes were adapted for
BBC Radio, and a stage play was produced in 2010,
the latter leading to a new television series
on UKTV Gold in 2013. |
ºÃÏñ¸ú¸Õ²ÅµÄtextÄÚÈÝûÓÐÇø±ðѽ£¿²»»¹ÊÇÕâ¶ÎÎı¾Âð£¿
±ð׿±£¬SpacyÖ»ÊÇΪÁËÈÃÎÒÃÇ¿´×ÅÊæ·þ£¬ËùÒÔÖ»´òÓ¡³öÀ´Îı¾ÄÚÈÝ¡£
Æäʵ£¬ËüÔÚºǫ́£¬ÒѾ¶ÔÕâ¶Î»°½øÐÐÁËÐí¶à²ã´ÎµÄ·ÖÎö¡£
²»ÐÅ£¿
ÎÒÃÇÀ´ÊÔÊÔ£¬ÈÃSpacy°ïÎÒÃÇ·ÖÎöÕâ¶Î»°ÖгöÏÖµÄÈ«²¿´ÊÀý£¨token£©¡£
for token in
doc:
print('"' + token.text + '"') |
Äã»á¿´µ½£¬SpacyΪÎÒÃÇÊä³öÁËÒ»³¤´®ÁÐ±í¡£
"The"
"sequel"
","
"Yes"
","
"Prime"
"Minister"
","
"ran"
"from"
"1986"
"to"
"1988"
"."
"In"
"total"
"there"
"were"
"38"
"episodes"
","
"of"
"which"
"all"
"but"
"one"
"lasted"
"half"
"an"
"hour"
"."
"Almost"
"all"
"episodes"
"ended"
"with"
"a"
"variation"
"of"
"the"
"title"
"of"
"the"
"series"
"spoken"
"as"
"the"
"answer"
"to"
"a"
"question"
"posed"
"by"
"the"
"same"
"character"
","
"Jim"
"Hacker"
"."
"Several"
"episodes"
"were"
"adapted"
"for"
"BBC"
"Radio"
","
"and"
"a"
"stage"
"play"
"was"
"produced"
"in"
"2010"
","
"the"
"latter"
"leading"
"to"
"a"
"new"
"television"
"series"
"on"
"UKTV"
"Gold"
"in"
"2013"
"." |
Äã¿ÉÄܲ»ÒÔΪȻ¡ª¡ªÕâÓÐʲôÁ˲»Æð£¿
Ó¢Óï±¾À´¾ÍÊǿոñ·Ö¸îµÄÂÎÒ×Ô¼ºÒ²Äܱà¸öС³ÌÐò£¬ÒÔ¿Õ¸ñ·Ö¶Î£¬ÒÀ´Î´òÓ¡³öÕâЩÄÚÈÝÀ´£¡
±ð棬³ýÁË´ÊÀýÄÚÈݱ¾Éí£¬Spacy»¹°Ñÿ¸ö´ÊÀýµÄһЩÊôÐÔÐÅÏ¢£¬½øÐÐÁË´¦Àí¡£
ÏÂÃæ£¬ÎÒÃÇÖ»¶Ôǰ10¸ö´ÊÀý£¨token£©£¬Êä³öÒÔÏÂÄÚÈÝ£º
Îı¾
Ë÷ÒýÖµ£¨¼´ÔÚÔÎÄÖеĶ¨Î»£©
´ÊÔª(lemma)
ÊÇ·ñΪ±êµã·ûºÅ
ÊÇ·ñΪ¿Õ¸ñ
´ÊÐÔ
񈬀
for token in
doc[:10]:
print("{0}\t{1}\t{2}\t{3}\t{4}\t{5}\t{6}\t{7}".format(
token.text,
token.idx,
token.lemma_,
token.is_punct,
token.is_space,
token.shape_,
token.pos_,
token.tag_
)) |
½á¹ûΪ£º
The 0 the False
False Xxx DET DT
sequel 4 sequel False False xxxx NOUN NN
, 10 , True False , PUNCT ,
Yes 12 yes False False Xxx INTJ UH
, 15 , True False , PUNCT ,
Prime 17 prime False False Xxxxx PROPN NNP
Minister 23 minister False False Xxxxx PROPN NNP
, 31 , True False , PUNCT ,
ran 33 run False False xxx VERB VBD
from 37 from False False xxxx ADP IN |
¿´µ½SpacyÔÚºǫ́ĬĬΪÎÒÃÇ×ö³öµÄ´óÁ¿¹¤×÷Á˰ɣ¿
ÏÂÃæÎÒÃDz»ÔÙ¿¼ÂÇÈ«²¿´ÊÐÔ£¬Ö»¹Ø×¢Îı¾ÖгöÏÖµÄʵÌ壨entity£©´Ê»ã¡£
for ent in doc.ents:
print(ent.text, ent.label_) |
1986 to 1988
DATE
38 CARDINAL
one CARDINAL
half an hour TIME
Jim Hacker PERSON
BBC Radio ORG
2010 DATE
UKTV Gold ORG
2013 DATE |
ÔÚÕâÒ»¶ÎÎÄ×ÖÖУ¬³öÏÖµÄʵÌå°üÀ¨ÈÕÆÚ¡¢Ê±¼ä¡¢»ùÊý£¨Cardinal£©¡¡Spacy²»½ö×Ô¶¯Ê¶±ð³öÁËJim
HackerΪÈËÃû£¬»¹ÕýÈ·Åж¨BBC RadioºÍUKTV GoldΪ»ú¹¹Ãû³Æ¡£
Èç¹ûÄãÆ½Ê±µÄ¹¤×÷£¬ÐèÒª´Óº£Á¿ÆÀÂÛÀïɸѡDZÔÚ¾ºÕù²úÆ·»òÕß¾ºÕùÕߣ¬ÄÇ¿´µ½ÕâÀÓÐûÓÐÒ»µã¶ùÁé¸ÐÄØ£¿
Ö´ÐÐÏÂÃæÕâ¶Î´úÂ룬¿´¿´»á·¢Éúʲô£º
from spacy import
displacy
displacy.render(doc, style='ent', jupyter=True)
|

ÈçÉÏͼËùʾ£¬Spacy°ïÎÒÃǰÑʵÌåʶ±ðµÄ½á¹û£¬½øÐÐÁËÖ±¹ÛµÄ¿ÉÊÓ»¯¡£²»Í¬Àà±ðµÄʵÌ壬»¹²ÉÓÃÁ˲»Í¬µÄÑÕÉ«¼ÓÒÔÇø·Ö¡£
°ÑÒ»¶ÎÎÄ×Ö²ð½âΪÓï¾ä£¬¶ÔSpacy¶øÑÔ£¬Ò²ÊÇС²ËÒ»µú¡£
for sent in doc.sents:
print(sent) |
The sequel, Yes,
Prime Minister, ran from 1986 to 1988.
In total there were 38 episodes, of which all
but one lasted half an hour.
Almost all episodes ended with a variation of
the title of the series spoken as the answer to
a question posed by the same character, Jim Hacker.
Several episodes were adapted for BBC Radio, and
a stage play was produced in 2010, the latter
leading to a new television series on UKTV Gold
in 2013. |
×¢ÒâÕâÀdoc.sents²¢²»ÊǸöÁбíÀàÐÍ¡£
<generator
at 0x116e95e18> |
ËùÒÔ£¬¼ÙÉèÎÒÃÇÐèÒª´ÓÖÐɸѡ³öijһ¾ä»°£¬ÐèÒªÏȽ«Æäת»¯ÎªÁÐ±í¡£
[The sequel,
Yes, Prime Minister, ran from 1986 to 1988.,
In total there were 38 episodes, of which all
but one lasted half an hour.,
Almost all episodes ended with a variation of
the title of the series spoken as the answer to
a question posed by the same character, Jim Hacker.,
Several episodes were adapted for BBC Radio, and
a stage play was produced in 2010, the latter
leading to a new television series on UKTV Gold
in 2013.] |
ÏÂÃæÒªÕ¹Ê¾µÄ¹¦ÄÜ£¬·ÖÎö·¶Î§¾ÖÏÞÔÚµÚÒ»¾ä»°¡£
ÎÒÃǽ«Æä³éÈ¡³öÀ´£¬²¢ÇÒÖØÐÂÓÃnlpÄ£ÐÍ´¦Àí£¬´æÈ뵽еıäÁ¿newdocÖС£
newdoc = nlp(list(doc.sents)[0].text)
|
¶ÔÕâÒ»¾ä»°£¬ÎÒÃÇÏëÒª¸ãÇåÆäÖÐÿһ¸ö´ÊÀý£¨token£©Ö®¼äµÄÒÀÀµ¹ØÏµ¡£
for token in
newdoc:
print("{0}/{1} <--{2}-- {3}/{4}".format(
token.text, token.tag_, token.dep_, token.head.text,
token.head.tag_)) |
The/DT <--det--
sequel/NN
sequel/NN <--nsubj-- ran/VBD
,/, <--punct-- sequel/NN
Yes/UH <--intj-- sequel/NN
,/, <--punct-- sequel/NN
Prime/NNP <--compound-- Minister/NNP
Minister/NNP <--appos-- sequel/NN
,/, <--punct-- sequel/NN
ran/VBD <--ROOT-- ran/VBD
from/IN <--prep-- ran/VBD
1986/CD <--pobj-- from/IN
to/IN <--prep-- from/IN
1988/CD <--pobj-- to/IN
./. <--punct-- ran/VBD |
ºÜÇåÎú£¬µ«ÊÇÁбíµÄ·½Ê½£¬Ëƺõ²»´óÖ±¹Û¡£
ÄǾÍÈÃSpacy°ïÎÒÃÇ¿ÉÊÓ»¯°É¡£
displacy.render(newdoc,
style='dep', jupyter=True, options={'distance':
90}) |
½á¹ûÈçÏ£º

ÕâЩÒÀÀµ¹ØÏµÁ´½ÓÉϵĴʻ㣬¶¼´ú±íʲô£¿
Èç¹ûÄã¶ÔÓïÑÔѧ±È½ÏÁ˽⣬Ӧ¸ÃÄÜ¿´¶®¡£
²»¶®£¿²é²é×ÖµäÂï¡£
¸úÓï·¨Êé¶Ô±Èһϣ¬¿´¿´Spacy·ÖÎöµÃÊÇ·ñ׼ȷ¡£
Ç°ÃæÎÒÃÇ·ÖÎöµÄ£¬ÊôÓÚÓï·¨²ã¼¶¡£
ÏÂÃæÎÒÃÇ¿´ÓïÒå¡£
ÎÒÃÇÀûÓõŤ¾ß£¬½Ð×ö´ÊǶÈ루word embedding£©Ä£ÐÍ¡£
֮ǰµÄÎÄÕ¡¶ÈçºÎÓÃPython´Óº£Á¿Îı¾³éÈ¡Ö÷Ì⣿¡·ÖУ¬ÎÒÃÇÌáµ½¹ýÈçºÎ°ÑÎÄ×Ö±í´ï³ÉµçÄÔ¿ÉÒÔ¿´¶®µÄÊý¾Ý¡£

ÎÄÖд¦ÀíµÄÿһ¸öµ¥´Ê£¬¶¼½ö½ö¶ÔÓ¦×ŴʵäÀïÃæµÄÒ»¸ö±àºÅ¶øÒÑ¡£Äã¿ÉÒÔ°ÑËü¿´³ÉÄãÈ¥ÓªÒµÌü°ìÀíÒµÎñʱÁìÈ¡µÄºÅÂë¡£
ËüÖ»ÌṩÁËÏÈÀ´ºóµ½µÄ˳ÐòÐÅÏ¢£¬¸úÄãµÄÖ°Òµ¡¢Ñ§Àú¡¢ÐÔ±ðͳͳûÓйØÏµ¡£
ÎÒÃǽ«ÕâÑù¹ýÓÚ¼ò»¯µÄÐÅÏ¢ÊäÈ룬¼ÆËã»ú¶ÔÓÚ´ÊÒåµÄÁ˽⣬Ҳ±ØÈ»ÉٵÿÉÁ¯¡£
ÀýÈç¸øÄãÏÂÃæÕâ¸öʽ×Ó£º
Ö»ÒªÄãѧ¹ýÓ¢Ó¾Í²»ÄѲµ½ÕâÀï´ó¸ÅÂÊÓ¦¸ÃÌîд¡°man¡±¡£
µ«ÊÇ£¬Èç¹ûÄãÖ»ÊÇÓÃÁËËæ»úµÄÐòºÅÀ´´ú±í´Ê»ã£¬ÓÖÈçºÎÄܹ»²Âµ½ÕâÀïÕýÈ·µÄÌî´Ê½á¹ûÄØ£¿
ÐҺã¬ÔÚÉî¶ÈѧϰÁìÓò£¬ÎÒÃÇ¿ÉÒÔʹÓøüΪ˳Êֵĵ¥´ÊÏòÁ¿»¯¹¤¾ß¡ª¡ª´ÊǶÈ루word embeddings
£©¡£

ÈçÉÏͼÕâ¸ö¼ò»¯Ê¾Àý£¬´ÊǶÈë°Ñµ¥´Ê±ä³É¶àά¿Õ¼äÉÏÃæµÄÏòÁ¿¡£
ÕâÑù£¬´ÊÓï¾Í²»ÔÙÊÇÀä±ù±ùµÄ×Öµä±àºÅ£¬¶øÊǾßÓÐÁËÒâÒå¡£
ʹÓôÊǶÈëÄ£ÐÍ£¬ÎÒÃÇÐèÒªSpacy¶Áȡһ¸öеÄÎļþ¡£
nlp = spacy.load('en_core_web_lg')
|
Ϊ²âÊÔ¶ÁÈ¡½á¹û£¬ÎÒÃÇÈÃSpacy´òÓ¡¡°minister¡±Õâ¸öµ¥´Ê¶ÔÓ¦µÄÏòÁ¿È¡Öµ¡£
print(nlp.vocab['minister'].vector)
|

¿ÉÒÔ¿´µ½£¬Ã¿¸öµ¥´Ê£¬ÓÃ×ܳ¤¶ÈΪ300µÄ¸¡µãÊý×é³ÉÏòÁ¿À´±íʾ¡£
˳±ã˵һ¾ä£¬Spacy¶ÁÈëµÄÕâ¸öÄ£ÐÍ£¬ÊDzÉÓÃword2vec£¬ÔÚº£Á¿ÓïÁÏÉÏѵÁ·µÄ½á¹û¡£
ÎÒÃÇÀ´¿´¿´£¬´ËʱSpacyµÄÓïÒå½üËÆ¶ÈÅбðÄÜÁ¦¡£
ÕâÀÎÒÃǽ«4¸ö±äÁ¿£¬¸³ÖµÎª¶ÔÓ¦µ¥´ÊµÄÏòÁ¿±í´ï½á¹û¡£
dog = nlp.vocab["dog"]
cat = nlp.vocab["cat"]
apple = nlp.vocab["apple"]
orange = nlp.vocab["orange"] |
ÎÒÃÇ¿´¿´¡°¹·¡±ºÍ¡°Ã¨¡±µÄ½üËÆ¶È£º
àÅ£¬¶¼ÊdzèÎ½üËÆ¶È¸ß£¬¿ÉÒÔ½ÓÊÜ¡£
ÏÂÃæ¿´¿´¡°¹·¡±ºÍ¡°Æ»¹û¡±¡£
Ò»¸ö¶¯Îһ¸öË®¹û£¬½üËÆ¶ÈÒ»ÏÂ×Ӿ͵øÂäÏÂÀ´ÁË¡£
¡°¹·¡±ºÍ¡°éÙ×Ó¡±ÄØ£¿
¿É¼û£¬ÏàËÆ¶ÈÒ²²»¸ß¡£
ÄÇô¡°Æ»¹û¡±ºÍ¡°éÙ×Ó¡±Ö®¼äÄØ£¿
Ë®¹û¼ä½üËÆ¶È£¬Ô¶Ô¶³¬¹ýË®¹ûÓ붯ÎïµÄÏàËÆ³Ì¶È¡£
²âÊÔͨ¹ý¡£
¿´À´SpacyÀûÓôÊǶÈëÄ£ÐÍ£¬¶ÔÓïÒåÓÐÁËÒ»¶¨µÄÀí½â¡£
ÏÂÃæÎªÁ˺ÃÍæ£¬ÎÒÃÇÀ´¿¼¿¼Ëü¡£
ÕâÀÎÒÃÇÐèÒª¼ÆËã´ÊµäÖпÉÄܲ»´æÔÚµÄÏòÁ¿£¬Òò´ËSpacy×Ô´øµÄsimilarity()º¯Êý£¬¾ÍÏԵò»¹»ÓÃÁË¡£
ÎÒÃÇ´ÓscipyÖУ¬ÕÒµ½ÏàËÆ¶È¼ÆËãÐèÒªÓõ½µÄÓàÏÒº¯Êý¡£
from scipy.spatial.distance
import cosine |
¶Ô±Èһϣ¬ÎÒÃÇÖ±½Ó´úÈë¡°¹·¡±ºÍ¡°Ã¨¡±µÄÏòÁ¿£¬½øÐмÆËã¡£
1 - cosine(dog.vector,
cat.vector) |
³ýÁ˱£ÁôÊý×ÖÍ⣬¼ÆËã½á¹ûÓëSpacy×Ô´øµÄsimilarity()ÔËÐнá¹ûûÓвî±ð¡£
ÎÒÃǰÑËü×ö³ÉÒ»¸öСº¯Êý£¬×¨ÃÅ´¦ÀíÏòÁ¿ÊäÈë¡£
def vector_similarity(x,
y):
return 1 - cosine(x, y) |
ÓÃÎÒÃÇ×Ô±àµÄÏàËÆ¶Èº¯Êý£¬²âÊÔһϡ°¹·¡±ºÍ¡°Æ»¹û¡±¡£
vector_similarity(dog.vector,
apple.vector) |
Óë¸Õ²ÅµÄ½á¹û¶Ô±È£¬Ò²ÊÇÒ»Öµġ£
ÎÒÃÇÒª±í´ïµÄ£¬ÊÇÕâ¸öʽ×Ó£º
ÎÒÃǰÑÎʺţ¬³ÆÎª guess_word
ËùÒÔ
guess_word =
king - queen + woman |
ÎÒÃǰÑÓÒ²àÈý¸öµ¥´Ê£¬Ò»°ã»¯¼ÇΪ words¡£±àдÏÂÃæº¯Êý£¬¼ÆËãguess_wordȡֵ¡£
def make_guess_word(words):
[first, second, third] = words
return nlp.vocab[first].vector - nlp.vocab[second].vector
+ nlp.vocab[third].vector |
ÏÂÃæµÄº¯Êý¾Í±È½Ï±©Á¦ÁË£¬ËüÆäʵÊÇÓÃÎÒÃǼÆËãµÄ guess_word ȡֵ£¬ºÍ×ÖµäÖÐÈ«²¿´ÊÓïÒ»Ò»ºË¶Ô½üËÆÐÔ¡£°Ñ×îΪ½üËÆµÄ10¸öºòÑ¡µ¥´Ê´òÓ¡³öÀ´¡£
def get_similar_word(words,
scope=nlp.vocab):
guess_word = make_guess_word(words)
similarities = []
for word in scope:
if not word.has_vector:
continue
similarity = vector_similarity(guess_word,
word.vector)
similarities.append((word, similarity))
similarities = sorted(similarities, key=lambda
item: -item[1])
print([word[0].text for word in similarities[:10]])
|
ºÃÁË£¬ÓÎϷʱ¼ä¿ªÊ¼¡£
ÎÒÃÇÏÈ¿´¿´£º
¼´£º
guess_word =
king - queen + woman |
ÊäÈëÓÒ²à´ÊÐòÁУº
words = ["king",
"queen", "woman"] |
È»ºóÖ´ÐжԱȺ¯Êý£º
Õâ¸öº¯ÊýÔËÐÐÆðÀ´£¬ÐèÒªÒ»¶Îʱ¼ä¡£Çë±£³ÖÄÍÐÄ¡£
ÔËÐнáÊøÖ®ºó£¬Äã»á¿´µ½ÈçϽá¹û£º
['MAN', 'Man',
'mAn', 'MAn', 'MaN', 'man', 'mAN', 'WOMAN', 'womAn',
'WOman'] |
ÔÀ´×ÖµäÀïÃæ£¬¡°ÄÐÈË¡±(man)Õâ¸ö´Ê»ãÓÐÕâô¶àµÄ±äÐΰ¡¡£
µ«ÊÇÕâ¸öÀý×ÓÌ«¾µäÁË£¬ÎÒÃdz¢ÊÔ¸öÐÂÏÊһЩµÄ£º
? - England =
Paris - London |
¼´£º
guess_word =
Paris - London + England |
¶ÔÄãÀ´½²£¬¾ø¶ÔÊǼòµ¥µÄÌâÄ¿¡£×ó²à¹ú±ð£¬ÓÒ²àÊ×¶¼£¬¶ÔÓ¦À´¿´£¬×ÔÈ»ÊǰÍÀèËùÔڵ퍹ú£¨France£©¡£
ÎÊÌâÊÇ£¬SpacyÄܲ¶ÔÂð£¿
ÎÒÃǰÑÕ⼸¸öµ¥´ÊÊäÈë¡£
words = ["Paris",
"London", "England"] |
ÈÃSpacyÀ´²Â£º
['france', 'FRANCE',
'France', 'Paris', 'paris', 'PARIS', 'EUROPE',
'EUrope', 'europe', 'Europe'] |
½á¹ûºÜÁîÈËÕñ·Ü£¬Ç°Èý¸ö¶¼ÊÇ¡°·¨¹ú¡±£¨France£©¡£
ÏÂÃæÎÒÃÇ×öÒ»¸ö¸üÓÐȤµÄʶù£¬°Ñ´ÊÏòÁ¿µÄ300άµÄ¸ß¿Õ¼äά¶È£¬Ñ¹Ëõµ½Ò»ÕÅÖ½£¨¶þά£©ÉÏ£¬¿´¿´´ÊÓïÖ®¼äµÄÏà¶ÔλÖùØÏµ¡£
Ê×ÏÈÎÒÃÇÐèÒª¶ÁÈënumpyÈí¼þ°ü¡£
ÎÒÃǰѴÊǶÈë¾ØÕóÏÈÉ趨Ϊ¿Õ¡£Ò»»á¶ùÂýÂýÌîÈë¡£
ÐèÒªÑÝʾµÄµ¥´ÊÁÐ±í£¬Ò²ÏÈ¿Õ×Å¡£
ÎÒÃÇÔÙ´ÎÈÃSpacy±éÀú¡°Yes, Minister¡±Î¬»ùÒ³ÃæÖÐժȡµÄÄǶÎÎÄ×Ö£¬¼ÓÈëµ½µ¥´ÊÁбíÖС£×¢ÒâÕâ´ÎÎÒÃÇÒª½øÐÐÅжϣº
Èç¹ûÊDZêµã£¬¶ªÆú£»
Èç¹û´Ê»ãÒѾÔÚ´ÊÓïÁбíÖУ¬¶ªÆú¡£
for token in
doc:
if not(token.is_punct) and not(token.text in word_list):
word_list.append(token.text) |
¿´¿´Éú³ÉµÄ½á¹û£º
['The',
'sequel',
'Yes',
'Prime',
'Minister',
'ran',
'from',
'1986',
'to',
'1988',
'In',
'total',
'there',
'were',
'38',
'episodes',
'of',
'which',
'all',
'but',
'one',
'lasted',
'half',
'an',
'hour',
'Almost',
'ended',
'with',
'a',
'variation',
'the',
'title',
'series',
'spoken',
'as',
'answer',
'question',
'posed',
'by',
'same',
'character',
'Jim',
'Hacker',
'Several',
'adapted',
'for',
'BBC',
'Radio',
'and',
'stage',
'play',
'was',
'produced',
'in',
'2010',
'latter',
'leading',
'new',
'television',
'on',
'UKTV',
'Gold',
'2013'] |
¼ì²éÁËһϣ¬Ò»³¤´®£¨63¸ö£©´ÊÓïÁбíÖУ¬Ã»ÓгöÏÖ±êµã¡£Ò»ÇÐÕý³£¡£
ÏÂÃæ£¬ÎÒÃǰÑÿ¸ö´Ê»ã¶ÔÓ¦µÄ¿Õ¼äÏòÁ¿£¬×·¼Óµ½´ÊǶÈë¾ØÕóÖС£
for word in word_list:
embedding = np.append(embedding, nlp.vocab[word].vector)
|
¿´¿´´Ëʱ´ÊǶÈë¾ØÕóµÄά¶È¡£
¿ÉÒÔ¿´µ½£¬ËùÓеÄÏòÁ¿ÄÚÈÝ£¬¶¼±»·ÅÔÚÁËÒ»¸ö³¤´®ÉÏÃæ¡£ÕâÏÔÈ»²»·ûºÏÎÒÃǵÄÒªÇó£¬ÎÒÃǽ«²»Í¬µÄµ¥´Ê¶ÔÓ¦µÄ´ÊÏòÁ¿£¬²ð½âµ½²»Í¬ÐÐÉÏÃæÈ¥¡£
embedding = embedding.reshape(len(word_list),
-1) |
ÔÙ¿´¿´±ä»»ºó´ÊǶÈë¾ØÕóµÄά¶È¡£
63¸ö´Ê»ã£¬Ã¿¸ö³¤¶È300£¬Õâ¾Í¶ÔÁË¡£
ÏÂÃæÎÒÃÇ´Óscikit-learnÈí¼þ°üÖУ¬¶ÁÈëTSNEÄ£¿é¡£
from sklearn.manifold
import TSNE |
ÎÒÃǽ¨Á¢Ò»¸öͬÃûСдµÄtsne£¬×÷Ϊµ÷ÓöÔÏó¡£
tsneµÄ×÷Óã¬ÊǰѸßά¶ÈµÄ´ÊÏòÁ¿£¨300ά£©Ñ¹Ëõµ½¶þÎ¬Æ½ÃæÉÏ¡£ÎÒÃÇÖ´ÐÐÕâ¸öת»»¹ý³Ì£º
low_dim_embedding
= tsne.fit_transform(embedding) |
ÏÖÔÚ£¬ÎÒÃÇÊÖÀïÓµÓÐµÄ low_dim_embedding £¬¾ÍÊÇ63¸ö´Ê»ã½µµÍµ½¶þάµÄÏòÁ¿±íʾÁË¡£
ÎÒÃǶÁÈë»æÍ¼¹¤¾ß°ü¡£
import matplotlib.pyplot
as plt
%pylab inline |
ÏÂÃæÕâ¸öº¯Êý£¬ÓÃÀ´°Ñ¶þάÏòÁ¿µÄ¼¯ºÏ£¬»æÖƳöÀ´¡£
Èç¹ûÄã¶Ô¸Ãº¯ÊýÄÚÈÝϸ½Ú²»Àí½â£¬Ã»¹ØÏµ¡£ÒòΪÎÒ»¹Ã»ÓиøÄãϵͳ½éÉܹýPythonÏµĻæÍ¼¹¦ÄÜ¡£
ºÃÔÚÕâÀïÎÒÃÇÖ»Òª»áµ÷ÓÃËü£¬¾Í¿ÉÒÔÁË¡£
def plot_with_labels(low_dim_embs,
labels, filename='tsne.pdf'):
assert low_dim_embs.shape[0] >= len(labels),
"More labels than embeddings"
plt.figure(figsize=(18, 18)) # in inches
for i, label in enumerate(labels):
x, y = low_dim_embs[i, :]
plt.scatter(x, y)
plt.annotate(label,
xy=(x, y),
xytext=(5, 2),
textcoords='offset points',
ha='right',
va='bottom')
plt.savefig(filename) |
ÖÕÓÚ¿ÉÒÔ½øÐнµÎ¬ºóµÄ´ÊÏòÁ¿¿ÉÊÓ»¯ÁË¡£
ÇëÖ´ÐÐÏÂÃæÕâÌõÓï¾ä£º
plot_with_labels(low_dim_embedding,
word_list) |
Äã»á¿´µ½ÕâÑùÒ»¸öͼÐΡ£

Çë×¢Òâ¹Û²ìͼÖеö²¿·Ö£º
Äê·Ý
ͬһµ¥´ÊµÄ´óСдÐÎʽ
Radio ºÍ television
a ºÍ an
¿´¿´ÓÐʲô¹æÂÉûÓУ¿
ÎÒ·¢ÏÖÁËÒ»¸öÓÐÒâ˼µÄÏÖÏ󡪡ªÃ¿´ÎÔËÐÐtsne£¬²úÉúµÄ¶þά¿ÉÊÓ»¯Í¼¶¼²»Ò»Ñù£¡
²»¹ýÕâÒ²Õý³££¬ÒòΪÕâ¶Î»°Ö®ÖгöÏֵĵ¥´Ê£¬²¢·Ç¶¼ÓÐÔ¤ÏÈѵÁ·ºÃµÄÏòÁ¿¡£
ÕâÑùµÄµ¥´Ê£¬±»Spacy½øÐÐÁËËæ»ú»¯µÈ´¦Àí¡£
Òò´Ë£¬Ã¿Ò»´ÎÉú³É¸ßάÏòÁ¿£¬½á¹û¶¼²»Í¬¡£²»Í¬µÄ¸ßάÏòÁ¿£¬Ñ¹Ëõµ½¶þά£¬½á¹û×ÔȻҲ»áÓÐÇø±ð¡£
ÎÊÌâÀ´ÁË£¬Èç¹ûÎÒÏ£Íûÿ´ÎÔËÐеĽá¹û¶¼Ò»Ö£¬¸ÃÈçºÎ´¦ÀíÄØ£¿
Õâ¸öÎÊÌ⣬×÷Ϊ¿Îºó˼¿¼Ì⣬Áô¸øÄã×ÔÐнâ´ð¡£
ϸÐĵÄÄã¿ÉÄÜ·¢ÏÖÁË£¬Ö´ÐÐÍê×îºóÒ»ÌõÓï¾äºó£¬Ò³Ãæ×ó²à±ßÀ¸ÎļþÁбíÖУ¬³öÏÖÁËÒ»¸öеÄpdfÎļþ¡£

Õâ¸öpdf£¬¾ÍÊÇÄã¸Õ¸ÕÉú³ÉµÄ¿ÉÊÓ»¯½á¹û¡£Äã¿ÉÒÔË«»÷¸ÃÎļþÃû³Æ£¬ÔÚеıêǩҳÖв鿴¡£

¿´£¬¾ÍÁ¬pdfÎļþ£¬Jupyter LabÒ²ÄÜÕýÈ·ÏÔʾ¡£
ÏÂÃæ£¬ÊÇÁ·Ï°Ê±¼ä¡£
Çë°Ñipynb³öÏÖµÄÎı¾ÄÚÈÝ£¬Ì滻ΪÄã¸ÐÐËȤµÄ¶ÎÂäºÍ´Ê»ã£¬ÔÙ³¢ÊÔÔËÐÐÒ»´Î°É¡£
Ô´Âë
Ö´ÐÐÁËÈ«²¿´úÂ룬²¢ÇÒ³¢ÊÔÌæ»»ÁË×Ô¼ºÐèÒª·ÖÎöµÄÎı¾£¬³É¹¦ÔËÐкó£¬ÄãÊDz»ÊǺÜÓгɾ͸У¿
Äã¿ÉÄÜÏëÒª¸ü½øÒ»²½ÍÚ¾òSpacyµÄ¹¦ÄÜ£¬²¢ÇÒÏ£ÍûÔÚ±¾µØ¸´ÏÖÔËÐл·¾³Óë½á¹û¡£
ûÎÊÌ⣬ÇëʹÓÃÕâ¸öÁ´½Ó£¨http://t.cn/R35MIKh£©ÏÂÔØ±¾ÎÄÓõ½µÄÈ«²¿Ô´´úÂëºÍÔËÐл·¾³ÅäÖÃÎļþ£¨Pipenv£©Ñ¹Ëõ°ü¡£
Èç¹ûÄãÖªµÀÈçºÎʹÓÃgithub£¬Ò²»¶ÓÓÃÕâ¸öÁ´½Ó£¨http://t.cn/R35MEqk£©·ÃÎʶÔÓ¦µÄgithub
repo£¬½øÐÐclone»òÕßforkµÈ²Ù×÷¡£

С½á
±¾ÎÄÀûÓÃPython×ÔÈ»ÓïÑÔ´¦Àí¹¤¾ß°üSpacy£¬·Ç³£¼òÒªµØÎªÄãÑÝʾÁËÒÔÏÂNLP¹¦ÄÜ£º
1.´ÊÐÔ·ÖÎö
2.ÃüÃûʵÌåʶ±ð
3.ÒÀÀµ¹ØÏµ¿Ì»
4.´ÊǶÈëÏòÁ¿µÄ½üËÆ¶È¼ÆËã
5.´ÊÓオάºÍ¿ÉÊÓ»¯
|