±à¼ÍƼö: |
ÎÄÕ½²½âÁËÈçºÎ¼ÆËãTF-IDF£¿TF-IDFÓÐʲôӦÓã¿ÈçºÎÌáÈ¡Îı¾µÄ¹Ø¼ü´ÊºÍÕªÒª£¿ ±¾ÎÄÀ´×Ô΢ÐÅÊý¾Ý¿ÆÑ§ÓëÈ˹¤ÖÇÄÜ£¬ÓÉ»ðÁú¹ûÈí¼þAnna±à¼¡¢ÍƼö¡£ |
|
ǰÑÔ
ÓÐһƪºÜ³¤µÄÎÄÕ£¬ÎÒÒªÓüÆËã»úÌáÈ¡ËüµÄ¹Ø¼ü´Ê£¨Automatic
Keyphrase extraction£©£¬ÍêÈ«²»¼ÓÒÔÈ˹¤¸ÉÔ¤£¬ÇëÎÊÔõÑù²ÅÄÜÕýÈ·×öµ½£¿
Õâ¸öÎÊÌâÉæ¼°µ½Êý¾ÝÍÚ¾ò¡¢Îı¾´¦Àí¡¢ÐÅÏ¢¼ìË÷µÈºÜ¶à¼ÆËã»úÇ°ÑØÁìÓò£¬µ«ÊdzöºõÒâÁϵÄÊÇ£¬ÓÐÒ»¸ö·Ç³£¼òµ¥µÄ¾µäËã·¨£¬¿ÉÒÔ¸ø³öÁîÈËÏ൱ÂúÒâµÄ½á¹û¡£Ëü¼òµ¥µ½¶¼²»ÐèÒª¸ßµÈÊýѧ£¬ÆÕͨÈËÖ»ÓÃ10·ÖÖӾͿÉÒÔÀí½â£¬Õâ¾ÍÊÇÎÒ½ñÌìÏëÒª½éÉܵÄTF-IDFËã·¨¡£
ÈÃÎÒÃÇ´ÓÒ»¸öʵÀý¿ªÊ¼½²Æð¡£¼Ù¶¨ÏÖÔÚÓÐһƪ³¤ÎÄ¡¶ÖйúµÄÃÛ·äÑøÖ³¡·£¬ÎÒÃÇ×¼±¸ÓüÆËã»úÌáÈ¡ËüµÄ¹Ø¼ü´Ê¡£
Ò»¸öÈÝÒ×Ïëµ½µÄ˼·£¬¾ÍÊÇÕÒµ½³öÏÖ´ÎÊý×î¶àµÄ´Ê¡£Èç¹ûij¸ö´ÊºÜÖØÒª£¬ËüÓ¦¸ÃÔÚÕâÆªÎÄÕÂÖжà´Î³öÏÖ¡£ÓÚÊÇ£¬ÎÒÃǽøÐÐ"´ÊƵ"£¨Term
Frequency£¬ËõдΪTF£©Í³¼Æ¡£
½á¹ûÄã¿Ï¶¨²Âµ½ÁË£¬³öÏÖ´ÎÊý×î¶àµÄ´ÊÊÇ----"µÄ"¡¢"ÊÇ"¡¢"ÔÚ"----ÕâÒ»Àà×î³£ÓõĴʡ£ËüÃǽÐ×ö"Í£ÓôÊ"£¨
http://baike.baidu.com/view/3784680.htm £©£¨stop words£©£¬±íʾ¶ÔÕÒµ½½á¹ûºÁÎÞ°ïÖú¡¢±ØÐë¹ýÂ˵ôµÄ´Ê¡£
¼ÙÉèÎÒÃǰÑËüÃǶ¼¹ýÂ˵ôÁË£¬Ö»¿¼ÂÇʣϵÄÓÐʵ¼ÊÒâÒåµÄ´Ê¡£ÕâÑùÓÖ»áÓöµ½ÁËÁíÒ»¸öÎÊÌ⣬ÎÒÃÇ¿ÉÄÜ·¢ÏÖ"Öйú"¡¢"ÃÛ·ä"¡¢"ÑøÖ³"ÕâÈý¸ö´ÊµÄ³öÏÖ´ÎÊýÒ»Ñù¶à¡£ÕâÊDz»ÊÇÒâζ×Å£¬×÷Ϊ¹Ø¼ü´Ê£¬ËüÃǵÄÖØÒªÐÔÊÇÒ»ÑùµÄ£¿
ÏÔÈ»²»ÊÇÕâÑù¡£ÒòΪ"Öйú"ÊǺܳ£¼ûµÄ´Ê£¬Ïà¶Ô¶øÑÔ£¬"ÃÛ·ä"ºÍ"ÑøÖ³"²»ÄÇô³£¼û¡£Èç¹ûÕâÈý¸ö´ÊÔÚһƪÎÄÕµijöÏÖ´ÎÊýÒ»Ñù¶à£¬ÓÐÀíÓÉÈÏΪ£¬"ÃÛ·ä"ºÍ"ÑøÖ³"µÄÖØÒª³Ì¶ÈÒª´óÓÚ"Öйú"£¬Ò²¾ÍÊÇ˵£¬Ôڹؼü´ÊÅÅÐòÉÏÃæ£¬"ÃÛ·ä"ºÍ"ÑøÖ³"Ó¦¸ÃÅÅÔÚ"Öйú"µÄÇ°Ãæ¡£
ËùÒÔ£¬ÎÒÃÇÐèÒªÒ»¸öÖØÒªÐÔµ÷ÕûϵÊý£¬ºâÁ¿Ò»¸ö´ÊÊDz»Êdz£¼û´Ê¡£Èç¹ûij¸ö´Ê±È½ÏÉÙ¼û£¬µ«ÊÇËüÔÚÕâÆªÎÄÕÂÖжà´Î³öÏÖ£¬ÄÇôËüºÜ¿ÉÄܾͷ´Ó³ÁËÕâÆªÎÄÕµÄÌØÐÔ£¬ÕýÊÇÎÒÃÇËùÐèÒªµÄ¹Ø¼ü´Ê¡£
ÓÃͳ¼ÆÑ§ÓïÑÔ±í´ï£¬¾ÍÊÇÔÚ´ÊÆµµÄ»ù´¡ÉÏ£¬Òª¶Ôÿ¸ö´Ê·ÖÅäÒ»¸ö"ÖØÒªÐÔ"È¨ÖØ¡£×î³£¼ûµÄ´Ê£¨"µÄ"¡¢"ÊÇ"¡¢"ÔÚ"£©¸øÓè×îСµÄÈ¨ÖØ£¬½Ï³£¼ûµÄ´Ê£¨"Öйú"£©¸øÓè½ÏСµÄÈ¨ÖØ£¬½ÏÉÙ¼ûµÄ´Ê£¨"ÃÛ·ä"¡¢"ÑøÖ³"£©¸øÓè½Ï´óµÄÈ¨ÖØ¡£Õâ¸öÈ¨ÖØ½Ð×ö"ÄæÎĵµÆµÂÊ"£¨Inverse
Document Frequency£¬ËõдΪIDF£©£¬ËüµÄ´óСÓëÒ»¸ö´ÊµÄ³£¼û³Ì¶È³É·´±È¡£
ÖªµÀÁË"´ÊƵ"£¨TF£©ºÍ"ÄæÎĵµÆµÂÊ"£¨IDF£©ÒԺ󣬽«ÕâÁ½¸öÖµÏà³Ë£¬¾ÍµÃµ½ÁËÒ»¸ö´ÊµÄTF-IDFÖµ¡£Ä³¸ö´Ê¶ÔÎÄÕµÄÖØÒªÐÔÔ½¸ß£¬ËüµÄTF-IDFÖµ¾ÍÔ½´ó¡£ËùÒÔ£¬ÅÅÔÚ×îÇ°ÃæµÄ¼¸¸ö´Ê£¬¾ÍÊÇÕâÆªÎÄÕµĹؼü´Ê¡£
ÏÂÃæ¾ÍÊÇÕâ¸öËã·¨µÄϸ½Ú¡£
µÚÒ»²½£¬¼ÆËã´ÊƵ¡£

¿¼Âǵ½ÎÄÕÂÓ㤶ÌÖ®·Ö£¬ÎªÁ˱ãÓÚ²»Í¬ÎÄÕµıȽϣ¬½øÐÐ"´ÊƵ"±ê×¼»¯¡£

»òÕß

µÚ¶þ²½£¬¼ÆËãÄæÎĵµÆµÂÊ¡£
Õâʱ£¬ÐèÒªÒ»¸öÓïÁϿ⣨corpus£©£¬ÓÃÀ´Ä£ÄâÓïÑÔµÄʹÓû·¾³¡£

Èç¹ûÒ»¸ö´ÊÔ½³£¼û£¬ÄÇô·Öĸ¾ÍÔ½´ó£¬ÄæÎĵµÆµÂʾÍԽСԽ½Ó½ü0¡£·Öĸ֮ËùÒÔÒª¼Ó1£¬ÊÇΪÁ˱ÜÃâ·ÖĸΪ0£¨¼´ËùÓÐÎĵµ¶¼²»°üº¬¸Ã´Ê£©¡£log±íʾ¶ÔµÃµ½µÄֵȡ¶ÔÊý¡£
µÚÈý²½£¬¼ÆËãTF-IDF¡£

¿ÉÒÔ¿´µ½£¬TF-IDFÓëÒ»¸ö´ÊÔÚÎĵµÖеijöÏÖ´ÎÊý³ÉÕý±È£¬Óë¸Ã´ÊÔÚÕû¸öÓïÑÔÖеijöÏÖ´ÎÊý³É·´±È¡£ËùÒÔ£¬×Ô¶¯ÌáÈ¡¹Ø¼ü´ÊµÄËã·¨¾ÍºÜÇå³þÁË£¬¾ÍÊǼÆËã³öÎĵµµÄÿ¸ö´ÊµÄTF-IDFÖµ£¬È»ºó°´½µÐòÅÅÁУ¬È¡ÅÅÔÚ×îÇ°ÃæµÄ¼¸¸ö´Ê¡£

»¹ÊÇÒÔ¡¶ÖйúµÄÃÛ·äÑøÖ³¡·ÎªÀý£¬¼Ù¶¨¸ÃÎij¤¶ÈΪ1000¸ö´Ê£¬"Öйú"¡¢"ÃÛ·ä"¡¢"ÑøÖ³"¸÷³öÏÖ20´Î£¬ÔòÕâÈý¸ö´ÊµÄ"´ÊƵ"£¨TF£©¶¼Îª0.02¡£È»ºó£¬ËÑË÷Google·¢ÏÖ£¬°üº¬"µÄ"×ÖµÄÍøÒ³¹²ÓÐ250ÒÚÕÅ£¬¼Ù¶¨Õâ¾ÍÊÇÖÐÎÄÍøÒ³×ÜÊý¡£°üº¬"Öйú"µÄÍøÒ³¹²ÓÐ62.3ÒÚÕÅ£¬°üº¬"ÃÛ·ä"µÄÍøÒ³Îª0.484ÒÚÕÅ£¬°üº¬"ÑøÖ³"µÄÍøÒ³Îª0.973ÒÚÕÅ¡£ÔòËüÃǵÄÄæÎĵµÆµÂÊ£¨IDF£©ºÍTF-IDFÈçÏ£º
´ÓÉϱí¿É¼û£¬"ÃÛ·ä"µÄTF-IDFÖµ×î¸ß£¬"ÑøÖ³"Æä´Î£¬"Öйú"×îµÍ¡££¨Èç¹û»¹¼ÆËã"µÄ"×ÖµÄTF-IDF£¬Äǽ«ÊÇÒ»¸ö¼«Æä½Ó½ü0µÄÖµ¡££©ËùÒÔ£¬Èç¹ûֻѡÔñÒ»¸ö´Ê£¬"ÃÛ·ä"¾ÍÊÇÕâÆªÎÄÕµĹؼü´Ê¡£
³ýÁË×Ô¶¯ÌáÈ¡¹Ø¼ü´Ê£¬TF-IDFËã·¨»¹¿ÉÒÔÓÃÓÚÐí¶à±ðµÄµØ·½¡£±ÈÈ磬ÐÅÏ¢¼ìË÷ʱ£¬¶ÔÓÚÿ¸öÎĵµ£¬¶¼¿ÉÒÔ·Ö±ð¼ÆËãÒ»×éËÑË÷´Ê£¨"Öйú"¡¢"ÃÛ·ä"¡¢"ÑøÖ³"£©µÄTF-IDF£¬½«ËüÃÇÏà¼Ó£¬¾Í¿ÉÒԵõ½Õû¸öÎĵµµÄTF-IDF¡£Õâ¸öÖµ×î¸ßµÄÎĵµ¾ÍÊÇÓëËÑË÷´Ê×îÏà¹ØµÄÎĵµ¡£
TF-IDFËã·¨µÄÓŵãÊǼòµ¥¿ìËÙ£¬½á¹û±È½Ï·ûºÏʵ¼ÊÇé¿ö¡£È±µãÊÇ£¬µ¥´¿ÒÔ"´ÊƵ"ºâÁ¿Ò»¸ö´ÊµÄÖØÒªÐÔ£¬²»¹»È«Ã棬ÓÐÊ±ÖØÒªµÄ´Ê¿ÉÄܳöÏÖ´ÎÊý²¢²»¶à¡£¶øÇÒ£¬ÕâÖÖËã·¨ÎÞ·¨ÌåÏִʵÄλÖÃÐÅÏ¢£¬³öÏÖλÖÿ¿Ç°µÄ´ÊÓë³öÏÖλÖÿ¿ºóµÄ´Ê£¬¶¼±»ÊÓÎªÖØÒªÐÔÏàͬ£¬ÕâÊDz»ÕýÈ·µÄ¡££¨Ò»ÖÖ½â¾ö·½·¨ÊÇ£¬¶ÔÈ«ÎĵĵÚÒ»¶ÎºÍÿһ¶ÎµÄµÚÒ»¾ä»°£¬¸øÓè½Ï´óµÄÈ¨ÖØ¡££©
ÕÒ³öÏàËÆÎÄÕ ÎÒÃÇÔÙÀ´Ñо¿ÁíÒ»¸öÏà¹ØµÄÎÊÌâ¡£ÓÐЩʱºò£¬³ýÁËÕÒµ½¹Ø¼ü´Ê£¬ÎÒÃÇ»¹Ï£ÍûÕÒµ½ÓëÔÎÄÕÂÏàËÆµÄÆäËûÎÄÕ¡£±ÈÈ磬"GoogleÐÂÎÅ"ÔÚÖ÷ÐÂÎÅÏ·½£¬»¹Ìṩ¶àÌõÏàËÆµÄÐÂÎÅ¡£

ΪÁËÕÒ³öÏàËÆµÄÎÄÕ£¬ÐèÒªÓõ½"ÓàÏÒÏàËÆÐÔ" £¨cosine
similiarity£©¡£ÏÂÃæ£¬ÎÒ¾ÙÒ»¸öÀý×ÓÀ´ËµÃ÷£¬Ê²Ã´ÊÇ"ÓàÏÒÏàËÆÐÔ"¡£
ΪÁ˼òµ¥Æð¼û£¬ÎÒÃÇÏÈ´Ó¾ä×Ó×ÅÊÖ¡£
¾ä×ÓA£ºÎÒϲ»¶¿´µçÊÓ£¬²»Ï²»¶¿´µçÓ°¡£
¾ä×ÓB£ºÎÒ²»Ï²»¶¿´µçÊÓ£¬Ò²²»Ï²»¶¿´µçÓ°¡£
ÇëÎÊÔõÑù²ÅÄܼÆËãÉÏÃæÁ½¾ä»°µÄÏàËÆ³Ì¶È£¿
»ù±¾Ë¼Â·ÊÇ£ºÈç¹ûÕâÁ½¾ä»°µÄÓôÊÔ½ÏàËÆ£¬ËüÃǵÄÄÚÈݾÍÓ¦¸ÃÔ½ÏàËÆ¡£Òò´Ë£¬¿ÉÒÔ´Ó´ÊÆµÈëÊÖ£¬¼ÆËãËüÃǵÄÏàËÆ³Ì¶È¡£
µÚÒ»²½£¬·Ö´Ê¡£
¾ä×ÓA£ºÎÒ/ϲ»¶/¿´/µçÊÓ£¬²»/ϲ»¶/¿´/µçÓ°¡£
¾ä×ÓB£ºÎÒ/²»/ϲ»¶/¿´/µçÊÓ£¬Ò²/²»/ϲ»¶/¿´/µçÓ°¡£
µÚ¶þ²½£¬ÁгöËùÓеĴʡ£
ÎÒ£¬Ï²»¶£¬¿´£¬µçÊÓ£¬µçÓ°£¬²»£¬Ò²¡£
µÚÈý²½£¬¼ÆËã´ÊƵ¡£
¾ä×ÓA£ºÎÒ 1£¬Ï²»¶ 2£¬¿´ 2£¬µçÊÓ 1£¬µçÓ° 1£¬²» 1£¬Ò² 0¡£
¾ä×ÓB£ºÎÒ 1£¬Ï²»¶ 2£¬¿´ 2£¬µçÊÓ 1£¬µçÓ° 1£¬²» 2£¬Ò² 1¡£
µÚËIJ½£¬Ð´³ö´ÊƵÏòÁ¿¡£
¾ä×ÓA£º[1, 2, 2, 1, 1, 1, 0]
¾ä×ÓB£º[1, 2, 2, 1, 1, 2, 1]
µ½ÕâÀÎÊÌâ¾Í±ä³ÉÁËÈçºÎ¼ÆËãÕâÁ½¸öÏòÁ¿µÄÏàËÆ³Ì¶È¡£
ÎÒÃÇ¿ÉÒÔ°ÑËüÃÇÏëÏó³É¿Õ¼äÖеÄÁ½ÌõÏ߶Σ¬¶¼ÊÇ´ÓԵ㣨[0, 0, ...]£©³ö·¢£¬Ö¸Ïò²»Í¬µÄ·½Ïò¡£Á½ÌõÏß¶ÎÖ®¼äÐγÉÒ»¸ö¼Ð½Ç£¬Èç¹û¼Ð½ÇΪ0¶È£¬Òâζ×Å·½ÏòÏàͬ¡¢Ïß¶ÎÖØºÏ£»Èç¹û¼Ð½ÇΪ90¶È£¬Òâζ×ÅÐγÉÖ±½Ç£¬·½ÏòÍêÈ«²»ÏàËÆ£»Èç¹û¼Ð½ÇΪ180¶È£¬Òâζ×Å·½ÏòÕýºÃÏà·´¡£Òò´Ë£¬ÎÒÃÇ¿ÉÒÔͨ¹ý¼Ð½ÇµÄ´óС£¬À´ÅжÏÏòÁ¿µÄÏàËÆ³Ì¶È¡£¼Ð½ÇԽС£¬¾Í´ú±íÔ½ÏàËÆ¡£

ÒÔ¶þά¿Õ¼äΪÀý£¬ÉÏͼµÄaºÍbÊÇÁ½¸öÏòÁ¿£¬ÎÒÃÇÒª¼ÆËãËüÃǵļнǦȡ£ÓàÏÒ¶¨Àí¸æËßÎÒÃÇ£¬¿ÉÒÔÓÃÏÂÃæµÄ¹«Ê½ÇóµÃ£º


¼Ù¶¨aÏòÁ¿ÊÇ[x1, y1]£¬bÏòÁ¿ÊÇ[x2, y2]£¬ÄÇô¿ÉÒÔ½«ÓàÏÒ¶¨Àí¸Äд³ÉÏÂÃæµÄÐÎʽ£º


Êýѧ¼ÒÒѾ֤Ã÷£¬ÓàÏÒµÄÕâÖÖ¼ÆËã·½·¨¶ÔnάÏòÁ¿Ò²³ÉÁ¢¡£¼Ù¶¨AºÍBÊÇÁ½¸önάÏòÁ¿£¬AÊÇ [A1, A2,
..., An] £¬BÊÇ [B1, B2, ..., Bn] £¬ÔòAÓëBµÄ¼Ð½Ç¦ÈµÄÓàÏÒµÈÓÚ£º

ʹÓÃÕâ¸ö¹«Ê½£¬ÎÒÃǾͿÉÒԵõ½£¬¾ä×ÓAÓë¾ä×ÓBµÄ¼Ð½ÇµÄÓàÏÒ¡£

ÓàÏÒÖµÔ½½Ó½ü1£¬¾Í±íÃ÷¼Ð½ÇÔ½½Ó½ü0¶È£¬Ò²¾ÍÊÇÁ½¸öÏòÁ¿Ô½ÏàËÆ£¬Õâ¾Í½Ð"ÓàÏÒÏàËÆÐÔ"¡£ËùÒÔ£¬ÉÏÃæµÄ¾ä×ÓAºÍ¾ä×ÓBÊǺÜÏàËÆµÄ£¬ÊÂʵÉÏËüÃǵļнǴóԼΪ20.3¶È¡£
ÓÉ´Ë£¬ÎÒÃǾ͵õ½ÁË"ÕÒ³öÏàËÆÎÄÕÂ"µÄÒ»ÖÖËã·¨£º
£¨1£©Ê¹ÓÃTF-IDFËã·¨£¬ÕÒ³öÁ½ÆªÎÄÕµĹؼü´Ê£»
£¨2£©Ã¿ÆªÎÄÕ¸÷È¡³öÈô¸É¸ö¹Ø¼ü´Ê£¨±ÈÈç20¸ö£©£¬ºÏ²¢³ÉÒ»¸ö¼¯ºÏ£¬¼ÆËãÿƪÎÄÕ¶ÔÓÚÕâ¸ö¼¯ºÏÖÐµÄ´ÊµÄ´ÊÆµ£¨ÎªÁ˱ÜÃâÎÄÕ³¤¶ÈµÄ²îÒ죬¿ÉÒÔʹÓÃÏà¶Ô´ÊƵ£©£»
£¨3£©Éú³ÉÁ½ÆªÎÄÕ¸÷×ÔµÄ´ÊÆµÏòÁ¿£»
£¨4£©¼ÆËãÁ½¸öÏòÁ¿µÄÓàÏÒÏàËÆ¶È£¬ÖµÔ½´ó¾Í±íʾԽÏàËÆ¡£
"ÓàÏÒÏàËÆ¶È"ÊÇÒ»Öַdz£ÓÐÓõÄËã·¨£¬Ö»ÒªÊǼÆËãÁ½¸öÏòÁ¿µÄÏàËÆ³Ì¶È£¬¶¼¿ÉÒÔ²ÉÓÃËü¡£
×Ô¶¯ÕªÒª ÓÐʱºò£¬ºÜ¼òµ¥µÄÊýѧ·½·¨£¬¾Í¿ÉÒÔÍê³ÉºÜ¸´ÔÓµÄÈÎÎñ¡£
ǰÁ½²¿·Ö¾ÍÊǺܺõÄÀý×Ó¡£½ö½öÒÀ¿¿Í³¼Æ´ÊƵ£¬¾ÍÄÜÕÒ³ö¹Ø¼ü´ÊºÍÏàËÆÎÄÕ¡£ËäÈ»ËüÃÇËã²»ÉÏЧ¹û×îºÃµÄ·½·¨£¬µ«¿Ï¶¨ÊÇ×î¼ò±ãÒ×Ðеķ½·¨¡£
½ÓÏÂÀ´ÌÖÂÛÈçºÎͨ¹ý´ÊƵ£¬¶ÔÎÄÕ½øÐÐ×Ô¶¯ÕªÒª£¨Automatic summarization£©¡£
Èç¹ûÄÜ´Ó3000×ÖµÄÎÄÕ£¬ÌáÁ¶³ö150×ÖµÄÕªÒª£¬¾Í¿ÉÒÔΪ¶ÁÕß½ÚÊ¡´óÁ¿ÔĶÁʱ¼ä¡£ÓÉÈËÍê³ÉµÄÕªÒª½Ð"È˹¤ÕªÒª"£¬ÓÉ»úÆ÷Íê³ÉµÄ¾Í½Ð"×Ô¶¯ÕªÒª"¡£Ðí¶àÍøÕ¾¶¼ÐèÒªËü£¬±ÈÈçÂÛÎÄÍøÕ¾¡¢ÐÂÎÅÍøÕ¾¡¢ËÑË÷ÒýÇæµÈµÈ¡£2007Ä꣬ÃÀ¹úѧÕßµÄÂÛÎÄ¡¶A
Survey on Automatic Text Summarization¡·£¨Dipanjan Das,
Andre F.T. Martins, 2007£©×ܽáÁËĿǰµÄ×Ô¶¯ÕªÒªËã·¨¡£ÆäÖУ¬ºÜÖØÒªµÄÒ»ÖÖ¾ÍÊÇ´ÊÆµÍ³¼Æ¡£
ÕâÖÖ·½·¨×îÔç³ö×Ô1958ÄêµÄIBM¹«Ë¾¿ÆÑ§¼ÒH.P. LuhnµÄÂÛÎÄ¡¶The Automatic
Creation of Literature Abstracts¡·¡£
Luhn²©Ê¿ÈÏΪ£¬ÎÄÕµÄÐÅÏ¢¶¼°üº¬ÔÚ¾ä×ÓÖУ¬ÓÐЩ¾ä×Ó°üº¬µÄÐÅÏ¢¶à£¬ÓÐЩ¾ä×Ó°üº¬µÄÐÅÏ¢ÉÙ¡£"×Ô¶¯ÕªÒª"¾ÍÊÇÒªÕÒ³öÄÇЩ°üº¬ÐÅÏ¢×î¶àµÄ¾ä×Ó¡£
¾ä×ÓµÄÐÅÏ¢Á¿ÓÃ"¹Ø¼ü´Ê"À´ºâÁ¿¡£Èç¹û°üº¬µÄ¹Ø¼ü´ÊÔ½¶à£¬¾Í˵Ã÷Õâ¸ö¾ä×ÓÔ½ÖØÒª¡£LuhnÌá³öÓÃ"´Ø"£¨cluster£©±íʾ¹Ø¼ü´ÊµÄ¾Û¼¯¡£Ëùν"´Ø"¾ÍÊǰüº¬¶à¸ö¹Ø¼ü´ÊµÄ¾ä×ÓÆ¬¶Î¡£

ÉÏͼ¾ÍÊÇLuhnÔʼÂÛÎĵIJåͼ£¬±»¿òÆðÀ´µÄ²¿·Ö¾ÍÊÇÒ»¸ö"´Ø"¡£Ö»Òª¹Ø¼ü´ÊÖ®¼äµÄ¾àÀëСÓÚ"Ãż÷Öµ"£¬ËüÃǾͱ»ÈÏΪ´¦ÓÚͬһ¸ö´ØÖ®ÖС£Luhn½¨ÒéµÄÃż÷ÖµÊÇ4»ò5¡£Ò²¾ÍÊÇ˵£¬Èç¹ûÁ½¸ö¹Ø¼ü´ÊÖ®¼äÓÐ5¸öÒÔÉÏµÄÆäËû´Ê£¬¾Í¿ÉÒÔ°ÑÕâÁ½¸ö¹Ø¼ü´Ê·ÖÔÚÁ½¸ö´Ø¡£
ÏÂÒ»²½£¬¶ÔÓÚÿ¸ö´Ø£¬¶¼¼ÆËãËüµÄÖØÒªÐÔ·ÖÖµ¡£

ÒÔǰͼΪÀý£¬ÆäÖеĴØÒ»¹²ÓÐ7¸ö´Ê£¬ÆäÖÐ4¸öÊǹؼü´Ê¡£Òò´Ë£¬ËüµÄÖØÒªÐÔ·ÖÖµµÈÓÚ ( 4 x 4 )
/ 7 = 2.3¡£
È»ºó£¬ÕÒ³ö°üº¬·ÖÖµ×î¸ßµÄ´ØµÄ¾ä×Ó£¨±ÈÈç5¾ä£©£¬°ÑËüÃǺÏÔÚÒ»Æð£¬¾Í¹¹³ÉÁËÕâÆªÎÄÕµÄ×Ô¶¯ÕªÒª¡£¾ßÌåʵÏÖ¿ÉÒԲμû¡¶Mining
the Social Web: Analyzing Data from Facebook, Twitter,
LinkedIn, and Other Social Media Sites¡·£¨O'Reilly,
2011£©Ò»ÊéµÄµÚ8Õ£¬python´úÂë¼ûgithub¡£
LuhnµÄÕâÖÖËã·¨ºóÀ´±»¼ò»¯£¬²»ÔÙÇø·Ö"´Ø"£¬Ö»¿¼ÂǾä×Ó°üº¬µÄ¹Ø¼ü´Ê¡£ÏÂÃæ¾ÍÊÇÒ»¸öÀý×Ó£¨²ÉÓÃαÂë±íʾ£©£¬Ö»¿¼Âǹؼü´ÊÊ×ÏȳöÏֵľä×Ó¡£
Summarizer(originalText,
maxSummarySize):
// ¼ÆËãÔʼÎı¾µÄ´ÊƵ£¬Éú³ÉÒ»¸öÊý×飬±ÈÈç[(10,'the'), (3,'language'),
(8,'code')...]
wordFrequences = getWordCounts(originalText)
// ¹ýÂ˵ôÍ£Óôʣ¬Êý×é±ä³É[(3, 'language'), (8, 'code')...]
contentWordFrequences = filtStopWords(wordFrequences)
// °´ÕÕ´ÊÆµ½øÐÐÅÅÐò£¬Êý×é±ä³É['code', 'language'...]
contentWordsSortbyFreq = sortByFreqThenDropFreq( contentWordFrequences)
// ½«ÎÄÕ·ֳɾä×Ó
sentences = getSentences(originalText)
// Ñ¡Ôñ¹Ø¼ü´ÊÊ×ÏȳöÏֵľä×Ó
setSummarySentences = {}
foreach word in contentWordsSortbyFreq:
firstMatchingSentence = search(sentences, word)
setSummarySentences.add(firstMatchingSentence)
if setSummarySentences.size() = maxSummarySize:
break
// ½«Ñ¡Öеľä×Ó°´ÕÕ³öÏÖ˳Ðò£¬×é³ÉÕªÒª
summary = ""
foreach sentence in sentences:
if sentence in setSummarySentences:
summary = summary + " " + sentence
return summary |
ÀàËÆµÄËã·¨ÒѾ±»Ð´³ÉÁ˹¤¾ß£¬±ÈÈç»ùÓÚJavaµÄClassifier4J¿âµÄSimpleSummariserÄ£¿é¡¢»ùÓÚCÓïÑÔµÄOTS¿â¡¢ÒÔ¼°»ùÓÚclassifier4JµÄC#ʵÏÖºÍpythonʵÏÖ¡£ |