Äú¿ÉÒÔ¾èÖú£¬Ö§³ÖÎÒÃǵĹ«ÒæÊÂÒµ¡£

1Ôª 10Ôª 50Ôª





ÈÏÖ¤Â룺  ÑéÖ¤Âë,¿´²»Çå³þ?Çëµã»÷Ë¢ÐÂÑéÖ¤Âë ±ØÌî



  ÇóÖª ÎÄÕ ÎÄ¿â Lib ÊÓÆµ iPerson ¿Î³Ì ÈÏÖ¤ ×Éѯ ¹¤¾ß ½²×ù Model Center   Code  
»áÔ±   
   
 
     
   
 ¶©ÔÄ
  ¾èÖú
solrÈ«ÎļìË÷ʵÏÖÔ­Àí
 
×÷Õߣº¸£Éú
  3079  次浏览      28
 2019-11-6
 
±à¼­ÍƼö:
ÎÄÕÂÖ÷Òª½éÉÜÁËÈ«ÎļìË÷´óÌå·ÖÁ½¸ö¹ý³Ì£¬Ë÷Òý´´½¨(Indexing)ºÍËÑË÷Ë÷Òý(Search)£¬ÒÔ¼°Ë÷Òý´´½¨ºÍËÑË÷Ë÷ÒýµÄÏêϸ¹ý³Ì¡£
±¾ÎÄÀ´×ÔÓÚcnblogs£¬ÓÉ»ðÁú¹ûÈí¼þLuca±à¼­¡¢ÍƼö¡£

SolrÊÇÒ»¸ö¶ÀÁ¢µÄÆóÒµ¼¶ËÑË÷Ó¦Ó÷þÎñÆ÷£¬Ëü¶ÔÍâÌṩÀàËÆÓÚWeb-serviceµÄAPI½Ó¿Ú¡£Óû§¿ÉÒÔͨ¹ýhttpÇëÇó£¬ÏòËÑË÷ÒýÇæ·þÎñÆ÷Ìá½»Ò»¶¨¸ñʽµÄXMLÎļþ£¬Éú³ÉË÷Òý£»Ò²¿ÉÒÔͨ¹ýHttp Get²Ù×÷Ìá³ö²éÕÒÇëÇ󣬲¢µÃµ½XML/Json¸ñʽµÄ·µ»Ø½á¹û¡£²ÉÓÃJava5¿ª·¢£¬»ùÓÚLucene¡£

LuceneÊÇapacheÈí¼þ»ù½ð»á4 jakartaÏîÄ¿×éµÄÒ»¸ö×ÓÏîÄ¿£¬ÊÇÒ»¸ö¿ª·ÅÔ´´úÂëµÄÈ«ÎļìË÷ÒýÇæ¹¤¾ß°ü£¬¼´Ëü²»ÊÇÒ»¸öÍêÕûµÄÈ«ÎļìË÷ÒýÇæ£¬¶øÊÇÒ»¸öÈ«ÎļìË÷ÒýÇæµÄ¼Ü¹¹£¬ÌṩÁËÍêÕûµÄ²éѯÒýÇæºÍË÷ÒýÒýÇæ£¬²¿·ÖÎı¾·ÖÎöÒýÇæ£¨Ó¢ÎÄÓëµÂÎÄÁ½ÖÖÎ÷·½ÓïÑÔ£©¡£

ÆäÖÐLuceneÈ«ÎļìË÷µÄ»ù±¾Ô­Àí£¬¸ú´óÅ£½²µÄwebËÑË÷¿Î³ÌÀïµÄ¼¼ÊõÒ»Ö£¬²ÉÓ÷ִʣ¬ÓïÒåÓï·¨·ÖÎö£¬ÏòÁ¿¿Õ¼äÄ£Ð͵ȼ¼ÊõÀ´ÊµÏÖ.

Ò»¡¢×ÜÂÛ

¸ù¾Ýhttp://lucene.apache.org/core/index.html¶¨Ò壺

LuceneÊÇÒ»¸ö¸ßЧµÄ£¬»ùÓÚJavaµÄÈ«ÎļìË÷¿â¡£

ËùÒÔÔÚÁ˽âLucene֮ǰҪ·ÑÒ»·¬¹¤·òÁ˽âÒ»ÏÂÈ«ÎļìË÷¡£

ÄÇôʲô½Ð×öÈ«ÎļìË÷ÄØ£¿ÕâÒª´ÓÎÒÃÇÉú»îÖеÄÊý¾Ý˵Æð¡£

ÎÒÃÇÉú»îÖеÄÊý¾Ý×ÜÌå·ÖΪÁ½ÖÖ£º½á¹¹»¯Êý¾ÝºÍ·Ç½á¹¹»¯Êý¾Ý¡£

½á¹¹»¯Êý¾Ý£ºÖ¸¾ßÓй̶¨¸ñʽ»òÓÐÏÞ³¤¶ÈµÄÊý¾Ý£¬ÈçÊý¾Ý¿â£¬ÔªÊý¾ÝµÈ¡£

·Ç½á¹¹»¯Êý¾Ý£ºÖ¸²»¶¨³¤»òÎ޹̶¨¸ñʽµÄÊý¾Ý£¬ÈçÓʼþ£¬wordÎĵµµÈ¡£

µ±È»Óеĵط½»¹»áÌáµ½µÚÈýÖÖ£¬°ë½á¹¹»¯Êý¾Ý£¬ÈçXML£¬HTMLµÈ£¬µ±¸ù¾ÝÐèÒª¿É°´½á¹¹»¯Êý¾ÝÀ´´¦Àí£¬Ò²¿É³éÈ¡³ö´¿Îı¾°´·Ç½á¹¹»¯Êý¾ÝÀ´´¦Àí¡£

·Ç½á¹¹»¯Êý¾ÝÓÖÒ»Öֽз¨½ÐÈ«ÎÄÊý¾Ý¡£

°´ÕÕÊý¾ÝµÄ·ÖÀ࣬ËÑË÷Ò²·ÖΪÁ½ÖÖ£º

¶Ô½á¹¹»¯Êý¾ÝµÄËÑË÷£ºÈç¶ÔÊý¾Ý¿âµÄËÑË÷£¬ÓÃSQLÓï¾ä¡£ÔÙÈç¶ÔÔªÊý¾ÝµÄËÑË÷£¬ÈçÀûÓÃwindowsËÑË÷¶ÔÎļþÃû£¬ÀàÐÍ£¬ÐÞ¸Äʱ¼ä½øÐÐËÑË÷µÈ¡£

¶Ô·Ç½á¹¹»¯Êý¾ÝµÄËÑË÷£ºÈçÀûÓÃwindowsµÄËÑË÷Ò²¿ÉÒÔËÑË÷ÎļþÄÚÈÝ£¬LinuxϵÄgrepÃüÁÔÙÈçÓÃGoogleºÍ°Ù¶È¿ÉÒÔËÑË÷´óÁ¿ÄÚÈÝÊý¾Ý¡£

¶Ô·Ç½á¹¹»¯Êý¾ÝÒ²¼´¶ÔÈ«ÎÄÊý¾ÝµÄËÑË÷Ö÷ÒªÓÐÁ½ÖÖ·½·¨£º

Ò»ÖÖÊÇ˳ÐòɨÃè·¨(Serial Scanning)£ºËùν˳ÐòɨÃ裬±ÈÈçÒªÕÒÄÚÈݰüº¬Ä³Ò»¸ö×Ö·û´®µÄÎļþ£¬¾ÍÊÇÒ»¸öÎĵµÒ»¸öÎĵµµÄ¿´£¬¶ÔÓÚÿһ¸öÎĵµ£¬´ÓÍ·¿´µ½Î²£¬Èç¹û´ËÎĵµ°üº¬´Ë×Ö·û´®£¬Ôò´ËÎĵµÎªÎÒÃÇÒªÕÒµÄÎļþ£¬½Ó×Å¿´ÏÂÒ»¸öÎļþ£¬Ö±µ½É¨ÃèÍêËùÓеÄÎļþ¡£ÈçÀûÓÃwindowsµÄËÑË÷Ò²¿ÉÒÔËÑË÷ÎļþÄÚÈÝ£¬Ö»ÊÇÏ൱µÄÂý¡£Èç¹ûÄãÓÐÒ»¸ö80GÓ²ÅÌ£¬Èç¹ûÏëÔÚÉÏÃæÕÒµ½Ò»¸öÄÚÈݰüº¬Ä³×Ö·û´®µÄÎļþ£¬²»»¨Ëû¼¸¸öСʱ£¬ÅÂÊÇ×ö²»µ½¡£LinuxϵÄgrepÃüÁîÒ²ÊÇÕâÒ»ÖÖ·½Ê½¡£´ó¼Ò¿ÉÄܾõµÃÕâÖÖ·½·¨±È½Ïԭʼ£¬µ«¶ÔÓÚСÊý¾ÝÁ¿µÄÎļþ£¬ÕâÖÖ·½·¨»¹ÊÇ×îÖ±½Ó£¬×î·½±ãµÄ¡£µ«ÊǶÔÓÚ´óÁ¿µÄÎļþ£¬ÕâÖÖ·½·¨¾ÍºÜÂýÁË¡£

ÓÐÈË¿ÉÄÜ»á˵£¬¶Ô·Ç½á¹¹»¯Êý¾Ý˳ÐòɨÃèºÜÂý£¬¶Ô½á¹¹»¯Êý¾ÝµÄËÑË÷È´Ïà¶Ô½Ï¿ì£¨ÓÉÓڽṹ»¯Êý¾ÝÓÐÒ»¶¨µÄ½á¹¹¿ÉÒÔ²Éȡһ¶¨µÄËÑË÷Ëã·¨¼Ó¿ìËÙ¶È£©£¬ÄÇô°ÑÎÒÃǵķǽṹ»¯Êý¾ÝÏë°ì·¨ÅªµÃÓÐÒ»¶¨½á¹¹²»¾ÍÐÐÁËÂð£¿

ÕâÖÖÏë·¨ºÜÌìÈ»£¬È´¹¹³ÉÁËÈ«ÎļìË÷µÄ»ù±¾Ë¼Â·£¬Ò²¼´½«·Ç½á¹¹»¯Êý¾ÝÖеÄÒ»²¿·ÖÐÅÏ¢ÌáÈ¡³öÀ´£¬ÖØÐÂ×éÖ¯£¬Ê¹Æä±äµÃÓÐÒ»¶¨½á¹¹£¬È»ºó¶Ô´ËÓÐÒ»¶¨½á¹¹µÄÊý¾Ý½øÐÐËÑË÷£¬´Ó¶ø´ïµ½ËÑË÷Ïà¶Ô½Ï¿ìµÄÄ¿µÄ¡£

Õⲿ·Ö´Ó·Ç½á¹¹»¯Êý¾ÝÖÐÌáÈ¡³öµÄÈ»ºóÖØÐÂ×éÖ¯µÄÐÅÏ¢£¬ÎÒÃdzÆÖ®Ë÷Òý¡£

ÕâÖÖ˵·¨±È½Ï³éÏ󣬾ټ¸¸öÀý×ӾͺÜÈÝÒ×Ã÷°×£¬±ÈÈç×ֵ䣬×ÖµäµÄÆ´Òô±íºÍ²¿Ê×¼ì×Ö±í¾ÍÏ൱ÓÚ×ÖµäµÄË÷Òý£¬¶Ôÿһ¸ö×ֵĽâÊÍÊǷǽṹ»¯µÄ£¬Èç¹û×ÖµäûÓÐÒô½Ú±íºÍ²¿Ê×¼ì×Ö±í£¬ÔÚãã´Çº£ÖÐÕÒÒ»¸ö×ÖÖ»ÄÜ˳ÐòɨÃ衣Ȼ¶ø×ÖµÄijЩÐÅÏ¢¿ÉÒÔÌáÈ¡³öÀ´½øÐнṹ»¯´¦Àí£¬±ÈÈç¶ÁÒô£¬¾Í±È½Ï½á¹¹»¯£¬·ÖÉùĸºÍÔÏĸ£¬·Ö±ðÖ»Óм¸ÖÖ¿ÉÒÔÒ»Ò»Áо٣¬ÓÚÊǽ«¶ÁÒôÄóöÀ´°´Ò»¶¨µÄ˳ÐòÅÅÁУ¬Ã¿Ò»Ïî¶ÁÒô¶¼Ö¸Ïò´Ë×ÖµÄÏêϸ½âÊ͵ÄÒ³Êý¡£ÎÒÃÇËÑË÷ʱ°´½á¹¹»¯µÄÆ´ÒôËѵ½¶ÁÒô£¬È»ºó°´ÆäÖ¸ÏòµÄÒ³Êý£¬±ã¿ÉÕÒµ½ÎÒÃǵķǽṹ»¯Êý¾Ý¡ª¡ªÒ²¼´¶Ô×ֵĽâÊÍ¡£

ÕâÖÖÏȽ¨Á¢Ë÷Òý£¬ÔÙ¶ÔË÷Òý½øÐÐËÑË÷µÄ¹ý³Ì¾Í½ÐÈ«ÎļìË÷(Full-text Search)¡£

ÏÂÃæÕâ·ùͼÀ´×Ô¡¶Lucene in action¡·£¬µ«È´²»½ö½öÃèÊöÁËLuceneµÄ¼ìË÷¹ý³Ì£¬¶øÊÇÃèÊöÁËÈ«ÎļìË÷µÄÒ»°ã¹ý³Ì¡£

È«ÎļìË÷´óÌå·ÖÁ½¸ö¹ý³Ì£¬Ë÷Òý´´½¨(Indexing)ºÍËÑË÷Ë÷Òý(Search)¡£

Ë÷Òý´´½¨£º½«ÏÖʵÊÀ½çÖÐËùÓеĽṹ»¯ºÍ·Ç½á¹¹»¯Êý¾ÝÌáÈ¡ÐÅÏ¢£¬´´½¨Ë÷ÒýµÄ¹ý³Ì¡£

ËÑË÷Ë÷Òý£º¾ÍÊǵõ½Óû§µÄ²éѯÇëÇó£¬ËÑË÷´´½¨µÄË÷Òý£¬È»ºó·µ»Ø½á¹ûµÄ¹ý³Ì¡£

ÓÚÊÇÈ«ÎļìË÷¾Í´æÔÚÈý¸öÖØÒªÎÊÌ⣺

1. Ë÷ÒýÀïÃæ¾¿¾¹´æÐ©Ê²Ã´£¿(Index)

2. ÈçºÎ´´½¨Ë÷Òý£¿(Indexing)

3. ÈçºÎ¶ÔË÷Òý½øÐÐËÑË÷£¿(Search)

ÏÂÃæÎÒÃÇ˳Ðò¶Ôÿ¸ö¸öÎÊÌâ½øÐÐÑо¿¡£

¶þ¡¢Ë÷ÒýÀïÃæ¾¿¾¹´æÐ©Ê²Ã´

Ë÷ÒýÀïÃæ¾¿¾¹ÐèÒª´æÐ©Ê²Ã´ÄØ£¿

Ê×ÏÈÎÒÃÇÀ´¿´ÎªÊ²Ã´Ë³ÐòɨÃèµÄËÙ¶ÈÂý£º

ÆäʵÊÇÓÉÓÚÎÒÃÇÏëÒªËÑË÷µÄÐÅÏ¢ºÍ·Ç½á¹¹»¯Êý¾ÝÖÐËù´æ´¢µÄÐÅÏ¢²»Ò»ÖÂÔì³ÉµÄ¡£

·Ç½á¹¹»¯Êý¾ÝÖÐËù´æ´¢µÄÐÅÏ¢ÊÇÿ¸öÎļþ°üº¬ÄÄЩ×Ö·û´®£¬Ò²¼´ÒÑÖªÎļþ£¬ÓûÇó×Ö·û´®Ïà¶ÔÈÝÒ×£¬Ò²¼´ÊÇ´ÓÎļþµ½×Ö·û´®µÄÓ³Éä¡£¶øÎÒÃÇÏëËÑË÷µÄÐÅÏ¢ÊÇÄÄЩÎļþ°üº¬´Ë×Ö·û´®£¬Ò²¼´ÒÑÖª×Ö·û´®£¬ÓûÇóÎļþ£¬Ò²¼´´Ó×Ö·û´®µ½ÎļþµÄÓ³Éä¡£Á½ÕßǡǡÏà·´¡£ÓÚÊÇÈç¹ûË÷Òý×ÜÄܹ»±£´æ´Ó×Ö·û´®µ½ÎļþµÄÓ³É䣬Ôò»á´ó´óÌá¸ßËÑË÷ËÙ¶È¡£

ÓÉÓÚ´Ó×Ö·û´®µ½ÎļþµÄÓ³ÉäÊÇÎļþµ½×Ö·û´®Ó³ÉäµÄ·´Ïò¹ý³Ì£¬ÓÚÊDZ£´æÕâÖÖÐÅÏ¢µÄË÷Òý³ÆÎª·´ÏòË÷Òý¡£

·´ÏòË÷ÒýµÄËù±£´æµÄÐÅÏ¢Ò»°ãÈçÏ£º

¼ÙÉèÎÒµÄÎĵµ¼¯ºÏÀïÃæÓÐ100ƪÎĵµ£¬ÎªÁË·½±ã±íʾ£¬ÎÒÃÇΪÎĵµ±àºÅ´Ó1µ½100£¬µÃµ½ÏÂÃæµÄ½á¹¹

×ó±ß±£´æµÄÊÇһϵÁÐ×Ö·û´®£¬³ÆÎª´Êµä¡£

ÿ¸ö×Ö·û´®¶¼Ö¸Ïò°üº¬´Ë×Ö·û´®µÄÎĵµ(Document)Á´±í£¬´ËÎĵµÁ´±í³ÆÎªµ¹Åűí(Posting List)¡£

ÓÐÁËË÷Òý£¬±ãʹ±£´æµÄÐÅÏ¢ºÍÒªËÑË÷µÄÐÅÏ¢Ò»Ö£¬¿ÉÒÔ´ó´ó¼Ó¿ìËÑË÷µÄËÙ¶È¡£

±ÈÈç˵£¬ÎÒÃÇҪѰÕҼȰüº¬×Ö·û´®¡°lucene¡±ÓÖ°üº¬×Ö·û´®¡°solr¡±µÄÎĵµ£¬ÎÒÃÇÖ»ÐèÒªÒÔϼ¸²½£º

1. È¡³ö°üº¬×Ö·û´®¡°lucene¡±µÄÎĵµÁ´±í¡£

2. È¡³ö°üº¬×Ö·û´®¡°solr¡±µÄÎĵµÁ´±í¡£

3. ͨ¹ýºÏ²¢Á´±í£¬ÕÒ³ö¼È°üº¬¡°lucene¡±ÓÖ°üº¬¡°solr¡±µÄÎļþ¡£

¿´µ½Õâ¸öµØ·½£¬ÓÐÈË¿ÉÄÜ»á˵£¬È«ÎļìË÷µÄÈ·¼Ó¿ìÁËËÑË÷µÄËÙ¶È£¬µ«ÊǶàÁËË÷ÒýµÄ¹ý³Ì£¬Á½Õß¼ÓÆðÀ´²»Ò»¶¨±È˳ÐòɨÃè¿ì¶àÉÙ¡£µÄÈ·£¬¼ÓÉÏË÷ÒýµÄ¹ý³Ì£¬È«ÎļìË÷²»Ò»¶¨±È˳ÐòɨÃè¿ì£¬ÓÈÆäÊÇÔÚÊý¾ÝÁ¿Ð¡µÄʱºò¸üÊÇÈç´Ë¡£¶ø¶ÔÒ»¸öºÜ´óÁ¿µÄÊý¾Ý´´½¨Ë÷ÒýÒ²ÊÇÒ»¸öºÜÂýµÄ¹ý³Ì¡£

È»¶øÁ½Õß»¹ÊÇÓÐÇø±ðµÄ£¬Ë³ÐòɨÃèÊÇÿ´Î¶¼ÒªÉ¨Ã裬¶ø´´½¨Ë÷ÒýµÄ¹ý³Ì½ö½öÐèÒªÒ»´Î£¬ÒÔºó±ãÊÇÒ»ÀÍÓÀÒݵÄÁË£¬Ã¿´ÎËÑË÷£¬´´½¨Ë÷ÒýµÄ¹ý³Ì²»±Ø¾­¹ý£¬½ö½öËÑË÷´´½¨ºÃµÄË÷Òý¾Í¿ÉÒÔÁË¡£

ÕâÒ²ÊÇÈ«ÎÄËÑË÷Ïà¶ÔÓÚ˳ÐòɨÃèµÄÓÅÊÆÖ®Ò»£ºÒ»´ÎË÷Òý£¬¶à´ÎʹÓá£

Èý¡¢ÈçºÎ´´½¨Ë÷Òý

È«ÎļìË÷µÄË÷Òý´´½¨¹ý³ÌÒ»°ãÓÐÒÔϼ¸²½£º

µÚÒ»²½£ºÒ»Ð©ÒªË÷ÒýµÄÔ­Îĵµ(Document)¡£

ΪÁË·½±ã˵Ã÷Ë÷Òý´´½¨¹ý³Ì£¬ÕâÀïÌØÒâÓÃÁ½¸öÎļþΪÀý£º

ÎļþÒ»£ºStudents should be allowed to go out with their friends, but not allowed to drink beer.

Îļþ¶þ£ºMy friend Jerry went to school to see his students but found them drunk which is not allowed.

µÚ¶þ²½£º½«Ô­Îĵµ´«¸ø·Ö´Î×é¼þ(Tokenizer)¡£

·Ö´Ê×é¼þ(Tokenizer)»á×öÒÔϼ¸¼þÊÂÇé(´Ë¹ý³Ì³ÆÎªTokenize)£º

1. ½«Îĵµ·Ö³ÉÒ»¸öÒ»¸öµ¥¶ÀµÄµ¥´Ê¡£

2. È¥³ý±êµã·ûºÅ¡£

3. È¥³ýÍ£´Ê(Stop word)¡£

Ëùνͣ´Ê(Stop word)¾ÍÊÇÒ»ÖÖÓïÑÔÖÐ×îÆÕͨµÄһЩµ¥´Ê£¬ÓÉÓÚûÓÐÌØ±ðµÄÒâÒ壬Òò¶ø´ó¶àÊýÇé¿öϲ»ÄܳÉΪËÑË÷µÄ¹Ø¼ü´Ê£¬Òò¶ø´´½¨Ë÷Òýʱ£¬ÕâÖִʻᱻȥµô¶ø¼õÉÙË÷ÒýµÄ´óС¡£

Ó¢ÓïÖÐͦ´Ê(Stop word)È磺¡°the¡±,¡°a¡±£¬¡°this¡±µÈ¡£

¶ÔÓÚÿһÖÖÓïÑԵķִÊ×é¼þ(Tokenizer)£¬¶¼ÓÐÒ»¸öÍ£´Ê(stop word)¼¯ºÏ¡£

¾­¹ý·Ö´Ê(Tokenizer)ºóµÃµ½µÄ½á¹û³ÆÎª´ÊÔª(Token)¡£

ÔÚÎÒÃǵÄÀý×ÓÖУ¬±ãµÃµ½ÒÔÏ´ÊÔª(Token)£º

¡°Students¡±£¬¡°allowed¡±£¬¡°go¡±£¬ ¡°their¡±£¬¡°friends¡±£¬ ¡°allowed¡±£¬¡°drink¡±£¬ ¡°beer¡±£¬¡°My¡±£¬¡°friend¡±£¬ ¡°Jerry¡±£¬¡°went¡±£¬¡°school¡±£¬ ¡°see¡±£¬¡°his¡±£¬ ¡°students¡±£¬¡°found¡±£¬ ¡°them¡±£¬¡°drunk¡±£¬¡°allowed¡±¡£

µÚÈý²½£º½«µÃµ½µÄ´ÊÔª(Token)´«¸øÓïÑÔ´¦Àí×é¼þ(Linguistic Processor)¡£

ÓïÑÔ´¦Àí×é¼þ(linguistic processor)Ö÷ÒªÊǶԵõ½µÄ´ÊÔª(Token)×öһЩͬÓïÑÔÏà¹ØµÄ´¦Àí¡£

¶ÔÓÚÓ¢ÓÓïÑÔ´¦Àí×é¼þ(Linguistic Processor)Ò»°ã×öÒÔϼ¸µã£º

1. ±äΪСд(Lowercase)¡£

2. ½«µ¥´ÊËõ¼õΪ´Ê¸ùÐÎʽ£¬Èç¡°cars¡±µ½¡°car¡±µÈ¡£ÕâÖÖ²Ù×÷³ÆÎª£ºstemming¡£

3. ½«µ¥´Êת±äΪ´Ê¸ùÐÎʽ£¬Èç¡°drove¡±µ½¡°drive¡±µÈ¡£ÕâÖÖ²Ù×÷³ÆÎª£ºlemmatization¡£

Stemming ºÍ lemmatizationµÄÒìͬ£º

Ïà֮ͬ´¦£ºStemmingºÍlemmatization¶¼ÒªÊ¹´Ê»ã³ÉΪ´Ê¸ùÐÎʽ¡£

Á½Õߵķ½Ê½²»Í¬£º

Stemming²ÉÓõÄÊÇ¡°Ëõ¼õ¡±µÄ·½Ê½£º¡°cars¡±µ½¡°car¡±£¬¡°driving¡±µ½¡°drive¡±¡£

Lemmatization²ÉÓõÄÊÇ¡°×ª±ä¡±µÄ·½Ê½£º¡°drove¡±µ½¡°drove¡±£¬¡°driving¡±µ½¡°drive¡±¡£

Á½ÕßµÄËã·¨²»Í¬£º

StemmingÖ÷ÒªÊDzÉȡijÖ̶ֹ¨µÄËã·¨À´×öÕâÖÖËõ¼õ£¬ÈçÈ¥³ý¡°s¡±£¬È¥³ý¡°ing¡±¼Ó¡°e¡±£¬½«¡°ational¡±±äΪ¡°ate¡±£¬½«¡°tional¡±±äΪ¡°tion¡±¡£

LemmatizationÖ÷ÒªÊDzÉÓñ£´æÄ³ÖÖ×ÖµäµÄ·½Ê½×öÕâÖÖת±ä¡£±ÈÈç×ÖµäÖÐÓС°driving¡±µ½¡°drive¡±£¬¡°drove¡±µ½¡°drive¡±£¬¡°am, is, are¡±µ½¡°be¡±µÄÓ³É䣬×öת±äʱ£¬Ö»Òª²é×Öµä¾Í¿ÉÒÔÁË¡£

StemmingºÍlemmatization²»ÊÇ»¥³â¹ØÏµ£¬ÊÇÓн»¼¯µÄ£¬ÓеĴÊÀûÓÃÕâÁ½ÖÖ·½Ê½¶¼ÄÜ´ïµ½ÏàͬµÄת»»¡£

ÓïÑÔ´¦Àí×é¼þ(linguistic processor)µÄ½á¹û³ÆÎª´Ê(Term)¡£

ÔÚÎÒÃǵÄÀý×ÓÖУ¬¾­¹ýÓïÑÔ´¦Àí£¬µÃµ½µÄ´Ê(Term)ÈçÏ£º

¡°student¡±£¬¡°allow¡±£¬¡°go¡±£¬ ¡°their¡±£¬¡°friend¡±£¬ ¡°allow¡±£¬ ¡°drink¡±£¬¡°beer¡±£¬¡°my¡±£¬ ¡°friend¡±£¬¡°jerry¡±£¬ ¡°go¡±£¬¡°school¡±£¬ ¡°see¡±£¬¡°his¡±£¬¡°student¡±£¬ ¡°find¡±£¬ ¡°them¡±£¬¡°drink¡±£¬¡°allow¡±¡£

Ò²ÕýÊÇÒòΪÓÐÓïÑÔ´¦ÀíµÄ²½Ö裬²ÅÄÜʹËÑË÷drove£¬¶ødriveÒ²Äܱ»ËÑË÷³öÀ´¡£

µÚËIJ½£º½«µÃµ½µÄ´Ê(Term)´«¸øË÷Òý×é¼þ(Indexer)¡£

Ë÷Òý×é¼þ(Indexer)Ö÷Òª×öÒÔϼ¸¼þÊÂÇ飺

1. ÀûÓõõ½µÄ´Ê(Term)´´½¨Ò»¸ö×ֵ䡣

ÔÚÎÒÃǵÄÀý×ÓÖÐ×ÖµäÈçÏ£º

2. ¶Ô×ֵ䰴×Öĸ˳Ðò½øÐÐÅÅÐò¡£

3. ºÏ²¢ÏàͬµÄ´Ê(Term)³ÉΪÎĵµµ¹ÅÅ(Posting List)Á´±í¡£

Ôڴ˱íÖУ¬Óм¸¸ö¶¨Ò壺

Document Frequency ¼´ÎĵµÆµ´Î£¬±íʾ×ܹ²ÓжàÉÙÎļþ°üº¬´Ë´Ê(Term)¡£

Frequency ¼´´ÊƵÂÊ£¬±íʾ´ËÎļþÖаüº¬Á˼¸¸ö´Ë´Ê(Term)¡£

ËùÒÔ¶Ô´Ê(Term) ¡°allow¡±À´½²£¬×ܹ²ÓÐÁ½ÆªÎĵµ°üº¬´Ë´Ê(Term)£¬´Ó¶ø´Ê(Term)ºóÃæµÄÎĵµÁ´±í×ܹ²ÓÐÁ½ÏµÚÒ»Ïî±íʾ°üº¬¡°allow¡±µÄµÚһƪÎĵµ£¬¼´1ºÅÎĵµ£¬´ËÎĵµÖУ¬¡°allow¡±³öÏÖÁË2´Î£¬µÚ¶þÏî±íʾ°üº¬¡°allow¡±µÄµÚ¶þ¸öÎĵµ£¬ÊÇ2ºÅÎĵµ£¬´ËÎĵµÖУ¬¡°allow¡±³öÏÖÁË1´Î¡£

µ½´ËΪֹ£¬Ë÷ÒýÒѾ­´´½¨ºÃÁË£¬ÎÒÃÇ¿ÉÒÔͨ¹ýËüºÜ¿ìµÄÕÒµ½ÎÒÃÇÏëÒªµÄÎĵµ¡£

¶øÇÒÔڴ˹ý³ÌÖУ¬ÎÒÃǾªÏ²µØ·¢ÏÖ£¬ËÑË÷¡°drive¡±£¬¡°driving¡±£¬¡°drove¡±£¬¡°driven¡±Ò²Äܹ»±»Ëѵ½¡£ÒòΪÔÚÎÒÃǵÄË÷ÒýÖУ¬¡°driving¡±£¬¡°drove¡±£¬¡°driven¡±¶¼»á¾­¹ýÓïÑÔ´¦Àí¶ø±ä³É¡°drive¡±£¬ÔÚËÑË÷ʱ£¬Èç¹ûÄúÊäÈë¡°driving¡±£¬ÊäÈëµÄ²éѯÓï¾äͬÑù¾­¹ýÎÒÃÇÕâÀïµÄÒ»µ½Èý²½£¬´Ó¶ø±äΪ²éѯ¡°drive¡±£¬´Ó¶ø¿ÉÒÔËÑË÷µ½ÏëÒªµÄÎĵµ¡£

Èý¡¢ÈçºÎ¶ÔË÷Òý½øÐÐËÑË÷£¿

µ½ÕâÀïËÆºõÎÒÃÇ¿ÉÒÔÐû²¼¡°ÎÒÃÇÕÒµ½ÏëÒªµÄÎĵµÁË¡±¡£

È»¶øÊÂÇ鲢ûÓнáÊø£¬ÕÒµ½Á˽ö½öÊÇÈ«ÎļìË÷µÄÒ»¸ö·½Ãæ¡£²»ÊÇÂð£¿Èç¹û½ö½öÖ»ÓÐÒ»¸ö»òÊ®¸öÎĵµ°üº¬ÎÒÃDzéѯµÄ×Ö·û´®£¬ÎÒÃǵÄÈ·ÕÒµ½ÁË¡£È»¶øÈç¹û½á¹ûÓÐһǧ¸ö£¬ÉõÖÁ³ÉǧÉÏÍò¸öÄØ£¿ÄǸöÓÖÊÇÄú×îÏëÒªµÄÎļþÄØ£¿

´ò¿ªGoogle°É£¬±ÈÈç˵ÄúÏëÔÚ΢ÈíÕҷݹ¤×÷£¬ÓÚÊÇÄúÊäÈë¡°Microsoft job¡±£¬ÄúÈ´·¢ÏÖ×ܹ²ÓÐ22600000¸ö½á¹û·µ»Ø¡£ºÃ´óµÄÊý×Öѽ£¬Í»È»·¢ÏÖÕÒ²»µ½ÊÇÒ»¸öÎÊÌ⣬ÕÒµ½µÄÌ«¶àÒ²ÊÇÒ»¸öÎÊÌâ¡£ÔÚÈç´Ë¶àµÄ½á¹ûÖУ¬ÈçºÎ½«×îÏà¹ØµÄ·ÅÔÚ×îÇ°ÃæÄØ£¿

µ±È»Google×öµÄºÜ²»´í£¬ÄúһϾÍÕÒµ½ÁËjobs at Microsoft¡£ÏëÏóһϣ¬Èç¹ûǰ¼¸¸öÈ«²¿ÊÇ¡°Microsoft does a good job at software industry¡­¡±½«ÊǶàô¿ÉŵÄÊÂÇéѽ¡£

ÈçºÎÏñGoogleÒ»Ñù£¬ÔÚ³ÉǧÉÏÍòµÄËÑË÷½á¹ûÖУ¬ÕÒµ½ºÍ²éѯÓï¾ä×îÏà¹ØµÄÄØ£¿

ÈçºÎÅжÏËÑË÷³öµÄÎĵµºÍ²éѯÓï¾äµÄÏà¹ØÐÔÄØ£¿

ÕâÒª»Øµ½ÎÒÃǵÚÈý¸öÎÊÌ⣺ÈçºÎ¶ÔË÷Òý½øÐÐËÑË÷£¿

ËÑË÷Ö÷Òª·ÖΪÒÔϼ¸²½£º

µÚÒ»²½£ºÓû§ÊäÈë²éѯÓï¾ä¡£

²éѯÓï¾äͬÎÒÃÇÆÕͨµÄÓïÑÔÒ»Ñù£¬Ò²ÊÇÓÐÒ»¶¨Óï·¨µÄ¡£

²»Í¬µÄ²éѯÓï¾äÓв»Í¬µÄÓï·¨£¬ÈçSQLÓï¾ä¾ÍÓÐÒ»¶¨µÄÓï·¨¡£

²éѯÓï¾äµÄÓï·¨¸ù¾ÝÈ«ÎļìË÷ϵͳµÄʵÏÖ¶ø²»Í¬¡£×î»ù±¾µÄÓбÈÈ磺AND, OR, NOTµÈ¡£

¾Ù¸öÀý×Ó£¬Óû§ÊäÈëÓï¾ä£ºlucene AND learned NOT hadoop¡£

˵Ã÷Óû§ÏëÕÒÒ»¸ö°üº¬luceneºÍlearnedÈ»¶ø²»°üÀ¨hadoopµÄÎĵµ¡£

µÚ¶þ²½£º¶Ô²éѯÓï¾ä½øÐдʷ¨·ÖÎö£¬Óï·¨·ÖÎö£¬¼°ÓïÑÔ´¦Àí¡£

ÓÉÓÚ²éѯÓï¾äÓÐÓï·¨£¬Òò¶øÒ²Òª½øÐÐÓï·¨·ÖÎö£¬Óï·¨·ÖÎö¼°ÓïÑÔ´¦Àí¡£

1. ´Ê·¨·ÖÎöÖ÷ÒªÓÃÀ´Ê¶±ðµ¥´ÊºÍ¹Ø¼ü×Ö¡£

ÈçÉÏÊöÀý×ÓÖУ¬¾­¹ý´Ê·¨·ÖÎö£¬µÃµ½µ¥´ÊÓÐlucene£¬learned£¬hadoop, ¹Ø¼ü×ÖÓÐAND, NOT¡£

Èç¹ûÔÚ´Ê·¨·ÖÎöÖз¢ÏÖ²»ºÏ·¨µÄ¹Ø¼ü×Ö£¬Ôò»á³öÏÖ´íÎó¡£Èçlucene AMD learned£¬ÆäÖÐÓÉÓÚANDÆ´´í£¬µ¼ÖÂAMD×÷Ϊһ¸öÆÕͨµÄµ¥´Ê²ÎÓë²éѯ¡£

2. Óï·¨·ÖÎöÖ÷ÒªÊǸù¾Ý²éѯÓï¾äµÄÓï·¨¹æÔòÀ´ÐγÉÒ»¿ÃÓï·¨Ê÷¡£

Èç¹û·¢ÏÖ²éѯÓï¾ä²»Âú×ãÓï·¨¹æÔò£¬Ôò»á±¨´í¡£Èçlucene NOT AND learned£¬Ôò»á³ö´í¡£

ÈçÉÏÊöÀý×Ó£¬lucene AND learned NOT hadoopÐγɵÄÓï·¨Ê÷ÈçÏ£º

3. ÓïÑÔ´¦ÀíͬË÷Òý¹ý³ÌÖеÄÓïÑÔ´¦Àí¼¸ºõÏàͬ¡£

Èçlearned±ä³ÉlearnµÈ¡£

¾­¹ýµÚ¶þ²½£¬ÎÒÃǵõ½Ò»¿Ã¾­¹ýÓïÑÔ´¦ÀíµÄÓï·¨Ê÷¡£

µÚÈý²½£ºËÑË÷Ë÷Òý£¬µÃµ½·ûºÏÓï·¨Ê÷µÄÎĵµ¡£

´Ë²½ÖèÓзּ¸Ð¡²½£º

Ê×ÏÈ£¬ÔÚ·´ÏòË÷Òý±íÖУ¬·Ö±ðÕÒ³ö°üº¬lucene£¬learn£¬hadoopµÄÎĵµÁ´±í¡£

Æä´Î£¬¶Ô°üº¬lucene£¬learnµÄÁ´±í½øÐкϲ¢²Ù×÷£¬µÃµ½¼È°üº¬luceneÓÖ°üº¬learnµÄÎĵµÁ´±í¡£

È»ºó£¬½«´ËÁ´±íÓëhadoopµÄÎĵµÁ´±í½øÐвî²Ù×÷£¬È¥³ý°üº¬hadoopµÄÎĵµ£¬´Ó¶øµÃµ½¼È°üº¬luceneÓÖ°üº¬learn¶øÇÒ²»°üº¬hadoopµÄÎĵµÁ´±í¡£

´ËÎĵµÁ´±í¾ÍÊÇÎÒÃÇÒªÕÒµÄÎĵµ¡£

µÚËIJ½£º¸ù¾ÝµÃµ½µÄÎĵµºÍ²éѯÓï¾äµÄÏà¹ØÐÔ£¬¶Ô½á¹û½øÐÐÅÅÐò¡£

ËäÈ»ÔÚÉÏÒ»²½£¬ÎÒÃǵõ½ÁËÏëÒªµÄÎĵµ£¬È»¶ø¶ÔÓÚ²éѯ½á¹ûÓ¦¸Ã°´ÕÕÓë²éѯÓï¾äµÄÏà¹ØÐÔ½øÐÐÅÅÐò£¬Ô½Ïà¹ØÕßÔ½¿¿Ç°¡£

ÈçºÎ¼ÆËãÎĵµºÍ²éѯÓï¾äµÄÏà¹ØÐÔÄØ£¿

²»ÈçÎÒÃǰѲéѯÓï¾ä¿´×÷һƬ¶ÌСµÄÎĵµ£¬¶ÔÎĵµÓëÎĵµÖ®¼äµÄÏà¹ØÐÔ(relevance)½øÐдò·Ö(scoring)£¬·ÖÊý¸ßµÄÏà¹ØÐԺ㬾ÍÓ¦¸ÃÅÅÔÚÇ°Ãæ¡£

ÄÇôÓÖÔõô¶ÔÎĵµÖ®¼äµÄ¹ØÏµ½øÐдò·ÖÄØ£¿

Õâ¿É²»ÊÇÒ»¼þÈÝÒ×µÄÊÂÇ飬Ê×ÏÈÎÒÃÇ¿´Ò»¿´ÅжÏÈËÖ®¼äµÄ¹ØÏµ°É¡£

Ê×ÏÈ¿´Ò»¸öÈË£¬ÍùÍùÓкܶàÒªËØ£¬ÈçÐÔ¸ñ£¬ÐÅÑö£¬°®ºÃ£¬ÒÂ×Å£¬¸ß°«£¬ÅÖÊݵȵȡ£

Æä´Î¶ÔÓÚÈËÓëÈËÖ®¼äµÄ¹ØÏµ£¬²»Í¬µÄÒªËØÖØÒªÐÔ²»Í¬£¬ÐÔ¸ñ£¬ÐÅÑö£¬°®ºÃ¿ÉÄÜÖØÒªÐ©£¬ÒÂ×Å£¬¸ß°«£¬ÅÖÊÝ¿ÉÄܾͲ»ÄÇÃ´ÖØÒªÁË£¬ËùÒÔ¾ßÓÐÏàͬ»òÏàËÆÐÔ¸ñ£¬ÐÅÑö£¬°®ºÃµÄÈ˱ȽÏÈÝÒ׳ÉΪºÃµÄÅóÓÑ£¬È»¶øÒÂ×Å£¬¸ß°«£¬ÅÖÊݲ»Í¬µÄÈË£¬Ò²¿ÉÒÔ³ÉΪºÃµÄÅóÓÑ¡£

Òò¶øÅжÏÈËÓëÈËÖ®¼äµÄ¹ØÏµ£¬Ê×ÏÈÒªÕÒ³öÄÄÐ©ÒªËØ¶ÔÈËÓëÈËÖ®¼äµÄ¹ØÏµ×îÖØÒª£¬±ÈÈçÐÔ¸ñ£¬ÐÅÑö£¬°®ºÃ¡£Æä´ÎÒªÅжÏÁ½¸öÈ˵ÄÕâÐ©ÒªËØÖ®¼äµÄ¹ØÏµ£¬±ÈÈçÒ»¸öÈËÐÔ¸ñ¿ªÀÊ£¬ÁíÒ»¸öÈËÐÔ¸ñÍâÏò£¬Ò»¸öÈËÐÅÑö·ð½Ì£¬ÁíÒ»¸öÐÅÑöÉϵۣ¬Ò»¸öÈ˰®ºÃ´òÀºÇò£¬ÁíÒ»¸ö°®ºÃÌß×ãÇò¡£ÎÒÃÇ·¢ÏÖ£¬Á½¸öÈËÔÚÐÔ¸ñ·½Ãæ¶¼ºÜ»ý¼«£¬ÐÅÑö·½Ãæ¶¼ºÜÉÆÁ¼£¬°®ºÃ·½Ãæ¶¼°®Ô˶¯£¬Òò¶øÁ½¸öÈ˹ØÏµÓ¦¸Ã»áºÜºÃ¡£

ÎÒÃÇÔÙÀ´¿´¿´¹«Ë¾Ö®¼äµÄ¹ØÏµ°É¡£

Ê×ÏÈ¿´Ò»¸ö¹«Ë¾£¬ÓкܶàÈË×é³É£¬Èç×ܾ­Àí£¬¾­Àí£¬Ê×ϯ¼¼Êõ¹Ù£¬ÆÕͨԱ¹¤£¬±£°²£¬ÃÅÎÀµÈ¡£

Æä´Î¶ÔÓÚ¹«Ë¾Ó빫˾֮¼äµÄ¹ØÏµ£¬²»Í¬µÄÈËÖØÒªÐÔ²»Í¬£¬×ܾ­Àí£¬¾­Àí£¬Ê×ϯ¼¼Êõ¹Ù¿ÉÄܸüÖØÒªÒ»Ð©£¬ÆÕͨԱ¹¤£¬±£°²£¬ÃÅÎÀ¿ÉÄܽϲ»ÖØÒªÒ»µã¡£ËùÒÔÈç¹ûÁ½¸ö¹«Ë¾×ܾ­Àí£¬¾­Àí£¬Ê×ϯ¼¼Êõ¹ÙÖ®¼ä¹ØÏµ±È½ÏºÃ£¬Á½¸ö¹«Ë¾ÈÝÒ×ÓбȽϺõĹØÏµ¡£È»¶øÒ»Î»ÆÕͨԱ¹¤¾ÍËãÓëÁíÒ»¼Ò¹«Ë¾µÄһλÆÕͨԱ¹¤ÓÐѪº£Éî³ð£¬ÅÂÒ²ÄÑÓ°ÏìÁ½¸ö¹«Ë¾Ö®¼äµÄ¹ØÏµ¡£

Òò¶øÅжϹ«Ë¾Ó빫˾֮¼äµÄ¹ØÏµ£¬Ê×ÏÈÒªÕÒ³öÄÄЩÈ˶Թ«Ë¾Ó빫˾֮¼äµÄ¹ØÏµ×îÖØÒª£¬±ÈÈç×ܾ­Àí£¬¾­Àí£¬Ê×ϯ¼¼Êõ¹Ù¡£Æä´ÎÒªÅжÏÕâЩÈËÖ®¼äµÄ¹ØÏµ£¬²»ÈçÁ½¼Ò¹«Ë¾µÄ×ܾ­ÀíÔø¾­ÊÇͬѧ£¬¾­ÀíÊÇÀÏÏ磬Ê×ϯ¼¼Êõ¹ÙÔøÊÇ´´Òµ»ï°é¡£ÎÒÃÇ·¢ÏÖ£¬Á½¼Ò¹«Ë¾ÎÞÂÛ×ܾ­Àí£¬¾­Àí£¬Ê×ϯ¼¼Êõ¹Ù£¬¹ØÏµ¶¼ºÜºÃ£¬Òò¶øÁ½¼Ò¹«Ë¾¹ØÏµÓ¦¸Ã»áºÜºÃ¡£

·ÖÎöÁËÁ½ÖÖ¹ØÏµ£¬ÏÂÃæ¿´Ò»ÏÂÈçºÎÅжÏÎĵµÖ®¼äµÄ¹ØÏµÁË¡£

Ê×ÏÈ£¬Ò»¸öÎĵµÓкܶà´Ê(Term)×é³É£¬Èçsearch, lucene, full-text, this, a, whatµÈ¡£

Æä´Î¶ÔÓÚÎĵµÖ®¼äµÄ¹ØÏµ£¬²»Í¬µÄTermÖØÒªÐÔ²»Í¬£¬±ÈÈç¶ÔÓÚ±¾ÆªÎĵµ£¬search, Lucene, full-text¾ÍÏà¶ÔÖØÒªÒ»Ð©£¬this, a , what¿ÉÄÜÏà¶Ô²»ÖØÒªÒ»Ð©¡£ËùÒÔÈç¹ûÁ½ÆªÎĵµ¶¼°üº¬search, Lucene£¬fulltext£¬ÕâÁ½ÆªÎĵµµÄÏà¹ØÐÔºÃһЩ£¬È»¶ø¾ÍËãһƪÎĵµ°üº¬this, a, what£¬ÁíһƪÎĵµ²»°üº¬this, a, what£¬Ò²²»ÄÜÓ°ÏìÁ½ÆªÎĵµµÄÏà¹ØÐÔ¡£

Òò¶øÅжÏÎĵµÖ®¼äµÄ¹ØÏµ£¬Ê×ÏÈÕÒ³öÄÄЩ´Ê(Term)¶ÔÎĵµÖ®¼äµÄ¹ØÏµ×îÖØÒª£¬Èçsearch, Lucene, fulltext¡£È»ºóÅжÏÕâЩ´Ê(Term)Ö®¼äµÄ¹ØÏµ¡£

ÕÒ³ö´Ê(Term)¶ÔÎĵµµÄÖØÒªÐԵĹý³Ì³ÆÎª¼ÆËã´ÊµÄÈ¨ÖØ(Term weight)µÄ¹ý³Ì¡£

¼ÆËã´ÊµÄÈ¨ÖØ(term weight)ÓÐÁ½¸ö²ÎÊý£¬µÚÒ»¸öÊÇ´Ê(Term)£¬µÚ¶þ¸öÊÇÎĵµ(Document)¡£

´ÊµÄÈ¨ÖØ(Term weight)±íʾ´Ë´Ê(Term)ÔÚ´ËÎĵµÖеÄÖØÒª³Ì¶È£¬Ô½ÖØÒªµÄ´Ê(Term)ÓÐÔ½´óµÄÈ¨ÖØ(Term weight)£¬Òò¶øÔÚ¼ÆËãÎĵµÖ®¼äµÄÏà¹ØÐÔÖн«·¢»Ó¸ü´óµÄ×÷Óá£

ÅжϴÊ(Term)Ö®¼äµÄ¹ØÏµ´Ó¶øµÃµ½ÎĵµÏà¹ØÐԵĹý³ÌÓ¦ÓÃÒ»ÖÖ½Ð×öÏòÁ¿¿Õ¼äÄ£Ð͵ÄËã·¨(Vector Space Model)¡£

ÏÂÃæ×Ðϸ·ÖÎöÒ»ÏÂÕâÁ½¸ö¹ý³Ì£º

1. ¼ÆËãÈ¨ÖØ(Term weight)µÄ¹ý³Ì¡£

Ó°ÏìÒ»¸ö´Ê(Term)ÔÚһƪÎĵµÖеÄÖØÒªÐÔÖ÷ÒªÓÐÁ½¸öÒòËØ£º

Term Frequency (tf)£º¼´´ËTermÔÚ´ËÎĵµÖгöÏÖÁ˶àÉٴΡ£tf Ô½´ó˵Ã÷Ô½ÖØÒª¡£

Document Frequency (df)£º¼´ÓжàÉÙÎĵµ°üº¬´ÎTerm¡£df Ô½´ó˵Ã÷Ô½²»ÖØÒª¡£

ÈÝÒ×Àí½âÂ𣿴Ê(Term)ÔÚÎĵµÖгöÏֵĴÎÊýÔ½¶à£¬ËµÃ÷´Ë´Ê(Term)¶Ô¸ÃÎĵµÔ½ÖØÒª£¬Èç¡°ËÑË÷¡±Õâ¸ö´Ê£¬ÔÚ±¾ÎĵµÖгöÏֵĴÎÊýºÜ¶à£¬ËµÃ÷±¾ÎĵµÖ÷Òª¾ÍÊǽ²Õâ·½ÃæµÄʵġ£È»¶øÔÚһƪӢÓïÎĵµÖУ¬this³öÏֵĴÎÊý¸ü¶à£¬¾Í˵Ã÷Ô½ÖØÒªÂ𣿲»Êǵģ¬ÕâÊÇÓɵڶþ¸öÒòËØ½øÐе÷Õû£¬µÚ¶þ¸öÒòËØËµÃ÷£¬ÓÐÔ½¶àµÄÎĵµ°üº¬´Ë´Ê(Term), ˵Ã÷´Ë´Ê(Term)Ì«ÆÕͨ£¬²»×ãÒÔÇø·ÖÕâЩÎĵµ£¬Òò¶øÖØÒªÐÔÔ½µÍ¡£

ÕâÒ²ÈçÎÒÃdzÌÐòÔ±ËùѧµÄ¼¼Êõ£¬¶ÔÓÚ³ÌÐòÔ±±¾ÉíÀ´Ëµ£¬ÕâÏî¼¼ÊõÕÆÎÕÔ½ÉîÔ½ºÃ£¨ÕÆÎÕÔ½Éî˵Ã÷»¨Ê±¼ä¿´µÄÔ½¶à£¬tfÔ½´ó£©£¬ÕÒ¹¤×÷ʱԽÓоºÕùÁ¦¡£È»¶ø¶ÔÓÚËùÓгÌÐòÔ±À´Ëµ£¬ÕâÏî¼¼Êõ¶®µÃµÄÈËÔ½ÉÙÔ½ºÃ£¨¶®µÃµÄÈËÉÙdfС£©£¬ÕÒ¹¤×÷Ô½ÓоºÕùÁ¦¡£È˵ļÛÖµÔÚÓÚ²»¿ÉÌæ´úÐÔ¾ÍÊÇÕâ¸öµÀÀí¡£

µÀÀíÃ÷°×ÁË£¬ÎÒÃÇÀ´¿´¿´¹«Ê½£º

Õâ½ö½öÖ»term weight¼ÆË㹫ʽµÄ¼òµ¥µäÐÍʵÏÖ¡£ÊµÏÖÈ«ÎļìË÷ϵͳµÄÈË»áÓÐ×Ô¼ºµÄʵÏÖ£¬Lucene¾ÍÓë´ËÉÔÓв»Í¬¡£

2. ÅжÏTermÖ®¼äµÄ¹ØÏµ´Ó¶øµÃµ½ÎĵµÏà¹ØÐԵĹý³Ì£¬Ò²¼´ÏòÁ¿¿Õ¼äÄ£Ð͵ÄËã·¨(VSM)¡£

ÎÒÃǰÑÎĵµ¿´×÷һϵÁдÊ(Term)£¬Ã¿Ò»¸ö´Ê(Term)¶¼ÓÐÒ»¸öÈ¨ÖØ(Term weight)£¬²»Í¬µÄ´Ê(Term)¸ù¾Ý×Ô¼ºÔÚÎĵµÖеÄÈ¨ÖØÀ´Ó°ÏìÎĵµÏà¹ØÐԵĴò·Ö¼ÆËã¡£

ÓÚÊÇÎÒÃǰÑËùÓдËÎĵµÖдÊ(term)µÄÈ¨ÖØ(term weight) ¿´×÷Ò»¸öÏòÁ¿¡£

Document = {term1, term2, ¡­¡­ ,term N}

Document Vector = {weight1, weight2, ¡­¡­ ,weight N}

ͬÑùÎÒÃǰѲéѯÓï¾ä¿´×÷Ò»¸ö¼òµ¥µÄÎĵµ£¬Ò²ÓÃÏòÁ¿À´±íʾ¡£

Query = {term1, term 2, ¡­¡­ , term N}

Query Vector = {weight1, weight2, ¡­¡­ , weight N}

ÎÒÃǰÑËùÓÐËÑË÷³öµÄÎĵµÏòÁ¿¼°²éѯÏòÁ¿·Åµ½Ò»¸öNά¿Õ¼äÖУ¬Ã¿¸ö´Ê(term)ÊÇһά¡£

Èçͼ£º

ÎÒÃÇÈÏΪÁ½¸öÏòÁ¿Ö®¼äµÄ¼Ð½ÇԽС£¬Ïà¹ØÐÔÔ½´ó¡£

ËùÒÔÎÒÃǼÆËã¼Ð½ÇµÄÓàÏÒÖµ×÷ΪÏà¹ØÐԵĴò·Ö£¬¼Ð½ÇԽС£¬ÓàÏÒÖµÔ½´ó£¬´ò·ÖÔ½¸ß£¬Ïà¹ØÐÔÔ½´ó¡£

ÓÐÈË¿ÉÄÜ»áÎÊ£¬²éѯÓï¾äÒ»°ãÊǺ̵ܶ쬰üº¬µÄ´Ê(Term)ÊǺÜÉٵģ¬Òò¶ø²éѯÏòÁ¿µÄάÊýºÜС£¬¶øÎĵµºÜ³¤£¬°üº¬´Ê(Term)ºÜ¶à£¬ÎĵµÏòÁ¿Î¬ÊýºÜ´ó¡£ÄãµÄͼÖÐÁ½ÕßάÊýÔõô¶¼ÊÇNÄØ£¿

ÔÚÕâÀ¼ÈȻҪ·Åµ½ÏàͬµÄÏòÁ¿¿Õ¼ä£¬×ÔȻάÊýÊÇÏàͬµÄ£¬²»Í¬Ê±£¬È¡¶þÕߵIJ¢¼¯£¬Èç¹û²»º¬Ä³¸ö´Ê(Term)ʱ£¬ÔòÈ¨ÖØ(Term Weight)Ϊ0¡£

Ïà¹ØÐÔ´ò·Ö¹«Ê½ÈçÏ£º

¾Ù¸öÀý×Ó£¬²éѯÓï¾äÓÐ11¸öTerm£¬¹²ÓÐÈýƪÎĵµËÑË÷³öÀ´¡£ÆäÖи÷×ÔµÄÈ¨ÖØ(Term weight)£¬Èçϱí¸ñ¡£

ÓÚÊǼÆË㣬ÈýƪÎĵµÍ¬²éѯÓï¾äµÄÏà¹ØÐÔ´ò·Ö·Ö±ðΪ£º

ÓÚÊÇÎĵµ¶þÏà¹ØÐÔ×î¸ß£¬ÏÈ·µ»Ø£¬Æä´ÎÊÇÎĵµÒ»£¬×îºóÊÇÎĵµÈý¡£

µ½´ËΪֹ£¬ÎÒÃÇ¿ÉÒÔÕÒµ½ÎÒÃÇ×îÏëÒªµÄÎĵµÁË¡£

˵ÁËÕâô¶à£¬Æäʵ»¹Ã»ÓнøÈëµ½Lucene£¬¶ø½ö½öÊÇÐÅÏ¢¼ìË÷¼¼Êõ(Information retrieval)ÖеĻù±¾ÀíÂÛ£¬È»¶øµ±ÎÒÃÇ¿´¹ýLuceneºóÎÒÃǻᷢÏÖ£¬LuceneÊǶÔÕâÖÖ»ù±¾ÀíÂÛµÄÒ»ÖÖ»ù±¾µÄµÄʵ¼ù¡£ËùÒÔÔÚÒÔºó·ÖÎöLuceneµÄÎÄÕÂÖУ¬»á³£³£¿´µ½ÒÔÉÏÀíÂÛÔÚLuceneÖеÄÓ¦Óá£

ÔÚ½øÈëLucene֮ǰ£¬¶ÔÉÏÊöË÷Òý´´½¨ºÍËÑË÷¹ý³ÌËùÒ»¸ö×ܽᣬÈçͼ£º

1. Ë÷Òý¹ý³Ì£º

1) ÓÐһϵÁб»Ë÷ÒýÎļþ

2) ±»Ë÷ÒýÎļþ¾­¹ýÓï·¨·ÖÎöºÍÓïÑÔ´¦ÀíÐγÉһϵÁдÊ(Term)¡£

3) ¾­¹ýË÷Òý´´½¨ÐγɴʵäºÍ·´ÏòË÷Òý±í¡£

4) ͨ¹ýË÷Òý´æ´¢½«Ë÷ÒýдÈëÓ²ÅÌ¡£

2. ËÑË÷¹ý³Ì£º

a) Óû§ÊäÈë²éѯÓï¾ä¡£

b) ¶Ô²éѯÓï¾ä¾­¹ýÓï·¨·ÖÎöºÍÓïÑÔ·ÖÎöµÃµ½Ò»ÏµÁдÊ(Term)¡£

c) ͨ¹ýÓï·¨·ÖÎöµÃµ½Ò»¸ö²éѯÊ÷¡£

d) ͨ¹ýË÷Òý´æ´¢½«Ë÷Òý¶ÁÈëµ½ÄÚ´æ¡£

e) ÀûÓòéѯÊ÷ËÑË÷Ë÷Òý£¬´Ó¶øµÃµ½Ã¿¸ö´Ê(Term)µÄÎĵµÁ´±í£¬¶ÔÎĵµÁ´±í½øÐн»£¬²î£¬²¢µÃµ½½á¹ûÎĵµ¡£

f) ½«ËÑË÷µ½µÄ½á¹ûÎĵµ¶Ô²éѯµÄÏà¹ØÐÔ½øÐÐÅÅÐò¡£

g) ·µ»Ø²éѯ½á¹û¸øÓû§¡£

 
   
3079 ´Îä¯ÀÀ       28
Ïà¹ØÎÄÕÂ

Java΢·þÎñÐÂÉú´úÖ®Nacos
ÉîÈëÀí½âJavaÖеÄÈÝÆ÷
JavaÈÝÆ÷Ïê½â
Java´úÂëÖÊÁ¿¼ì²é¹¤¾ß¼°Ê¹Óð¸Àý
Ïà¹ØÎĵµ

JavaÐÔÄÜÓÅ»¯
Spring¿ò¼Ü
SSM¿ò¼Ü¼òµ¥¼òÉÜ
´ÓÁ㿪ʼѧjava±à³Ì¾­µä
Ïà¹Ø¿Î³Ì

¸ßÐÔÄÜJava±à³ÌÓëϵͳÐÔÄÜÓÅ»¯
JavaEE¼Ü¹¹¡¢ Éè¼ÆÄ£Ê½¼°ÐÔÄܵ÷ÓÅ
Java±à³Ì»ù´¡µ½Ó¦Óÿª·¢
JAVAÐéÄâ»úÔ­ÀíÆÊÎö