±à¼ÍƼö: |
ÎÄÕÂÖ÷Òª½éÉÜÁËÈ«ÎļìË÷´óÌå·ÖÁ½¸ö¹ý³Ì£¬Ë÷Òý´´½¨(Indexing)ºÍËÑË÷Ë÷Òý(Search)£¬ÒÔ¼°Ë÷Òý´´½¨ºÍËÑË÷Ë÷ÒýµÄÏêϸ¹ý³Ì¡£
±¾ÎÄÀ´×ÔÓÚcnblogs£¬ÓÉ»ðÁú¹ûÈí¼þLuca±à¼¡¢ÍƼö¡£ |
|
SolrÊÇÒ»¸ö¶ÀÁ¢µÄÆóÒµ¼¶ËÑË÷Ó¦Ó÷þÎñÆ÷£¬Ëü¶ÔÍâÌṩÀàËÆÓÚWeb-serviceµÄAPI½Ó¿Ú¡£Óû§¿ÉÒÔͨ¹ýhttpÇëÇó£¬ÏòËÑË÷ÒýÇæ·þÎñÆ÷Ìá½»Ò»¶¨¸ñʽµÄXMLÎļþ£¬Éú³ÉË÷Òý£»Ò²¿ÉÒÔͨ¹ýHttp
Get²Ù×÷Ìá³ö²éÕÒÇëÇ󣬲¢µÃµ½XML/Json¸ñʽµÄ·µ»Ø½á¹û¡£²ÉÓÃJava5¿ª·¢£¬»ùÓÚLucene¡£
LuceneÊÇapacheÈí¼þ»ù½ð»á4 jakartaÏîÄ¿×éµÄÒ»¸ö×ÓÏîÄ¿£¬ÊÇÒ»¸ö¿ª·ÅÔ´´úÂëµÄÈ«ÎļìË÷ÒýÇæ¹¤¾ß°ü£¬¼´Ëü²»ÊÇÒ»¸öÍêÕûµÄÈ«ÎļìË÷ÒýÇæ£¬¶øÊÇÒ»¸öÈ«ÎļìË÷ÒýÇæµÄ¼Ü¹¹£¬ÌṩÁËÍêÕûµÄ²éѯÒýÇæºÍË÷ÒýÒýÇæ£¬²¿·ÖÎı¾·ÖÎöÒýÇæ£¨Ó¢ÎÄÓëµÂÎÄÁ½ÖÖÎ÷·½ÓïÑÔ£©¡£
ÆäÖÐLuceneÈ«ÎļìË÷µÄ»ù±¾ÔÀí£¬¸ú´óÅ£½²µÄwebËÑË÷¿Î³ÌÀïµÄ¼¼ÊõÒ»Ö£¬²ÉÓ÷ִʣ¬ÓïÒåÓï·¨·ÖÎö£¬ÏòÁ¿¿Õ¼äÄ£Ð͵ȼ¼ÊõÀ´ÊµÏÖ.
Ò»¡¢×ÜÂÛ
¸ù¾Ýhttp://lucene.apache.org/core/index.html¶¨Ò壺
LuceneÊÇÒ»¸ö¸ßЧµÄ£¬»ùÓÚJavaµÄÈ«ÎļìË÷¿â¡£
ËùÒÔÔÚÁ˽âLucene֮ǰҪ·ÑÒ»·¬¹¤·òÁ˽âÒ»ÏÂÈ«ÎļìË÷¡£
ÄÇôʲô½Ð×öÈ«ÎļìË÷ÄØ£¿ÕâÒª´ÓÎÒÃÇÉú»îÖеÄÊý¾Ý˵Æð¡£
ÎÒÃÇÉú»îÖеÄÊý¾Ý×ÜÌå·ÖΪÁ½ÖÖ£º½á¹¹»¯Êý¾ÝºÍ·Ç½á¹¹»¯Êý¾Ý¡£
½á¹¹»¯Êý¾Ý£ºÖ¸¾ßÓй̶¨¸ñʽ»òÓÐÏÞ³¤¶ÈµÄÊý¾Ý£¬ÈçÊý¾Ý¿â£¬ÔªÊý¾ÝµÈ¡£
·Ç½á¹¹»¯Êý¾Ý£ºÖ¸²»¶¨³¤»òÎ޹̶¨¸ñʽµÄÊý¾Ý£¬ÈçÓʼþ£¬wordÎĵµµÈ¡£
µ±È»Óеĵط½»¹»áÌáµ½µÚÈýÖÖ£¬°ë½á¹¹»¯Êý¾Ý£¬ÈçXML£¬HTMLµÈ£¬µ±¸ù¾ÝÐèÒª¿É°´½á¹¹»¯Êý¾ÝÀ´´¦Àí£¬Ò²¿É³éÈ¡³ö´¿Îı¾°´·Ç½á¹¹»¯Êý¾ÝÀ´´¦Àí¡£
·Ç½á¹¹»¯Êý¾ÝÓÖÒ»Öֽз¨½ÐÈ«ÎÄÊý¾Ý¡£
°´ÕÕÊý¾ÝµÄ·ÖÀ࣬ËÑË÷Ò²·ÖΪÁ½ÖÖ£º
¶Ô½á¹¹»¯Êý¾ÝµÄËÑË÷£ºÈç¶ÔÊý¾Ý¿âµÄËÑË÷£¬ÓÃSQLÓï¾ä¡£ÔÙÈç¶ÔÔªÊý¾ÝµÄËÑË÷£¬ÈçÀûÓÃwindowsËÑË÷¶ÔÎļþÃû£¬ÀàÐÍ£¬ÐÞ¸Äʱ¼ä½øÐÐËÑË÷µÈ¡£
¶Ô·Ç½á¹¹»¯Êý¾ÝµÄËÑË÷£ºÈçÀûÓÃwindowsµÄËÑË÷Ò²¿ÉÒÔËÑË÷ÎļþÄÚÈÝ£¬LinuxϵÄgrepÃüÁÔÙÈçÓÃGoogleºÍ°Ù¶È¿ÉÒÔËÑË÷´óÁ¿ÄÚÈÝÊý¾Ý¡£
¶Ô·Ç½á¹¹»¯Êý¾ÝÒ²¼´¶ÔÈ«ÎÄÊý¾ÝµÄËÑË÷Ö÷ÒªÓÐÁ½ÖÖ·½·¨£º
Ò»ÖÖÊÇ˳ÐòɨÃè·¨(Serial Scanning)£ºËùν˳ÐòɨÃ裬±ÈÈçÒªÕÒÄÚÈݰüº¬Ä³Ò»¸ö×Ö·û´®µÄÎļþ£¬¾ÍÊÇÒ»¸öÎĵµÒ»¸öÎĵµµÄ¿´£¬¶ÔÓÚÿһ¸öÎĵµ£¬´ÓÍ·¿´µ½Î²£¬Èç¹û´ËÎĵµ°üº¬´Ë×Ö·û´®£¬Ôò´ËÎĵµÎªÎÒÃÇÒªÕÒµÄÎļþ£¬½Ó×Å¿´ÏÂÒ»¸öÎļþ£¬Ö±µ½É¨ÃèÍêËùÓеÄÎļþ¡£ÈçÀûÓÃwindowsµÄËÑË÷Ò²¿ÉÒÔËÑË÷ÎļþÄÚÈÝ£¬Ö»ÊÇÏ൱µÄÂý¡£Èç¹ûÄãÓÐÒ»¸ö80GÓ²ÅÌ£¬Èç¹ûÏëÔÚÉÏÃæÕÒµ½Ò»¸öÄÚÈݰüº¬Ä³×Ö·û´®µÄÎļþ£¬²»»¨Ëû¼¸¸öСʱ£¬ÅÂÊÇ×ö²»µ½¡£LinuxϵÄgrepÃüÁîÒ²ÊÇÕâÒ»ÖÖ·½Ê½¡£´ó¼Ò¿ÉÄܾõµÃÕâÖÖ·½·¨±È½ÏÔʼ£¬µ«¶ÔÓÚСÊý¾ÝÁ¿µÄÎļþ£¬ÕâÖÖ·½·¨»¹ÊÇ×îÖ±½Ó£¬×î·½±ãµÄ¡£µ«ÊǶÔÓÚ´óÁ¿µÄÎļþ£¬ÕâÖÖ·½·¨¾ÍºÜÂýÁË¡£
ÓÐÈË¿ÉÄÜ»á˵£¬¶Ô·Ç½á¹¹»¯Êý¾Ý˳ÐòɨÃèºÜÂý£¬¶Ô½á¹¹»¯Êý¾ÝµÄËÑË÷È´Ïà¶Ô½Ï¿ì£¨ÓÉÓڽṹ»¯Êý¾ÝÓÐÒ»¶¨µÄ½á¹¹¿ÉÒÔ²Éȡһ¶¨µÄËÑË÷Ëã·¨¼Ó¿ìËÙ¶È£©£¬ÄÇô°ÑÎÒÃǵķǽṹ»¯Êý¾ÝÏë°ì·¨ÅªµÃÓÐÒ»¶¨½á¹¹²»¾ÍÐÐÁËÂð£¿
ÕâÖÖÏë·¨ºÜÌìÈ»£¬È´¹¹³ÉÁËÈ«ÎļìË÷µÄ»ù±¾Ë¼Â·£¬Ò²¼´½«·Ç½á¹¹»¯Êý¾ÝÖеÄÒ»²¿·ÖÐÅÏ¢ÌáÈ¡³öÀ´£¬ÖØÐÂ×éÖ¯£¬Ê¹Æä±äµÃÓÐÒ»¶¨½á¹¹£¬È»ºó¶Ô´ËÓÐÒ»¶¨½á¹¹µÄÊý¾Ý½øÐÐËÑË÷£¬´Ó¶ø´ïµ½ËÑË÷Ïà¶Ô½Ï¿ìµÄÄ¿µÄ¡£
Õⲿ·Ö´Ó·Ç½á¹¹»¯Êý¾ÝÖÐÌáÈ¡³öµÄÈ»ºóÖØÐÂ×éÖ¯µÄÐÅÏ¢£¬ÎÒÃdzÆÖ®Ë÷Òý¡£
ÕâÖÖ˵·¨±È½Ï³éÏ󣬾ټ¸¸öÀý×ӾͺÜÈÝÒ×Ã÷°×£¬±ÈÈç×ֵ䣬×ÖµäµÄÆ´Òô±íºÍ²¿Ê×¼ì×Ö±í¾ÍÏ൱ÓÚ×ÖµäµÄË÷Òý£¬¶Ôÿһ¸ö×ֵĽâÊÍÊǷǽṹ»¯µÄ£¬Èç¹û×ÖµäûÓÐÒô½Ú±íºÍ²¿Ê×¼ì×Ö±í£¬ÔÚãã´Çº£ÖÐÕÒÒ»¸ö×ÖÖ»ÄÜ˳ÐòɨÃ衣Ȼ¶ø×ÖµÄijЩÐÅÏ¢¿ÉÒÔÌáÈ¡³öÀ´½øÐнṹ»¯´¦Àí£¬±ÈÈç¶ÁÒô£¬¾Í±È½Ï½á¹¹»¯£¬·ÖÉùĸºÍÔÏĸ£¬·Ö±ðÖ»Óм¸ÖÖ¿ÉÒÔÒ»Ò»Áо٣¬ÓÚÊǽ«¶ÁÒôÄóöÀ´°´Ò»¶¨µÄ˳ÐòÅÅÁУ¬Ã¿Ò»Ïî¶ÁÒô¶¼Ö¸Ïò´Ë×ÖµÄÏêϸ½âÊ͵ÄÒ³Êý¡£ÎÒÃÇËÑË÷ʱ°´½á¹¹»¯µÄÆ´ÒôËѵ½¶ÁÒô£¬È»ºó°´ÆäÖ¸ÏòµÄÒ³Êý£¬±ã¿ÉÕÒµ½ÎÒÃǵķǽṹ»¯Êý¾Ý¡ª¡ªÒ²¼´¶Ô×ֵĽâÊÍ¡£
ÕâÖÖÏȽ¨Á¢Ë÷Òý£¬ÔÙ¶ÔË÷Òý½øÐÐËÑË÷µÄ¹ý³Ì¾Í½ÐÈ«ÎļìË÷(Full-text
Search)¡£
ÏÂÃæÕâ·ùͼÀ´×Ô¡¶Lucene in action¡·£¬µ«È´²»½ö½öÃèÊöÁËLuceneµÄ¼ìË÷¹ý³Ì£¬¶øÊÇÃèÊöÁËÈ«ÎļìË÷µÄÒ»°ã¹ý³Ì¡£
È«ÎļìË÷´óÌå·ÖÁ½¸ö¹ý³Ì£¬Ë÷Òý´´½¨(Indexing)ºÍËÑË÷Ë÷Òý(Search)¡£
Ë÷Òý´´½¨£º½«ÏÖʵÊÀ½çÖÐËùÓеĽṹ»¯ºÍ·Ç½á¹¹»¯Êý¾ÝÌáÈ¡ÐÅÏ¢£¬´´½¨Ë÷ÒýµÄ¹ý³Ì¡£
ËÑË÷Ë÷Òý£º¾ÍÊǵõ½Óû§µÄ²éѯÇëÇó£¬ËÑË÷´´½¨µÄË÷Òý£¬È»ºó·µ»Ø½á¹ûµÄ¹ý³Ì¡£
ÓÚÊÇÈ«ÎļìË÷¾Í´æÔÚÈý¸öÖØÒªÎÊÌ⣺
1. Ë÷ÒýÀïÃæ¾¿¾¹´æÐ©Ê²Ã´£¿(Index)
2. ÈçºÎ´´½¨Ë÷Òý£¿(Indexing)
3. ÈçºÎ¶ÔË÷Òý½øÐÐËÑË÷£¿(Search)
ÏÂÃæÎÒÃÇ˳Ðò¶Ôÿ¸ö¸öÎÊÌâ½øÐÐÑо¿¡£
¶þ¡¢Ë÷ÒýÀïÃæ¾¿¾¹´æÐ©Ê²Ã´
Ë÷ÒýÀïÃæ¾¿¾¹ÐèÒª´æÐ©Ê²Ã´ÄØ£¿
Ê×ÏÈÎÒÃÇÀ´¿´ÎªÊ²Ã´Ë³ÐòɨÃèµÄËÙ¶ÈÂý£º
ÆäʵÊÇÓÉÓÚÎÒÃÇÏëÒªËÑË÷µÄÐÅÏ¢ºÍ·Ç½á¹¹»¯Êý¾ÝÖÐËù´æ´¢µÄÐÅÏ¢²»Ò»ÖÂÔì³ÉµÄ¡£
·Ç½á¹¹»¯Êý¾ÝÖÐËù´æ´¢µÄÐÅÏ¢ÊÇÿ¸öÎļþ°üº¬ÄÄЩ×Ö·û´®£¬Ò²¼´ÒÑÖªÎļþ£¬ÓûÇó×Ö·û´®Ïà¶ÔÈÝÒ×£¬Ò²¼´ÊÇ´ÓÎļþµ½×Ö·û´®µÄÓ³Éä¡£¶øÎÒÃÇÏëËÑË÷µÄÐÅÏ¢ÊÇÄÄЩÎļþ°üº¬´Ë×Ö·û´®£¬Ò²¼´ÒÑÖª×Ö·û´®£¬ÓûÇóÎļþ£¬Ò²¼´´Ó×Ö·û´®µ½ÎļþµÄÓ³Éä¡£Á½ÕßǡǡÏà·´¡£ÓÚÊÇÈç¹ûË÷Òý×ÜÄܹ»±£´æ´Ó×Ö·û´®µ½ÎļþµÄÓ³É䣬Ôò»á´ó´óÌá¸ßËÑË÷ËÙ¶È¡£
ÓÉÓÚ´Ó×Ö·û´®µ½ÎļþµÄÓ³ÉäÊÇÎļþµ½×Ö·û´®Ó³ÉäµÄ·´Ïò¹ý³Ì£¬ÓÚÊDZ£´æÕâÖÖÐÅÏ¢µÄË÷Òý³ÆÎª·´ÏòË÷Òý¡£
·´ÏòË÷ÒýµÄËù±£´æµÄÐÅÏ¢Ò»°ãÈçÏ£º
¼ÙÉèÎÒµÄÎĵµ¼¯ºÏÀïÃæÓÐ100ƪÎĵµ£¬ÎªÁË·½±ã±íʾ£¬ÎÒÃÇΪÎĵµ±àºÅ´Ó1µ½100£¬µÃµ½ÏÂÃæµÄ½á¹¹
×ó±ß±£´æµÄÊÇһϵÁÐ×Ö·û´®£¬³ÆÎª´Êµä¡£
ÿ¸ö×Ö·û´®¶¼Ö¸Ïò°üº¬´Ë×Ö·û´®µÄÎĵµ(Document)Á´±í£¬´ËÎĵµÁ´±í³ÆÎªµ¹Åűí(Posting
List)¡£
ÓÐÁËË÷Òý£¬±ãʹ±£´æµÄÐÅÏ¢ºÍÒªËÑË÷µÄÐÅÏ¢Ò»Ö£¬¿ÉÒÔ´ó´ó¼Ó¿ìËÑË÷µÄËÙ¶È¡£
±ÈÈç˵£¬ÎÒÃÇҪѰÕҼȰüº¬×Ö·û´®¡°lucene¡±ÓÖ°üº¬×Ö·û´®¡°solr¡±µÄÎĵµ£¬ÎÒÃÇÖ»ÐèÒªÒÔϼ¸²½£º
1. È¡³ö°üº¬×Ö·û´®¡°lucene¡±µÄÎĵµÁ´±í¡£
2. È¡³ö°üº¬×Ö·û´®¡°solr¡±µÄÎĵµÁ´±í¡£
3. ͨ¹ýºÏ²¢Á´±í£¬ÕÒ³ö¼È°üº¬¡°lucene¡±ÓÖ°üº¬¡°solr¡±µÄÎļþ¡£
¿´µ½Õâ¸öµØ·½£¬ÓÐÈË¿ÉÄÜ»á˵£¬È«ÎļìË÷µÄÈ·¼Ó¿ìÁËËÑË÷µÄËÙ¶È£¬µ«ÊǶàÁËË÷ÒýµÄ¹ý³Ì£¬Á½Õß¼ÓÆðÀ´²»Ò»¶¨±È˳ÐòɨÃè¿ì¶àÉÙ¡£µÄÈ·£¬¼ÓÉÏË÷ÒýµÄ¹ý³Ì£¬È«ÎļìË÷²»Ò»¶¨±È˳ÐòɨÃè¿ì£¬ÓÈÆäÊÇÔÚÊý¾ÝÁ¿Ð¡µÄʱºò¸üÊÇÈç´Ë¡£¶ø¶ÔÒ»¸öºÜ´óÁ¿µÄÊý¾Ý´´½¨Ë÷ÒýÒ²ÊÇÒ»¸öºÜÂýµÄ¹ý³Ì¡£
È»¶øÁ½Õß»¹ÊÇÓÐÇø±ðµÄ£¬Ë³ÐòɨÃèÊÇÿ´Î¶¼ÒªÉ¨Ã裬¶ø´´½¨Ë÷ÒýµÄ¹ý³Ì½ö½öÐèÒªÒ»´Î£¬ÒÔºó±ãÊÇÒ»ÀÍÓÀÒݵÄÁË£¬Ã¿´ÎËÑË÷£¬´´½¨Ë÷ÒýµÄ¹ý³Ì²»±Ø¾¹ý£¬½ö½öËÑË÷´´½¨ºÃµÄË÷Òý¾Í¿ÉÒÔÁË¡£
ÕâÒ²ÊÇÈ«ÎÄËÑË÷Ïà¶ÔÓÚ˳ÐòɨÃèµÄÓÅÊÆÖ®Ò»£ºÒ»´ÎË÷Òý£¬¶à´ÎʹÓá£
Èý¡¢ÈçºÎ´´½¨Ë÷Òý
È«ÎļìË÷µÄË÷Òý´´½¨¹ý³ÌÒ»°ãÓÐÒÔϼ¸²½£º
µÚÒ»²½£ºÒ»Ð©ÒªË÷ÒýµÄÔÎĵµ(Document)¡£
ΪÁË·½±ã˵Ã÷Ë÷Òý´´½¨¹ý³Ì£¬ÕâÀïÌØÒâÓÃÁ½¸öÎļþΪÀý£º
ÎļþÒ»£ºStudents should be allowed to
go out with their friends, but not allowed to drink
beer.
Îļþ¶þ£ºMy friend Jerry went to school
to see his students but found them drunk which is
not allowed.
µÚ¶þ²½£º½«ÔÎĵµ´«¸ø·Ö´Î×é¼þ(Tokenizer)¡£
·Ö´Ê×é¼þ(Tokenizer)»á×öÒÔϼ¸¼þÊÂÇé(´Ë¹ý³Ì³ÆÎªTokenize)£º
1. ½«Îĵµ·Ö³ÉÒ»¸öÒ»¸öµ¥¶ÀµÄµ¥´Ê¡£
2. È¥³ý±êµã·ûºÅ¡£
3. È¥³ýÍ£´Ê(Stop word)¡£
Ëùνͣ´Ê(Stop word)¾ÍÊÇÒ»ÖÖÓïÑÔÖÐ×îÆÕͨµÄһЩµ¥´Ê£¬ÓÉÓÚûÓÐÌØ±ðµÄÒâÒ壬Òò¶ø´ó¶àÊýÇé¿öϲ»ÄܳÉΪËÑË÷µÄ¹Ø¼ü´Ê£¬Òò¶ø´´½¨Ë÷Òýʱ£¬ÕâÖִʻᱻȥµô¶ø¼õÉÙË÷ÒýµÄ´óС¡£
Ó¢ÓïÖÐͦ´Ê(Stop word)È磺¡°the¡±,¡°a¡±£¬¡°this¡±µÈ¡£
¶ÔÓÚÿһÖÖÓïÑԵķִÊ×é¼þ(Tokenizer)£¬¶¼ÓÐÒ»¸öÍ£´Ê(stop
word)¼¯ºÏ¡£
¾¹ý·Ö´Ê(Tokenizer)ºóµÃµ½µÄ½á¹û³ÆÎª´ÊÔª(Token)¡£
ÔÚÎÒÃǵÄÀý×ÓÖУ¬±ãµÃµ½ÒÔÏ´ÊÔª(Token)£º
¡°Students¡±£¬¡°allowed¡±£¬¡°go¡±£¬ ¡°their¡±£¬¡°friends¡±£¬
¡°allowed¡±£¬¡°drink¡±£¬ ¡°beer¡±£¬¡°My¡±£¬¡°friend¡±£¬ ¡°Jerry¡±£¬¡°went¡±£¬¡°school¡±£¬
¡°see¡±£¬¡°his¡±£¬ ¡°students¡±£¬¡°found¡±£¬ ¡°them¡±£¬¡°drunk¡±£¬¡°allowed¡±¡£
µÚÈý²½£º½«µÃµ½µÄ´ÊÔª(Token)´«¸øÓïÑÔ´¦Àí×é¼þ(Linguistic
Processor)¡£
ÓïÑÔ´¦Àí×é¼þ(linguistic processor)Ö÷ÒªÊǶԵõ½µÄ´ÊÔª(Token)×öһЩͬÓïÑÔÏà¹ØµÄ´¦Àí¡£
¶ÔÓÚÓ¢ÓÓïÑÔ´¦Àí×é¼þ(Linguistic Processor)Ò»°ã×öÒÔϼ¸µã£º
1. ±äΪСд(Lowercase)¡£
2. ½«µ¥´ÊËõ¼õΪ´Ê¸ùÐÎʽ£¬Èç¡°cars¡±µ½¡°car¡±µÈ¡£ÕâÖÖ²Ù×÷³ÆÎª£ºstemming¡£
3. ½«µ¥´Êת±äΪ´Ê¸ùÐÎʽ£¬Èç¡°drove¡±µ½¡°drive¡±µÈ¡£ÕâÖÖ²Ù×÷³ÆÎª£ºlemmatization¡£
Stemming ºÍ lemmatizationµÄÒìͬ£º
Ïà֮ͬ´¦£ºStemmingºÍlemmatization¶¼ÒªÊ¹´Ê»ã³ÉΪ´Ê¸ùÐÎʽ¡£
Á½Õߵķ½Ê½²»Í¬£º
Stemming²ÉÓõÄÊÇ¡°Ëõ¼õ¡±µÄ·½Ê½£º¡°cars¡±µ½¡°car¡±£¬¡°driving¡±µ½¡°drive¡±¡£
Lemmatization²ÉÓõÄÊÇ¡°×ª±ä¡±µÄ·½Ê½£º¡°drove¡±µ½¡°drove¡±£¬¡°driving¡±µ½¡°drive¡±¡£
Á½ÕßµÄËã·¨²»Í¬£º
StemmingÖ÷ÒªÊDzÉȡijÖ̶ֹ¨µÄËã·¨À´×öÕâÖÖËõ¼õ£¬ÈçÈ¥³ý¡°s¡±£¬È¥³ý¡°ing¡±¼Ó¡°e¡±£¬½«¡°ational¡±±äΪ¡°ate¡±£¬½«¡°tional¡±±äΪ¡°tion¡±¡£
LemmatizationÖ÷ÒªÊDzÉÓñ£´æÄ³ÖÖ×ÖµäµÄ·½Ê½×öÕâÖÖת±ä¡£±ÈÈç×ÖµäÖÐÓС°driving¡±µ½¡°drive¡±£¬¡°drove¡±µ½¡°drive¡±£¬¡°am,
is, are¡±µ½¡°be¡±µÄÓ³É䣬×öת±äʱ£¬Ö»Òª²é×Öµä¾Í¿ÉÒÔÁË¡£
StemmingºÍlemmatization²»ÊÇ»¥³â¹ØÏµ£¬ÊÇÓн»¼¯µÄ£¬ÓеĴÊÀûÓÃÕâÁ½ÖÖ·½Ê½¶¼ÄÜ´ïµ½ÏàͬµÄת»»¡£
ÓïÑÔ´¦Àí×é¼þ(linguistic processor)µÄ½á¹û³ÆÎª´Ê(Term)¡£
ÔÚÎÒÃǵÄÀý×ÓÖУ¬¾¹ýÓïÑÔ´¦Àí£¬µÃµ½µÄ´Ê(Term)ÈçÏ£º
¡°student¡±£¬¡°allow¡±£¬¡°go¡±£¬ ¡°their¡±£¬¡°friend¡±£¬
¡°allow¡±£¬ ¡°drink¡±£¬¡°beer¡±£¬¡°my¡±£¬ ¡°friend¡±£¬¡°jerry¡±£¬ ¡°go¡±£¬¡°school¡±£¬
¡°see¡±£¬¡°his¡±£¬¡°student¡±£¬ ¡°find¡±£¬ ¡°them¡±£¬¡°drink¡±£¬¡°allow¡±¡£
Ò²ÕýÊÇÒòΪÓÐÓïÑÔ´¦ÀíµÄ²½Ö裬²ÅÄÜʹËÑË÷drove£¬¶ødriveÒ²Äܱ»ËÑË÷³öÀ´¡£
µÚËIJ½£º½«µÃµ½µÄ´Ê(Term)´«¸øË÷Òý×é¼þ(Indexer)¡£
Ë÷Òý×é¼þ(Indexer)Ö÷Òª×öÒÔϼ¸¼þÊÂÇ飺
1. ÀûÓõõ½µÄ´Ê(Term)´´½¨Ò»¸ö×ֵ䡣
ÔÚÎÒÃǵÄÀý×ÓÖÐ×ÖµäÈçÏ£º
2. ¶Ô×ֵ䰴×Öĸ˳Ðò½øÐÐÅÅÐò¡£
3. ºÏ²¢ÏàͬµÄ´Ê(Term)³ÉΪÎĵµµ¹ÅÅ(Posting List)Á´±í¡£
Ôڴ˱íÖУ¬Óм¸¸ö¶¨Ò壺
Document Frequency ¼´ÎĵµÆµ´Î£¬±íʾ×ܹ²ÓжàÉÙÎļþ°üº¬´Ë´Ê(Term)¡£
Frequency ¼´´ÊƵÂÊ£¬±íʾ´ËÎļþÖаüº¬Á˼¸¸ö´Ë´Ê(Term)¡£
ËùÒÔ¶Ô´Ê(Term) ¡°allow¡±À´½²£¬×ܹ²ÓÐÁ½ÆªÎĵµ°üº¬´Ë´Ê(Term)£¬´Ó¶ø´Ê(Term)ºóÃæµÄÎĵµÁ´±í×ܹ²ÓÐÁ½ÏµÚÒ»Ïî±íʾ°üº¬¡°allow¡±µÄµÚһƪÎĵµ£¬¼´1ºÅÎĵµ£¬´ËÎĵµÖУ¬¡°allow¡±³öÏÖÁË2´Î£¬µÚ¶þÏî±íʾ°üº¬¡°allow¡±µÄµÚ¶þ¸öÎĵµ£¬ÊÇ2ºÅÎĵµ£¬´ËÎĵµÖУ¬¡°allow¡±³öÏÖÁË1´Î¡£
µ½´ËΪֹ£¬Ë÷ÒýÒѾ´´½¨ºÃÁË£¬ÎÒÃÇ¿ÉÒÔͨ¹ýËüºÜ¿ìµÄÕÒµ½ÎÒÃÇÏëÒªµÄÎĵµ¡£
¶øÇÒÔڴ˹ý³ÌÖУ¬ÎÒÃǾªÏ²µØ·¢ÏÖ£¬ËÑË÷¡°drive¡±£¬¡°driving¡±£¬¡°drove¡±£¬¡°driven¡±Ò²Äܹ»±»Ëѵ½¡£ÒòΪÔÚÎÒÃǵÄË÷ÒýÖУ¬¡°driving¡±£¬¡°drove¡±£¬¡°driven¡±¶¼»á¾¹ýÓïÑÔ´¦Àí¶ø±ä³É¡°drive¡±£¬ÔÚËÑË÷ʱ£¬Èç¹ûÄúÊäÈë¡°driving¡±£¬ÊäÈëµÄ²éѯÓï¾äͬÑù¾¹ýÎÒÃÇÕâÀïµÄÒ»µ½Èý²½£¬´Ó¶ø±äΪ²éѯ¡°drive¡±£¬´Ó¶ø¿ÉÒÔËÑË÷µ½ÏëÒªµÄÎĵµ¡£
Èý¡¢ÈçºÎ¶ÔË÷Òý½øÐÐËÑË÷£¿
µ½ÕâÀïËÆºõÎÒÃÇ¿ÉÒÔÐû²¼¡°ÎÒÃÇÕÒµ½ÏëÒªµÄÎĵµÁË¡±¡£
È»¶øÊÂÇ鲢ûÓнáÊø£¬ÕÒµ½Á˽ö½öÊÇÈ«ÎļìË÷µÄÒ»¸ö·½Ãæ¡£²»ÊÇÂð£¿Èç¹û½ö½öÖ»ÓÐÒ»¸ö»òÊ®¸öÎĵµ°üº¬ÎÒÃDzéѯµÄ×Ö·û´®£¬ÎÒÃǵÄÈ·ÕÒµ½ÁË¡£È»¶øÈç¹û½á¹ûÓÐһǧ¸ö£¬ÉõÖÁ³ÉǧÉÏÍò¸öÄØ£¿ÄǸöÓÖÊÇÄú×îÏëÒªµÄÎļþÄØ£¿
´ò¿ªGoogle°É£¬±ÈÈç˵ÄúÏëÔÚ΢ÈíÕҷݹ¤×÷£¬ÓÚÊÇÄúÊäÈë¡°Microsoft
job¡±£¬ÄúÈ´·¢ÏÖ×ܹ²ÓÐ22600000¸ö½á¹û·µ»Ø¡£ºÃ´óµÄÊý×Öѽ£¬Í»È»·¢ÏÖÕÒ²»µ½ÊÇÒ»¸öÎÊÌ⣬ÕÒµ½µÄÌ«¶àÒ²ÊÇÒ»¸öÎÊÌâ¡£ÔÚÈç´Ë¶àµÄ½á¹ûÖУ¬ÈçºÎ½«×îÏà¹ØµÄ·ÅÔÚ×îÇ°ÃæÄØ£¿
µ±È»Google×öµÄºÜ²»´í£¬ÄúһϾÍÕÒµ½ÁËjobs at Microsoft¡£ÏëÏóһϣ¬Èç¹ûǰ¼¸¸öÈ«²¿ÊÇ¡°Microsoft
does a good job at software industry¡¡±½«ÊǶàô¿ÉŵÄÊÂÇéѽ¡£
ÈçºÎÏñGoogleÒ»Ñù£¬ÔÚ³ÉǧÉÏÍòµÄËÑË÷½á¹ûÖУ¬ÕÒµ½ºÍ²éѯÓï¾ä×îÏà¹ØµÄÄØ£¿
ÈçºÎÅжÏËÑË÷³öµÄÎĵµºÍ²éѯÓï¾äµÄÏà¹ØÐÔÄØ£¿
ÕâÒª»Øµ½ÎÒÃǵÚÈý¸öÎÊÌ⣺ÈçºÎ¶ÔË÷Òý½øÐÐËÑË÷£¿
ËÑË÷Ö÷Òª·ÖΪÒÔϼ¸²½£º
µÚÒ»²½£ºÓû§ÊäÈë²éѯÓï¾ä¡£
²éѯÓï¾äͬÎÒÃÇÆÕͨµÄÓïÑÔÒ»Ñù£¬Ò²ÊÇÓÐÒ»¶¨Óï·¨µÄ¡£
²»Í¬µÄ²éѯÓï¾äÓв»Í¬µÄÓï·¨£¬ÈçSQLÓï¾ä¾ÍÓÐÒ»¶¨µÄÓï·¨¡£
²éѯÓï¾äµÄÓï·¨¸ù¾ÝÈ«ÎļìË÷ϵͳµÄʵÏÖ¶ø²»Í¬¡£×î»ù±¾µÄÓбÈÈ磺AND,
OR, NOTµÈ¡£
¾Ù¸öÀý×Ó£¬Óû§ÊäÈëÓï¾ä£ºlucene AND learned NOT
hadoop¡£
˵Ã÷Óû§ÏëÕÒÒ»¸ö°üº¬luceneºÍlearnedÈ»¶ø²»°üÀ¨hadoopµÄÎĵµ¡£
µÚ¶þ²½£º¶Ô²éѯÓï¾ä½øÐдʷ¨·ÖÎö£¬Óï·¨·ÖÎö£¬¼°ÓïÑÔ´¦Àí¡£
ÓÉÓÚ²éѯÓï¾äÓÐÓï·¨£¬Òò¶øÒ²Òª½øÐÐÓï·¨·ÖÎö£¬Óï·¨·ÖÎö¼°ÓïÑÔ´¦Àí¡£
1. ´Ê·¨·ÖÎöÖ÷ÒªÓÃÀ´Ê¶±ðµ¥´ÊºÍ¹Ø¼ü×Ö¡£
ÈçÉÏÊöÀý×ÓÖУ¬¾¹ý´Ê·¨·ÖÎö£¬µÃµ½µ¥´ÊÓÐlucene£¬learned£¬hadoop,
¹Ø¼ü×ÖÓÐAND, NOT¡£
Èç¹ûÔÚ´Ê·¨·ÖÎöÖз¢ÏÖ²»ºÏ·¨µÄ¹Ø¼ü×Ö£¬Ôò»á³öÏÖ´íÎó¡£Èçlucene AMD
learned£¬ÆäÖÐÓÉÓÚANDÆ´´í£¬µ¼ÖÂAMD×÷Ϊһ¸öÆÕͨµÄµ¥´Ê²ÎÓë²éѯ¡£
2. Óï·¨·ÖÎöÖ÷ÒªÊǸù¾Ý²éѯÓï¾äµÄÓï·¨¹æÔòÀ´ÐγÉÒ»¿ÃÓï·¨Ê÷¡£
Èç¹û·¢ÏÖ²éѯÓï¾ä²»Âú×ãÓï·¨¹æÔò£¬Ôò»á±¨´í¡£Èçlucene NOT AND
learned£¬Ôò»á³ö´í¡£
ÈçÉÏÊöÀý×Ó£¬lucene AND learned NOT hadoopÐγɵÄÓï·¨Ê÷ÈçÏ£º
3. ÓïÑÔ´¦ÀíͬË÷Òý¹ý³ÌÖеÄÓïÑÔ´¦Àí¼¸ºõÏàͬ¡£
Èçlearned±ä³ÉlearnµÈ¡£
¾¹ýµÚ¶þ²½£¬ÎÒÃǵõ½Ò»¿Ã¾¹ýÓïÑÔ´¦ÀíµÄÓï·¨Ê÷¡£
µÚÈý²½£ºËÑË÷Ë÷Òý£¬µÃµ½·ûºÏÓï·¨Ê÷µÄÎĵµ¡£
´Ë²½ÖèÓзּ¸Ð¡²½£º
Ê×ÏÈ£¬ÔÚ·´ÏòË÷Òý±íÖУ¬·Ö±ðÕÒ³ö°üº¬lucene£¬learn£¬hadoopµÄÎĵµÁ´±í¡£
Æä´Î£¬¶Ô°üº¬lucene£¬learnµÄÁ´±í½øÐкϲ¢²Ù×÷£¬µÃµ½¼È°üº¬luceneÓÖ°üº¬learnµÄÎĵµÁ´±í¡£
È»ºó£¬½«´ËÁ´±íÓëhadoopµÄÎĵµÁ´±í½øÐвî²Ù×÷£¬È¥³ý°üº¬hadoopµÄÎĵµ£¬´Ó¶øµÃµ½¼È°üº¬luceneÓÖ°üº¬learn¶øÇÒ²»°üº¬hadoopµÄÎĵµÁ´±í¡£
´ËÎĵµÁ´±í¾ÍÊÇÎÒÃÇÒªÕÒµÄÎĵµ¡£
µÚËIJ½£º¸ù¾ÝµÃµ½µÄÎĵµºÍ²éѯÓï¾äµÄÏà¹ØÐÔ£¬¶Ô½á¹û½øÐÐÅÅÐò¡£
ËäÈ»ÔÚÉÏÒ»²½£¬ÎÒÃǵõ½ÁËÏëÒªµÄÎĵµ£¬È»¶ø¶ÔÓÚ²éѯ½á¹ûÓ¦¸Ã°´ÕÕÓë²éѯÓï¾äµÄÏà¹ØÐÔ½øÐÐÅÅÐò£¬Ô½Ïà¹ØÕßÔ½¿¿Ç°¡£
ÈçºÎ¼ÆËãÎĵµºÍ²éѯÓï¾äµÄÏà¹ØÐÔÄØ£¿
²»ÈçÎÒÃǰѲéѯÓï¾ä¿´×÷һƬ¶ÌСµÄÎĵµ£¬¶ÔÎĵµÓëÎĵµÖ®¼äµÄÏà¹ØÐÔ(relevance)½øÐдò·Ö(scoring)£¬·ÖÊý¸ßµÄÏà¹ØÐԺ㬾ÍÓ¦¸ÃÅÅÔÚÇ°Ãæ¡£
ÄÇôÓÖÔõô¶ÔÎĵµÖ®¼äµÄ¹ØÏµ½øÐдò·ÖÄØ£¿
Õâ¿É²»ÊÇÒ»¼þÈÝÒ×µÄÊÂÇ飬Ê×ÏÈÎÒÃÇ¿´Ò»¿´ÅжÏÈËÖ®¼äµÄ¹ØÏµ°É¡£
Ê×ÏÈ¿´Ò»¸öÈË£¬ÍùÍùÓкܶàÒªËØ£¬ÈçÐÔ¸ñ£¬ÐÅÑö£¬°®ºÃ£¬ÒÂ×Å£¬¸ß°«£¬ÅÖÊݵȵȡ£
Æä´Î¶ÔÓÚÈËÓëÈËÖ®¼äµÄ¹ØÏµ£¬²»Í¬µÄÒªËØÖØÒªÐÔ²»Í¬£¬ÐÔ¸ñ£¬ÐÅÑö£¬°®ºÃ¿ÉÄÜÖØÒªÐ©£¬ÒÂ×Å£¬¸ß°«£¬ÅÖÊÝ¿ÉÄܾͲ»ÄÇÃ´ÖØÒªÁË£¬ËùÒÔ¾ßÓÐÏàͬ»òÏàËÆÐÔ¸ñ£¬ÐÅÑö£¬°®ºÃµÄÈ˱ȽÏÈÝÒ׳ÉΪºÃµÄÅóÓÑ£¬È»¶øÒÂ×Å£¬¸ß°«£¬ÅÖÊݲ»Í¬µÄÈË£¬Ò²¿ÉÒÔ³ÉΪºÃµÄÅóÓÑ¡£
Òò¶øÅжÏÈËÓëÈËÖ®¼äµÄ¹ØÏµ£¬Ê×ÏÈÒªÕÒ³öÄÄÐ©ÒªËØ¶ÔÈËÓëÈËÖ®¼äµÄ¹ØÏµ×îÖØÒª£¬±ÈÈçÐÔ¸ñ£¬ÐÅÑö£¬°®ºÃ¡£Æä´ÎÒªÅжÏÁ½¸öÈ˵ÄÕâÐ©ÒªËØÖ®¼äµÄ¹ØÏµ£¬±ÈÈçÒ»¸öÈËÐÔ¸ñ¿ªÀÊ£¬ÁíÒ»¸öÈËÐÔ¸ñÍâÏò£¬Ò»¸öÈËÐÅÑö·ð½Ì£¬ÁíÒ»¸öÐÅÑöÉϵۣ¬Ò»¸öÈ˰®ºÃ´òÀºÇò£¬ÁíÒ»¸ö°®ºÃÌß×ãÇò¡£ÎÒÃÇ·¢ÏÖ£¬Á½¸öÈËÔÚÐÔ¸ñ·½Ãæ¶¼ºÜ»ý¼«£¬ÐÅÑö·½Ãæ¶¼ºÜÉÆÁ¼£¬°®ºÃ·½Ãæ¶¼°®Ô˶¯£¬Òò¶øÁ½¸öÈ˹ØÏµÓ¦¸Ã»áºÜºÃ¡£
ÎÒÃÇÔÙÀ´¿´¿´¹«Ë¾Ö®¼äµÄ¹ØÏµ°É¡£
Ê×ÏÈ¿´Ò»¸ö¹«Ë¾£¬ÓкܶàÈË×é³É£¬Èç×ܾÀí£¬¾Àí£¬Ê×ϯ¼¼Êõ¹Ù£¬ÆÕͨԱ¹¤£¬±£°²£¬ÃÅÎÀµÈ¡£
Æä´Î¶ÔÓÚ¹«Ë¾Ó빫˾֮¼äµÄ¹ØÏµ£¬²»Í¬µÄÈËÖØÒªÐÔ²»Í¬£¬×ܾÀí£¬¾Àí£¬Ê×ϯ¼¼Êõ¹Ù¿ÉÄܸüÖØÒªÒ»Ð©£¬ÆÕͨԱ¹¤£¬±£°²£¬ÃÅÎÀ¿ÉÄܽϲ»ÖØÒªÒ»µã¡£ËùÒÔÈç¹ûÁ½¸ö¹«Ë¾×ܾÀí£¬¾Àí£¬Ê×ϯ¼¼Êõ¹ÙÖ®¼ä¹ØÏµ±È½ÏºÃ£¬Á½¸ö¹«Ë¾ÈÝÒ×ÓбȽϺõĹØÏµ¡£È»¶øÒ»Î»ÆÕͨԱ¹¤¾ÍËãÓëÁíÒ»¼Ò¹«Ë¾µÄһλÆÕͨԱ¹¤ÓÐѪº£Éî³ð£¬ÅÂÒ²ÄÑÓ°ÏìÁ½¸ö¹«Ë¾Ö®¼äµÄ¹ØÏµ¡£
Òò¶øÅжϹ«Ë¾Ó빫˾֮¼äµÄ¹ØÏµ£¬Ê×ÏÈÒªÕÒ³öÄÄЩÈ˶Թ«Ë¾Ó빫˾֮¼äµÄ¹ØÏµ×îÖØÒª£¬±ÈÈç×ܾÀí£¬¾Àí£¬Ê×ϯ¼¼Êõ¹Ù¡£Æä´ÎÒªÅжÏÕâЩÈËÖ®¼äµÄ¹ØÏµ£¬²»ÈçÁ½¼Ò¹«Ë¾µÄ×ܾÀíÔø¾ÊÇͬѧ£¬¾ÀíÊÇÀÏÏ磬Ê×ϯ¼¼Êõ¹ÙÔøÊÇ´´Òµ»ï°é¡£ÎÒÃÇ·¢ÏÖ£¬Á½¼Ò¹«Ë¾ÎÞÂÛ×ܾÀí£¬¾Àí£¬Ê×ϯ¼¼Êõ¹Ù£¬¹ØÏµ¶¼ºÜºÃ£¬Òò¶øÁ½¼Ò¹«Ë¾¹ØÏµÓ¦¸Ã»áºÜºÃ¡£
·ÖÎöÁËÁ½ÖÖ¹ØÏµ£¬ÏÂÃæ¿´Ò»ÏÂÈçºÎÅжÏÎĵµÖ®¼äµÄ¹ØÏµÁË¡£
Ê×ÏÈ£¬Ò»¸öÎĵµÓкܶà´Ê(Term)×é³É£¬Èçsearch, lucene,
full-text, this, a, whatµÈ¡£
Æä´Î¶ÔÓÚÎĵµÖ®¼äµÄ¹ØÏµ£¬²»Í¬µÄTermÖØÒªÐÔ²»Í¬£¬±ÈÈç¶ÔÓÚ±¾ÆªÎĵµ£¬search,
Lucene, full-text¾ÍÏà¶ÔÖØÒªÒ»Ð©£¬this, a , what¿ÉÄÜÏà¶Ô²»ÖØÒªÒ»Ð©¡£ËùÒÔÈç¹ûÁ½ÆªÎĵµ¶¼°üº¬search,
Lucene£¬fulltext£¬ÕâÁ½ÆªÎĵµµÄÏà¹ØÐÔºÃһЩ£¬È»¶ø¾ÍËãһƪÎĵµ°üº¬this, a, what£¬ÁíһƪÎĵµ²»°üº¬this,
a, what£¬Ò²²»ÄÜÓ°ÏìÁ½ÆªÎĵµµÄÏà¹ØÐÔ¡£
Òò¶øÅжÏÎĵµÖ®¼äµÄ¹ØÏµ£¬Ê×ÏÈÕÒ³öÄÄЩ´Ê(Term)¶ÔÎĵµÖ®¼äµÄ¹ØÏµ×îÖØÒª£¬Èçsearch,
Lucene, fulltext¡£È»ºóÅжÏÕâЩ´Ê(Term)Ö®¼äµÄ¹ØÏµ¡£
ÕÒ³ö´Ê(Term)¶ÔÎĵµµÄÖØÒªÐԵĹý³Ì³ÆÎª¼ÆËã´ÊµÄÈ¨ÖØ(Term weight)µÄ¹ý³Ì¡£
¼ÆËã´ÊµÄÈ¨ÖØ(term weight)ÓÐÁ½¸ö²ÎÊý£¬µÚÒ»¸öÊÇ´Ê(Term)£¬µÚ¶þ¸öÊÇÎĵµ(Document)¡£
´ÊµÄÈ¨ÖØ(Term weight)±íʾ´Ë´Ê(Term)ÔÚ´ËÎĵµÖеÄÖØÒª³Ì¶È£¬Ô½ÖØÒªµÄ´Ê(Term)ÓÐÔ½´óµÄÈ¨ÖØ(Term
weight)£¬Òò¶øÔÚ¼ÆËãÎĵµÖ®¼äµÄÏà¹ØÐÔÖн«·¢»Ó¸ü´óµÄ×÷Óá£
ÅжϴÊ(Term)Ö®¼äµÄ¹ØÏµ´Ó¶øµÃµ½ÎĵµÏà¹ØÐԵĹý³ÌÓ¦ÓÃÒ»ÖÖ½Ð×öÏòÁ¿¿Õ¼äÄ£Ð͵ÄËã·¨(Vector
Space Model)¡£
ÏÂÃæ×Ðϸ·ÖÎöÒ»ÏÂÕâÁ½¸ö¹ý³Ì£º
1. ¼ÆËãÈ¨ÖØ(Term weight)µÄ¹ý³Ì¡£
Ó°ÏìÒ»¸ö´Ê(Term)ÔÚһƪÎĵµÖеÄÖØÒªÐÔÖ÷ÒªÓÐÁ½¸öÒòËØ£º
Term Frequency (tf)£º¼´´ËTermÔÚ´ËÎĵµÖгöÏÖÁ˶àÉٴΡ£tf
Ô½´ó˵Ã÷Ô½ÖØÒª¡£
Document Frequency (df)£º¼´ÓжàÉÙÎĵµ°üº¬´ÎTerm¡£df
Ô½´ó˵Ã÷Ô½²»ÖØÒª¡£
ÈÝÒ×Àí½âÂ𣿴Ê(Term)ÔÚÎĵµÖгöÏֵĴÎÊýÔ½¶à£¬ËµÃ÷´Ë´Ê(Term)¶Ô¸ÃÎĵµÔ½ÖØÒª£¬Èç¡°ËÑË÷¡±Õâ¸ö´Ê£¬ÔÚ±¾ÎĵµÖгöÏֵĴÎÊýºÜ¶à£¬ËµÃ÷±¾ÎĵµÖ÷Òª¾ÍÊǽ²Õâ·½ÃæµÄʵġ£È»¶øÔÚһƪӢÓïÎĵµÖУ¬this³öÏֵĴÎÊý¸ü¶à£¬¾Í˵Ã÷Ô½ÖØÒªÂ𣿲»Êǵģ¬ÕâÊÇÓɵڶþ¸öÒòËØ½øÐе÷Õû£¬µÚ¶þ¸öÒòËØËµÃ÷£¬ÓÐÔ½¶àµÄÎĵµ°üº¬´Ë´Ê(Term),
˵Ã÷´Ë´Ê(Term)Ì«ÆÕͨ£¬²»×ãÒÔÇø·ÖÕâЩÎĵµ£¬Òò¶øÖØÒªÐÔÔ½µÍ¡£
ÕâÒ²ÈçÎÒÃdzÌÐòÔ±ËùѧµÄ¼¼Êõ£¬¶ÔÓÚ³ÌÐòÔ±±¾ÉíÀ´Ëµ£¬ÕâÏî¼¼ÊõÕÆÎÕÔ½ÉîÔ½ºÃ£¨ÕÆÎÕÔ½Éî˵Ã÷»¨Ê±¼ä¿´µÄÔ½¶à£¬tfÔ½´ó£©£¬ÕÒ¹¤×÷ʱԽÓоºÕùÁ¦¡£È»¶ø¶ÔÓÚËùÓгÌÐòÔ±À´Ëµ£¬ÕâÏî¼¼Êõ¶®µÃµÄÈËÔ½ÉÙÔ½ºÃ£¨¶®µÃµÄÈËÉÙdfС£©£¬ÕÒ¹¤×÷Ô½ÓоºÕùÁ¦¡£È˵ļÛÖµÔÚÓÚ²»¿ÉÌæ´úÐÔ¾ÍÊÇÕâ¸öµÀÀí¡£
µÀÀíÃ÷°×ÁË£¬ÎÒÃÇÀ´¿´¿´¹«Ê½£º
Õâ½ö½öÖ»term weight¼ÆË㹫ʽµÄ¼òµ¥µäÐÍʵÏÖ¡£ÊµÏÖÈ«ÎļìË÷ϵͳµÄÈË»áÓÐ×Ô¼ºµÄʵÏÖ£¬Lucene¾ÍÓë´ËÉÔÓв»Í¬¡£
2. ÅжÏTermÖ®¼äµÄ¹ØÏµ´Ó¶øµÃµ½ÎĵµÏà¹ØÐԵĹý³Ì£¬Ò²¼´ÏòÁ¿¿Õ¼äÄ£Ð͵ÄËã·¨(VSM)¡£
ÎÒÃǰÑÎĵµ¿´×÷һϵÁдÊ(Term)£¬Ã¿Ò»¸ö´Ê(Term)¶¼ÓÐÒ»¸öÈ¨ÖØ(Term
weight)£¬²»Í¬µÄ´Ê(Term)¸ù¾Ý×Ô¼ºÔÚÎĵµÖеÄÈ¨ÖØÀ´Ó°ÏìÎĵµÏà¹ØÐԵĴò·Ö¼ÆËã¡£
ÓÚÊÇÎÒÃǰÑËùÓдËÎĵµÖдÊ(term)µÄÈ¨ÖØ(term weight)
¿´×÷Ò»¸öÏòÁ¿¡£
Document = {term1, term2, ¡¡ ,term
N}
Document Vector = {weight1, weight2,
¡¡ ,weight N}
ͬÑùÎÒÃǰѲéѯÓï¾ä¿´×÷Ò»¸ö¼òµ¥µÄÎĵµ£¬Ò²ÓÃÏòÁ¿À´±íʾ¡£
Query = {term1, term 2, ¡¡ , term
N}
Query Vector = {weight1, weight2,
¡¡ , weight N}
ÎÒÃǰÑËùÓÐËÑË÷³öµÄÎĵµÏòÁ¿¼°²éѯÏòÁ¿·Åµ½Ò»¸öNά¿Õ¼äÖУ¬Ã¿¸ö´Ê(term)ÊÇһά¡£
Èçͼ£º
ÎÒÃÇÈÏΪÁ½¸öÏòÁ¿Ö®¼äµÄ¼Ð½ÇԽС£¬Ïà¹ØÐÔÔ½´ó¡£
ËùÒÔÎÒÃǼÆËã¼Ð½ÇµÄÓàÏÒÖµ×÷ΪÏà¹ØÐԵĴò·Ö£¬¼Ð½ÇԽС£¬ÓàÏÒÖµÔ½´ó£¬´ò·ÖÔ½¸ß£¬Ïà¹ØÐÔÔ½´ó¡£
ÓÐÈË¿ÉÄÜ»áÎÊ£¬²éѯÓï¾äÒ»°ãÊǺ̵ܶ쬰üº¬µÄ´Ê(Term)ÊǺÜÉٵģ¬Òò¶ø²éѯÏòÁ¿µÄάÊýºÜС£¬¶øÎĵµºÜ³¤£¬°üº¬´Ê(Term)ºÜ¶à£¬ÎĵµÏòÁ¿Î¬ÊýºÜ´ó¡£ÄãµÄͼÖÐÁ½ÕßάÊýÔõô¶¼ÊÇNÄØ£¿
ÔÚÕâÀ¼ÈȻҪ·Åµ½ÏàͬµÄÏòÁ¿¿Õ¼ä£¬×ÔȻάÊýÊÇÏàͬµÄ£¬²»Í¬Ê±£¬È¡¶þÕߵIJ¢¼¯£¬Èç¹û²»º¬Ä³¸ö´Ê(Term)ʱ£¬ÔòÈ¨ÖØ(Term
Weight)Ϊ0¡£
Ïà¹ØÐÔ´ò·Ö¹«Ê½ÈçÏ£º
¾Ù¸öÀý×Ó£¬²éѯÓï¾äÓÐ11¸öTerm£¬¹²ÓÐÈýƪÎĵµËÑË÷³öÀ´¡£ÆäÖи÷×ÔµÄÈ¨ÖØ(Term
weight)£¬Èçϱí¸ñ¡£
ÓÚÊǼÆË㣬ÈýƪÎĵµÍ¬²éѯÓï¾äµÄÏà¹ØÐÔ´ò·Ö·Ö±ðΪ£º
ÓÚÊÇÎĵµ¶þÏà¹ØÐÔ×î¸ß£¬ÏÈ·µ»Ø£¬Æä´ÎÊÇÎĵµÒ»£¬×îºóÊÇÎĵµÈý¡£
µ½´ËΪֹ£¬ÎÒÃÇ¿ÉÒÔÕÒµ½ÎÒÃÇ×îÏëÒªµÄÎĵµÁË¡£
˵ÁËÕâô¶à£¬Æäʵ»¹Ã»ÓнøÈëµ½Lucene£¬¶ø½ö½öÊÇÐÅÏ¢¼ìË÷¼¼Êõ(Information
retrieval)ÖеĻù±¾ÀíÂÛ£¬È»¶øµ±ÎÒÃÇ¿´¹ýLuceneºóÎÒÃǻᷢÏÖ£¬LuceneÊǶÔÕâÖÖ»ù±¾ÀíÂÛµÄÒ»ÖÖ»ù±¾µÄµÄʵ¼ù¡£ËùÒÔÔÚÒÔºó·ÖÎöLuceneµÄÎÄÕÂÖУ¬»á³£³£¿´µ½ÒÔÉÏÀíÂÛÔÚLuceneÖеÄÓ¦Óá£
ÔÚ½øÈëLucene֮ǰ£¬¶ÔÉÏÊöË÷Òý´´½¨ºÍËÑË÷¹ý³ÌËùÒ»¸ö×ܽᣬÈçͼ£º
1. Ë÷Òý¹ý³Ì£º
1) ÓÐһϵÁб»Ë÷ÒýÎļþ
2) ±»Ë÷ÒýÎļþ¾¹ýÓï·¨·ÖÎöºÍÓïÑÔ´¦ÀíÐγÉһϵÁдÊ(Term)¡£
3) ¾¹ýË÷Òý´´½¨ÐγɴʵäºÍ·´ÏòË÷Òý±í¡£
4) ͨ¹ýË÷Òý´æ´¢½«Ë÷ÒýдÈëÓ²ÅÌ¡£
2. ËÑË÷¹ý³Ì£º
a) Óû§ÊäÈë²éѯÓï¾ä¡£
b) ¶Ô²éѯÓï¾ä¾¹ýÓï·¨·ÖÎöºÍÓïÑÔ·ÖÎöµÃµ½Ò»ÏµÁдÊ(Term)¡£
c) ͨ¹ýÓï·¨·ÖÎöµÃµ½Ò»¸ö²éѯÊ÷¡£
d) ͨ¹ýË÷Òý´æ´¢½«Ë÷Òý¶ÁÈëµ½ÄÚ´æ¡£
e) ÀûÓòéѯÊ÷ËÑË÷Ë÷Òý£¬´Ó¶øµÃµ½Ã¿¸ö´Ê(Term)µÄÎĵµÁ´±í£¬¶ÔÎĵµÁ´±í½øÐн»£¬²î£¬²¢µÃµ½½á¹ûÎĵµ¡£
f) ½«ËÑË÷µ½µÄ½á¹ûÎĵµ¶Ô²éѯµÄÏà¹ØÐÔ½øÐÐÅÅÐò¡£
g) ·µ»Ø²éѯ½á¹û¸øÓû§¡£ |