È«ÎļìË÷ÒýÇæSolrϵÁСª¡ªÈëÃÅÆª
Solr²ÉÓÃLuceneËÑË÷¿âΪºËÐÄ£¬ÌṩȫÎÄË÷ÒýºÍËÑË÷¿ªÔ´Æóҵƽ̨£¬ÌṩRESTµÄHTTP/XMLºÍJSONµÄAPI£¬Èç¹ûÄãÊÇSolrÐÂÊÖ£¬ÄÇô¾ÍºÍÎÒÒ»ÆðÀ´ÈëÃŰɣ¡±¾½Ì³ÌÒÔsolr4.8×÷Ϊ²âÊÔ»·¾³£¬jdk°æ±¾ÐèÒª1.7¼°ÒÔÉϰ汾¡£
×¼±¸
±¾ÎļÙÉèÄã¶ÔJavaÓгõÖм¶ÒÔÉÏˮƽ£¬Òò´Ë²»ÔÙ½éÉÜJavaÏà¹Ø»·¾³µÄÅäÖá£ÏÂÔØ½âѹËõsolr£¬ÔÚexampleĿ¼ÓÐstart.jarÎļþ£¬Æô¶¯£º
Ë÷ÒýÊý¾Ý
·þÎñÆô¶¯ºó£¬Ä¿Ç°Äã¿´µ½µÄ½çÃæÃ»ÓÐÈκÎÊý¾Ý£¬Äã¿ÉÒÔͨ¹ýPOSTingÃüÁîÏòSolrÖÐÌí¼Ó£¨¸üУ©Îĵµ£¬É¾³ýÎĵµ£¬ÔÚexampledocsĿ¼°üº¬Ò»Ð©Ê¾ÀýÎļþ£¬ÔËÐÐÃüÁ
java -jar post.jar solr.xml monitor.xml |
ÉÏÃæµÄÃüÁîÊÇÏòsolrÌí¼ÓÁËÁ½·ÝÎĵµ£¬´ò¿ªÕâÁ½¸öÎļþ¿´¿´ÀïÃæÊÇʲôÄÚÈÝ£¬solr.xmlÀïÃæµÄÄÚÈÝÊÇ£º
<add> <doc> <field name="id">SOLR1000</field> <field name="name">Solr, the Enterprise Search Server</field> <field name="manu">Apache Software Foundation</field> <field name="cat">software</field> <field name="cat">search</field> <field name="features">Advanced Full-Text Search Capabilities using Lucene</field> <field name="features">Optimized for High Volume Web Traffic</field> <field name="features">Standards Based Open Interfaces - XML and HTTP</field> <field name="features">Comprehensive HTML Administration Interfaces</field> <field name="features">Scalability - Efficient Replication to other Solr Search Servers</field> <field name="features">Flexible and Adaptable with XML configuration and Schema</field> <field name="features">Good unicode support: héllo (hello with an accent over the e)</field> <field name="price">0</field> <field name="popularity">10</field> <field name="inStock">true</field> <field name="incubationdate_dt">2006-01-17T00:00:00.000Z</field> </doc> </add> |
±íʾÏòË÷ÒýÖÐÌí¼ÓÒ»¸öÎĵµ£¬Îĵµ¾ÍÊÇÓÃÀ´ËÑË÷µÄÊý¾ÝÔ´£¬ÏÖÔھͿÉÒÔͨ¹ý¹ÜÀí½çÃæËÑË÷¹Ø¼ü×Ö¡±solr¡±£¬¾ßÌå²½ÖèÊÇ£º

µã»÷Ò³ÃæÏµÄExecute Query°´Å¥ºóÓÒ²à¾Í»áÏÔʾ²éѯ½á¹û£¬Õâ¸ö½á¹û¾ÍÊǸղŵ¼Èë½øÈ¥µÄsolr.xmlµÄjson¸ñʽµÄչʾ½á¹û¡£solrÖ§³Ö·á¸»µÄ²éѯÓï·¨£¬±ÈÈ磺ÏÖÔÚÏëËÑË÷×Ö¶ÎnameÀïÃæµÄ¹Ø¼ü×Ö¡±Search¡±¾Í¿ÉÒÔÓÃÓï·¨name:search£¬µ±È»Èç¹ûÄãËÑË÷name:xxx¾ÍûÓзµ»Ø½á¹ûÁË£¬ÒòΪÎĵµÖÐûÓÐÕâÑùµÄÄÚÈÝ¡£
Êý¾Ýµ¼Èë
µ¼ÈëÊý¾Ýµ½SolrµÄ·½Ê½Ò²ÊǶàÖÖ¶àÑùµÄ£º
¿ÉÒÔʹÓÃDIH(DataImportHandler)´ÓÊý¾Ý¿âµ¼ÈëÊý¾Ý
Ö§³ÖCSVÎļþµ¼È룬Òò´ËExcelÊý¾ÝÒ²ÄÜÇáËɵ¼Èë
Ö§³ÖJSON¸ñʽÎĵµ
¶þ½øÖÆÎĵµ±ÈÈ磺Word¡¢PDF
»¹ÄÜÒÔ±à³ÌµÄ·½Ê½À´×Ô¶¨Òåµ¼Èë
¸üÐÂÊý¾Ý
Èç¹ûͬһ·ÝÎĵµsolr.xmlÖØ¸´µ¼Èë»á³öÏÖʲôÇé¿öÄØ£¿Êµ¼ÊÉÏsolr»á¸ù¾ÝÎĵµµÄ×Ö¶ÎidÀ´Î¨Ò»±êʶÎĵµ£¬Èç¹ûµ¼ÈëµÄÎĵµµÄidÒѾ´æÔÚsolrÖУ¬ÄÇôÕâ·ÝÎĵµ¾Í±»×îе¼ÈëµÄͬidµÄÎĵµ×Ô¶¯Ìæ»»¡£Äã¿ÉÒÔ×Ô¼º³¢ÊÔÊÔÑéһϣ¬¹Û²ìÌæ»»Ç°ºó¹ÜÀí½çÃæµÄ¼¸¸ö²ÎÊý£ºNum
Docs£¬Max Doc£¬Deleted DocsµÄ±ä»¯¡£
numDocs£ºµ±Ç°ÏµÍ³ÖеÄÎĵµÊýÁ¿£¬ËüÓпÉÄÜ´óÓÚxmlÎļþ¸öÊý£¬ÒòΪһ¸öxmlÎļþ¿ÉÄÜÓжà¸ö<doc>±êÇ©¡£
maxDoc£ºmaxDocÓпÉÄܱÈnumDocsµÄÖµÒª´ó£¬±ÈÈçÖØ¸´postͬһ·ÝÎļþºó£¬maxDocÖµ¾ÍÔö´óÁË¡£
deletedDocs£ºÖظ´postµÄÎļþ»áÌæ»»µôÀϵÄÎĵµ£¬Í¬Ê±deltedDocsµÄÖµÒ²»á¼Ó1£¬²»¹ýÕâÖ»ÊÇÂß¼ÉϵÄɾ³ý£¬²¢Ã»ÓÐÕæÕý´ÓË÷ÒýÖÐÒÆ³ýµô
ɾ³ýÊý¾Ý
ͨ¹ýidɾ³ýÖ¸¶¨µÄÎĵµ£¬»òÕßͨ¹ýÒ»¸ö²éѯÀ´É¾³ýÆ¥ÅäµÄÎĵµ
java -Ddata=args -jar post.jar "<delete><id>SOLR1000</id></delete>" java -Ddata=args -jar post.jar "<delete><query>name:DDR</query></delete>" |
´Ëʱsolr.xmlÎĵµ´ÓË÷ÒýÖÐɾ³ýÁË£¬ÔÙ´ÎËÑ¡±solr¡±Ê±²»ÔÙ·µ»Ø½á¹û¡£µ±È»solrÒ²ÓÐÊý¾Ý¿âÖеÄÊÂÎñ£¬Ö´ÐÐɾ³ýÃüÁîµÄʱºòÊÂÎñ×Ô¶¯Ìá½»ÁË£¬Îĵµ¾Í»áÁ¢¼´´ÓË÷ÒýÖÐɾ³ý¡£ÄãÒ²¿ÉÒÔ°ÑcommitÉèÖÃΪfalse£¬ÊÖ¶¯Ìá½»ÊÂÎñ¡£
java -Ddata=args -Dcommit=false -jar post.jar "<delete><id>3007WFP</id></delete>" |
Ö´ÐÐÍêÉÏÃæµÄÃüÁîʱÎĵµ²¢Ã»ÓÐÕæÕýɾ³ý£¬»¹ÊÇ¿ÉÒÔ¼ÌÐøËÑË÷Ïà¹Ø½á¹û£¬×îºó¿ÉÒÔͨ¹ýÃüÁ
Ìá½»ÊÂÎñ£¬Îĵµ¾Í³¹µ×ɾ³ýÁË¡£ÏÖÔڰѸոÕɾ³ýµÄÎļþÖØÐµ¼ÈëSolrÖÐÀ´£¬¼ÌÐøÎÒÃǵÄѧϰ¡£
ɾ³ýËùÓÐÊý¾Ý£º
http://localhost:8983/solr/collection1/update?stream.body=<delete><query>*:*</query></delete>&commit=true |
ɾ³ýÖ¸¶¨Êý¾Ý
http://localhost:8983/solr/collection1/update?stream.body=<delete><query>title:abc</query></delete>&commit=true |
¶àÌõ¼þɾ³ý
http://localhost:8983/solr/collection1/update?stream.body=<delete>
<query>title:abc AND name:zhang</query></delete>&commit=true |
²éѯÊý¾Ý
²éѯÊý¾Ý¶¼ÊÇͨ¹ýHTTPµÄGETÇëÇó»ñÈ¡µÄ£¬ËÑË÷¹Ø¼ü×ÖÓòÎÊýqÖ¸¶¨£¬ÁíÍ⻹¿ÉÒÔÖ¸¶¨ºÜ¶à¿ÉÑ¡µÄ²ÎÊýÀ´¿ØÖÆÐÅÏ¢µÄ·µ»Ø£¬ÀýÈ磺ÓÃflÖ¸¶¨·µ»ØµÄ×ֶΣ¬±ÈÈçf1=name£¬ÄÇô·µ»ØµÄÊý¾Ý¾ÍÖ»°üÀ¨name×ֶεÄÄÚÈÝ
http://localhost:8983/solr/collection1/select?q=solr&fl=name&wt=json&indent=true |
ÅÅÐò
SolrÌṩÅÅÐòµÄ¹¦ÄÜ£¬Í¨¹ý²ÎÊýsortÀ´Ö¸¶¨£¬ËüÖ§³ÖÕýÐò¡¢µ¹Ðò£¬»òÕß¶à¸ö×Ö¶ÎÅÅÐò
q=video&sort=price desc q=video&sort=price asc q=video&sort=inStock asc, price desc |
ĬÈÏÌõ¼þÏ£¬Solr¸ù¾Ýsocre µ¹ÐòÅÅÁУ¬socreÊÇÒ»ÌõËÑË÷¼Ç¼¸ù¾ÝÏà¹Ø¶È¼ÆËã³öÀ´µÄÒ»¸ö·ÖÊý¡£
¸ßÁÁ
ÍøÒ³ËÑË÷ÖУ¬ÎªÁËÍ»³öËÑË÷½á¹û£¬¿ÉÄÜ»á¶ÔÆ¥ÅäµÄ¹Ø¼ü×Ö¸ßÁÁ³öÀ´£¬SolrÌṩÁ˺ܺõÄÖ§³Ö£¬Ö»ÒªÖ¸¶¨²ÎÊý£º
hl=true #¿ªÆô¸ßÁÁ¹¦ÄÜ
hl.fl=name #Ö¸¶¨ÐèÒª¸ßÁÁµÄ×Ö¶Î
http://localhost:8983/solr/collection1/select?q=Search&wt=json&indent=true&hl=true&hl.fl=features |
·µ»ØµÄÄÚÈÝÖаüº¬£º
"highlighting":{ "SOLR1000":{ "features":["Advanced Full-Text <em>Search</em> Capabilities using Lucene"] } } |
Îı¾·ÖÎö
Îı¾×Ö¶Îͨ¹ý°ÑÎı¾·Ö¸î³Éµ¥´ÊÒÔ¼°ÔËÓø÷ÖÖת»»·½·¨£¨È磺Сдת»»¡¢¸´ÊýÒÆ³ý¡¢´Ê¸ÉÌáÈ¡£©ºó±»Ë÷Òý£¬schema.xmlÎļþÖж¨ÒåÁË×Ö¶ÎÔÚË÷ÒýÖУ¬ÕâЩ×ֶν«×÷ÓÃÓÚÆäÖÐ.
ĬÈÏÇé¿öÏÂËÑË÷¡±power-shot¡±ÊDz»ÄÜÆ¥Å䡱powershot¡±µÄ£¬Í¨¹ýÐÞ¸Äschema.xmlÎļþ(solr/example/solr/collection1/confĿ¼)£¬°ÑfeaturesºÍtext×Ö¶ÎÌæ»»³É¡±text_en_splitting¡±ÀàÐÍ£¬¾ÍÄÜË÷Òýµ½ÁË¡£
<field name="features" type="text_en_splitting" indexed="true" stored="true" multiValued="true"/> ... <field name="text" type="text_en_splitting" indexed="true" stored="false" multiValued="true"/> |
ÐÞ¸ÄÍêºóÖØÆôsolr£¬È»ºóÖØÐµ¼ÈëÎĵµ
ÏÖÔھͿÉÒÔÆ¥ÅäÁË
power-shot¡ª>Powershot features:recharing¡ª>Rechargeable 1 gigabyte ¨C> 1G |
×ܽá
×÷ΪÈëÃÅÎÄÕ£¬±¾ÎÄûÓÐÒýÈëÌ«¶à¸ÅÄî¡£°²×°µ½²¿Êð£¬Îĵµ¸üУ¬¶ÔsolrÓÐÁ˳õ²½¸ÐÐÔµÄÈÏʶ£¬ÏÂһƪ½«½éÉÜÈ«ÎļìË÷µÄ»ù±¾ÔÀí¡£
È«ÎļìË÷ÒýÇæSolrϵÁСª¨CÈ«ÎļìË÷»ù±¾ÔÀí
³¡¾°£ºÐ¡Ê±ºòÎÒÃǶ¼Ê¹Óùýлª×ֵ䣬ÂèÂè½ÐÄã·¿ªµÚ38Ò³£¬ÕÒµ½¡°¿Óµù¡±ËùÔÚµÄλÖ㬴ËʱÄã»áÔõô²éÄØ£¿ºÁÎÞÒÉÎÊ£¬ÄãµÄÑÛ¾¦»á´Ó38Ò³µÄµÚÒ»¸ö×Ö¿ªÊ¼´ÓÍ·ÖÁβµØÉ¨Ã裬ֱµ½ÕÒµ½¡°¿Óµù¡±¶þ×ÖΪֹ¡£ÕâÖÖËÑË÷·½·¨½Ð×ö˳ÐòɨÃè·¨¡£¶ÔÓÚÉÙÁ¿µÄÊý¾Ý£¬Ê¹ÓÃ˳ÐòɨÃèÊǹ»Óõġ£µ«ÊÇÂèÂè½ÐÄã²é³ö¿ÓµùµÄ¡°¿Ó¡±×ÖÔÚÄÄһҳʱ£¬ÄãÒªÊÇ´ÓµÚÒ»Ò³µÄµÚÒ»¸ö×ÖÖð¸öµÄɨÃèÏÂÈ¥£¬ÄÇÄãÕæµÄÊDZ»¿ÓÁË¡£´ËʱÄã¾ÍÐèÒªÓõ½Ë÷Òý¡£Ë÷Òý¼Ç¼ÁË¡°¿Ó¡±×ÖÔÚÄÄÒ»Ò³£¬ÄãÖ»ÐèÔÚË÷ÒýÖÐÕÒµ½¡°¿Ó¡±×Ö£¬È»ºóÕÒµ½¶ÔÓ¦µÄÒ³Â룬´ð°¸¾Í³öÀ´ÁË¡£ÒòΪÔÚË÷ÒýÖвéÕÒ¡°¿Ó¡±×ÖÊǷdz£¿ìµÄ£¬ÒòΪÄãÖªµÀËüµÄÆ«ÅÔ£¬Òò´ËÒ²¾Í¿ÉѸËÙ¶¨Î»µ½Õâ¸ö×Ö¡£
ÄÇôлª×ÖµäµÄĿ¼£¨Ë÷Òý±í£©ÊÇÔõô±àд¶ø³ÉµÄÄØ£¿Ê×ÏȶÔÓÚлª×ÖµäÕâ±¾ÊéÀ´Ëµ£¬³ýȥĿ¼ºó£¬Õâ±¾Êé¾ÍÊÇÒ»¶ÑûÓнṹµÄÊý¾Ý¼¯¡£µ«ÊÇ´ÏÃ÷µÄÈËÀàÉÆÓÚ˼¿¼×ܽᣬ·¢ÏÖÿ¸ö×Ö¶¼»á¶ÔÓ¦µ½Ò»¸öÒ³Â룬±ÈÈç¡°¿Ó¡±×Ö¾ÍÔÚµÚ38Ò³£¬¡°µù¡±×ÖÔÚµÚ90Ò³¡£ÓÚÊÇËûÃǾʹÓÖÐÌáÈ¡ÕâЩÐÅÏ¢£¬¹¹Ôì³ÉÒ»¸öÓнṹµÄÊý¾Ý¡£ÀàËÆÊý¾Ý¿âÖеıí½á¹¹£º
word page_no --------------- ¿Ó 38 µù 90 ... ... |
ÕâÑù¾ÍÐγÉÁËÒ»¸öÍêÕûµÄĿ¼£¨Ë÷Òý¿â£©£¬²éÕÒµÄʱºò¾Í·Ç³£·½±ãÁË¡£¶ÔÓÚÈ«ÎļìË÷Ò²ÊÇÀàËÆµÄÔÀí£¬Ëü¿ÉÒÔ¹é½áΪÁ½¸ö¹ý³Ì£º1.Ë÷Òý´´½¨£¨Indexing£©2.
ËÑË÷Ë÷Òý£¨Search£©¡£ÄÇôË÷Òýµ½µ×ÊÇÈçºÎ´´½¨µÄÄØ£¿Ë÷ÒýÀïÃæ´æ·ÅµÄÓÖÊÇʲô¶«Î÷ÄØ£¿ËÑË÷µÄµÄʱºòÓÖÊÇÈçºÎÈ¥²éÕÒË÷ÒýµÄÄØ£¿´ø×ÅÕâһϵÁÐÎÊÌâ¼ÌÐøÍùÏ¿´¡£

Ë÷Òý
Solr/Lucene²ÉÓõÄÊÇÒ»ÖÖ·´ÏòË÷Òý£¬Ëùν·´ÏòË÷Òý£º¾ÍÊǴӹؼü×Öµ½ÎĵµµÄÓ³Éä¹ý³Ì£¬±£´æÕâÖÖÓ³ÉäÕâÖÖÐÅÏ¢µÄË÷Òý³ÆÎª·´ÏòË÷Òý

×ó±ß±£´æµÄÊÇ×Ö·û´®ÐòÁÐ
ÓÒ±ßÊÇ×Ö·û´®µÄÎĵµ£¨Document£©±àºÅÁ´±í£¬³ÆÎªµ¹ÅÅ±í£¨Posting
List£©
×ֶδ®ÁбíºÍÎĵµ±àºÅÁ´±íÁ½Õß¹¹³ÉÁËÒ»¸ö×ֵ䡣ÏÖÔÚÏëËÑË÷¡±lucene¡±£¬ÄÇôË÷ÒýÖ±½Ó¸æËßÎÒÃÇ£¬°üº¬ÓС±lucene¡±µÄÎĵµÓУº2£¬3£¬10£¬35£¬92£¬¶øÎÞÐèÔÚÕû¸öÎĵµ¿âÖÐÖð¸ö²éÕÒ¡£Èç¹ûÊÇÏëËѼȰüº¬¡±lucene¡±ÓÖ°üº¬¡±solr¡±µÄÎĵµ£¬ÄÇôÓëÖ®¶ÔÓ¦µÄÁ½¸öµ¹ÅűíÈ¥½»¼¯¼´¿É»ñµÃ£º3¡¢10¡¢35¡¢92¡£
Ë÷Òý´´½¨
¼ÙÉèÓÐÈçÏÂÁ½¸öÔʼÎĵµ£º
ÎĵµÒ»£ºStudents should be allowed to go
out with their friends, but not allowed to drink beer.
Îĵµ¶þ£ºMy friend Jerry went to school to
see his students but found them drunk which is not allowed.
´´½¨¹ý³Ì´ó¸Å·ÖΪÈçϲ½Ö裺

Ò»£º°ÑÔʼÎĵµ½»¸ø·Ö´Ê×é¼þ(Tokenizer)
·Ö´Ê×é¼þ(Tokenizer)»á×öÒÔϼ¸¼þÊÂÇé(Õâ¸ö¹ý³Ì³ÆÎª£ºTokenize)£¬´¦ÀíµÃµ½µÄ½á¹ûÊǴʻ㵥Ԫ£¨Token£©
½«Îĵµ·Ö³ÉÒ»¸öÒ»¸öµ¥¶ÀµÄµ¥´Ê
È¥³ý±êµã·ûºÅ
È¥³ýÍ£´Ê(stop word)
Ëùνͣ´Ê(Stop word)¾ÍÊÇÒ»ÖÖÓïÑÔÖÐûÓоßÌ庬Ò壬Òò¶ø´ó¶àÊýÇé¿öϲ»»á×÷ΪËÑË÷µÄ¹Ø¼ü´Ê£¬
ÕâÑùÒ»À´´´½¨Ë÷ÒýʱÄܼõÉÙË÷ÒýµÄ´óС¡£Ó¢ÓïÖÐÍ£´Ê(Stop word)È磺
¡±the¡±¡¢¡±a¡±¡¢¡±this¡±£¬ÖÐÎÄÓУº¡±µÄ£¬µÃ¡±µÈ¡£
²»Í¬ÓïÖֵķִÊ×é¼þ(Tokenizer)£¬¶¼ÓÐ×Ô¼ºµÄÍ£´Ê(stop word)¼¯ºÏ¡£
¾¹ý·Ö´Ê(Tokenizer)ºóµÃµ½µÄ½á¹û³ÆÎª´Ê»ãµ¥Ôª(Token)¡£ÉÏÀý×ÓÖУ¬±ãµÃµ½ÒÔÏ´ʻ㵥Ԫ(Token)£º
"Students"£¬"allowed"£¬"go"£¬"their"£¬"friends"£¬"allowed"£¬
"drink"£¬"beer"£¬"My"£¬"friend"£¬"Jerry"£¬"went"£¬"school"£¬
"see"£¬"his"£¬"students"£¬"found"£¬"them"£¬"drunk"£¬"allowed"
¶þ£º´Ê»ãµ¥Ôª(Token)´«¸øÓïÑÔ´¦Àí×é¼þ(Linguistic Processor)
ÓïÑÔ´¦Àí×é¼þ(linguistic processor)Ö÷ÒªÊǶԵõ½µÄ´ÊÔª(Token)×öһЩÓïÑÔÏà¹ØµÄ´¦Àí¡£
¶ÔÓÚÓ¢ÓÓïÑÔ´¦Àí×é¼þ(Linguistic Processor)Ò»°ã×öÒÔϼ¸µã£º
±äΪСд(Lowercase)¡£
½«µ¥´ÊËõ¼õΪ´Ê¸ùÐÎʽ£¬È硱cars¡±µ½¡±car¡±µÈ¡£ÕâÖÖ²Ù×÷³ÆÎª£ºstemming¡£
½«µ¥´Êת±äΪ´Ê¸ùÐÎʽ£¬È硱drove¡±µ½¡±drive¡±µÈ¡£ÕâÖÖ²Ù×÷³ÆÎª£ºlemmatization¡£
ÓïÑÔ´¦Àí×é¼þ(linguistic processor)´¦ÀíµÃµ½µÄ½á¹û³ÆÎª´Ê(Term)£¬Àý×ÓÖо¹ýÓïÑÔ´¦ÀíºóµÃµ½µÄ´Ê(Term)ÈçÏ£º
"student"£¬"allow"£¬"go"£¬"their"£¬"friend"£¬"allow"£¬"drink"£¬"beer"£¬"my"£¬"friend"£¬
"jerry"£¬"go"£¬"school"£¬"see"£¬"his"£¬"student"£¬"find"£¬"them"£¬"drink"£¬"allow"¡£
¾¹ýÓïÑÔ´¦Àíºó£¬ËÑË÷driveʱdroveÒ²Äܱ»ËÑË÷³öÀ´¡£Stemming
ºÍ lemmatizationµÄÒìͬ£º
Ïà֮ͬ´¦£º
StemmingºÍlemmatization¶¼ÒªÊ¹´Ê»ã³ÉΪ´Ê¸ùÐÎʽ¡£
Á½Õߵķ½Ê½²»Í¬£º
Stemming²ÉÓõÄÊÇ¡±Ëõ¼õ¡±µÄ·½Ê½£º¡±cars¡±µ½¡±car¡±£¬¡±driving¡±µ½¡±drive¡±¡£
Lemmatization²ÉÓõÄÊÇ¡±×ª±ä¡±µÄ·½Ê½£º¡±drove¡±µ½¡±drove¡±£¬¡±driving¡±µ½¡±drive¡±¡£
Á½ÕßµÄËã·¨²»Í¬£º
StemmingÖ÷ÒªÊDzÉȡijÖ̶ֹ¨µÄËã·¨À´×öÕâÖÖËõ¼õ£¬ÈçÈ¥³ý¡±s¡±£¬
È¥³ý¡±ing¡±¼Ó¡±e¡±£¬½«¡±ational¡±±äΪ¡±ate¡±£¬½«¡±tional¡±±äΪ¡±tion¡±¡£
LemmatizationÖ÷ÒªÊDzÉÓÃÊÂÏÈÔ¼¶¨µÄ¸ñʽ±£´æÄ³ÖÖ×ÖµäÖС£
±ÈÈç×ÖµäÖÐÓС±driving¡±µ½¡±drive¡±£¬¡±drove¡±µ½¡±drive¡±£¬¡±am,
is, are¡±µ½¡±be¡±µÄÓ³É䣬×öת±äʱ£¬°´ÕÕ×ÖµäÖÐÔ¼¶¨µÄ·½Ê½×ª»»¾Í¿ÉÒÔÁË¡£
StemmingºÍlemmatization²»ÊÇ»¥³â¹ØÏµ£¬ÊÇÓн»¼¯µÄ£¬ÓеĴÊÀûÓÃÕâÁ½ÖÖ·½Ê½¶¼ÄÜ´ïµ½ÏàͬµÄת»»¡£
Èý£ºµÃµ½µÄ´Ê(Term)´«µÝ¸øË÷Òý×é¼þ(Indexer)
ÀûÓõõ½µÄ´Ê(Term)´´½¨Ò»¸ö×Öµä
Term Document ID student 1 allow 1 go 1 their 1 friend 1 allow 1 drink 1 beer 1 my 2 friend 2 jerry 2 go 2 school 2 see 2 his 2 student 2 find 2 them 2 drink 2 allow 2 |
¶Ô×ֵ䰴×Öĸ˳ÐòÅÅÐò£º
Term Document ID allow 1 allow 1 allow 2 beer 1 drink 1 drink 2 find 2 friend 1 friend 2 go 1 go 2 his 2 jerry 2 my 2 school 2 see 2 student 1 student 2 their 1 them 2 |
ºÏ²¢ÏàͬµÄ´Ê(Term)³ÉΪÎĵµµ¹ÅÅ(Posting List)Á´±ípostlist

Document Frequency£ºÎĵµÆµ´Î£¬±íʾ¶àÉÙÎĵµ³öÏÖ¹ý´Ë´Ê(Term)
Frequency£º´ÊƵ£¬±íʾij¸öÎĵµÖиôÊ(Term)³öÏÖ¹ý¼¸´Î
¶Ô´Ê(Term) ¡°allow¡±À´½²£¬×ܹ²ÓÐÁ½ÆªÎĵµ°üº¬´Ë´Ê(Term)£¬´Ê£¨Term)ºóÃæµÄÎĵµÁ´±í×ܹ²ÓÐÁ½¸ö£¬µÚÒ»¸ö±íʾ°üº¬¡±allow¡±µÄµÚһƪÎĵµ£¬¼´1ºÅÎĵµ£¬´ËÎĵµÖУ¬¡±allow¡±³öÏÖÁË2´Î£¬µÚ¶þ¸ö±íʾ°üº¬¡±allow¡±µÄµÚ¶þ¸öÎĵµ£¬ÊÇ2ºÅÎĵµ£¬´ËÎĵµÖУ¬¡±allow¡±³öÏÖÁË1´Î
ÖÁ´ËË÷Òý´´½¨Íê³É£¬ËÑË÷¡±drive¡±Ê±£¬¡±driving¡±£¬¡±drove¡±£¬¡±driven¡±Ò²Äܹ»±»Ëѵ½¡£ÒòΪÔÚË÷ÒýÖУ¬¡±driving¡±£¬¡±drove¡±£¬¡±driven¡±¶¼»á¾¹ýÓïÑÔ´¦Àí¶ø±ä³É¡±drive¡±£¬ÔÚËÑË÷ʱ£¬Èç¹ûÄúÊäÈ롱driving¡±£¬ÊäÈëµÄ²éѯÓï¾äͬÑù¾¹ý·Ö´Ê×é¼þºÍÓïÑÔ´¦Àí×é¼þ´¦ÀíµÄ²½Ö裬±äΪ²éѯ¡±drive¡±£¬´Ó¶ø¿ÉÒÔËÑË÷µ½ÏëÒªµÄÎĵµ¡£
ËÑË÷²½Öè
ËÑË÷¡±microsoft job¡±£¬Óû§µÄÄ¿µÄÊÇÏ£ÍûÔÚ΢ÈíÕÒÒ»·Ý¹¤×÷£¬Èç¹ûËѳöÀ´µÄ½á¹ûÊÇ:¡±Microsoft
does a good job at software industry¡¡±£¬Õâ¾ÍÓëÓû§µÄÆÚÍûÆ«Àë̫ԶÁË¡£ÈçºÎ½øÐкÏÀíÓÐЧµÄËÑË÷£¬ËÑË÷³öÓû§×îÏëÒªµÃ½á¹ûÄØ£¿ËÑË÷Ö÷ÒªÓÐÈçϲ½Ö裺
Ò»£º¶Ô²éѯÄÚÈݽøÐдʷ¨·ÖÎö¡¢Óï·¨·ÖÎö¡¢ÓïÑÔ´¦Àí
´Ê·¨·ÖÎö£ºÇø·Ö²éѯÄÚÈÝÖе¥´ÊºÍ¹Ø¼ü×Ö£¬±ÈÈ磺english and janpan£¬¡±and¡±¾ÍÊǹؼü×Ö£¬¡±english¡±ºÍ¡±janpan¡±ÊÇÆÕͨµ¥´Ê¡£
¸ù¾Ý²éѯÓï·¨µÄÓï·¨¹æÔòÐγÉÒ»¿ÃÊ÷

ÓïÑÔ´¦Àí£¬ºÍ´´½¨Ë÷Òýʱ´¦Àí·½Ê½ÊÇÒ»ÑùµÄ¡£±ÈÈ磺leaned¨C>lean£¬driven¨C>drive
¶þ£ºËÑË÷Ë÷Òý£¬µÃµ½·ûºÏÓï·¨Ê÷µÄÎĵµ¼¯ºÏ
Èý£º¸ù¾Ý²éѯÓï¾äÓëÎĵµµÄÏà¹ØÐÔ£¬¶Ô½á¹û½øÐÐÅÅÐò
ÎÒÃǰѲéѯÓï¾äÒ²¿´×÷ÊÇÒ»¸öÎĵµ£¬¶ÔÎĵµÓëÎĵµÖ®¼äµÄÏà¹ØÐÔ£¨relevance£©½øÐдò·Ö£¨scoring£©£¬·ÖÊý¸ß±È½ÏÔ½Ïà¹Ø£¬ÅÅÃû¾ÍÔ½¿¿Ç°¡£µ±È»»¹¿ÉÒÔÈ˹¤Ó°Ïì´ò·Ö£¬±ÈÈç°Ù¶ÈËÑË÷£¬¾Í²»Ò»¶¨ÍêÈ«°´ÕÕÏà¹ØÐÔÀ´ÅÅÃûµÄ¡£
ÈçºÎÆÀÅÐÎĵµÖ®¼äµÄÏà¹ØÐÔ£¿Ò»¸öÎĵµÓɶà¸ö£¨»òÕßÒ»¸ö£©´Ê£¨Term£©×é³É£¬±ÈÈ磺¡±solr¡±£¬ ¡°toturial¡±£¬²»Í¬µÄ´Ê¿ÉÄÜÖØÒªÐÔ²»Ò»Ñù£¬±ÈÈçsolr¾Í±ÈtoturialÖØÒª£¬Èç¹ûÒ»¸öÎĵµ³öÏÖÁË10´Îtoturial£¬µ«Ö»³öÏÖÁËÒ»´Îsolr£¬¶øÁíÒ»Îĵµsolr³öÏÖÁË4´Î£¬toturial³öÏÖÒ»´Î£¬ÄÇôºóÕߺÜÓпÉÄܾÍÊÇÎÒÃÇÏëÒªµÄËѵĽá¹û¡£Õâ¾ÍÒýÉê³öÈ¨ÖØ£¨Term
weight£©µÄ¸ÅÄî¡£
È¨ÖØ±íʾ¸Ã´ÊÔÚÎĵµÖеÄÖØÒª³Ì¶È£¬Ô½ÖØÒªµÄ´Êµ±È»È¨ÖØÔ½¸ß£¬Òò´ËÔÚ¼ÆËãÎĵµÏà¹ØÐÔʱӰÏìÁ¦¾Í¸ü´ó¡£Í¨¹ý´ÊÖ®¼äµÄÈ¨ÖØµÃµ½ÎĵµÏà¹ØÐԵĹý³Ì½Ð×ö¿Õ¼äÏòÁ¿Ä£ÐÍËã·¨(Vector
Space Model)
Ó°ÏìÒ»¸ö´ÊÔÚÎĵµÖеÄÖØÒªÐÔÖ÷ÒªÓÐÁ½¸ö·½Ã棺
Term Frequencey£¨tf£©£¬TermÔÚ´ËÎĵµÖгöÏֵįµÂÊ£¬ftÔ½´ó±íÊ¾Ô½ÖØÒª
Document Frequency£¨df£©£¬±íʾÓжàÉÙÎĵµÖгöÏÖ¹ýÕâ¸öTrem£¬dfÔ½´ó±íʾԽ²»ÖØÒª
ÎïÒÔϣΪ¹ó£¬´ó¼Ò¶¼ÓеĶ«Î÷£¬×ÔÈ»¾Í²»ÄÇô¹óÖØÁË£¬Ö»ÓÐÄãרÓеĶ«Î÷±íʾÕâ¸ö¶«Î÷ºÜÕä¹ó£¬È¨ÖصĹ«Ê½£º

¿Õ¼äÏòÁ¿Ä£ÐÍ
ÎĵµÖдʵÄÈ¨ÖØ¿´×÷Ò»¸öÏòÁ¿
Document = {term1, term2, ¡¡ ,term N}
Document Vector = {weight1, weight2,
¡¡ ,weight N}
°ÑÓûÒª²éѯµÄÓï¾ä¿´×÷Ò»¸ö¼òµ¥µÄÎĵµ£¬Ò²ÓÃÏòÁ¿±íʾ£º
Query = {term1, term 2, ¡¡ , term N}
Query Vector = {weight1, weight2, ¡¡
, weight N}
°ÑËÑË÷³öµÄÎĵµÏòÁ¿¼°²éѯÏòÁ¿·ÅÈëNά¶ÈµÄ¿Õ¼äÖУ¬Ã¿¸ö´Ê±íʾһά£º

¼Ð½ÇԽС£¬±íʾԽÏàËÆ£¬Ïà¹ØÐÔÔ½´ó
|