È«ÎļìË÷ÒýÇæSolrϵÁСª¡ªSolrºËÐĸÅÄî¡¢ÅäÖÃÎļþ
Document
DocumentÊÇSolrË÷Òý£¨¶¯´Ê£¬indexing£©ºÍËÑË÷µÄ×î»ù±¾µ¥Ôª£¬ËüÀàËÆÓÚ¹ØÏµÊý¾Ý¿â±íÖеÄÒ»Ìõ¼Ç¼£¬¿ÉÒÔ°üº¬Ò»¸ö»ò¶à¸ö×ֶΣ¨Field£©£¬Ã¿¸ö×ֶΰüº¬Ò»¸önameºÍÎı¾Öµ¡£×Ö¶ÎÔÚ±»Ë÷ÒýµÄͬʱ¿ÉÒÔ´æ´¢ÔÚË÷ÒýÖУ¬ËÑË÷ʱ¾ÍÄÜ·µ»Ø¸Ã×ֶεÄÖµ£¬Í¨³£Îĵµ¶¼Ó¦¸Ã°üº¬Ò»¸öÄÜΨһ±íʾ¸ÃÎĵµµÄid×ֶΡ£ÀýÈ磺
<doc> <field name="id">company123</field> <field name="companycity">Atlanta</field> <field name="companystate">Georgia</field> <field name="companyname">Code Monkeys R Us, LLC</field> <field name="companydescription">we write lots of code</field> <field name="lastmodified">2013-06-01T15:26:37Z</field> </doc> |
Schema
SolrÖеÄSchemaÀàËÆÓÚ¹ØÏµÊý¾Ý¿âÖеıí½á¹¹£¬ËüÒÔschema.xmlµÄÎı¾ÐÎʽ´æÔÚÔÚconfĿ¼Ï£¬ÔÚÌí¼ÓÎĵ±µ½Ë÷ÒýÖÐʱÐèÒªÖ¸¶¨Schema£¬SchemaÎļþÖ÷Òª°üº¬Èý²¿·Ö£º×ֶΣ¨Field£©¡¢×Ö¶ÎÀàÐÍ£¨FieldType£©¡¢Î¨Ò»¼ü£¨uniqueKey£©
×Ö¶ÎÀàÐÍ£¨FieldType£©£ºÓÃÀ´¶¨ÒåÌí¼Óµ½Ë÷ÒýÖеÄxmlÎļþ×ֶΣ¨Field£©ÖеÄÀàÐÍ£¬È磺int£¬String£¬date£¬
×ֶΣ¨Field£©£ºÌí¼Óµ½Ë÷ÒýÎļþÖÐʱµÄ×Ö¶ÎÃû³Æ
Ψһ¼ü£¨uniqueKey£©£ºuniqueKeyÊÇÓÃÀ´±êʶÎĵµÎ¨Ò»ÐÔµÄÒ»¸ö×ֶΣ¨Feild£©£¬ÔÚ¸üкÍɾ³ýʱÓõ½
ÀýÈ磺
<schema name="example" version="1.5"> <field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" /> <field name="title" type="text_general" indexed="true" stored="true" multiValued="true"/> <uniqueKey>id</uniqueKey> <fieldType name="string" class="solr.StrField" sortMissingLast="true" /> <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" /> <!-- in this example, we will only use synonyms at query time <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/> --> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" /> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType> </schema> |
Field
ÔÚSolrÖУ¬×Ö¶Î(Field)Êǹ¹³ÉDocumentµÄ»ù±¾µ¥Ôª¡£¶ÔÓ¦ÓÚÊý¾Ý¿â±íÖеÄijһÁС£×Ö¶ÎÊǰüÀ¨ÁËÃû³Æ£¬ÀàÐÍÒÔ¼°¶Ô×ֶζÔÓ¦µÄÖµÈçºÎ´¦ÀíµÄÒ»ÖÖÔªÊý¾Ý¡£±ÈÈ磺
<field name="name" type="text_general"
indexed="true" stored="true"/>
Indexed£ºIndexed=trueʱ£¬±íʾ×ֶλá¼Ó±»Sorl´¦Àí¼ÓÈëµ½Ë÷ÒýÖУ¬Ö»Óб»Ë÷ÒýµÄ×ֶβÅÄܱ»ËÑË÷µ½¡£
Stored£ºStored=true£¬×Ö¶ÎÖµ»áÒÔ±£´æÒ»·ÝÔʼÄÚÈÝÔÚÔÚË÷ÒýÖУ¬¿ÉÒÔ±»ËÑË÷×é¼þ×é¼þ·µ»Ø£¬¿¼Âǵ½ÐÔÄÜÎÊÌ⣬¶ÔÓÚ³¤Îı¾¾Í²»Êʺϴ洢ÔÚË÷ÒýÖС£
Field Type
SolrÖÐÿ¸ö×ֶζ¼ÓÐÒ»¸ö¶ÔÓ¦µÄ×Ö¶ÎÀàÐÍ£¬±ÈÈ磺float¡¢long¡¢double¡¢date¡¢text£¬SolrÌṩÁ˷ḻ×Ö¶ÎÀàÐÍ£¬Í¬Ê±£¬ÎÒÃÇ»¹¿ÉÒÔ×Ô¶¨ÒåÊʺÏ×Ô¼ºµÄÊý¾ÝÀàÐÍ£¬ÀýÈ磺
<!-- Ik ·Ö´ÊÆ÷ --> <fieldType name="text_cn_stopword" class="solr.TextField"> <analyzer type="index"> <tokenizer class="org.wltea.analyzer.lucene.IKAnalyzerSolrFactory" useSmart="false"/> </analyzer> <analyzer type="query"> <tokenizer class="org.wltea.analyzer.lucene.IKAnalyzerSolrFactory" useSmart="true"/> </analyzer> </fieldType> <!-- Ik ·Ö´ÊÆ÷ --> |
Solrconfig£º
Èç¹û°ÑSchema¶¨ÒåΪSolrµÄModelµÄ»°£¬ÄÇôSolrconfig¾ÍÊÇSolrµÄConfiguration£¬Ëü¶¨ÒåSolrÈç¹û´¦ÀíË÷Òý¡¢¸ßÁÁ¡¢ËÑË÷µÈºÜ¶àÇëÇó£¬Í¬Ê±»¹Ö¸¶¨ÁË»º´æ²ßÂÔ£¬ÓõıȽ϶àµÄÔªËØ°üÀ¨£º
Ö¸¶¨Ë÷ÒýÊý¾Ý·¾¶
<!-- Used to specify an alternate directory to hold all index data other than the default ./data under the Solr home. If replication is in use, this should match the replication configuration. --> <dataDir>${solr.data.dir:./solr/data}</dataDir> |
»º´æ²ÎÊý
<filterCache class="solr.FastLRUCache" size="512" initialSize="512" autowarmCount="0"/> <!-- queryResultCache caches results of searches - ordered lists of document ids (DocList) based on a query, a sort, and the range of documents requested. --> <queryResultCache class="solr.LRUCache" size="512" initialSize="512" autowarmCount="0"/> <!-- documentCache caches Lucene Document objects (the stored fields for each document). Since Lucene internal document ids are transient, this cache will not be autowarmed. --> <documentCache class="solr.LRUCache" size="512" initialSize="512" autowarmCount="0"/> |
ÇëÇó´¦ÀíÆ÷
ÇëÇó´¦ÀíÆ÷ÓÃÓÚ½ÓÊÕHTTPÇëÇ󣬴¦ÀíËÑË÷ºó£¬·µ»ØÏìÓ¦½á¹ûµÄ´¦ÀíÆ÷¡£±ÈÈ磺queryÇëÇó£º
<!-- A request handler that returns indented JSON by default --> <requestHandler name="/query" class="solr.SearchHandler"> <lst name="defaults"> <str name="echoParams">explicit</str> <str name="wt">json</str> <str name="indent">true</str> <str name="df">text</str> </lst> </requestHandler> |
ÿ¸öÇëÇó´¦ÀíÆ÷°üÀ¨Ò»ÏµÁпÉÅäÖõÄËÑË÷²ÎÊý£¬ÀýÈ磺wt,indent,dfµÈµÈ¡£
È«ÎļìË÷ÒýÇæSolrϵÁСª¡ªÕûºÏÖÐÎÄ·Ö´Ê×é¼þmmseg4j
ĬÈÏSolrÌṩµÄ·Ö´Ê×é¼þ¶ÔÖÐÎĵÄÖ§³ÖÊDz»ÓѺõ쬱ÈÈ磺¡°VIM±È×÷ÊÇ±à¼Æ÷Ö®Éñ¡±Õâ¸ö¾ä×ÓÔÚË÷ÒýµÄµÄʱºò£¬Ñ¡ÔñFieldTypeΪ¡±text_general¡±×÷Ϊ·Ö´ÊÒÀ¾Ýʱ£¬·Ö´ÊЧ¹ûÊÇ£º

Ëü°Ñÿһ¸ö´Ê¶¼·Ö¿ªÁË£¬¿ÉÒÔÏëÏóÈç¹ûһƪÎÄÕÂÕâÑù·Ö´ÊµÄËÑË÷µÄÌåÑéЧ¹û·Ç³£²î¡£Äܹ»ºÍSolr¼¯³ÉµÄÖÐÎÄ·Ö´Ê×é¼þÓкܶ࣬±ÈÈ磺mmseg4j¡¢IkAnalyzer¡¢ICTCLASµÈµÈ¡£¸÷Óи÷µÄÌØµã¡£ÕâÆªÎÄÕ½²ÊöÈçºÎÕûºÏSolrÓëmmseg4j¡£mmeseg4j×îа汾ÊÇ1.9.1£¬ÏÂÔØ½âѹ£¬ÌáÈ¡ÆäÖеÄÈý¸öÎļþ£ºmmseg4j-analysis-1.9.1.jar£¬
mmseg4j-core-1.9.1.jar£¬mmseg4j-solr-1.9.1.jar¡£·Åµ½Ä¿Â¼£ºE:\solr-4.8.0\example\solr-webapp\webapp\WEB-INF\lib£¬ÐÞ¸ÄÅäÖÃÎļþschema.xml£¬Ìí¼ÓÏÂÃæµÄÁ½¶Î´úÂ룺
fieldType:
<!-- mmseg4j--> <fieldType name="text_mmseg4j_complex" class="solr.TextField" positionIncrementGap="100" > <analyzer> <tokenizer class="com.chenlb.mmseg4j.solr.MMSegTokenizerFactory" mode="complex" dicPath="dic"/> </analyzer> </fieldType> <fieldType name="text_mmseg4j_maxword" class="solr.TextField" positionIncrementGap="100" > <analyzer> <tokenizer class="com.chenlb.mmseg4j.solr.MMSegTokenizerFactory" mode="max-word" dicPath="dic"/> </analyzer> </fieldType> <fieldType name="text_mmseg4j_simple" class="solr.TextField" positionIncrementGap="100" > <analyzer> <!-- <tokenizer class="com.chenlb.mmseg4j.solr.MMSegTokenizerFactory"
mode="simple" dicPath="n:/OpenSource/apache-solr-1.3.0/example/solr/my_dic"/> --> <tokenizer class="com.chenlb.mmseg4j.solr.MMSegTokenizerFactory" mode="simple" dicPath="dic"/> </analyzer> </fieldType> <!-- mmseg4j--> |
ÓëfieldType¶ÔÓ¦µÄfield£º
<!-- mmseg4j --> <field name="mmseg4j_complex_name" type="text_mmseg4j_complex" indexed="true" stored="true"/> <field name="mmseg4j_maxword_name" type="text_mmseg4j_maxword" indexed="true" stored="true"/> <field name="mmseg4j_simple_name" type="text_mmseg4j_simple" indexed="true" stored="true"/> <!--mmseg4j --> |
´Ëʱ¾ÍËãÅäÖÃÍê³ÉÁË£¬ÖØÆô·þÎñ£ºjava -jar start.jar£¬À´¿´¿´mmseg4jµÄ·Ö´ÊЧ¹ûÔõôÑù£¬´ò¿ªSolr¹ÜÀí½çÃæ£¬µã»÷×ó²àµÄAnalysisÒ³Ãæ

¶Ô±È֮ǰµÄ·Ö´ÊЧ¹û£¬¸Ä½øÁ˺ܶ࣬²î²»¶à¾ÍÊÇÕý³£µÄÓïÒåÁË¡£ÕâÀïÔڷִʵÄʱºòÄãÓпÉÄÜ»áÓöµ½Ò»¸öÎÊÌ⣺
TokenStream contract violation: reset()/close() call
missing, reset() called multiple times, or subclass
does not call super.reset(). Please see Javadocs of
TokenStream class for more information about the correct
consuming workflow.
Õâ¸öÊÇSolr4.8»·¾³ÏÂmmseg4jµÄÒ»¸öbug£¬ÕâÊÇmmseg4j-analysis-1.9.1.jarÒýÆðµÄ£¬ÐèÒªÐÞ¸ÄÔ´Â룬ÕÒµ½Îļþ£ºmmseg4j-1.9.1\mmseg4j-analysis\src\main\java\com\chenlb\mmseg4j\analysis\MMSegTokenizer.java£¬¼ÓÉÏsuper.reset()£º
public void reset() throws IOException { //lucene 4.0 //org.apache.lucene.analysis.Tokenizer.setReader(Reader) //setReader ×Ô¶¯±»µ÷ÓÃ, input ×Ô¶¯±»ÉèÖᣠsuper.reset(); //¼ÓÉÏÕâÒ»ÐÐ mmSeg.reset(input); } |
ÐÞ¸ÄÍêÖ®ºóÓÃmavenÖØÆô±àÒ룺mvn clean package -DskipTests£¬ÓÃеÄmmseg4j-1.9.1\mmseg4j-analysis\target\mmseg4j-analysis-1.9.2-SNAPSHOT.jarÌæ»»µôÔÀ´ÄǸöÎļþ£¬ÖØÆô·þÎñ¾ÍokÁË¡£
mmeseg4j-1.9.1Õâ¸ö°æ±¾µÄµÄ´Ê¿âÈ«²¿´ò°ü·ÅÔÚÁËjarÎļþÀïÃæ£¬Òò´ËÎÞÐèÔÙÖ¸¶¨´Ê¿âÎļþ(chars.dic£¬units.dic£¬words.dic)£¬µ±È»ÄãÒ²¿ÉÒÔ¸²¸ÇÕâЩÎļþ£¬Ö»ÐèÒª°ÉÔ¤Ìæ»»µÄÎļþ·ÅÔÚÔÚWEB-INF\data\¼´¿É¡£
ÏÖÔÚÌí¼ÓÁ½¸öÖÐÎÄÎĵµµ½Ë÷ÒýÖÐÈ¥£¬ÊÔÊÔmmeseg4jµÄЧ¹ûÔõôÑù£º
<add> <doc> <field name="id">0001</field> <field name="mmseg4j_complex_name">°ÑEmacs±È×÷ÊÇÉñµÄ±à¼Æ÷£¬VIM±È×÷ÊÇ±à¼Æ÷Ö®Éñ£¬
2012Ä꿪ʼ½Ó´¥VIM£¬Ò»Ö±ÑØÓÃÖÁ½ñ£¬Ò²Ôø½ñ×ܽá¹ýVIMµÄÏà¹ØÖªÊ¶£¬
ÎÄÕ¶¼ÕûÀíÔÚÒÔǰµÄITeye²©¿ÍºÍGitHub£¬Õâ¿î¹Å¶ø²»ÀÏµÄ±à¼Æ÷ÖÁ½ñÈÔÊÜÖÚ¶à³ÌÐòÔ±×·Åõ£¬
µ±È»ÎÒÒ²ÊÇÖÒʵµÄVIMÓû§£¬ÕâÆªÎÄÕ¾ÍÊÇÓÃVIM±à¼Íê³É¡£</field> </doc> <doc> <field name="id">0002</field> <field name="mmseg4j_complex_name">ÓÃGoogleËÑË÷"Python IDE"£¬
µÚÒ»Ìõ¾ÍÊÇstackoverflowÉÏÒ»¸ö·Ç³£ÈÈÃŵÄÎÊÌ⣺
"what IDE to use for Python"£¬ÉϰÙÖÖ±à¼Æ÷µÄ¹¦ÄܶԱÈͼÈÃÈËÑÛ»¨çÔÂÒ¡£
ÆäÖÐÓÐÎÒ½Ó´¥¹ýµÄ¼¸¿î±à¼Æ÷£¨IDE£©°üÀ¨£ºEclilpse(PyDev)¡¢VIM¡¢NotePad++¡¢PyCharm¡£
Èç¹ûÄãµÄÈÕ³£¿ª·¢ÓïÑÔÊÇPythonµÄ»°£¬ÔÙËÑË÷"python vim"£¬´óÔ¼ÓÐ328ÍòÌõ½á¹û£¬
¿É¼ûÓÃVIM×öPython¿ª·¢µÄ³ÌÐòÔ±ÄÇÊÇÏ൱֮¶à£¬ÎÒ´ó¸Å×ܽáµÄ¼¸µãÔÒò£¬µ±È»²»Ò»¶¨ÕýÈ·</field> </doc> </add> |
±£´æÎªutf-8¸ñʽµÄÎļþÃû£ºmmseg4j-solr-demo-doc.xml£¬¼ÓÈëµ½SolrÖÐÈ¥£º
E:\solr-4.8.0\example\exampledocs>java -jar post.jar mmseg4j-solr-demo-doc.xml SimplePostTool version 1.5 Posting files to base url http://localhost:8983/solr/update using content-type application/xml.. POSTing file mmseg4j-solr-demo-doc.xml 1 files indexed. COMMITting Solr index changes to http://localhost:8983/solr/update.. Time spent: 0:00:01.055 |
¿´ËÑË÷½á¹û£º

|