Monday, October 14, 2019

Solr Field Attributes

Compare querying in Solr with querying in SQL databases, the mapping is as follows.

SQL query: 
select album,title,artist
 from hellosolr
 where album in ["solr","search","engine"] order by album DESC limit 20,10;

Solr query: 
$ curl http://localhost:8983/solr/hellosolr/select?q=solr search engine &fl=album,title,artist&start=20&rows=10&sort=album desc


If you want to search a field for multiple tokens, you need to surround it with parentheses:
  q=title:(to kill a mockingbird)&df=album
  q=title:(buffalo OR soldier) OR artist:(bob OR marley)


Query Operators 

The following are operators supported by query parsers:

OR: Union is performed and a document will match if any of the clause is satisfied.
AND: Association is performed and a document will match only if both the clauses are satisfied.
NOT: Operator to exclude documents containing the clause.
+/-: Operators to mandate the occurrence of terms. + ensures that documents containing the token must exist, and - ensures that documents containing the token must not exist.

Phrase Query  ""

exact search
   q="bob marley"

Proximity Query  ~ INT

A proximity query requires the phrase query to be followed by the tilde (~) operator and a numeric distance for identifying the terms in proximity.
  q="jamaican singer"~3

Wildcard Query ? *

You can specify the wildcard character ?, which matches exactly one character, or *, which matches zero or more characters.

   q=title:(bob* OR mar?ey) OR album:(*bird)    


Range Query [] {}

q=price:[1000 TO 5000]         // 1000 <= price <= 5000
q=price:{1000 TO 5000}       // 1000 < price > 5000
q=price:[1000 TO 5000}       // 1000 <= price > 5000
q=price:[1000 TO *]             // 1000 <= price


Check if field exists or not exists 
for field exists 

field:[* TO *]

for field not exists
q=*:* -Tag_100_is:[* TO *]
 

Filter Query fq

Before apply any search filter all documents and select only the documents that has language=english and genre=rock
 If multiple fq parameters are specified, the query parser will select the subset of documents that matches all the fq queries.

  q=singer(bob marley) title:(redemption song)&fq=language:english&fq=genre:rock   

q=product:hotel&fq=city:"las vegas" AND category:travel

rows=   &   start= 

use start and rows together to get paginated search results.

sort
 comma-separated list of fields on which the result should be sorted. The field name should be followed by the asc or desc keyword
    sort=score desc,popularity desc  


fl 
Specifies the comma-separated list of fields to be displayed in the response.

wt
Specifies the format in which the response should be returned, such as JSON, XML, or CSV.

debugQuery 
This Boolean parameter works wonders to analyze how the query is parsed and how a document got its score. Debug operations are costly and should not be enabled on live production queries. This parameter supports only XML and JSON response format currently.

explainOther
explainOther is quite useful for analyzing documents that are not part of a debug explanation. debugQuery explains the score of documents that are part of the result set (if you specify rows=10, debugQuery will add an explanation for only those 10 documents). If you want an explanation for additional documents, you can specify a Lucene query in the explainOther parameter for identifying those additional documents. Remember, the explainOther query will select the additional document to explain, but the explanation will be with respect to the main query.







Solr Schema save inside "managed-schema" file


Sample File Content

<schema name="default-config" version="1.6">
    <field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />
    <field name="_version_" type="plong" indexed="false" stored="false"/>
    <field name="_root_" type="string" indexed="true" stored="false" docValues="false" />
    <field name="_text_" type="text_general" indexed="true" stored="false" multiValued="true"/>
</schema>



Explain Solr Field Attributes

name : each field should has a name
type : each field should has a type
indexed: whether it will be searchable or not
stored : whether it will be visible and user get field value or not
multiValued: where if the value is a single value or array of values
default: set default value if field is missing
sortMissingLast: order missing at the end
sortMissingFirst: order missing at the begin
required: Setting this attribute as true specifies a field as mandatory.

docValues: true means create a forward index for the field. Notes: inverted index is not efficient for sorting, faceting, and highlighting, and this approach promises to make it faster and also free up the fieldCache.

omitNorms Fields have norms associated with them, which holds additional information such as index-time boost and length normalization. Specifying omitNorms="true" discards this information, saving some memory. Length normalization allows Solr to give lower weight to longer fields. If a length norm is not important in your ranking algorithm (such as metadata fields) and you are not providing an index-time boost, you can set omitNorms="true". By default, Solr disables norms for primitive fields.



The following are possible combinations of indexed and stored parameters: 
indexed="true" & stored="true": 
When you are interested in both querying and displaying the value of a field.

indexed="true" & stored="false": 
When you want to query on a field but don’t need its value to be displayed. For example, you may want to only query on the extracted metadata but display the source field from which it was extracted.

indexed="false" & stored="true": 
If you are never going to query on a field and only display its value.






Define New Solr Type that support Arabic

  <field name="subjects" type="text_ar_sort" indexed="true" stored="true"/>

Define new type "text_ar_sort"

  <fieldType name="text_ar_sort" class="solr.SortableTextField" sortMissingLast="true" docValues="true"  positionIncrementGap="100" multiValued="false">
    <analyzer type="index">
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.StopFilterFactory" words="lang/stopwords_ar.txt" ignoreCase="true"/>
  <filter class="solr.ArabicNormalizationFilterFactory"/>
      <filter class="solr.ArabicStemFilterFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
    <analyzer type="query">
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.StopFilterFactory" words="lang/stopwords_ar.txt" ignoreCase="true"/>
  <filter class="solr.ArabicNormalizationFilterFactory"/>
      <filter class="solr.ArabicStemFilterFactory"/>
      <filter class="solr.SynonymGraphFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt"/>
      <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
  </fieldType>



================================================

<analyzer type="index">
<analyzer type="query">



An analyzer defines a chain of processes, each of which performs a specific operation such as splitting on whitespace, removing stop words, adding synonyms, or converting to lowercase. The output of each of these processes is called a token. Tokens that are generated by the last process in the analysis chain (which either gets indexed or is used for querying) are called terms, and only indexed terms are searchable. Tokens that are filtered out, such as by stop-word removal, have no significance in searching and are totally discarded.

Example

<fieldType name="text_analysis" class="solr.TextField" positionIncrementGap="100">
 <analyzer>
 <tokenizer class="solr.WhitespaceTokenizerFactory"/>
 <filter class="solr.AsciiFoldingFilterFactory"/>
 <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
 <filter class="solr.LowerCaseFilterFactory"/>
 <filter class="solr.PorterStemFilterFactory"/>
 <filter class="solr.TrimFilterFactory"/>
 </analyzer>
</fieldType>

Description

1. WhitespaceTokenizerFactory splits the text stream on whitespace. In the English language, whitespace separates words, and this tokenizer fits well for such text analysis. Had it been an unstructured text containing sentences, a tokenizer that also splits on symbols would have been a better fit, such as for the “Cappuccino.”
2. AsciiFoldingFilterFactory removes the accent as the user query or content might contain it.
3. StopFilterFactory removes the common words in the English language that don’t have much significance in the context and adds to the recall.
4. LowerCaseFilterFactory normalizes the tokens to lowercase, without which the query term mockingbird would not match the term in the movie name.
5. PorterStemFilterFactory converts the terms to their base form without which the tokens kill and kills would have not matched.
6. TrimFilterFactory finally trims the tokens.





Tokenizer Implementations



Common Solr Issues and how to Solve 

Indexing Is Slow 
Indexing can be slow for many reasons. The following are factors that can improve indexing performance, enabling you to tune your setup accordingly:
Memory: If the memory allocated to the JVM is low, garbage collection will be called more frequently and indexing will be slow.
Indexed fields: The number of indexed field affects the index size, memory requirements, and merge time. Index only the fields that you want to be searchable.
Merge factor: Merging segments is an expensive operation. The higher the merge factor, the faster the indexing.
Commit frequency: The less frequently you commit, the faster indexing will be.
Batch size: The more documents you index in each request, the faster indexing will be







No comments: