Compare querying in Solr with querying in SQL
databases, the mapping is as follows.
SQL query:
select album,title,artist
from hellosolr
where album in ["solr","search","engine"]
order by album DESC limit 20,10;
Solr query:
$ curl http://localhost:8983/solr/hellosolr/select?q=solr search engine
&fl=album,title,artist&start=20&rows=10&sort=album desc
If you
want to search a field for multiple tokens, you need to surround it with parentheses:
q=title:(to kill a mockingbird)&df=album
q=title:(buffalo OR soldier) OR artist:(bob OR marley)
Query Operators
The following are operators supported by query parsers:
OR: Union is performed and a document will match if any of the clause is
satisfied.
AND: Association is performed and a document will match only if both the clauses
are satisfied.
NOT: Operator to exclude documents containing the clause.
+/-: Operators to mandate the occurrence of terms. + ensures that documents
containing the token must exist, and - ensures that documents containing the
token must not exist.
Phrase Query ""
exact search
q="bob marley"
Proximity Query ~ INT
A proximity query requires the phrase query to be followed by the
tilde (~) operator and a numeric distance for identifying the terms in proximity.
q="jamaican singer"~3
Wildcard Query ? *
You can specify the
wildcard character ?, which matches exactly one character, or *, which matches zero or more characters.
q=title:(bob* OR mar?ey) OR album:(*bird)
Range Query [] {}
q=price:[1000 TO 5000] // 1000 <= price <= 5000
q=price:{1000 TO 5000} // 1000 < price > 5000
q=price:[1000 TO 5000} // 1000 <= price > 5000
q=price:[1000 TO *] // 1000 <= price
Check if field exists or not exists
for field exists
Filter Query fq
Before apply any search filter all documents and select only the documents that has language=english and genre=rock
If multiple fq parameters are specified, the query parser will select
the subset of documents that matches all the fq queries.
q=singer(bob marley) title:(redemption song)&fq=language:english&fq=genre:rock
q=product:hotel&fq=city:"las vegas" AND category:travel
rows= & start=
use start and rows together to get paginated search results.
sort
comma-separated list of fields on which the result should be sorted. The field name should
be followed by the asc or desc keyword
sort=score desc,popularity desc
fl
Specifies the comma-separated list of fields to be displayed in the response.
wt
Specifies the format in which the response should be returned, such as JSON, XML, or CSV.
debugQuery
This Boolean parameter works wonders to analyze how the query is parsed and how a document got its
score. Debug operations are costly and should not be enabled on live production queries. This parameter
supports only XML and JSON response format currently.
explainOther
explainOther is quite useful for analyzing documents that are not part of a debug explanation. debugQuery
explains the score of documents that are part of the result set (if you specify rows=10, debugQuery will add
an explanation for only those 10 documents). If you want an explanation for additional documents, you
can specify a Lucene query in the explainOther parameter for identifying those additional documents.
Remember, the explainOther query will select the additional document to explain, but the explanation will
be with respect to the main query.
Solr Schema save inside "managed-schema" file
Sample File Content
<schema name="default-config" version="1.6">
<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />
<field name="_version_" type="plong" indexed="false" stored="false"/>
<field name="_root_" type="string" indexed="true" stored="false" docValues="false" />
<field name="_text_" type="text_general" indexed="true" stored="false" multiValued="true"/>
</schema>
Explain Solr Field Attributes
name : each field should has a name
type : each field should has a type
indexed: whether it will be searchable or not
stored : whether it will be visible and user get field value or not
multiValued: where if the value is a single value or array of values
default: set default value if field is missing
sortMissingLast: order missing at the end
sortMissingFirst: order missing at the begin
required: Setting this attribute as true specifies a field as mandatory.
docValues: true means create a
forward
index for the field. Notes:
inverted index is not efficient for sorting, faceting, and highlighting, and
this approach promises to make it faster and also free up the fieldCache.
omitNorms
Fields have norms associated with them, which holds additional information such as index-time boost and
length normalization. Specifying omitNorms="true" discards this information, saving some memory.
Length normalization allows Solr to give lower weight to longer fields. If a length norm is not important
in your ranking algorithm (such as metadata fields) and you are not providing an index-time boost, you can
set omitNorms="true". By default, Solr disables norms for primitive fields.
The following are possible combinations of indexed and stored parameters:
indexed="true" & stored="true":
When you are interested in both querying
and displaying the value of a field.
indexed="true" & stored="false":
When you want to query on a field but
don’t need its value to be displayed. For example, you may want to only query on
the extracted metadata but display the source field from which it was extracted.
indexed="false" & stored="true":
If you are never going to query on a field
and only display its value.
Define New Solr Type that support Arabic
<field name="subjects"
type="text_ar_sort" indexed="true" stored="true"/>
Define new type "text_ar_sort"
<fieldType name="text_ar_sort" class="solr.SortableTextField" sortMissingLast="true" docValues="true" positionIncrementGap="100" multiValued="false">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" words="lang/stopwords_ar.txt" ignoreCase="true"/>
<filter class="solr.ArabicNormalizationFilterFactory"/>
<filter class="solr.ArabicStemFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" words="lang/stopwords_ar.txt" ignoreCase="true"/>
<filter class="solr.ArabicNormalizationFilterFactory"/>
<filter class="solr.ArabicStemFilterFactory"/>
<filter class="solr.SynonymGraphFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
================================================
<analyzer type="index">
<analyzer type="query">
An analyzer defines a chain of processes, each of which performs a specific operation such as splitting
on whitespace, removing stop words, adding synonyms, or converting to lowercase. The output of each
of these processes is called a token. Tokens that are generated by the last process in the analysis chain
(which either gets indexed or is used for querying) are called terms, and only indexed terms are searchable.
Tokens that are filtered out, such as by stop-word removal, have no significance in searching and are totally
discarded.
Example
<fieldType name="text_analysis" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.AsciiFoldingFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
<filter class="solr.TrimFilterFactory"/>
</analyzer>
</fieldType>
Description
1.
WhitespaceTokenizerFactory splits the text stream on whitespace. In the English language, whitespace separates words, and this tokenizer fits well for such text analysis. Had it been an unstructured text containing sentences, a tokenizer that also splits on symbols would have been a better fit, such as for the “Cappuccino.”
2.
AsciiFoldingFilterFactory removes the accent as the user query or content might contain it.
3. StopFilterFactory removes the common words in the English language that don’t have much significance in the context and adds to the recall.
4.
LowerCaseFilterFactory normalizes the tokens to lowercase, without which the query term mockingbird would not match the term in the movie name.
5.
PorterStemFilterFactory converts the terms to their base form without which the tokens kill and kills would have not matched.
6.
TrimFilterFactory finally trims the tokens.
Tokenizer Implementations
Common Solr Issues and how to Solve
Indexing Is Slow
Indexing can be slow for many reasons. The following are factors that can improve indexing performance,
enabling you to tune your setup accordingly:
Memory: If the memory allocated to the JVM is low, garbage collection will be
called more frequently and indexing will be slow.
Indexed fields: The number of indexed field affects the index size, memory
requirements, and merge time. Index only the fields that you want to be
searchable.
Merge factor: Merging segments is an expensive operation. The higher the merge
factor, the faster the indexing.
Commit frequency: The less frequently you commit, the faster indexing will be.
Batch size: The more documents you index in each request, the faster indexing
will be