Solr/Lucene Score Tutorial

In this post, we will try to understand how Solr/Lucene’s default scoring mechanism works.

Whenever we do a Solr search, we get result documents sorted in descending order of their scores. We can explicitly ask Solr to return the scores in search results by adding score to the fl parameter. For example, here is a query:

and here are the top five results:

Here is the fieldType definition for name field:

We want to understand how Solr calculated the scores for each of the result documents. (Notice that the 2nd and the 3rd results have only one of the search tokens cricket present, but score higher than the 4th and the 5th results which have both the search tokens in them!)

We can have Solr “explain” its score calculations by adding param debugQuery and setting it to on. We also add wt=xml so we get results in XML which is easier to use for analyzing the scores:

The results are hard to read as they are. To see an indented output, right click on your browser window and view page source. Then you will see an output like this (for this example I am showing only two results):

(If you want to follow along the explanation below without scrolling up and down, either print the output above, or keep it in another window.)

To understand the above calculation, we need to look into Solr’s default similarity calculation, which is done using Lucene’s DefaultSimilarity. DefaultSimilarity extends Lucene’s TFIDFSimilarity and does something extra. We will look into the “extra” part later. Let’s focus on the Lucene Practical Scoring Function mentioned in the TFIDFSimilarity doc:

  \boxed{\mathrm{s}(q,d) = \Big( \, \sum\limits_{t \, \in \, q} \, \mathrm{tf}(t \, \in \, d) \, \mathrm{idf}^2(t) \, t\mathrm{.getBoost}(\,) \, \mathrm{norm}(t,d) \, \Big) \, \mathrm{coord}(q,d) \, \mathrm{queryNorm}(q)}

  • q is the search query (indian cricket in the above example)
  • d is the document against which we are calculating the score
  • t is a token in q (for example, indian).

Here are what the various factors mean:

\mathrm{tf}(t \, \in \, d): square root of the frequency of token t in d (in the field name that we searched). The rationale behind having this as a factor in the score is we want to give higher scores to documents that have more occurrences of our search terms. Since the term cricket appears in the first document twice in the name field (case does not matter, since the field has a lower-case filter during index and query time), it gets:

     \begin{gather*} \mathrm{tf}(\mathrm{cricket} \, \in \, \mathrm{doc} \, 1856209) = \sqrt2 = 1.4142135. \end{gather*}

\mathrm{idf}(t): inverse document frequency. Before understanding what inverse document frequency is, let us first understand what document frequency, df(t), is. It is the number of documents (out of all the documents in our Solr index) which contain token t in the field that is searched. And idf(t) is defined as:

     \begin{gather*} \mathrm{idf}(t) = 1 + \mathrm{log}_e \Big( \frac{\mathrm{numDocs}}{1 \, + \, \mathrm{df}(t)} \Big) \end{gather*}

where numDocs is the total number of documents in our index. The rationale behind having this as a factor is to give lower scores to more frequently occurring terms and higher scores to rarer terms. In our example search, the token indian occurs in 209 documents (see the explain output above where it says docFreq=209) but cricket occurs only in 57 documents, so cricket gets a higher score.

For token indian with document frequency of 209 and maxDocs being 198488, we get idf(indian) = 7.8513765.

(One thing I noticed which is strange though is, though the formula says numDocs, we can see above that Solr actually uses maxDocs, which also includes deleted documents that are not purged yet!)

t\mathrm{.getBoost}(\,): Query time boost for token t. We can give higher weights to certain search tokens by boosting them. For example, name:(indian^5 cricket) would boost token indian by 5. Since we have not specified any query time boost, the default multiplicative boost of 1 is applied.

\mathrm{norm}(t,d): This is reported as fieldNorm by Solr. It is the product of index time boosts and a length normalization factor. It is more precisely defined on the field f we are searching against (i.e. name for our example above), rather than token t:

     \begin{gather*} \mathrm{norm}(f, d) = d.\mathrm{getBoost()} \times f.\mathrm{getBoost()} \times \mathrm{lengthNorm}(f) \end{gather*}

d.\mathrm{getBoost()} is the index time boost for document d. f.\mathrm{getBoost()} is the index time boost for the field f. These are straight forward to understand. (In our example above, there are no index time boosts applied, so they are all 1, which Solr does not report.)

The third factor \mathrm{lengthNorm}(f) is defined as the reciprocal of the square root of number of terms in field f. The rationale behind this is we want to penalize documents with longer field values, because there is a better probability for the search term to occur in them simply because they are longer. Shorter documents may more precisely match our query.

We can use Solr’s analysis admin tool to find out how many terms our field gets. For example, with the way I have my index time analysis chain set up for field name, for field value Best Captain of Indian National Cricket Team (Test Cricket), I get 13 tokens. I get extra tokens because of WordDelimiterFilter and SynonymFilter.

The lengthNorm for our first result is then 1/\sqrt{13} = 0.27735, but we see that Solr is reporting a fieldNorm of 0.25 for this document (which is the same as lengthNorm for us, since index time boosts are all 1). This is because of the something “extra” that DefaultSimilarity does. DefaultSimilarity optimizes the storage of lengthNorm by converting the decimal value (more precisely the float) to a byte. If you dig into the source code, you see that it encodes the float at index time as follows:

If we put that method in a test file and run it, we can see:

This byte value is then decoded during query time for computing the score using this method:

so we get back:

\mathrm{coord}(q,d): number of tokens in q that are found in doc d divided by number of tokens in q. This factor gives higher score to documents that have more of the search terms.

Since both search terms are found in the first document, coord(indian cricket, doc 1856209) is simply 1 (since coord is a multiplier, Solr does not report the multiplier of 1). For the second document, we get 0.5 since there is only one matching term (cricket) and Solr reports the coord in this case.

\mathrm{queryNorm}(q)}: The Lucene documentation on this is clear: “queryNorm(q) is a normalizing factor used to make scores between queries comparable. This factor does not affect document ranking (since all ranked documents are multiplied by the same factor), but rather just attempts to make scores from different queries (or even different indexes) comparable. This is a search time factor computed by the Similarity in effect at search time. The default computation in DefaultSimilarity produces a Euclidean norm.” The formula is:

     \begin{gather*} \mathrm{queryNorm}(q) = \frac{1}{\sqrt{\mathrm{sumOfSquaredWeights}}} \end{gather*}

For a Boolean query (like our example), sumOfSquaredWeights is defined as:

     \begin{gather*} \mathrm{sumOfSquaredWeights} = \big( q.\mathrm{getBoost}(\,) \big)^2 \, \sum\limits_{t \, \in \, q} \big( \mathrm{idf}(t) \times t.\mathrm{getBoost}(\,) \big)^2 \end{gather*}

In our case, all boosts are 1, so sumOfSquaredWeights simply becomes

     \begin{align*}  \mathrm{sumOfSquaredWeights} &= \sum\limits_{t \, \in \, q} \big( \mathrm{idf}(t) \big)^2 \\  &= (7.8513765^2 + 9.138041^2) \\  &= 145.1479 \end{align*}

which gives the queryNorm of 1/\sqrt{145.1479} = 0.08300316 which is what Solr reports.

This is arguably the least interesting factor of all, because it is the same for all the documents that match the search.

Now we come back to the anomaly we noticed in the search results above. Why the second and third documents with only one search term get a higher score than the fourth and fifth documents which contain both the search terms? The answer has to do with how queryNorm (equivalently, lengthNorm in our case) and coord offset each other. In our example, the second and third documents get a coord of 0.5 each and a lengthNorm of 1 (since there is only one token), whereas the fourth and fifth documents get a coord of 1 but a lengthNorm of only 0.25, which is why they get a lower score. If this is something you do not desire, you can instruct Solr to omit the norms in the score calculation. This can be done in your schema.xml by specifying omitNorms=true in your field definition. (See Solr documentation on schema.xml.)

Sometimes you may want to not include idf or tf in your score calculations. For example, we have a Solr core on which we search for specific tags with query-time boosts, but we do not want to use idf for this search at all. This can be done by writing our own Similarity class which extends DefaultSimilarity and using that in our fieldType definition. Writing a custom similarity class is a topic for a separate post which I will be writing shortly.

Print Friendly, PDF & Email

Leave a Reply

Your email address will not be published. Required fields are marked *