Monthly Archives: December 2014

Solr/Lucene Score Tutorial

In this post, we will try to understand how Solr/Lucene’s default scoring mechanism works.

Whenever we do a Solr search, we get result documents sorted in descending order of their scores. We can explicitly ask Solr to return the scores in search results by adding score to the fl parameter. For example, here is a query:

and here are the top five results:

Here is the fieldType definition for name field:

We want to understand how Solr calculated the scores for each of the result documents. (Notice that the 2nd and the 3rd results have only one of the search tokens cricket present, but score higher than the 4th and the 5th results which have both the search tokens in them!)

We can have Solr “explain” its score calculations by adding param debugQuery and setting it to on. We also add wt=xml so we get results in XML which is easier to use for analyzing the scores:

The results are hard to read as they are. To see an indented output, right click on your browser window and view page source. Then you will see an output like this (for this example I am showing only two results):

(If you want to follow along the explanation below without scrolling up and down, either print the output above, or keep it in another window.)

To understand the above calculation, we need to look into Solr’s default similarity calculation, which is done using Lucene’s DefaultSimilarity. DefaultSimilarity extends Lucene’s TFIDFSimilarity and does something extra. We will look into the “extra” part later. Let’s focus on the Lucene Practical Scoring Function mentioned in the TFIDFSimilarity doc:

  \boxed{\mathrm{s}(q,d) = \Big( \, \sum\limits_{t \, \in \, q} \, \mathrm{tf}(t \, \in \, d) \, \mathrm{idf}^2(t) \, t\mathrm{.getBoost}(\,) \, \mathrm{norm}(t,d) \, \Big) \, \mathrm{coord}(q,d) \, \mathrm{queryNorm}(q)}

  • q is the search query (indian cricket in the above example)
  • d is the document against which we are calculating the score
  • t is a token in q (for example, indian).

Here are what the various factors mean:

\mathrm{tf}(t \, \in \, d): square root of the frequency of token t in d (in the field name that we searched). The rationale behind having this as a factor in the score is we want to give higher scores to documents that have more occurrences of our search terms. Since the term cricket appears in the first document twice in the name field (case does not matter, since the field has a lower-case filter during index and query time), it gets:

     \begin{gather*} \mathrm{tf}(\mathrm{cricket} \, \in \, \mathrm{doc} \, 1856209) = \sqrt2 = 1.4142135. \end{gather*}

\mathrm{idf}(t): inverse document frequency. Before understanding what inverse document frequency is, let us first understand what document frequency, df(t), is. It is the number of documents (out of all the documents in our Solr index) which contain token t in the field that is searched. And idf(t) is defined as:

     \begin{gather*} \mathrm{idf}(t) = 1 + \mathrm{log}_e \Big( \frac{\mathrm{numDocs}}{1 \, + \, \mathrm{df}(t)} \Big) \end{gather*}

where numDocs is the total number of documents in our index. The rationale behind having this as a factor is to give lower scores to more frequently occurring terms and higher scores to rarer terms. In our example search, the token indian occurs in 209 documents (see the explain output above where it says docFreq=209) but cricket occurs only in 57 documents, so cricket gets a higher score.

For token indian with document frequency of 209 and maxDocs being 198488, we get idf(indian) = 7.8513765.

(One thing I noticed which is strange though is, though the formula says numDocs, we can see above that Solr actually uses maxDocs, which also includes deleted documents that are not purged yet!)

t\mathrm{.getBoost}(\,): Query time boost for token t. We can give higher weights to certain search tokens by boosting them. For example, name:(indian^5 cricket) would boost token indian by 5. Since we have not specified any query time boost, the default multiplicative boost of 1 is applied.

\mathrm{norm}(t,d): This is reported as fieldNorm by Solr. It is the product of index time boosts and a length normalization factor. It is more precisely defined on the field f we are searching against (i.e. name for our example above), rather than token t:

     \begin{gather*} \mathrm{norm}(f, d) = d.\mathrm{getBoost()} \times f.\mathrm{getBoost()} \times \mathrm{lengthNorm}(f) \end{gather*}

d.\mathrm{getBoost()} is the index time boost for document d. f.\mathrm{getBoost()} is the index time boost for the field f. These are straight forward to understand. (In our example above, there are no index time boosts applied, so they are all 1, which Solr does not report.)

The third factor \mathrm{lengthNorm}(f) is defined as the reciprocal of the square root of number of terms in field f. The rationale behind this is we want to penalize documents with longer field values, because there is a better probability for the search term to occur in them simply because they are longer. Shorter documents may more precisely match our query.

We can use Solr’s analysis admin tool to find out how many terms our field gets. For example, with the way I have my index time analysis chain set up for field name, for field value Best Captain of Indian National Cricket Team (Test Cricket), I get 13 tokens. I get extra tokens because of WordDelimiterFilter and SynonymFilter.

The lengthNorm for our first result is then 1/\sqrt{13} = 0.27735, but we see that Solr is reporting a fieldNorm of 0.25 for this document (which is the same as lengthNorm for us, since index time boosts are all 1). This is because of the something “extra” that DefaultSimilarity does. DefaultSimilarity optimizes the storage of lengthNorm by converting the decimal value (more precisely the float) to a byte. If you dig into the source code, you see that it encodes the float at index time as follows:

If we put that method in a test file and run it, we can see:

This byte value is then decoded during query time for computing the score using this method:

so we get back:

\mathrm{coord}(q,d): number of tokens in q that are found in doc d divided by number of tokens in q. This factor gives higher score to documents that have more of the search terms.

Since both search terms are found in the first document, coord(indian cricket, doc 1856209) is simply 1 (since coord is a multiplier, Solr does not report the multiplier of 1). For the second document, we get 0.5 since there is only one matching term (cricket) and Solr reports the coord in this case.

\mathrm{queryNorm}(q)}: The Lucene documentation on this is clear: “queryNorm(q) is a normalizing factor used to make scores between queries comparable. This factor does not affect document ranking (since all ranked documents are multiplied by the same factor), but rather just attempts to make scores from different queries (or even different indexes) comparable. This is a search time factor computed by the Similarity in effect at search time. The default computation in DefaultSimilarity produces a Euclidean norm.” The formula is:

     \begin{gather*} \mathrm{queryNorm}(q) = \frac{1}{\sqrt{\mathrm{sumOfSquaredWeights}}} \end{gather*}

For a Boolean query (like our example), sumOfSquaredWeights is defined as:

     \begin{gather*} \mathrm{sumOfSquaredWeights} = \big( q.\mathrm{getBoost}(\,) \big)^2 \, \sum\limits_{t \, \in \, q} \big( \mathrm{idf}(t) \times t.\mathrm{getBoost}(\,) \big)^2 \end{gather*}

In our case, all boosts are 1, so sumOfSquaredWeights simply becomes

     \begin{align*}  \mathrm{sumOfSquaredWeights} &= \sum\limits_{t \, \in \, q} \big( \mathrm{idf}(t) \big)^2 \\  &= (7.8513765^2 + 9.138041^2) \\  &= 145.1479 \end{align*}

which gives the queryNorm of 1/\sqrt{145.1479} = 0.08300316 which is what Solr reports.

This is arguably the least interesting factor of all, because it is the same for all the documents that match the search.

Now we come back to the anomaly we noticed in the search results above. Why the second and third documents with only one search term get a higher score than the fourth and fifth documents which contain both the search terms? The answer has to do with how queryNorm (equivalently, lengthNorm in our case) and coord offset each other. In our example, the second and third documents get a coord of 0.5 each and a lengthNorm of 1 (since there is only one token), whereas the fourth and fifth documents get a coord of 1 but a lengthNorm of only 0.25, which is why they get a lower score. If this is something you do not desire, you can instruct Solr to omit the norms in the score calculation. This can be done in your schema.xml by specifying omitNorms=true in your field definition. (See Solr documentation on schema.xml.)

Sometimes you may want to not include idf or tf in your score calculations. For example, we have a Solr core on which we search for specific tags with query-time boosts, but we do not want to use idf for this search at all. This can be done by writing our own Similarity class which extends DefaultSimilarity and using that in our fieldType definition. Writing a custom similarity class is a topic for a separate post which I will be writing shortly.

A PHP script for log file analysis

Just want to share a PHP script I wrote for analyzing multiple log files in our log directories. (Disclaimer: this is a very simple script and may not be robust and secure, so use it only on servers which are used internally.)

The motivation for this small project is we have about a hundred batch processes (built with Spring Batch framework) that run via cron jobs on our production servers. We use Log4j (with daily log rotation) and all the batch processes log to a folder like

The log file name is the same (batch.log) under all these individual sub-directories. With log rotation, the previous day’s files are automatically moved at midnight to files like batch.log.$year-$month-$day.

Frequently we need to either tail a log file to see when it last ran (or if it is still running) or look for errors or count how many items got processed by a certain batch process on a certain day. Someone needed to login to the server via VPN and SSH and then change working directory to the appropriate batch process directory and analyze the log files with shell commands Gaming Mouse Reviews This was getting tedious. We desired a simple web interface with which we could easily tail the log files or look at certain lines with pattern matching (for example only ERROR level log messages) and also be able to specify the date of the log file.

Since our servers already had Apache HTTP server and PHP installed, we went for a PHP script (like I mentioned in my post on scripting languages). The PHP script simply acts as a nice front-end to execute a bunch of shell commands and displays the output on a web interface.

This script serves two purposes:

  1. list all the sub directories of the individual processes available at BASEDIR/logs/
  2. help tail and analyze the individual process log files

Invoked without any params, it simply lists the names of all the sub-directories. Each of these names is linked to the same script, but these links have an additional param proc, which is the name of the specific batch process sub directory. The script also takes an additional param lines, which is the number of lines we want from the tail of the log file. By default, the links on the main page to the individual processes simply have lines=50 so that the last 50 lines from the log file are displayed. This is implemented using the shell tail command and capturing its output with back ticks.

A date can also be specified to the script via the dt param. If not specified, it simply works on the current day’s log file.

We also wanted to do pattern matching to find lines containing specific patterns. Naturally grep is the shell command for this. We can also specify a Perl regular expression to grep via the -P option, which can be specified to the script below using param pattern. The script also takes two additional params before and after, which correspond to options -B and -A of the grep command.

Here is the script:

Here are the PHP functions the script uses with links to their documentation:

Here is an example URL with all the bells and whistes, assuming the script is called track.php and kept under the logs dir of your www folder. We are getting all the lines from the process new-user-signup which match the pattern ERROR (log4j log level) or Exception and 3 lines before and after such matches on the date 2014-12-01:

Notice that we have to URL encode the params. Here we have URL encoded ERROR|Exception to ERROR%7CException.

So that’s the poor man’s script for a simple web interface for your log files with some powerful functionality :-).

Find sizes of all collections in a MongoDB database

Here is a simple script for mongo shell that prints the sizes of all the collections in a database in descending order of their total sizes (sum of storage and index sizes). We just use the combination of the mongo commands db.getCollectionNames() and collection stats command and some javascript code to achieve this.

Let’s assume we are using database named mydb.

That gives a nice output like this:

Of course, this script is very easy to adapt so that only collections that occupy more than X GB are returned.