Monthly Archives: July 2013

Solr PatternReplaceCharFilterFactory vs. PatternReplaceFilterFactory

I was in for a bit of surprise when I used Solr’s PatternReplaceFilterFactory.

I read the Solr wiki about when to use CharFilter vs. a TokenFilter and decided to go with the TokenFilter, and not with the CharFilter. My requirement was to strip out all numbers from text that has 1 to 3 digits. (4-digit numbers in our case were usually years, so we wanted to keep them.) So I used the regex \bd{1,3}\b (where \b means word boundary) and formed the filter as follows:

<filter class="solr.PatternReplaceFilterFactory" 
        pattern="\bd{1,3}\b" 
        replacement="" 
        replace="all"/>

My full analyzer chain had other filters and looked like this:

<fieldType
    name="text_testing"
    class="solr.TextField"
    positionIncrementGap="100"
    omitNorms="true">
    <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory" />
        <filter
            class="solr.WordDelimiterFilterFactory"
            preserveOriginal="1"
            generateWordParts="1"
            generateNumberParts="1"
            catenateWords="1"
            catenateNumbers="0"
            catenateAll="1"
            splitOnCaseChange="1" />
        <filter class="solr.LowerCaseFilterFactory" />
        <filter
            class="solr.PatternReplaceFilterFactory"
            pattern="\bd{1,3}\b"
            replacement=""
            replace="all" />
        <filter
            class="solr.StopFilterFactory"
            words="stopwords.txt"
            ignoreCase="true" />
        <filter
            class="solr.SynonymFilterFactory"
            synonyms="synonyms.txt"
            ignoreCase="true"
            expand="true" />
        <filter class="solr.PorterStemFilterFactory" />
        <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
        <filter class="solr.ASCIIFoldingFilterFactory" />
    </analyzer>

    <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory" />
        <filter class="solr.LowerCaseFilterFactory" />
        <filter
            class="solr.PatternReplaceFilterFactory"
            pattern="\bd{1,3}\b"
            replacement=""
            replace="all" />
        <filter
            class="solr.StopFilterFactory"
            words="stopwords.txt"
            ignoreCase="true" />
        <filter class="solr.PorterStemFilterFactory" />
        <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
        <filter class="solr.ASCIIFoldingFilterFactory" />
    </analyzer>
</fieldType>

Then I used the analysis UI to analyze some text. Let us say we are indexing top 45 news items of 2009. Here is what the analysis tool shows us:

PatternReplaceFilterFactoryIndexTimeAnalysis

What is surprising here is that the 2-digit number 45 gets replaced with the Greek character Φ and it is actually an empty token. (If you click on ‘Verbose Output’, you can see more clearly.)

This gives strange matches when the search text has some other 1 or 2 or 3 digit numbers, as you can see here:

PatternReplaceFilterFactoryMatching

Here 32 in the search text also gets converted to the empty token and it matches the empty token produced by 45 at index time! (The highlighting for match is happening, but since it is on an empty token it looks like a grey block.)

So, we have to use the PatternReplaceCharFilterFactory in this case. I went ahead and defined another type, where we use the same pattern as follows:

<fieldType
    name="text_no_idf"
    class="solr.TextField"
    positionIncrementGap="100"
    omitNorms="true">
    <analyzer type="index">
        <charFilter
            class="solr.PatternReplaceCharFilterFactory"
            pattern="\bd{1,3}\b"
            replacement=""
            replace="all" />
        <tokenizer class="solr.StandardTokenizerFactory" />
        <filter
            class="solr.WordDelimiterFilterFactory"
            preserveOriginal="1"
            generateWordParts="1"
            generateNumberParts="1"
            catenateWords="1"
            catenateNumbers="0"
            catenateAll="1"
            splitOnCaseChange="1" />
        <filter class="solr.LowerCaseFilterFactory" />
        <filter
            class="solr.StopFilterFactory"
            words="stopwords.txt"
            ignoreCase="true" />
        <filter
            class="solr.SynonymFilterFactory"
            synonyms="synonyms.txt"
            ignoreCase="true"
            expand="true" />
        <filter class="solr.PorterStemFilterFactory" />
        <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
        <filter class="solr.ASCIIFoldingFilterFactory" />
    </analyzer>

    <analyzer type="query">
        <charFilter
            class="solr.PatternReplaceCharFilterFactory"
            pattern="\bd{1,3}\b"
            replacement=""
            replace="all" />
        <tokenizer class="solr.StandardTokenizerFactory" />
        <filter
            class="solr.WordDelimiterFilterFactory"
            preserveOriginal="1"
            generateWordParts="1"
            generateNumberParts="1"
            catenateWords="0"
            catenateNumbers="0"
            catenateAll="0"
            splitOnCaseChange="0" />
        <filter class="solr.LowerCaseFilterFactory" />
        <filter
            class="solr.StopFilterFactory"
            words="stopwords.txt"
            ignoreCase="true" />
        <filter class="solr.PorterStemFilterFactory" />
        <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
        <filter class="solr.ASCIIFoldingFilterFactory" />
    </analyzer>
</fieldType>

and the matching goes as follows:

PatternReplaceCharFilterFactory

No more empty tokens and no more false matches :).

You may also be interested in this recent and comprehensive Manning Publications book on Solr:

Java List remove int vs. remove Integer Object

I have made this mistake a few times, so thought posting this might help me in the future and some other unfortunate soul like me :).

Can you spot the bug in the following piece of code, without actually running it?

If we run that code, we get this exception:

The issue is remove(int i) removes the element at index i and not the element with value i. (I wish this method was really named removeElementAtIndex(int i)!)

To remove the first occurrence of 40 from that list, we should replace

with

which indeed removes the first occurrence of 40 from that list, since the argument is an Integer object and not a primitive int and that makes all the difference in this case!

Logging Perl Script with Log::Log4perl

To enable logging in a Perl script, we can use Log::Log4perl.

Here is a simple Perl script:

Of course, if we do not have the required Perl module installed, we will get this error when we try to run the script:

To install the module on ubuntu, use:

Once we run that script successfully, we can see there is a log file called test.log created and its contents are like:

Tracking the Progress of a MongoDB Map-Reduce Operation

If you are running long map-reduce operations in mongoDB, you will definitely want to track the progress of them. Here is a simple way to do it with the mongo command line client.

Following this question and its answer from stackoverflow we can check the status of all the current running operations in mongo using:

If you have a lot of running operations, the output could be huge.

Now we take hint from db.currentOp() page about how to get the details for only the operation we are interested in.

We can first fetch the opid‘s and the namespaces of the running operations, using:

which gives an output like this:

I am interested only in the mdb.vote.log operation, so to fetch the status of only that operation, we can use the following:

This prints:

Solr External File Fields

Solr has a very nice feature – “external file fields” (EFFs). If you google it today (Jun 30, 2013), you find these pages:
Lucene External File Field Page
Lucidworks External File Field Page

While both these pages have valuable information, I believe all the information is not in one place, so here is an attempt to consolidate my knowledge and understanding of how to work with them.

We use Solr to serve our company’s browse pages. Our browse pages are similar to how a typical Stackoverflow tag page looks. That “browse” page has the question title (which links to the actual page that contains the question, comments and replies), view count, snippet of the question text, questioner’s profile info, tags and time information. One thing that can change quite frequently on such a page is the view count. I believe Stackoverflow uses Redis to keep track of the view counts, but we have to currently manage this in Solr, since Solr is our only datastore to serve these browse pages.

The problem before Solr 4.0 was that you could not update a single field in a document. You have to form the entire document first (either by querying Solr or using an alternate data source which contains all the info), update the view count and then post the entire document to Solr. With Solr 4+, you can do atomic update of a single field – the Solr server internally handles fetching the entire document, updating the field and updating its index. But atomic update comes with some caveats:

  • you must store all your Solr fields (other than copyFields), which can increase your storage space
  • enable updateLog, which can slow down Solr start-up.

For this specific problem of updating a field more frequently than the rest of the document, external file fields (EFFs) can come in quite handy. They have one main restriction though – you cannot use them in your queries directly i.e. they cannot be used in the q parameter directly. But we will see how we can circumvent this problem at least partially using function query hacks.

EFF set-up
First, the location for external file field is the dataDir specified in your solrconfig.xml. I like to name external file fields with an eff_ prefix in my Solr schema.xml. Here is the part of my schema.xml relevant to EFF. Under the types element I got:

where id is my uniqueKey and under fields element:

Since the name of my field is eff_views, Solr will look for a file named external_eff_views (extension does not matter) in my dataDir. The data in this file should be as key=value pairs. So if I got documents with ids 1, 2, 3 with 2341, 34 and 991 views respectively, this file should contain:

Now we can check if Solr loads the above data. Start up the Solr server and issue the following query:

Here we are indirectly making Solr return the value of the EFF with a function query. We are setting the score in this case to be the value of the EFF, so we should see that the returned documents have the correct views:

Reloading EFFs
Next we want to see how to reload these fields. If we only edit the file dataDir/external_eff_views, Solr will not use the new values automatically. There are two ways in which we can have Solr start using the new values:
– reloadCache
– listener (Solr 4.1+)

If you are an unfortunate user of Solr version prior to 4.1, you are stuck with reloadCache! To use this, we need to add a request handler in solrconfig.xml like:

and then need to hit the URL

to have Solr start using the new EFF values. But be very careful that no other process is adding documents to Solr when you hit reloadCache. Otherwise, Solr will do a partial commit of that process and open a new searcher.

If you are using Solr 4.1+, you have the better option of having a listener reload EFFs after a commit happens. For this add the following to the element in solrconfig.xml:

You can either call commit after reloading the values in the file, like:

or wait till another indexing process you run frequently commits. (In our case, we run a batch process to fetch documents modified in the last 10 min, post them to Solr and do a commit, whether we find any modified document or not. So we do not explicitly commit after reloading the EFFs.)

Retrieve the value of EFF
Next is how to retrieve the value of the external file fields. You cannot simply add them to your fl parameter like

fl=id,eff_views

Solr will not return the value. You need to use this function query hack:

fl=id,field(eff_views)

field(x) basically returns the value of the function query x. But this returns documents like:

We can alias the field to views by making a small change to the fl param:

fl=id,views:field(eff_views)

and now we will get

Sorting results by the value of EFF
Next, let us see how to sort the documents by EFF values. To sort in descending order, use:

q={!func}eff_views&fl=id,views:field(eff_views)

Here the score is simply the value of eff_views, so Solr will sort the results by views.

To sort in ascending order, you can use:

q={!func}div(1,sum(1,eff_views))&fl=id,views:field(eff_views)

Here the score is 1/(1+views), so the maximum score a doc can get is 1, which is for the docs with no views.

Other Tricks

How do you find docs that have between 100 and 500 views? Again, use the function query hack:

q={!func}map(map(eff_views,0,99,0),501,1000000,0)&fl=id,views:field(eff_views)

and ignore any result that has a score of 0. Here we have used the map function to map any views in intervals [0, 99] and [501,1000000] to 0. (Assuming 1M is more than the max. views a document has got.)

Replicating EFFs
If you are using Solr Master/Slave Replication, for Solr to replicate the EFFs, you need to specify the file in the confFiles directive of the replication handler in solrconfig.xml, like:

Either the entire path of the ext file field file needs to be given, or the relative path of the ext file field file (with respect to solrconfig.xml) has to be given. Note that no wild-cards are allowed to replicate multiple EFF files and each file has to be specified separately. Since the slave will commit after replication, the values will be reloaded automatically if you set up the listeners as mentioned above.

ACKNOLWEDGMENTS – thanks to the amazing solr-user mailing group, especially Erick Erickson, Jack Krupansky and Yonik Seely, who have responded to my various questions on this topic.

You may also be interested in this recent and comprehensive Manning Publications book on Solr: