Solr External File Fields

Solr has a very nice feature – “external file fields” (EFFs). If you google it today (Jun 30, 2013), you find these pages:
Lucene External File Field Page
Lucidworks External File Field Page

While both these pages have valuable information, I believe all the information is not in one place, so here is an attempt to consolidate my knowledge and understanding of how to work with them.

We use Solr to serve our company’s browse pages. Our browse pages are similar to how a typical Stackoverflow tag page looks. That “browse” page has the question title (which links to the actual page that contains the question, comments and replies), view count, snippet of the question text, questioner’s profile info, tags and time information. One thing that can change quite frequently on such a page is the view count. I believe Stackoverflow uses Redis to keep track of the view counts, but we have to currently manage this in Solr, since Solr is our only datastore to serve these browse pages.

The problem before Solr 4.0 was that you could not update a single field in a document. You have to form the entire document first (either by querying Solr or using an alternate data source which contains all the info), update the view count and then post the entire document to Solr. With Solr 4+, you can do atomic update of a single field – the Solr server internally handles fetching the entire document, updating the field and updating its index. But atomic update comes with some caveats:

  • you must store all your Solr fields (other than copyFields), which can increase your storage space
  • enable updateLog, which can slow down Solr start-up.

For this specific problem of updating a field more frequently than the rest of the document, external file fields (EFFs) can come in quite handy. They have one main restriction though – you cannot use them in your queries directly i.e. they cannot be used in the q parameter directly. But we will see how we can circumvent this problem at least partially using function query hacks.

EFF set-up
First, the location for external file field is the dataDir specified in your solrconfig.xml. I like to name external file fields with an eff_ prefix in my Solr schema.xml. Here is the part of my schema.xml relevant to EFF. Under the types element I got:

where id is my uniqueKey and under fields element:

Since the name of my field is eff_views, Solr will look for a file named external_eff_views (extension does not matter) in my dataDir. The data in this file should be as key=value pairs. So if I got documents with ids 1, 2, 3 with 2341, 34 and 991 views respectively, this file should contain:

Now we can check if Solr loads the above data. Start up the Solr server and issue the following query:

Here we are indirectly making Solr return the value of the EFF with a function query. We are setting the score in this case to be the value of the EFF, so we should see that the returned documents have the correct views:

Reloading EFFs
Next we want to see how to reload these fields. If we only edit the file dataDir/external_eff_views, Solr will not use the new values automatically. There are two ways in which we can have Solr start using the new values:
– reloadCache
– listener (Solr 4.1+)

If you are an unfortunate user of Solr version prior to 4.1, you are stuck with reloadCache! To use this, we need to add a request handler in solrconfig.xml like:

and then need to hit the URL

to have Solr start using the new EFF values. But be very careful that no other process is adding documents to Solr when you hit reloadCache. Otherwise, Solr will do a partial commit of that process and open a new searcher.

If you are using Solr 4.1+, you have the better option of having a listener reload EFFs after a commit happens. For this add the following to the element in solrconfig.xml:

You can either call commit after reloading the values in the file, like:

or wait till another indexing process you run frequently commits. (In our case, we run a batch process to fetch documents modified in the last 10 min, post them to Solr and do a commit, whether we find any modified document or not. So we do not explicitly commit after reloading the EFFs.)

Retrieve the value of EFF
Next is how to retrieve the value of the external file fields. You cannot simply add them to your fl parameter like

fl=id,eff_views

Solr will not return the value. You need to use this function query hack:

fl=id,field(eff_views)

field(x) basically returns the value of the function query x. But this returns documents like:

We can alias the field to views by making a small change to the fl param:

fl=id,views:field(eff_views)

and now we will get

Sorting results by the value of EFF
Next, let us see how to sort the documents by EFF values. To sort in descending order, use:

q={!func}eff_views&fl=id,views:field(eff_views)

Here the score is simply the value of eff_views, so Solr will sort the results by views.

To sort in ascending order, you can use:

q={!func}div(1,sum(1,eff_views))&fl=id,views:field(eff_views)

Here the score is 1/(1+views), so the maximum score a doc can get is 1, which is for the docs with no views.

Other Tricks

How do you find docs that have between 100 and 500 views? Again, use the function query hack:

q={!func}map(map(eff_views,0,99,0),501,1000000,0)&fl=id,views:field(eff_views)

and ignore any result that has a score of 0. Here we have used the map function to map any views in intervals [0, 99] and [501,1000000] to 0. (Assuming 1M is more than the max. views a document has got.)

Replicating EFFs
If you are using Solr Master/Slave Replication, for Solr to replicate the EFFs, you need to specify the file in the confFiles directive of the replication handler in solrconfig.xml, like:

Either the entire path of the ext file field file needs to be given, or the relative path of the ext file field file (with respect to solrconfig.xml) has to be given. Note that no wild-cards are allowed to replicate multiple EFF files and each file has to be specified separately. Since the slave will commit after replication, the values will be reloaded automatically if you set up the listeners as mentioned above.

ACKNOLWEDGMENTS – thanks to the amazing solr-user mailing group, especially Erick Erickson, Jack Krupansky and Yonik Seely, who have responded to my various questions on this topic.

You may also be interested in this recent and comprehensive Manning Publications book on Solr:

Print Friendly, PDF & Email

15 thoughts on “Solr External File Fields

  1. This is a great consolidation of EFF usage. I like the map trick for finding documents within a value range, but wonder why it doesn’t work for a filter query: fq={!func}map(map(eff_views,0,99,0),501,1000000,0)&fl=id,views:field(eff_views).
    Any ideas?

    1. No, you cannot search for them directly using the q param. But you can use the hacks specified above and search. Adapt the query above which gets docs with views between 100 and 500 for a specific value.

  2. Great post! The Solr docs need more consolidated knowledge on a specific feature with examples like these. Thank you for your time in writing this, as you’ve definitely saved some of mine today. Much appreciated.

    1. Yes.

      If you want your score to depend entirely on the external file field (EFF) and not be influenced by your query, then you can add your query via a filter query. So let’s say we want to sort our documents by the views, which is an EFF (eff_views), but our search is for documents with id less than 100. Then we can use:

      q={!func}eff_views&fq=id:[* TO 99].

      One caveat with filter queries is that they get cached. To avoid the results of your filter queries from getting cached, you can specify {!cache=false} so your query becomes q={!func}eff_views&fq={!cache=false}id:[* TO 99]

  3. Thanks for this article.
    I’m trying to replicate the eff files, but it seems that the following is true:

    1) it’s not possible to replicate files *outside* of the core conf/ folder
    2) even if I place my files in the conf/ folder (they still won’t be picked up, but it doesn’t matter for now) solr doesn’t replicate them. I’m using 4.0.0 here.

    I’m trying to find out examples of eff files replication but apparently it’s not very common out there.

    Do you have any suggestions?

  4. Nice Article.
    Can someone tell how are they handling writes to external file? I mean if external file field is used to track popularity/rank then should that file be updated anytime a rank value needs to added for a document?
    Does Solr support an api to add a record to ExternalFile or should that be handled outside of Solr ?

    Thanks
    Kanth

Leave a Reply

Your email address will not be published. Required fields are marked *