Monthly Archives: September 2013

Solr Regex Query Tutorial

Solr started supporting regular expression (regex) queries from version 4.0+. Solr regexes differ somewhat from Java and Perl/Python regexes.

First we need to decide whether we want a regex to match against the tokens in a field or the entire field. Recall that in Solr, string type is stored verbatim (essentially as a single token) whereas all the text types consist of tokens. The regex matching happens against the tokens for text fields and against the full content for string fields. So once a piece of text is tokenized, there is no way to perform a regex query across word boundaries.

For example, this piece of text

Hello World

gets split into two tokens {Hello, World} when stored in a text field that only has the WhitespaceTokenizer, but is stored “as is” in a string field.

Let us say we have a Solr document like:

where content_str is a string field and content_ws is whitespace tokenized field.

Let us perform some regex searches. We need to enclose our regex between forward slashes as shown below. The following string regex searches will match the above document:

(same as the regular search q=content_str:"Hello World".)

(matches strings beginning with He and ending with ld.)

(matches strings that have lo Wo somewhere.)

The following are token-level matches and match the above doc:

(matches the exact token Hello)

(matches tokens that have ell somewhere in them.)

Notice that unlike Perl, there is no anchoring syntax (^ and $) in Solr. Instead, all Solr regexes are anchored in the beginning with a ^ and in the end with a $ by default. If you do not want your searches to be anchored, then you need to specify .* on both sides of your regex.

The following searches do not match the document.

(searches for exact contents being ello Worl)

(tokenized fields cannot be searched across word boundaries)


Let’s look at one more document for doing digit matching.

For matching a digit, we should use [0-9]. (Neither \d nor \\d work.)

The query

matches any string which has 201 followed by a digit anywhere in it, so matches document 2.

Here are some more examples:

The ? pattern for 0 or 1 occurrences of a character works. So

will match strings that end either with 201 or with 20.

matches 2 followed by either 2 or 3 zeroes.

matches both 1970 and 1971.

matches 20 followed by a lower-case letter like 20s, 20th, etc.,

matches 20 not immediately followed by another digit.

matches both Hello and hello.

WARNING: Note that regex matching does not work against int, float, long and double fields.

SOME OTHER REFERENCES:
Marshut, Stackoverflow Post 1, Stackoverflow Post 2

You may also be interested in this recent and comprehensive Manning Publications book on Solr:

Speeding up MySQL data restore

A typical task we routinely do is to take the dump of our production databases and load them on the MySQL server running on our local machines. We mostly have tables using only one engine in a database – for our write-heavy and transactional storage we use InnoDB tables in one database and for read-only tables we use MyISAM tables in another database. The InnoDb database dump is about 20GB and the MyISAM one about 30 GB. We use mysqldump on one of our production slave boxes to get the dumps and make them available for devs to download it.

Once the data dumps are downloaded, data restore on our local boxes is done with the following command:
mysql -uUSER -pPASSWORD DATABASE < DUMPFILE.sql

We can tweak the following variables in MySQL’s my.cnf for data restore speed-ups:

MyISAM – key_buffer_size (max. allowed = 4G)
Set this to 30% of the memory on your box. I have an 8GB box, so I set it to 2400M.

InnoDB – innodb_flush_log_at_trx_commit
temporarily set this to 2. After import is done, revert it to 1.

References:
MySQL key_buffer_size
MySQL flush_log_at_trx_commit
Serverfault
Stackoverflow

PDF Printing Problem on Ubuntu Solved

For some time now, I am having problems printing PDF files on ubuntu 12.04 with brother MFC (440CN) printer. I can view the PDF and it looks good in document viewer, but when it prints, it either prints a blank page or prints only a few lines at the top.

Here is one fix that worked for me.

  1. Convert the PDF file to a PS (postscript) file with pdf2ps command.
    pdf2ps myfile.pdf
  2. Then use lpr command to print the ps file:
    lpr myfile.ps