Monthly Archives: November 2012

Unix/Perl script to monitor process memory usage

We have been running into some problems with Solr/Lucene’s fieldCache lately at work. We have a lot of dynamic fields on which we do sorting. When you do a sort on a field, Solr populates the underlying unconfigurable lucene field cache with an entry. The number of entries made is directly proportional to the number of documents you have in your index. In our case, we have 14 million documents, so a sort on one field will populate the field cache with 14 million integers plus more, which depends on how many distinct values that field can take. Since we have about 250 of these dynamic fields and we can sort on any of those fields, the field cache gets built up pretty quickly. In about 6 to 7 hours of operation, we have tomcat consuming about 90% memory of the 4 GB box on which it is running and we started getting OutOfMemory exceptions because the heap space was full.

The only temporary work-around before we rework our Solr schema seemed to be to restart tomcat automatically when the total memory consumption is over 80% or so. In our production scene, this happens every 6 to 7 hours. We did not want to set a cron job to simply restart tomcat every 6 or 7 hours. Instead it should monitor the memory usage and restart tomcat only if 80% or more memory is consumed by it. Of course, this is assuming no other memory-intensive process is running on that machine.

While trying the Unix free command, we started observing some problems. It was not reflecting what top command was showing. For example, top said 76.7% of memory is being used by tomcat, but
free -m
gave the following output:

             total       used       free     shared    buffers     cached
Mem:          4013       3993         19          0         10       1149
-/+ buffers/cache:       2833       1179
Swap:         1953         12       1940

So how do we get the 76.7% from this output? None of the row ratios seem to give the correct answer.

Instead if you look at Unix ps aux and the 4th column of its output, you will get the memory consumed by that process. The output of
ps aux | grep tomcat
is:

binuser    953  0.2 76.7 22136372 3152708 ?    Sl   Nov15   2:19 /usr/lib/jvm/java-7-oracle/bin/java -Djava.util.logging.config.file=/var/lib/tomcat6/conf/logging.properties -Dsolr.home=/var/solr -Xms1g -Xmx3600m -XX:+UseConcMarkSweepGC -Dcom.sun.management.jmxremote.port=7009 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Djava.rmi.server.hostname=XXX.YY.ZZZ.WWW -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager -Djava.endorsed.dirs=/usr/share/tomcat6/endorsed -classpath /usr/share/tomcat6/bin/bootstrap.jar -Dcatalina.base=/var/lib/tomcat6 -Dcatalina.home=/usr/share/tomcat6 -Djava.io.tmpdir=/tmp/tomcat6-tmp org.apache.catalina.startup.Bootstrap start
joe    17191  0.0  0.0   6156   656 pts/0    R+   16:25   0:00 grep --color=auto tomcat

Ok, so that has two lines of output and the first line is what we want. (The second is the command-line search we performed.) You look at the first line 4th column and you see 76.7, which is the percentage of physical memory this process is taking.

So here is the Perl script to restart tomcat when the memory consumed by tomcat exceeds 80%:

Put that in a sudo cron job (since you are restarting a service) to run every minute and you can log the output to a file to monitor when tomcat got restarted last. So now I can go to sleep peacefully :-).

Easy Solr Tutorial – Part 3

Prev:
Easy Solr Tutorial Part 1Easy Solr Tutorial Part 2

In this part, let us try to add some content and search for it using Solr.

Again, open up myfirstdoc.xml file and enter the following:

Recall that we have already posted one document with id = doc1. If we post another document with the same id, it will over-write the old document.

Post this document to Solr as before using:
java -jar post.jar myfirstdoc.xml

Then hit the search URL
http://localhost:8983/solr/collection1/select?q=id:doc1.
Your output should be:

You can see that field marked as multiValued="true" in schema.xml (cat and features here) are returned as arr elements i.e. as arrays. Notice that string fields are in str elements, and text fields are in str elements as well.

You can see that there is an extra field named price_c, which we did not have in myfirstdoc.xml. It is a Solr copyField. If you look at schema.xml, you can see this:

So whatever we enter in field price gets copied into price_c. But what’s the deal with the ,USD at the end of price_c? For that scroll a little bit above in your schema.xml and you will see this:

That introduces us to dynamicField. You can think of dynamic fields as fields with wild-card names. Any field not explicitly declared in schema.xml, but ends in _c will be captured by this. price_c matches this and will use its type and other attributes.

While the type of price is float, the type of price_c is currency. This field type will add a comma followed by the currency acronym and USD is the default. (This field type is an advanced field type and I read about it only now :-). If you are interested, the Solr reference page for CurrencyField is here: https://wiki.apache.org/solr/CurrencyField.)

copyFields are used like above for indexing the same value with different types. You can also use copyField for copying multiple field values into one catch-all field. You can see that there is a field in schema.xml named text (could have been named catchall; text is a very confusing name in this context) and is of type text_general:

(This field is not stored, which is why you do not see it in the output. But you can search on it since it is indexed. We will see how shortly.)

And schema.xml also has the following:

This means all the source fields are copied into text field. You can search for dest="text" to see all the fields that are copied. The analysis done for the text field is that of text_general.

Ok, so much for the indexing and the schema. Let us do some searching now. Let us say we want to search by sku now. Hit:
http://localhost:8983/solr/collection1/select?q=sku:1234A
Remember the SQL equivalent of this would be:

If we want to search by name, hit
http://localhost:8983/solr/collection1/select?q=name:macbook

Notice that name field is of type text_general, so lower-casing happens both during index time and query time. So we can also search like:

http://localhost:8983/solr/collection1/select?q=name:MaCbOoK

and our doc will be returned.

Let’s try the catch-all field text now.

http://localhost:8983/solr/collection1/select?q=text:MaCbOoK

and that too returns our document as expected.

Let us also look into the manu_exact field. In schema.xml, this is defined as

The comment makes it clear that we won’t be typically using this field for searching. Instead it will be useful for sorting and grouping (which we will see in a later part). In any case, to search using manu_exact, you will need to search for the exact name of the manufacturer i.e. “Apple Inc.”
http://localhost:8983/solr/collection1/select?q=manu_exact:”Apple Inc.”
If you search for manu_exact:Apple or manu_exact:apple, you won’t get any results.

You may also be interested in this recent and comprehensive Manning Publications book on Solr:

Easy Solr Tutorial – Part 2

Prev: Easy Solr Tutorial Part 1

Let us look at schema.xml again and check out the other fields. There are a lot of simple fields like the following:

Let us look at the third field, which is our first example of a text field:

This field’s type is text_general, which is one kind of text field. If you look below in schema.xml you will see its definition:

There are two analyzer elements, one for indexing and another for querying. The first element inside an analyzer is always a tokenizer, which is followed by one or more filters. These are basically a chain of processors that will process the text.

The reference page for analyzers, tokenizers and filters is https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters.

A tokenizer typically splits a given piece of text into tokens. As a simple example, the text
It is a beautiful day
has five tokens (which are the same as the words in this case).

The best way to learn what a tokenizer is doing is by using the Solr analysis admin tool at http://localhost:8983/solr/#/collection1/analysis. (You can also get to this page by clicking on collection1 from your main Solr admin page and then clicking on Analysis.)

From the drop-down named “Analyse Fieldname / FieldType:” choose name which is the field we are looking at. (Alternatively, you can also choose the field type text_general itself, which is further down in the drop-down.) Let us see what happens at index time on this field first. Under “Field Value (Index)” enter the following text:

The quick ("brown") I.B.M. fox can jump 4.9 ft, huh?!?

Uncheck “Verbose Output” for now. Then click “Analyze values” on the far right. The first line of output is:

ST The quick brown I.B.M fox can jump 4.9 ft huh

ST stands for StandardTokenizer. The StandardTokenizer has nicely stripped off all special characters and only output the word contents. It is also smart enough to leave “I.B.M” and “4.9” as single tokens.

By now, you would have guessed that the 2nd and the 3rd lines of output beginning with SF and LCF are from the StopFilter and the LowerCaseFilter. You are right!

The StopFilter strips out tokens that are in the stopwords.txt file located at SOLR_HOME/example/solr/collection1/conf/stopwords.txt where SOLR_HOME is the dir to which you extracted Solr. It is empty as of now, but you can add words like a, the, and etc., to filter them out.

The LowerCaseFilterFactory simply lower-cases whatever is input to it. You can see that I.B.M has become i.b.m now, which may not be the desired result. In such cases, you can use one of the other text types or define your own type.

You can play around with feeding different inputs to different field types and get a clear understanding of what is happening to the text you index in Solr. Similar to seeing what happens at index time, you can see what the query-time analyzer chain does by feeding input to the text box named “Field Value (Query)”.

In the next part we will add some meaningful content to Solr.

Prev: Easy Solr Tutorial Part 1
Next: Easy Solr Tutorial Part 3

You may also be interested in this recent and comprehensive Manning Publications book on Solr:

Easy Solr Tutorial – Part 1

Solr is admittedly the very best search platform today. When I read the Solr 4.0 tutorial, I thought it is too much for a tutorial. Here is my attempt at a much simplified Solr tutorial.

We will use the latest Solr version 4.0. (It needs Java 1.6 or greater.) I will assume Java is installed on your computer. If not, go to Oracle web-site and install Java 1.6 or 1.7. (You don’t need to know Java to use Solr.)

Next download Solr 4.0 and extract the contents to a directory like /home/joe/apache-solr-4.0.0. Let’s run Solr:

$ cd /home/joe/apache-solr-4.0.0/example
$ java -jar start.jar

It should print a bunch of lines and end with something like:

...
INFO: SolrDispatchFilter.init() done
2012-11-06 21:04:56.940:INFO:oejs.AbstractConnector:Started SocketConnector@0.0.0.0:8983

That means success!

Now you can go to http://localhost:8983/solr in a browser and see a page like this:

Solr admin page

Now we will deviate from the main Solr tutorial on Solr site.

You can see in the screenshot that there is collection1 on the left. Technically collection1 is what is called a Solr core. If you are coming from a database world, you can think of a core as a table. Just like how a database table will hold rows, a Solr core will hold documents. Just like how a database table has a schema, each Solr core also has a corresponding schema which is defined in schema.xml. If you click on collection1, you will get a drop-down which will have this link to schema.xml. It is very well documented with comments. The main elements in there are fields and types. Continuing with our database analogy, fields are like columns and types are like column data types (INT, CHAR, TEXT, etc.,). The beauty is that in Solr you can customize the types per your needs.

Let us begin with the first field we find in the schema:

  • It is of string type, meaning it is indexed AS IS (as opposed to text field types which may process the value before indexing).
  • It is indexed, which means Solr will index this field and we will be able to search for it. Contrary to database world, in Solr, you cannot search or sort on a field that is not indexed! Also, the index in Solr is field level (either set to true or false) and there are no compound indexes, like in databases.
  • It is stored, which means we can retrieve the value in our queries.
  • It is required meaning every document must have it. (If you inspect the schema, you will see that this is the only required field. In fact, id is also a primary key for our documents. You can see below in schema.xml that it is declared as the unique key using id.)
  • This field is not multiValued. Solr supports multiValued fields, or in other words, arrays! In database world, we will store arrays usually as comma-separated strings, split it in code and then search it. (If interested, you can check our tutorial on how to work with comma-separated values in MySQL.) In Solr, you can directly search for the array values!

Since this is the only required field, let us try adding our first document. First do:

$ cd /home/joe/apache-solr-4.0.0/example/exampledocs

Create your first document in this folder using your favorite text editor and save it as myfirstdoc.xml. Our document will only have the id field in it.

We will need to POST this document to Solr for it to get indexed. You can post the doc using the following command:

$ java -jar post.jar myfirstdoc.xml

It should print:
SimplePostTool version 1.5
Posting files to base url http://localhost:8983/solr/update using content-type application/xml..
POSTing file myfirstdoc.xml
1 files indexed.
COMMITting Solr index changes to http://localhost:8983/solr/update..

Great! Now let’s search for this document in Solr. Hit the URL: http://localhost:8983/solr/collection1/select?q=id:doc1. This is the equivalent of something like

in database world.

You should see the output like:

  • A status of 0 means all is well.
  • QTime of 2 means Solr took 2 milli seconds to finish the search.
  • params is simply showing you what your query is.
  • The result element is what has the search results.
  • It has attribute numFound = 1, which is number of documents that matched the query.
  • start is the offset from which the result list has begun.
  • Of course, we got back the id field our document had.
  • And Solr adds _version_ by default.

Next: Part 2

You may also be interested in this recent and comprehensive Manning Publications book on Solr: