Solr/Lucene Score Tutorial

In this post, we will try to understand how Solr/Lucene’s default scoring mechanism works.

Whenever we do a Solr search, we get result documents sorted in descending order of their scores. We can explicitly ask Solr to return the scores in search results by adding score to the fl parameter. For example, here is a query:

and here are the top five results:

Here is the fieldType definition for name field:

We want to understand how Solr calculated the scores for each of the result documents. (Notice that the 2nd and the 3rd results have only one of the search tokens cricket present, but score higher than the 4th and the 5th results which have both the search tokens in them!)

We can have Solr “explain” its score calculations by adding param debugQuery and setting it to on. We also add wt=xml so we get results in XML which is easier to use for analyzing the scores:

The results are hard to read as they are. To see an indented output, right click on your browser window and view page source. Then you will see an output like this (for this example I am showing only two results):

(If you want to follow along the explanation below without scrolling up and down, either print the output above, or keep it in another window.)

To understand the above calculation, we need to look into Solr’s default similarity calculation, which is done using Lucene’s DefaultSimilarity. DefaultSimilarity extends Lucene’s TFIDFSimilarity and does something extra. We will look into the “extra” part later. Let’s focus on the Lucene Practical Scoring Function mentioned in the TFIDFSimilarity doc:

  \boxed{\mathrm{s}(q,d) = \Big( \, \sum\limits_{t \, \in \, q} \, \mathrm{tf}(t \, \in \, d) \, \mathrm{idf}^2(t) \, t\mathrm{.getBoost}(\,) \, \mathrm{norm}(t,d) \, \Big) \, \mathrm{coord}(q,d) \, \mathrm{queryNorm}(q)}
where

  • q is the search query (indian cricket in the above example)
  • d is the document against which we are calculating the score
  • t is a token in q (for example, indian).

Here are what the various factors mean:


\mathrm{tf}(t \, \in \, d): square root of the frequency of token t in d (in the field name that we searched). The rationale behind having this as a factor in the score is we want to give higher scores to documents that have more occurrences of our search terms. Since the term cricket appears in the first document twice in the name field (case does not matter, since the field has a lower-case filter during index and query time), it gets:

     \begin{gather*} \mathrm{tf}(\mathrm{cricket} \, \in \, \mathrm{doc} \, 1856209) = \sqrt2 = 1.4142135. \end{gather*}


\mathrm{idf}(t): inverse document frequency. Before understanding what inverse document frequency is, let us first understand what document frequency, df(t), is. It is the number of documents (out of all the documents in our Solr index) which contain token t in the field that is searched. And idf(t) is defined as:

     \begin{gather*} \mathrm{idf}(t) = 1 + \mathrm{log}_e \Big( \frac{\mathrm{numDocs}}{1 \, + \, \mathrm{df}(t)} \Big) \end{gather*}

where numDocs is the total number of documents in our index. The rationale behind having this as a factor is to give lower scores to more frequently occurring terms and higher scores to rarer terms. In our example search, the token indian occurs in 209 documents (see the explain output above where it says docFreq=209) but cricket occurs only in 57 documents, so cricket gets a higher score.

For token indian with document frequency of 209 and maxDocs being 198488, we get idf(indian) = 7.8513765.

(One thing I noticed which is strange though is, though the formula says numDocs, we can see above that Solr actually uses maxDocs, which also includes deleted documents that are not purged yet!)


t\mathrm{.getBoost}(\,): Query time boost for token t. We can give higher weights to certain search tokens by boosting them. For example, name:(indian^5 cricket) would boost token indian by 5. Since we have not specified any query time boost, the default multiplicative boost of 1 is applied.


\mathrm{norm}(t,d): This is reported as fieldNorm by Solr. It is the product of index time boosts and a length normalization factor. It is more precisely defined on the field f we are searching against (i.e. name for our example above), rather than token t:

     \begin{gather*} \mathrm{norm}(f, d) = d.\mathrm{getBoost()} \times f.\mathrm{getBoost()} \times \mathrm{lengthNorm}(f) \end{gather*}

d.\mathrm{getBoost()} is the index time boost for document d. f.\mathrm{getBoost()} is the index time boost for the field f. These are straight forward to understand. (In our example above, there are no index time boosts applied, so they are all 1, which Solr does not report.)

The third factor \mathrm{lengthNorm}(f) is defined as the reciprocal of the square root of number of terms in field f. The rationale behind this is we want to penalize documents with longer field values, because there is a better probability for the search term to occur in them simply because they are longer. Shorter documents may more precisely match our query.

We can use Solr’s analysis admin tool to find out how many terms our field gets. For example, with the way I have my index time analysis chain set up for field name, for field value Best Captain of Indian National Cricket Team (Test Cricket), I get 13 tokens. I get extra tokens because of WordDelimiterFilter and SynonymFilter.

The lengthNorm for our first result is then 1/\sqrt{13} = 0.27735, but we see that Solr is reporting a fieldNorm of 0.25 for this document (which is the same as lengthNorm for us, since index time boosts are all 1). This is because of the something “extra” that DefaultSimilarity does. DefaultSimilarity optimizes the storage of lengthNorm by converting the decimal value (more precisely the float) to a byte. If you dig into the source code, you see that it encodes the float at index time as follows:

If we put that method in a test file and run it, we can see:

This byte value is then decoded during query time for computing the score using this method:

so we get back:


\mathrm{coord}(q,d): number of tokens in q that are found in doc d divided by number of tokens in q. This factor gives higher score to documents that have more of the search terms.

Since both search terms are found in the first document, coord(indian cricket, doc 1856209) is simply 1 (since coord is a multiplier, Solr does not report the multiplier of 1). For the second document, we get 0.5 since there is only one matching term (cricket) and Solr reports the coord in this case.


\mathrm{queryNorm}(q)}: The Lucene documentation on this is clear: “queryNorm(q) is a normalizing factor used to make scores between queries comparable. This factor does not affect document ranking (since all ranked documents are multiplied by the same factor), but rather just attempts to make scores from different queries (or even different indexes) comparable. This is a search time factor computed by the Similarity in effect at search time. The default computation in DefaultSimilarity produces a Euclidean norm.” The formula is:

     \begin{gather*} \mathrm{queryNorm}(q) = \frac{1}{\sqrt{\mathrm{sumOfSquaredWeights}}} \end{gather*}

For a Boolean query (like our example), sumOfSquaredWeights is defined as:

     \begin{gather*} \mathrm{sumOfSquaredWeights} = \big( q.\mathrm{getBoost}(\,) \big)^2 \, \sum\limits_{t \, \in \, q} \big( \mathrm{idf}(t) \times t.\mathrm{getBoost}(\,) \big)^2 \end{gather*}

In our case, all boosts are 1, so sumOfSquaredWeights simply becomes

     \begin{align*}  \mathrm{sumOfSquaredWeights} &= \sum\limits_{t \, \in \, q} \big( \mathrm{idf}(t) \big)^2 \\  &= (7.8513765^2 + 9.138041^2) \\  &= 145.1479 \end{align*}

which gives the queryNorm of 1/\sqrt{145.1479} = 0.08300316 which is what Solr reports.

This is arguably the least interesting factor of all, because it is the same for all the documents that match the search.


Now we come back to the anomaly we noticed in the search results above. Why the second and third documents with only one search term get a higher score than the fourth and fifth documents which contain both the search terms? The answer has to do with how queryNorm (equivalently, lengthNorm in our case) and coord offset each other. In our example, the second and third documents get a coord of 0.5 each and a lengthNorm of 1 (since there is only one token), whereas the fourth and fifth documents get a coord of 1 but a lengthNorm of only 0.25, which is why they get a lower score. If this is something you do not desire, you can instruct Solr to omit the norms in the score calculation. This can be done in your schema.xml by specifying omitNorms=true in your field definition. (See Solr documentation on schema.xml.)

Sometimes you may want to not include idf or tf in your score calculations. For example, we have a Solr core on which we search for specific tags with query-time boosts, but we do not want to use idf for this search at all. This can be done by writing our own Similarity class which extends DefaultSimilarity and using that in our fieldType definition. Writing a custom similarity class is a topic for a separate post which I will be writing shortly.

A PHP script for log file analysis

Just want to share a PHP script I wrote for analyzing multiple log files in our log directories. (Disclaimer: this is a very simple script and may not be robust and secure, so use it only on servers which are used internally.)

The motivation for this small project is we have about a hundred batch processes (built with Spring Batch framework) that run via cron jobs on our production servers. We use Log4j (with daily log rotation) and all the batch processes log to a folder like

The log file name is the same (batch.log) under all these individual sub-directories. With log rotation, the previous day’s files are automatically moved at midnight to files like batch.log.$year-$month-$day.

Frequently we need to either tail a log file to see when it last ran (or if it is still running) or look for errors or count how many items got processed by a certain batch process on a certain day. Someone needed to login to the server via VPN and SSH and then change working directory to the appropriate batch process directory and analyze the log files with shell commands Gaming Mouse Reviews This was getting tedious. We desired a simple web interface with which we could easily tail the log files or look at certain lines with pattern matching (for example only ERROR level log messages) and also be able to specify the date of the log file.

Since our servers already had Apache HTTP server and PHP installed, we went for a PHP script (like I mentioned in my post on scripting languages). The PHP script simply acts as a nice front-end to execute a bunch of shell commands and displays the output on a web interface.

This script serves two purposes:

  1. list all the sub directories of the individual processes available at BASEDIR/logs/
  2. help tail and analyze the individual process log files

Invoked without any params, it simply lists the names of all the sub-directories. Each of these names is linked to the same script, but these links have an additional param proc, which is the name of the specific batch process sub directory. The script also takes an additional param lines, which is the number of lines we want from the tail of the log file. By default, the links on the main page to the individual processes simply have lines=50 so that the last 50 lines from the log file are displayed. This is implemented using the shell tail command and capturing its output with back ticks.

A date can also be specified to the script via the dt param. If not specified, it simply works on the current day’s log file.

We also wanted to do pattern matching to find lines containing specific patterns. Naturally grep is the shell command for this. We can also specify a Perl regular expression to grep via the -P option, which can be specified to the script below using param pattern. The script also takes two additional params before and after, which correspond to options -B and -A of the grep command.

Here is the script:

Here are the PHP functions the script uses with links to their documentation:
scandir
strlen
file_exists
nl2br

Here is an example URL with all the bells and whistes, assuming the script is called track.php and kept under the logs dir of your www folder. We are getting all the lines from the process new-user-signup which match the pattern ERROR (log4j log level) or Exception and 3 lines before and after such matches on the date 2014-12-01:

http://batch.mycompany.com/logs/track.php?proc=new-user-signup&dt=2014-12-01&pattern=ERROR%7CException&before=3&after=3

Notice that we have to URL encode the params. Here we have URL encoded ERROR|Exception to ERROR%7CException.

So that’s the poor man’s script for a simple web interface for your log files with some powerful functionality :-).

Find sizes of all collections in a MongoDB database

Here is a simple script for mongo shell that prints the sizes of all the collections in a database in descending order of their total sizes (sum of storage and index sizes). We just use the combination of the mongo commands db.getCollectionNames() and collection stats command and some javascript code to achieve this.

Let’s assume we are using database named mydb.

That gives a nice output like this:

Of course, this script is very easy to adapt so that only collections that occupy more than X GB are returned.

Python vs. Perl vs. PHP vs. Bash

There are a lot of posts about this topic! So why one more? This post is just to share my personal experiences with the few scripting languages I have used in the past few years. Contrary to hot debates about which is better, I consider each of these languages I use i.e. Python, Perl, PHP and Bash (or shell) as simply a knife in my swiss army knife. When a specific task can be accomplished easily with one, I just use it.

Perl

I begin with Perl because that’s the first scripting language I learnt. I have scripted a lot in Perl. Here are the parts I enjoy in Perl:

  1. I owe my love for regular expressions (regex) to Perl. If you want to learn regular expressions, you should learn them in Perl (or Python or PHP), and definitely not in Java :-). In my job (where we mainly use Java) when people need help with regexes, they go to people with scripting experience. I believe Java developers should not try to learn regexes with Java. They should try to learn regexes with a scripting language and then transition to Java regexes.
  2. Perl really shines when you need to parse a large text/log file line by line and do a bunch of regex matches.
  3. Perl makes accessing *nix command outputs very easy with back ticks. For example, you could just put back ticks around a long system command and capture its output in a variable like this:
  4. Perl’s string interpolation of variable values inside double quoted strings is a lovely feature. For example:
  5. Perl has a lot of one liners. It takes some time to get used to them, but once you learn them, they can be very useful.

And here are a few tasks for which I found Perl to be unnatural or difficult to use:

  1. If I need to work with classes and objects, I do not use Perl. Somehow I feel OOP support in Perl is not straight forward to learn and use.
  2. While Perl hashes have a lot of power, the sigils (especially in multi-dimensional hashes) made my head spin.
  3. The main reason I moved away from Perl was when I started working with MongoDB (from 2012). I am glad that I had learnt some Python by then, so I could compare the two languages for working with mongo. Python’s pymongo was a mature client library, while Mongo’s creator 10gen (now MongoDB, Inc.) did not seem to care much about Perl drivers. The Perl driver documentations were also not clear, and even as of today the Perl driver tutorial page does not even load! Also MongoDB’s JSON documents map seamlessly to Python dictionaries, which IMO are much easier to work with than Perl’s multi-dimensional hashes.

Python

I was using Perl happily until a few years ago, but then started seeing a lot of people using Python. And there were hot debates about which language is better. So I started learning Python from online tutorials. Python has one of the best documentations about any programming language I have ever read. Later, I read The Quick Python Book, Second Edition. And today I am an avid user (and learner) of Python! There are a lot of very nice things about Python and I am not going to list them all here. Just the top points:

  1. I love the Python IDLE. It makes development a lot of fun with Python. Usually I test some small function in IDLE and then copy it over to my Python script in Eclipse-PyDev (which IMO is a great Python code editor).
  2. Like I mentioned before, if I need to work with MongoDB, I use python and pymongo.
  3. For any decent sized project which we intend to keep for some time (especially in cron jobs), Python is my language of choice.
  4. No more curly braces and semi-colons making code shorter and cleaner.

While Python is my main scripting language of choice these days, I do miss two of Perl’s features: string interpolation and ease of command line output capture. It is possible to do these things in Python, but just not as easily as in Perl.

PHP

PHP is my language of choice when I need to build a quick dynamic web page. On most of our servers, PHP and Apache HTTP server are already installed, so building a web page in PHP is very easy. PHP provides a lot of built-in functions. Here are a few I used recently:

scandir – List files and directories inside the specified path
file_exists – Checks whether a file or directory exists
nl2br – Inserts HTML line breaks before all newlines in a string

Since I don’t usually build UI tools, my overall experience with PHP is limited. But if someone asks me to build a UI tool and provide them a link, I will most likely use PHP.

Bash/Shell Script

I love the *nix command line. I also use pipelines to do quick analysis (especially on log files). But if I need to put my code in a script, I stay away from bash and shell scripts. Somehow I find it difficult to just scan a bash script and understand what it does. The code has to be cluttered with comments to explain what it is doing. And if you have experience maintaining code for a while, you know people tend to forget to update comments when they update code. It is much easier for me to read a Python script and understand what it does quickly (without the need for comments).


So that’s the summary of when I use what :-).


You can vote for your favorite programming language below (no sign up required). Have fun voting and seeing the items move live as you vote!

Coolest Programming Languages

BIGINT UNSIGNED value is out of range

Ran into a strange issue with a MySQL query.

Here is a table I got:

The main point to notice is that there are two UNSIGNED columns: num_up_votes and num_down_votes.

I tried to run this query:

and MySQL gave me this error:

ERROR 1690 (22003): BIGINT UNSIGNED value is out of range in '('0' - '18')'

Could not understand why MySQL threw error for this. Of course, the result is negative, but I am not trying to store anything into an UNSIGNED column.

A google search led me to the MySQL manual’s out of range and overflow page.

where it states

Subtraction between integer values, where one is of type UNSIGNED, produces an unsigned result by default.

This is a really strange behavior!

For now, there are two solutions:

  1. Use CAST with SIGNED as follows in the SQL itself:

  2. If you are using an interactive client, you can set sql_mode like this:

    which then allows subtraction between unsigned column values when the difference is negative.

Redirect output line by line

I often run into situations where I have a (Perl or Python or Shell) script print a bunch of lines to STDOUT. And this is a one-time throw-away script, so additional effort to configure logging is not really worth it. I will run the script and redirect the output to a log file. Typically something like:

But then the OS will not output the lines to that log file until its buffer is full! So I will need to wait for some time. This is especially painful if my script has a sleep of few seconds in a loop (like when doing bulk writes to a production database, to which our app is also writing live data).

A useful trick I googled and found out was to use stdbuf.

So I run the script with stdbuf and line buffering these days:

Option o is for STDOUT and e is for STDERR. Setting them both to L, we output line by line. If we want to have no buffering at all, we could set them to 0, but this would not be performant!

Voila! Now I can use

and the log file gets populated one line at a time!

MongoDB – adding a read/write user to a database

The commands are different in different versions of MongoDB for authenticating a user against a database.

Let us say we have to authenticate user joe with password secret as a read/write user for database mydb.

Mongo 2.2:

Mongo 2.4:

Mongo 2.6:

dbAdmin: If you also want to make the readWrite user as an admin for the database, in Mongo 2.4 and 2.6, you should add dbAdmin to roles. (I don’t know if an “admin user” is applicable for Mongo 2.2, and if yes, what the command is.)

MongoDB – Some Useful Admin Commands

All the below can be executed from the mongo shell client.

Find queries running for more than few seconds:

Reduce the amount of output in db.currentOp():
db.currentOp() prints a lot of details. To restrict the amount of output, use the following approach. For example, to print only seconds running, client and db/collection info:

Find queries from a specific client IP:

Kill queries from a specific client:

Find queries waiting for lock:

Working with Comma Separated Values in MySQL

To begin with, a word of caution on comma-separated strings in MySQL.

Let us say we want to store a list of related IDs for a given ID. (Think similar products or documents.)

Rather than creating a table like this:

with the intent of storing comma-separated values in related_ids column, it is much better to create a table like this:

This second table facilitates:

  1. more queries
  2. more indexes
  3. data consistency

For example, it helps us keep a unique index on (related_id, id) pairs to avoid duplicate related_ids for an id. The index is also used for look-ups by related_id. (Notice that we could have also kept a unique index on (id, related_id) pair to enforce data consistency and a separate index on related_id for queries. But this is inefficient. The unique index on (id, related_id) will only help us with data consistency and not with look-ups by related_id, for which we need the additional index on just related_id.)

However, people are forced to store comma-separated values for scaling reasons. For example, in my company, we have a table where we store for a given ID, fifty related IDs. The number of IDs in our system is about 500K. If we went with the second table, we would be storing 25M rows. While reads are much more frequent than writes, queries do get really slow with 25M rows. So we had to use the first table.


MySQL does have some nice functions to work with comma-separated values which we will see here.

Below, I am using @x, @y, @css (comma-separated string), etc., for SQL variables. You can set them like

to play around with the code.

PROBLEM: Given a comma-separated string, find the number of values in it.
SOLUTION: We just need to count the number of commas in the string and add one to the result to get the number of values. Can be done as:

REPLACE( @css, ',', '' ) simply replaces the commas with empty strings. So if @css is a,b,c,d then REPLACE( @css, ',', '' ) is simply abcd, so subtracting its length from the original string gives us the number of commas and we add 1 to that to get the number of values.

PROBLEM: Get the list of all IDs that have a given ID @x as a related ID.
SOLUTION: Use FIND_IN_SET. FIND_IN_SET(@x, @css) gives the index of @x in @css. Note that the index begins with 1. So for example,

gives the result 3.

So using this SQL:

will give us all the IDs that have @x as a related ID. (Notice however that this query has to perform a full table scan and cannot use indexes.)

PROBLEM: Find all those IDs that got @x as the @nth related ID.
SOLUTION:

PROBLEM: Find the @n‘th related ID for a given ID @x.
SOLUTION: We need to learn about another function: SUBSTRING_INDEX.

SUBSTRING_INDEX(@css, @delimiter, @n) will give the substring of @css upto the @nth @delimiter. For example:

will give us the sub-string of a,b,c,d,e,f,g upto the 4th comma i.e. it will output:

A very nice thing about SUBSTRING_INDEX is that negative indices work as well. They just give us the sub-string from the right side upto the @nth delimiter. For example:

will output

So to get the 5th value in a comma-separated string, we will use:

which will output ‘e’.

So our problem of finding @nth related ID of a given ID @x can be solved like this:

WARNING:
One caveat with SUBSTRING_INDEX is that if the number of values in our comma separated string is less than @n, then it will output the entire string. So for example:

will output a,b,c,d since the 5th occurrence of comma never occurred.

So doing:

will give us d which is wrong.

If we are unsure whether our comma-separated string has at least @n values, then we can first check if the number of values is greater than or equal to @n and then use the above, else return the empty string:

Simple Tutorial on Java equals and hashCode

For a long time, I wanted to write this up. I looked around for an easy way to understand Java equals and hashCode methods, but found that while the facts are stated in many places, a simple example is missing. So here is a humble attempt.

You probably know this already, but for the sake of completeness, here is the question:
How do you check if two variables in Java are equal?
Before we can answer that question, we should ask what the different types of variables in Java are. For the purposes of testing equality, there are two: primitives and objects. Primitives can be checked for equality with ==, so your code will look like:

(Strictly speaking, you would not be comparing floats and doubles this way, even though they are primitives, since they do not store exact values.)

How do we compare objects? Let’s say we have this Book class:

And we have a test drive class like this:

When we want to compare object references, we will use == and when we want to check for logical equality, we will use equals. So we would expect(b1 == b2) to be false (since b1 and b2 are different objects) and
b1.equals(b2) to be true (since they both are in fact the same book).

But what does BooksDemo print when run?
b1 == b2 ? false
b1.equals(b2) ? false

Why does b1.equals(b2) return false? Are we not doing value comparison? As it turns out, we are still doing reference comparison! Since we did not override the equals method in Book class, Java simply called the super-class’s equals method. What is the super class of Book? Book extends Object, which is the parent of all Java classes. The Object class equals method is simply this:

Ok, let us add equals method to Book class. If the authors and the titles are the same for two Books, we will say they are logically equal.

Now if we run BooksDemo it prints:
b1 == b2 ? false
b1.equals(b2) ? true

Great!

While this works, there are some issues. The subtle mistake in our equals method above is that it does not really override Object class equals method, since the signature for Object class equals method is

but Book has

What we did was an overload and not an override. This is one of the reasons to always use @Override annotation for overrided methods to make sure we are indeed overriding, and not overloading. That annotation would have pointed out an error.

In practice, our incorrect overloaded method may never cause any problem at all. We may always be invoking its equals method only with Book objects. But if someone compares another object like String to our Book object, then the equals method of Object class will be invoked and not our overloaded equals in Book class.

So let us fix this:

We have a correct equals method now. But we also need to fix another issue:
Two equal objects must return the same hashCode.

So whenever we over-ride equals, we must override hashCode and make sure that equal objects return the same hashCode.

If we do not over-ride hashCode method in our class, then we will use Object class hashCode method. As mentioned here, Object hashCode is typically implemented by converting the internal address of the object into an integer. Internal address of two different objects, even if they are equal (i.e. b1.equals(b2) is true), may be different and we may get different hashCode, which violates this requirement.

Not over-riding hashCode creates problems for HashSet, HashTable and HashMap. Let’s see an example. Recall that Sets are useful for keeping unique elements, so we would expect logically equal elements to not be duplicated. Also recall that when we invoke the add method on a set, it will return false if the item is a duplicate.

This code prints:

Item added ? true
Item added ? true
# of unique books = 2

So why did the second item get added, though it is equal to the first? Isn’t Set supposed to prevent duplicates?

Let us put this hashCode method in Book class now and see what is happening:

Now run BooksDemo again and we see:

in hashcode. I am simply returning Object hashCode=18426253
Item added ? true
in hashcode. I am simply returning Object hashCode=16197143
Item added ? true
# of unique books = 2

So we see that add first gets the hashCode of the object we are trying to add. If there is no item in the set with the same hashCode, it will decide that the new item is not a duplicate. (Recall again, ‘equal implies same hashCode’ which means ‘unequal hashCode implies unequal objects’.) If there is another object in the set with the same hashCode, then it will invoke equals method to check if there is an item in the set equal to the new item we are trying to add. In our case above, we reused Object’s hashCode, so our new Book’s hashCode is different from the one in the set.

Let us make sure we write a correct hashCode method now:

This method returns the same hashCode for equal objects. For unequal objects, it will return different hashCodes almost all the time (unless there are hash collisions between the author and title strings, which will be very rare).

Let us add a print statement to our equals method also for verification:

Now let’s run BooksDemo and we get the following output:

in hashcode
hashCode=1869578864
Item added ? true
in hashcode
hashCode=1869578864
in equals
Item added ? false
# of unique books = 1

The first item gets added, but when we add the second item, the hashCode of the new object is the same as the hashCode of the item already in the set, so equals is invoked. Since the two objects are equal, add decides this is a duplicate object and does not add it to the set, so we get the expected behavior.

Another important note about hashCode is this: unequal objects need not have different hashCodes. However as noted here, the programmer should be aware that producing distinct hashCodes for unequal objects may improve the performance of hash tables. This has to do with the bucket sizes of hashes and as much as possible we should avoid putting more than one object in a bucket. In general, hash collisions should be rare so that performance of lookups is fast.