Monthly Archives: March 2014

Working with Comma Separated Values in MySQL

To begin with, a word of caution on comma-separated strings in MySQL.

Let us say we want to store a list of related IDs for a given ID. (Think similar products or documents.)

Rather than creating a table like this:

with the intent of storing comma-separated values in related_ids column, it is much better to create a table like this:

This second table facilitates:

  1. more queries
  2. more indexes
  3. data consistency

For example, it helps us keep a unique index on (related_id, id) pairs to avoid duplicate related_ids for an id. The index is also used for look-ups by related_id. (Notice that we could have also kept a unique index on (id, related_id) pair to enforce data consistency and a separate index on related_id for queries. But this is inefficient. The unique index on (id, related_id) will only help us with data consistency and not with look-ups by related_id, for which we need the additional index on just related_id.)

However, people are forced to store comma-separated values for scaling reasons. For example, in my company, we have a table where we store for a given ID, fifty related IDs. The number of IDs in our system is about 500K. If we went with the second table, we would be storing 25M rows. While reads are much more frequent than writes, queries do get really slow with 25M rows. So we had to use the first table.


MySQL does have some nice functions to work with comma-separated values which we will see here.

Below, I am using @x, @y, @css (comma-separated string), etc., for SQL variables. You can set them like

to play around with the code.

PROBLEM: Given a comma-separated string, find the number of values in it.
SOLUTION: We just need to count the number of commas in the string and add one to the result to get the number of values. Can be done as:

REPLACE( @css, ',', '' ) simply replaces the commas with empty strings. So if @css is a,b,c,d then REPLACE( @css, ',', '' ) is simply abcd, so subtracting its length from the original string gives us the number of commas and we add 1 to that to get the number of values.

PROBLEM: Get the list of all IDs that have a given ID @x as a related ID.
SOLUTION: Use FIND_IN_SET. FIND_IN_SET(@x, @css) gives the index of @x in @css. Note that the index begins with 1. So for example,

gives the result 3.

So using this SQL:

will give us all the IDs that have @x as a related ID. (Notice however that this query has to perform a full table scan and cannot use indexes.)

PROBLEM: Find all those IDs that got @x as the @nth related ID.
SOLUTION:

PROBLEM: Find the @n‘th related ID for a given ID @x.
SOLUTION: We need to learn about another function: SUBSTRING_INDEX.

SUBSTRING_INDEX(@css, @delimiter, @n) will give the substring of @css upto the @nth @delimiter. For example:

will give us the sub-string of a,b,c,d,e,f,g upto the 4th comma i.e. it will output:

A very nice thing about SUBSTRING_INDEX is that negative indices work as well. They just give us the sub-string from the right side upto the @nth delimiter. For example:

will output

So to get the 5th value in a comma-separated string, we will use:

which will output ‘e’.

So our problem of finding @nth related ID of a given ID @x can be solved like this:

WARNING:
One caveat with SUBSTRING_INDEX is that if the number of values in our comma separated string is less than @n, then it will output the entire string. So for example:

will output a,b,c,d since the 5th occurrence of comma never occurred.

So doing:

will give us d which is wrong.

If we are unsure whether our comma-separated string has at least @n values, then we can first check if the number of values is greater than or equal to @n and then use the above, else return the empty string: