SQL CASES - find solutions to common problems

In this post, I would like to present some SQL problems along with the way how to solve them. In the following examples, I use the PostgreSQL, most of them will also work in other databases.

In the beginning, let's take a table with user's comments:


SELECT id, user_id, content, created_at
FROM comments;


 id  | user_id |                     content                       |      created_at
-----+---------+---------------------------------------------------+---------------------
   6 |      16 | Me mundare. Et extendens Herodis manum.           | 2018-07-24 06:50:56
  62 |      16 | Habentes curavit Ut adimpleretur quod dictum.     | 2018-07-25 09:38:40
  68 |      16 | In vestimentis ovium, intrinsecus autem sunt.     | 2018-07-25 15:21:37
  79 |       1 | Venturus est, fortior me est, cujus.              | 2018-07-27 14:17:08
  85 |      17 | Bethlehem Juda in diebus Herodis regis.           | 2018-08-03 10:47:52
  86 |       5 | Dimitte nobis debita nostra, sicut et.            | 2018-08-03 10:53:28
  92 |      16 | Vade, et vadit et alii Veni.                      | 2018-08-17 07:55:45
 196 |      16 | Tunc Herodes clam vocatis magis diligenter.       | 2018-09-21 12:29:29
 206 |       1 | Te, abscide eam, et projice abs.                  | 2018-09-26 14:13:32
 216 |       1 | Tunc Confitebor illis numquam novi your.          | 2018-10-03 13:07:52

Goal: We need to find the last user's comment and its date

We could try to group data by users.id and use MAX function on created_at, but content and id will not be available.

The solution to this problem is to use DISTINCT ON with ORDER BY


SELECT DISTINCT ON (user_id) id, user_id, content, created_at
FROM comments
ORDER BY user_id, created_at DESC;


id  | user_id |                   content                   |     created_at
-----+---------+---------------------------------------------+---------------------
 216 |       1 | Tunc Confitebor illis numquam novi your.    | 2018-10-03 13:07:52
  86 |       5 | Dimitte nobis debita nostra, sicut et.      | 2018-08-03 10:53:28
 196 |      16 | Tunc Herodes clam vocatis magis diligenter. | 2018-09-21 12:29:29
  85 |      17 | Bethlehem Juda in diebus Herodis regis.     | 2018-08-03 10:47:52

We received unique values for a user according to the selected order. Please note that the columns used in the DISTINCT ON clause must be located also at the beginning of ORDER BY

In the next example, we have a table of comments, each of them contains its unique number


SELECT number, content, created_at
FROM comments
ORDER BY created_at;


  number  |                     content                     |     created_at
----------+-------------------------------------------------+---------------------
 192/2018 | Porro homines mirati sunt, dicentes Qualis.     | 2018-09-21 12:29:28
 193/2018 | Agite appropinquavit enim regnum calorum. Hic.  | 2018-09-21 12:29:28
 194/2018 | Sunt enim qui quarebant animam pueri.           | 2018-09-21 12:29:28
 195/2018 | Saturabuntur. Beati misericordes quoniam.       | 2018-09-21 12:29:29
 196/2018 | Tunc Herodes clam vocatis magis diligenter.     | 2018-09-21 12:29:29
 202/2018 | Et tunc veniens offeres munus tuum.             | 2018-09-26 11:48:33
 203/2018 | Congregans omnes principes sacerdotum.          | 2018-09-26 11:48:37
 204/2018 | Fornicationis causa, facit eam machari et.      | 2018-09-26 14:12:33

As we can see, we lack numbers from 197/2018 to 201/2018

Goal: We need to find all the missing numbers

Helpful for us will be the GENERATE_SERIES function, let's look at an example:

SELECT * FROM GENERATE_SERIES(1, 5);


generate_series
-----------------
               1
               2
               3
               4
               5

This function has generated a series of data for the given range. Let's try to connect it with our comments, choosing the proper joining condition.


SELECT x, number, content, created_at
FROM comments
RIGHT JOIN GENERATE_SERIES(1, 300) AS x ON (x || '/2018') = comments.number;


  x  |  number  |                  content                        |     created_at
-----+----------+-------------------------------------------------+---------------------
 195 | 195/2018 | Saturabuntur. Beati misericordes quoniam.       | 2018-09-21 12:29:29
 196 | 196/2018 | Tunc Herodes clam vocatis magis diligenter.     | 2018-09-21 12:29:29
 197 |          |                                                 |
 198 |          |                                                 |
 199 |          |                                                 |
 200 |          |                                                 |
 201 |          |                                                 |
 202 | 202/2018 | Et tunc veniens offeres munus tuum.             | 2018-09-26 11:48:33
 203 | 203/2018 | Congregans omnes principes sacerdotum.          | 2018-09-26 11:48:37

We are on the right way, to get missing numbers we need to check where is the NULL


SELECT (x || '/2018') AS number
FROM comments
RIGHT JOIN GENERATE_SERIES(1, 300) AS x ON (x || '/2018') = comments.number
WHERE number IS NULL;


  number
----------
 197/2018
 198/2018
 199/2018
 200/2018
 201/2018

We have found the missing numbers!

We'll use the comments table again, to show another issue

Goal: Let's find how many comments we received in each hour for a certain day.

We will probably want to group the data by the creation date: date_trunc ('hour', created_at)


SELECT date_trunc('hour', created_at) AS created_at, count(1)
FROM comments
WHERE date_trunc('day', created_at) = timestamp '2018-07-24 00:00:00'
GROUP BY date_trunc('hour', created_at)
ORDER BY created_at;


     created_at      | count
---------------------+-------
 2018-07-24 06:00:00 |     8
 2018-07-24 08:00:00 |    16
 2018-07-24 10:00:00 |     6
 2018-07-24 11:00:00 |    30

This is the right approach, but we miss the other hours, where the number of entries is equal to 0. We can use GENERATE_SERIES again, this time we will generate a series of dates:

SELECT * FROM GENERATE_SERIES('2018-07-24 00:00:00'::TIMESTAMP, '2018-07-24 23:00:00', '1 hour');


        hours
---------------------
 2018-07-24 00:00:00
 2018-07-24 01:00:00
 2018-07-24 02:00:00
 2018-07-24 03:00:00
 2018-07-24 04:00:00
 2018-07-24 05:00:00
 ...

We are getting close to our goal, we must combine the two previous queries.


SELECT hours AS created_at, count(comments.id)
FROM GENERATE_SERIES('2018-07-24 00:00:00'::TIMESTAMP, '2018-07-24 23:00:00', '1 hour') AS hours
LEFT JOIN comments ON date_trunc('hour', created_at) = hours
GROUP BY hours
ORDER BY created_at;


     created_at      | count
---------------------+-------
 ...
 2018-07-24 05:00:00 |     0
 2018-07-24 06:00:00 |     8
 2018-07-24 07:00:00 |     0
 2018-07-24 08:00:00 |    16
 2018-07-24 09:00:00 |     0
 2018-07-24 10:00:00 |     6
 2018-07-24 11:00:00 |    30
 2018-07-24 12:00:00 |     0
 ...

Success!

Sometimes we decide to use a custom solution to store certain data in arrays. In this case, I will use a simple implementation of the surveys:

SELECT id, question, type FROM surveys;


 id |                  question                 |      type
----+-------------------------------------------+-----------------
 4  | Esto consentiens adversario tuo cito dum? | multiple_choice

As we can see, this is a multiple choice survey, the options table looks like:

SELECT id, survey_id, content FROM survey_options;


 id | survey_id |                 content
----+-----------+-----------------------------------------
 11 |         4 | Solvere legem aut prophetas non veni.
 12 |         4 | Et alteram et ei, qui vult.
 13 |         4 | Enim decet nos implere omnem justitiam.
 14 |         4 | Enim erat super petram. And that

The responses sent by the user are stored in the survey_responses table, which looks like:

SELECT id, user_id, survey_options_ids FROM survey_responses;


 id | user_id | survey_options_ids
----+---------+--------------------
 66 |       1 | {11,13}
 72 |      16 | {11,12,13}
 78 |      17 | {11,12}
 84 |      27 | {11,14}

Goal: We have to count how many times each answer has been chosen

We will use the UNNEST function which expands an array to a set of rows


SELECT id, user_id, UNNEST(survey_options_ids) AS survey_option_id
FROM survey_responses;


 id | user_id | survey_option_id
----+---------+------------------
 66 |       1 |               11
 66 |       1 |               13
 72 |      16 |               11
 72 |      16 |               12
 72 |      16 |               13
 78 |      17 |               11
 78 |      17 |               12
 84 |      27 |               11
 84 |      27 |               14

Let's add grouping by the response


SELECT COUNT(1) AS count, UNNEST(survey_options_ids) AS response
FROM survey_responses
GROUP BY response
ORDER BY count DESC;


 count | response
-------+----------
     4 |       11
     2 |       13
     2 |       12
     1 |       14

The result is ready, we can add another join with survey_options to get the answer's content.


SELECT responses.count, survey_options.content
FROM (
  SELECT COUNT(1) AS count, UNNEST(survey_options_ids) AS option_id
  FROM survey_responses
  GROUP BY option_id
) AS responses
LEFT JOIN survey_options ON responses.option_id = survey_options.id
ORDER BY COUNT DESC;


 count |                 content
-------+-----------------------------------------
     4 | Solvere legem aut prophetas non veni.
     2 | Enim decet nos implere omnem justitiam.
     2 | Et alteram et ei, qui vult.
     1 | Enim erat super petram. And that.

The report is ready!

Consider now the following table containing users with basic information such as first name, last name, and age.


SELECT id, first_name, last_name, age FROM users;


 id | first_name  | last_name  | age
----+-------------+------------+-----
  9 | Sebastian   | Kihn       |  16
  8 | Colby       | Ankunding  |  19
 11 | Saul        | Frami      |  15
 13 | Emmy        | Haley      |  42
  3 | Houston     | Hilll      |  64
  6 | Ramona      | Feeney     |  57
  4 | Bee         | Haley      |  51
 ...

Goal: We have to count how many users are in specific age ranges, e.g., under 25, between 25 and 50, and over 50.

This is not a problem if we use a separate query for each case, but if we want to do it with a single query? We can use the expression in the aggregate function. We check the condition, then cast it on int, true gives us the value 1, false 0. All you need to do is use SUM function on these values.


SELECT SUM((age < 25)::int) AS below_25,
       SUM((age >= 25 AND age < 50)::int) AS between_25_and_50,
       SUM((age >= 50)::int) AS above_50,
       COUNT(1) AS total_count
FROM users;


 below_25 | between_25_and_50 | above_50 | total_count
----------+-------------------+----------+-------------
        9 |                 4 |       12 |          25

It is done!

In the next example, we'll use a similar table with users, this time they have a certain number of points, it is kind of ranking.


SELECT id, first_name, last_name, points
FROM users
ORDER BY points DESC;


id | first_name  | last_name  | points
----+-------------+------------+--------
 13 | Emmy        | Haley      |    196
 11 | Saul        | Frami      |    172
 29 | Danyel      | Hodkiewicz |    154
  8 | Colby       | Ankunding  |    152
 32 | Micki       | Jenkins    |    148
  3 | Houston     | Hilll      |    142
 14 | Cherrie     | Feeney     |    131

Goal: We want to find, how many points each user lacks to the leader

We will use a window function to solve our problem. Let's add the column which contains the highest result, we will use the MAX function.


SELECT id, first_name, last_name, points,
       max(points) OVER () AS leader_points
FROM users
ORDER BY points DESC;


 id | first_name  | last_name  | points | leader_points
----+-------------+------------+--------+------------
 13 | Emmy        | Haley      |    196 |        196
 11 | Saul        | Frami      |    172 |        196
 29 | Danyel      | Hodkiewicz |    154 |        196
  8 | Colby       | Ankunding  |    152 |        196
 32 | Micki       | Jenkins    |    148 |        196
  3 | Houston     | Hilll      |    142 |        196
 14 | Cherrie     | Feeney     |    131 |        196

In leader_points column all rows have the same value, is a number of points that leader got. All we need to do, is calculate the difference of points, then the solution is ready:


SELECT id, first_name, last_name, points,
       points - max(points) OVER () AS leader_difference
FROM users
ORDER BY points DESC;


 id | first_name  | last_name  | points | difference
----+-------------+------------+--------+------------
 13 | Emmy        | Haley      |    196 |          0
 11 | Saul        | Frami      |    172 |        -24
 29 | Danyel      | Hodkiewicz |    154 |        -42
  8 | Colby       | Ankunding  |    152 |        -44
 32 | Micki       | Jenkins    |    148 |        -48
  3 | Houston     | Hilll      |    142 |        -54
 14 | Cherrie     | Feeney     |    131 |        -65

However, we must be aware of certain problems that may come up if we add some conditions. Window functions work on data set of the current query, not on the whole table. If we want to check the difference between a leader for people under 20:


SELECT id, first_name, last_name, points,
       points - max(points) OVER () AS leader_difference
FROM users
WHERE age < 20
ORDER BY points DESC;


 id | first_name | last_name | points | leader_difference
----+------------+-----------+--------+-------------------
 11 | Saul       | Frami     |    172 |                 0
  8 | Colby      | Ankunding |    152 |               -20
 30 | Tegan      | Jaskolski |    109 |               -63
 12 | Julian     | Hermann   |     57 |              -115

Saul is the leader now, not Emmy. To get the correct results, window function must be used in the subquery.


SELECT * FROM (
  SELECT id, first_name, last_name, points, age,
       points - max(points) OVER () AS leader_difference
  FROM users) AS u
WHERE age < 20
ORDER BY points DESC;


id | first_name | last_name | points | age | leader_difference
----+------------+-----------+--------+-----+-------------------
 11 | Saul       | Frami     |    172 |  15 |               -24
  8 | Colby      | Ankunding |    152 |  19 |               -44
 30 | Tegan      | Jaskolski |    109 |  17 |               -87
 12 | Julian     | Hermann   |     57 |  15 |              -139

Goal 2: Let's change the rules, we have to check how much points user is missing for the predecessor

Here we can use window functions too, this time the LAG function - which refers to the previous a row in the result, let's look at an example:


SELECT id, first_name, last_name, points,
       LAG(points) OVER (ORDER BY points DESC) AS previous_points
FROM users;


 id | first_name  | last_name  | points | previous_points
----+-------------+------------+--------+-----------------
 13 | Emmy        | Haley      |    196 |
 11 | Saul        | Frami      |    172 |             196
 29 | Danyel      | Hodkiewicz |    154 |             172
  8 | Colby       | Ankunding  |    152 |             154
 32 | Micki       | Jenkins    |    148 |             152
  3 | Houston     | Hilll      |    142 |             148
 14 | Cherrie     | Feeney     |    131 |             142

In previous_points we can see the result that the predecessor received. Simply subtract the values from both columns and we have the result. We must remember that, in the case of the leader, we have NULL value, so we need to use COALESCE


SELECT id, first_name, last_name, points,
       points - COALESCE(LAG(points) OVER (ORDER BY points DESC), points) AS previous_difference
FROM users;


 id | first_name  | last_name  | points | previous_difference
----+-------------+------------+--------+---------------------
 13 | Emmy        | Haley      |    196 |                   0
 11 | Saul        | Frami      |    172 |                 -24
 29 | Danyel      | Hodkiewicz |    154 |                 -18
  8 | Colby       | Ankunding  |    152 |                  -2
 32 | Micki       | Jenkins    |    148 |                  -4
  3 | Houston     | Hilll      |    142 |                  -6
 14 | Cherrie     | Feeney     |    131 |                 -11

The last example I would like to show the transfer of funds between two clients (functionality similar to the banking system), we have 2 users:


 id | first_name  | last_name  | balance
----+-------------+------------+---------
 11 | Saul        | Frami      |     300
 14 | Emmy        | Haley      |      60

Goal: We want to send 100$ from Saul to Emmy.

Before the transfer, make sure that the status has not changed - Saul has at least 100$. In the network, we can find different solutions to this problem, they usually use transactions and locking. The following method performs an atomic update, using only a single query:


UPDATE users
   SET balance = CASE WHEN id = 14 THEN balance + 100
                      WHEN id = 11 THEN balance - 100
                 END
WHERE id IN (11, 14) AND
      EXISTS (SELECT id FROM users WHERE id = 14) AND
      EXISTS (SELECT id FROM users WHERE id = 11 AND balance >= 100);

In WHERE we restrict the update to 2 users, then we check all users definitely exist and whether the balance is sufficient. In CASE statement, we consider two cases - the user's balance of id 14 must be increased whereas the user's with id 11 decreased.

Thank you for reading, I hope that the examples will be useful for you.