Sets, Multisets and… SQL

Those of you who paid attention at your ‘Introduction to Databases’ class will probably remember that SQL is based on Relational Algebra. Maybe you also remember that Relational Algebra operates on relations, which are a special category of sets. As such, they do not contain duplicates. An SQL table, however, can easily contain duplicate rows and therefore it’s a multiset (i.e. set with repetitions) rather than a set. So far so good.

The trouble (for me at least) started when I tried to use SQL’s set operations: UNION, INTERSECTION and EXCEPT.

Can a union of two multisets contain duplicates? (Checking Wikipedia here…) Yes, it can. So {1,1,2} \/ {1,2,3} = {1,1,1,2,2,3}. But what if we execute SQL’s UNION on two tables with these values? No, not a table with three 1s, two 2s and one 3. We’ll get {1,2,3}. Huh? So now we do remove duplicates all of a sudden?! After some googling, I quickly found out that I should use UNION ALL to achieve what I expected from UNION. Eh… OK.

So now I try EXCEPT. In some DB systems it’s actually called MINUS (I’m talking to you, Oracle). In others it doesn’t exist at all (talking to you MySQL). So what does EXCEPT do? After my experiences with UNION, I don’t trust my intuition anymore and I google around. And yes, EXCEPT also removes duplicates. Is there a version that doesn’t? Theoretically, yes: EXCEPT ALL, but many DB systems didn’t feel like implementing it (talking to you SQL Server, and to you MySQL, and to you Oracle). So how do I implement multiset difference? I want to implement it properly, so that {1,1,2} \ {1,2,3} = {1} and not {} as SQL standard tries to convince me…

So far, the only solution I came up with is this:

SELECT A1.<all columns except nr>
FROM
  (SELECT <all columns>, ROW_NUMBER()
    OVER (PARTITION BY <all columns> ORDER BY <all columns>) AS nr
   FROM A) AS A1
LEFT OUTER JOIN
  (SELECT <all columns>, ROW_NUMBER()
    OVER (PARTITION BY <all columns> ORDER BY <all columns>) AS nr
   FROM B) AS B1
ON <all columns have to be equal> AND A1.nr = B1.nr
WHERE B1.nr IS NULL

Since we partition by all columns, the ROW_NUMBER() analytical function will return 1 for each row, except the duplicate rows. The duplicate rows will be numbered in both A and B which will allow removing only as many duplicates from A as are found in B. Nice. But we need analytical functions to do that. I still need to find a solution for MySQL, SQLite and such. I’ll keep you posted.