CockroachDB is the new kid on the block and is growing fast. It’s Beta software under rapid development and it’s SQL layer is maturing fast.
While SQL is a standard, it has gone through many revisions and contains optional functionality that various vendors have chosen to implement, or not. The standard is also incomplete and ambiguous in many places, leading to different interpretations.
One of the design objectives with CockroachDB SQL is to make it as close to the PostgreSQL flavor of SQL as possible. CockroachDB has implemented the PostgreSQL wire protocol which enables any PostgreSQL client to operate with the database. In terms of overall compatibility against the SQL standard, PostgreSQL is one of the more compliant implementations.
This post looks at how CockroachDB handles NULL values and how it compares with the SQL standard and other SQL implementations.
The good people over at SQLite have a script for testing NULLs and their functionality. I’ve used the contents of the script for my investigation into CockroachDB. The tests were carried out on Linux using version beta-20160512 of CockroachDB.
Let’s create a table with some test data.
CREATE TABLE t1(
a INT,
b INT,
c INT
);
INSERT INTO t1 VALUES(1, 0, 0);
INSERT INTO t1 VALUES(2, 0, 1);
INSERT INTO t1 VALUES(3, 1, 0);
INSERT INTO t1 VALUES(4, 1, 1);
INSERT INTO t1 VALUES(5, NULL, 0);
INSERT INTO t1 VALUES(6, NULL, 1);
INSERT INTO t1 VALUES(7, NULL, NULL);
SELECT * FROM t1;
+---+------+------+
| a | b | c |
+---+------+------+
| 1 | 0 | 0 |
| 2 | 0 | 1 |
| 3 | 1 | 0 |
| 4 | 1 | 1 |
| 5 | NULL | 0 |
| 6 | NULL | 1 |
| 7 | NULL | NULL |
+---+------+------+
Ok, so that’s the first observation. The CockroachDB built-in client shows NULLs using the word NULL and not just an empty field, which is good because distinguishing between a NULL and a STRING column that contains an empty string (“”) would be difficult.
What happens when a NULL value is used in a logic comparison? Like in the WHERE
clause of a SQL statement.
SELECT * FROM t1 WHERE b < 10;
+---+---+---+
| a | b | c |
+---+---+---+
| 1 | 0 | 0 |
| 2 | 0 | 1 |
| 3 | 1 | 0 |
| 4 | 1 | 1 |
+---+---+---+
SELECT * FROM t1 WHERE NOT b > 10;
+---+---+---+
| a | b | c |
+---+---+---+
| 1 | 0 | 0 |
| 2 | 0 | 1 |
| 3 | 1 | 0 |
| 4 | 1 | 1 |
+---+---+---+
SELECT * FROM t1 WHERE b < 10 OR c = 1;
+---+------+---+
| a | b | c |
+---+------+---+
| 1 | 0 | 0 |
| 2 | 0 | 1 |
| 3 | 1 | 0 |
| 4 | 1 | 1 |
| 6 | NULL | 1 |
+---+------+---+
SELECT * FROM t1 WHERE b < 10 AND c = 1;
+---+---+---+
| a | b | c |
+---+---+---+
| 2 | 0 | 1 |
| 4 | 1 | 1 |
+---+---+---+
SELECT * FROM t1 WHERE NOT (b < 10 AND c = 1);
+---+------+---+
| a | b | c |
+---+------+---+
| 1 | 0 | 0 |
| 3 | 1 | 0 |
| 5 | NULL | 0 |
+---+------+---+
SELECT * FROM t1 WHERE NOT (c = 1 AND b < 10);
+---+------+---+
| a | b | c |
+---+------+---+
| 1 | 0 | 0 |
| 3 | 1 | 0 |
| 5 | NULL | 0 |
+---+------+---+
In summary, anything compared to a NULL is NULL (and therefore not TRUE). This behavior is consistent with PostgresSQL as well as all other major RDBMS’s.
But how do you use NULL in a logic comparison? CockroachDB offers the standard IS NULL
and IS NOT NULL
that can be used when comparing NULL values.
SELECT * FROM t1 WHERE b IS NULL AND c IS NOT NULL;
+---+------+---+
| a | b | c |
+---+------+---+
| 5 | NULL | 0 |
| 6 | NULL | 1 |
+---+------+---+
What sort of results do you get when a NULLs is part of an arithmetic expression?
SELECT a, b, c, b*0, b*c, b+c FROM t1;
+---+------+------+-------+-------+-------+
| a | b | c | b * 0 | b * c | b + c |
+---+------+------+-------+-------+-------+
| 1 | 0 | 0 | 0 | 0 | 0 |
| 2 | 0 | 1 | 0 | 0 | 1 |
| 3 | 1 | 0 | 0 | 0 | 1 |
| 4 | 1 | 1 | 0 | 1 | 2 |
| 5 | NULL | 0 | NULL | NULL | NULL |
| 6 | NULL | 1 | NULL | NULL | NULL |
| 7 | NULL | NULL | NULL | NULL | NULL |
+---+------+------+-------+-------+-------+
In summary, any arithmetic operation involving a NULL will yield a NULL result.
Aggregate functions are those that operate on a set of rows and return a single value. I’ve repeated the contents of t1 here to make it easier to understand the results.
SELECT * FROM t1;
+---+------+------+
| a | b | c |
+---+------+------+
| 1 | 0 | 0 |
| 2 | 0 | 1 |
| 3 | 1 | 0 |
| 4 | 1 | 1 |
| 5 | NULL | 0 |
| 6 | NULL | 1 |
| 7 | NULL | NULL |
+---+------+------+
SELECT COUNT(*), COUNT(b), SUM(b), AVG(b), MIN(b), MAX(b) FROM t1;
+----------+----------+--------+--------------------+--------+--------+
| COUNT(*) | COUNT(b) | SUM(b) | AVG(b) | MIN(b) | MAX(b) |
+----------+----------+--------+--------------------+--------+--------+
| 7 | 4 | 2 | 0.5000000000000000 | 0 | 1 |
+----------+----------+--------+--------------------+--------+--------+
Points to note:
COUNT()
of a column that contains NULLs.MIN()
or MAX()
.AVG()
is defined as SUM()/COUNT()
which for column b is 2/4 not 2/7, but see the NULLs as Other Values section below.Are NULL values considered distinct from other values?
SELECT DISTINCT b FROM t1;
+------+
| b |
+------+
| 0 |
| 1 |
| NULL |
+------+
The answer is yes. NULL values are included in the list of distinct values from a column containing NULLs.
However, counting the number of distinct values excludes them, which is consistent with the COUNT()
function.
SELECT COUNT(DISTINCT b) FROM t1;
+-------------------+
| count(DISTINCT b) |
+-------------------+
| 2 |
+-------------------+
Sometimes you may want NULL values to be included in calculations like arithmetic or aggregate functions. For example, you may want to calculate the average value of column b as being the SUM()
of all numbers in b divided by the total number of rows, regardless of whether b’s value is NULL. We can include NULL values in the calculation by substituting a value for the NULL during the calculation. In this case, a value of zero (0) is appropriate. CockroachDB provides a function called IFNULL(arg1, arg2)
which returns arg2 if arg1 is NULL.
Let’s try the aggregate query again, but include the IFNULL()
function in the AVG()
calculation.
SELECT COUNT(*), COUNT(b), SUM(b), AVG(b), AVG(IFNULL(b, 0)), MIN(b), MAX(b) FROM t1;
+----------+----------+--------+--------------------+--------------------+--------+--------+
| COUNT(*) | COUNT(b) | SUM(b) | AVG(b) | AVG(IFNULL(b, 0)) | MIN(b) | MAX(b) |
+----------+----------+--------+--------------------+--------------------+--------+--------+
| 7 | 4 | 2 | 0.5000000000000000 | 0.2857142857142857 | 0 | 1 |
+----------+----------+--------+--------------------+--------------------+--------+--------+
Including all the rows from t1 by replacing the NULL values with zero for the purposes of the AVG()
calculation has made a big difference to the calculation. I wonder how many applications out there are returning AVG(b)
when they really wanted AVG(IFNULL(b, 0))
?
How NULLs behave in a UNION query.
SELECT b FROM t1 UNION SELECT b FROM t1;
+------+
| b |
+------+
| 0 |
| 1 |
| NULL |
+------+
NULL values are considered as part of a UNION
set operation.
Where do NULLs sit when sorting by a column containing NULL values? The core SQL standard doesn’t explicitly define a sort order for NULLs but an optional extension to the SQL:2003 standard provided the NULLS FIRST
and NULLS LAST
addition to the ORDER BY
clause.
By default, CockroachDB orders NULLs lower than the first non-NULL value, which is the same as using the ORDER BY ... ASC
option. Which, by the way, is the same as MySQL and SQL Server.
CockroachDB hasn’t implemented this optional extension so you can’t change the position of the NULLs in the sorting order. I suspect this will be added to CockroachDB SQL in future as PostgreSQL provides the NULLS FIRST
and NULLS LAST
clauses.
As an aside, PostgreSQL orders NULLs higher than the last non-NULL value, which is the same as using the ORDER BY ... ASC NULLS LAST
or ORDER BY ... DESC NULLS FIRST
options. Which, by the way, is the same as Oracle but different from CockroachDB.
SELECT * FROM t1 ORDER BY b;
+---+------+------+
| a | b | c |
+---+------+------+
| 6 | NULL | 1 |
| 5 | NULL | 0 |
| 7 | NULL | NULL |
| 1 | 0 | 0 |
| 2 | 0 | 1 |
| 4 | 1 | 1 |
| 3 | 1 | 0 |
+---+------+------+
SELECT * FROM t1 ORDER BY b DESC;
+---+------+------+
| a | b | c |
+---+------+------+
| 4 | 1 | 1 |
| 3 | 1 | 0 |
| 2 | 0 | 1 |
| 1 | 0 | 0 |
| 7 | NULL | NULL |
| 6 | NULL | 1 |
| 5 | NULL | 0 |
+---+------+------+
Let’s find out if NULL values are considered unique.
CREATE TABLE t2(a INT, b INT UNIQUE);
INSERT INTO t2 VALUES(1, 1);
INSERT INTO t2 VALUES(2, NULL);
INSERT INTO t2 VALUES(3, NULL);
SELECT * FROM t2;
+---+------+
| a | b |
+---+------+
| 1 | 1 |
| 2 | NULL |
| 3 | NULL |
+---+------+
This shows that NULLs are not considered unique as the UNIQUE
constraint on column b was not violated when two rows with NULL values were inserted. Be aware that if a table has a UNIQUE
constraint on column(s) that are optional (nullable), it is still possible to insert duplicate rows that appear to violate the constraint if they contain a NULL value in at least one of the columns. This is because NULLs are never considered equal and hence don’t violate the uniqueness constraint.
Unfortunately, how your version of SQL handles NULL values is only part of the puzzle. How your application code (and often more importantly, what it’s written in) has a part to play in processing NULL values. That’s a whole different story but one that you should know if you’re pulling data from any SQL database.
Paul is an independent consultant and company director. He lives in Auckland, New Zealand and travels extensively.