Skip to content

PRIMARY KEY isn't always used in left joins #9520

@kennethklee

Description

@kennethklee

TLDR; slow left join where primary key can be used as index, but isn't. benchmark size 30k rows, slow query takes 0.30 sec, optimal query takes 0.00 sec.

there's a case where the primary key isn't used in a left join, when it should be.

here's the query plan for two identical queries on two tables, test and test2. test uses primary key. test2 doesn't have a primary key, and uses an index on col, id.

Image

here's the reproduction:

# setup
SQL_CMD="dolt sql" # change to `mysql` to test difference in behavior
$SQL_CMD <<< "CREATE DATABASE test"

$SQL_CMD <<< "DROP TABLE IF EXISTS test"
$SQL_CMD <<< "CREATE TABLE test (id VARCHAR(255), parent VARCHAR(255), primary key (id))"

$SQL_CMD <<< "DROP TABLE IF EXISTS test2"
$SQL_CMD <<< "CREATE TABLE test2 (id VARCHAR(255), parent VARCHAR(255))"
$SQL_CMD <<< "CREATE UNIQUE INDEX idx_test2_id ON test2 (id)"

# data generation (not really needed for query plan, but here for performance benchmarks)
TABLE="test" # change this test2 to generate for 
# 10,000 grandparents
for j in {0..9}; do
  sql="INSERT INTO $TABLE (id, parent) VALUES ('test_$(( j * 1000 + 1 ))', NULL)"
  for ((i = j * 1000 + 2; i <= (j + 1) * 1000; i++)); do
    sql+=",('test_$i', NULL)"
  done
  $SQL_CMD <<< "$sql"
done

# 10,000 parents
for j in {0..9}; do
  sql="INSERT INTO $TABLE (id, parent) VALUES ('test_$(( j * 1000 + 10001 ))', 'test_$(( j * 1000 ))')"
  for ((i = j * 1000 + 10002; i <= (j + 1) * 1000 + 10000; i++)); do
    sql+=",('test_$i', 'test_$(( i - 10000 ))')"
  done
  $SQL_CMD <<< "$sql"
done

# 10,000 children
for j in {0..9}; do
  sql="INSERT INTO $TABLE (id, parent) VALUES ('test_$(( j * 1000 + 20001 ))', 'test_$(( j * 1000 + 10000 ))')"
  for ((i = j * 1000 + 20002; i <= (j + 1) * 1000 + 20000; i++)); do
    sql+=",('test_$i', 'test_$(( i - 10000 ))')"
  done
  $SQL_CMD <<< "$sql"
done

now for the query:

select *
from test child
  left join test parent on parent.id = child.parent
  left join test grandparent on grandparent.id = parent.parent
where child.id in ('test_0', 'test_5000', 'test_10000', 'test_15000', 'test_20000', 'test_25000')

Change the table from test to test2 for the other side.

performance different on my machine:

test/main*> select *
         -> from test child
         ->   left join test parent on parent.id = child.parent
         ->   left join test grandparent on grandparent.id = parent.parent
         -> where child.id in ('test_0', 'test_5000', 'test_10000', 'test_15000', 'test_20000', 'test_25000');
+------------+------------+------------+-----------+-----------+--------+
| id         | parent     | id         | parent    | id        | parent |
+------------+------------+------------+-----------+-----------+--------+
| test_10000 | NULL       | NULL       | NULL      | NULL      | NULL   |
| test_15000 | test_5000  | test_5000  | NULL      | NULL      | NULL   |
| test_20000 | test_10000 | test_10000 | NULL      | NULL      | NULL   |
| test_25000 | test_15000 | test_15000 | test_5000 | test_5000 | NULL   |
| test_5000  | NULL       | NULL       | NULL      | NULL      | NULL   |
+------------+------------+------------+-----------+-----------+--------+
5 rows in set (0.30 sec) 


test/main*> select *
         -> from test2 child
 on parent.id = child.parent
         ->   left join test2 parent on parent.id = child.parent
         ->   left join test2 grandparent on grandparent.id = parent.parent
         -> where child.id in ('test_0', 'test_5000', 'test_10000', 'test_15000', 'test_20000', 'test_25000');
+------------+------------+------------+-----------+-----------+--------+
| id         | parent     | id         | parent    | id        | parent |
+------------+------------+------------+-----------+-----------+--------+
| test_10000 | NULL       | NULL       | NULL      | NULL      | NULL   |
| test_15000 | test_5000  | test_5000  | NULL      | NULL      | NULL   |
| test_20000 | test_10000 | test_10000 | NULL      | NULL      | NULL   |
| test_25000 | test_15000 | test_15000 | test_5000 | test_5000 | NULL   |
| test_5000  | NULL       | NULL       | NULL      | NULL      | NULL   |
+------------+------------+------------+-----------+-----------+--------+
5 rows in set (0.00 sec) 

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions