You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The crdb_internal.system_jobs is a virtual table that joins
information from the jobs table and the jobs_info table.
For the previous query,
SELECT id, payload, status FROM "".crdb_internal.system_jobs
WHERE (created < $1) AND (id > $2)
ORDER BY id
LIMIT $3
this is a little suboptimal because:
- We don't make use of the progress column so any read of that is
useless.
- While the crdb_internal.virtual table has a virtual index on job id,
and EXPLAIN will even claim that it will be used:
• limit
│ count: 100
│
└── • filter
│ filter: created < '2023-07-20 07:29:01.17001'
│
└── • virtual table
table: system_jobs@system_jobs_id_idx
spans: [/101 - ]
This is actually a lie. A virtual index can only handle single-key
spans. As a result the unconstrained query is used:
WITH
latestpayload AS (SELECT job_id, value FROM system.job_info AS payload WHERE info_key = 'legacy_payload' ORDER BY written DESC),
latestprogress AS (SELECT job_id, value FROM system.job_info AS progress WHERE info_key = 'legacy_progress' ORDER BY written DESC)
SELECT
DISTINCT(id), status, created, payload.value AS payload, progress.value AS progress,
created_by_type, created_by_id, claim_session_id, claim_instance_id, num_runs, last_run, job_type
FROM system.jobs AS j
INNER JOIN latestpayload AS payload ON j.id = payload.job_id
LEFT JOIN latestprogress AS progress ON j.id = progress.job_id
which has a full scan of the jobs table and 2 full scans of the info
table:
• distinct
│ distinct on: id, value, value
│
└── • merge join
│ equality: (job_id) = (id)
│
├── • render
│ │
│ └── • filter
│ │ estimated row count: 7,318
│ │ filter: info_key = 'legacy_payload'
│ │
│ └── • scan
│ estimated row count: 14,648 (100% of the table; stats collected 39 minutes ago; using stats forecast for 2 hours in the future)
│ table: job_info@primary
│ spans: FULL SCAN
│
└── • merge join (right outer)
│ equality: (job_id) = (id)
│ right cols are key
│
├── • render
│ │
│ └── • filter
│ │ estimated row count: 7,317
│ │ filter: info_key = 'legacy_progress'
│ │
│ └── • scan
│ estimated row count: 14,648 (100% of the table; stats collected 39 minutes ago; using stats forecast for 2 hours in the future)
│ table: job_info@primary
│ spans: FULL SCAN
│
└── • scan
missing stats
table: jobs@primary
spans: FULL SCAN
Because of the limit, I don't think this ends up being as bad as it
looks. But it still isn't great.
In this PR, we replace crdb_internal.jobs with a query that removes
the join on the unused progress field and also constrains the query of
the job_info table.
• distinct
│ distinct on: id, value
│
└── • merge join
│ equality: (job_id) = (id)
│ right cols are key
│
├── • render
│ │
│ └── • filter
│ │ estimated row count: 7,318
│ │ filter: info_key = 'legacy_payload'
│ │
│ └── • scan
│ estimated row count: 14,646 (100% of the table; stats collected 45 minutes ago; using stats forecast for 2 hours in the future)
│ table: job_info@primary
│ spans: [/101/'legacy_payload' - ]
│
└── • render
│
└── • limit
│ count: 100
│
└── • filter
│ filter: created < '2023-07-20 07:29:01.17001'
│
└── • scan
missing stats
table: jobs@primary
spans: [/101 - ]
In a local example, this does seem faster:
> SELECT id, payload, status, created
> FROM "".crdb_internal.system_jobs
> WHERE (created < '2023-07-20 07:29:01.17001') AND (id > 100) ORDER BY id LIMIT 100;
id | payload | status | created
-----+---------+--------+----------
(0 rows)
Time: 183ms total (execution 183ms / network 0ms)
> WITH
> latestpayload AS (
> SELECT job_id, value
> FROM system.job_info AS payload
> WHERE job_id > 100 AND info_key = 'legacy_payload'
> ORDER BY written desc
> ),
> jobpage AS (
> SELECT id, status, created
> FROM system.jobs
> WHERE (created < '2023-07-20 07:29:01.17001') and (id > 100)
> ORDER BY id
> LIMIT 100
> )
> SELECT distinct (id), latestpayload.value AS payload, status
> FROM jobpage AS j
> INNER JOIN latestpayload ON j.id = latestpayload.job_id;
id | payload | status
-----+---------+---------
(0 rows)
Time: 43ms total (execution 42ms / network 0ms)
Release note: None
Epic: none
0 commit comments