|
3 | 3 | { |
4 | 4 | "cell_type": "markdown", |
5 | 5 | "metadata": {}, |
6 | | - "source": "# Refresh RAS Works Counts\n\nRebuilds the `affiliation_strings_lookup_with_counts` table with fresh works counts\nfrom `work_authorships` and institution IDs from the MV (which includes curations).\n\nThis keeps the affiliations dashboard in sync with actual works data.\n\n**Runs after**: Guardrails (needs finalized works data)\n**Feeds**: `sync_affiliation_strings_to_elastic_v2` (ES sync for dashboard)" |
| 6 | + "source": "# Refresh RAS Works Counts\n\nRebuilds the `affiliation_strings_lookup_with_counts` table with fresh works counts\nfrom `OpenAlex_works` and institution IDs from the MV (which includes curations).\n\nUses a MERGE with content hashing to detect changes — only rows with changed data\nget a new `refreshed_at` timestamp, enabling incremental ES sync downstream.\n\n**Runs after**: Guardrails (needs finalized works data)\n**Feeds**: `sync_affiliation_strings_to_elastic_v2` (ES sync for dashboard)" |
7 | 7 | }, |
8 | 8 | { |
9 | 9 | "cell_type": "markdown", |
|
15 | 15 | { |
16 | 16 | "cell_type": "code", |
17 | 17 | "metadata": {}, |
18 | | - "source": "-- Rebuild works counts by exploding authorships from work_authorships.\n-- Uses work_authorships instead of OpenAlex_works for a much faster scan (narrow table).\n-- This replaces the entire table with fresh counts.\nCREATE OR REPLACE TABLE openalex.institutions.affiliation_string_works_counts AS\nSELECT \n raw_aff_string,\n COUNT(DISTINCT w.work_id) as works_count\nFROM openalex.works.work_authorships w\nLATERAL VIEW EXPLODE(authorships) AS authorship\nLATERAL VIEW EXPLODE(authorship.raw_affiliation_strings) AS raw_aff_string\nGROUP BY raw_aff_string", |
| 18 | + "source": "-- Rebuild works counts by exploding authorships from OpenAlex_works.\n-- This replaces the entire counts table with fresh data.\nCREATE OR REPLACE TABLE openalex.institutions.affiliation_string_works_counts AS\nSELECT\n raw_aff_string,\n COUNT(DISTINCT w.id) as works_count\nFROM openalex.works.OpenAlex_works w\nLATERAL VIEW EXPLODE(authorships) AS authorship\nLATERAL VIEW EXPLODE(authorship.raw_affiliation_strings) AS raw_aff_string\nGROUP BY raw_aff_string", |
19 | 19 | "outputs": [], |
20 | 20 | "execution_count": null |
21 | 21 | }, |
|
37 | 37 | { |
38 | 38 | "cell_type": "markdown", |
39 | 39 | "metadata": {}, |
40 | | - "source": [ |
41 | | - "## Step 2: Rebuild lookup with counts\n", |
42 | | - "\n", |
43 | | - "Joins the MV (which has curations applied via 3-layer priority) with fresh counts.\n", |
44 | | - "Only keeps RAS that appear in at least one work." |
45 | | - ] |
| 40 | + "source": "## Step 2: MERGE lookup with counts (hash-based change detection)\n\nBuilds a staging table with a `content_hash` of key fields, then MERGEs into the\ntarget. Only rows where the hash changed get `refreshed_at` updated, enabling\nincremental ES sync. New rows are inserted, removed rows are deleted." |
46 | 41 | }, |
47 | 42 | { |
48 | 43 | "cell_type": "code", |
| 44 | + "source": "-- Enable schema auto-merge so the MERGE can add content_hash and refreshed_at\n-- columns to the existing table on first run (they'll start as NULLs).\nSET spark.databricks.delta.schema.autoMerge.enabled = true", |
49 | 45 | "metadata": {}, |
50 | | - "source": [ |
51 | | - "%sql\n", |
52 | | - "CREATE OR REPLACE TABLE openalex.institutions.affiliation_strings_lookup_with_counts AS\n", |
53 | | - "SELECT \n", |
54 | | - " mv.raw_affiliation_string,\n", |
55 | | - " mv.institution_ids AS institution_ids_final,\n", |
56 | | - " mv.model_institution_ids AS institution_ids_from_model,\n", |
57 | | - " mv.institution_ids_override,\n", |
58 | | - " mv.countries,\n", |
59 | | - " mv.source,\n", |
60 | | - " mv.created_datetime,\n", |
61 | | - " mv.updated_datetime,\n", |
62 | | - " c.works_count\n", |
63 | | - "FROM openalex.institutions.raw_affiliation_strings_institutions_mv mv\n", |
64 | | - "INNER JOIN openalex.institutions.affiliation_string_works_counts c\n", |
65 | | - " ON mv.raw_affiliation_string = c.raw_aff_string" |
66 | | - ], |
| 46 | + "execution_count": null, |
67 | 47 | "outputs": [] |
68 | 48 | }, |
69 | 49 | { |
70 | 50 | "cell_type": "code", |
71 | 51 | "metadata": {}, |
72 | | - "source": [ |
73 | | - "%sql\n", |
74 | | - "-- Verify rebuild\n", |
75 | | - "SELECT\n", |
76 | | - " COUNT(*) AS total_rows,\n", |
77 | | - " COUNT(CASE WHEN SIZE(institution_ids_final) > 0 THEN 1 END) AS rows_with_institutions,\n", |
78 | | - " ROUND(COUNT(CASE WHEN SIZE(institution_ids_final) > 0 THEN 1 END) * 100.0 / COUNT(*), 1) AS pct_with_institutions\n", |
79 | | - "FROM openalex.institutions.affiliation_strings_lookup_with_counts" |
80 | | - ], |
| 52 | + "source": "-- Build staging table with content hash for change detection\nCREATE OR REPLACE TABLE openalex.institutions._ras_lookup_staging AS\nSELECT\n mv.raw_affiliation_string,\n mv.institution_ids AS institution_ids_final,\n mv.model_institution_ids AS institution_ids_from_model,\n mv.institution_ids_override,\n mv.countries,\n mv.source,\n mv.created_datetime,\n mv.updated_datetime,\n c.works_count,\n SHA2(TO_JSON(NAMED_STRUCT(\n 'iif', mv.institution_ids,\n 'iim', mv.model_institution_ids,\n 'iio', mv.institution_ids_override,\n 'c', mv.countries,\n 'wc', c.works_count\n )), 256) AS content_hash\nFROM openalex.institutions.raw_affiliation_strings_institutions_mv mv\nINNER JOIN openalex.institutions.affiliation_string_works_counts c\n ON mv.raw_affiliation_string = c.raw_aff_string", |
| 53 | + "outputs": [], |
| 54 | + "execution_count": null |
| 55 | + }, |
| 56 | + { |
| 57 | + "cell_type": "code", |
| 58 | + "source": "-- MERGE with hash-based change detection.\n-- Only updates rows where content actually changed (new refreshed_at).\n-- Inserts new rows, deletes rows no longer in source.\n-- On first run, COALESCE(target.content_hash, '') handles NULLs from schema migration.\nMERGE INTO openalex.institutions.affiliation_strings_lookup_with_counts AS target\nUSING openalex.institutions._ras_lookup_staging AS source\nON target.raw_affiliation_string = source.raw_affiliation_string\nWHEN MATCHED AND COALESCE(target.content_hash, '') <> source.content_hash THEN\n UPDATE SET\n institution_ids_final = source.institution_ids_final,\n institution_ids_from_model = source.institution_ids_from_model,\n institution_ids_override = source.institution_ids_override,\n countries = source.countries,\n source = source.source,\n created_datetime = source.created_datetime,\n updated_datetime = source.updated_datetime,\n works_count = source.works_count,\n content_hash = source.content_hash,\n refreshed_at = CURRENT_TIMESTAMP()\nWHEN NOT MATCHED THEN\n INSERT (raw_affiliation_string, institution_ids_final, institution_ids_from_model,\n institution_ids_override, countries, source, created_datetime, updated_datetime,\n works_count, content_hash, refreshed_at)\n VALUES (source.raw_affiliation_string, source.institution_ids_final, source.institution_ids_from_model,\n source.institution_ids_override, source.countries, source.source, source.created_datetime,\n source.updated_datetime, source.works_count, source.content_hash, CURRENT_TIMESTAMP())\nWHEN NOT MATCHED BY SOURCE THEN DELETE", |
| 59 | + "metadata": {}, |
| 60 | + "execution_count": null, |
| 61 | + "outputs": [] |
| 62 | + }, |
| 63 | + { |
| 64 | + "cell_type": "code", |
| 65 | + "source": "DROP TABLE IF EXISTS openalex.institutions._ras_lookup_staging", |
| 66 | + "metadata": {}, |
| 67 | + "execution_count": null, |
| 68 | + "outputs": [] |
| 69 | + }, |
| 70 | + { |
| 71 | + "cell_type": "code", |
| 72 | + "source": "-- Verify rebuild + change detection stats\nSELECT\n COUNT(*) AS total_rows,\n COUNT(CASE WHEN SIZE(institution_ids_final) > 0 THEN 1 END) AS rows_with_institutions,\n ROUND(COUNT(CASE WHEN SIZE(institution_ids_final) > 0 THEN 1 END) * 100.0 / COUNT(*), 1) AS pct_with_institutions,\n COUNT(CASE WHEN refreshed_at >= CURRENT_DATE() THEN 1 END) AS rows_refreshed_today,\n MIN(refreshed_at) AS oldest_refresh,\n MAX(refreshed_at) AS newest_refresh\nFROM openalex.institutions.affiliation_strings_lookup_with_counts", |
| 73 | + "metadata": {}, |
| 74 | + "execution_count": null, |
81 | 75 | "outputs": [] |
82 | 76 | } |
83 | 77 | ], |
|
0 commit comments