Skip to content

Commit d1cdee6

Browse files
caseydmclaude
andcommitted
Add rescrape queue usage docs to taxicab notebook
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 9677ce0 commit d1cdee6

File tree

1 file changed

+5
-0
lines changed

1 file changed

+5
-0
lines changed

notebooks/scraping/taxicab.ipynb

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -89,6 +89,11 @@
8989
"execution_count": null,
9090
"outputs": []
9191
},
92+
{
93+
"cell_type": "markdown",
94+
"source": "#### Rescrape Queue Usage\n\nTo rescrape records, insert into `openalex.taxicab.rescrape_queue` and trigger the **TaxiCab_Rescrape** job.\n\n**`native_id` format by namespace:**\n- `doi` — bare DOI, not a URL (e.g. `10.1038/s41586-023-06600-9`, not `https://doi.org/10.1038/...`). The URL is constructed automatically as `https://doi.org/{native_id}`.\n- `pmh` — PMH identifier (e.g. `oai:arXiv.org:2301.00001`). URL is looked up from `taxicab_results`.\n- Other namespaces — the identifier as stored in `taxicab_results`. URL is looked up from `taxicab_results`.\n\n**Examples:**\n```sql\n-- Single DOI\nINSERT INTO openalex.taxicab.rescrape_queue (native_id, native_id_namespace)\nVALUES ('10.1038/s41586-023-06600-9', 'doi');\n\n-- Bulk: all DOIs that resolved to linkinghub\nINSERT INTO openalex.taxicab.rescrape_queue (native_id, native_id_namespace)\nSELECT native_id, native_id_namespace\nFROM openalex.taxicab.taxicab_results\nWHERE resolved_url LIKE '%linkinghub%';\n\n-- Bulk: PMH records with errors\nINSERT INTO openalex.taxicab.rescrape_queue (native_id, native_id_namespace)\nSELECT native_id, native_id_namespace\nFROM openalex.taxicab.taxicab_results\nWHERE native_id_namespace = 'pmh' AND error IS NOT NULL;\n```",
95+
"metadata": {}
96+
},
9297
{
9398
"cell_type": "code",
9499
"source": "# result schema\n\nresults_schema = T.StructType([\n T.StructField(\"taxicab_id\", T.StringType(), True),\n T.StructField(\"url\", T.StringType(), True),\n T.StructField(\"resolved_url\", T.StringType(), True),\n T.StructField(\"status_code\", T.IntegerType(), True),\n T.StructField(\"content_type\", T.StringType(), True),\n T.StructField(\"native_id\", T.StringType(), True),\n T.StructField(\"native_id_namespace\", T.StringType(), True),\n T.StructField(\"s3_path\", T.StringType(), True),\n T.StructField(\"is_soft_block\", T.BooleanType(), True),\n T.StructField(\"created_date\", T.TimestampType(), True),\n T.StructField(\"processed_date\", T.TimestampType(), True),\n T.StructField(\"error\", T.StringType(), True)\n])",

0 commit comments

Comments
 (0)