|
2 | 2 | "cells": [ |
3 | 3 | { |
4 | 4 | "cell_type": "markdown", |
5 | | - "id": "c4cd74db", |
| 5 | + "id": "7a981ad9", |
6 | 6 | "metadata": {}, |
7 | 7 | "source": [ |
8 | | - "# SearchQuery: Short demo and explanation" |
| 8 | + "# Using `search-query` for Literature Searches" |
| 9 | + ] |
| 10 | + }, |
| 11 | + { |
| 12 | + "cell_type": "markdown", |
| 13 | + "id": "4cd37bc8", |
| 14 | + "metadata": {}, |
| 15 | + "source": [ |
| 16 | + "<p style=\"max-width: 90ch; line-height: 1.5;\">\n", |
| 17 | + " This notebook demonstrates how the <code>search-query</code> Python package supports reproducible and programmable academic search strategies by organizing the search process around a query object that can be created programmatically or parsed from an existing string or JSON file. Once created, a query can be <i>linted</i> to identify quality defects, such as syntactic errors, <i>translated</i> to adapt the query string to different database syntaxes (e.g., PubMed vs. Web of Science), <i>improved</i> to iteratively refine and strengthen the search formulation, and <i>automated</i> to run API searches within scripts, command-line workflows, or other environments. Throughout, queries can be saved to and loaded from JSON files, supporting versioning, reuse, and collaborative development of search strategies.\n", |
| 18 | + "</p>\n", |
| 19 | + "\n", |
| 20 | + "```mermaid\n", |
| 21 | + "flowchart TD\n", |
| 22 | + " %% External artifact\n", |
| 23 | + " J[(JSON query file)]\n", |
| 24 | + "\n", |
| 25 | + " C[Create a query object]\n", |
| 26 | + " C .-> Q\n", |
| 27 | + " %% Query object as a subgraph\n", |
| 28 | + " subgraph Q[Query object]\n", |
| 29 | + " direction LR\n", |
| 30 | + "\n", |
| 31 | + " %% Any combination, starting right after create\n", |
| 32 | + " Auto[Automate]\n", |
| 33 | + " Trans[Translate]\n", |
| 34 | + " Imp[Improve]\n", |
| 35 | + " Lint[Lint]\n", |
| 36 | + "\n", |
| 37 | + " Imp <--> Lint\n", |
| 38 | + " Trans <--> Auto\n", |
| 39 | + " Imp <--> Auto\n", |
| 40 | + " Lint <--> Trans\n", |
| 41 | + "\n", |
| 42 | + "\n", |
| 43 | + " end\n", |
| 44 | + "\n", |
| 45 | + " %% Interfacing with file via annotated dotted lines (no Save/Load boxes)\n", |
| 46 | + " Q -. \"save\" .-> J\n", |
| 47 | + " J -. \"load\" .-> Q\n", |
| 48 | + "\n", |
| 49 | + " %% ===== Styling to resemble the example =====\n", |
| 50 | + " style J fill:#ffffff,stroke:#333,stroke-width:2px\n", |
| 51 | + " style C fill:#ffffff,stroke:#333,stroke-width:2px\n", |
| 52 | + " style Auto fill:#ffffff,stroke:#333,stroke-width:2px\n", |
| 53 | + " style Trans fill:#ffffff,stroke:#333,stroke-width:2px\n", |
| 54 | + " style Lint fill:#ffffff,stroke:#333,stroke-width:2px\n", |
| 55 | + " style Imp fill:#ffffff,stroke:#333,stroke-width:2px\n", |
| 56 | + " style Q fill:#f5f5f5,stroke:#666,stroke-width:1px\n", |
| 57 | + "```" |
| 58 | + ] |
| 59 | + }, |
| 60 | + { |
| 61 | + "cell_type": "markdown", |
| 62 | + "id": "5903f273", |
| 63 | + "metadata": {}, |
| 64 | + "source": [ |
| 65 | + "## Installation (if needed)\n", |
| 66 | + "\n", |
| 67 | + "<p style=\"max-width: 90ch; line-height: 1.5;\">\n", |
| 68 | + "The <code>search-query</code> package should be installed automatically in Binder.\n", |
| 69 | + "If you run this notebook locally and do not have `search-query` installed, uncomment and run the next cell.\n", |
| 70 | + "</p>\n" |
9 | 71 | ] |
10 | 72 | }, |
11 | 73 | { |
12 | 74 | "cell_type": "code", |
13 | 75 | "execution_count": null, |
14 | | - "id": "7bb802b1", |
| 76 | + "id": "cc760da2", |
15 | 77 | "metadata": {}, |
16 | 78 | "outputs": [], |
17 | 79 | "source": [ |
18 | | - "!pip install git+https://github.com/CoLRev-Environment/search-query.git" |
| 80 | + "# !pip install search-query" |
| 81 | + ] |
| 82 | + }, |
| 83 | + { |
| 84 | + "cell_type": "markdown", |
| 85 | + "id": "c2aef83a", |
| 86 | + "metadata": {}, |
| 87 | + "source": [ |
| 88 | + "## Create a query object\n", |
| 89 | + "\n", |
| 90 | + "To create a query object, there are two options: a) create a query programmatically, or b) parse a query from a string:" |
19 | 91 | ] |
20 | 92 | }, |
21 | 93 | { |
22 | 94 | "cell_type": "markdown", |
23 | | - "id": "fb887d30", |
| 95 | + "id": "226d2ba5", |
24 | 96 | "metadata": {}, |
25 | 97 | "source": [ |
26 | | - "Text..." |
| 98 | + "### a) Programmatically\n", |
| 99 | + "\n" |
27 | 100 | ] |
28 | 101 | }, |
29 | 102 | { |
30 | 103 | "cell_type": "code", |
31 | | - "execution_count": 4, |
32 | | - "id": "227d628f", |
33 | | - "metadata": {}, |
34 | | - "outputs": [ |
35 | | - { |
36 | | - "data": { |
37 | | - "text/plain": [ |
38 | | - "'AND[OR[digital[Abstract], virtual[Abstract], online[Abstract]], OR[work[Abstract], labor[Abstract], service[Abstract]]]'" |
39 | | - ] |
40 | | - }, |
41 | | - "execution_count": 4, |
42 | | - "metadata": {}, |
43 | | - "output_type": "execute_result" |
44 | | - } |
45 | | - ], |
| 104 | + "execution_count": null, |
| 105 | + "id": "cfebed7f", |
| 106 | + "metadata": {}, |
| 107 | + "outputs": [], |
46 | 108 | "source": [ |
47 | 109 | "from search_query import OrQuery, AndQuery\n", |
48 | 110 | "\n", |
49 | | - "# Typical building-blocks approach\n", |
50 | | - "digital_synonyms = OrQuery([\"digital\", \"virtual\", \"online\"], field=\"Abstract\")\n", |
51 | | - "work_synonyms = OrQuery([\"work\", \"labor\", \"service\"], field=\"Abstract\")\n", |
52 | | - "query = AndQuery([digital_synonyms, work_synonyms], field=\"Author Keywords\")\n", |
53 | | - "query.to_string()" |
| 111 | + "digital_synonyms = OrQuery([\"digital\", \"virtual\", \"online\"], field=\"abstract\")\n", |
| 112 | + "work_synonyms = OrQuery([\"work\", \"labor\", \"service\"], field=\"abstract\")\n", |
| 113 | + "\n", |
| 114 | + "query = AndQuery([digital_synonyms, work_synonyms])\n", |
| 115 | + "\n", |
| 116 | + "print(query.to_string())" |
| 117 | + ] |
| 118 | + }, |
| 119 | + { |
| 120 | + "cell_type": "markdown", |
| 121 | + "id": "bf7a0806", |
| 122 | + "metadata": {}, |
| 123 | + "source": [ |
| 124 | + "When building queries programmatically, use **canonical generic field tokens** (e.g., `abstract`, `title`, `keywords`)." |
| 125 | + ] |
| 126 | + }, |
| 127 | + { |
| 128 | + "cell_type": "markdown", |
| 129 | + "id": "f45e0c86", |
| 130 | + "metadata": {}, |
| 131 | + "source": [ |
| 132 | + "### b) Parse from a string\n", |
| 133 | + "\n", |
| 134 | + "Parsing platform syntax is a core feature.\n", |
| 135 | + "\n", |
| 136 | + "Example PubMed query:\n" |
| 137 | + ] |
| 138 | + }, |
| 139 | + { |
| 140 | + "cell_type": "code", |
| 141 | + "execution_count": null, |
| 142 | + "id": "5410ef0b", |
| 143 | + "metadata": {}, |
| 144 | + "outputs": [], |
| 145 | + "source": [ |
| 146 | + "from search_query.parser import parse\n", |
| 147 | + "\n", |
| 148 | + "query_string = '(\"digital health\"[Title/Abstract]) AND (\"privacy\"[Title/Abstract])'\n", |
| 149 | + "pubmed_query = parse(query_string, platform=\"pubmed\")\n", |
| 150 | + "\n", |
| 151 | + "# `pubmed_query` is now a Query object that can be translated or rendered.\n", |
| 152 | + "print(pubmed_query.to_string())" |
| 153 | + ] |
| 154 | + }, |
| 155 | + { |
| 156 | + "cell_type": "markdown", |
| 157 | + "id": "ab989817", |
| 158 | + "metadata": {}, |
| 159 | + "source": [ |
| 160 | + "## Lint a query\n", |
| 161 | + "\n", |
| 162 | + "\n", |
| 163 | + "<p style=\"max-width: 90ch; line-height: 1.5;\">\n", |
| 164 | + "Search queries are prone to subtle but impactful errors, ranging from unbalanced parentheses to unsupported fields or database-specific constraints. The <code>search-query</code> linters help detect such issues early and provide precise, actionable feedback—covering parsing errors, structural problems, term and field issues, as well as platform-specific constraints (e.g., PubMed or Web of Science).\n", |
| 165 | + "</p>\n", |
| 166 | + "\n", |
| 167 | + "<p style=\"max-width: 90ch; line-height: 1.5;\">\n", |
| 168 | + "By limiting fatal errors during exploratory workflows, you can surface these diagnostics without interrupting the overall analysis. This makes it easier to iteratively refine queries, compare variants, and understand quality defects before running searches in external databases.\n", |
| 169 | + "</p>\n", |
| 170 | + "\n", |
| 171 | + "<p style=\"max-width: 90ch; line-height: 1.5;\">\n", |
| 172 | + "In this example, we intentionally parse a malformed query (missing a closing parenthesis) to illustrate how the parser reports a fatal parsing error with a clear explanation and location hint. For a full overview of supported lint categories and best-practice checks—including parsing errors, query structure errors, term and field errors, database-specific constraints, and quality warnings (see the <a href=\"https://colrev-environment.github.io/search-query/lint/index.html\">Lint documentation</a>).\n", |
| 173 | + "</p>\n" |
| 174 | + ] |
| 175 | + }, |
| 176 | + { |
| 177 | + "cell_type": "code", |
| 178 | + "execution_count": null, |
| 179 | + "id": "b2a42b4e", |
| 180 | + "metadata": {}, |
| 181 | + "outputs": [], |
| 182 | + "source": [ |
| 183 | + "bad_query = '(\"digital health\"[Title/Abstract]) AND (\"privacy\"[Title/Abstract]'\n", |
| 184 | + "\n", |
| 185 | + "try:\n", |
| 186 | + " parse(bad_query, platform=\"pubmed\")\n", |
| 187 | + " print(\"❌ Unexpected: bad query parsed without a fatal error.\")\n", |
| 188 | + "except Exception as exc:\n", |
| 189 | + " print(f\"\\n✅ Linter demo: parse() raised an error (expected): {type(exc).__name__}\")" |
| 190 | + ] |
| 191 | + }, |
| 192 | + { |
| 193 | + "cell_type": "markdown", |
| 194 | + "id": "c6f2e553", |
| 195 | + "metadata": {}, |
| 196 | + "source": [ |
| 197 | + "## Translate a query\n", |
| 198 | + "\n", |
| 199 | + "<p style=\"max-width: 90ch; line-height: 1.5;\">\n", |
| 200 | + "Systematic literature searches typically involve multiple databases (e.g., PubMed, Web of Science, EBSCOHost). Because each platform uses its own query syntax and field conventions, search strategies need to be translated and adapted accordingly to ensure comparable retrieval across sources.\n", |
| 201 | + "</p>\n", |
| 202 | + "\n", |
| 203 | + "<p style=\"max-width: 90ch; line-height: 1.5;\">\n", |
| 204 | + "Here, we translate a parsed PubMed query to Web of Science syntax. Depending on semantics, some fields may expand during translation—for example, PubMed <code>[Title/Abstract]</code> can map to <code>TI=</code> OR <code>AB=</code> in Web of Science.\n", |
| 205 | + "</p>" |
| 206 | + ] |
| 207 | + }, |
| 208 | + { |
| 209 | + "cell_type": "code", |
| 210 | + "execution_count": null, |
| 211 | + "id": "2d6cb657", |
| 212 | + "metadata": {}, |
| 213 | + "outputs": [], |
| 214 | + "source": [ |
| 215 | + "query_string = '(\"digital health\"[Title/Abstract]) AND (\"privacy\"[Title/Abstract])'\n", |
| 216 | + "pubmed_query = parse(query_string, platform=\"pubmed\")\n", |
| 217 | + "\n", |
| 218 | + "wos_query = pubmed_query.translate(target_syntax=\"wos\")\n", |
| 219 | + "\n", |
| 220 | + "print(wos_query.to_string())" |
| 221 | + ] |
| 222 | + }, |
| 223 | + { |
| 224 | + "cell_type": "markdown", |
| 225 | + "id": "ac1479b0", |
| 226 | + "metadata": {}, |
| 227 | + "source": [ |
| 228 | + "## Improve and automate a query\n", |
| 229 | + "\n", |
| 230 | + "<p style=\"max-width: 90ch; line-height: 1.5;\">\n", |
| 231 | + "Programmatic access to search queries enables a wide range of use cases related to both <strong>query improvement</strong> and <strong>automation</strong>.\n", |
| 232 | + "</p>\n", |
| 233 | + "\n", |
| 234 | + "<div style=\"max-width: 90ch; line-height: 1.5;\">\n", |
| 235 | + "<ul>\n", |
| 236 | + "<li><strong>Query improvement</strong> typically focuses on <i>local</i> and exploratory workflows. It may involve systematic modifications—such as query expansion or structural simplification—followed by evaluating query performance on pre-classified datasets. This makes it possible to iteratively refine search queries and assess how different formulations affect recall and precision.</li>\n", |
| 237 | + "<li><strong>Automation</strong>, in contrast, usually targets <i>online</i> workflows and external systems. Typical use cases include retrieving records from APIs (e.g., Crossref) or running multiple query variants against live databases to compare yields across research scopes, date restrictions, field specifications, or keyword combinations. Such experiments help understand, justify, and operationalize search strategies.</li>\n", |
| 238 | + "</ul>\n", |
| 239 | + "</div>\n", |
| 240 | + "\n", |
| 241 | + "<p style=\"max-width: 90ch; line-height: 1.5;\">\n", |
| 242 | + "To support these workflows, researchers can write Python code that programmatically interacts with the <code>search-query</code> package and query objects.\n", |
| 243 | + "The documentation provides practical examples for both <a href=\"https://colrev-environment.github.io/search-query/improve.html\">query improvement</a> and <a href=\"https://colrev-environment.github.io/search-query/automate.html\">automation</a>.\n", |
| 244 | + "</p>\n" |
| 245 | + ] |
| 246 | + }, |
| 247 | + { |
| 248 | + "cell_type": "markdown", |
| 249 | + "id": "93630bd1", |
| 250 | + "metadata": {}, |
| 251 | + "source": [ |
| 252 | + "## Save a query JSON file\n", |
| 253 | + "\n", |
| 254 | + "This is useful for reproducible workflows and sharing exact search strategies.\n" |
| 255 | + ] |
| 256 | + }, |
| 257 | + { |
| 258 | + "cell_type": "code", |
| 259 | + "execution_count": null, |
| 260 | + "id": "4dec911f", |
| 261 | + "metadata": {}, |
| 262 | + "outputs": [], |
| 263 | + "source": [ |
| 264 | + "from search_query.parser import parse\n", |
| 265 | + "from search_query import SearchFile\n", |
| 266 | + "\n", |
| 267 | + "query_string = '(\"digital health\"[Title]) AND (\"privacy\"[Title])'\n", |
| 268 | + "pubmed_query = parse(query_string, platform=\"pubmed\")\n", |
| 269 | + "\n", |
| 270 | + "search_file = SearchFile(\n", |
| 271 | + " search_string=pubmed_query.to_string(),\n", |
| 272 | + " platform=\"pubmed\",\n", |
| 273 | + " version=\"1\",\n", |
| 274 | + " authors=[{\"name\": \"Gerit Wagner\"}],\n", |
| 275 | + " record_info={},\n", |
| 276 | + " date={}\n", |
| 277 | + ")\n", |
| 278 | + "\n", |
| 279 | + "out_path = \"pubmed-search-file.json\"\n", |
| 280 | + "search_file.save(out_path)\n", |
| 281 | + "print(f\"✅ Saved: {out_path}\")" |
| 282 | + ] |
| 283 | + }, |
| 284 | + { |
| 285 | + "cell_type": "markdown", |
| 286 | + "id": "62948864", |
| 287 | + "metadata": {}, |
| 288 | + "source": [ |
| 289 | + "## Load a query JSON file\n", |
| 290 | + "\n", |
| 291 | + "<p style=\"max-width: 90ch; line-height: 1.5;\">\n", |
| 292 | + "This closes the loop: saving and loading queries enables iterative refinement over time—whether you update a search strategy, version it in a Git repository, or share exact queries with collaborators. The following example shows how to load a previously saved query file.\n", |
| 293 | + "</p>\n", |
| 294 | + "\n" |
| 295 | + ] |
| 296 | + }, |
| 297 | + { |
| 298 | + "cell_type": "code", |
| 299 | + "execution_count": null, |
| 300 | + "id": "5d9ee9c7", |
| 301 | + "metadata": {}, |
| 302 | + "outputs": [], |
| 303 | + "source": [ |
| 304 | + "from search_query.search_file import load_search_file\n", |
| 305 | + "from search_query.parser import parse\n", |
| 306 | + "\n", |
| 307 | + "search = load_search_file(\"pubmed-search-file.json\")\n", |
| 308 | + "query = parse(search.search_string, platform=search.platform)\n", |
| 309 | + "\n", |
| 310 | + "print(\"Loaded platform:\", search.platform)\n", |
| 311 | + "print(query.to_string())" |
| 312 | + ] |
| 313 | + }, |
| 314 | + { |
| 315 | + "cell_type": "markdown", |
| 316 | + "id": "d0b9f3a5", |
| 317 | + "metadata": {}, |
| 318 | + "source": [ |
| 319 | + "---\n", |
| 320 | + "\n", |
| 321 | + "## ✅ Completed — What we learned\n", |
| 322 | + "\n", |
| 323 | + "🎉🎈 You have completed the `search-query` demo notebook — good work! 🎈🎉\n", |
| 324 | + "\n", |
| 325 | + "In this notebook, we walked through the full lifecycle of search queries:\n", |
| 326 | + "\n", |
| 327 | + "- Create queries programmatically or parse them from strings / JSON files \n", |
| 328 | + "- Lint queries to detect quality defects early \n", |
| 329 | + "- Translate queries across platforms (e.g., PubMed ↔ Web of Science) \n", |
| 330 | + "- Save and reload queries as reusable JSON search files \n", |
| 331 | + "\n", |
| 332 | + "<p style=\"max-width: 90ch; line-height: 1.5;\">\n", |
| 333 | + "Together, these steps show how search queries can be treated as first-class, versionable research artifacts—supporting reproducible and transparent literature searches.\n", |
| 334 | + "</p>\n" |
54 | 335 | ] |
55 | 336 | } |
56 | 337 | ], |
57 | 338 | "metadata": { |
58 | 339 | "kernelspec": { |
59 | | - "display_name": "Python 3 (ipykernel)", |
| 340 | + "display_name": "Python 3", |
60 | 341 | "language": "python", |
61 | 342 | "name": "python3" |
62 | 343 | }, |
|
70 | 351 | "name": "python", |
71 | 352 | "nbconvert_exporter": "python", |
72 | 353 | "pygments_lexer": "ipython3", |
73 | | - "version": "3.8.10" |
| 354 | + "version": "3.12.3" |
74 | 355 | } |
75 | 356 | }, |
76 | 357 | "nbformat": 4, |
|
0 commit comments