Gold Digger is a command-line tool for snowballing literature search, starting from a few initial papers. It uses the Semantic Scholar API to find citations and references, and provides a flexible filtering mechanism to narrow down the results.
-
Clone the repository:
git clone https://github.com/your-username/gold-digger.git cd gold-digger -
Install the required Python packages:
pip install -r requirements.txt
The tool can be used without a Semantic Scholar API key, but it is recommended to use one to get higher rate limits. You can get a key from the Semantic Scholar website.
Once you have a key, you can set it as an environment variable:
export SEMANTIC_SCHOLAR_API_KEY="YOUR_API_KEY"The llm filter can use the Gemini API for filtering. To use it, you need a Gemini API key. You can get one from Google AI Studio.
Set the API key as an environment variable:
export GEMINI_API_KEY="YOUR_API_KEY"python main.py --initial-papers <paper_id_1> <paper_id_2> ... --filter <filter_1> --filter <filter_2> ...Note: If you interrupt the process with Ctrl-C, the papers that have been processed so far will be saved to the output file.
--initial-papers: A list of initial paper IDs to start the snowballing from. These can be Semantic Scholar Paper IDs, DOIs, or arXiv IDs. (Required, unless using--convert-to-bibtex)--output-file: The file to save the results to. (Default:snowball_results.json)--output-format: The format to save the results in. (Choices:json,bibtex; Default:json)--cache-file: The file to use for caching Semantic Scholar API responses. (Default:semantic_scholar_cache.json)--batch-size: The maximum number of papers to fetch in a single batch from Semantic Scholar. (Default: 10)--llm-batch-size: The maximum number of papers to process in a single batch with the LLM filter. (Default: 5)--retry-on-400: The number of times to retry on HTTP 400 errors. (Default: 0)--llm-provider: The LLM provider to use for thellmfilter. (Choices:gemini-api,gemini-cli; Default:gemini-api)--gemini-cli-path: The path to thegemini-cliexecutable. (Default:gemini)--filter: A filter to apply to the papers. This argument can be used multiple times. (Required, unless using--convert-to-bibtex)--convert-to-bibtex: Convert an existing JSON results file to BibTeX format. When this option is used, no snowballing is performed.
Filters are used to narrow down the results of the literature search.
Filters papers by a keyword in a specific field.
Syntax: --filter field <field_name> <keyword>
<field_name>: The field to search in (e.g.,title,abstract).<keyword>: The keyword to search for.
Filters papers by publication year.
Syntax: --filter year <operator> <year>
<operator>: One oflt(less than),gt(greater than),eq(equal to),le(less than or equal to),ge(greater than or equal to).<year>: The year to compare against.
Filters papers by author name.
Syntax: --filter author <author_name>
<author_name>: The name of the author to search for.
Filters papers using a custom criterion with an LLM. The papers are processed in batches, which can be controlled with the --llm-batch-size argument.
Syntax: --filter llm <criterion>
<criterion>: The criterion to use for filtering. This will be sent to the LLM.
Filters papers using a criterion from a file. The papers are processed in batches, which can be controlled with the --llm-batch-size argument.
Syntax: --filter llm_from_file <file_path>
<file_path>: The path to the file containing the criterion.
You can group filters with OR logic using or_start and or_end.
Syntax:
--filter or_start \
--filter <filter_1> \
--filter <filter_2> \
... \
--filter or_endYou can convert an existing JSON results file to BibTeX format using the --convert-to-bibtex argument.
python main.py --convert-to-bibtex snowball_results.jsonThis will create a snowball_results.bib file in the same directory.
Find papers related to "machine learning" in the title, published after 2020.
python main.py \
--initial-papers 10.1109/CVPR.2016.90 \
--filter field title "machine learning" \
--filter year gt 2020Find papers with "attention" in the title OR "transformer" in the abstract.
python main.py \
--initial-papers 10.1109/CVPR.2016.90 \
--filter or_start \
--filter field title "attention" \
--filter field abstract "transformer" \
--filter or_endFind papers that are relevant to "explainable AI".
python main.py \
--initial-papers 10.1109/CVPR.2016.90 \
--filter llm "papers relevant to explainable AI"python main.py \
--initial-papers 10.1109/CVPR.2016.90 \
--llm-provider gemini-cli \
--filter llm "papers relevant to explainable AI"