Webpage summarizer with ollama and CLI

This projects generates summaries from documents loaded from URL.

This is largely based on https://github.com/rishabkumar7/langchain-ollama (thanks! 🎩).

Changes I've made:

update ollama invoking to fit current langchain documentation
add passing arguments to the script from the command line
add appending detailed output to a .jsonl (jsonlines) file
add parsing out deepseek's chain of thought (delimited by <think></think> markers)

Advantages of access through CLI: you can write an external bash script and call the summarizing module as many times as you like.

Requirements

Install Ollama and download some open-source models (I tested this with llama3.2, deepseek-r1:14b, deepseek-r1:8b, deepseek-r1:1.5b).

Set up a virtual environment and install the dependencies from requirements.txt.

Example usage

python main.py -i 'https://en.wikipedia.org/wiki/Interlocutor_(linguistics)' -o example_output.jsonl -l deepseek-r1:1.5b

Here's a pretty-printed result of the above command:

{
  "url": "https://en.wikipedia.org/wiki/Interlocutor_(linguistics)",
  "model_used": "deepseek-r1:1.5b",
  "chain_of_thought": "<think>\nOkay, I need to summarize this text concisely. The reference materials discuss \"interlocutor\" and \"routelers,\" which are related to language use and quotation techniques. Here's how I can approach it:\n\n1. Start with the main topic.\n2. Mention key terms like interlocutor and routelers.\n3. Note their role in communication, particularly quotation.\n\nSo, a possible summary would be: The reference materials discuss concepts like \"interlocutor\" and \"routelers,\" examining their roles in language use and quoting techniques.\n</think>",
  "summary": "\n\n**Summary:** The references discuss key terms such as \"interlocutor\" and \"routelers,\" focusing on their roles in language use and quotation techniques."
}

The result of a few runs with different models is saved in example_output.jsonl in this repo.

TODOs

1.

Add a switch to save the date when the summary was retrieved.
Add a switch so you can turn on and off the filtering of deepseek's chain of thought

2.

The script works despite printing the following errors:

USER_AGENT environment variable not set, consider setting it to identify your requests. (always)

None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used. (sometimes)

Token indices sequence length is longer than the specified maximum sequence length for this model (1480 > 1024). Running this sequence through the model will result in indexing error (with some models)

I haven't tested Rashib's original repo, but I think it would have the same issues.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
example_output.jsonl		example_output.jsonl
main.py		main.py
readme.md		readme.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Webpage summarizer with ollama and CLI

Requirements

Example usage

TODOs

1.

2.

About

Uh oh!

Releases

Packages

Uh oh!

Languages

adam-sierakowski/llm_summarizer_with_ollama

Folders and files

Latest commit

History

Repository files navigation

Webpage summarizer with ollama and CLI

Requirements

Example usage

TODOs

1.

2.

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages