|
| 1 | +--- |
| 2 | +title: ScrapGraphAI Roadmap |
| 3 | +markmap: |
| 4 | + colorFreezeLevel: 2 |
| 5 | + maxWidth: 500 |
| 6 | +--- |
| 7 | + |
| 8 | +# **ScrapGraphAI Roadmap** |
| 9 | + |
| 10 | +## **Short-Term Goals** |
| 11 | + |
| 12 | +- Integration with more llm APIs |
| 13 | + |
| 14 | +- Test proxy rotation implementation |
| 15 | + |
| 16 | +- Add more search engines inside the SearchInternetNode |
| 17 | + |
| 18 | +- Improve the documentation (ReadTheDocs) |
| 19 | + - [Issue #102](https://github.com/VinciGit00/Scrapegraph-ai/issues/102) |
| 20 | + |
| 21 | +- Create tutorials for the library |
| 22 | + |
| 23 | +## **Medium-Term Goals** |
| 24 | + |
| 25 | +- Node for handling API requests |
| 26 | + |
| 27 | +- Improve SearchGraph to look into the first 5 results of the search engine |
| 28 | + |
| 29 | +- Make scraping more deterministic |
| 30 | + - Create DOM tree of the website |
| 31 | + - HTML tag text embeddings with tags metadata |
| 32 | + - Study tree forks from root node |
| 33 | + - How do we use the tags parameters? |
| 34 | + |
| 35 | +- Create scraping folder with report |
| 36 | + - Folder contains .scrape files, DOM tree files, report |
| 37 | + - Report could be a HTML page with scraping speed, costs, LLM info, scraped content and DOM tree visualization |
| 38 | + - We can use pyecharts with R-markdown |
| 39 | + |
| 40 | +- Scrape multiple pages of the same website |
| 41 | + - Create new node that instantiate multiple graphs at the same time |
| 42 | + - Make graphs run in parallel |
| 43 | + - Scrape only relevant URLs from user prompt |
| 44 | + - Use the multi dimensional DOM tree of the website for retrieval |
| 45 | + - [Issue #112](https://github.com/VinciGit00/Scrapegraph-ai/issues/112) |
| 46 | + |
| 47 | +- Crawler graph |
| 48 | + - Scrape all the URLs with the same domain in all the pages |
| 49 | + - Build many DOM trees and link them together |
| 50 | + - Save the multi dimensional tree in a file |
| 51 | + |
| 52 | +- Compare two DOM trees to assess the similarity |
| 53 | + - Save the DOM tree of the scraped website in a file as a sort of cache to be used to compare with future website structure |
| 54 | + - Create similarity metrics with multiple DOM trees (overall tree? only relevant tags structure?) |
| 55 | + |
| 56 | +- Nodes for handling authentication |
| 57 | + - Use Selenium or Playwright to handle authentication |
| 58 | + - Passes the cookies to the other nodes |
| 59 | + |
| 60 | +- Nodes that attaches to an open browser |
| 61 | + - Use Selenium or Playwright to attach to an open browser |
| 62 | + - Navigate inside the browser and scrape the content |
| 63 | + |
| 64 | +- Nodes for taking screenshots and understanding the page layout |
| 65 | + - Use Selenium or Playwright to take screenshots |
| 66 | + - Use LLM to asses if it is a block-like page, paragraph-like page, etc. |
| 67 | + - [Issue #88](https://github.com/VinciGit00/Scrapegraph-ai/issues/88) |
| 68 | + |
| 69 | +## **Long-Term Goals** |
| 70 | + |
| 71 | +- Automatic generation of scraping pipelines from a given prompt |
| 72 | + |
| 73 | +- Create API for the library |
| 74 | + |
| 75 | +- Finetune a LLM for html content |
0 commit comments