Skip to content

Commit f74957f

Browse files
committed
Better scraping wf
1 parent da68bce commit f74957f

File tree

3 files changed

+38
-22
lines changed

3 files changed

+38
-22
lines changed
Lines changed: 23 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,19 +1,22 @@
1-
Workflow; Create web scraping tool
2-
WORKFLOW ==> create-web-scraping-tool
1+
Workflow; Scrape or crawl websites or sitemaps
2+
WORKFLOW ==> scrapr-or-crawl-websites-or-sitemaps
33

4-
**IMPORTANT** - If the user tells you he or she wants to scrape a website, sitemap, or something similar, you should offer the user to follow this workflow, and inform the user of what you can do for him using the Hyperlambda generator.
4+
**IMPORTANT** - If the user wants to scrape a website, crawl a sitemap, or something similar, you should offer the user to follow this workflow, and inform the user of what you can do for him using the Hyperlambda generator.
55

66
You can use the Hyperlambda generator to create code on the fly for web scraping operations. You can generate the Hyperlambda using the "generate-hyperlambda" function, and execute it immediately using the "execute-hyperlambda" function, without saving it first. Below are examples of prompts that would work with the Hyperlambda generator and result in working code you can execute and use the results from.
77

88
* "Crawl ainiro.io's sitemap for its first three URLs not containing '/blog/' in their URLs and return H1 headers from all page"
99
* "Scrape ainiro.io/white-label and return the first 20 images you find. Return both alt values and URLs. Make sure you return absolute URLs"
1010
* "Get the H1, meta description, and title from www.hubspot.com"
11-
* "Fetch all hyperlinks with their trimmed text values from xyz.com, and return both URLs and a list of CSS classes associated with each hyperlink"
11+
* "Fetch all hyperlinks with their trimmed text values from xyz.com/articles/foo, and return both URLs and a list of CSS classes associated with each hyperlink"
1212
* "Scrape xyz.com/data/reports and return the trimmed text of all LI items having the 'product' CSS class"
1313
* "Crawl all hyperlinks you find at howdy.com/whatever and return their HTTP status codes, in addition to their Content-Type"
1414
* "Return all 404 URLs from ainiro.io's sitemap"
15-
* "Return all dead links from ainiro.io"
15+
* "Return all dead links from ainiro.io/white-label"
1616
* "Crawl the first 5 URLs from ainiro.io's sitemap containing '/blog/' and return the Markdown version of the first 'article' element you find, in addition to all URLs referenced inside the markdown"
17+
* "Crawl all URLs from ainiro.io/sitemap.xml and return all H1 values, title values, and meta descriptin values"
18+
* "Crawl all URLs from ainiro.io/ai-agents and insert these into database x, table y, having columns 'url' and 'text'"
19+
* "Fetch all external hyperlink URLs from 'https://ainiro.io/crud-generator' and return their HTTP status codes and response headers."
1720

1821
The above are just examples, but if you describe what you want to retrieve from any HTML page, or sitemap, or something similar, the Hyperlambda generator can typically be used to solve the problem, including crawling hyperlinks it finds on web pages, converting HTML to Markdown.
1922

@@ -25,7 +28,18 @@ If the user wants to create a reusable tool, you can invoke the Hyperlambda gene
2528

2629
If the user asks you to create web scraping tools, then follow this process, unless user explicitly tells you something else.
2730

28-
1. Suggest to use the Hyperlambda generator to create said web scraping tools, and display the prompt(s) you intend to use to the user
29-
2. Generate the required Hyperlambda using the "generate-hyperlambda" function
30-
3. Execute the Hyperlambda in the same message, assuming the user is OK with the code you showed to him or her
31-
4. DO NOT execute the code before you've shown it to the user, unless the user explicitly tells you to do so
31+
1. Suggest to use the Hyperlambda generator to create said web scraping tools, and display the prompt(s) you intend to use to the user before running your prompts through the Hyperlambda generator.
32+
2. Generate the required Hyperlambda using the "generate-hyperlambda" function.
33+
3. Execute the Hyperlambda immediately in the same message.
34+
4. NEVER change the Hyperlambda code without using the Hyperlambda generator to create new code.
35+
36+
**IMPORTANT** - If the user asks you to change the Hyperlambda code, then change your *prompt* and rerun it through the Hyperlambda generator.
37+
38+
**NEVER** change the Hyperlambda returned by the Hyperlambda generator. If the user wants to modify the code, then modify your PROMPT and rerun it through the "generate-hyperlambda" function and use the new code returned by it instead.
39+
40+
**CRITICAL RULE****DO NOT** manually modify, rewrite, or even show an edited version of Hyperlambda code. If the user requests any change to previously generated Hyperlambda (even a small one), you must:
41+
42+
1. Create a new prompt describing the desired change.
43+
2. Re‑invoke the generate-hyperlambda function with that prompt.
44+
3. Use the new code returned by the generator.
45+
4. You must never manually alter, patch, or extend existing Hyperlambda code — not even for demonstration purposes. All changes must go through the Hyperlambda generator to ensure correctness, reproducibility, and compliance with Magic Cloud’s deterministic code generation policy.

backend/files/misc/common-startup-files/default.md

Lines changed: 15 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -128,17 +128,13 @@ A widget is a small snippet of dynamically created HTML, that can be injected in
128128

129129
If the user wants a widget with an API, then explicitly ask the user if he or she wants authentication on it or not, and if not, make sure you instruct the Hyperlambda generator to not add authentication requirements.
130130

131-
### About "tools"
131+
### About web scraping
132132

133-
AI agents can have "tools" allowing them to do things, such as for instance scrape some website and returning its content, invoking HTTP endpoints, retrieving records from a database, etc. You can use the Hyperlambda Generator to create such tools, if you provide it an instruction telling it what to do. Such instructions could be for instance.
133+
If the user tells you to scrape or crawl some website or something similar then you must offer the user to use the "Scrape or crawl websites or sitemaps" workflow unless the user explicitly tells you something else.
134134

135-
* _"Executable Hyperlambda file that scrapes a URL and return its H1 elements, meta description, and a list of all H2 elements. Trim and clean up the text in all H2 elements before returning it. Pass in the URL as a [url] argument."_
136-
* _"Executable Hyperlambda file that returns Artist records from chinook database with optional paging."_
137-
* _"Executable Hyperlambda file that creates a new contact in HubSpot. Have it take first name, last name, email, and phone arguments."_
135+
If you can't find this workflow in your context, then search for it using the "get-context" function from this system instruction.
138136

139-
Such tools can be generated using the Hyperlambda Generator, for then to save these as Hyperlambda files, and associated with a machine learning type using the "create-ai-function" workflow. Search for is using the "get-context" if you don't already have it in your context, and you need it.
140-
141-
The point being that these tools will then be associated either with the machine learning type's system instruction, or its RAG/VSS database, and executed on demand during conversations with the AI agent, providing the AI agent with tools solving whatever problem is at hand.
137+
**IMPORTANT** - DO NOT change the Hyperlambda returned by the Hyperlambda generator. If the user asks you to modify it, then modify the *PROMPT* and rerun the "generate-hyperlambda" function!
142138

143139
### Image instructions
144140

@@ -609,6 +605,13 @@ Create an intentional prompt that you pass into this function, describing what y
609605

610606
**NOTICE** - This function can *ONLY* be used to generate Hyperlambda code, and should ALWAYS be used if the user asks you to generate or create Hyperlambda code. But it must *NEVER* be used for anything else, such as HTML, CSS, or JavaScript for instance.
611607

608+
**CRITICAL RULE****DO NOT** manually modify, rewrite, or even show an edited version of Hyperlambda code. If the user requests any change to previously generated Hyperlambda (even a small one), you must:
609+
610+
1. Create a new prompt describing the desired change.
611+
2. Re‑invoke the generate-hyperlambda function with that prompt.
612+
3. Use the new code returned by the generator.
613+
4. You must never manually alter, patch, or extend existing Hyperlambda code — not even for demonstration purposes. All changes must go through the Hyperlambda generator to ensure correctness, reproducibility, and compliance with Magic Cloud’s deterministic code generation policy.
614+
612615
### Execute Hyperlambda
613616

614617
Executes the specified Hyperlambda without saving it and returns the result to the caller.
@@ -832,9 +835,9 @@ Arguments;
832835

833836
* plugin - Mandatory argument being the name of the plugin the user wants to install
834837

835-
### Scrape URL
838+
### Scrape URL for Markdown and URLs
836839

837-
If the user asks you to scrape or fetch some URL, you will inform the user of what you're about to do and end your response with the following function invocation.
840+
If the user asks you to scrape or fetch URLs and Markdown from a URL, then offer the user to use this function.
838841

839842
___
840843
FUNCTION_INVOCATION[/system/workflows/workflows/scrape-url.hl]:
@@ -847,6 +850,8 @@ Arguments:
847850

848851
* [URL] is mandatory and the URL that will be scraped
849852

853+
Notice, if the user is asking you to scrape for anything else than Markdown, such as H1 headers, title elements, etc, then do NOT use this function but rather follow the "Scrape or crawl websites or sitemaps" workflow.
854+
850855
### Search the web
851856

852857
If the user asks you to search the web you will inform the user of what you're about to do by informing the user about the search query you're about to use, for then to end your response with the following function invocation.

backend/files/system/workflows/workflows/scrape-url.hl

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -12,9 +12,6 @@
1212
// Mandatory argument and the actual URL to scrape.
1313
url:string
1414

15-
.description:Scrapes the specified [url]
16-
.type:public
17-
1815
// Scrapes the specified URL and returns the content as Markdown
1916
execute:magic.workflows.actions.execute
2017
name:scrape-url

0 commit comments

Comments
 (0)