SimpleChatTC:WebTools:UrlText:HtmlParser: tag drops - refine

hanishkvc · hanishkvc · commit babfb961db45 · 2025-11-06T15:32:19.000+05:30
Update the initial skeleton wrt the tag drops logic

* had forgotten to convert object to json string at the client end
* had confused between js and python and tried accessing the dict
  elements using . notation rather than [] notation in python.
* if the id filtered tag to be dropped is found, from then on
  track all other tags of the same type (independent of id),
  so that start and end tags can be matched. bcas end tag call
  wont have attribute, so all other tags of same type need to
  be tracked, for proper winding and unwinding to try find
  matching end tag
* remember to reset the tracked drop tag type to None once matching
  end tag at same depth is found. should avoid some unnecessary
  unwinding.
* set/fix the type wrt tagDrops explicitly to needed depth and
  ensure the dummy one and any explicitly got one is of right type.

Tested with duckduckgo search engine and now the div based unneeded
header is avoided in returned search result.
diff --git a/tools/server/public_simplechat/local.tools/webmagic.py b/tools/server/public_simplechat/local.tools/webmagic.py
@@ -9,7 +9,7 @@
 import debug
 import filemagic as mFile
 import json
-from typing import TYPE_CHECKING
+from typing import TYPE_CHECKING, Any, cast
 
 if TYPE_CHECKING:
     from simpleproxy import ProxyHandler
@@ -93,12 +93,21 @@ class TextHtmlParser(html.parser.HTMLParser):
     html content, that logic wont be triggered, so also such client side dynamic content wont be
     got.
 
+    Supports one to specify a list of tags and their corresponding id attributes, so that contents
+    within such specified blocks will be dropped.
+
+    * this works properly only if the html being processed has proper opening and ending tags
+    around the area of interest.
+    * remember to specify non overlapping tag blocks, if more than one specified for dropping.
+        * this path not tested, but should logically work
+
     This helps return a relatively clean textual representation of the html file/content being parsed.
     """
 
-    def __init__(self, tagDrops: dict):
+    def __init__(self, tagDrops: list[dict[str, Any]]):
         super().__init__()
         self.tagDrops = tagDrops
+        print(f"DBUG:TextHtmlParser:{self.tagDrops}")
         self.inside = {
             'body': False,
             'script': False,
@@ -126,20 +135,27 @@ def handle_starttag(self, tag: str, attrs: list[tuple[str, str | None]]):
         if tag in self.monitored:
             self.inside[tag] = True
         for tagMeta in self.tagDrops:
-            if tag != tagMeta.tag:
+            if tag != tagMeta['tag']:
+                continue
+            if (self.droptagCount > 0) and (self.droptagType == tag):
+                self.droptagCount += 1
                 continue
             for attr in attrs:
                 if attr[0] != 'id':
                     continue
-                if attr[1] == tagMeta.id:
+                if attr[1] == tagMeta['id']:
                     self.droptagCount += 1
                     self.droptagType = tag
+                    print(f"DBUG:THP:Start:Tag found [{tag}:{attr[1]}]...")
 
     def handle_endtag(self, tag: str):
         if tag in self.monitored:
             self.inside[tag] = False
-        if tag == self.droptagType:
+        if self.droptagType and (tag == self.droptagType):
             self.droptagCount -= 1
+            if self.droptagCount == 0:
+                self.droptagType = None
+                print("DBUG:THP:End:Tag found...")
             if self.droptagCount < 0:
                 self.droptagCount = 0
 
@@ -186,9 +202,9 @@ def handle_urltext(ph: 'ProxyHandler', pr: urllib.parse.ParseResult):
         # Extract Text
         tagDrops = ph.headers.get('urltext-tag-drops')
         if not tagDrops:
-            tagDrops = {}
+            tagDrops = []
         else:
-            tagDrops = json.loads(tagDrops)
+            tagDrops = cast(list[dict[str,Any]], json.loads(tagDrops))
         textHtml = TextHtmlParser(tagDrops)
         textHtml.feed(got.contentData)
         # Send back to client
diff --git a/tools/server/public_simplechat/readme.md b/tools/server/public_simplechat/readme.md
@@ -590,6 +590,13 @@ SimpleProxy updates
 * Helpers to fetch file from local file system or the web, transparently
 * Help check for needed modules before a particular service path is acknowledged as available
   through /aum service path
+* urltext and related - logic to drop contents of specified tag with a given id
+  * allow its use for the web search tool flow
+    * setup wrt default duckduckgo search result urltext plain text cleanup and found working.
+  * this works properly only if the html being processed has proper opening and ending tags
+    around the area of interest.
+  * remember to specify non overlapping tag blocks, if more than one specified for dropping.
+    * this path not tested, but should logically work
 
 Settings/Config default changes
 
diff --git a/tools/server/public_simplechat/toolweb.mjs b/tools/server/public_simplechat/toolweb.mjs
@@ -259,7 +259,7 @@ function searchwebtext_run(chatid, toolcallid, toolname, obj) {
         searchUrl = searchUrl.replace("SEARCHWORDS", encodeURIComponent(obj.words));
         delete(obj.words)
         obj['url'] = searchUrl
-        let headers = { 'urltext-tag-drops': get_gme().tools.searchDrops }
+        let headers = { 'urltext-tag-drops': JSON.stringify(get_gme().tools.searchDrops) }
         return proxyserver_get_anyargs(chatid, toolcallid, toolname, obj, 'urltext', headers);
     }
 }

Original file line number	Diff line number	Diff line change
`@@ -259,7 +259,7 @@ function searchwebtext_run(chatid, toolcallid, toolname, obj) {`
`259`	`259`	`searchUrl = searchUrl.replace("SEARCHWORDS", encodeURIComponent(obj.words));`
`260`	`260`	`delete(obj.words)`
`261`	`261`	`obj['url'] = searchUrl`
`262`		`- let headers = { 'urltext-tag-drops': get_gme().tools.searchDrops }`
	`262`	`+ let headers = { 'urltext-tag-drops': JSON.stringify(get_gme().tools.searchDrops) }`
`263`	`263`	`return proxyserver_get_anyargs(chatid, toolcallid, toolname, obj, 'urltext', headers);`
`264`	`264`	`}`
`265`	`265`	`}`