@@ -94,13 +94,17 @@ remember to
9494
9595 * cd tools/server/public_simplechat/local.tools; python3 ./simpleproxy.py --config simpleproxy.json
9696
97- * remember that this is a relatively minimal dumb proxy logic along with optional stripping of non textual
98- content like head, scripts, styles, headers, footers, ... Be careful when accessing web through this and
99- use it only with known safe sites.
97+ * remember that this is a relatively minimal dumb proxy logic which can fetch html or pdf content and
98+ inturn optionally provide plain text version of the content by stripping off non textual/core contents.
99+ Be careful when accessing web through this and use it only with known safe sites.
100100
101101 * look into local.tools/simpleproxy.json for specifying
102102
103+ * the white list of allowed.schemes
104+ * you may want to use this to disable local file access and or disable http access,
105+ and inturn retaining only https based urls or so.
103106 * the white list of allowed.domains
107+ * review and update this to match your needs.
104108 * the shared bearer token between server and client ui
105109
106110* other builtin tool / function calls like calculator, javascript runner, DataStore dont require the
@@ -389,15 +393,15 @@ like
389393sessions by getting it to also create and execute mathematical expressions or code to verify
390394such stuff and so.
391395
392- * access content from internet and augment the ai model's context with additional data as
393- needed to help generate better responses. this can also be used for
396+ * access content (including html, pdf, text based...) from local file system or the internet
397+ and augment the ai model's context with additional data as needed to help generate better
398+ responses. This can also be used for
394399 * generating the latest news summary by fetching from news aggregator sites and collating
395400 organising and summarising the same
396- * searching for specific topics and summarising the results
401+ * searching for specific topics and summarising the search results and or fetching and
402+ analysing found data to generate summary or to explore / answer queries around that data ...
397403 * or so
398404
399- * one could also augment additional data / info by accessing text content from pdf files
400-
401405* save collated data or generated analysis or more to the provided data store and retrieve
402406them later to augment the analysis / generation then. Also could be used to summarise chat
403407session till a given point and inturn save the summary into data store and later retrieve
@@ -444,16 +448,18 @@ Either way always remember to cross check the tool requests and generated respon
444448* search_web_text - search for the specified words using the configured search engine and return the
445449plain textual content from the search result page.
446450
451+ * pdf2text - fetch/read specified pdf file and extract its textual content
452+ * this depends on the pypdf python based open source library
453+
447454the above set of web related tool calls work by handshaking with a bundled simple local web proxy
448455(/caching in future) server logic, this helps bypass the CORS restrictions applied if trying to
449456directly fetch from the browser js runtime environment.
450457
451- * pdf2text - fetch/read specified pdf file and extract its textual content
452-
453- * local file access is enabled for this feature, so be careful as to where and under which user id
454- the simple proxy will be run.
458+ Local file access is also enabled for web fetch and pdf tool calls, if one uses the file:/// scheme
459+ in the url, so be careful as to where and under which user id the simple proxy will be run.
455460
456- * this depends on the pypdf python based open source library
461+ * one can always disable local file access by removing 'file' from the list of allowed.schemes in
462+ simpleproxy.json config file.
457463
458464Implementing some of the tool calls through the simpleproxy.py server and not directly in the browser
459465js env, allows one to isolate the core of these logic within a discardable VM or so, by running the
@@ -463,7 +469,7 @@ Depending on the path specified wrt the proxy server, it executes the correspond
463469urltext path is used (and not urlraw), the logic in addition to fetching content from given url, it
464470tries to convert html content into equivalent plain text content to some extent in a simple minded
465471manner by dropping head block as well as all scripts/styles/footers/headers/nav blocks and inturn
466- dropping the html tags.
472+ also dropping the html tags. Similarly for pdf2text .
467473
468474The client ui logic does a simple check to see if the bundled simpleproxy is running at specified
469475proxyUrl before enabling these web and related tool calls.
@@ -475,7 +481,8 @@ The bundled simple proxy
475481
476482* it provides for a basic white list of allowed domains to access, to be specified by the end user.
477483 This should help limit web access to a safe set of sites determined by the end user. There is also
478- a provision for shared bearer token to be specified by the end user.
484+ a provision for shared bearer token to be specified by the end user. One could even control what
485+ schemes are supported wrt the urls.
479486
480487* it tries to mimic the client/browser making the request to it by propogating header entries like
481488 user-agent, accept and accept-language from the got request to the generated request during proxying
@@ -572,13 +579,15 @@ users) own data or data of ai model.
572579
573580Trap http response errors and inform user the specific error returned by ai server.
574581
575- Initial go at a pdf2text tool call. For now it allows local pdf files to be read and their text content
576- extracted and passed to ai model for further processing, as decided by ai and end user.
582+ Initial go at a pdf2text tool call. It allows web / local pdf files to be read and their text content
583+ extracted and passed to ai model for further processing, as decided by ai and end user. One could
584+ either work with the full pdf or a subset of adjacent pages.
577585
578586SimpleProxy
579587* Convert from a single monolithic file into a collection of modules.
580588* UrlValidator to cross check scheme and domain of requested urls,
581589 the whitelist inturn picked from config json
590+ * Helpers to fetch file from local file system or the web, transparently
582591
583592#### ToDo
584593
@@ -594,8 +603,6 @@ same when saved chat is loaded.
594603
595604MAYBE make the settings in general chat session specific, rather than the current global config flow.
596605
597- Provide tool to allow for specified pdf files to be converted to equivalent plain text form, so that ai
598- can be used to work with the content in those PDFs.
599606
600607### Debuging the handshake and beyond
601608
0 commit comments