Skip to content

Commit 27913f9

Browse files
committed
SimpleChatTC:SimpleProxy:Pdf2Text update /cleanup readme
1 parent d2ba5ae commit 27913f9

File tree

1 file changed

+26
-19
lines changed

1 file changed

+26
-19
lines changed

tools/server/public_simplechat/readme.md

Lines changed: 26 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -94,13 +94,17 @@ remember to
9494

9595
* cd tools/server/public_simplechat/local.tools; python3 ./simpleproxy.py --config simpleproxy.json
9696

97-
* remember that this is a relatively minimal dumb proxy logic along with optional stripping of non textual
98-
content like head, scripts, styles, headers, footers, ... Be careful when accessing web through this and
99-
use it only with known safe sites.
97+
* remember that this is a relatively minimal dumb proxy logic which can fetch html or pdf content and
98+
inturn optionally provide plain text version of the content by stripping off non textual/core contents.
99+
Be careful when accessing web through this and use it only with known safe sites.
100100

101101
* look into local.tools/simpleproxy.json for specifying
102102

103+
* the white list of allowed.schemes
104+
* you may want to use this to disable local file access and or disable http access,
105+
and inturn retaining only https based urls or so.
103106
* the white list of allowed.domains
107+
* review and update this to match your needs.
104108
* the shared bearer token between server and client ui
105109

106110
* other builtin tool / function calls like calculator, javascript runner, DataStore dont require the
@@ -389,15 +393,15 @@ like
389393
sessions by getting it to also create and execute mathematical expressions or code to verify
390394
such stuff and so.
391395

392-
* access content from internet and augment the ai model's context with additional data as
393-
needed to help generate better responses. this can also be used for
396+
* access content (including html, pdf, text based...) from local file system or the internet
397+
and augment the ai model's context with additional data as needed to help generate better
398+
responses. This can also be used for
394399
* generating the latest news summary by fetching from news aggregator sites and collating
395400
organising and summarising the same
396-
* searching for specific topics and summarising the results
401+
* searching for specific topics and summarising the search results and or fetching and
402+
analysing found data to generate summary or to explore / answer queries around that data ...
397403
* or so
398404

399-
* one could also augment additional data / info by accessing text content from pdf files
400-
401405
* save collated data or generated analysis or more to the provided data store and retrieve
402406
them later to augment the analysis / generation then. Also could be used to summarise chat
403407
session till a given point and inturn save the summary into data store and later retrieve
@@ -444,16 +448,18 @@ Either way always remember to cross check the tool requests and generated respon
444448
* search_web_text - search for the specified words using the configured search engine and return the
445449
plain textual content from the search result page.
446450

451+
* pdf2text - fetch/read specified pdf file and extract its textual content
452+
* this depends on the pypdf python based open source library
453+
447454
the above set of web related tool calls work by handshaking with a bundled simple local web proxy
448455
(/caching in future) server logic, this helps bypass the CORS restrictions applied if trying to
449456
directly fetch from the browser js runtime environment.
450457

451-
* pdf2text - fetch/read specified pdf file and extract its textual content
452-
453-
* local file access is enabled for this feature, so be careful as to where and under which user id
454-
the simple proxy will be run.
458+
Local file access is also enabled for web fetch and pdf tool calls, if one uses the file:/// scheme
459+
in the url, so be careful as to where and under which user id the simple proxy will be run.
455460

456-
* this depends on the pypdf python based open source library
461+
* one can always disable local file access by removing 'file' from the list of allowed.schemes in
462+
simpleproxy.json config file.
457463

458464
Implementing some of the tool calls through the simpleproxy.py server and not directly in the browser
459465
js env, allows one to isolate the core of these logic within a discardable VM or so, by running the
@@ -463,7 +469,7 @@ Depending on the path specified wrt the proxy server, it executes the correspond
463469
urltext path is used (and not urlraw), the logic in addition to fetching content from given url, it
464470
tries to convert html content into equivalent plain text content to some extent in a simple minded
465471
manner by dropping head block as well as all scripts/styles/footers/headers/nav blocks and inturn
466-
dropping the html tags.
472+
also dropping the html tags. Similarly for pdf2text.
467473

468474
The client ui logic does a simple check to see if the bundled simpleproxy is running at specified
469475
proxyUrl before enabling these web and related tool calls.
@@ -475,7 +481,8 @@ The bundled simple proxy
475481

476482
* it provides for a basic white list of allowed domains to access, to be specified by the end user.
477483
This should help limit web access to a safe set of sites determined by the end user. There is also
478-
a provision for shared bearer token to be specified by the end user.
484+
a provision for shared bearer token to be specified by the end user. One could even control what
485+
schemes are supported wrt the urls.
479486

480487
* it tries to mimic the client/browser making the request to it by propogating header entries like
481488
user-agent, accept and accept-language from the got request to the generated request during proxying
@@ -572,13 +579,15 @@ users) own data or data of ai model.
572579

573580
Trap http response errors and inform user the specific error returned by ai server.
574581

575-
Initial go at a pdf2text tool call. For now it allows local pdf files to be read and their text content
576-
extracted and passed to ai model for further processing, as decided by ai and end user.
582+
Initial go at a pdf2text tool call. It allows web / local pdf files to be read and their text content
583+
extracted and passed to ai model for further processing, as decided by ai and end user. One could
584+
either work with the full pdf or a subset of adjacent pages.
577585

578586
SimpleProxy
579587
* Convert from a single monolithic file into a collection of modules.
580588
* UrlValidator to cross check scheme and domain of requested urls,
581589
the whitelist inturn picked from config json
590+
* Helpers to fetch file from local file system or the web, transparently
582591

583592
#### ToDo
584593

@@ -594,8 +603,6 @@ same when saved chat is loaded.
594603

595604
MAYBE make the settings in general chat session specific, rather than the current global config flow.
596605

597-
Provide tool to allow for specified pdf files to be converted to equivalent plain text form, so that ai
598-
can be used to work with the content in those PDFs.
599606

600607
### Debuging the handshake and beyond
601608

0 commit comments

Comments
 (0)