diff --git a/.env-example b/.env-example new file mode 100644 index 0000000..6345fb2 --- /dev/null +++ b/.env-example @@ -0,0 +1,11 @@ +# slack app token (must be set) +export SLACK_APP_TOKEN="xapp-..." +# slack bot token (must be set) +export SLACK_BOT_TOKEN="xoxb-..." +# openai api key (set if using openai api) +export OPENAI_API_KEY="sk-..." + +# slack bot user id (used when scanning for messages) +export SLACK_BOT_USERID = "U01234567ABC" +# channel name on slack for scanning +export SLACK_CHANNEL_NAME = "bot-channel-name" \ No newline at end of file diff --git a/.gitignore b/.gitignore new file mode 100644 index 0000000..5665ad0 --- /dev/null +++ b/.gitignore @@ -0,0 +1,3 @@ +venv/* +pdfs/* +.env \ No newline at end of file diff --git a/.justfile b/.justfile new file mode 100644 index 0000000..8889c1d --- /dev/null +++ b/.justfile @@ -0,0 +1,18 @@ +run *FLAGS: activate + source ./.env && ipython {{FLAGS}} virtualpi.py pdfs + +scan *FLAGS: activate + source ./.env && ipython {{FLAGS}} scan_messages.py + +activate: + source ./venv/bin/activate + +clean: + rm -f ./pdfs/docs.pkl + +setup: clean + rm -rf ./venv + python -m venv venv + just activate + pip install -r requirements.txt + mkdir -p pdfs \ No newline at end of file diff --git a/Dockerfile b/Dockerfile new file mode 100644 index 0000000..e7ebae9 --- /dev/null +++ b/Dockerfile @@ -0,0 +1,6 @@ +FROM python:3.10 +WORKDIR /app +COPY requirements.txt . +RUN pip install -r requirements.txt +COPY virtualpi.py . +CMD ["ipython","virtualpi.py","./pdfs"] diff --git a/README.md b/README.md index bca6329..7651ff1 100644 --- a/README.md +++ b/README.md @@ -9,12 +9,19 @@ Why the name? When your Principal Investigator goes on holidays, you need a *Vir This work was first inspired by a conversation with the authors of [Galactic ChitChat: Using Large Language Models to Converse with Astronomy Literature](https://arxiv.org/abs/2304.05406), who implemented a similar tool, using a similar software stack. Virtual PI was first implemented and used for querying documentation for an astronomical instrument, [MAVIS](https://mavis-ao.org/). ## Configuration +#### API keys +```bash +# create your .env file: +cp .env-example .env +# set your environment variables: +vim .env +``` +#### Launching the bot To run the script, you require: - * A directory with the PDFs you wish the expert system to ingest; + * A directory with the PDFs you wish the expert system to ingest (e.g., `./pdfs/*.pdfs`) * A working Python3 environment with the following packages available: - * `pip3 install slack_bolt paper-qa==1.2` - * NB: At the time of writing the default pip version of paper-qa and its langchain dependency are out of sync, hence requesting version 1.2. + * `pip3 install -r requirements.txt` * An OpenAI [API key](https://help.openai.com/en/articles/4936850-where-do-i-find-my-secret-api-key). * You can [Create a new Slack app](https://api.slack.com/tutorials/tracks/responding-to-app-mentions) that is preconfigured with the neccessary permissions by pressing the green 'Create App' button on that link. * You can change the name of your app/bot (you'll use this to interact with it on Slack, by editing the 'manifest' file when the option is presented. @@ -22,17 +29,48 @@ To run the script, you require: The three API tokens you have generated should be exported to your shell environment at runtime: -``` +```bash export OPENAI_API_KEY="sk-M...M" export SLACK_APP_TOKEN="xapp-1...d" export SLACK_BOT_TOKEN="xoxb-2...C" ``` +e.g., by `source`ing the `.env` file after modifying it. Then you can start the app as follows. -`python3 virtualpi.py /path/to/your/PDF/directory/` +```bash +python3 virtualpi.py /path/to/your/PDF/directory/ +``` + +#### Recording Reactions +In some cases, you may wish to gather the reactions to bot messages (e.g., for further optimisation of the bot) by scanning a channel. +Assuming the `.env` is setup correctly, you can save this data to disk `bot_messages.json` using by running the `scan_messages.py` script: +```bash +python scan_messages.py +``` + +To get the bot's user id (required in `.env`), find the bot's profile on your slack channel, and copy the id shown (starting with `U...`), e.g.: + + + -After you run it the first time (when it embeds all of the documents), the script will exit and ask you to restart it (to avoid what appears to be a timeout issue in the Slack libraries). +### Using [just](https://github.com/casey/just): +`just` allows the abstraction of a few of these setup tasks, see the full set of tasks in the `.justfile`. + +After setting API keys (as above), you can create a virtual environment, install dependencies, and create a `./pdfs/` directory, by running: +```bash +just setup +``` + +Then (after adding your PDFs to `./pdfs/` you can start the slackbot using: +```bash +just run +``` + +To record the reactions by scanning a slack channel, set the appropriate `.env` variables and run: +```bash +just scan +``` ## Saving State @@ -40,7 +78,13 @@ When the script starts it will check if a pickled version of the dense vector co NB: If you add/remove PDFs you will need to remove the state file! -`rm /path/to/your/PDF/directory/docs.pkl` +```bash +rm /path/to/your/PDF/directory/docs.pkl +``` +or +```bash +just clean +``` ## Add to Slack Workspace @@ -55,3 +99,13 @@ By now your app should be happily running. The final step is to actually add it An example interaction is shown below: ![alt text](images/MAVIS-IMBH.png "Example Slack interaction") + +## Docker +Running with Docker is probably the easiest all round solution, but can make debugging a bit more tedious. To run with docker, use: +```bash +docker build -t virtualpi:latest +docker run --restart=unless-stopped -d -v ./pdfs:/app/pdfs --env-file=./.env virtualpi +``` +This has the benefit of allowing multiple bots running on varied pdf sources. You can build the image once, then spin up a new container (changing the `./pdfs` directory and probably `.env`. + +Note that for now, the `.env` format is not compatible between `just run` and `docker run`. For Docker, remove the `export` and quotation marks from the `.env` file. TODO: fix this. diff --git a/images/vpiuid.png b/images/vpiuid.png new file mode 100644 index 0000000..cfff331 Binary files /dev/null and b/images/vpiuid.png differ diff --git a/requirements.txt b/requirements.txt new file mode 100644 index 0000000..8874df8 --- /dev/null +++ b/requirements.txt @@ -0,0 +1,36 @@ +annotated-types==0.6.0 +anyio==4.2.0 +certifi==2024.2.2 +charset-normalizer==3.3.2 +distro==1.9.0 +h11==0.14.0 +html2text==2020.1.16 +httpcore==1.0.2 +httpx==0.26.0 +idna==3.6 +jsonpatch==1.33 +jsonpointer==2.4 +langchain-core==0.1.18 +langchain-openai==0.0.5 +langsmith==0.0.86 +numpy==1.26.4 +openai==1.11.1 +packaging==23.2 +paper-qa==4.0.0rc7 +pycryptodome==3.20.0 +pydantic==2.6.1 +pydantic_core==2.16.2 +pypdf==4.0.1 +PyYAML==6.0.1 +regex==2023.12.25 +requests==2.31.0 +slack-bolt==1.18.1 +slack_sdk==3.26.2 +sniffio==1.3.0 +tenacity==8.2.3 +tiktoken==0.5.2 +tqdm==4.66.1 +typing_extensions==4.9.0 +urllib3==2.2.0 +ipython +tqdm \ No newline at end of file diff --git a/scan_messages.py b/scan_messages.py new file mode 100644 index 0000000..3524389 --- /dev/null +++ b/scan_messages.py @@ -0,0 +1,62 @@ +#!/usr/bin/python3 + +import os +from slack_bolt import App +import json + +#Create handle to Slack +app = App(token=os.environ["SLACK_BOT_TOKEN"]) +bot_userid = os.environ["SLACK_BOT_USERID"] +channel_name = os.environ["SLACK_CHANNEL_NAME"] + +save_to_file = "./bot_messages.json" + +channel_id = None +# Call the conversations.list method using the WebClient +for result in app.client.conversations_list(): + if channel_id is not None: + break + for channel in result["channels"]: + if channel["name"] == channel_name: + channel_id = channel["id"] + #Print result + print(f"Found conversation ID: {channel_id}") + break + +if channel_id is None: + raise ValueError(f"Unable to find channel named: {channel_name:s}") + +# Store conversation history +conversation_history = [] + +# Call the conversations.history method using the WebClient +# conversations.history returns the first 100 messages by default +# These results are paginated, see: https://api.slack.com/methods/conversations.history$pagination +result = app.client.conversations_history(channel=channel_id) +while True: + conversation_history += result["messages"] + if not result.data["has_more"]: + break + cursor = result.data["response_metadata"]["next_cursor"] + result = app.client.conversations_history(channel=channel_id,cursor=cursor) + +# Print results +print(f"{len(conversation_history):d} messages found in {channel_id:s}") + +bot_messages = [] +for message in conversation_history: + if message["user"]!=bot_userid: + continue + if "subtype" in message and message["subtype"] == "channel_join": + continue + message.pop("blocks") + message.pop("bot_profile") + bot_messages.append(message) + +# Print results +print(f"{len(bot_messages):d} bot messages found in {channel_id:s}") + +with open(save_to_file,"w") as f: + json.dump(bot_messages,f,indent=4) + +print(f"finished successfully, saved to {save_to_file:s}") \ No newline at end of file diff --git a/virtualpi.py b/virtualpi.py index 59bbe11..38422c6 100644 --- a/virtualpi.py +++ b/virtualpi.py @@ -13,6 +13,9 @@ from paperqa import Docs from slack_bolt import App from slack_bolt.adapter.socket_mode import SocketModeHandler +from openai import AsyncOpenAI +from tqdm import tqdm +chat = AsyncOpenAI() #Create handle to Slack app = App(token=os.environ["SLACK_BOT_TOKEN"]) @@ -21,6 +24,7 @@ #This function is called when a Slack user mentions the bot @app.event("app_mention") def event_test(say, body): + print("received question, working on answer.") try: #This gets the question text from the user user_question=body["event"]["blocks"][0]["elements"][0]["elements"][1]["text"] @@ -29,11 +33,9 @@ def event_test(say, body): answer = docs.query(user_question, k=30, max_sources=10) #Print some stuff locally print(answer.formatted_answer) - for p in answer.passages: - print("* %s: %s\n"%(p, answer.passages[p])) print("\n\n\n") - #Send the answer to Slack - say(answer.formatted_answer) + #Send the (minimal) answer to Slack + say(answer.answer) except Exception as e: print("Error: %s"%e) @@ -52,10 +54,14 @@ def event_test(say, body): try: #Load the pre-pickled document vector if it exists with open("%s/docs.pkl"%PAPERDIR, "rb") as f: - docs = pickle.load(f) + docs = pickle.loads(pickle.load(f)) + docs.set_client(chat) print("Loaded previous state from %s/docs.pkl"%PAPERDIR) print(" - remove this file if you change the set of PDFs\n") -except: +except FileNotFoundError: + docs = None + +if docs is None: #Couldn't load a pre-picked version papers=[] filesfound=glob.glob("%s/*"%PAPERDIR) @@ -70,34 +76,42 @@ def event_test(say, body): print("Found %d PDFs in %s"%(len(papers),PAPERDIR)) #Add each paper in turn to paper-qa/FAISS/OpenAI embedding - docs = Docs(llm='gpt-3.5-turbo', summary_llm="davinci") - for p in papers: + docs = Docs(llm="gpt-3.5-turbo",client=chat) + print("Embedding documents") + pbar = tqdm(papers,leave=True,desc="") + for p in pbar: try: #Get the base file name to use as the citation citation=os.path.split(p)[-1] - #Strip off the ".pdf" or ".PDF" citation=citation[0:citation.rfind(".")] #Embed this doc - print("Embedding %s"%citation) - docs.add(p, citation=citation, key=citation) + pbar.set_description(f"doc={citation:s}") + docs.add(p,docname=citation,citation=citation) except Exception as e: print("Error processing %s: %s"%(p,e)) try: with open("%s/docs.pkl"%PAPERDIR, "wb") as f: #Save this state for next time print("\nSaving state to file %s/docs.pkl - this may take some time."%PAPERDIR) - pickle.dump(docs, f) + pickle.dump(pickle.dumps(docs), f) except Exception as e: print("Couldn't save state into %s - is it writeable?"%PAPERDIR) print("Error was: %s"%e) - sys.exit(1) - finally: - #This is only necessary as the Slack handle created above seems to break - #during the long delay of embedding and pickling. Some kind of bug? - print("State saved okay - please restart program.") - sys.exit(1) + sys.exit(2) + +docs.prompts.qa = ("Write an answer ({answer_length}) " + "for the question below based on the provided context. " + "If the context provides insufficient information, " + 'reply "I cannot answer". ' + "For each part of your answer, indicate which sources most support it " + "via valid citation markers at the end of sentences, like (Example2012). " + "Answer in an unbiased, comprehensive, and scholarly tone. " + "If the question is subjective, provide an opinionated answer in the concluding 1-2 sentences. " + "Use Markdown for formatting code or text, and try to use direct quotes to support arguments.\n\n" + "{context}\n" + "Question: {question}\n" + "Answer: ") #Set up the Slack interface to start servicing requests print("Starting Slack handler - bot is ready to answer your questions!") SocketModeHandler(app, os.environ["SLACK_APP_TOKEN"]).start() -