Skip to content

jaimemendozadev/alwayssaved-extractor-service

Repository files navigation

Welcome to the AlwaysSaved Extractor Service — the user-facing web app that powers your private, searchable knowledge base for long-form media. Built to deliver fast, intelligent, and intuitive experiences, this interface lets users upload, explore, and query their personal content with ease.

This is the repository for the Extractor Service - Step 3 of the App Flow and the real start of the AlwaysSaved ML/AI Pipeline.

For more information about What is AlwaysSaved and its Key Features, refer to the AlwaysSaved Frontend README.


Table of Contents (TOC)


3rd Party Services Needed

As a friendly reminder from the AlwaysSaved Frontend, the following AWS Resources should have already been setup for this Extractor Service to work properly:


An Amazon Simple Queue Service Extractor Queue was already created when you spun up the Frontend. You will now need to:

  • Create a second SQS queue we call the Embedding Queue that the Extractor Service uses to send payloads to the Embedding Service to continue the next part of the ML/AI Pipeline (see Steps 4-5 of System Design Diagram).

Your newly created Embedding Queue URL will be saved in the AWS Parmeter Store (see next section).


Back to TOC


Environment and AWS Systems Manager Parameter Store Variables

You'll need to create a .env file at the root of this repo. There's only one variable that you have to prefill, which is the Region where all your AWS s3 Bucket and SQS Queues are located.

AWS_REGION=

For both development and production, there are a lot of variables that we couldn't store in the .env file, so we had to resort to using the AWS Systems Manager Parameter Store ahead of time in order to get the app functioning.

The following variable keys have their values stored in the Parameter store as follows:

/alwayssaved/AWS_BUCKET

/alwayssaved/AWS_BUCKET_BASE_URL


/alwayssaved/EXTRACTOR_PUSH_QUEUE_URL

/alwayssaved/EMBEDDING_PUSH_QUEUE_URL


/alwayssaved/MONGO_DB_USER

/alwayssaved/MONGO_DB_PASSWORD

/alwayssaved/MONGO_DB_BASE_URI

/alwayssaved/MONGO_DB_NAME

/alwayssaved/MONGO_DB_CLUSTER_NAME

If you already setup your MongoDB Cluster and s3 Bucket by setting up the AlwaysSaved Frontend, adding those values in the AWS Parameter Store should be easy.

Make sure that the Extractor Service SQS URL gets saved in the paramter store under /alwayssaved/EXTRACTOR_PUSH_QUEUE_URL and the newly created Embedding Queue URL gets saved under /alwayssaved/EMBEDDING_PUSH_QUEUE_URL.

For clarification, your AWS_BUCKET_BASE_URL really means the URL that points to your Bucket in AWS like so:

https://<AWS_BUCKET_NAME>.s3.amazonaws.com


Back to TOC


Starting the App

We need to use a virtual environment (we use the Pipenv virtualenv management tool) to run the app.

Navigate to the root of the project folder in your computer. Open 2 separate terminal windows that both point to the root of the project. In one of those terminal windows run the following commands:

Create and enter the virtual environment:

$ pipenv --python 3.11

Enter the virtual environment:

$ pipenv shell

Install the dependencies in the Pipfile:

$ pipenv install

Start the Extractor Service at the root service.py file:

$ python3 service.py

Back to TOC


File Structure and Service Flow

/
|
|___/services
|    |
|    |__/audio_extractor
|    |
|    |__/audio_transcription
|    |
|    |__/aws
|    | |
|    | |__s3.py
|    | |
|    | |__sqs.py
|    | |
|    | |__ssm.py
|    |
|    |
|    |
|    |__/utils
|      |
|      |__/mongodb
|      |
|      |__/types
|
|
|
|__service.py


So basically the AlwaysSaved ML/AI Pipeline starts here in the Extractor Service in concert with the Extractor Queue (see Steps 2-3 of System Design Diagram).

But first, a little explanation on the Data Entities of AlwaysSaved and how incoming requests to the Extractor Service arrive from the Frontend.

On the Frontend for v1, Users can upload a single or multiple .mp4 Video or .mp3 Audio File(s) to s3. When a single or multiple Files are uploaded to s3, the Frontend creates File MongoDB documents for each File upload.

Those newly created media File upload(s) are organized in a newly created MongoDB "Note" document.

When the File upload(s) to s3 finish, the Frontend sends an SQS Message Payload to the Extractor Queue that gets processed by the Extractor Service. The incoming SQS Message to the Extractor Service has the following shape:

[
  {
    user_id: string;
    media_uploads: [
      {
       note_id: ObjectID;
       user_id: ObjectID;
       s3_key: string;
      }
    ]
  }
]


Then for each SQS Message, for each media_upload in the Message Payload, the Extractor Service will:

  • Download the media File from s3;
    • If the media download is an .mp4 file, extract the .mp3 audio file from the video;
  • Use the Whisper Model to transcribe the audio and create a .txt file of the transcript;
  • Upload the .txt transcript to s3; and
  • Send an outgoing SQS Message to the Embedding Queue Step 4 of System Design Diagram with the following shape:
  {
      note_id: string;
      file_id: string;
      user_id: string;
      transcript_s3_key: string;
  }

The next part of the ML/AI Pipeline then moves on to the Embedding Queue and Embedding Service (see Steps 4-5 of System Design Diagram).


Back to TOC


AlwaysSaved System Design / App Flow

Screenshot of AlwaysSaved System Design and App Flow

Above 👆🏽you will see the entire System Design and App Flow for Always Saved.

If you need a better view of the entire screenshot, feel free to download the Excalidraw File and view the System Design document in Excalidraw.


Back to TOC


Created By

Jaime Mendoza https://github.com/jaimemendozadev

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published