Audio Transcription and Vectorization System

A comprehensive Python system for processing audio files through Amazon Transcribe and vectorizing the resulting text for semantic search using Zilliz Cloud.

Overview

This system provides an end-to-end solution for:

Monitoring AWS SQS for audio file notifications
Automatically triggering Amazon Transcribe jobs
Extracting text from transcription results stored in S3
Converting text into semantic vectors for similarity search
Storing and querying vectorized content in Zilliz Cloud

Features

🎵 Audio Processing

Automatic Transcription: Monitors SQS queue for new audio files
AWS Integration: Seamless integration with Amazon Transcribe
Format Support: Supports MP4 and other audio formats
Japanese Language: Optimized for Japanese audio transcription

📄 Text Extraction

S3 JSON Reader: Extracts text from JSON files stored in S3
Smart Detection: Automatically detects Amazon Transcribe result format
Generic JSON Support: Can extract text from any JSON structure
Batch Processing: Process multiple files simultaneously

🔍 Vector Search

Semantic Search: Advanced similarity search using sentence transformers
Japanese Text Processing: Optimized for Japanese text vectorization
Chunking: Intelligent text chunking for better search granularity
Zilliz Cloud Integration: Scalable vector database storage

Architecture

[Audio Files] → [SQS Queue] → [Amazon Transcribe] → [S3 JSON Results]
                                                           ↓
[Zilliz Cloud] ← [Vector Database] ← [Text Chunking] ← [Text Extraction]

Installation

Prerequisites

Python 3.8+
AWS Account with appropriate permissions
Zilliz Cloud account

Setup

Clone the repository

git clone <repository-url>
cd Transcribe

Create virtual environment

python -m venv .venv
# Windows
.venv\Scripts\activate
# macOS/Linux
source .venv/bin/activate

Install dependencies

pip install -r requirements.txt

Configure environment variables Create a .env file in the src/ directory:

# AWS Configuration
AWS_ACCESS_KEY_ID=your_access_key
AWS_SECRET_ACCESS_KEY=your_secret_key
AWS_REGION=ap-northeast-1
SQS_QUEUE_URL=https://sqs.region.amazonaws.com/account/queue-name
TRANSCRIBE_OUTPUT_BUCKET=your-output-bucket

# S3 Configuration
S3_BUCKET_NAME=your-transcribe-results-bucket

# Zilliz Cloud Configuration
ZILLIZ_URI=https://your-zilliz-endpoint
ZILLIZ_TOKEN=your_zilliz_token

Usage

Local Development

1. Audio Transcription Service

Start the transcription service to monitor SQS for new audio files:

cd src
python AmazonTranscribe.py

This service will:

Monitor the configured SQS queue
Automatically start transcription jobs for new audio files
Process messages continuously

2. Text Extraction from S3

Extract text from JSON files stored in S3:

from extract_text_fromS3 import S3JsonTextExtractor

# Initialize extractor
extractor = S3JsonTextExtractor()

# Extract from single file
result = extractor.extract_text_from_s3_json(
    bucket_name="your-bucket",
    object_key="path/to/file.json"
)

# Batch processing
results = extractor.batch_extract_texts(
    bucket_name="your-bucket",
    prefix="transcribe-output/"
)

3. Text Vectorization and Search

Process extracted text and enable semantic search:

from conversation_vectorization import ConversationVectorizer
import os

# Initialize vectorizer
vectorizer = ConversationVectorizer(
    zilliz_uri=os.getenv("ZILLIZ_URI"),
    zilliz_token=os.getenv("ZILLIZ_TOKEN")
)

# Process text
chunks = vectorizer.process_monologue(extracted_text)

# Search similar content
results = vectorizer.search_similar("営業について", limit=5)
for result in results:
    print(f"Text: {result['text'][:100]}...")
    print(f"Score: {result['score']:.3f}\n")

4. AI Chat Server

Start the intelligent chat server that combines Zilliz search with OpenAI:

cd src
python chat_server.py

The chat server provides:

Web Interface: Access at http://localhost:5000
REST API: /api/chat and /api/search endpoints
WebSocket: Real-time chat functionality
RAG System: Retrieval-Augmented Generation using past conversations

Chat Server Features

🔍 Semantic Search: Find relevant conversations using vector similarity
🤖 AI Responses: Generate contextual answers with OpenAI
💬 Real-time Chat: WebSocket-based chat interface
📚 Source Citations: Show which conversations informed the answer
🌐 Multi-interface: Web UI, REST API, and WebSocket support

API Usage Examples

Search for conversations:

curl -X POST http://localhost:5000/api/search \
  -H "Content-Type: application/json" \
  -d '{"query": "営業について", "limit": 5}'

Chat with AI:

curl -X POST http://localhost:5000/api/chat \
  -H "Content-Type: application/json" \
  -d '{"query": "AIの活用方法を教えて"}'

Test the Chat Server

# Run comprehensive tests
python src/test_chat_server.py

# Test specific functionality
python -c "
import requests
response = requests.post('http://localhost:5000/api/search',
    json={'query': 'test', 'limit': 3})
print(response.json())
"

5. Complete Pipeline

Run the complete pipeline for processing transcription results:

cd src
python conversation_vectorization.py

Project Structure

Transcribe/
├── src/
│   ├── .env                           # Environment configuration
│   ├── AmazonTranscribe.py           # SQS monitoring and transcription
│   ├── extract_text_fromS3.py       # S3 JSON text extraction
│   └── conversation_vectorization.py # Text vectorization and search
├── requirements.txt                   # Python dependencies
└── README.md                         # This file

Key Components

AmazonTranscribe.py

Purpose: Monitors SQS queue for audio file notifications
Features: Automatic transcription job creation, error handling
Output: Transcribed results stored in specified S3 bucket

extract_text_fromS3.py

Purpose: Extracts text content from JSON files in S3
Features: Auto-detection of Transcribe format, generic JSON support
Methods: Single file and batch processing capabilities

conversation_vectorization.py

Purpose: Converts text to vectors and enables semantic search
Features: Text chunking, Japanese language optimization, similarity search
Integration: Zilliz Cloud for scalable vector storage

Configuration

AWS Permissions Required

Your AWS user/role needs the following permissions:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "sqs:ReceiveMessage",
        "sqs:DeleteMessage",
        "transcribe:StartTranscriptionJob",
        "s3:GetObject",
        "s3:ListBucket",
        "s3:PutObject"
      ],
      "Resource": "*"
    }
  ]
}

Zilliz Cloud Setup

Create a Zilliz Cloud account
Create a new cluster
Obtain the connection URI and token
Configure in your .env file

Monitoring and Logging

The system includes comprehensive logging:

INFO: Processing status and progress
ERROR: Failed operations and exceptions
DEBUG: Detailed operation information

Logs are output to console with timestamps and severity levels.

Troubleshooting

Common Issues

SQS Connection Issues
- Verify AWS credentials and region
- Check SQS queue URL format
- Ensure proper IAM permissions
Transcribe Job Failures
- Verify audio file format (MP4 supported)
- Check S3 bucket permissions
- Ensure file is accessible to Transcribe service
Zilliz Connection Issues
- Verify URI and token format
- Check network connectivity
- Ensure cluster is running
Text Extraction Issues
- Verify S3 bucket access permissions
- Check JSON file format
- Ensure proper file encoding (UTF-8)

Performance Considerations

Batch Processing: Use batch operations for multiple files
Chunk Size: Adjust chunk size based on your search requirements
Vector Dimensions: Consider model dimensions for storage optimization
Connection Pooling: Reuse AWS clients when possible

Contributing

Fork the repository
Create a feature branch
Make your changes
Add tests if applicable
Submit a pull request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Support

For issues and questions:

Check the troubleshooting section
Review logs for error details
Create an issue in the repository

Dependencies

Key dependencies include:

boto3: AWS SDK for Python
sentence-transformers: Text embedding models
pymilvus: Zilliz/Milvus vector database client
langchain: Text processing utilities
python-dotenv: Environment variable management

See requirements.txt for complete dependency list.

Docker and ECS Deployment

Docker Build and Run

Build Docker Image

# Build the Docker image
docker build -t transcribe-service:latest .

# Run locally with Docker
docker run -d \
  --name transcribe-service \
  --env-file src/.env \
  transcribe-service:latest

# Run with docker-compose
docker-compose up -d

Docker Commands

# View logs
docker logs transcribe-service -f

# Stop container
docker stop transcribe-service

# Remove container
docker rm transcribe-service

ECS Deployment

Prerequisites

AWS CLI configured with appropriate permissions
Docker installed
ECR repository created

Infrastructure Setup

Deploy infrastructure using CloudFormation:

aws cloudformation deploy \
  --template-file cloudformation-infrastructure.yaml \
  --stack-name transcribe-infrastructure \
  --parameter-overrides \
    Environment=production \
    S3BucketName=your-transcribe-bucket \
    SQSQueueName=your-audio-queue \
  --capabilities CAPABILITY_NAMED_IAM

Store secrets in AWS Systems Manager Parameter Store:

# Store AWS credentials
aws ssm put-parameter \
  --name "/transcribe/aws_access_key_id" \
  --value "YOUR_ACCESS_KEY" \
  --type "SecureString"

aws ssm put-parameter \
  --name "/transcribe/aws_secret_access_key" \
  --value "YOUR_SECRET_KEY" \
  --type "SecureString"

# Store other configuration
aws ssm put-parameter \
  --name "/transcribe/sqs_queue_url" \
  --value "https://sqs.region.amazonaws.com/account/queue-name" \
  --type "String"

aws ssm put-parameter \
  --name "/transcribe/zilliz_uri" \
  --value "https://your-zilliz-endpoint" \
  --type "String"

aws ssm put-parameter \
  --name "/transcribe/zilliz_token" \
  --value "YOUR_ZILLIZ_TOKEN" \
  --type "SecureString"

Deploy to ECS

Make the deployment script executable:

chmod +x deploy-to-ecs.sh

Run the deployment:

./deploy-to-ecs.sh production ap-northeast-1

ECS Service Configuration

CPU: 1024 (1 vCPU)
Memory: 2048 MB (2 GB)
Network: Fargate with public subnets
Auto Scaling: Can be configured based on CPU/memory usage
Health Checks: Built-in application health checks

Monitoring

# View service status
aws ecs describe-services \
  --cluster transcribe-cluster \
  --services transcribe-service

# View logs
aws logs tail /ecs/transcribe-service --follow

# View task details
aws ecs describe-tasks \
  --cluster transcribe-cluster \
  --tasks TASK_ARN

Container Resource Requirements

Component	CPU	Memory	Description
Transcribe Service	1 vCPU	2 GB	SQS monitoring and job creation
Vector Service	2 vCPU	4 GB	Text processing and vectorization

Environment Variables for ECS

The following environment variables are configured via AWS Systems Manager Parameter Store:

Variable	Description	Type
`AWS_ACCESS_KEY_ID`	AWS Access Key	SecureString
`AWS_SECRET_ACCESS_KEY`	AWS Secret Key	SecureString
`AWS_REGION`	AWS Region	String
`SQS_QUEUE_URL`	SQS Queue URL	String
`TRANSCRIBE_OUTPUT_BUCKET`	S3 Output Bucket	String
`S3_BUCKET_NAME`	S3 Bucket for Results	String
`ZILLIZ_URI`	Zilliz Cloud URI	String
`ZILLIZ_TOKEN`	Zilliz Cloud Token	SecureString

AWS CodeBuild CI/CD Pipeline

Overview

This project includes a complete CI/CD pipeline using AWS CodeBuild that automatically builds Docker images and deploys them to ECS when code is pushed to the repository.

Pipeline Features

Automatic Builds: Triggered on git push to specified branch
Multi-stage Docker Builds: Optimized for production
ECR Integration: Automatic image pushing to Amazon ECR
ECS Deployment: Seamless updates to ECS services
Security Scanning: Built-in container vulnerability scanning
Artifact Management: Automated cleanup of old images

Setup CodeBuild Pipeline

1. Quick Setup (Automated)

# Make the setup script executable
chmod +x setup-codebuild.sh

# Run the setup script
./setup-codebuild.sh

2. Manual Setup

Deploy CodeBuild Infrastructure:

aws cloudformation deploy \
  --template-file codebuild-infrastructure.yaml \
  --stack-name transcribe-codebuild \
  --parameter-overrides \
    ProjectName=transcribe-service \
    Environment=production \
    GitHubRepo=https://github.com/your-username/transcribe.git \
    GitHubBranch=main \
  --capabilities CAPABILITY_NAMED_IAM

Create ECR Repository:

aws ecr create-repository \
  --repository-name transcribe-service \
  --image-scanning-configuration scanOnPush=true

Set up Parameters in Parameter Store:

# Core AWS settings
aws ssm put-parameter \
  --name "/transcribe/aws_access_key_id" \
  --value "YOUR_ACCESS_KEY" \
  --type "SecureString"

aws ssm put-parameter \
  --name "/transcribe/aws_secret_access_key" \
  --value "YOUR_SECRET_KEY" \
  --type "SecureString"

# Application settings
aws ssm put-parameter \
  --name "/transcribe/sqs_queue_url" \
  --value "https://sqs.region.amazonaws.com/account/queue" \
  --type "String"

aws ssm put-parameter \
  --name "/transcribe/zilliz_uri" \
  --value "https://your-zilliz-endpoint" \
  --type "String"

aws ssm put-parameter \
  --name "/transcribe/zilliz_token" \
  --value "YOUR_ZILLIZ_TOKEN" \
  --type "SecureString"

Build Specifications

The project includes two buildspec files:

`buildspec.yml` (Basic)

Simple Docker build and push
Suitable for basic CI/CD needs
Minimal configuration required

`buildspec-advanced.yml` (Production)

Comprehensive build with security scanning
Multi-tag strategy (latest, commit hash, build number, timestamp)
Advanced error handling and logging
Production-ready configuration

Build Process

Pre-build Phase
- ECR login and repository validation
- Build environment setup
- Variable configuration
Build Phase
- Docker image building with caching
- Security vulnerability scanning
- Image testing and validation
Post-build Phase
- Multi-tag image pushing to ECR
- ECS task definition updates
- Deployment artifact creation

Monitoring and Management

View Build Status

# List recent builds
aws codebuild list-builds-for-project \
  --project-name transcribe-service

# Get build details
aws codebuild batch-get-builds \
  --ids BUILD_ID

Monitor Build Logs

# Tail build logs in real-time
aws logs tail /aws/codebuild/transcribe-service --follow

# View specific log stream
aws logs get-log-events \
  --log-group-name /aws/codebuild/transcribe-service \
  --log-stream-name LOG_STREAM_NAME

Trigger Manual Builds

# Start a build from the main branch
aws codebuild start-build \
  --project-name transcribe-service

# Start a build from a specific branch
aws codebuild start-build \
  --project-name transcribe-service \
  --source-version feature-branch

Build Artifacts

Each successful build produces:

imagedefinitions.json: ECS container image definitions
ecs-task-definition-final.json: Updated ECS task definition
deployment-summary.json: Build and deployment metadata

Environment Variables

Variable	Source	Description
`AWS_DEFAULT_REGION`	Build Environment	AWS region for deployment
`AWS_ACCOUNT_ID`	Parameter Store	AWS account ID
`IMAGE_REPO_NAME`	Build Environment	ECR repository name
`ECS_CLUSTER_NAME`	Parameter Store	Target ECS cluster
`ECS_SERVICE_NAME`	Parameter Store	Target ECS service

Security Features

Container Scanning: Trivy security scanner integration
IAM Least Privilege: Minimal required permissions
Secrets Management: Parameter Store for sensitive data
Image Signing: Optional container image signing
Vulnerability Reports: Security scan results in CodeBuild reports

Cost Optimization

Build Caching: Docker layer and pip package caching
Lifecycle Policies: Automatic cleanup of old ECR images
Compute Optimization: Right-sized build instances
Artifact Retention: 30-day artifact lifecycle

Troubleshooting

Common Issues

Build Fails with ECR Login Error

# Check IAM permissions for ECR
aws ecr get-login-password --region us-east-1

Parameter Store Access Denied

# Verify parameter exists and permissions
aws ssm get-parameter --name "/transcribe/parameter-name"

Docker Build Out of Space
- Enable Docker layer caching in buildspec
- Use multi-stage builds to reduce image size
ECS Deployment Fails
- Check ECS service and task definition compatibility
- Verify IAM roles for ECS tasks

Debug Commands

# Check CodeBuild project configuration
aws codebuild batch-get-projects --names transcribe-service

# View parameter store values
aws ssm get-parameters-by-path --path "/transcribe" --recursive

# Check ECR repository
aws ecr describe-repositories --repository-names transcribe-service

# View ECS service status
aws ecs describe-services --cluster transcribe-cluster --services transcribe-service

Name		Name	Last commit message	Last commit date
Latest commit History 79 Commits
.serena		.serena
lambda		lambda
scripts		scripts
src		src
typescript		typescript
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
CHAT_SERVER_README.md		CHAT_SERVER_README.md
DOCKER_README.md		DOCKER_README.md
Dockerfile		Dockerfile
Dockerfile.chat		Dockerfile.chat
Dockerfile.chat_server		Dockerfile.chat_server
Dockerfile.dev		Dockerfile.dev
Dockerfile.production		Dockerfile.production
Jenkinsfile		Jenkinsfile
MEMORY_SYSTEM.md		MEMORY_SYSTEM.md
README.md		README.md
README_HUGGINGFACE.md		README_HUGGINGFACE.md
buildspec-advanced.yml		buildspec-advanced.yml
buildspec.yml		buildspec.yml
cloudformation-infrastructure.yaml		cloudformation-infrastructure.yaml
codebuild-infrastructure.yaml		codebuild-infrastructure.yaml
deploy-to-ecs.sh		deploy-to-ecs.sh
docker-build.ps1		docker-build.ps1
docker-build.sh		docker-build.sh
docker-compose.chat.yml		docker-compose.chat.yml
docker-compose.yml		docker-compose.yml
ecs-task-definition.json		ecs-task-definition.json
env_template.txt		env_template.txt
requirements.txt		requirements.txt
run_chat_server.py		run_chat_server.py
run_youtube_api_server.py		run_youtube_api_server.py
setup-codebuild.sh		setup-codebuild.sh
test_conversation_vectorizer.py		test_conversation_vectorizer.py
test_imports.py		test_imports.py
test_tfidf_integration.py		test_tfidf_integration.py
test_transcribe_update.py		test_transcribe_update.py

tthogho1/transcribe

Folders and files

Latest commit

History

Repository files navigation

Audio Transcription and Vectorization System

Overview

Features

🎵 Audio Processing

📄 Text Extraction

🔍 Vector Search

Architecture

Installation

Prerequisites

Setup

Usage

Local Development

1. Audio Transcription Service

2. Text Extraction from S3

3. Text Vectorization and Search

4. AI Chat Server

Chat Server Features

API Usage Examples

Test the Chat Server

5. Complete Pipeline

Project Structure

Key Components

AmazonTranscribe.py

extract_text_fromS3.py

conversation_vectorization.py

Configuration

AWS Permissions Required

Zilliz Cloud Setup

Monitoring and Logging

Troubleshooting

Common Issues

Performance Considerations

Contributing

License

Support

Dependencies

Docker and ECS Deployment

Docker Build and Run

Build Docker Image

Docker Commands

ECS Deployment

Prerequisites

Infrastructure Setup

Deploy to ECS

ECS Service Configuration

Monitoring

Container Resource Requirements

Environment Variables for ECS

AWS CodeBuild CI/CD Pipeline

Overview

Pipeline Features

Setup CodeBuild Pipeline

1. Quick Setup (Automated)

2. Manual Setup

Build Specifications

buildspec.yml (Basic)

buildspec-advanced.yml (Production)

Build Process

Monitoring and Management

View Build Status

Monitor Build Logs

Trigger Manual Builds

Build Artifacts

Environment Variables

Security Features

Cost Optimization

Troubleshooting

Common Issues

Debug Commands

About

Resources

Uh oh!

Stars

Watchers

Forks

`buildspec.yml` (Basic)

`buildspec-advanced.yml` (Production)

Packages