A comprehensive Python system for processing audio files through Amazon Transcribe and vectorizing the resulting text for semantic search using Zilliz Cloud.
This system provides an end-to-end solution for:
- Monitoring AWS SQS for audio file notifications
- Automatically triggering Amazon Transcribe jobs
- Extracting text from transcription results stored in S3
- Converting text into semantic vectors for similarity search
- Storing and querying vectorized content in Zilliz Cloud
- Automatic Transcription: Monitors SQS queue for new audio files
- AWS Integration: Seamless integration with Amazon Transcribe
- Format Support: Supports MP4 and other audio formats
- Japanese Language: Optimized for Japanese audio transcription
- S3 JSON Reader: Extracts text from JSON files stored in S3
- Smart Detection: Automatically detects Amazon Transcribe result format
- Generic JSON Support: Can extract text from any JSON structure
- Batch Processing: Process multiple files simultaneously
- Semantic Search: Advanced similarity search using sentence transformers
- Japanese Text Processing: Optimized for Japanese text vectorization
- Chunking: Intelligent text chunking for better search granularity
- Zilliz Cloud Integration: Scalable vector database storage
[Audio Files] → [SQS Queue] → [Amazon Transcribe] → [S3 JSON Results]
↓
[Zilliz Cloud] ← [Vector Database] ← [Text Chunking] ← [Text Extraction]
- Python 3.8+
- AWS Account with appropriate permissions
- Zilliz Cloud account
- Clone the repository
git clone <repository-url>
cd Transcribe- Create virtual environment
python -m venv .venv
# Windows
.venv\Scripts\activate
# macOS/Linux
source .venv/bin/activate- Install dependencies
pip install -r requirements.txt- Configure environment variables
Create a
.envfile in thesrc/directory:
# AWS Configuration
AWS_ACCESS_KEY_ID=your_access_key
AWS_SECRET_ACCESS_KEY=your_secret_key
AWS_REGION=ap-northeast-1
SQS_QUEUE_URL=https://sqs.region.amazonaws.com/account/queue-name
TRANSCRIBE_OUTPUT_BUCKET=your-output-bucket
# S3 Configuration
S3_BUCKET_NAME=your-transcribe-results-bucket
# Zilliz Cloud Configuration
ZILLIZ_URI=https://your-zilliz-endpoint
ZILLIZ_TOKEN=your_zilliz_tokenStart the transcription service to monitor SQS for new audio files:
cd src
python AmazonTranscribe.pyThis service will:
- Monitor the configured SQS queue
- Automatically start transcription jobs for new audio files
- Process messages continuously
Extract text from JSON files stored in S3:
from extract_text_fromS3 import S3JsonTextExtractor
# Initialize extractor
extractor = S3JsonTextExtractor()
# Extract from single file
result = extractor.extract_text_from_s3_json(
bucket_name="your-bucket",
object_key="path/to/file.json"
)
# Batch processing
results = extractor.batch_extract_texts(
bucket_name="your-bucket",
prefix="transcribe-output/"
)Process extracted text and enable semantic search:
from conversation_vectorization import ConversationVectorizer
import os
# Initialize vectorizer
vectorizer = ConversationVectorizer(
zilliz_uri=os.getenv("ZILLIZ_URI"),
zilliz_token=os.getenv("ZILLIZ_TOKEN")
)
# Process text
chunks = vectorizer.process_monologue(extracted_text)
# Search similar content
results = vectorizer.search_similar("営業について", limit=5)
for result in results:
print(f"Text: {result['text'][:100]}...")
print(f"Score: {result['score']:.3f}\n")Start the intelligent chat server that combines Zilliz search with OpenAI:
cd src
python chat_server.pyThe chat server provides:
- Web Interface: Access at
http://localhost:5000 - REST API:
/api/chatand/api/searchendpoints - WebSocket: Real-time chat functionality
- RAG System: Retrieval-Augmented Generation using past conversations
- 🔍 Semantic Search: Find relevant conversations using vector similarity
- 🤖 AI Responses: Generate contextual answers with OpenAI
- 💬 Real-time Chat: WebSocket-based chat interface
- 📚 Source Citations: Show which conversations informed the answer
- 🌐 Multi-interface: Web UI, REST API, and WebSocket support
Search for conversations:
curl -X POST http://localhost:5000/api/search \
-H "Content-Type: application/json" \
-d '{"query": "営業について", "limit": 5}'Chat with AI:
curl -X POST http://localhost:5000/api/chat \
-H "Content-Type: application/json" \
-d '{"query": "AIの活用方法を教えて"}'# Run comprehensive tests
python src/test_chat_server.py
# Test specific functionality
python -c "
import requests
response = requests.post('http://localhost:5000/api/search',
json={'query': 'test', 'limit': 3})
print(response.json())
"Run the complete pipeline for processing transcription results:
cd src
python conversation_vectorization.pyTranscribe/
├── src/
│ ├── .env # Environment configuration
│ ├── AmazonTranscribe.py # SQS monitoring and transcription
│ ├── extract_text_fromS3.py # S3 JSON text extraction
│ └── conversation_vectorization.py # Text vectorization and search
├── requirements.txt # Python dependencies
└── README.md # This file
- Purpose: Monitors SQS queue for audio file notifications
- Features: Automatic transcription job creation, error handling
- Output: Transcribed results stored in specified S3 bucket
- Purpose: Extracts text content from JSON files in S3
- Features: Auto-detection of Transcribe format, generic JSON support
- Methods: Single file and batch processing capabilities
- Purpose: Converts text to vectors and enables semantic search
- Features: Text chunking, Japanese language optimization, similarity search
- Integration: Zilliz Cloud for scalable vector storage
Your AWS user/role needs the following permissions:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"sqs:ReceiveMessage",
"sqs:DeleteMessage",
"transcribe:StartTranscriptionJob",
"s3:GetObject",
"s3:ListBucket",
"s3:PutObject"
],
"Resource": "*"
}
]
}- Create a Zilliz Cloud account
- Create a new cluster
- Obtain the connection URI and token
- Configure in your
.envfile
The system includes comprehensive logging:
- INFO: Processing status and progress
- ERROR: Failed operations and exceptions
- DEBUG: Detailed operation information
Logs are output to console with timestamps and severity levels.
-
SQS Connection Issues
- Verify AWS credentials and region
- Check SQS queue URL format
- Ensure proper IAM permissions
-
Transcribe Job Failures
- Verify audio file format (MP4 supported)
- Check S3 bucket permissions
- Ensure file is accessible to Transcribe service
-
Zilliz Connection Issues
- Verify URI and token format
- Check network connectivity
- Ensure cluster is running
-
Text Extraction Issues
- Verify S3 bucket access permissions
- Check JSON file format
- Ensure proper file encoding (UTF-8)
- Batch Processing: Use batch operations for multiple files
- Chunk Size: Adjust chunk size based on your search requirements
- Vector Dimensions: Consider model dimensions for storage optimization
- Connection Pooling: Reuse AWS clients when possible
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
This project is licensed under the MIT License - see the LICENSE file for details.
For issues and questions:
- Check the troubleshooting section
- Review logs for error details
- Create an issue in the repository
Key dependencies include:
boto3: AWS SDK for Pythonsentence-transformers: Text embedding modelspymilvus: Zilliz/Milvus vector database clientlangchain: Text processing utilitiespython-dotenv: Environment variable management
See requirements.txt for complete dependency list.
# Build the Docker image
docker build -t transcribe-service:latest .
# Run locally with Docker
docker run -d \
--name transcribe-service \
--env-file src/.env \
transcribe-service:latest
# Run with docker-compose
docker-compose up -d# View logs
docker logs transcribe-service -f
# Stop container
docker stop transcribe-service
# Remove container
docker rm transcribe-service- AWS CLI configured with appropriate permissions
- Docker installed
- ECR repository created
- Deploy infrastructure using CloudFormation:
aws cloudformation deploy \
--template-file cloudformation-infrastructure.yaml \
--stack-name transcribe-infrastructure \
--parameter-overrides \
Environment=production \
S3BucketName=your-transcribe-bucket \
SQSQueueName=your-audio-queue \
--capabilities CAPABILITY_NAMED_IAM- Store secrets in AWS Systems Manager Parameter Store:
# Store AWS credentials
aws ssm put-parameter \
--name "/transcribe/aws_access_key_id" \
--value "YOUR_ACCESS_KEY" \
--type "SecureString"
aws ssm put-parameter \
--name "/transcribe/aws_secret_access_key" \
--value "YOUR_SECRET_KEY" \
--type "SecureString"
# Store other configuration
aws ssm put-parameter \
--name "/transcribe/sqs_queue_url" \
--value "https://sqs.region.amazonaws.com/account/queue-name" \
--type "String"
aws ssm put-parameter \
--name "/transcribe/zilliz_uri" \
--value "https://your-zilliz-endpoint" \
--type "String"
aws ssm put-parameter \
--name "/transcribe/zilliz_token" \
--value "YOUR_ZILLIZ_TOKEN" \
--type "SecureString"- Make the deployment script executable:
chmod +x deploy-to-ecs.sh- Run the deployment:
./deploy-to-ecs.sh production ap-northeast-1- CPU: 1024 (1 vCPU)
- Memory: 2048 MB (2 GB)
- Network: Fargate with public subnets
- Auto Scaling: Can be configured based on CPU/memory usage
- Health Checks: Built-in application health checks
# View service status
aws ecs describe-services \
--cluster transcribe-cluster \
--services transcribe-service
# View logs
aws logs tail /ecs/transcribe-service --follow
# View task details
aws ecs describe-tasks \
--cluster transcribe-cluster \
--tasks TASK_ARN| Component | CPU | Memory | Description |
|---|---|---|---|
| Transcribe Service | 1 vCPU | 2 GB | SQS monitoring and job creation |
| Vector Service | 2 vCPU | 4 GB | Text processing and vectorization |
The following environment variables are configured via AWS Systems Manager Parameter Store:
| Variable | Description | Type |
|---|---|---|
AWS_ACCESS_KEY_ID |
AWS Access Key | SecureString |
AWS_SECRET_ACCESS_KEY |
AWS Secret Key | SecureString |
AWS_REGION |
AWS Region | String |
SQS_QUEUE_URL |
SQS Queue URL | String |
TRANSCRIBE_OUTPUT_BUCKET |
S3 Output Bucket | String |
S3_BUCKET_NAME |
S3 Bucket for Results | String |
ZILLIZ_URI |
Zilliz Cloud URI | String |
ZILLIZ_TOKEN |
Zilliz Cloud Token | SecureString |
This project includes a complete CI/CD pipeline using AWS CodeBuild that automatically builds Docker images and deploys them to ECS when code is pushed to the repository.
- Automatic Builds: Triggered on git push to specified branch
- Multi-stage Docker Builds: Optimized for production
- ECR Integration: Automatic image pushing to Amazon ECR
- ECS Deployment: Seamless updates to ECS services
- Security Scanning: Built-in container vulnerability scanning
- Artifact Management: Automated cleanup of old images
# Make the setup script executable
chmod +x setup-codebuild.sh
# Run the setup script
./setup-codebuild.shDeploy CodeBuild Infrastructure:
aws cloudformation deploy \
--template-file codebuild-infrastructure.yaml \
--stack-name transcribe-codebuild \
--parameter-overrides \
ProjectName=transcribe-service \
Environment=production \
GitHubRepo=https://github.com/your-username/transcribe.git \
GitHubBranch=main \
--capabilities CAPABILITY_NAMED_IAMCreate ECR Repository:
aws ecr create-repository \
--repository-name transcribe-service \
--image-scanning-configuration scanOnPush=trueSet up Parameters in Parameter Store:
# Core AWS settings
aws ssm put-parameter \
--name "/transcribe/aws_access_key_id" \
--value "YOUR_ACCESS_KEY" \
--type "SecureString"
aws ssm put-parameter \
--name "/transcribe/aws_secret_access_key" \
--value "YOUR_SECRET_KEY" \
--type "SecureString"
# Application settings
aws ssm put-parameter \
--name "/transcribe/sqs_queue_url" \
--value "https://sqs.region.amazonaws.com/account/queue" \
--type "String"
aws ssm put-parameter \
--name "/transcribe/zilliz_uri" \
--value "https://your-zilliz-endpoint" \
--type "String"
aws ssm put-parameter \
--name "/transcribe/zilliz_token" \
--value "YOUR_ZILLIZ_TOKEN" \
--type "SecureString"The project includes two buildspec files:
- Simple Docker build and push
- Suitable for basic CI/CD needs
- Minimal configuration required
- Comprehensive build with security scanning
- Multi-tag strategy (latest, commit hash, build number, timestamp)
- Advanced error handling and logging
- Production-ready configuration
-
Pre-build Phase
- ECR login and repository validation
- Build environment setup
- Variable configuration
-
Build Phase
- Docker image building with caching
- Security vulnerability scanning
- Image testing and validation
-
Post-build Phase
- Multi-tag image pushing to ECR
- ECS task definition updates
- Deployment artifact creation
# List recent builds
aws codebuild list-builds-for-project \
--project-name transcribe-service
# Get build details
aws codebuild batch-get-builds \
--ids BUILD_ID# Tail build logs in real-time
aws logs tail /aws/codebuild/transcribe-service --follow
# View specific log stream
aws logs get-log-events \
--log-group-name /aws/codebuild/transcribe-service \
--log-stream-name LOG_STREAM_NAME# Start a build from the main branch
aws codebuild start-build \
--project-name transcribe-service
# Start a build from a specific branch
aws codebuild start-build \
--project-name transcribe-service \
--source-version feature-branchEach successful build produces:
imagedefinitions.json: ECS container image definitionsecs-task-definition-final.json: Updated ECS task definitiondeployment-summary.json: Build and deployment metadata
| Variable | Source | Description |
|---|---|---|
AWS_DEFAULT_REGION |
Build Environment | AWS region for deployment |
AWS_ACCOUNT_ID |
Parameter Store | AWS account ID |
IMAGE_REPO_NAME |
Build Environment | ECR repository name |
ECS_CLUSTER_NAME |
Parameter Store | Target ECS cluster |
ECS_SERVICE_NAME |
Parameter Store | Target ECS service |
- Container Scanning: Trivy security scanner integration
- IAM Least Privilege: Minimal required permissions
- Secrets Management: Parameter Store for sensitive data
- Image Signing: Optional container image signing
- Vulnerability Reports: Security scan results in CodeBuild reports
- Build Caching: Docker layer and pip package caching
- Lifecycle Policies: Automatic cleanup of old ECR images
- Compute Optimization: Right-sized build instances
- Artifact Retention: 30-day artifact lifecycle
-
Build Fails with ECR Login Error
# Check IAM permissions for ECR aws ecr get-login-password --region us-east-1 -
Parameter Store Access Denied
# Verify parameter exists and permissions aws ssm get-parameter --name "/transcribe/parameter-name"
-
Docker Build Out of Space
- Enable Docker layer caching in buildspec
- Use multi-stage builds to reduce image size
-
ECS Deployment Fails
- Check ECS service and task definition compatibility
- Verify IAM roles for ECS tasks
# Check CodeBuild project configuration
aws codebuild batch-get-projects --names transcribe-service
# View parameter store values
aws ssm get-parameters-by-path --path "/transcribe" --recursive
# Check ECR repository
aws ecr describe-repositories --repository-names transcribe-service
# View ECS service status
aws ecs describe-services --cluster transcribe-cluster --services transcribe-service