A FastAPI-based semantic search system for company and product data with AI-powered embeddings and vector similarity search.
This project provides a comprehensive semantic search API that processes company/product data from Excel/CSV files, generates embeddings using OpenAI's text-embedding-3-small model, and enables intelligent search through vector similarity matching. The system is designed for production deployment with PostgreSQL + pgvector for vector storage.
- FastAPI Application: Main web framework with async support
- PostgreSQL + pgvector: Vector database for embeddings storage
- OpenAI API: Text embedding generation
- Pandas: Data processing and analysis
- SQLAlchemy: ORM for database operations
- VectorDB: Stores industry-categorized embeddings and metadata
- SearchQuery: Tracks search history and parameters
- SearchResult: Stores individual search results with scores
- Feedback: User feedback on search results
semantic_search/
├── api/ # API endpoints
│ ├── upload.py # File upload and data processing
│ ├── search.py # Semantic search functionality
│ └── feedback.py # User feedback system
├── database/ # Database models and configuration
│ ├── database.py # Database setup and models
│ └── schemas.py # Pydantic schemas
├── app.py # Main FastAPI application
├── start.py # Production startup script
├── requirements.txt # Python dependencies
├── gunicorn.conf.py # Gunicorn configuration
├── render.yaml # Render.com deployment config
└── Procfile # Process file for deployment
- Python 3.8+
- PostgreSQL with pgvector extension
- OpenAI API key
DATABASE_URL=postgresql://user:password@host:port/database
OPENAI_API_KEY=your_openai_api_key
DISABLE_DOCS=false # Optional: disable API docs# Clone repository
git clone <repository-url>
cd semantic_search
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Run the application
python app.pyThe project is configured for Render.com deployment:
# Deploy using Render CLI or web interface
# Ensure DATABASE_URL and OPENAI_API_KEY are set in environment# Development server
uvicorn app:app --reload --host 0.0.0.0 --port 8000
# Production server
python start.py- GET
/- Basic health check - GET
/health- Detailed health status
-
POST
/api/upload -
Description: Upload Excel/CSV files for data processing
-
Content-Type:
multipart/form-data -
Parameters:
file: Excel (.xlsx, .xls) or CSV file
-
Requirements:
- Must contain industry category column (e.g., '產業別', 'industry_category')
- Data is grouped by industry and scored for quality
- Recommended product/code and company columns (any that match):
- Product:
問卷編號,產品編號,產品代號,product_id,product,sku - Company:
客戶名稱,公司名稱,company,company_name
- Product:
-
Response:
{ "message": "Successfully processed X industry groups with embeddings", "file_id": "uuid", "groups_processed": 5, "data_type": "product" } -
Metadata stored per industry group:
product_ids: list of detected product IDsproduct_to_company: mapping product_id → companyproduct_metrics: array of per-product objects:product_id,company,quantity,quality_score,tags,fields
numeric_fields: list of detected numeric column names
- POST
/api/vectordb/refresh - Description: Regenerate all embeddings in VectorDB
- Response:
{ "message": "Successfully refreshed X VectorDB entries", "entries_updated": 10 }
- GET
/api/vectordb/stats - Description: Get comprehensive statistics about VectorDB
- Response:
{ "total_vector_entries": 15, "unique_industry_categories": 8, "industry_categories": ["扣件", "食品", "電子"], "total_records_represented": 1500, "average_quality_score": 0.85, "last_updated_entries": [...] }
- DELETE
/api/vectordb/industry/{industry_category} - Description: Remove specific industry category from VectorDB
- Response:
{ "message": "Successfully deleted VectorDB entry for industry: 扣件", "deleted_industry": "扣件" }
-
POST
/api/search -
Description: Perform AI-powered semantic search with keyword extraction, product-code matching, and metric-aware selection
-
Request Body:
{ "query_text": "I need Q02 highest quantity product", "filters": "扣件", // or "industry:扣件,電子;country:TW" "top_k": 5 } -
Response:
{ "query_id": "uuid", "top_k": 5, "returned": 5, "results": [ { "company": "ABC扣件公司", "product": "Q2024002", "completeness_score": 95, "semantic_score": 0.87, "doc_status": "有效", "total_score": 91 } ] } -
Query preprocessing features:
- Keyword extraction with basic stopwords
- Product code detection with normalization (e.g.,
Q02≈Q002, hyphens ignored) - Simple synonyms (e.g.,
two→2) - Metric intent detection: “highest/lowest ” maps to numeric fields (e.g.,
quantity,quality_score, or matching uploaded numeric columns viafields)
-
Ranking behavior:
- Score = 0.6 × completeness + 0.4 × semantic similarity
- Boost vectors that contain an exact product code match
- Honor metric intent by selecting the best product within top groups
- Expand group results into product-level items up to
top_k
- GET
/api/search/history?limit=10 - Description: Get recent search queries
- Response:
{ "queries": [ { "id": "query-uuid", "query_text": "尋找高品質的扣件供應商", "filters": "扣件", "top_k": 5, "created_at": "2024-01-15T10:30:00" } ] }
- GET
/api/search/results/{query_id} - Description: Retrieve results for specific search query
- Response:
{ "query": {...}, "results": [ { "id": "result-uuid", "company": "ABC扣件公司", "product": "Q2024001", "completeness_score": 95, "semantic_score": 0.87, "doc_status": "有效", "total_score": 91, "rank": 1, "vector_id": "vector-uuid" } ] }
- POST
/api/feedback - Description: Submit user feedback on search results
- Request Body:
{ "query_id": "query-uuid", "result_id": "result-uuid", "action_type": "keep" // "keep", "reject", "compare" } - Response:
{ "status": "success", "message": "Feedback submitted successfully for action: keep", "feedback_id": "feedback-uuid" }
- File Upload: User uploads Excel/CSV file via
/api/upload - Data Validation: System validates file format and required columns
- Industry Grouping: Data is grouped by industry category (產業別)
- Quality Scoring: Each record is scored based on:
- Completeness (penalty for empty fields)
- Date validity (expire_date, issue_date)
- Embedding Generation: OpenAI API generates embeddings for each industry group
- Vector Storage: Embeddings stored in PostgreSQL with pgvector
- Query Processing: User submits natural language query
- Filter Application: System applies industry/country filters
- Embedding Generation: Query converted to embedding vector
- Similarity Search: Cosine similarity calculated against filtered vectors
- Multi-factor Scoring: Results ranked by:
- Completeness Score (60%)
- Semantic Similarity (40%)
- Result Formatting: Results formatted and returned to user
- Document Status: Automatic validation of expire dates
- Score Calculation: Comprehensive scoring system for data quality
- Feedback Loop: User feedback system for continuous improvement
- Natural Language Processing: Understands Chinese and English queries
- Semantic Similarity: AI-powered meaning-based search
- Multi-factor Ranking: Combines completeness and semantic scores
- Flexible Filtering: Industry and country-based filtering
- Automatic Scoring: Quality assessment based on data completeness
- Date Validation: Expire date and issue date validation
- Industry Categorization: Automatic grouping by industry type
- Async Operations: Full async/await support for high performance
- Connection Pooling: Optimized database connections
- Error Handling: Comprehensive error handling and logging
- Scalable Architecture: Designed for horizontal scaling
- PostgreSQL: Primary database with pgvector extension
- Connection Pooling: 5 base connections, 10 max overflow
- SSL Support: Required for production deployments
- Model: text-embedding-3-small (1536 dimensions)
- Rate Limiting: Built-in error handling for API limits
- Cost Optimization: Efficient embedding generation
- Gunicorn Workers: 4 workers for production
- Timeout Settings: 30-second request timeout
- Memory Management: Connection recycling and cleanup
- VectorDB entry counts and categories
- Search query history and patterns
- Data quality metrics and trends
- User feedback analysis
- Database connection status
- API response times
- Error rates and logging
- Create new router in
api/directory - Add corresponding database models in
database/ - Update schemas in
database/schemas.py - Include router in
app.py
# Run with test data
python -m pytest tests/
# Manual API testing
curl -X POST "http://localhost:8000/api/upload" \
-F "file=@test_data.xlsx"This project is proprietary software. All rights reserved.
For technical support or questions, please contact the development team.
Version: 1.0.0
Last Updated: 2024
Status: Production Ready