UtkarshTheDev
diff --git a/‎CHANGELOG.md‎
Lines changed: 17 additions & 137 deletions b/‎CHANGELOG.md‎
Lines changed: 17 additions & 137 deletions
diff --git a/‎docs/colab/setup_guide.md‎
Lines changed: 188 additions & 0 deletions b/‎docs/colab/setup_guide.md‎
Lines changed: 188 additions & 0 deletions
@@ -2,6 +2,22 @@
 
 All notable changes to LocalLab will be documented in this file.
 
+## [0.3.1] - 2025-03-06
+
+### Fixed
+
+- Fixed NameError in config.py by properly defining the CUSTOM_MODEL variable
+- Enhanced error handling for environment variable loading
+- Improved model registry initialization for custom models
+- Fixed key mismatch in custom model requirements dictionary
+- Standardized environment variable naming with LOCALLAB\_ prefix
+- Removed duplicate configuration settings for optimization parameters
+- Added safe model registry access with fallback support
+- Added better error handling for missing environment variables
+- Added consistent version constraints for dependencies in setup.py
+- Improved error messages for configuration issues
+- Updated environment variable documentation with more details
+
 ## [0.3.0] - 2025-03-05
 
 ### Added
@@ -108,140 +124,4 @@ All notable changes to LocalLab will be documented in this file.
 
 ### Added
 
-- Added new `get_gpu_info()` function for detailed GPU monitoring
-- Added improved system resource endpoint with detailed GPU metrics
-- Added robust environment variable handling for optimization settings
-
-### Changed
-
-- Made optimization flags more robust by checking for empty string values
-- Improved fallback handling for missing torch packages
-- Enhanced server startup logs with better optimization information
-
-## [0.2.3] - 2025-03-02
-
-### Fixed
-
-- Fixed critical server startup error in Google Colab environment with uvicorn callback configuration
-- Resolved "'list' object is not callable" error by properly implementing the callback_notify as an async function
-- Enhanced server startup sequence for better compatibility with both local and Colab environments
-- Improved custom server implementation to handle callbacks more robustly
-
-## [0.2.2] - 2025-03-02
-
-### Fixed
-
-- Fixed circular import issue between core/app.py and routes/system.py by updating system.py to use get_request_count from logger module directly
-- Made Flash Attention warning less alarming by changing it from a warning to an info message with better explanation
-- Enhanced get_system_info endpoint with cleaner code and better organization
-- Fixed potential issues with GPU info retrieval through better error handling
-
-## [0.2.0] - 2025-03-02
-
-### Added
-
-- Comprehensive environment check system that validates:
-  - Python version compatibility
-  - CUDA/GPU availability and configuration
-  - Ngrok token presence when running in Google Colab
-- Improved error handling with detailed error messages and suggestions
-- Clear instructions for setting up ngrok authentication token
-
-### Changed
-
-- Complete removal of the deprecated monolithic `main.py` file
-- Enhanced ngrok setup process with better authentication handling:
-  - Automatic detection of auth token from environment variables
-  - Clear error messages when auth token is missing
-  - Improved token validation and connection process
-- Parameter renamed from `ngrok` to `use_ngrok` for clarity
-- More readable ASCII art for initializing banner
-- Improved documentation about ngrok requirements for Google Colab
-
-### Fixed
-
-- Fixed circular import issues between core/app.py and routes modules
-- Fixed ngrok authentication flow to properly use auth token from environment variables
-- Fixed error with missing torch import in the server.py file
-- Added graceful handling of missing torch module to prevent startup failures
-- Improved error messages when server fails to start
-- Better exception handling throughout the codebase
-
-## [0.1.9] - 2025-03-01
-
-### Added
-
-- Clear ASCII art status indicators ("INITIALIZING" and "RUNNING") showing server state
-- Warning messages that prevent users from making API requests before the server is ready
-- Callback mechanism to display the "RUNNING" banner only when the server is fully operational
-- New dedicated logger module with comprehensive features:
-  - Colorized console output for different log levels
-  - Server status tracking (initializing, running, error, shutting_down)
-  - Request tracking with detailed metrics
-  - Model loading/unloading metrics
-  - Performance monitoring for slow requests
-- API documentation for logger module with usage examples
-
-### Changed
-
-- Completely refactored the codebase into a more modular structure:
-  - Split main.py into smaller, focused modules
-  - Created separate directories for routes, UI components, utilities, and core functionality
-  - Improved import structure to prevent circular dependencies
-  - Better organization of server startup and API functionality
-- Enhanced model loading process with proper timing and status updates
-- Improved error handling throughout the application
-- Better request metrics in response headers
-- Removed old logger.py in favor of the new dedicated logger module
-
-### Fixed
-
-- Complete removal of health checks and validation when setting up ngrok tunnels
-- Fixed issue where logs did not appear correctly due to server starting in a separate process
-- Simplified ngrok setup process to run without validation to prevent connection errors during startup
-- Improved server startup flow to be more direct without background health checks or API validation
-- Reorganized startup sequence to work properly with ngrok, enhancing compatibility with Colab
-
-## [0.1.7] - 2025-03-01
-
-### Changed
-
-- Removed the background process workflow for server startup. The server now runs directly in the main process, ensuring that all logs (banner, model details, system resources, etc.) are displayed properly.
-- Simplified the startup process by directly calling uvicorn.run(), with optional ngrok setup if the server is run in Google Colab.
-
-## [0.1.6] - 2025-02-25
-
-### Added
-
-- Added utility function is_port_in_use(port: int) → bool to check if a port is already in use.
-- Added async utility function load_model_in_background(model_id: str) to load the model asynchronously in the background while managing the global loading flag.
-- Updated server startup functions to incorporate these utilities, ensuring proper port management and asynchronous model loading.
-
-## [0.1.5] - 2025-02-25
-
-### Changed
-
-- Extended the initial wait time in start_server from 5 to 15 seconds to allow the server ample time to initialize, especially in Google Colab environments.
-- Increased health check timeout to 120 seconds for ngrok mode and 60 seconds for local mode to accommodate slower startups.
-- Added detailed logging during health checks to aid in debugging startup issues.
-
-## [0.1.4] - 2025-02-25
-
-### Changed
-
-- Improved logging across startup: the banner, model details, configuration, system resources, API documentation, quick start guide, and footer are now fully logged and printed.
-- Updated the start_server function to extend the health check timeout to 60 seconds in Google Colab (when using ngrok) and to set an environment variable to trigger the Colab branch in run_server_proc.
-- Modified startup_event to load the model in the background, ensuring that the server's /health endpoint becomes available in time and that logging output is complete.
-
-## [0.1.3] - 2025-02-25
-
-### Changed
-
-- Updated GitHub Actions workflow to install the Locallab package along with its runtime dependencies in CI, ensuring that all required packages are available for proper testing.
-
-### Fixed
-
-- Refactored `run_server_proc` in the spawned process to initialize a dedicated logger ("locallab.spawn") to avoid inheriting SemLock objects from a fork context.
-- Ensured that the log queue is created using the multiprocessing spawn context, preventing runtime errors in Google Colab.
-- Updated Mermaid diagrams in `README.md` and `docs/colab/README.md` to enclose node labels in double quotes, resolving parse errors in GitHub rendering.
-- Removed duplicate architecture diagrams from the root `
+- Added new `
@@ -0,0 +1,188 @@
+# Running LocalLab in Google Colab
+
+This guide shows you how to set up and run LocalLab in a Google Colab environment. Google Colab provides free GPU resources that can significantly speed up model inference.
+
+## Quick Start
+
+Copy and paste this code into a Colab notebook cell:
+
+```python
+# Install LocalLab package
+!pip install locallab
+
+# Set up and run the server
+import os
+import logging
+
+# Configure logging
+logging.basicConfig(
+    level=logging.INFO,
+    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
+)
+
+# Replace with your actual ngrok token
+os.environ["NGROK_AUTH_TOKEN"] = "your_ngrok_token"
+
+# Model selection - use any HuggingFace model
+os.environ["HUGGINGFACE_MODEL"] = "microsoft/phi-2"  # Light model that works well on Colab
+
+# Memory optimization settings
+os.environ["LOCALLAB_ENABLE_QUANTIZATION"] = "true"
+os.environ["LOCALLAB_QUANTIZATION_TYPE"] = "int8"
+os.environ["LOCALLAB_ENABLE_ATTENTION_SLICING"] = "true"
+
+# Speed optimization settings
+os.environ["LOCALLAB_ENABLE_FLASH_ATTENTION"] = "true"
+os.environ["LOCALLAB_ENABLE_BETTERTRANSFORMER"] = "true"
+
+# Import and start server
+from locallab import start_server
+
+# Start with explicit ngrok flag
+start_server(use_ngrok=True)
+```
+
+## Step-by-Step Setup
+
+### 1. Create a New Notebook
+
+Create a new notebook in Google Colab:
+
+- Go to [Google Colab](https://colab.research.google.com/)
+- Click "New Notebook"
+
+### 2. Enable GPU Runtime
+
+Change the runtime to use GPU:
+
+- Click "Runtime" in the menu
+- Select "Change runtime type"
+- Set "Hardware accelerator" to "GPU"
+- Click "Save"
+
+### 3. Install the LocalLab Package
+
+```python
+!pip install locallab
+```
+
+### 4. Get an ngrok Auth Token
+
+LocalLab uses ngrok to make your local server accessible over the internet:
+
+1. Sign up or log in at [ngrok.com](https://ngrok.com)
+2. Go to the [Auth page](https://dashboard.ngrok.com/get-started/your-authtoken)
+3. Copy your authtoken
+
+### 5. Configure and Run the Server
+
+```python
+import os
+import logging
+
+# Configure logging
+logging.basicConfig(
+    level=logging.INFO,
+    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
+)
+
+# Set your ngrok token
+os.environ["NGROK_AUTH_TOKEN"] = "your_ngrok_token_here"  # Replace with your actual token
+
+# Set environment variables
+os.environ["HUGGINGFACE_MODEL"] = "microsoft/phi-2"  # Good default for Colab
+os.environ["LOCALLAB_ENABLE_QUANTIZATION"] = "true"
+os.environ["LOCALLAB_QUANTIZATION_TYPE"] = "int8"
+os.environ["LOCALLAB_ENABLE_ATTENTION_SLICING"] = "true"
+
+# Import and start server
+from locallab import start_server
+
+# Start with explicit ngrok flag
+start_server(use_ngrok=True)
+```
+
+### 6. Access the Server
+
+After running the cell, you'll see output like:
+
+```
+✅ Server running at: https://1a2b3c4d5e.ngrok.io
+```
+
+Visit this URL to access the LocalLab UI and API.
+
+## Recommended Models for Colab
+
+Different Colab tiers support different model sizes:
+
+| Colab Tier | GPU Type          | Recommended Models                                                                       |
+| ---------- | ----------------- | ---------------------------------------------------------------------------------------- |
+| Free       | T4 (~16GB VRAM)   | microsoft/phi-2, TinyLlama/TinyLlama-1.1B-Chat-v1.0                                      |
+| Pro        | T4/V100           | meta-llama/Llama-2-7b-chat-hf, mistralai/Mistral-7B-Instruct-v0.1                        |
+| Pro+       | A100 (~40GB VRAM) | meta-llama/Llama-2-13b-chat-hf, mistralai/Mixtral-8x7B-Instruct-v0.1 (with quantization) |
+
+## Memory Optimization Configurations
+
+### Configuration for Smaller Models
+
+```python
+os.environ["HUGGINGFACE_MODEL"] = "microsoft/phi-2"
+os.environ["LOCALLAB_ENABLE_QUANTIZATION"] = "true"
+os.environ["LOCALLAB_QUANTIZATION_TYPE"] = "int8"
+os.environ["LOCALLAB_ENABLE_FLASH_ATTENTION"] = "true"
+```
+
+### Configuration for Medium Models (7B)
+
+```python
+os.environ["HUGGINGFACE_MODEL"] = "meta-llama/Llama-2-7b-chat-hf"
+os.environ["LOCALLAB_ENABLE_QUANTIZATION"] = "true"
+os.environ["LOCALLAB_QUANTIZATION_TYPE"] = "int8"
+os.environ["LOCALLAB_ENABLE_ATTENTION_SLICING"] = "true"
+os.environ["LOCALLAB_ENABLE_FLASH_ATTENTION"] = "true"
+```
+
+### Configuration for Large Models (13B+)
+
+```python
+os.environ["HUGGINGFACE_MODEL"] = "meta-llama/Llama-2-13b-chat-hf"
+os.environ["LOCALLAB_ENABLE_QUANTIZATION"] = "true"
+os.environ["LOCALLAB_QUANTIZATION_TYPE"] = "int4"  # More aggressive quantization
+os.environ["LOCALLAB_ENABLE_ATTENTION_SLICING"] = "true"
+os.environ["LOCALLAB_ENABLE_CPU_OFFLOADING"] = "true"  # Offload some layers to CPU
+```
+
+## Troubleshooting
+
+### Out of Memory Errors
+
+If you encounter GPU out of memory errors:
+
+1. Use a smaller model
+2. Enable more aggressive memory optimizations:
+   ```python
+   os.environ["LOCALLAB_QUANTIZATION_TYPE"] = "int4"
+   os.environ["LOCALLAB_ENABLE_CPU_OFFLOADING"] = "true"
+   ```
+3. Reduce the max length:
+   ```python
+   os.environ["LOCALLAB_MAX_LENGTH"] = "1024"  # Default is 2048
+   ```
+
+### Connection Issues
+
+If you cannot connect to the ngrok URL:
+
+1. Ensure your ngrok token is correct
+2. Check Colab hasn't timed out (Colab sessions expire after idle periods)
+3. Try a different region:
+   ```python
+   os.environ["LOCALLAB_NGROK_REGION"] = "eu"  # Options: us, eu, ap, au, sa, jp, in
+   ```
+
+## See Also
+
+- [Environment Variables Documentation](../guides/environment_variables.md)
+- [Performance Guide](../features/performance.md)
+- [Memory Monitoring](../features/memory.md)