Fixes

makseq · makseq · commit 80e7b2a41088 · 2025-06-09T00:59:20.000+01:00
diff --git a/.rules/new_models_best_practice.mdc b/.rules/new_models_best_practice.mdc
@@ -33,7 +33,7 @@ Each example should contain the following files:
 
 - Reference the main repository README to help users understand how to install and run the ML backend.
 - Include labeling configuration examples in the example README so users can quickly reproduce training and inference.
-- Provide troubleshooting tips or links to Label Studio documentation such as [Writing your own ML backend](https://labelstud.io/guide/ml_create).
+- Provide troubleshooting tips or links to Label Studio documentation such as [Writing your own ML backend](mdc:https:/labelstud.io/guide/ml_create).
 
 ## 3.1. Security Best Practices
 
@@ -106,8 +106,6 @@ def get_model(self, model_path):
 Implement proper model versioning for production systems:
 
 - **Version Tracking**: Include model version in predictions and logs
-- **Backwards Compatibility**: Handle multiple model versions gracefully
-- **Migration Strategies**: Provide clear upgrade paths for model updates
 - **Rollback Support**: Maintain ability to revert to previous model versions
 
 Example versioning pattern:
@@ -148,6 +146,7 @@ Implement robust data processing for different scenarios:
 - **Type Safety**: Use proper type conversion and validation for different data types
 - **Streaming Data**: Support large files that don't fit in memory using streaming approaches
 - **Data Caching**: Cache preprocessed data when appropriate to improve performance
+- **LabelStudioMLBackend::preload_task_data(path, task)**: Download all URLs from a task and stores them locally. It uses get_local_path() `from label_studio_sdk._extensions.label_studio_tools.core.utils.io import get_local_path` and requires LABEL_STUDIO_API_KEY and LABEL_STUDIO_URL to be able to download files through Label Studio instance.
 
 Example robust data loading:
 ```python
diff --git a/label_studio_ml/examples/timeseries_segmenter/README.md b/label_studio_ml/examples/timeseries_segmenter/README.md
@@ -46,57 +46,114 @@ columns.
 
 ## Training
 
-Training starts automatically when annotations are created or updated. The model
-collects all labeled segments, extracts sensor values inside each segment and
-fits an LSTM classifier. Model artifacts are stored in the
-`MODEL_DIR` (defaults to the current directory).
+Training starts automatically when annotations are created or updated. The model uses a PyTorch-based LSTM neural network with proper temporal modeling to learn time series patterns.
 
-Steps performed by `fit()`:
+### Training Process
 
-1. Fetch all labeled tasks from Label Studio.
-2. Convert labeled ranges to per-row training samples.
-3. Fit a small LSTM network.
-4. Save the trained model to disk.
+The model follows these steps during training:
+
+1. **Data Collection**: Fetches all labeled tasks from your Label Studio project
+2. **Sample Generation**: Converts labeled time ranges into training samples:
+   - **Background Class**: Unlabeled time periods are treated as "background" (class 0)
+   - **Event Classes**: Your labeled segments (e.g., "Run", "Walk") become classes 1, 2, etc.
+   - **Ground Truth Priority**: If multiple annotations exist for a task, ground truth annotations take precedence
+3. **Model Training**: Fits a multi-layer LSTM network with:
+   - Configurable sequence windows (default: 50 timesteps)  
+   - Dropout regularization for better generalization
+   - Background class support for realistic time series modeling
+4. **Model Persistence**: Saves trained model artifacts to `MODEL_DIR`
+
+### Training Configuration
+
+You can customize training behavior with these environment variables:
+
+- `START_TRAINING_EACH_N_UPDATES`: How often to retrain (default: 1, trains on every annotation)
+- `TRAIN_EPOCHS`: Number of training epochs (default: 1000)
+- `SEQUENCE_SIZE`: Sliding window size for temporal context (default: 50)
+- `HIDDEN_SIZE`: LSTM hidden layer size (default: 64)
+
+### Ground Truth Handling
+
+When multiple annotations exist for the same task, the model prioritizes ground truth annotations:
+- Non-ground truth annotations are processed first
+- Ground truth annotations override previous labels and stop processing for that task
+- This ensures the highest quality labels are used for training
 
 ## Prediction
 
-For each task, the backend loads the CSV, applies the trained classifier to each
-row and groups consecutive predictions into labeled segments. Prediction scores
-are averaged per segment and returned to Label Studio.
+The model processes new time series data by applying the trained LSTM classifier with sliding window temporal context. Only meaningful event segments are returned to Label Studio, filtering out background periods automatically.
+
+### Prediction Process
 
-The `predict()` method:
+For each task, the model performs these steps:
 
-1. Loads the stored model.
-2. Reads the task CSV and builds a feature matrix.
-3. Predicts a label for each row.
-4. Merges consecutive rows with the same label into a segment.
-5. Returns the segments in Label Studio JSON format.
+1. **Model Loading**: Loads the trained PyTorch model from disk
+2. **Data Processing**: Reads the task CSV and creates feature vectors from sensor channels
+3. **Temporal Prediction**: Applies LSTM with sliding windows for temporal context:
+   - Uses overlapping windows with 50% overlap for smoother predictions
+   - Averages predictions across overlapping windows
+   - Maintains temporal dependencies between timesteps
+4. **Segment Extraction**: Groups consecutive predictions into meaningful segments:
+   - **Background Filtering**: Automatically filters out background (unlabeled) periods
+   - **Event Segmentation**: Only returns segments with actual event labels
+   - **Score Calculation**: Averages prediction confidence per segment
+5. **Result Formatting**: Returns segments in Label Studio JSON format
+
+### Prediction Quality
+
+The model provides several quality indicators:
+
+- **Per-segment Confidence**: Average prediction probability for each returned segment
+- **Temporal Consistency**: Sliding window approach reduces prediction noise
+- **Background Suppression**: Only returns segments where the model is confident about specific events
+
+This approach ensures that predictions focus on actual events rather than forcing labels on every timestep.
 
 ## How it works
 
-### Training pipeline
+### Training Pipeline
 
 ```mermaid
 flowchart TD
-  A[Webhook event] --> B{Enough tasks?}
-  B -- no --> C[Skip]
-  B -- yes --> D[Load labeled tasks]
-  D --> E[Collect per-row samples]
-  E --> F[Fit LSTM]
-  F --> G[Save model]
+  A[Annotation Event] --> B{Training Trigger?}
+  B -- no --> C[Skip Training]
+  B -- yes --> D[Fetch Labeled Tasks]
+  D --> E[Process Annotations]
+  E --> F{Ground Truth?}
+  F -- yes --> G[Priority Processing]
+  F -- no --> H[Standard Processing]
+  G --> I[Generate Samples]
+  H --> I
+  I --> J[Background + Event Labels]
+  J --> K[PyTorch LSTM Training]
+  K --> L[Model Validation]
+  L --> M[Save Model]
+  M --> N[Cache in Memory]
 ```
 
-### Prediction pipeline
+### Prediction Pipeline
 
 ```mermaid
 flowchart TD
-  T[Predict request] --> U[Load model]
-  U --> V[Read task CSV]
-  V --> W[Predict label per row]
-  W --> X[Group consecutive labels]
-  X --> Y[Return segments]
+  T[Prediction Request] --> U[Load PyTorch Model]
+  U --> V[Read Task CSV]
+  V --> W[Extract Features]
+  W --> X[Sliding Window LSTM]
+  X --> Y[Overlap Averaging]
+  Y --> Z[Filter Background]
+  Z --> AA[Group Event Segments]
+  AA --> BB[Calculate Confidence]
+  BB --> CC[Return Segments]
 ```
 
+### Key Technical Features
+
+- **PyTorch-based LSTM**: Modern deep learning framework with better performance and flexibility
+- **Temporal Modeling**: Sliding windows capture time dependencies (default 50 timesteps)
+- **Background Class**: Realistic modeling where unlabeled periods are explicit background
+- **Ground Truth Priority**: Ensures highest quality annotations are used for training
+- **Overlap Averaging**: Smoother predictions through overlapping window consensus
+
 ## Customize
 
 Edit `docker-compose.yml` to set environment variables such as `LABEL_STUDIO_HOST`
diff --git a/label_studio_ml/examples/timeseries_segmenter/model.py b/label_studio_ml/examples/timeseries_segmenter/model.py
@@ -236,6 +236,75 @@ def _group_rows(self, df: pd.DataFrame, time_col: str) -> List[Dict]:
         logger.debug(f"Grouped into {len(segments)} segments")
         return segments
 
+    def _process_task_annotations(
+        self, task: Dict, df: pd.DataFrame, params: Dict, label2idx: Dict[str, int]
+    ) -> Tuple[np.ndarray, int]:
+        """Process annotations for a single task and return row labels.
+        
+        Args:
+            task: Label Studio task dictionary
+            df: DataFrame with time series data
+            params: Labeling parameters from label config
+            label2idx: Mapping from label names to indices
+            
+        Returns:
+            Tuple of (row_labels array, number of labeled rows)
+        """
+        task_id = task.get("id", "unknown")
+        
+        # Initialize all rows as background (index 0)
+        row_labels = np.zeros(len(df), dtype=np.int64)  # 0 = background
+        
+        annotations = [a for a in task["annotations"] if a.get("result")]
+        logger.debug(f"Task {task_id}: Found {len(annotations)} annotations")
+        
+        # Mark labeled regions
+        labeled_rows = 0
+        for ann in annotations:
+            for r in ann["result"]:
+                if r["from_name"] != params["from_name"]:
+                    continue
+                start = r["value"]["start"]
+                end = r["value"]["end"]
+                label = r["value"]["timeserieslabels"][0]
+                
+                # Convert start/end to same type as time column for comparison
+                time_dtype = df[params["time_col"]].dtype
+                logger.debug(f"Task {task_id}: Converting time range [{start}, {end}] to match column dtype {time_dtype}")
+                try:
+                    if 'int' in str(time_dtype):
+                        start = int(float(start))
+                        end = int(float(end))
+                    elif 'float' in str(time_dtype):
+                        start = float(start)
+                        end = float(end)
+                    # For string/datetime, keep as is
+                    logger.debug(f"Task {task_id}: Converted to [{start}, {end}]")
+                except (ValueError, TypeError) as e:
+                    logger.warning(f"Could not convert start={start}, end={end} to {time_dtype}: {e}, using original values")
+                
+                # Find rows in this time range
+                try:
+                    mask = (df[params["time_col"]] >= start) & (
+                        df[params["time_col"]] <= end
+                    )
+                except TypeError as e:
+                    logger.error(f"Task {task_id}: Type error comparing times - start={start} ({type(start)}), end={end} ({type(end)}), time_col dtype={time_dtype}: {e}")
+                    # Skip this annotation if we can't compare
+                    continue
+                
+                # Set the appropriate label index
+                label_idx = label2idx[label]
+                row_labels[mask] = label_idx
+                labeled_rows += mask.sum()
+                logger.debug(f"Task {task_id}: Labeled {mask.sum()} rows with '{label}' (index {label_idx})")
+
+            if ann.get('ground_truth', False):
+                logger.info(f"Task {task_id}: Ground truth annotation found: {ann['ground_truth']}")
+                break
+                
+        return row_labels, labeled_rows
+
     def _collect_samples(
         self, tasks: List[Dict], params: Dict, label2idx: Dict[str, int]
     ) -> Tuple[np.ndarray, np.ndarray]:
@@ -253,52 +322,8 @@ def _collect_samples(
                 logger.warning(f"Task {task_id}: Empty dataframe, skipping")
                 continue
             
-            # Initialize all rows as background (index 0)
-            row_labels = np.zeros(len(df), dtype=np.int64)  # 0 = background
-            
-            annotations = [a for a in task["annotations"] if a.get("result")]
-            logger.debug(f"Task {task_id}: Found {len(annotations)} annotations")
-            
-            # Mark labeled regions
-            labeled_rows = 0
-            for ann in annotations:
-                for r in ann["result"]:
-                    if r["from_name"] != params["from_name"]:
-                        continue
-                    start = r["value"]["start"]
-                    end = r["value"]["end"]
-                    label = r["value"]["timeserieslabels"][0]
-                    
-                    # Convert start/end to same type as time column for comparison
-                    time_dtype = df[params["time_col"]].dtype
-                    logger.debug(f"Task {task_id}: Converting time range [{start}, {end}] to match column dtype {time_dtype}")
-                    try:
-                        if 'int' in str(time_dtype):
-                            start = int(float(start))
-                            end = int(float(end))
-                        elif 'float' in str(time_dtype):
-                            start = float(start)
-                            end = float(end)
-                        # For string/datetime, keep as is
-                        logger.debug(f"Task {task_id}: Converted to [{start}, {end}]")
-                    except (ValueError, TypeError) as e:
-                        logger.warning(f"Could not convert start={start}, end={end} to {time_dtype}: {e}, using original values")
-                    
-                    # Find rows in this time range
-                    try:
-                        mask = (df[params["time_col"]] >= start) & (
-                            df[params["time_col"]] <= end
-                        )
-                    except TypeError as e:
-                        logger.error(f"Task {task_id}: Type error comparing times - start={start} ({type(start)}), end={end} ({type(end)}), time_col dtype={time_dtype}: {e}")
-                        # Skip this annotation if we can't compare
-                        continue
-                    
-                    # Set the appropriate label index
-                    label_idx = label2idx[label]
-                    row_labels[mask] = label_idx
-                    labeled_rows += mask.sum()
-                    logger.debug(f"Task {task_id}: Labeled {mask.sum()} rows with '{label}' (index {label_idx})")
+            # Process annotations for this task
+            row_labels, labeled_rows = self._process_task_annotations(task, df, params, label2idx)
             
             # Add ALL rows to training data
             X_list.append(df[params["channels"]].values.astype(np.float32))