hinxcode · Hbeilinson · Oct 11, 2025 · Oct 11, 2025 · Oct 17, 2025 · Dec 8, 2025
diff --git a/.gitignore b/.gitignore
@@ -51,6 +51,9 @@ data/thumbnails/*
 !data/processed/.gitkeep
 !data/embeddings/.gitkeep
 !data/thumbnails/.gitkeep
+data--aip/*
+data--SephardicStudies/*
+
 # Model files
 *.pth
 *.pt

diff --git a/PullRequestInformation.md b/PullRequestInformation.md
@@ -0,0 +1,130 @@
+## Description
+
+I made three changes, all specifically to the photograph part of the app. These were:
+
+1) Allowing the app to run on photographs stored in S3, without having to locally store all of the raw images.
+2) Adding a date search filter.
+3) Adding an option to filter photographs by file path before running the embedding search.
+
+## Motivation and Context
+
+The first of these changes allows the app to scale to larger datasets of photographs. For use cases where there are over a million photos, it will be helpful to be able to run the app without having to store all of the photos locally.
+
+The next two are to enable more specific photograph searching. This is particularly useful for contexts where a user might know about a specific photo they're looking for, but not know where to find it. By filtering based on date or file name they can get closer to finding the photo they want, and then layer the embedding search on top of that.
+
+## Type of Change
+
+<!-- Mark the relevant option with an "x" -->
+
+- [ ] Bug fix (non-breaking change that fixes an issue)
+- [x] New feature (non-breaking change that adds functionality)
+- [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
+- [x] Documentation update
+- [ ] Code refactoring (no functional changes)
+- [ ] Performance improvement
+- [ ] Research contribution (new models, evaluation methods, etc.)
+- [ ] Other (please describe):
+
+## Component(s) Affected
+
+<!-- Mark all that apply -->
+
+- [x] Backend (Python/FastAPI)
+- [x] Frontend - Photographs
+- [ ] Frontend - Maps
+- [ ] Frontend - Documents
+- [ ] CLIP/ML models
+- [x] Configuration
+- [x] Documentation
+- [ ] Tests
+- [x] Build/deployment
+
+## Changes Made
+
+<!-- List the main changes in bullet points -->
+
+- Updated the generate_embeddings script to be able to download files from S3.
+- Updated the generate_embeddigns script to store the origin date of a photograph into the metadata file.
+- Updated the backend to fetch full photographs from S3 when they are not stored locally.
+- Updated the backend to provide an API for date search.
+- Updated the backend to enable file name filter on text search.
+- Updated the photograph frontend to add a date search option.
+- Updated the photograph frontend to include a filter bar below the text search, currently only including the file path filter.
+
+## Testing
+
+### How Has This Been Tested?
+
+<!-- Describe the tests you ran and how to reproduce them -->
+
+I ran manual tests on each aspect that I described above.
+
+## Screenshots (if applicable)
+
+<!-- Add screenshots to demonstrate UI changes -->
+
+| Before | After |
+|--------|-------|
+| ![Previous text search](image.png) | ![Updated text search](image-1.png) |
+| N/A | ![New date search](image-2.png) |
+
+## Checklist
+
+<!-- Mark completed items with an "x" -->
+
+### Code Quality
+
+- [x] My code follows the project's coding standards
+- [x] I have run `black .` and `isort .` on Python code
+- [x] I have run `npm run lint` on frontend code (if applicable)
+- [x] I have performed a self-review of my own code
+- [x] I have commented my code, particularly in hard-to-understand areas
+- [x] My changes generate no new warnings or errors
+
+### Testing
+
+- [ ] I have added tests that prove my fix is effective or that my feature works
+I didn't see unit tests.
+- [ ] New and existing unit tests pass locally with my changes
+I didn't see unit tests.
+- [x] I have tested this locally with actual data
+
+### Documentation
+
+- [x] I have updated the documentation accordingly
+- [x] I have updated the README if needed
+- [x] I have added docstrings to new functions/classes
+- [x] I have updated `config.json` documentation if config changes were made
+
+### Dependencies
+
+- [x] I have updated `requirements.txt` (if Python dependencies changed)
+- [x] I have updated `package.json` (if Node dependencies changed)
+- [x] I have documented any new configuration options
+
+### Research (if applicable)
+
+- [ ] I have included references to relevant papers or research
+- [ ] I have shared evaluation results or benchmarks
+- [ ] I have included information about datasets used
+- [ ] I have documented model training procedures
+
+## Breaking Changes
+
+<!-- If this PR contains breaking changes, describe them here -->
+<!-- Include migration instructions for users -->
+
+None / (describe breaking changes)
+
+## Additional Notes
+
+<!-- Any additional information that reviewers should know -->
+
+## Reviewers Checklist (for maintainers)
+
+- [ ] Code quality and style compliance
+- [ ] Test coverage adequate
+- [ ] Documentation complete
+- [ ] No security concerns
+- [ ] Performance implications acceptable
+- [ ] Breaking changes documented
diff --git a/README.md b/README.md
@@ -12,9 +12,11 @@ This project describes out Digital Collections Explorer, available at: [https://
 
 We present Digital Collections Explorer, a web-based, open-source exploratory search platform that leverages CLIP (Contrastive Language-Image Pre-training) for enhanced visual discovery of digital collections. Our Digital Collections Explorer can be installed locally and configured to run on a visual collection of interest on disk in just a few steps. Building upon recent advances in multimodal search techniques, our interface enables natural language queries and reverse image searches over digital collections with visual features. An overview of our system can be seen in the image above.
 
+We are in the process of adding additional capabilities that are currently only available for photography collections. These include a configuration to run on collections stored in AWS S3 buckets, the option to limit the natural language search to sub-directories of the collection, and the option to perform a search on the original date of the photographs.
+
 ## Features
 
-- Multimodal search capabilities using both text and image inputs
+- Multimodal search capabilities using text, image, and metadata inputs (for photographs)
 - Support for various digital collection types:
   - Historical maps
   - Photographs
@@ -75,6 +77,12 @@ python -m src.models.clip.generate_embeddings
 
 This will process all images found in `raw_data_dir` and create embeddings in `embeddings_dir` (both set in `config.json`).
 
+If your data is stored in an S3 bucket instead of locally, ensure your default AWS profile has read and list access to your bucket, then run the above command with the following arguments:
+
+```bash
+python -m src.models.clip.generate_embeddings --use-remote --bucket <BUCKETNAME> --prefix <PREFIX>
+```
+
 ### Step 5: Start the Backend Server
 
 ```bash
@@ -83,6 +91,8 @@ python -m src.backend.main
 
 The API server will start at http://localhost:8000
 
+If your data is stored in S3, change the REMOTE flag in src.backend.main to True.
+
 ### Customizing the Frontend
 
 #### Development Mode

diff --git a/image-1.png b/image-1.png
diff --git a/image-2.png b/image-2.png
diff --git a/image.png b/image.png
diff --git a/package-lock.json b/package-lock.json
diff --git a/src/backend/api/routes/images.py b/src/backend/api/routes/images.py
@@ -3,7 +3,11 @@
 from fastapi import APIRouter, HTTPException, Query
 from fastapi.responses import FileResponse
 
+import boto3
+import os
+
 from src.backend.services.embedding_service import embedding_service
+import src.backend.utils.helpers as helpers
 
 router = APIRouter(tags=["images"])
 
@@ -37,6 +41,13 @@ async def get_image_by_id(
         and "processed" in doc["metadata"]["paths"]
     ):
         path_str = doc["metadata"]["paths"]["processed"]
+        if doc["metadata"]["remote"]:
+            s3_client = boto3.session.Session().client("s3")
+            local_dir = f"{doc['metadata']['processed_dir']}/{path_str}"
+            helpers.download_file(
+                s3_client, doc["metadata"]["bucket"], path_str, local_dir
+            )
+            path_str = local_dir
     else:
         raise HTTPException(
             status_code=404, detail="Image path not found in document metadata"

diff --git a/src/backend/api/routes/search.py b/src/backend/api/routes/search.py
@@ -1,3 +1,4 @@
+import datetime
 import logging
 from io import BytesIO
 
@@ -7,6 +8,7 @@
 from ...models.schemas import SearchResponse, SearchResult
 from ...services.clip_service import clip_service
 from ...services.embedding_service import embedding_service
+from ...services.metadata_search_service import metadata_search_service
 
 logger = logging.getLogger(__name__)
 router = APIRouter(prefix="/api/search", tags=["search"])
@@ -17,6 +19,7 @@ async def search_by_text(
     query: str,
     limit: int = Query(30, description="Number of results per page"),
     page: int = Query(1, description="Page number for pagination"),
+    filepath_search_term: str = Query("", description="Substring to filter file paths"),
 ):
     """Search for similar content using text query."""
     offset = (page - 1) * limit
@@ -28,7 +31,11 @@ async def search_by_text(
         text_embedding = clip_service.encode_text(query)
         logit_scale = clip_service.model.logit_scale.exp().item()
         raw_results = embedding_service.search(
-            text_embedding, logit_scale=logit_scale, limit=limit, offset=offset
+            text_embedding,
+            logit_scale=logit_scale,
+            limit=limit,
+            offset=offset,
+            filepath_search_term=filepath_search_term,
         )
 
         search_results = [
@@ -70,3 +77,30 @@ async def search_by_image(
     except Exception as e:
         logger.error(f"Error in image search: {str(e)}")
         return SearchResponse(results=[])
+
+
+@router.get("/date", response_model=SearchResponse)
+async def search_by_date(
+    query: datetime.date,
+    limit: int = Query(30, description="Number of results per page"),
+    page: int = Query(1, description="Page number for pagination"),
+    searchNearDate: bool = Query(
+        False, description="Whether to search for dates near the target date"
+    ),
+):
+    """Search for similar content using date query."""
+    offset = (page - 1) * limit
+
+    try:
+        raw_results = metadata_search_service.date_search(
+            query, limit=limit, offset=offset, search_near_date=searchNearDate
+        )
+
+        search_results = [
+            SearchResult(id=result["id"], score=1, metadata=result["metadata"])
+            for result in raw_results
+        ]
+        return SearchResponse(results=search_results)
+    except Exception as e:
+        logger.error(f"Error in date search: {str(e)}")
+        return SearchResponse(results=[])
diff --git a/src/backend/main.py b/src/backend/main.py
@@ -11,12 +11,16 @@
 from .core.config import settings
 from .services.embedding_service import embedding_service
 
+import os
+
 logging.basicConfig(
     level=logging.INFO,
     format="%(asctime)s - %(name)s - %(levelname)s - %(message)s",
 )
 logger = logging.getLogger(__name__)
 
+REMOTE_FILES = False
+
 
 @asynccontextmanager
 async def lifespan(app):
@@ -29,6 +33,10 @@ async def lifespan(app):
 
     yield
 
+    # Clean up cached files downloaded from S3
+    if REMOTE_FILES:
+        os.system(f"rm -rf {str(settings.processed_data_dir)}/*")
+
 
 app = FastAPI(
     title=settings.api_title,

diff --git a/src/backend/services/embedding_service.py b/src/backend/services/embedding_service.py
@@ -5,6 +5,7 @@
 from typing import Any, Dict, List, Optional
 
 import torch
+import numpy as np
 
 from ..core.config import settings
 
@@ -99,16 +100,45 @@ def get_document_by_id(self, doc_id: str) -> Optional[Dict[str, Any]]:
 
         return None
 
+    def filepath_filter(self, filepath_substring: str) -> torch.Tensor:
+        """Create a metadata filter for file path substring matching"""
+        metadata_arr = np.array(
+            [
+                self.metadata[item_id].get("paths", {}).get("original", "")
+                for item_id in self.item_ids
+            ]
+        )
+        matching_indices = np.where(
+            np.char.find(metadata_arr.astype(str), filepath_substring) != -1
+        )[0]
+        return torch.tensor(matching_indices, dtype=torch.long)
+
     def search(
         self,
         query_embedding: torch.Tensor,
         logit_scale: Optional[float] = None,
         limit: int = 20,
         offset: int = 0,
+        filepath_search_term: str = "",
     ) -> List[Dict[str, Any]]:
         """Search for similar items using query embedding with pagination"""
         try:
-            similarities = torch.matmul(self.embeddings, query_embedding.t()).squeeze()
+            if filepath_search_term != "":
+                valid_indices = self.filepath_filter(filepath_search_term)
+
+                if len(valid_indices) == 0:
+                    return []  # No items match the filter
+
+                # Filter embeddings to only valid ones
+                filtered_embeddings = self.embeddings[valid_indices]
+                similarities = torch.matmul(
+                    filtered_embeddings, query_embedding.t()
+                ).squeeze()
+            else:
+                valid_indices = None
+                similarities = torch.matmul(
+                    self.embeddings, query_embedding.t()
+                ).squeeze()
 
             if logit_scale is not None:
                 similarities = similarities * logit_scale
@@ -127,7 +157,14 @@ def search(
             for idx, score in zip(
                 paginated_indices.tolist(), paginated_scores.tolist()
             ):
-                idx_int = int(idx)
+                # Map back to original index if we filtered
+                if valid_indices is not None:
+                    original_idx = valid_indices[idx].item()
+                else:
+                    original_idx = idx
+
+                idx_int = int(original_idx)
+
                 if idx_int >= len(self.item_ids):
                     logger.warning(
                         f"Index {idx_int} out of range for item_ids of length {len(self.item_ids)}"