Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,9 @@ data/thumbnails/*
!data/processed/.gitkeep
!data/embeddings/.gitkeep
!data/thumbnails/.gitkeep
data--aip/*
data--SephardicStudies/*

# Model files
*.pth
*.pt
Expand Down
130 changes: 130 additions & 0 deletions PullRequestInformation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,130 @@
## Description

I made three changes, all specifically to the photograph part of the app. These were:

1) Allowing the app to run on photographs stored in S3, without having to locally store all of the raw images.
2) Adding a date search filter.
3) Adding an option to filter photographs by file path before running the embedding search.

## Motivation and Context

The first of these changes allows the app to scale to larger datasets of photographs. For use cases where there are over a million photos, it will be helpful to be able to run the app without having to store all of the photos locally.

The next two are to enable more specific photograph searching. This is particularly useful for contexts where a user might know about a specific photo they're looking for, but not know where to find it. By filtering based on date or file name they can get closer to finding the photo they want, and then layer the embedding search on top of that.

## Type of Change

<!-- Mark the relevant option with an "x" -->

- [ ] Bug fix (non-breaking change that fixes an issue)
- [x] New feature (non-breaking change that adds functionality)
- [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
- [x] Documentation update
- [ ] Code refactoring (no functional changes)
- [ ] Performance improvement
- [ ] Research contribution (new models, evaluation methods, etc.)
- [ ] Other (please describe):

## Component(s) Affected

<!-- Mark all that apply -->

- [x] Backend (Python/FastAPI)
- [x] Frontend - Photographs
- [ ] Frontend - Maps
- [ ] Frontend - Documents
- [ ] CLIP/ML models
- [x] Configuration
- [x] Documentation
- [ ] Tests
- [x] Build/deployment

## Changes Made

<!-- List the main changes in bullet points -->

- Updated the generate_embeddings script to be able to download files from S3.
- Updated the generate_embeddigns script to store the origin date of a photograph into the metadata file.
- Updated the backend to fetch full photographs from S3 when they are not stored locally.
- Updated the backend to provide an API for date search.
- Updated the backend to enable file name filter on text search.
- Updated the photograph frontend to add a date search option.
- Updated the photograph frontend to include a filter bar below the text search, currently only including the file path filter.

## Testing

### How Has This Been Tested?

<!-- Describe the tests you ran and how to reproduce them -->

I ran manual tests on each aspect that I described above.

## Screenshots (if applicable)

<!-- Add screenshots to demonstrate UI changes -->

| Before | After |
|--------|-------|
| ![Previous text search](image.png) | ![Updated text search](image-1.png) |
| N/A | ![New date search](image-2.png) |

## Checklist

<!-- Mark completed items with an "x" -->

### Code Quality

- [x] My code follows the project's coding standards
- [x] I have run `black .` and `isort .` on Python code
- [x] I have run `npm run lint` on frontend code (if applicable)
- [x] I have performed a self-review of my own code
- [x] I have commented my code, particularly in hard-to-understand areas
- [x] My changes generate no new warnings or errors

### Testing

- [ ] I have added tests that prove my fix is effective or that my feature works
I didn't see unit tests.
- [ ] New and existing unit tests pass locally with my changes
I didn't see unit tests.
- [x] I have tested this locally with actual data

### Documentation

- [x] I have updated the documentation accordingly
- [x] I have updated the README if needed
- [x] I have added docstrings to new functions/classes
- [x] I have updated `config.json` documentation if config changes were made

### Dependencies

- [x] I have updated `requirements.txt` (if Python dependencies changed)
- [x] I have updated `package.json` (if Node dependencies changed)
- [x] I have documented any new configuration options

### Research (if applicable)

- [ ] I have included references to relevant papers or research
- [ ] I have shared evaluation results or benchmarks
- [ ] I have included information about datasets used
- [ ] I have documented model training procedures

## Breaking Changes

<!-- If this PR contains breaking changes, describe them here -->
<!-- Include migration instructions for users -->

None / (describe breaking changes)

## Additional Notes

<!-- Any additional information that reviewers should know -->

## Reviewers Checklist (for maintainers)

- [ ] Code quality and style compliance
- [ ] Test coverage adequate
- [ ] Documentation complete
- [ ] No security concerns
- [ ] Performance implications acceptable
- [ ] Breaking changes documented
12 changes: 11 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,9 +12,11 @@ This project describes out Digital Collections Explorer, available at: [https://

We present Digital Collections Explorer, a web-based, open-source exploratory search platform that leverages CLIP (Contrastive Language-Image Pre-training) for enhanced visual discovery of digital collections. Our Digital Collections Explorer can be installed locally and configured to run on a visual collection of interest on disk in just a few steps. Building upon recent advances in multimodal search techniques, our interface enables natural language queries and reverse image searches over digital collections with visual features. An overview of our system can be seen in the image above.

We are in the process of adding additional capabilities that are currently only available for photography collections. These include a configuration to run on collections stored in AWS S3 buckets, the option to limit the natural language search to sub-directories of the collection, and the option to perform a search on the original date of the photographs.

## Features

- Multimodal search capabilities using both text and image inputs
- Multimodal search capabilities using text, image, and metadata inputs (for photographs)
- Support for various digital collection types:
- Historical maps
- Photographs
Expand Down Expand Up @@ -75,6 +77,12 @@ python -m src.models.clip.generate_embeddings

This will process all images found in `raw_data_dir` and create embeddings in `embeddings_dir` (both set in `config.json`).

If your data is stored in an S3 bucket instead of locally, ensure your default AWS profile has read and list access to your bucket, then run the above command with the following arguments:

```bash
python -m src.models.clip.generate_embeddings --use-remote --bucket <BUCKETNAME> --prefix <PREFIX>
```

### Step 5: Start the Backend Server

```bash
Expand All @@ -83,6 +91,8 @@ python -m src.backend.main

The API server will start at http://localhost:8000

If your data is stored in S3, change the REMOTE flag in src.backend.main to True.

### Customizing the Frontend

#### Development Mode
Expand Down
Binary file added image-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added image-2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added image.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
4 changes: 2 additions & 2 deletions package-lock.json

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

11 changes: 11 additions & 0 deletions src/backend/api/routes/images.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,11 @@
from fastapi import APIRouter, HTTPException, Query
from fastapi.responses import FileResponse

import boto3
import os

from src.backend.services.embedding_service import embedding_service
import src.backend.utils.helpers as helpers

router = APIRouter(tags=["images"])

Expand Down Expand Up @@ -37,6 +41,13 @@ async def get_image_by_id(
and "processed" in doc["metadata"]["paths"]
):
path_str = doc["metadata"]["paths"]["processed"]
if doc["metadata"]["remote"]:
s3_client = boto3.session.Session().client("s3")
local_dir = f"{doc['metadata']['processed_dir']}/{path_str}"
helpers.download_file(
s3_client, doc["metadata"]["bucket"], path_str, local_dir
)
path_str = local_dir
else:
raise HTTPException(
status_code=404, detail="Image path not found in document metadata"
Expand Down
36 changes: 35 additions & 1 deletion src/backend/api/routes/search.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
import datetime
import logging
from io import BytesIO

Expand All @@ -7,6 +8,7 @@
from ...models.schemas import SearchResponse, SearchResult
from ...services.clip_service import clip_service
from ...services.embedding_service import embedding_service
from ...services.metadata_search_service import metadata_search_service

logger = logging.getLogger(__name__)
router = APIRouter(prefix="/api/search", tags=["search"])
Expand All @@ -17,6 +19,7 @@ async def search_by_text(
query: str,
limit: int = Query(30, description="Number of results per page"),
page: int = Query(1, description="Page number for pagination"),
filepath_search_term: str = Query("", description="Substring to filter file paths"),
):
"""Search for similar content using text query."""
offset = (page - 1) * limit
Expand All @@ -28,7 +31,11 @@ async def search_by_text(
text_embedding = clip_service.encode_text(query)
logit_scale = clip_service.model.logit_scale.exp().item()
raw_results = embedding_service.search(
text_embedding, logit_scale=logit_scale, limit=limit, offset=offset
text_embedding,
logit_scale=logit_scale,
limit=limit,
offset=offset,
filepath_search_term=filepath_search_term,
)

search_results = [
Expand Down Expand Up @@ -70,3 +77,30 @@ async def search_by_image(
except Exception as e:
logger.error(f"Error in image search: {str(e)}")
return SearchResponse(results=[])


@router.get("/date", response_model=SearchResponse)
async def search_by_date(
query: datetime.date,
limit: int = Query(30, description="Number of results per page"),
page: int = Query(1, description="Page number for pagination"),
searchNearDate: bool = Query(
False, description="Whether to search for dates near the target date"
),
):
"""Search for similar content using date query."""
offset = (page - 1) * limit

try:
raw_results = metadata_search_service.date_search(
query, limit=limit, offset=offset, search_near_date=searchNearDate
)

search_results = [
SearchResult(id=result["id"], score=1, metadata=result["metadata"])
for result in raw_results
]
return SearchResponse(results=search_results)
except Exception as e:
logger.error(f"Error in date search: {str(e)}")
return SearchResponse(results=[])
8 changes: 8 additions & 0 deletions src/backend/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,12 +11,16 @@
from .core.config import settings
from .services.embedding_service import embedding_service

import os

logging.basicConfig(
level=logging.INFO,
format="%(asctime)s - %(name)s - %(levelname)s - %(message)s",
)
logger = logging.getLogger(__name__)

REMOTE_FILES = False


@asynccontextmanager
async def lifespan(app):
Expand All @@ -29,6 +33,10 @@ async def lifespan(app):

yield

# Clean up cached files downloaded from S3
if REMOTE_FILES:
os.system(f"rm -rf {str(settings.processed_data_dir)}/*")


app = FastAPI(
title=settings.api_title,
Expand Down
41 changes: 39 additions & 2 deletions src/backend/services/embedding_service.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
from typing import Any, Dict, List, Optional

import torch
import numpy as np

from ..core.config import settings

Expand Down Expand Up @@ -99,16 +100,45 @@ def get_document_by_id(self, doc_id: str) -> Optional[Dict[str, Any]]:

return None

def filepath_filter(self, filepath_substring: str) -> torch.Tensor:
"""Create a metadata filter for file path substring matching"""
metadata_arr = np.array(
[
self.metadata[item_id].get("paths", {}).get("original", "")
for item_id in self.item_ids
]
)
matching_indices = np.where(
np.char.find(metadata_arr.astype(str), filepath_substring) != -1
)[0]
return torch.tensor(matching_indices, dtype=torch.long)

def search(
self,
query_embedding: torch.Tensor,
logit_scale: Optional[float] = None,
limit: int = 20,
offset: int = 0,
filepath_search_term: str = "",
) -> List[Dict[str, Any]]:
"""Search for similar items using query embedding with pagination"""
try:
similarities = torch.matmul(self.embeddings, query_embedding.t()).squeeze()
if filepath_search_term != "":
valid_indices = self.filepath_filter(filepath_search_term)

if len(valid_indices) == 0:
return [] # No items match the filter

# Filter embeddings to only valid ones
filtered_embeddings = self.embeddings[valid_indices]
similarities = torch.matmul(
filtered_embeddings, query_embedding.t()
).squeeze()
else:
valid_indices = None
similarities = torch.matmul(
self.embeddings, query_embedding.t()
).squeeze()

if logit_scale is not None:
similarities = similarities * logit_scale
Expand All @@ -127,7 +157,14 @@ def search(
for idx, score in zip(
paginated_indices.tolist(), paginated_scores.tolist()
):
idx_int = int(idx)
# Map back to original index if we filtered
if valid_indices is not None:
original_idx = valid_indices[idx].item()
else:
original_idx = idx

idx_int = int(original_idx)

if idx_int >= len(self.item_ids):
logger.warning(
f"Index {idx_int} out of range for item_ids of length {len(self.item_ids)}"
Expand Down
Loading