Skip to content

Conversation

@Hbeilinson
Copy link

Description

I made three changes, all specifically to the photograph part of the app. These were:

  1. Allowing the app to run on photographs stored in S3, without having to locally store all of the raw images.
  2. Adding a date search filter.
  3. Adding an option to filter photographs by file path before running the embedding search.

Motivation and Context

The first of these changes allows the app to scale to larger datasets of photographs. For use cases where there are over a million photos, it will be helpful to be able to run the app without having to store all of the photos locally.

The next two are to enable more specific photograph searching. This is particularly useful for contexts where a user might know about a specific photo they're looking for, but not know where to find it. By filtering based on date or file name they can get closer to finding the photo they want, and then layer the embedding search on top of that.

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Code refactoring (no functional changes)
  • Performance improvement
  • Research contribution (new models, evaluation methods, etc.)
  • Other (please describe):

Component(s) Affected

  • Backend (Python/FastAPI)
  • Frontend - Photographs
  • Frontend - Maps
  • Frontend - Documents
  • CLIP/ML models
  • Configuration
  • Documentation
  • Tests
  • Build/deployment

Changes Made

  • Updated the generate_embeddings script to be able to download files from S3.
  • Updated the generate_embeddigns script to store the origin date of a photograph into the metadata file.
  • Updated the backend to fetch full photographs from S3 when they are not stored locally.
  • Updated the backend to provide an API for date search.
  • Updated the backend to enable file name filter on text search.
  • Updated the photograph frontend to add a date search option.
  • Updated the photograph frontend to include a filter bar below the text search, currently only including the file path filter.

Testing

How Has This Been Tested?

I ran manual tests on each aspect that I described above.

Screenshots (if applicable)

Before After
Previous text search Updated text search
N/A New date search

Checklist

Code Quality

  • My code follows the project's coding standards
  • I have run black . and isort . on Python code
  • I have run npm run lint on frontend code (if applicable)
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • My changes generate no new warnings or errors

Testing

  • I have added tests that prove my fix is effective or that my feature works
    I didn't see unit tests.
  • New and existing unit tests pass locally with my changes
    I didn't see unit tests.
  • I have tested this locally with actual data

Documentation

  • I have updated the documentation accordingly
  • I have updated the README if needed
  • I have added docstrings to new functions/classes
  • I have updated config.json documentation if config changes were made

Dependencies

  • I have updated requirements.txt (if Python dependencies changed)
  • I have updated package.json (if Node dependencies changed)
  • I have documented any new configuration options

Research (if applicable)

  • I have included references to relevant papers or research
  • I have shared evaluation results or benchmarks
  • I have included information about datasets used
  • I have documented model training procedures

Breaking Changes

None / (describe breaking changes)

Additional Notes

Reviewers Checklist (for maintainers)

  • Code quality and style compliance
  • Test coverage adequate
  • Documentation complete
  • No security concerns
  • Performance implications acceptable
  • Breaking changes documented

Copy link
Owner

@hinxcode hinxcode left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this great work! The S3 support and date filtering are solid additions. I had a few thoughts on the implementation:

S3 file handling

I took a look at process_remote_files and noticed it’s still downloading files to local disk and then cleaning them up afterward. One alternative I’d prefer is to keep generate_embeddings.py focused on embedding generation, and provide a new helper script (can be written in any language) that downloads the collection from S3 into raw_data_dir up front. That way:
(1) We avoid adding more parameters/configuration for bucket names, S3 credentials, etc.
(2) We can rely on proven tooling like s5cmd for fast, reliable syncs, with optional cleanup handled by the helper script.

Date filtering UX

Instead of making date search its own tab, could we make date a filter that works alongside text and image search? In practice, I can imagine researchers and practitioners wanting workflows like searching with results limited to a certain time range in both modes. We may also need to keep the codebase flexible, leaving more room for future contributors to extend the same pattern and add additional filters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants