Commit 008060b
Exact deduplication api (#965)
* New API Spec with Ray Backend (#726)
* Create package + reorganize (#2)
* fc
Signed-off-by: Praateek <praateekm@gmail.com>
* remove per file ignore
Signed-off-by: Praateek <praateekm@gmail.com>
* sc
Signed-off-by: Praateek <praateekm@gmail.com>
* ruff
Signed-off-by: Praateek <praateekm@gmail.com>
* use curator_id_str
Signed-off-by: Praateek <praateekm@gmail.com>
---------
Signed-off-by: Praateek <praateekm@gmail.com>
* fc
Signed-off-by: Praateek <praateekm@gmail.com>
* kmeans works
Signed-off-by: Praateek <praateekm@gmail.com>
* Fuzzy dedup fixes (#11)
* high level method for each step
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
* Fixes/changes after testing
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
* Updates to existing fuzzy_dedup modules
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
* Add high level fuzzy dedup api and e2e example
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
* Add e2e example
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
* Add config
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
---------
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
* fc
Signed-off-by: Praateek <praateekm@gmail.com>
* fc
Signed-off-by: Praateek <praateekm@gmail.com>
* removal works
Signed-off-by: Praateek <praateekm@gmail.com>
* bug fix
Signed-off-by: Praateek <praateekm@gmail.com>
* working streaming embedding with id generator
Signed-off-by: Praateek <praateekm@gmail.com>
* Dump high level skeleton
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
* update xenna executor
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
* More changes
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
* working example
Signed-off-by: Praateek <praateekm@gmail.com>
* Revert "working example"
This reverts commit 7b3e65173dd1df92b0de9431fcfebdbc0b93d6c9.
* [WIP] Add reader + utf modifier (#31)
* Dump high level skeleton
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
* update xenna executor
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
* More changes
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
* Updates for utfModifier+ high level updates
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
* Remove old examples and add new modifier and stages
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
* Add modify stage
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
* More updates
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
---------
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
* Revert "[WIP] Add reader + utf modifier (#31)" (#32)
This reverts commit ef25e3eff6502cb9bfc4a57ba48f0939284fd49b.
* rebase
Signed-off-by: Praateek <praateekm@gmail.com>
* rebase continue
Signed-off-by: Praateek <praateekm@gmail.com>
* Remove older file versions
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
* Final changes as per the meeting
* refactor
Signed-off-by: Praateek <praateekm@gmail.com>
* example works
Signed-off-by: Praateek <praateekm@gmail.com>
* add base classes
Signed-off-by: Praateek <praateekm@gmail.com>
* example works
Signed-off-by: Praateek <praateekm@gmail.com>
* ..
Signed-off-by: Praateek <praateekm@gmail.com>
* more google style
Signed-off-by: Praateek <praateekm@gmail.com>
* add init for backends
Signed-off-by: Praateek <praateekm@gmail.com>
* Update example script
* add impl
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
* ruff
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
* add suggestions
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
* add another check
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
* Move changes one level deeper in ray-curator, add pyproject toml
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
* Update dependencies to include cosmos-xenna and pyarrow explicitly
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
* Update python upper bound
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
* Add a simple contributing file with instructions
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
* Remove pyarrow check since it's an explicit dependency
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
* Remove unusued file utils
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
---------
Signed-off-by: Praateek <praateekm@gmail.com>
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
Co-authored-by: Praateek Mahajan <praateekmahajan@users.noreply.github.com>
Co-authored-by: Praateek <praateekm@gmail.com>
Co-authored-by: Sarah Yurick <sarahyurick@gmail.com>
Co-authored-by: Abhinav Garg <abhinavg@stanford.edu>
Co-authored-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>
* [Ray] Allow loguru to be serialized #729
* [Ray] Add Jsonl / Parquet Writer Stage (#730)
* Update CI testing workflow for ray branch (#739)
* Update ci workflow to build ray-curator package instead
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
* Split out CPU and GPU modules
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
* Update pytest command
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
* update crossfit dep to use pinned version (avoiding absl dep issues)
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
* Explicitly add absl-py dependency to avoid python 3.10 errors
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
* Update paths for codecov
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
---------
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
* Initial API desing doc (#737)
* Intial APi desing doc
Signed-off-by: Abhinav Garg <abhinavg@stanford.edu>
* Update ray-curator/api-design.md
Co-authored-by: Praateek Mahajan <praateekmahajan@users.noreply.github.com>
Signed-off-by: Abhinav Garg <abhinavg@stanford.edu>
* Update ray-curator/api-design.md
Co-authored-by: Praateek Mahajan <praateekmahajan@users.noreply.github.com>
Signed-off-by: Abhinav Garg <abhinavg@stanford.edu>
* Update ray-curator/api-design.md
Co-authored-by: Praateek Mahajan <praateekmahajan@users.noreply.github.com>
Signed-off-by: Abhinav Garg <abhinavg@stanford.edu>
* Update ray-curator/api-design.md
Co-authored-by: Praateek Mahajan <praateekmahajan@users.noreply.github.com>
Signed-off-by: Abhinav Garg <abhinavg@stanford.edu>
* Update ray-curator/api-design.md
Co-authored-by: Praateek Mahajan <praateekmahajan@users.noreply.github.com>
Signed-off-by: Abhinav Garg <abhinavg@stanford.edu>
* Update ray-curator/api-design.md
Co-authored-by: Praateek Mahajan <praateekmahajan@users.noreply.github.com>
Signed-off-by: Abhinav Garg <abhinavg@stanford.edu>
* Update ray-curator/api-design.md
Co-authored-by: Praateek Mahajan <praateekmahajan@users.noreply.github.com>
Signed-off-by: Abhinav Garg <abhinavg@stanford.edu>
* Update ray-curator/api-design.md
Co-authored-by: Praateek Mahajan <praateekmahajan@users.noreply.github.com>
Signed-off-by: Abhinav Garg <abhinavg@stanford.edu>
* Update ray-curator/api-design.md
Co-authored-by: Ayush Dattagupta <ayushdg95@gmail.com>
Signed-off-by: Abhinav Garg <abhinavg@stanford.edu>
* Refine map-style execution description in API design document to clarify task transformation and mapping flexibility.
* Remove redundant sections on Tasks, Stages, and Pipelines from the API design document to streamline content and improve clarity.
* Add quickstart example and update API design documentation
- Introduced a new quickstart example in `ray_curator/examples/quickstart.py` demonstrating a sentiment analysis pipeline with three stages: TaskCreationStage, WordCountStage, and SentimentStage.
- Updated `api-design.md` to include a new section for examples, linking to the quickstart for user reference.
- Clarified resource requirements in `resources.py` documentation for GPU usage and constraints.
* Ruff related changes
Signed-off-by: Abhinav Garg <abhinavg@stanford.edu>
* PR changes
Signed-off-by: Abhinav Garg <abhinavg@stanford.edu>
* Update DocumentTask to DocumentBatch in API design for improved type flexibility
Signed-off-by: Abhinav Garg <abhinavg@stanford.edu>
* Add fault tolerance requirements to API design documentation
- Introduced a new section outlining the necessity for fault tolerance and retry safety in all stages.
- Highlighted critical aspects such as task preemption and handling of partial operations to ensure robustness during execution.
Signed-off-by: Abhinav Garg <abhinavg@stanford.edu>
---------
Signed-off-by: Abhinav Garg <abhinavg@stanford.edu>
Co-authored-by: Praateek Mahajan <praateekmahajan@users.noreply.github.com>
Co-authored-by: Ayush Dattagupta <ayushdg95@gmail.com>
* Refactor XennaExecutor by removing the cluster initialization function and deleting the associated ray_cluster_init.py file. This streamlines the execution process by eliminating unnecessary setup code. (#768)
Signed-off-by: Abhinav Garg <abhinavg@stanford.edu>
* [Ray] Add Ray Data as an experimental backend (#740)
* [Ray] Add integration test to test backends for a specified pipeline (#770)
* Adding with_ for options in ProcessingStage and CompositeStage (#764)
* [Ray] `DocumentFilter` and `Filter`/`Score`/`ScoreFilter` (#746)
* add documentfilter implementation
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
* fix nits and ruff
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
* add additional logic for setup, setup_on_node, and process_batch
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
* add pytests
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
* add dep
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
* more dep edits
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
* another dep
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
* add fasttext dep
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
* add jieba and mecab
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
* add default None params for setup_on_node and setup functions
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
* add praateek's suggestions
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
* organize imports
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
* remove process_batch
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
* add _metadata to result
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
* add praateek's suggestions
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
* ruff and post init for _name
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
* modify test
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
---------
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
* [Ray] Add Download Extract Base Class + Common Crawl Stage (#738)
* [Ray] Use Ray Actors where viable (#792)
* Extract And download for WIkipedia (#795)
* copy over
Signed-off-by: Praateek <praateekm@gmail.com>
* copy over
Signed-off-by: Praateek <praateekm@gmail.com>
* add init to download
Signed-off-by: Praateek <praateekm@gmail.com>
* move justext
Signed-off-by: Praateek <praateekm@gmail.com>
* move resiliparse
Signed-off-by: Praateek <praateekm@gmail.com>
* move trafilatura
Signed-off-by: Praateek <praateekm@gmail.com>
* move get_stop_list_dict
Signed-off-by: Praateek <praateekm@gmail.com>
* move download_utils.py to utils/download_utils.py
Signed-off-by: Praateek <praateekm@gmail.com>
* move out to download.py
Signed-off-by: Praateek <praateekm@gmail.com>
* move WarcIterator towarc_reader.py
Signed-off-by: Praateek <praateekm@gmail.com>
* move CommonCrawlWARCExtractor to html_extractor
Signed-off-by: Praateek <praateekm@gmail.com>
* remove commoncrawl.py
Signed-off-by: Praateek <praateekm@gmail.com>
* create url_generation.py from download_utils
Signed-off-by: Praateek <praateekm@gmail.com>
* tests dir
Signed-off-by: Praateek <praateekm@gmail.com>
* copy over test_download.py as test_common_crawl.py
Signed-off-by: Praateek <praateekm@gmail.com>
* add html_extractors/__init__
Signed-off-by: Praateek <praateekm@gmail.com>
* move html_extractor to ProcessingStage
Signed-off-by: Praateek <praateekm@gmail.com>
* update WarcReader to use ProecssingStage
Signed-off-by: Praateek <praateekm@gmail.com>
* move to classes for url generation
Signed-off-by: Praateek <praateekm@gmail.com>
* typo in name
Signed-off-by: Praateek <praateekm@gmail.com>
* bug fixes in justext; rename resiliparse func; utils modular
Signed-off-by: Praateek <praateekm@gmail.com>
* init file in for download/text
Signed-off-by: Praateek <praateekm@gmail.com>
* justtext minor change
Signed-off-by: Praateek <praateekm@gmail.com>
* support str in htmlextractor
Signed-off-by: Praateek <praateekm@gmail.com>
* add a working example
Signed-off-by: Praateek <praateekm@gmail.com>
* set source_files so that write can be hashed
Signed-off-by: Praateek <praateekm@gmail.com>
* use pprint in example
Signed-off-by: Praateek <praateekm@gmail.com>
* update comment
Signed-off-by: Praateek <praateekm@gmail.com>
* all tests migrated + work
Signed-off-by: Praateek <praateekm@gmail.com>
* update defaults in example; comments in stage
Signed-off-by: Praateek <praateekm@gmail.com>
* add tests for url generation + PR review
Signed-off-by: Praateek <praateekm@gmail.com>
* update download for aws
Signed-off-by: Praateek <praateekm@gmail.com>
* rename aws to use_aws_to_donwload
Signed-off-by: Praateek <praateekm@gmail.com>
* update resources
Signed-off-by: Praateek <praateekm@gmail.com>
* change url generation to have ray-stage-spec
Signed-off-by: Praateek <praateekm@gmail.com>
* make download fault tolerant
Signed-off-by: Praateek <praateekm@gmail.com>
* refactor as per pr reviews; with tests
Signed-off-by: Praateek <praateekm@gmail.com>
* add readme
Signed-off-by: Praateek <praateekm@gmail.com>
* bug fix; update tests
Signed-off-by: Praateek <praateekm@gmail.com>
* update record limit to None
Signed-off-by: Praateek <praateekm@gmail.com>
* bug fixes
Signed-off-by: Praateek <praateekm@gmail.com>
* pr comments
Signed-off-by: Praateek <praateekm@gmail.com>
* add back test html extractor implementations
Signed-off-by: Praateek <praateekm@gmail.com>
* remove cc example
Signed-off-by: Praateek <praateekm@gmail.com>
* add column utils
Signed-off-by: Praateek <praateekm@gmail.com>
* add todos
Signed-off-by: Praateek <praateekm@gmail.com>
* Add Wikipedia download and extract stage
This commit introduces a comprehensive pipeline for downloading and processing Wikipedia dump files within the ray-curator framework. Key components include:
- **WikipediaUrlGenerator**: Generates URLs for Wikipedia dump files.
- **WikipediaDownloader**: Downloads .bz2 dump files using wget.
- **WikipediaIterator**: Parses Wikipedia XML dumps and extracts article content.
- **WikipediaExtractor**: Cleans Wikipedia markup and extracts meaningful text.
Additionally, an example script demonstrating the usage of the new stage is included, along with tests for each component to ensure functionality and reliability.
Documentation for the new stage is also provided to guide users in implementation and usage.
Signed-off-by: Abhinav Garg <abhinavg@stanford.edu>
* merge from main
Signed-off-by: Praateek <praateekm@gmail.com>
* move deps to text
Signed-off-by: Praateek <praateekm@gmail.com>
* update dev
Signed-off-by: Praateek <praateekm@gmail.com>
* update pyproject and test.yml
Signed-off-by: Praateek <praateekm@gmail.com>
* remove cugraph extra pyproject
Signed-off-by: Praateek <praateekm@gmail.com>
* move text to optional deps
Signed-off-by: Praateek <praateekm@gmail.com>
* Refactor pyproject.toml: Remove unused dependencies and clean up dev section
Signed-off-by: Abhinav Garg <abhinavg@stanford.edu>
* Remove unused Wikipedia example and related README documentation from the download text stages.
Signed-off-by: Abhinav Garg <abhinavg@stanford.edu>
* Add method to fetch JSON dump data for Wikipedia and refactor dump date retrieval logic
- Introduced `_get_data_for_dump` method to handle fetching and parsing JSON dump data.
- Refactored logic in `_get_wikipedia_urls` to iterate through available dumps and check their status.
- Improved error handling for cases where dump data cannot be loaded or is not finished.
Signed-off-by: Abhinav Garg <abhinavg@stanford.edu>
* Add README for custom download pipelines and remove Wikipedia stage documentation
- Introduced a new README.md file detailing the structure and implementation of custom download pipelines.
- Removed the outdated README.md for the Wikipedia download and extract stage to streamline documentation.
Signed-off-by: Abhinav Garg <abhinavg@stanford.edu>
* Add num_workers_per_node method to DocumentDownloader and WikipediaDownloader
- Implemented num_workers_per_node method in DocumentDownloader to define the number of workers per node for downloading tasks.
- Overridden num_workers_per_node in WikipediaDownloader to return a fixed value of 1.
- Updated xenna_stage_spec method in DocumentDownloadStage to include the number of workers per node.
Signed-off-by: Abhinav Garg <abhinavg@stanford.edu>
* Update WikipediaDownloader to use 2 workers and change logging level in WikipediaIterator
- Modified num_workers_per_node in WikipediaDownloader to return 2, allowing for increased parallelism during downloads.
- Changed logging from info to debug level in WikipediaIterator for extracted articles to reduce log verbosity.
Signed-off-by: Abhinav Garg <abhinavg@stanford.edu>
---------
Signed-off-by: Praateek <praateekm@gmail.com>
Signed-off-by: Abhinav Garg <abhinavg@stanford.edu>
Co-authored-by: Praateek <praateekm@gmail.com>
* Fixing tests (#827)
* Refactor Wikipedia extraction and URL generation logic
- Removed redundant return statement in `WikipediaExtractor` class.
- Simplified status check for dump data in `WikipediaUrlGenerator` by directly accessing the dictionary keys.
- Updated logging level in tests to ensure accurate assertions on log calls.
- Enhanced test cases for URL generation to cover various dump statuses.
These changes improve code clarity and maintainability while ensuring robust error handling in the Wikipedia download and extraction process.
Signed-off-by: Abhinav Garg <abhinavg@stanford.edu>
* Add mwparserfromhell dependency to pyproject.toml
- Included `mwparserfromhell==0.6.5` in the text dependencies section of `pyproject.toml` to support parsing Wikipedia markup.
This addition enhances the functionality of the project by ensuring the necessary tools for processing Wikipedia data are available.
Signed-off-by: [Your Name] <your.email@example.com>
Signed-off-by: Abhinav Garg <abhinavg@stanford.edu>
---------
Signed-off-by: Abhinav Garg <abhinavg@stanford.edu>
Signed-off-by: [Your Name] <your.email@example.com>
* Update ray version to 2.48 #839
* Re-enable CI/CD for Ray API branch (#840)
* CI/CD for Ray API branch
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
* add text dependencies
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
* only run cpu tests
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
* comment instead of delete
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
---------
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
* Ray Video Pipeline : Video Reader (#775)
* Add video io reader
* Add test
* Add VideoReaderStage to video reading pipeline and update VideoDownloadStage to accept VideoTask. Enhance video reading capabilities with new tests for VideoReaderStage.
Signed-off-by: Ao Tang <aot@nvidia.com>
* Update VideoDownloadStage to support verbose logging and modify video_read_example to include verbose argument.
Signed-off-by: Ao Tang <aot@nvidia.com>
* Update outputs for VideoDownloadStage and VideoReaderStage to include additional metadata fields.
Signed-off-by: Ao Tang <aot@nvidia.com>
* Update CI workflow to include video dependencies for testing
Signed-off-by: Ao Tang <aot@nvidia.com>
* Add tests for video tasks module
- Introduced a new test package for tasks with an initial test suite for the video tasks module, including tests for the Clip, ClipStats, Video, VideoMetadata, and VideoTask classes.
- Implemented various test cases to validate initialization, property calculations, metadata extraction, and size calculations.
This enhances the testing coverage for video-related functionalities in the ray-curator project.
Signed-off-by: Ao Tang <aot@nvidia.com>
* Enhance video tasks module with additional test cases
- Expanded the test suite for the video tasks module by adding new test cases for the Clip, ClipStats, Video, VideoMetadata, and VideoTask classes.
- Improved coverage for various functionalities including initialization, property calculations, and metadata extraction.
This update strengthens the reliability of video-related features in the ray-curator project.
Signed-off-by: Ao Tang <aot@nvidia.com>
* Update pyproject.toml to include a trailing comma for pynvml dependency
Signed-off-by: Ao Tang <aot@nvidia.com>
* Refactor video processing stages to introduce a composite VideoReaderDownloadStage
- Replaced separate VideoReaderStage and VideoDownloadStage with a new VideoReaderDownloadStage that combines both functionalities.
- Updated the video_read_example to utilize the new composite stage.
- Adjusted inputs and outputs for VideoDownloadStage to reflect changes in the pipeline.
- Added tests for the new VideoReaderDownloadStage to ensure proper functionality and integration.
This refactor simplifies the video reading and downloading process within the ray-curator framework.
Signed-off-by: Ao Tang <aot@nvidia.com>
---------
Signed-off-by: Ao Tang <aot@nvidia.com>
* chore: Add new trustees and vetters to the copy-pr-bot configuration (#841) (#842)
* chore: Add new trustees and vetters to the copy-pr-bot configuration
* chore: Remove empty line in copy-pr-bot configuration
* chore: Remove ryantwolf from additional trustees and vetters in copy-pr-bot configuration
---------
Signed-off-by: Ao Tang <aot@nvidia.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
Co-authored-by: Ao Tang <mike.tang96@gmail.com>
* ci: Add community-bot (#846) (#849)
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
Co-authored-by: oliver könig <okoenig@nvidia.com>
* Ray Video Reader Enhancement (#848)
* Refactor video reading stages: Rename VideoReaderStage to VideoListStage and update VideoReaderDownloadStage to use the new class. Adjust tests accordingly to reflect the changes in stage names and functionality.
Signed-off-by: Ao Tang <aot@nvidia.com>
* Rename test_video_reader to test_video_list
Signed-off-by: Ao Tang <aot@nvidia.com>
* Update VideoListStage name and corresponding tests to reflect new naming convention
- Changed the internal name of VideoListStage from "video_reader" to "video_list".
- Updated assertions in the test for VideoListStage to match the new name.
- Adjusted configuration in the VideoReaderDownloadStage to use "video_list" instead of "video_reader".
This ensures consistency across the codebase following the recent refactor.
Signed-off-by: Ao Tang <aot@nvidia.com>
* Update test assertions in VideoReaderDownloadStage to use "video_list" instead of "video_reader"
Signed-off-by: Ao Tang <aot@nvidia.com>
* Refactor video processing stages: Replace VideoDownloadStage with VideoReaderStage in VideoReaderDownloadStage. Update related tests to reflect the new structure and ensure consistency across the codebase.
Signed-off-by: Ao Tang <aot@nvidia.com>
* Enhance VideoListStage and VideoReaderStage documentation
Signed-off-by: Ao Tang <aot@nvidia.com>
* Refactor video reading pipeline: Introduce VideoLoadingStage as a composite stage that combines VideoListStage and VideoReaderStage.
Signed-off-by: Ao Tang <aot@nvidia.com>
* Remove SplitPipeTask from video module and update imports accordingly.
Signed-off-by: Ao Tang <aot@nvidia.com>
* Refactor video task imports: Update import statements in video_list, video_loading, video_reader, and related test files to use the new video module structure.
Signed-off-by: Ao Tang <aot@nvidia.com>
* ruff fix
Signed-off-by: Ao Tang <aot@nvidia.com>
* Implement FilePartitioningStage: Introduce a new stage for partitioning files into groups based on specified criteria, including a limit on the number of groups. Update VideoLoadingStage to utilize FilePartitioningStage instead of the deprecated VideoListStage. Refactor VideoReaderStage to accept FileGroupTask as input and adjust related tests to ensure functionality and correctness.
Signed-off-by: Ao Tang <aot@nvidia.com>
* Refactor video reading stages: Replace VideoLoadingStage with VideoReader as a composite stage that combines FilePartitioningStage and VideoReaderStage. Update related tests to ensure functionality and correctness. Remove deprecated VideoLoadingStage and its associated tests.
Signed-off-by: Ao Tang <aot@nvidia.com>
* Update video_limit type in VideoReader to support None: Changed the type of video_limit from int to int | None to allow for more flexible configuration. This enhances the usability of the VideoReader class.
Signed-off-by: Ao Tang <aot@nvidia.com>
* Refactor file partitioning limit check
Signed-off-by: Ao Tang <aot@nvidia.com>
* Remove redundant tests from TestVideoReader: Deleted tests for video limit values, verbose flag, file extensions, and files per partition configuration to streamline the test suite and focus on essential functionality.
Signed-off-by: Ao Tang <aot@nvidia.com>
---------
Signed-off-by: Ao Tang <aot@nvidia.com>
* Enhance FilePartitioningStage to enforce task limit check earlier in the process. (#867)
Signed-off-by: Ao Tang <aot@nvidia.com>
* Initialize and shutdown ray session in each executor (#844)
* Remove pynvml dependency from pyproject.toml (#872)
* docs: refactor all the things (#826) (#859)
* docs: refactor all the things
* remove auto api docs
* api docs to gitignore
* updated readme
* python linting fixes batch 1
* batch 2
* batch 3
* update
---------
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
Co-authored-by: L.B. <llane@nvidia.com>
Co-authored-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>
* ci(fix): Use GITHUB_TOKEN for community bot (#853) (#854)
* ci(fix): Use GITHUB_TOKEN for community bot
* f
---------
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
Co-authored-by: oliver könig <okoenig@nvidia.com>
Co-authored-by: Ayush Dattagupta <ayushdg95@gmail.com>
* update LLM PII redaction file - fix issue 828 (#868) (#871)
* update LLM PII redaction file - fix 828
* Fix ruff check LLM PII redaction file - fix 828
* update LLM PII redaction Enron-file - fix 828
* update LLM-PII redaction README - fix 828
* updated LLM PII redaction Enron-file - fix 828
* updated LLM PII redaction file - fix 828
* Update tutorials/curator-llm-pii/README.md
* removed typo from README file - fix 828
* updated LLM redaction tutorial - fix 828
* updated LLM redaction-Enron file - fix 828
* updated LLM redaction-Enron file - fix 828
* Update tutorials/curator-llm-pii/PII-LLM-modification-Enron.ipynb
* Update tutorials/curator-llm-pii/PII-LLM-modification-Enron.ipynb
---------
Signed-off-by: Adeola Adesoba <aadesoba@nvidia.com>
Signed-off-by: aadesoba-nv <aadesoba@nvidia.com>
Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
Co-authored-by: aadesoba-nv <aadesoba@nvidia.com>
Co-authored-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>
* [Tutorials] Lazy import GPU modules in the Llama Nemotron tutorial (#831) (#875)
Signed-off-by: Mehran Maghoumi <Maghoumi@users.noreply.github.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
Co-authored-by: Mehran Maghoumi <Maghoumi@users.noreply.github.com>
Co-authored-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>
* docs: changelog update (#860) (#887)
* docs: changelog update
* formatting
* remove item
---------
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
Co-authored-by: L.B. <llane@nvidia.com>
* linkfixes (#865) (#882)
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
Co-authored-by: L.B. <llane@nvidia.com>
Co-authored-by: Ayush Dattagupta <ayushdg95@gmail.com>
* docs: Fixing version switcher issues (#885) (#886)
Signed-off-by: Andrew Schilling <aschilling@nvidia.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
Co-authored-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>
* [Ray] Download and extract ArXiv (#805)
* remove dask arxiv
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
* first pass for entire arxiv implementation
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
* ruff
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
* fix circular import
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
* working module
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
* add downloader tests
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
* remove unused noqa
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
* add test_iterator
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
* add extractor tests
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
* fix failing download tests
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
* add test_stage
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
* sort
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
* add url generator tests
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
* remove noqa
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
* remove nemo_curator/download, outdated scripts, outdated examples
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
---------
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>
* [Ray] Classifiers (#753)
* [Ray] Classifiers
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
* fix ruff
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
* add utils file
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
* commit quality classifier benchmark helpers
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
* use basictokenizer as cpu tokenizer, add crossfit config
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
* some ruff
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
* merge upstream
Signed-off-by: Praateek <praateekm@gmail.com>
* use _name, remove gpu resources from labeler
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
* consolidate praateek's work with distributeddataclassifier for quality classifier
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
* ruff
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
* add content type, domain, multilingual domain, and filter_by support
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
* support for fineweb, fineweb mixtral, and fineweb nemotron classifiers
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
* ruff
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
* add prompt task complexity support
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
* remove noqa
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
* padding_size does not need to be exposed to user
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
* max_seq_length does not need to be exposed to the user, set default micro_batch_sizes
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
* add max_chars, edit docstring
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
* ruff
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
* aegis functionality, start working on instruction data guard
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
* nit fixes
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
* add working pytests for all classifiers
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
* remove existing pytest file
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
* add more comments to tests
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
* address review, add mem conversation, add README
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
* move redundant test code
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
* ruff
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
* model_inference_batch_size and format_name_with_suffix
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
* add missing hf_token usage, remove test file, restructure dirs and files
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
* delete old examples and scripts
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
---------
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
Signed-off-by: Praateek <praateekm@gmail.com>
Co-authored-by: Praateek <praateekm@gmail.com>
* [RAY] Add ID Module (#876)
* Add id inital working IMP
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
* working add_id
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
* Add ID
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
* Update ray-curator/ray_curator/tasks/tasks.py
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>
* Add prefix feature, overwrite, warnings
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
* rename id_prefix to user_prefix
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
* Add in test for tasks and fix task id
Signed-off-by: VibhuJawa <vibhujawa@gmail.com>
---------
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>
Signed-off-by: VibhuJawa <vibhujawa@gmail.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Add video splitting pipeline with fixed stride extraction and transcoding Stage (#783)
* Add video splitting pipeline with fixed stride extraction and transcoding stages
- Introduced `video_split_clip_example.py` to demonstrate video splitting functionality.
- Added `ClipTranscodingStage` and `FixedStrideExtractorStage` for processing video clips.
- Implemented command-line arguments for configuring video processing parameters.
- Created utility functions for grouping iterables in `grouping.py`.
- Added unit tests for the new stages in `test_clip_transcoding_stage.py` and `test_fixed_stride_extractor_stage.py`.
Signed-off-by: Ao Tang <aot@nvidia.com>
* Refactor video splitting pipeline to remove debug mode and enhance stage integration
Signed-off-by: Ao Tang <aot@nvidia.com>
* Add video limit argument to video split clip example
Signed-off-by: Ao Tang <aot@nvidia.com>
* Refactor video processing stages to enhance resource management and integrate new functionalities
- Replaced separate VideoReaderStage and VideoDownloadStage with a composite VideoReaderDownloadStage, streamlining the video reading and downloading process.
- Updated ClipTranscodingStage to improve GPU resource allocation and added detailed arguments for better configurability.
- Adjusted tests to reflect changes in resource management, ensuring accurate assertions on GPU usage.
These changes improve the clarity and efficiency of video processing within the ray-curator framework.
Signed-off-by: Ao Tang <aot@nvidia.com>
* Add mock GPU classes and enhance ClipTranscodingStage tests
- Introduced MockGpuInfo and MockGpuResources classes to simulate GPU information and resources for testing.
- Updated test_resources_gpu_encoder and test_resources_hwaccel_enabled methods to utilize mocks, ensuring accurate resource assertions without dependency on actual GPU hardware.
- Enhanced test_different_encoder_configurations to validate resource requirements for various encoder configurations, including GPU settings.
These changes improve the robustness of the ClipTranscodingStage tests by isolating them from hardware dependencies, facilitating easier testing and validation.
Signed-off-by: [Your Name] <your.email@example.com>
Signed-off-by: Ao Tang <aot@nvidia.com>
* Remove deprecated GPU resource tests from ClipTranscodingStage
Signed-off-by: Ao Tang <aot@nvidia.com>
* Remove unused test for processing in debug mode from ClipTranscodingStage tests
Signed-off-by: Ao Tang <aot@nvidia.com>
* Add unit tests for grouping utilities in the ray_curator.utils module
Signed-off-by: Ao Tang <aot@nvidia.com>
* Enhance video processing stages with ray stage specifications
- Added `ray_stage_spec` method to `ClipTranscodingStage`, `VideoDownloadStage`, and `VideoReaderStage` to define stage characteristics for Ray integration.
- Updated input and output methods in `ClipTranscodingStage` to include additional input parameters.
- Modified `SplitPipeTask` to return properties from `data` instead of `video`, ensuring consistency in task data handling.
- Added unit tests to verify the correctness of the new `ray_stage_spec` implementations.
These changes improve the integration of video processing stages with Ray's architecture and enhance test coverage for the new functionalities.
Signed-off-by: Ao Tang <aot@nvidia.com>
* Refactor video processing imports and update pipeline stages
Signed-off-by: Ao Tang <aot@nvidia.com>
* Remove unused `IS_ACTOR_STAGE` key from `ray_stage_spec` in `ClipTranscodingStage` and clean up commented-out code. This simplifies the stage specification and prepares for future enhancements.
Signed-off-by: Ao Tang <aot@nvidia.com>
* Remove redundant check for video source bytes in ClipTranscodingStage. This simplifies the process method by eliminating unnecessary error handling when source bytes are not available.
Signed-off-by: Ao Tang <aot@nvidia.com>
* Refactor ClipTranscodingStage to use a class variable for the stage name and implement post-initialization resource setup. Added error handling for None source bytes in the process method. Updated tests to remove redundant checks and ensure proper functionality.
Signed-off-by: Ao Tang <aot@nvidia.com>
* Remove unnecessary error handling for None source bytes in ClipTranscodingStage's process method,
Signed-off-by: Ao Tang <aot@nvidia.com>
* remove redudant test
Signed-off-by: Ao Tang <aot@nvidia.com>
* precommit fix
Signed-off-by: Ao Tang <aot@nvidia.com>
---------
Signed-off-by: Ao Tang <aot@nvidia.com>
Signed-off-by: [Your Name] <your.email@example.com>
* docs: ray curator api autodoc updates (#896)
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* Move all text stages to `stages/text/` (#891)
* first pass
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
* ruff
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
* fix tests
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
* fix after merge
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
---------
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
* Add Ray Actor Pool Exceuctor (#893)
* Initial Minhash implementation on Ray (#837)
* Initial minhash logic without Stage API
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
* update args and support passing in pre-batched files
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
* Remove old minhash impl
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
* Add Class to do GPU IO for dedup
Co-authored-by: Praateek Mahajan <praateekmahajan@users.noreply.github.com>
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
* Add ID Generator class
Co-authored-by: Praateek Mahajan <praateekmahajan@users.noreply.github.com>
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
* Move MinHashActor to a GPUMinHash class and create a GPUMinHash Processing stage
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
* Remove minhash method in favor of minhashProcessingStage
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
* Add mkdir logic to the writer
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
* Add file partitioning stage to __init__.py
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
* Update cuda12x extra to deduplication. Bump pynvml to avoid conflicts
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
* Update stage name
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
* Add initial minhash tests
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
* Add rmm pool arg to MinhashStage, default to false in the parent actor
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
* Move IO and ID generator logic to the Stage rather than the parent GPUMinHash class
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
* Update GPUMinHash Tests
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
* Standardize Id generator actor name
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
* Add GPUMinHashStage tests
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
* Rename GPUMinHashStage to MinHashStage
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
* Add marker for GPU tests
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
* update cpu ci workflow to skip GPU tests
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
* Skip tests if imports fail
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
* move cudf import checks before stage imports
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
* Use storage options from read_kwargs directly
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
---------
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
Co-authored-by: Praateek Mahajan <praateekmahajan@users.noreply.github.com>
* docs: curate text load data content updates for ray (#895)
* docs: load text data article updates
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* remove "ray-curator" for curator
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* simplify naming
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* imports
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* imports
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* imports
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* linkfix
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* read through
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* simplification
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* remove placeholder concept details
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* pipeline verbiage
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* initial feedback round
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* reduce admonition noise
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* minor updates
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* minor updates
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* feedback
Signed-off-by: Lawrence Lane <llane@nvidia.com>
---------
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* Adding function decorator for very simple functions to be converted into stages (#835)
* Revert 'Add utility decorators for ProcessingStage creation' (empty cherry-pick)
Signed-off-by: Abhinav Garg <abhinavg@stanford.edu>
* Add utility decorators for ProcessingStage creation
This commit introduces a new module containing the `processing_stage` decorator, which allows users to easily convert plain Python functions into `ProcessingStage` instances. The decorator supports configuration options such as stage name, resource allocation, and batch size. Additionally, unit tests have been added to validate the functionality of the decorator and ensure proper handling of task processing.
Signed-off-by: [Your Name] <your.email@example.com>
Signed-off-by: Abhinav Garg <abhinavg@stanford.edu>
* test commit
Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>
* add test_stage_registry, other nits
Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>
* overwrite stage registry
Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>
* ruff
Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>
* propagate _metadata and _stage_perf
Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>
* accept resources dict
Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>
* reformat
Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>
* add process_batch tests
Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>
* ruff
Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>
* remove todo
Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>
* add pipeline example
Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>
---------
Signed-off-by: Abhinav Garg <abhinavg@stanford.edu>
Signed-off-by: [Your Name] <your.email@example.com>
Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>
Co-authored-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>
* Add Text Embedding Model (#899)
* Add Ray curator dockerfile and enable testing (#879)
* Add Ray curator dockerfile and enable testing
Signed-off-by: Dong Hyuk Chang <donghyukc@nvidia.com>
* Fix indentation issues
Signed-off-by: Dong Hyuk Chang <donghyukc@nvidia.com>
* Update dockerfile and add cuda12x
Signed-off-by: Dong Hyuk Chang <donghyukc@nvidia.com>
* Update coverage pathes
Signed-off-by: Dong Hyuk Chang <donghyukc@nvidia.com>
* Update gpu tests runner
Signed-off-by: Dong Hyuk Chang <donghyukc@nvidia.com>
* Add gpu testing scripts and update
Signed-off-by: Dong Hyuk Chang <donghyukc@nvidia.com>
* Cd into ray-curator for coverage
Signed-off-by: Dong Hyuk Chang <donghyukc@nvidia.com>
* Create dev layer and install dev packages
Signed-off-by: Dong Hyuk Chang <donghyukc@nvidia.com>
* Update coverage paths
Signed-off-by: Dong Hyuk Chang <donghyukc@nvidia.com>
* Install opencv
Signed-off-by: Dong Hyuk Chang <donghyukc@nvidia.com>
* Address syntax error
Signed-off-by: Dong Hyuk Chang <donghyukc@nvidia.com>
* Update cv2 ubuntu dependencies
Signed-off-by: Dong Hyuk Chang <donghyukc@nvidia.com>
* Fix typo
Signed-off-by: Dong Hyuk Chang <donghyukc@nvidia.com>
* Add cudf placeholder test
Signed-off-by: Dong Hyuk Chang <donghyukc@nvidia.com>
* Space after import
Signed-off-by: Dong Hyuk Chang <donghyukc@nvidia.com>
* Add gpu_only_import
Signed-off-by: Dong Hyuk Chang <donghyukc@nvidia.com>
* Remove import utils for now
Signed-off-by: Dong Hyuk Chang <donghyukc@nvidia.com>
* Fix spacing
Signed-off-by: Dong Hyuk Chang <donghyukc@nvidia.com>
* Skip gpu tests for cpu
Signed-off-by: Dong Hyuk Chang <donghyukc@nvidia.com>
* Update unit test coverage path
Signed-off-by: Dong Hyuk Chang <donghyukc@nvidia.com>
* Skip gpu coverage report for now
Signed-off-by: Dong Hyuk Chang <donghyukc@nvidia.com>
* Use pixi
Signed-off-by: Dong Hyuk Chang <donghyukc@nvidia.com>
* Fix dockerfile syntax
Signed-off-by: Dong Hyuk Chang <donghyukc@nvidia.com>
* Try ffmpeg only
Signed-off-by: Dong Hyuk Chang <donghyukc@nvidia.com>
* Add extra index url for pixi
Signed-off-by: Dong Hyuk Chang <donghyukc@nvidia.com>
* Address typos
Signed-off-by: Dong Hyuk Chang <donghyukc@nvidia.com>
* Install git
Signed-off-by: Dong Hyuk Chang <donghyukc@nvidia.com>
* Update entrypoint
Signed-off-by: Dong Hyuk Chang <donghyukc@nvidia.com>
* Fix typo
Signed-off-by: Dong Hyuk Chang <donghyukc@nvidia.com>
* Use env var for dev install
Signed-off-by: Dong Hyuk Chang <donghyukc@nvidia.com>
* Resolve syntax error
Signed-off-by: Dong Hyuk Chang <donghyukc@nvidia.com>
* Fix env var and verbose install
Signed-off-by: Dong Hyuk Chang <donghyukc@nvidia.com>
* Update pixi entrypoint and pyproject install
Signed-off-by: Dong Hyuk Chang <donghyukc@nvidia.com>
* Trigger entrypoint before tests
Signed-off-by: Dong Hyuk Chang <donghyukc@nvidia.com>
* Update test entrypoint
Signed-off-by: Dong Hyuk Chang <donghyukc@nvidia.com>
* Source entrypoint
Signed-off-by: Dong Hyuk Chang <donghyukc@nvidia.com>
* Update list of dev install pixi
Signed-off-by: Dong Hyuk Chang <donghyukc@nvidia.com>
* Add back cuda12x and index-strategy
Signed-off-by: Dong Hyuk Chang <donghyukc@nvidia.com>
* Turn off verbose install
Signed-off-by: Dong Hyuk Chang <donghyukc@nvidia.com>
* Skip gpu coverage for now
Signed-off-by: Dong Hyuk Chang <donghyukc@nvidia.com>
* Support arm
Signed-off-by: Dong Hyuk Chang <donghyukc@nvidia.com>
* Set timeout for dockerbuild and update pyproject
Signed-off-by: Dong Hyuk Chang <donghyukc@nvidia.com>
* Remove retry github config
Signed-off-by: Dong Hyuk Chang <donghyukc@nvidia.com>
---------
Signed-off-by: Dong Hyuk Chang <donghyukc@nvidia.com>
* ci: Install ray-curator module (#905)
* Add ray curator as pypi dependency
Signed-off-by: Dong Hyuk Chang <donghyukc@nvidia.com>
* Add package info and test import
Signed-off-by: Dong Hyuk Chang <donghyukc@nvidia.com>
* Update pyproject.toml
Signed-off-by: Dong Hyuk Chang <donghyukc@nvidia.com>
* Copy src for pixi install
Signed-off-by: Dong Hyuk Chang <donghyukc@nvidia.com>
* Update test import
Signed-off-by: Dong Hyuk Chang <donghyukc@nvidia.com>
* Revert temp test
Signed-off-by: Dong Hyuk Chang <donghyukc@nvidia.com>
---------
Signed-off-by: Dong Hyuk Chang <donghyukc@nvidia.com>
* [REVIEW] Add modifers to ray curator (#898)
* Inital WIP modifier workflows
Signed-off-by: VibhuJawa <vibhujawa@gmail.com>
* Moved tests and also moved modifiers to text sub module
Signed-off-by: VibhuJawa <vibhujawa@gmail.com>
* Add tests for the meta class and modifier and improve docstring
Signed-off-by: VibhuJawa <vibhujawa@gmail.com>
* Update ray-curator/ray_curator/stages/text/modifiers/slicer.py
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>
* Update ray-curator/ray_curator/stages/text/modifiers/line_remover.py
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>
* Delete files from dask dir and remove optional download fields
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
* Add pytest as requested
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
---------
Signed-off-by: VibhuJawa <vibhujawa@gmail.com>
Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Allow users to fuse multiple `DocumentFilter` objects into a single `ScoreFilter` stage (#850)
* Allow users to fuse multiple `DocumentFilter` objects into a single `ScoreFilter` stage
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
* remove old example and scripts file
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
* add suggestions
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
* add init
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
* fix csv path
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
* clearer error messages
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
---------
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
* Fix exception when blocksize is set (#892) (#904)
If blocksize is set instead of files_per_partition, this line raised an exception.
Signed-off-by: Yurii Paniv <mr.robinhad@gmail.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
Co-authored-by: Yurii Paniv <mr.robinhad@gmail.com>
* docs: curate text - process data - language dir (#900)
* docs: curate text - process data - language dir
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* remove extra content
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* another pass
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* remove pool
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* formatting
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* feedback
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* clarificaiton and alternative as pipeline stage. removed extra section
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* Update docs/curate-text/process-data/language-management/language.md
Co-authored-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>
Signed-off-by: L.B. <llane@nvidia.com>
---------
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Signed-off-by: L.B. <llane@nvidia.com>
Co-authored-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>
* docs: add README for experimental scripts directory (#910)
Signed-off-by: Abhinav Garg <abhinavg@stanford.edu>
* Add IdGenerator to JsonlReader + IdGenerator tests / write_to_disk / from_disk (#907)
* Initial buckets to edges stage (#909)
* Initial buckets to edges stage
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
* re-add file utils from lsh pr
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
* Handle directory cleanup/creation logic in the stage
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
* Add tests for buckets to edglist
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
* Rename doc_id_column to doc_id_field, update storage_options to read/write_kwargs instead
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
* Fix indentation
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
* Fix kwargs args
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
* Add copyright headers
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
* remove previous curator impl
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
---------
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
* [SemDedup] Add KMeans (#912)
* S3 Client (#903)
* WIP
Signed-off-by: Ao Tang <aot@nvidia.com>
* WIP
Signed-off-by: Ao Tang <aot@nvidia.com>
* Refactor S3 client configuration and enhance video reading logging
- Updated S3_PROFILE_PATH to use an environment variable for better flexibility in specifying the S3 credentials file location.
- Improved logging in VideoReaderStage to provide more informative messages about video byte downloads, including the size of the downloaded video.
Signed-off-by: Ao Tang <aot@nvidia.com>
* Enhance VideoReader functionality with S3 support and improve validation checks
- Updated VideoReader to conditionally use ClientPartitioningStage for S3 paths and FilePartitioningStage for local paths, improving flexibility in handling video sources.
- Enhanced validation in VideoTask to check for the existence of input videos when provided as pathlib.Path, ensuring better error handling.
- Removed unused methods from S3Client to streamline the codebase.
Signed-off-by: Ao Tang <aot@nvidia.com>
* Remove redundant exception raising in VideoReaderStage to improve error handling during video reading. This change prevents unnecessary propagation of exceptions while still logging errors effectively.
Signed-off-by: Ao Tang <aot@nvidia.com>
* Refactor ClientPartitioningStage and enhance S3 client configuration
- Rearranged import statements for better organization and readability in `client_partitioning.py` and `video_reader.py`.
- Updated `S3ClientConfig` and `BaseClientConfig` to use `@dataclass` for improved data handling.
- Added comprehensive unit tests for `ClientPartitioningStage`, covering initialization, setup, and processing methods with various scenarios.
- Improved error handling and validation in the `_read_list_json` function.
This refactor enhances the maintainability and test coverage of the codebase, ensuring better functionality and reliability in handling client partitioning tasks.
Signed-off-by: Ao Tang <aot@nvidia.com>
* Remove SPDX license comments from S3 client, storage client, and storage utilities files to streamline code readability. This change simplifies the file headers while retaining essential module documentation.
Signed-off-by: Ao Tang <aot@nvidia.com>
* Use Fsspec instead of boto3
Signed-off-by: Ao Tang <aot@nvidia.com>
* Refactor file handling and enhance video reading capabilities
- Introduced a new `FSPath` class in `client_utils.py` for improved file operations with fsspec.
- Updated `ClientPartitioningStage` and `VideoReaderStage` to utilize the new `FSPath` class for better handling of file paths.
- Removed unused imports and streamlined code in `client_partitioning.py` and `video_reader.py`.
- Enhanced error handling in `VideoReaderStage` to support various input types for video sources.
This refactor improves the maintainability and flexibility of file handling in the video processing pipeline.
Signed-off-by: Ao Tang <aot@nvidia.com>
* move client_partitioning.py
Signed-off-by: Ao Tang <aot@nvidia.com>
* ruff check
Signed-off-by: Ao Tang <aot@nvidia.com>
* Fix broken tests
Signed-off-by: Ao Tang <aot@nvidia.com>
* Add `as_posix` method to `FSPath` class and implement comprehensive test suite
- Introduced `as_posix` method in the `FSPath` class to convert filesystem paths to POSIX format, accommodating various protocols.
- Created a new test suite for `FSPath` in `test_client_utils.py`, covering initialization, string representation, file operations, and edge cases.
- Enhanced tests for `get_bytes_cat_ranges` to handle different file sizes and error scenarios.
This update improves the functionality and test coverage of the `FSPath` class, ensuring robust file handling across different filesystems.
Signed-off-by: Ao Tang <aot@nvidia.com>
* Remove logging of downloaded video size in VideoReaderStage to streamline error handling and reduce unnecessary output.
Signed-off-by: Ao Tang <aot@nvidia.com>
* Refactor video reading and splitting pipeline examples for improved readability
- Reformatted the `create_video_reading_pipeline` and `create_video_splitting_pipeline` functions to enhance code clarity by aligning parameters and removing unnecessary line breaks.
- Updated the `VideoReader` and `ClipTranscodingStage` instantiation to follow a consistent style.
- Made minor adjustments in the `ClientPartitioningStage` to ensure consistent formatting and improved readability.
These changes contribute to a cleaner and more maintainable codebase for video processing pipelines.
Signed-off-by: Ao Tang <aot@nvidia.com>
---------
Signed-off-by: Ao Tang <aot@nvidia.com>
* Add ClipWriterStage to video splitting pipeline Clean (#897)
* WIP
Signed-off-by: Ao Tang <aot@nvidia.com>
* WIP
Signed-off-by: Ao Tang <aot@nvidia.com>
* Update ClipWriterStage to clarify local storage usage
Signed-off-by: [Your Name] <your.email@example.com>
Signed-off-by: Ao Tang <aot@nvidia.com>
* Enhance video clip processing with new GenericClipWriterStage and required output path argument
- Introduced a new GenericClipWriterStage for writing video clips and their metadata, consolidating the writing process and improving resource management.
- Updated the video_split_clip_example to require an output clip path, ensuring that users specify where to save the generated clips.
- The new stage supports parallel writing of clips and metadata, enhancing performance and flexibility in video processing workflows.
Signed-off-by: Ao Tang <aot@nvidia.com>
* Enhance ClipWriterStage with additional metadata handling
- Improved `ClipWriterStage` to support writing additional metadata during video processing.
- Updated related utility functions to accommodate new metadata fields.
- Refined unit tests to cover the new functionality and ensure reliability.
Signed-off-by: Ao Tang <aot@nvidia.com>
* Add ClipWriterStage to video splitting pipeline
- Introduced `ClipWriterStage` for writing clips and metadata during video processing.
- Updated `video_split_clip_example.py` to include the new stage, allowing for clip writing functionality.
- Enhanced command-line argument parsing for output clip path.
- Added utility functions for managing storage paths and writing data in various formats.
- Implemented unit tests for `ClipWriterStage` to ensure functionality and reliability.
Signed-off-by: Ao Tang <aot@nvidia.com>
* ruff fix
Signed-off-by: Ao Tang <aot@nvidia.com>
* ruff format
Signed-off-by: Ao Tang <aot@nvidia.com>
* Refactor S3 client configuration and enhance video reading logging
- Updated S3_PROFILE_PATH to use an environment variable for better flexibility in specifying the S3 credentials file location.
- Improved logging in VideoReaderStage to provide more informative messages about video byte downloads, including the size of the downloaded video.
Signed-off-by: Ao Tang <aot@nvidia.com>
* Enhance VideoReader functionality with S3 support and improve validation checks
- Updated VideoReader to conditionally use ClientPartitioningStage for S3 paths and FilePartitioningStage for local paths, improving flexibility in handling video sources.
- Enhanced validation in VideoTask to check for the existence of input videos when provided as pathlib.Path, ensuring better error handling.
- Removed unused methods from S3Client to streamline the codebase.
Signed-off-by: Ao Tang <aot@nvidia.com>
* Remove redundant exception raising in VideoReaderStage to improve error handling during video reading. This change prevents unnecessary propagation of exceptions while still logging errors effectively.
Signed-off-by: Ao Tang <aot@nvidia.com>
* Refactor ClientPartitioningStage and enhance S3 client configuration
- Rearranged import statements for better organization and readability in `client_partitioning.py` and `video_reader.py`.
- Updated `S3ClientConfig` and `BaseClientConfig` to use `@dataclass` for improved data handling.
- Added comprehensive unit tests for `ClientPartitioningStage`, covering initialization, setup, and processing methods with various scenarios.
- Improved error handling and validation in the `_read_list_json` function.
This refactor enhances the maintainability and test coverage of the codebase, ensuring better functionality and reliability in handling client partitioning tasks.
Signed-off-by: Ao Tang <aot@nvidia.com>
* Remove SPDX license comments from S3 client, storage client, and storage utilities files to streamline code readability. This change simplifies the file headers while retaining essential module documentation.
Signed-off-by: Ao Tang <aot@nvidia.com>
* Use Fsspec instead of boto3
Signed-off-by: Ao Tang <aot@nvidia.com>
* Refactor file handling and enhance video reading capabilities
- Introduced a new `FSPath` class in `client_utils.py` for improved file operations with fsspec.
- Updated `ClientPartitioningStage` and `VideoReaderStage` to utilize the new `FSPath` class for better handling of file paths.
- Removed unused imports and streamlined code in `client_partitioning.py` and `video_reader.py`.
- Enhanced error handling in `VideoReaderStage` to support various input types for video sources.
This refactor improves the maintainability and flexibility of file handling in the video processing pipeline.
Signed-off-by: Ao Tang <aot@nvidia.com>
* move client_partitioning.py
Signed-off-by: Ao Tang <aot@nvidia.com>
* ruff check
Signed-off-by: Ao Tang <aot@nvidia.com>
* Fix broken tests
Signed-off-by: Ao Tang <aot@nvidia.com>
* Remove unused `generic_clip_writer.py`, `storage_client.py`, and related utility files; refactor `writer_utils.py` to eliminate storage client dependencies and streamline file writing functions. Update tests to reflect these changes and ensure compatibility with the new structure.
Signed-off-by: Ao Tang <aot@nvidia.com>
* Remove test file `test_client_utils.py` for the `FSPath` class, cleaning up unused test cases and ensuring the test suite reflects the current codebase structure.
Signed-off-by: Ao Tang <aot@nvidia.com>
* Refactor ClipWriterStage to remove storage client dependencies and streamline file writing methods. Updated method signatures to eliminate storage client parameters, enhancing code clarity and maintainability.
Signed-off-by: Ao Tang <aot@nvidia.com>
* Remove unused import of ClipWriterStage in video_split_clip_example.py to streamline the code and improve clarity.
Signed-off-by: Abhinav Garg <abhinavg@stanford.edu>
* Remove unused `input_s3_profile_name` attribute from `VideoReader` class to streamline the code and improve clarity.
Signed-off-by: Ao Tang <aot@nvidia.com>
* Remove unused `input_s3_profile_name` attribute from `VideoReader` class to streamline the code and improve clarity.
Signed-off-by: Abhinav Garg <abhinavg@stanford.edu>
* Refactor video metadata writing in ClipWriterStage by removing an unnecessary blank line for improved code clarity. Update get_full_path function signature for consistency in type hinting. Enhance test case formatting in TestVideoReaderStage to improve readability and maintainability.
Signed-off-by: Ao Tang <aot@nvidia.com>
---------
Signed-off-by: Ao Tang <aot@nvidia.com>
Signed-off-by: [Your Name] <your.email@example.com>
Signed-off-by: Abhinav Garg <abhinavg@stanford.edu>
Co-authored-by: Abhinav Garg <abhinavg@stanford.edu>
* Add motion filtering stages to video splitting pipeline (#797)
* Add video io reader
* Add test
* Add video splitting pipeline with fixed stride extraction and transcoding stages
- Introduced `video_split_clip_example.py` to demonstrate video splitting functionality.
- Added `ClipTranscodingStage` and `FixedStrideExtractorStage` for processing video clips.
- Implemented command-line arguments for configuring …1 parent f1067d4 commit 008060b
File tree
8 files changed
+828
-5
lines changed- nemo_curator
- backends/experimental/ray_actor_pool
- stages/deduplication/exact
- tests/stages/deduplication/exact
8 files changed
+828
-5
lines changedLines changed: 1 addition & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
214 | 214 | | |
215 | 215 | | |
216 | 216 | | |
217 | | - | |
| 217 | + | |
218 | 218 | | |
219 | 219 | | |
220 | 220 | | |
| |||
Lines changed: 6 additions & 4 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
23 | 23 | | |
24 | 24 | | |
25 | 25 | | |
| 26 | + | |
26 | 27 | | |
27 | 28 | | |
28 | 29 | | |
| |||
103 | 104 | | |
104 | 105 | | |
105 | 106 | | |
106 | | - | |
107 | | - | |
108 | | - | |
109 | | - | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
110 | 112 | | |
111 | 113 | | |
112 | 114 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
Lines changed: 207 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
| 136 | + | |
| 137 | + | |
| 138 | + | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
| 161 | + | |
| 162 | + | |
| 163 | + | |
| 164 | + | |
| 165 | + | |
| 166 | + | |
| 167 | + | |
| 168 | + | |
| 169 | + | |
| 170 | + | |
| 171 | + | |
| 172 | + | |
| 173 | + | |
| 174 | + | |
| 175 | + | |
| 176 | + | |
| 177 | + | |
| 178 | + | |
| 179 | + | |
| 180 | + | |
| 181 | + | |
| 182 | + | |
| 183 | + | |
| 184 | + | |
| 185 | + | |
| 186 | + | |
| 187 | + | |
| 188 | + | |
| 189 | + | |
| 190 | + | |
| 191 | + | |
| 192 | + | |
| 193 | + | |
| 194 | + | |
| 195 | + | |
| 196 | + | |
| 197 | + | |
| 198 | + | |
| 199 | + | |
| 200 | + | |
| 201 | + | |
| 202 | + | |
| 203 | + | |
| 204 | + | |
| 205 | + | |
| 206 | + | |
| 207 | + | |
0 commit comments