-
Notifications
You must be signed in to change notification settings - Fork 168
Add WebFsIndexHelper for web and filesystem crawling #2959
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Add WebFsIndexHelper for web and filesystem crawling #2959
Conversation
Major improvements: - Extracted common filter processing logic into reusable methods (processFilters, addFailureExclusionFilters) - Extracted crawler setup logic into dedicated methods (setupWebCrawler, setupFileCrawler) - Extracted crawler execution logic into executeCrawlers method - Reduced code duplication by ~250 lines Code quality improvements: - Fixed incorrect break statements (should be continue) - Removed unused code (ComponentUtil.getFessConfig() call) - Simplified AtomicBoolean usage in filter processing - Improved error messages with config names - Fixed issue where session IDs were added even for invalid configs The doCrawl method is now much more readable and maintainable, going from ~380 lines to ~65 lines while preserving all functionality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR refactors the WebFsIndexHelper class by extracting the lengthy doCrawl method into smaller, focused methods. The main changes improve code maintainability by eliminating ~250 lines of duplicated logic between web and file crawling, while also fixing several bugs in control flow and error handling.
Key changes:
- Extracted filter processing logic into reusable
processFiltersandaddFailureExclusionFiltersmethods - Created dedicated setup methods (
setupWebCrawler,setupFileCrawler) and a commonconfigureCrawlermethod - Fixed control flow bugs (incorrect
breakstatements changed tocontinuevia early returns) - Improved error messages to include config names for better debugging
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Added 28 new test methods covering the refactored functionality: New method test coverage: - processFilters: 10 tests covering null/empty strings, include/exclude filters, comments, #DISABLE_URL_ENCODE directive, special characters - addFailureExclusionFilters: 4 tests covering null/empty lists, URLs, duplicates, and blank URLs - configureCrawler: 3 tests covering parameter settings, unlimited depth, default values, and custom priority - handleCleanupConfig: 4 tests covering cleanup modes (all, url filters, none, empty map) - setupWebCrawler/setupFileCrawler: 2 tests for blank URL/path validation Test scenarios include: - Edge cases (null, empty, whitespace-only inputs) - Valid inputs with various configurations - URL encoding behavior with special directive - Comment line filtering - Duplicate handling - Special character handling - Multiple filter types (URL and Path) Total test methods: 58 (increased from 30) Total test file size: 804 lines (increased from 437)
Added mockito-core 4.11.0 as a test dependency to support the comprehensive unit tests added for WebFsIndexHelper. The tests use Mockito for mocking crawler components, system helpers, and other dependencies to enable isolated unit testing of the refactored methods.
Replaced Mockito mocks with anonymous inner classes for CrawlingConfigHelper registration, following the pattern used in DataIndexHelperTest. The Lasta Di container requires components to be registered as actual class instances, not Mockito mocks. This fixes the ComponentNotFound errors in: - test_addFailureExclusionFilters_withNullList - test_addFailureExclusionFilters_withEmptyList - test_addFailureExclusionFilters_withUrls - test_addFailureExclusionFilters_withDuplicates - test_addFailureExclusionFilters_withBlankUrls
Created a test stub implementation of CrawlingConfigHelper that can be properly registered in the Lasta Di container and configured for different test scenarios. Changes: - Created TestCrawlingConfigHelper.java with configurable behavior for testing - Registered TestCrawlingConfigHelper in test_app.xml as a singleton component - Updated test methods to get the component from DI container and configure it per test instead of trying to register new instances This fixes the ComponentNotFound errors in: - test_addFailureExclusionFilters_withNullList - test_addFailureExclusionFilters_withEmptyList - test_addFailureExclusionFilters_withUrls - test_addFailureExclusionFilters_withDuplicates - test_addFailureExclusionFilters_withBlankUrls The DI container now properly resolves the crawlingConfigHelper component when ComponentUtil.getCrawlingConfigHelper() is called.
Major improvements:
Code quality improvements:
The doCrawl method is now much more readable and maintainable, going from ~380 lines to ~65 lines while preserving all functionality.