Version 0.24 (Alpha version)
Pre-release
Pre-release
Major Enhancements and New Features:
- Comprehensive Configuration Validation: Implemented a robust two-stage configuration validation system. This uses the
schemalibrary for initial YAML structure validation (ensuring correct keys, data types, and relationships) and Pydantic for runtime validation and type coercion. This significantly improves the reliability and user-friendliness of the script by catching configuration errors early and providing informative error messages. Theconfig_validation.pymodule encapsulates this logic. Pydantic models are used extensively throughout, ensuring type safety and data integrity. - Advanced Keyword Extraction and Filtering:
- Fuzzy Matching Integration: Integrated
rapidfuzzfor fuzzy matching of keywords against a whitelist (or the expanded set of skills). This allows for variations in spelling and phrasing, improving recall. Configuration options include the matching algorithm (ratio,partial_ratio,token_sort_ratio,token_set_ratio,WRatio), minimum similarity score, and allowed POS tags. - Configurable Processing Order: Added the
fuzzy_before_semanticoption (text_processingsection inconfig.yaml). This allows users to choose whether fuzzy matching is applied before or after semantic validation, providing greater flexibility in the keyword extraction pipeline. - Phrase-Level Synonym Handling: Introduced support for phrase-level synonyms (e.g., "product management" synonyms: ["product leadership", "product ownership"]). Synonyms can be loaded from a static JSON file (
phrase_synonyms_path) or fetched from an API (api_endpoint,api_key). This significantly expands the ability to capture relevant skills expressed in different ways. TheSynonymEntryPydantic model enforces data integrity for static synonyms. - Improved Contextual Validation: Enhanced semantic validation using a configurable context window (
context_window_size). The script now considers the surrounding sentences (respecting paragraph breaks) to determine if a keyword is used in the relevant context. This reduces false positives. The sentence splitting logic now handles bullet points and numbered lists more robustly. - POS Tag Filtering: Added more granular control over POS tag filtering with the
pos_filterandallowed_posoptions. This allows users to specify which parts of speech are considered for keyword extraction and fuzzy matching. - Trigram Optimization: Implemented a
TrigramOptimizerto improve the efficiency of n-gram generation and candidate selection. This uses an LRU cache to store frequently used trigrams, reducing redundant computations. - Dynamic N-gram Generation: The
_generate_ngramsfunction is now cached and handles edge cases more robustly (e.g., invalid inputn).
- Fuzzy Matching Integration: Integrated
- Adaptive Chunking and Parameter Tuning:
- Smart Chunker: Introduced a
SmartChunkerclass that uses a Q-learning algorithm to dynamically adjust the chunk size based on dataset statistics (average job description length, number of texts) and system resource usage (memory). This helps to optimize performance and prevent out-of-memory errors. - Auto Tuner: Added an
AutoTunerclass that automatically adjusts parameters (e.g.,chunk_size,pos_processing) based on metrics (recall, memory usage, processing time) and the trigram cache hit rate. This allows the script to adapt to different datasets and hardware configurations.
- Smart Chunker: Introduced a
- Intermediate Result Saving and Checkpointing:
- Configurable Intermediate Saving: Implemented robust intermediate saving of results (summary and detailed scores) to disk. This allows for resuming processing after interruptions and prevents data loss in case of errors. The
intermediate_savesection inconfig.yamlcontrols the format (feather,jsonl,json), save interval, working directory, and cleanup behavior. - Data Integrity Checks: Added checksum verification (using
xxhash) for intermediate files to ensure data integrity. A checksum manifest file (checksums.jsonl) is created and used to verify the integrity of the saved data. - Streaming Data Aggregation: Implemented a streaming data aggregation approach for combining intermediate results. This allows the script to handle very large datasets that don't fit in memory. The
_aggregate_resultsfunction handles both lists and generators of DataFrames. - Schema Validation and Appending: The code now validates the schema of intermediate files (especially for feather and jsonl) and it is able to append new chunks to already existing files.
- Configurable Intermediate Saving: Implemented robust intermediate saving of results (summary and detailed scores) to disk. This allows for resuming processing after interruptions and prevents data loss in case of errors. The
- Enhanced Error Handling and Logging:
- Custom Exceptions: Defined custom exceptions (
ConfigError,InputValidationError,CriticalFailureError,AggregationError,DataIntegrityError) for more specific error handling and reporting. - Comprehensive Error Handling: Added extensive error handling throughout the script, including checks for invalid input, file I/O errors, API errors, memory errors, and data integrity issues.
- Improved Logging: Enhanced logging to provide more informative messages about the script's progress, warnings, and errors. This includes logging of configuration parameters, dataset statistics, processing times, memory usage, and cache hit rates.
- Strict Mode: Added a
strict_modeoption (in theconfig.yaml) that, when enabled, causes the script to raise exceptions on certain errors (e.g., invalid input, empty descriptions) instead of logging warnings and continuing.
- Custom Exceptions: Defined custom exceptions (
- Code Refactoring and Optimization:
- Modular Design: Refactored the code into smaller, more manageable classes and functions (e.g.,
ParallelProcessor,TrigramOptimizer,SmartChunker,AutoTuner). - Type Hinting: Added type hints throughout the code to improve readability and maintainability.
- Memory Management: Implemented various memory management techniques, including explicit garbage collection (
gc.collect()), releasing spaCy Doc objects after processing, and using generators for streaming data processing. - Caching: Used
lru_cacheandLRUCacheto cache frequently used computations (e.g., term vectorization, n-gram generation, fuzzy matching). - Parallel Processing: Leveraged
concurrent.futures.ProcessPoolExecutorfor parallel processing of job descriptions, significantly improving performance on multi-core systems. - Dynamic Batch Size: The batch size for spaCy processing is now dynamically calculated, considering available memory and the configured
memory_scaling_factor. - GPU Memory Check: Added an optional check for available GPU memory (if
use_gpuandcheck_gpu_memoryare enabled). If GPU memory is low, it can either disable GPU usage or reduce the number of workers.
- Modular Design: Refactored the code into smaller, more manageable classes and functions (e.g.,
- Refactored TF-IDF Matrix Creation: The TF-IDF matrix creation is now more efficient and robust. The vectorizer is fitted only once (with optional sampling for large datasets), and keyword sets are pre-validated.
- Consistent Hashing: The caching system now uses a
cache_saltto ensure that cache keys are unique across different runs and configurations. The salt can be set via an environment variable (K4CV_CACHE_SALT) or in theconfig.yamlfile. - Improved Keyword Categorization: Keyword categorization logic is enhanced, and a configurable
default_categoryis used for terms that cannot be categorized. Thecategorization_cache_sizeallows controlling the cache size for term categorization.
Bug Fixes:
- Fixed several issues related to data loading, validation, and processing.
- Improved error handling and logging in various parts of the script.
- Addressed potential memory leaks and improved overall memory management.
- Corrected issues with chunk size calculation and Q-table updates.
- Fixed inconsistencies in the application of the whitelist boost.
- Resolved issues with intermediate file saving and loading.
- Addressed errors during vectorization and score calculations.
Known Issues:
- NOTE! At this point of time, the script doesn't work. This release aims to introduct critical architectural changes.
Dependencies:
- nltk
- pandas
- spacy (>=3.0.0 recommended)
- scikit-learn
- pyyaml
- psutil
- hashlib (replaced with xxhash)
- requests
- rapidfuzz
- srsly
- xxhash
- cachetools
- pydantic (>=2.0 recommended, but v1 is supported)
- schema
- pyarrow
- numpy
- itertools
Future Improvements:
- [List any planned future improvements.]
- Explore the use of Dask for distributed processing.
- Continue to refine the reinforcement learning algorithms for adaptive parameter tuning.
- Add more comprehensive unit tests.
- Improve documentation and user guide.
- Consider adding support for other input formats (e.g., CSV, text files).
- Explore the use of more advanced NLP techniques (e.g., transformer-based models).
How to Upgrade:
- Backup your existing
config.yamlandsynonyms.jsonfiles. - Replace the old script files (
keywords4cv_*.py.txt,exceptions.py.txt,config_validation.py.txt) with the new versions. - Carefully review the updated
config.yaml.truncated.txtfile. There are many new configuration options and changes to existing ones. You will need to merge your existing configuration with the new template. Pay close attention to the following sections:validationtext_processing(especiallyphrase_synonym_source,phrase_synonyms_path,api_endpoint,api_key,fuzzy_before_semantic)whitelist(especiallyfuzzy_matching)hardware_limitsoptimizationcaching(especiallycache_salt)intermediate_saveadvanced
- If you are using a static synonym file, update its format to match the
SynonymEntrymodel (see documentation). - Install any new dependencies:
pip install rapidfuzz srsly xxhash cachetools pydantic schema pyarrow.
Breaking Changes:
- The configuration file format has changed significantly. You will need to update your
config.yamlfile. - The
SynonymEntryformat insynonyms.jsonis now enforced using Pydantic. - The
hashliblibrary has been replaced withxxhashfor checksum calculation. - The intermediate file format and naming conventions have changed.
- The
max_workersparameter is now also used within thenlp.pipefunction. - The
analyzer._load_all_intermediatefunction now returns a generator. - The
_create_tfidf_matrixfunction's parameters have changed. - The
_calculate_scoresfunction now yields results instead of returning a list.
What's Changed
- Update LICENSE by @DavidOsipov in #21
- Delete test_keywords4cv.py by @DavidOsipov in #22
- Delete ats_optimizer.log by @DavidOsipov in #23
- Add files via upload by @DavidOsipov in #24
Full Changelog: 0.09...0.24