Career Launch is an NLP-based resume-job matching system that quantifies candidate-role alignment using vector-space similarity and structured skill extraction. The system focuses on explainability, showing why a resume matches (or doesn't) and what skills to improve.
Career Launch converts resumes and job descriptions into high-dimensional vector representations using TF-IDF and computes semantic similarity via cosine distance. It provides interpretable match scores and actionable skill-gap insights rather than opaque pass/fail decisions.
- Video Demo: https://youtu.be/jdLr-34M-_A
- Live Application: https://career-launch.streamlit.app/
- Portfolio: https://harshlad.vercel.app/
Most hiring pipelines rely on keyword-based filtering or opaque scoring systems, leading to false rejections and poor feedback loops. Candidates rarely understand why they were rejected or which skills are missing.
Career Launch addresses this by treating resume screening as a measurable NLP similarity problem and exposing intermediate signals instead of hiding them behind black-box decisions.
- Resumes and job descriptions uploaded as PDFs or plain text
- Robust parsing using
pdfplumberwithPyPDF2as fallback - Multi-engine extraction handles diverse PDF formats
- Text normalization (lowercasing, punctuation handling)
- URL and email removal while preserving content
- Stop-word removal using NLTK's English corpus
- N-gram extraction (unigrams to trigrams) for phrase detection
Hybrid Approach:
- Taxonomy-based matching: 100+ curated skills across 9 domains
- TF-IDF scoring: Statistical importance of terms
- Confidence weighting: Combines both methods with 0.7/0.3 split
Skill Categories:
- Programming Languages
- Web Frontend/Backend
- Databases
- Cloud & DevOps
- ML & AI
- Data Science
- Tools & Platforms
- Soft Skills
- TF-IDF vectorization applied to full text (max 500 features)
- N-gram range: (1, 3) to capture multi-word technical terms
- Separate skill vectors maintained for granular analysis
Weighted Scoring Algorithm:
Overall Score = (Skill Match × 0.6) + (Text Similarity × 0.4)
Where:
- Skill Match = |Resume ∩ Job| / |Job|
- Text Similarity = cosine_similarity(TF-IDF_resume, TF-IDF_job)
Rationale:
- Skills weighted higher (60%) as they're explicit requirements
- Content similarity (40%) captures contextual alignment
- Produces interpretable 0-100% match score
Skills classified into three sets:
- Matched: Resume ∩ Job Description
- Missing: Job Description - Resume
- Extra: Resume - Job Description
Missing skills prioritized by:
- Category importance (Programming > Soft Skills)
- Existing skill overlap in same category
- Role criticality (High/Medium/Low)
- Vector-space model ensures reproducible results
- No black-box neural networks or opaque classifiers
- Full transparency in score computation
- Explicit identification of matched vs missing skills
- Skills grouped by technical domain
- Contextual recommendations based on existing strengths
- Primary:
pdfplumber(handles complex layouts, tables) - Fallback:
PyPDF2(broader format compatibility) - Validation layer ensures minimum text quality
- Maps variations to canonical forms (e.g., "js" → "javascript")
- Handles common abbreviations (ML, NLP, K8s)
- Reduces false negatives from terminology differences
Career Launch does not claim "accuracy" in the traditional ML sense, as resume matching is inherently subjective and context-dependent.
System Evaluation Criteria:
- Stability: Consistent scores across minor text variations
- Alignment: Correlation with human judgment on relevance
- Interpretability: Users can understand and act on outputs
- Robustness: Handles noisy, real-world PDF data
Design Philosophy:
- Prioritize explainability and trust over aggressive automation
- Expose intermediate signals rather than final binary decisions
- Enable candidates to improve rather than just filter them out
Known Limitations:
- TF-IDF captures term frequency but not deep semantics
- No contextual understanding (e.g., "Python" the language vs snake)
- Performance depends on PDF text extraction quality
- Skill taxonomy requires manual curation and updates
- Python 3.9+
- Scikit-learn (v1.3.2+) - TF-IDF vectorization, cosine similarity
- NumPy (v1.26.2+) - Numerical computations
- Pandas (v2.1.3+) - Data manipulation
- pdfplumber (v0.10.3+) - Primary PDF extraction
- PyPDF2 (v3.0.1+) - Fallback extraction
- Pydantic (v2.1.0+) - Settings management with type validation
- Streamlit (v1.28.2+) - Lightweight, stateless UI for rapid iteration
- Plotly (v5.18.0+) - Interactive visualizations
Career AI/
├── app/
│ └── streamlit_app.py # UI layer (Streamlit-based interface)
│
├── src/
│ ├── parsers/
│ │ └── resume_parser.py # Multi-engine PDF extraction
│ │
│ ├── ml/
│ │ ├── text_processor.py # Text cleaning & normalization
│ │ ├── skill_extractor.py # Hybrid skill extraction
│ │ └── matcher.py # Similarity computation
│ │
│ ├── analyzers/
│ │ └── gap_analyzer.py # Skill gap classification
│ │
│ └── core/
│ ├── config.py # ML hyperparameters
│ └── skill_taxonomy.py # Curated skill database
│
├── requirements.txt
└── README.md
- Python 3.9 or higher
- pip package manager
git clone https://github.com/ladHarsh/Career-Launch.git
cd Career-Launch
python -m venv venv
# Windows
venv\Scripts\activate
# macOS/Linux
source venv/bin/activate
pip install -r requirements.txtstreamlit run app/streamlit_app.pyApplication will be available at http://localhost:8501
# TF-IDF Configuration
TFIDF_MAX_FEATURES = 500 # Maximum vocabulary size
TFIDF_NGRAM_RANGE = (1, 3) # Unigrams to trigrams
MIN_SKILL_CONFIDENCE = 0.3 # Minimum TF-IDF score threshold
# Skill Extraction
SKILL_MATCH_THRESHOLD = 0.85 # Fuzzy matching threshold
MAX_SKILLS_EXTRACT = 50 # Maximum skills per document
# Scoring Weights
SKILL_WEIGHT = 0.6 # Weight for skill-based matching
CONTENT_WEIGHT = 0.4 # Weight for text similarity9 predefined categories with 100+ skills:
- Programming Languages (14 skills)
- Web Frontend (13 skills)
- Web Backend (10 skills)
- Databases (10 skills)
- Cloud & DevOps (11 skills)
- ML & AI (10 skills)
- Data Science (10 skills)
- Tools & Platforms (12 skills)
- Soft Skills (8 skills)
- Translating unstructured documents into comparable vector spaces
- Balancing vocabulary size vs computational efficiency
- Handling sparse high-dimensional representations
- Designing systems that favor interpretability over opaque scoring
- Exposing intermediate signals for user trust
- Communicating ML outputs in actionable ways
- Handling noisy real-world PDF data with multiple extraction strategies
- Validating text quality before processing
- Graceful degradation when extraction fails
- Combining rule-based (taxonomy) and statistical (TF-IDF) approaches
- Weighting different signal sources appropriately
- Normalizing heterogeneous skill representations
- Curating and maintaining skill taxonomies
- Mapping synonyms and abbreviations
- Categorizing skills by technical domain
- Semantic Understanding: TF-IDF does not capture deep semantic meaning like embeddings
- Context Blindness: Cannot distinguish "Python" (language) from "Python" (snake)
- Extraction Dependency: Performance depends on PDF text extraction quality
- Static Taxonomy: Skill database requires manual updates for new technologies
- No Temporal Awareness: Doesn't account for skill recency or proficiency levels
- Contextual language models (BERT, GPT)
- Resume section parsing (Experience, Education, Skills)
- Temporal skill analysis (years of experience)
- Multi-document comparison (ranking multiple candidates)
- Active learning from user feedback
- Transition to embedding-based similarity (sentence transformers)
- Hybrid scoring combining keyword + embedding signals
- Resume section detection using layout analysis
- Dataset-driven evaluation with human-labeled relevance pairs
- Active learning pipeline to refine skill taxonomy
- Support for DOCX and other document formats
- Fine-tuned transformer models for domain-specific matching
- Multi-modal analysis (skills + experience + education)
- Temporal skill trend analysis
- Integration with job board APIs
- Understand resume-job fit before applying
- Identify specific skills to learn
- Prioritize upskilling based on role requirements
- Pre-screen candidates with transparent scoring
- Provide constructive feedback to applicants
- Reduce bias from keyword-only filtering
- Guide students on skill development paths
- Quantify readiness for specific roles
- Track skill gap closure over time
MIT License
Copyright (c) 2024 Harsh Lad
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Harsh Lad
Applied AI Engineer
- Email: harshlad.dev@gmail.com
- GitHub: https://github.com/ladHarsh
- Portfolio: https://harshlad.vercel.app/
Built with a focus on explainability, transparency, and practical ML design.