A comprehensive web application for automating the collection and analysis of ASB alumni career data from public LinkedIn profiles, featuring AI-powered summarization and an intuitive dashboard interface.
This system helps the ASB Alumni Office maintain up-to-date alumni records by:
- Automatically scraping public LinkedIn profile data
- Using AI to generate career summaries
- Providing a searchable, filterable dashboard
- Exporting structured data for further analysis
- Drag & drop CSV file upload
- Automatic validation and parsing
- Support for alumni name and LinkedIn URL columns
- Respects rate limits and LinkedIn's terms
- Only accesses public profile data
- No authentication or login required
- Built-in delays to prevent IP blocking
- Gemini AI Integration: Real AI summarization when API key is configured
- Automatic summarization of About sections
- Structured extraction of career history
- Fallback to mock AI service for demo purposes
- Real-time search and filtering
- Sortable columns
- Expandable detail views
- Company and location filters
- Export to CSV/JSON formats
- Public data only
- No stored credentials
- Rate-limited requests
- Transparent error handling
- Node.js 18+
- npm or yarn
- Gemini API Key (optional, for real AI processing)
-
Clone the repository
git clone <repository-url> cd asb-alumni-scraper
-
Install dependencies
npm install
-
Configure Environment Variables
Copy the example environment file:
cp .env.example .env
Edit
.env
and add your Gemini API key:GEMINI_API_KEY=your_actual_gemini_api_key_here
-
Start the development servers
Terminal 1 (Frontend):
npm run dev
Terminal 2 (Backend):
npm run server
-
Access the application
- Frontend: http://localhost:5173
- Backend API: http://localhost:3001
-
Visit Google AI Studio
- Go to https://makersuite.google.com/app/apikey
- Sign in with your Google account
-
Create API Key
- Click "Create API Key"
- Choose "Create API key in new project" or select existing project
- Copy the generated API key
-
Add to Environment
- Open your
.env
file - Replace
your_gemini_api_key_here
with your actual API key - Restart the server to apply changes
- Open your
-
Verify Integration
- Check server logs for "Gemini AI (Real)" message
- Visit http://localhost:3001/api/health to confirm configuration
βββ src/ # React frontend
β βββ components/ # UI components
β β βββ UploadSection.tsx
β β βββ Dashboard.tsx
β β βββ ScrapingProgress.tsx
β βββ types/ # TypeScript definitions
β βββ App.tsx # Main application
βββ server/ # Node.js backend
β βββ index.js # Express server
β βββ scraper.js # Puppeteer scraping logic
β βββ ai.js # AI summarization service
βββ .env # Environment variables
βββ docs/ # Project documentation
Your CSV file should contain these columns:
name,linkedin_url
Jane Doe,https://www.linkedin.com/in/janedoe/
John Smith,https://www.linkedin.com/in/johnsmith/
# Required for production
GEMINI_API_KEY=your_gemini_api_key_here
# Optional configurations
PORT=3001
SCRAPING_DELAY_MS=3000
MAX_CONCURRENT_BROWSERS=1
BROWSER_TIMEOUT_MS=30000
CORS_ORIGIN=http://localhost:5173
The system automatically detects if a Gemini API key is configured:
With API Key:
- Real AI-powered summarization
- Advanced text analysis
- Structured role extraction
Without API Key:
- Mock AI service for demo
- Basic text processing
- Still functional for testing
The AI service generates:
- Professional career summaries (1-2 sentences, <150 characters)
- Key expertise identification
- Industry and skill extraction
{
"id": "alumni_1234567890_0",
"name": "Jane Doe",
"title": "Product Manager",
"company": "Google",
"location": "Singapore",
"education": ["MBA, ASB"],
"summary": "Experienced PM with background in fintech...",
"linkedinUrl": "https://www.linkedin.com/in/janedoe/",
"pastRoles": [
{
"title": "Senior Analyst",
"company": "AirAsia",
"years": "2019-2022"
}
],
"scrapedAt": "2025-01-27T10:30:00Z",
"status": "success"
}
The scraper implements several protective measures:
- 3-second delays between profile requests (configurable)
- Headless browser with realistic user agent
- Graceful error handling
- Respect for robots.txt (public profiles only)
- No authentication bypass attempts
- Sequential Processing: Profiles are processed one at a time to avoid rate limiting
- Memory Management: Large datasets are streamed rather than loaded entirely into memory
- Error Recovery: Failed profiles don't stop the entire job
- Progress Tracking: Real-time updates on scraping progress
-
"Using mock AI service" message
- Add your Gemini API key to
.env
file - Restart the server
- Check
/api/health
endpoint
- Add your Gemini API key to
-
Puppeteer Installation Issues
npm install puppeteer --unsafe-perm=true
-
LinkedIn Blocking
- Reduce scraping frequency in
.env
- Check if profiles are truly public
- Verify user agent settings
- Reduce scraping frequency in
-
Memory Issues
- Limit concurrent browser instances
- Increase Node.js memory limit:
node --max-old-space-size=4096 server/index.js
GET /api/health
Response: { "status": "ok", "geminiConfigured": true, "message": "..." }
POST /api/upload
Content-Type: multipart/form-data
Body: file (CSV)
POST /api/scrape
Content-Type: application/json
Body: { "jobId": "job_123..." }
GET /api/scrape/status/:jobId
POST /api/export/csv
POST /api/export/json
Content-Type: application/json
Body: { "data": [...] }
- Free Tier: 15 requests per minute, 1,500 requests per day
- Paid Tier: $0.00025 per 1K characters (input), $0.0005 per 1K characters (output)
- Typical Cost: ~$0.01-0.02 per alumni profile summary
For 100 alumni profiles:
- Estimated cost: $1-2 USD
- Processing time: ~10-15 minutes (with rate limiting)
Create a demo video showing:
- CSV file upload
- Real-time scraping progress
- Dashboard navigation and filtering
- Data export functionality
- AI summary generation
- Error handling examples
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
This project is for educational and demonstration purposes. Please ensure compliance with LinkedIn's Terms of Service and applicable data protection regulations when using in production.
For issues and questions:
- Check the troubleshooting section
- Verify your Gemini API key configuration
- Review the GitHub issues
- Contact the development team
Built for ASB Alumni Office | Hackathon Project 2025