NASDAQ-100 Scraper

A Python tool for retrieving and storing current NASDAQ-100 constituents from Wikipedia.

Description

This project scrapes the list of NASDAQ-100 companies from the Wikipedia page and saves the data in CSV and JSON formats. The tool uses multiple fallback strategies to ensure reliable data extraction.

Motivation

This tool was created because I needed the NASDAQ-100 index composition for another project and thought it would be valuable to share this data source with the community. Rather than keeping it private, I decided to make it publicly available so others can benefit from automated access to current NASDAQ-100 constituent data.

The goal is to provide a reliable, automated way to access this financial data that updates regularly and can be easily integrated into other projects, research, or analysis workflows.

Features

Multiple extraction methods: Uses both pandas.read_html() and BeautifulSoup as fallback
Robust error handling: Automatic retry attempts on failures
Data validation: Checks completeness and correctness of extracted data
Multiple output formats: Saves data as both CSV and JSON
Logging: Detailed logging of all operations
Data cleaning: Automatic cleaning of whitespace and formatting

Installation

Clone the repository:

git clone https://github.com/Gary-Strauss/NASDAQ100_Constituents
cd NASDAQ100_Constituents

Install dependencies:

pip install -r requirements.txt

Usage

Local Usage

Run the script directly:

python nasdaq100_scraper.py

The tool will automatically:

Retrieve NASDAQ-100 data from Wikipedia
Validate and clean the data
Save results to data/nasdaq100_constituents.csv and data/nasdaq100_constituents.json
Display a summary of the first 5 entries

Automated Updates via GitHub Actions

This repository automatically updates the NASDAQ-100 data monthly using GitHub Actions:

Schedule: 1st of every month at 10:00 UTC
Manual trigger: Available via GitHub Actions tab
Automatic releases: Creates tagged releases when data changes

Access Current Data

You can directly access the latest data from GitHub:

CSV Format:

https://raw.githubusercontent.com/Gary-Strauss/nasdaq100-scraper/main/data/nasdaq100_constituents.csv

JSON Format:

https://raw.githubusercontent.com/Gary-Strauss/nasdaq100-scraper/main/data/nasdaq100_constituents.json

Programmatic Usage

import pandas as pd
import requests

# Load latest CSV data directly from GitHub
csv_url = "https://raw.githubusercontent.com/Gary-Strauss/nasdaq100-scraper/main/data/nasdaq100_constituents.csv"
df = pd.read_csv(csv_url)

# Or load JSON data
json_url = "https://raw.githubusercontent.com/Gary-Strauss/nasdaq100-scraper/main/data/nasdaq100_constituents.json"
response = requests.get(json_url)
data = response.json()

Output Files

CSV format (data/nasdaq100_constituents.csv): Tabular representation for Excel/spreadsheet programs
JSON format (data/nasdaq100_constituents.json): Structured data for programmatic use

Data Structure

The extracted data contains the following columns:

Ticker: Company stock symbol
Company: Full company name
GICS_Sector: Global Industry Classification Standard sector
GICS_Sub_Industry: GICS sub-industry

Sample Data

The tool currently extracts 101 companies, including:

Apple Inc. (AAPL) - Information Technology
Microsoft (MSFT) - Information Technology
Amazon (AMZN) - Consumer Discretionary
Nvidia (NVDA) - Information Technology
Meta Platforms (META) - Communication Services

Technical Details

Extraction Methods

Pandas method: First attempts pandas.read_html() for fast table extraction
BeautifulSoup fallback: Uses BeautifulSoup when pandas method fails
Intelligent column detection: Automatic identification of relevant table columns
Retry mechanism: Up to 3 retry attempts on network errors

Data Validation

Checks for at least 90 companies (typically ~100-101)
Validates all required columns
Cleans whitespace and formatting errors
Ticker validation (1-5 uppercase letters)

Dependencies

pandas>=1.3.0: Data manipulation and CSV export
requests>=2.25.0: HTTP requests
beautifulsoup4>=4.9.0: HTML parsing as fallback
lxml>=4.6.0: XML/HTML parser for pandas
html5lib>=1.1: Additional HTML parser

License and Data Sources

Data Sources

The data is retrieved from the Wikipedia "NASDAQ-100" page:

Primary Source: Wikipedia - NASDAQ-100
Original Data Source: Wikipedia references the official NASDAQ composition from NASDAQ NDX Index (as of 2025-06-22)
License: Wikipedia content is available under the Creative Commons Attribution-ShareAlike License 3.0 (CC BY-SA 3.0)

Usage Notes for Wikipedia Data

Data originates from Wikipedia and is subject to CC BY-SA 3.0 license
When redistributing, Wikipedia must be credited as the source
Derivative works must be published under the same license
Data is provided "as is" without warranty for completeness or accuracy
For financial decisions, please consult official sources

Data Chain

The data flow is: NASDAQ Official → Wikipedia → This Tool

NASDAQ maintains the official index composition at nasdaq.com
Wikipedia editors update their page based on official NASDAQ data
This tool extracts the data from Wikipedia for programmatic use

Troubleshooting

Common Issues

Network errors: The tool automatically retries on temporary connection problems
Table structure changed: If Wikipedia page changes, column detection logic may need adjustment
Missing dependencies: Ensure all packages from requirements.txt are installed

Debug Information

The tool logs all steps in detail. For issues, check console output for specific error messages.

Typical Output

2025-06-22 08:27:58,468 - INFO - Attempt 1 of 3
2025-06-22 08:27:58,468 - INFO - Trying to retrieve data with pandas.read_html()...
2025-06-22 08:27:58,750 - WARNING - No suitable Components table found with pandas
2025-06-22 08:27:58,750 - INFO - Falling back to BeautifulSoup...
2025-06-22 08:27:59,078 - INFO - DataFrame validation successful
2025-06-22 08:27:59,078 - INFO - Successfully retrieved 101 components with BeautifulSoup

Project Structure

nasdaq100-scraper/
├── .github/
│   └── workflows/
│       └── update-nasdaq100.yml  # GitHub Actions workflow
├── nasdaq100_scraper.py          # Main script
├── requirements.txt               # Python dependencies
├── README.md                     # This file
└── data/                         # Output directory
    ├── nasdaq100_constituents.csv
    └── nasdaq100_constituents.json

Contributing

Improvements and bug fixes are welcome! Please create a pull request or open an issue.

Disclaimer

This tool is for informational purposes only. The data comes from Wikipedia and may be incomplete or outdated. For investment decisions, please consult official financial sources such as NASDAQ or Bloomberg.

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
.github/workflows		.github/workflows
data		data
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
nasdaq100_scraper.py		nasdaq100_scraper.py
requirements.txt		requirements.txt
test_nasdaq100_scraper.py		test_nasdaq100_scraper.py

Folders and files

Latest commit

History

Repository files navigation

NASDAQ-100 Scraper

Description

Motivation

Features

Installation

Usage

Local Usage

Automated Updates via GitHub Actions

Access Current Data

Programmatic Usage

Output Files

Data Structure

Sample Data

Technical Details

Extraction Methods

Data Validation

Dependencies

License and Data Sources

Data Sources

Usage Notes for Wikipedia Data

Data Chain

Troubleshooting

Common Issues

Debug Information

Typical Output

Project Structure

Contributing

Disclaimer

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages