This project scrapes historical Olympic Games data from Olympics.com, including basic event information, sports disciplines, and detailed competition results. Supports both Chinese and English data extraction.
30s.mp4
Example:
China's medal tree at the 2024 Paris Olympics (Flourish)
2024 Paris Olympics World Medal Tree(Flourish)
-
GetOlympics_Name_Year.py
- Purpose: Extracts Olympic Games names and years from predefined URLs, generates
olympic_games.csv. - Output:
Olympics_event/olympic_games.csv
- Purpose: Extracts Olympic Games names and years from predefined URLs, generates
-
GetSport.py
- Purpose: Uses Selenium to scrape sports discipline lists for each Olympic Games, saves as
[Olympic-Game-Name]_events.csv. - Output:
Olympics_event/[Olympic-Game-Name]_events.csv
- Purpose: Uses Selenium to scrape sports discipline lists for each Olympic Games, saves as
-
GetOlympics.py
- Purpose: Scrapes detailed competition results (medals, athletes, countries) based on the sports list. Supports bilingual data.
- Output Directories:
- Chinese:
Olympics-result-zh/ - English:
Olympics-result-en/
- Chinese:
- Python 3.8+
- Required Libraries:
pip install pandas beautifulsoup4 requests selenium openpyxl tqdm
- Browser Driver: Microsoft Edge Driver (Must match your Edge browser version. Ensure the driver path is in the system environment variables.)
Run the script to generate basic Olympic Games info:
python GetOlympics_Name_Year.py
Run the script to fetch sports disciplines for each Olympic Games:
python GetSport.py
Note: First-time execution requires manual browser login and cookie acceptance. Subsequent runs will auto-load user profiles.
-
For Chinese Data:
python GetOlympics.py -
For English Data (Uncomment
main_en()inGetOlympics.py):# In GetOlympics.py, uncomment: # main_en()Then run:
python GetOlympics.py
├── Olympics_event/ # Olympic Games metadata
│ ├── olympic_games.csv # All Olympic Games names & years
│ └── [Olympic-Game-Name]_events.csv # Sports disciplines per edition
│
├── Olympics-result-zh/ # Chinese results (by edition)
│ └── [Olympic-Game-Name]/
│ └── [Sport-Name].xlsx
│
├── Olympics-result-en/ # English results (same structure)
│
├── GetOlympics_Name_Year.py # Script 1
├── GetSport.py # Script 2
└── GetOlympics.py # Script 3
-
Selenium Configuration
-
Install Microsoft Edge and download the matching EdgeDriver version.
-
To modify the browser profile path, update in
GetSport.py:options.add_argument("user-data-dir=/Your/Profile/Path")
-
-
Network Stability Some pages load slowly. Recommended to run in low-latency environments.
-
Anti-Scraping Measures If blocked frequently, adjust scroll parameters in
GetSport.py:scroll_pause_time = 2 # Wait time after scrolling (seconds) total_scrolls = 5 # Number of scrolls
| Sport | Event | Medal | Athlete Link | Athlete Name | NOC | Country |
|---|---|---|---|---|---|---|
| Swimming | Men's 100m Free | Gold | /athletes/... | John Smith | USA | United States |
Apache License 2.0