Telegram Scraper

This repository contains a powerful and flexible Telegram scraper designed to collect message data and media files from Telegram channels or groups. Built with Python and the Telethon library, this scraper is ideal for data analysts, researchers, and developers looking to gather and analyze data from Telegram for various purposes, such as social network analysis, market research, or content monitoring.

Features

Asynchronous Scraping: Utilizes Python's asyncio and Telethon library for efficient and non-blocking data collection.
Data and Media Download: Fetches messages, along with any attached media (images, videos, etc.), and saves them locally.
Error Management: Includes mechanisms to handle rate limits (FloodWaitError) and other common exceptions gracefully.
Customizable: Allows users to define scraping parameters such as group/channel name, message limits, and scraping intervals.
Data Consolidation: Merges individual CSV files containing messages and metadata into a single, comprehensive dataset.

Requirements

Python 3.7 or later
Telethon
nest_asyncio for enabling nested event loops

Installation

Clone the Repository

git clone https://github.com/Amirwpi/Telegram_Scraper.git
cd telegram-scraper

Install Dependencies

Install the required Python libraries using pip:
```
pip install -r requirements.txt
```
Note: Ensure you have Telethon and nest_asyncio in your requirements.txt.

Usage

1. Set Up Your API Credentials

Obtain your API ID and hash from the Telegram API.
Update the placeholders in the script (api_id, api_hash, and phone_number) with your actual credentials.

2. Run "Telegram Session Creater.py"

Execute the following command to authenticate your Telegram session:

python 01-Telegram Session Generator.py

Follow the on-screen instructions to complete authentication and retrieve your session string.

3. Run the Main Scraper "Telegram Scraper.py"

After obtaining the session string, run the main scraper to start collecting data:

python 02-Telegram Scraper.py

4. Merge the Data

Once the scraping is complete, use the merging script to consolidate all CSV files into a single file:

python 03-merge_output_files.py

##Hasn't uploaded but I will add this part as well

Configuration

Before running the scraper, customize the following settings in main_scraper.py:

api_id and api_hash: Your Telegram API credentials.
Session: The Telegram session that we got from "01-Telegram Session Generator.py"
group_title: The Telegram group or channel from which to scrape data.
limit_msg: Maximum number of messages to fetch per request.
Repeat_number: Number of iterations for repeated scraping.
datetime_before: The initial timestamp to begin scraping messages from.

Important Note: Adjust the Repeat_number, datetime_before, and other parameters based on your specific data and channel requirements. It's essential to monitor the scraping process to ensure that it is functioning correctly. If an error occurs, use the last file's timestamp to continue scraping from where it stopped.

How It Works

1. Session Management and Login Authentication

The first part of the scraper handles authentication with the Telegram API. It checks for authorization, manages two-factor authentication, and provides a session string for subsequent scraping tasks.

2. Main Scraper: Data and Media Collection

The main scraping script uses the GetHistoryRequest function to iteratively fetch messages and media from the specified group or channel. It handles Telegram's rate limits by catching FloodWaitError exceptions and waiting before retrying requests.

3. Data Merging

After data collection, the scraper saves the messages and media information in multiple CSV files. The merging script consolidates these files into a single dataset for easier analysis.

Error Handling

FloodWaitError: Automatically waits for the required time if Telegram's rate limit is hit.
SessionPasswordNeededError: Handles two-factor authentication if enabled on the account.
General exceptions are caught and logged to ensure the scraper continues running smoothly.

Contributing

Contributions are welcome! Please feel free to submit a pull request or open an issue to suggest improvements or report bugs.

Fork the repository.
Create a new branch (git checkout -b feature/your-feature-name).
Commit your changes (git commit -am 'Add your feature').
Push to the branch (git push origin feature/your-feature-name).
Open a pull request.

License

This project is licensed under the MIT License. See the LICENSE file for details. This code and documentation were written by Amir Jamali.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Telegram Scraper

Table of Contents

Features

Requirements

Installation

Usage

1. Set Up Your API Credentials

2. Run "Telegram Session Creater.py"

3. Run the Main Scraper "Telegram Scraper.py"

4. Merge the Data

Configuration

How It Works

1. Session Management and Login Authentication

2. Main Scraper: Data and Media Collection

3. Data Merging

Error Handling

Contributing

License

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
01-Telegram Session Generator.py		01-Telegram Session Generator.py
02-Telegram Scraper.py		02-Telegram Scraper.py
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

License

Mr-Process/telegram_scraper

Folders and files

Latest commit

History

Repository files navigation

Telegram Scraper

Table of Contents

Features

Requirements

Installation

Usage

1. Set Up Your API Credentials

2. Run "Telegram Session Creater.py"

3. Run the Main Scraper "Telegram Scraper.py"

4. Merge the Data

Configuration

How It Works

1. Session Management and Login Authentication

2. Main Scraper: Data and Media Collection

3. Data Merging

Error Handling

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages