Skip to content

semantic-systems/sems-social-media-retriever

Repository files navigation

Social Media Retriever

This project is a tool to retrieve data from various social media platforms and RSS feeds. Currently, this project supports the following sources:

  • Bluesky
  • Mastodon
  • Reddit
  • YouTube
  • RSS Feeds

Installation

  1. Clone the repository: git clone https://github.com/semantic-systems/sems-social-media-retriever.git
  2. Install python dependencies: pip install -r requirements.txt
  3. Rename keys.env.example to keys.env and add your credentials

Usage

Run python src/main.py with the following arguments:

Required Arguments:

-q, --query: The search query.

Optional Arguments:

--since: The start date in the format YYYY-MM-DD or YYYY-MM-DDTHH:MM:SS (Optional)
--until: The end date in the format YYYY-MM-DD or YYYY-MM-DDTHH:MM:SS (Optional)
--limit: The maximum number of results to return, ranging from 1 to 50 (Default: 10, Optional)
--verbose: Enables verbose output (Optional)
--subreddits: Custom list of subreddits to search (Default: hamburg, de)
--platforms: A list of platforms to extract. Supported platforms: youtube, mastodon, reddit, bluesky. (Defaults to all platforms)
--mandatory_keywords: A list of keywords that must be included in the result (Optional)
--optional_keywords: A list of optional keywords of which a minimum must be included in the result (Optional)
--n_keywords: The minimum number of optional keywords that a post must contain (Default: 2)
--w_regex: A whitelist regex string. Only posts matching this pattern will be included (Optional)
--b_regex: A blacklist regex string. Posts matching this pattern will be excluded (Optional)

Example:

python src/main.py -q 'hamburg storm' --since 2023-10-01 --until 2023-12-28 --limit 20 --verbose --subreddits hamburg de --platforms youtube mastodon reddit bluesky rss --mandatory_keywords hamburg --optional_keywords sturm storm flut flood unwetter regen rain --n_keywords 1 --w_regex '.*(hamburg).*' --b_regex '.*(berlin).*'

This will create a file in the base directory (output.json) that contains all the posts that match the search query.

Parameters

Not all platforms offer filtering by the above parameters. The following parameters are supported by each platform:

Platform query since until limit subreddits
Bluesky
Mastodon
Reddit
YouTube
RSS

API

This project can be used to host an API endpoint for the search() function. To start the API, run python src/api.py or use the provided Dockerfile. This will start a FastAPI server on port 5000 with an endpoint /search that accepts the parameters listed in the Usage section. The API returns a JSON response with the posts that match the search query.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors