-
-
Notifications
You must be signed in to change notification settings - Fork 121
Statistics and History
Welcome to URS - a comprehensive Reddit scraping command-line tool written in Python.
This page merely serves as a place to display repository statistics and an archive for all iterations of URS. This exists for me to see the evolution of my programming/skills and for anyone who is also curious how this repository has evolved since its inception.
I found this dope statistics tool called Star Chart and wanted to display it somewhere in this repository. It plots the repository's stars over time, which is such a cool feature and definitely something I am very interested in seeing.
Additionally, I found another statistics tool called Spark, which displays GitHub stars velocity of this repo for the entire lifetime.
I will also display the hit count. Maybe one day this repository will blow up again because of Reddit events such as the r/wallstreetbets fiasco that occurred in late January 2021.
I would love to revisit these statistics if something like that happens again, so consider the media above as future-proofing this wiki.
This is a table displaying the differences among the major iterations of URS.
| v1.0.0 | v2.0.0 | v3.0.0 | |
|---|---|---|---|
| CLI? | No | Yes | Yes |
| What Does It Scrape? | Subreddits Only | Subreddits Only | Subreddits, Redditors, submission Comments |
| Export Options | CSV | CSV | CSV and JSON |
| READMEs | README | README | README |
| Scraper | reddit_scraper.py | scraper.py | scraper.py |
| Requirements Text File | N/A | requirements.txt | requirements.txt |
Here I am listing additional changes that were built on top of v3.0.0. This is basically a modified version of the Releases document.
-
Structured comments export has been upgraded to include comments of all levels.
- Structured comments are now the default export format. Exporting to raw format requires including the
--rawflag.
- Structured comments are now the default export format. Exporting to raw format requires including the
- Tons of metadata has been added to all scrapers. See the Full Changelog section for a full list of attributes that have been added.
Credentials.pyhas been deprecated in favor of.envto avoid hard-coding API credentials.- Added more terminal eye candy - Halo has been implemented to spice up the output.
- User interface
- Added Halo to spice up the output while maintaining minimalism.
- Source code
- Created a comment
Forestand accompanyingCommentNode.- The
Forestcontains methods for insertingCommentNodes, including a depth-first search algorithm to do so.
- The
-
Subreddit.pyhas been refactored and submission metadata has been added to scrape files:"author""created_utc""distinguished""edited""id""is_original_content""is_self""link_flair_text""locked""name""num_comments""nsfw""permalink""score""selftext""spoiler""stickied""title""upvote_ratio""url"
-
Comments.pyhas been refactored and submission comments now include the following metadata:"author""body""body_html""created_utc""distinguished""edited""id""is_submitter""link_id""parent_id""score""stickied"
-
Major refactor for
Redditor.pyon top of adding additional metadata.- Additional Redditor information has been added to scrape files:
"has_verified_email""icon_img""subreddit""trophies"
- Additional Redditor comment, submission, and multireddit metadata has been added to scrape files:
-
subredditobjects are nested withincommentandsubmissionobjects and contain the following metadata:"can_assign_link_flair""can_assign_user_flair""created_utc""description""description_html""display_name""id""name""nsfw""public_description""spoilers_enabled""subscribers""user_is_banned""user_is_moderator""user_is_subscriber"
-
commentobjects will contain the following metadata:"type""body""body_html""created_utc""distinguished""edited""id""is_submitter""link_id""parent_id""score""stickied"-
"submission"- contains additional metadata "subreddit_id"
-
submissionobjects will contain the following metadata:"type""author""created_utc""distinguished""edited""id""is_original_content""is_self""link_flair_text""locked""name""num_comments""nsfw""permalink""score""selftext""spoiler""stickied"-
"subreddit"- contains additional metadata "title""upvote_ratio""url"
-
multiredditobjects will contain the following metadata:"can_edit""copied_from""created_utc""description_html""description_md""display_name""name""nsfw""subreddits""visibility"
-
-
interactionsare now sorted in alphabetical order.
- Additional Redditor information has been added to scrape files:
- CLI
- Flags
-
--raw- Export comments in raw format instead (structure format is the default)
-
- Flags
- Created a new
.envfile to store API credentials.
- Created a comment
-
README- Added new bullet point for The Forest Markdown file.
- Tests
- Added a new test for the
Statusclass inGlobal.py.
- Added a new test for the
- Repository documents
- Added "The Forest".
- This Markdown file is just a place where I describe how I implemented the
Forest.
- This Markdown file is just a place where I describe how I implemented the
- Added "The Forest".
- User interface
- Submission comments scraping parameters have changed due to the improvements made in this pull request.
- Structured comments is now the default format.
- Users will have to include the new
--rawflag to export to raw format.
- Users will have to include the new
- Both structured and raw formats can now scrape all comments from a submission.
- Structured comments is now the default format.
- Submission comments scraping parameters have changed due to the improvements made in this pull request.
- Source code
- The submission comments JSON file's structure has been modified to fit the new
submission_metadatadictionary."data"is now a dictionary that contains the submission metadata dictionary and scraped comments list. Comments are now stored in the"comments"field within"data". - Exporting Redditor or submission comments is now forbidden.
- URS will ignore the
--csvflag if it is present while trying to use either scraper.
- URS will ignore the
- The
created_utcfield for each Subreddit rule is now converted to readable time. -
requirements.txthas been updated.-
As of v1.20.0,
numpyhas dropped support for Python 3.6, which means Python 3.7+ is required for URS.-
.travis.ymlhas been modified to exclude Python 3.6. Added Python 3.9 to test configuration. - Note: Older versions of Python can still be used by downgrading to numpy<=1.19.5.
-
-
As of v1.20.0,
- Reddit object validation block has been refactored.
- A new reusable module has been defined at the bottom of
Validation.py.
- A new reusable module has been defined at the bottom of
-
Urs.pyno longer pulls API credentials fromCredentials.pyas it is now deprecated.- Credentials are now read from the
.envfile.
- Credentials are now read from the
- Minor refactoring within
Validation.pyto ensure an extra Halo line is not rendered on failed credential validation.
- The submission comments JSON file's structure has been modified to fit the new
-
README- Updated the Comments section to reflect new changes to comments scraper UI.
- Repository documents
- Updated
How to Get PRAW Credentials.mdto reflect new changes.
- Updated
- Tests
- Updated CLI usage and examples tests.
- Updated
c_fname()test because submission comments scrapes now follow a different naming convention.
- User interface
- Specifying
0comments does not only export all comments to raw format anymore. Defaults to structured format.
- Specifying
- Source code
- Deprecated many global variables defined in
Global.py:eooptionss_tanalytical_tools
-
Credentials.pyhas been replaced with the.envfile. - The
LogError.log_logindecorator has been deprecated due to the refactor withinValidation.py.
- Deprecated many global variables defined in
- Added analytical tools
- Significantly improved JSON structure
- Deprecated
--json - Added many new flags
- Improved error handling and logging
- Bug fixes
- New method comment style (Numpy/Scipy docstrings)
- Code refactor
- User Interface
- Analytical tools
- Word frequencies generator.
- Wordcloud generator.
- Analytical tools
- Source code
- CLI
- Flags
-
-e- Display additional example usage. -
--check- Runs a quick check for PRAW credentials and displays the rate limit table after validation. -
--rules- Include the Subreddit's rules in the scrape data (for JSON only). This data is included in thesubreddit_rulesfield. -
-f- Word frequencies generator. -
-wc- Wordcloud generator. -
--nosave- Only display the wordcloud; do not save to file.
-
- Added additional verbose feedback if invalid arguments are given.
- Flags
- Log decorators
- Added new decorator to log individual argument errors.
- Added new decorator to log when no Reddit objects are left to scrape after failing validation check.
- Added new decorator to log when an invalid file is passed into the analytical tools.
- Added new decorator to log when the
scrapesdirectory is missing, which would cause the newmake_analytics_directory()method inDirInit.pyto fail.- This decorator is also defined in the same file to avoid a circular import error.
- ASCII art
- Added new art for the word frequencies and wordcloud generators.
- Added new error art displayed when a problem arises while exporting data.
- Added new error art displayed when Reddit object validation is completed and there are no objects left to scrape.
- Added new error art displayed when an invalid file is passed into the analytical tools.
- CLI
-
README- Added new Contact section and moved contact badges into it.
- Apparently it was not obvious enough in previous versions since users did not send emails to the address specifically created for URS-related inquiries.
- Added new sections for the analytical tools.
- Updated demo GIFs
- Moved all GIFs to a separate branch to avoid unnecessary clones.
- Hosting static images on Imgur.
- Added new Contact section and moved contact badges into it.
- Tests
- Added additional tests for analytical tools.
- User interface
- JSON is now the default export option.
--csvflag is required to export to CSV instead. - Improved JSON structure.
- PRAW scraping export structure:
- Scrape details are now included at the top of each exported file in the
scrape_detailsfield.- Subreddit scrapes - Includes
subreddit,category,n_results_or_keywords, andtime_filter. - Redditor scrapes - Includes
redditorandn_results. - Submission comments scrapes - Includes
submission_title,n_results, andsubmission_url.
- Subreddit scrapes - Includes
- Scrape data is now stored in the
datafield.- Subreddit scrapes -
datais a list containing submission objects. - Redditor scrapes -
datais an object containing additional nested dictionaries:-
information- a dictionary denoting Redditor metadata, -
interactions- a dictionary denoting Redditor interactions (submissions and/or comments). Each interaction follows the Subreddit scrapes structure.
-
- Submission comments scrapes -
datais an list containing additional nested dictionaries.- Raw comments contains dictionaries of
comment_id: SUBMISSION_METADATA. - Structured comments follows the structure seen in raw comments, but includes an extra
repliesfield in the submission metadata, holding a list of additional nested dictionaries ofcomment_id: SUBMISSION_METADATA. This pattern repeats down to third level replies.
- Raw comments contains dictionaries of
- Subreddit scrapes -
- Scrape details are now included at the top of each exported file in the
- Word frequencies export structure:
- The original scrape data filepath is included in the
raw_filefield. -
datais a dictionary containingword: frequency.
- The original scrape data filepath is included in the
- PRAW scraping export structure:
- Log:
-
scrapes.logis now namedurs.log. - Validation of Reddit objects is now included - invalid Reddit objects will be logged as a warning.
- Rate limit information is now included in the log.
-
- JSON is now the default export option.
- Source code
- Moved PRAW scrapers into its own package.
- Scrape settings for the basic Subreddit scraper is now cleaned within
Basic.py, further streamlining conditionals inSubreddit.pyandExport.py. - Returning final scrape settings dictionary from all scrapers after execution for logging purposes, further streamlining the
LogPRAWScraperclass inLogger.py. - Passing the submission URL instead of the exception into the
not_foundlist for submission comments scraping.- This is a part of a bug fix that is listed in the Fixed section.
- ASCII art:
- Modified the args error art to display specific feedback when invalid arguments are passed.
- Upgraded from relative to absolute imports.
- Replaced old header comments with docstring comment block.
- Upgraded method comments to Numpy/Scipy docstring format.
-
README- Moved Releases section into its own document.
- Deleted all media from master branch.
- Tests
- Updated absolute imports to match new directory structure.
- Updated a few tests to match new changes made in the source code.
- Community documents
- Updated
PULL_REQUEST_TEMPLATE:- Updated section for listing changes that have been made to match new Releases syntax.
- Wrapped New Dependencies in a code block.
- Updated
STYLE_GUIDE:- Created new rules for method comments.
- Added
Releases:- Moved Releases section from main
READMEto a separate document.
- Moved Releases section from main
- Updated
- Source code
- PRAW scraper settings
- Bug: Invalid Reddit objects (Subreddits, Redditors, or submissions) and their respective scrape settings would be added to the scrape settings dictionary even after failing validation.
- Behavior: URS would try to scrape invalid Reddit objects, then throw an error mid-scrape because it is unable to pull data via PRAW.
-
Fix: Returning the invalid objects list from each scraper into
GetPRAWScrapeSettings.get_settings()to circumvent this issue.
- Basic Subreddit scraper
-
Bug: The time filter
allwould be applied to categories that do not support time filter use, resulting in errors while scraping. - Behavior: URS would throw an error when trying to export the file, resulting in a failed run.
-
Fix: Added a conditional to check if the category allows for a time filter, and applies either the
alltime filter orNoneaccordingly.
-
Bug: The time filter
- PRAW scraper settings
- User interface
- Removed the
--jsonflag since it is now the default export option.
- Removed the
- Added more directories within the
datedirectory
- User interface
- Scrapes will now be exported to sub-folders within the date directory.
-
comments,redditors, andsubredditsdirectories are now created for you when you run each scraper. Scrape results will now be stored within these directories.
-
- Scrapes will now be exported to sub-folders within the date directory.
-
README- Added new Derivative Projects section.
- Source code
- Minor code reformatting and refactoring.
- The forbidden access message that may appear when running the Redditor scraper is now yellow to avoid confusion.
- Updated
READMEandSTYLE_GUIDE.- Uploaded new demo GIFs.
- Made a minor change to PRAW credentials guide.
- Added time filters for Subreddit categories
- User interface
- Added time filters for Subreddit categories (Controversial, Top, Search).
- Source code
- Changed how arguments are processed in the CLI.
- Performed DRY code review.
-
README- Updated
READMEto reflect new changes.
- Updated
- Community documents
- Updated
STYLE_GUIDE.- Made minor formatting changes to scripts to reflect new rules.
- Updated
- Major code refactor
- Introduced logging capabilities
- Introducing the
scrapesdirectory - Aesthetic changes
- New ASCII art
- Added color to terminal output
-
User interface
- Scrapes will now be exported to the
scrapes/directory within a subdirectory corresponding to the date of the scrape. These directories are automatically created for you when you run URS.
- Scrapes will now be exported to the
-
Source code
- Major code refactor. Applied OOP concepts to existing code and rewrote methods in attempt to improve readability, maintenance, and scalability.
- Added log decorators that record what is happening during each scrape, which scrapes were ran, and any errors that might arise during runtime in the log file
scrapes.log. The log is stored in the same subdirectory corresponding to the date of the scrape. - Added color to terminal output.
- Integrating Travis CI and Codecov.
- Source code
- Replaced bulky titles with minimalist titles for a cleaner look.
- Improved naming convention for scripts.
- Community documents
- Updated the following documents:
BUG_REPORTCONTRIBUTINGFEATURE_REQUESTPULL_REQUEST_TEMPLATESTYLE_GUIDE
- Updated the following documents:
-
README- Numerous changes, the most significant change was splitting and storing walkthroughs in
docs/.
- Numerous changes, the most significant change was splitting and storing walkthroughs in