The Text Analysis Toolkit is a Python-based tool for analyzing textual data extracted from articles. It performs sentiment analysis, calculates readability metrics, and extracts key linguistic features to provide valuable insights into the content of the articles.
- Data Extraction: Extracts article text from provided URLs and saves them into separate text files.
- Sentiment Analysis: Determines the sentiment of articles (positive, negative, or neutral) and calculates sentiment scores.
- Readability Metrics: Computes various readability metrics such as average sentence length, percentage of complex words, Fog Index, etc.
- Output Data: Prepares an output CSV file containing calculated metrics for further analysis.
- Setup: Install required Python packages using
pip install -r requirements.txt. - Data Extraction: Provide URLs of articles in an input Excel file (
Input.xlsx) and rundata_extraction.pyto extract text. - Sentiment Analysis: Run
sentiment_analysis.pyto perform sentiment analysis on the extracted text. - Readability Metrics: Run
readability_metrics.pyto calculate readability metrics. - Output: The calculated metrics are saved to
Output_Data.csvfor further analysis.
- Articles: Contains extracted text files from articles.
- StopWords: Includes stop words lists for filtering out common words.
- MasterDictionary: Contains dictionaries of positive and negative words.
- Input.xlsx: Input file with URLs of articles.
- Output_Data.csv: Output file with calculated metrics.
- data_extraction.py: Script for extracting text from URLs.
- sentiment_analysis.py: Script for performing sentiment analysis.
- requirements.txt: List of required Python packages.
- Python
- BeautifulSoup
- NLTK