Features

This page presents the detailed features of the APRIL project. Below, you will find a description of each functionality of the product and the corresponding features.

Overview of the Pipeline

overview of the pipeline

Feature List

1. Web Scraping

Description:
This process involves automatically extracting relevant data from web pages and PDF files to gather information for analysis. The scraping focuses on retrieving content and metadata, which are then stored in a MongoDB database for efficient management.

Key Features:

Metadata Scraping from URLs (HTML):
Extracts metadata such as title, description, author, and published date from HTML pages.
Implementation: The meta_scraping function collects metadata using the BeautifulSoup library.
Code Reference: functions/scraping.py.
PDF and HTML Content Extraction:
Retrieves text content from both PDF files and HTML pages for comprehensive data coverage.
Implementation: The pdf_to_text and pdf_meta_scraping functions handle PDF content extraction.
Code Reference: functions/scraping.py.
Keyword-Based Web Scraping:
Automates searches and data extraction based on user-defined keywords to target relevant content.
Implementation: The scrape_webpages_to_db function uses Google search results and filters content based on keywords.
Code Reference: functions/scraping.py.
Error Logging:
Implements robust logging mechanisms to capture successful operations and handle errors effectively.
Implementation: Custom loggers record errors in data/url_errors.log and general logs in pipeline.log.
Code Reference: functions/scraping.py.

Libraries Used:

BeautifulSoup: For parsing HTML content.
Requests: For handling HTTP requests.
PyMuPDF (fitz): For extracting text and metadata from PDFs.
MongoDB: For structured data storage.

Benefits:

Automated Data Collection: Reduces manual effort in gathering data from diverse web sources.
Scalable and Efficient: Handles large-scale data scraping and storage.
Error Tracking: Ensures traceability and debugging through detailed logging.

2. Data Filtering and Cleaning

Description:
This stage focuses on refining the scraped data by eliminating irrelevant, duplicate, or low-quality content, ensuring high data integrity for analysis.

Key Features:

Duplicate Removal:
Prevents redundant data storage by checking URLs and content before insertion into the database.
Implementation: The scrape_webpages_to_db function verifies if URLs already exist in the MongoDB collection before processing.
Code Reference: functions/scraping.py.
Keyword Filtering:
Filters content based on user-defined keywords to ensure relevance.
Implementation: The contains_keywords function checks if key terms are present in the content.
Code Reference: functions/scraping.py.
Error Handling and Logging:
Captures failed scraping attempts and logs errors for future analysis.
Implementation: Errors are logged through a dedicated error logger (errorLogger).
Code Reference: functions/scraping.py.

Libraries Used:

Pandas: For data manipulation and filtering.
Logging: For tracking operations and errors.

Benefits:

Enhanced Data Quality: Filters and cleans data to ensure relevance and consistency.
Efficient Data Management: Prevents duplication and minimizes irrelevant data storage.
Robust Error Management: Detailed logging facilitates debugging and process optimization.

3. Data Enrichment with NLP

Description:
Enhances documents through advanced Natural Language Processing (NLP) techniques, improving their structure and content. This allows for richer insights and prepares data for deeper analysis.

Key Features:

Keyword Extraction:
Identifies important terms or keywords from documents, enabling a more focused analysis of the content. This is achieved by extracting relevant words from predefined vocabularies (research and analysis-specific keywords) using a custom word tracking function.
Implementation: The function word_tracking counts occurrences of target words (from a CSV file) in the text.
Code Reference: word_tracking function.
Named Entity Recognition (NER):
Detects named entities such as organizations, locations, and people from the text. This is crucial for identifying important concepts within documents.
Implementation: Using the spaCy library, the ner function extracts and categorizes entities from the processed text.
Code Reference: ner function.
Pertinence Scoring:
Measures the relevance of a document based on how closely the content aligns with user-defined search and analysis keywords. This score helps in filtering documents based on their importance to a specific research question.
Implementation: The function pertinence_sementic computes the relevance score by calculating the semantic proximity between the document's content and the search/analysis terms using cosine similarity.
Code Reference: pertinence_sementic function.

Libraries & model used:

spaCy (model fr_core_news_lg):
A powerful NLP library for text processing, used for named entity recognition (NER) and text cleaning (e.g., removing stop words and punctuation).
Reference: spaCy Documentation.
Sentence Transformers (SBERT):
A library for generating embeddings (vector representations) of sentences or words, which are used for semantic similarity calculation. The lightweight all-MiniLM-L6-v2 model is employed for relevance scoring.
Reference: Sentence Transformers Documentation.

Benefits:

Improved Document Context:
NLP techniques provide deeper meaning to raw data by identifying important keywords and entities, which allows for a better understanding of document content.
Advanced Search and Filtering:
Pertinence scoring helps prioritize documents based on how relevant they are to the user’s research goals, enabling more focused analysis.
Scalable Enrichment:
The enrichment process can be easily scaled to handle large volumes of documents by automating keyword extraction, NER, and relevance scoring.

Example Usage:

Input: Cleaned documents (free from ads, duplicates, and irrelevant content).
Output: Enriched files with metadata, such as:
- Keywords identified from the research and analysis vocabularies.
- Named Entities (e.g., organizations, locations).
- Pertinence Scores that measure the relevance of the document to the user-defined keywords.

4. Interactive User Interface (UI)

Description:

User-friendly interface for searching, sorting, and exploring datasets.

Key Features:

Keyword-based search and dynamic filtering.
Sorting by relevance, date, location, or title.
Interactive visualizations (charts and tables).
Semantic Scoring: Measures document relevance based on the keyword used to search the corpus

Benefits:

Simplifies navigation through large datasets.
Offers an intuitive platform for data exploration.

Example Usage:

Input: Search for "coastal vulnerability."
Output: Sorted and filtered results with interactive displays.

5. Data Storage and Management

Description:

Scalable storage of processed data using MongoDB for efficient retrieval.

Key Features:

Indexed storage for quick searches.
Categorization by location, keyword, and relevance.
Backup and archival capabilities.

Benefits:

Ensures long-term data accessibility.
Provides scalable and organized data management.

Example Usage:

Input: Enriched documents.
Output: Structured, searchable datasets stored in the database.

6. Error Handling and Failed URL Logging

Description:

In addition to automated extraction, the scraper now logs URLs where data extraction fails. file

Key Features:

Failed Scraping Log: Automatically logs URLs when a scraping attempt does not succeed due to issues like network errors, invalid content format, or other scraping challenges.
Reason for Failure: Logs the reason for failure (e.g., 404 error, invalid content structure).
Visualization of Failed URLs: Failed URLs are displayed in the UI for user review and potential manual follow-up.

Benefits:

Improved Scraping Transparency: The feature provides users with insight into the scraping process by listing URLs that did not work, making troubleshooting easier.
Easy Recovery: Users can manually investigate or retry failed URLs through the UI, improving the robustness of the scraping pipeline.
Comprehensive Data Collection: Ensures no data source is missed by highlighting scraping failures.

Example Usage:

Input: Keyword "coastal erosion" with 100 URLs for scraping.
Output: Relevant documents from successful scraping and a list of failed URLs with reasons.

Home | Contributors | Report an Issue | Licence

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Features

Overview of the Pipeline

Feature List

1. Web Scraping

2. Data Filtering and Cleaning

3. Data Enrichment with NLP

4. Interactive User Interface (UI)

5. Data Storage and Management

6. Error Handling and Failed URL Logging

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally