-
Notifications
You must be signed in to change notification settings - Fork 2
Features
This page presents the detailed features of the APRIL project. Below, you will find a description of each functionality of the product and the corresponding features.

Description:
This process involves automatically extracting relevant data from web pages and PDF files to gather information for analysis. The scraping focuses on retrieving content and metadata, which are then stored in a MongoDB database for efficient management.
Key Features:
-
Metadata Scraping from URLs (HTML):
Extracts metadata such as title, description, author, and published date from HTML pages.
Implementation: Themeta_scrapingfunction collects metadata using the BeautifulSoup library.
Code Reference:functions/scraping.py. -
PDF and HTML Content Extraction:
Retrieves text content from both PDF files and HTML pages for comprehensive data coverage.
Implementation: Thepdf_to_textandpdf_meta_scrapingfunctions handle PDF content extraction.
Code Reference:functions/scraping.py. -
Keyword-Based Web Scraping:
Automates searches and data extraction based on user-defined keywords to target relevant content.
Implementation: Thescrape_webpages_to_dbfunction uses Google search results and filters content based on keywords.
Code Reference:functions/scraping.py. -
Error Logging:
Implements robust logging mechanisms to capture successful operations and handle errors effectively.
Implementation: Custom loggers record errors indata/url_errors.logand general logs inpipeline.log.
Code Reference:functions/scraping.py.
Libraries Used:
- BeautifulSoup: For parsing HTML content.
- Requests: For handling HTTP requests.
- PyMuPDF (fitz): For extracting text and metadata from PDFs.
- MongoDB: For structured data storage.
Benefits:
- Automated Data Collection: Reduces manual effort in gathering data from diverse web sources.
- Scalable and Efficient: Handles large-scale data scraping and storage.
- Error Tracking: Ensures traceability and debugging through detailed logging.
Description:
This stage focuses on refining the scraped data by eliminating irrelevant, duplicate, or low-quality content, ensuring high data integrity for analysis.
Key Features:
-
Duplicate Removal:
Prevents redundant data storage by checking URLs and content before insertion into the database.
Implementation: Thescrape_webpages_to_dbfunction verifies if URLs already exist in the MongoDB collection before processing.
Code Reference:functions/scraping.py. -
Keyword Filtering:
Filters content based on user-defined keywords to ensure relevance.
Implementation: Thecontains_keywordsfunction checks if key terms are present in the content.
Code Reference:functions/scraping.py. -
Error Handling and Logging:
Captures failed scraping attempts and logs errors for future analysis.
Implementation: Errors are logged through a dedicated error logger (errorLogger).
Code Reference:functions/scraping.py.
Libraries Used:
- Pandas: For data manipulation and filtering.
- Logging: For tracking operations and errors.
Benefits:
- Enhanced Data Quality: Filters and cleans data to ensure relevance and consistency.
- Efficient Data Management: Prevents duplication and minimizes irrelevant data storage.
- Robust Error Management: Detailed logging facilitates debugging and process optimization.
Description:
Enhances documents through advanced Natural Language Processing (NLP) techniques, improving their structure and content. This allows for richer insights and prepares data for deeper analysis.
Key Features:
-
Keyword Extraction:
Identifies important terms or keywords from documents, enabling a more focused analysis of the content. This is achieved by extracting relevant words from predefined vocabularies (research and analysis-specific keywords) using a custom word tracking function.
Implementation: The functionword_trackingcounts occurrences of target words (from a CSV file) in the text.
Code Reference: word_tracking function. -
Named Entity Recognition (NER):
Detects named entities such as organizations, locations, and people from the text. This is crucial for identifying important concepts within documents.
Implementation: Using the spaCy library, thenerfunction extracts and categorizes entities from the processed text.
Code Reference: ner function. -
Pertinence Scoring:
Measures the relevance of a document based on how closely the content aligns with user-defined search and analysis keywords. This score helps in filtering documents based on their importance to a specific research question.
Implementation: The functionpertinence_sementiccomputes the relevance score by calculating the semantic proximity between the document's content and the search/analysis terms using cosine similarity.
Code Reference: pertinence_sementic function.
Libraries & model used:
-
spaCy (model fr_core_news_lg):
A powerful NLP library for text processing, used for named entity recognition (NER) and text cleaning (e.g., removing stop words and punctuation).
Reference: spaCy Documentation. -
Sentence Transformers (SBERT):
A library for generating embeddings (vector representations) of sentences or words, which are used for semantic similarity calculation. The lightweightall-MiniLM-L6-v2model is employed for relevance scoring.
Reference: Sentence Transformers Documentation.
Benefits:
-
Improved Document Context:
NLP techniques provide deeper meaning to raw data by identifying important keywords and entities, which allows for a better understanding of document content. -
Advanced Search and Filtering:
Pertinence scoring helps prioritize documents based on how relevant they are to the user’s research goals, enabling more focused analysis. -
Scalable Enrichment:
The enrichment process can be easily scaled to handle large volumes of documents by automating keyword extraction, NER, and relevance scoring.
Example Usage:
- Input: Cleaned documents (free from ads, duplicates, and irrelevant content).
-
Output: Enriched files with metadata, such as:
- Keywords identified from the research and analysis vocabularies.
- Named Entities (e.g., organizations, locations).
- Pertinence Scores that measure the relevance of the document to the user-defined keywords.
Description:
- User-friendly interface for searching, sorting, and exploring datasets.
Key Features:
- Keyword-based search and dynamic filtering.
- Sorting by relevance, date, location, or title.
- Interactive visualizations (charts and tables).
- Semantic Scoring: Measures document relevance based on the keyword used to search the corpus
Benefits:
- Simplifies navigation through large datasets.
- Offers an intuitive platform for data exploration.
Example Usage:
- Input: Search for "coastal vulnerability."
- Output: Sorted and filtered results with interactive displays.
Description:
- Scalable storage of processed data using MongoDB for efficient retrieval.
Key Features:
- Indexed storage for quick searches.
- Categorization by location, keyword, and relevance.
- Backup and archival capabilities.
Benefits:
- Ensures long-term data accessibility.
- Provides scalable and organized data management.
Example Usage:
- Input: Enriched documents.
- Output: Structured, searchable datasets stored in the database.
Description:
In addition to automated extraction, the scraper now logs URLs where data extraction fails. file
Key Features:
- Failed Scraping Log: Automatically logs URLs when a scraping attempt does not succeed due to issues like network errors, invalid content format, or other scraping challenges.
- Reason for Failure: Logs the reason for failure (e.g., 404 error, invalid content structure).
- Visualization of Failed URLs: Failed URLs are displayed in the UI for user review and potential manual follow-up.
Benefits:
- Improved Scraping Transparency: The feature provides users with insight into the scraping process by listing URLs that did not work, making troubleshooting easier.
- Easy Recovery: Users can manually investigate or retry failed URLs through the UI, improving the robustness of the scraping pipeline.
- Comprehensive Data Collection: Ensures no data source is missed by highlighting scraping failures.
Example Usage:
- Input: Keyword "coastal erosion" with 100 URLs for scraping.
- Output: Relevant documents from successful scraping and a list of failed URLs with reasons.
Home | Contributors | Report an Issue | Licence
© 2024 APRIL. | Version 1.0 | Last updated on 2025-01-14