-
Notifications
You must be signed in to change notification settings - Fork 2
Reusability
The service is designed to be flexible and adaptable for different topics or customized configurations. Below are detailed instructions on how to customize and reuse the system effectively.
The core scraping functionality is handled by the following function:
Core Function:
-
Path:
functions/scraping.py -
Function:
scrape_webpages_to_db(keywords_list: list, collection)
Default Behavior:
- This function queries the Google API to fetch the first 3 results for each keyword.
- It checks the existing database to prevent storing duplicate documents.
How to Expand the Search: To increase the number of search results (useful for broader searches or when rerunning the pipeline):
-
Stop the Docker Container:
- Run the following command to stop the container:
docker-compose down
- Run the following command to stop the container:
-
Modify the Scraping Behavior:
- Open
functions/scraping.py. - Locate the
num_resultsparameter (around line 181):
for url in search(combined, num_results=3): # Limited to 3 results
- Change its value from 3 to a higher number (e.g., 10) to expand the search results.
- Open
-
Restart the Docker Container:
- After saving the changes, restart the container to apply the updates:
docker-compose up -d
- After saving the changes, restart the container to apply the updates:
-
Rerun the Pipelines:
- Go to
[base_url]/launch-pipelineand restart the data collection and NLP processing pipelines.
- Go to
===============
The system can easily be adapted for different topics by updating the vocabulary file:
Path: data/Vocabulaire_Expert_CSV.csv
-
File Structure: The file must retain the same name and follow the exact format with three columns:
- Vocabulaire de recherche (Search Vocabulary)
- Localisation de recherche (Search Location)
- Vocabulaire d'analyse (Analysis Vocabulary)
-
Usage:
- The first two columns are combined to generate search queries.
- All three columns are used for word tracking, document filtering, and powering the NLP tasks.
By updating this CSV file with topic-specific terms while maintaining its structure, the service can be repurposed for new domains without additional modifications.
Home | Contributors | Report an Issue | Licence
© 2024 APRIL. | Version 1.0 | Last updated on 2025-01-14