Skip to content

Reusability

thibaut-dst edited this page Jan 17, 2025 · 4 revisions

The service is designed to be flexible and adaptable for different topics or customized configurations. Below are detailed instructions on how to customize and reuse the system effectively.

1. Customizing the Scraping Behavior

The core scraping functionality is handled by the following function:

Core Function:

  • Path: functions/scraping.py
  • Function: scrape_webpages_to_db(keywords_list: list, collection)

Default Behavior:

  • This function queries the Google API to fetch the first 3 results for each keyword.
  • It checks the existing database to prevent storing duplicate documents.

How to Expand the Search: To increase the number of search results (useful for broader searches or when rerunning the pipeline):

  1. Stop the Docker Container:

    • Run the following command to stop the container:
      docker-compose down
  2. Modify the Scraping Behavior:

    • Open functions/scraping.py.
    • Locate the num_results parameter (around line 181):
    for url in search(combined, num_results=3): # Limited to 3 results
    • Change its value from 3 to a higher number (e.g., 10) to expand the search results.
  3. Restart the Docker Container:

    • After saving the changes, restart the container to apply the updates:
      docker-compose up -d
  4. Rerun the Pipelines:

    • Go to [base_url]/launch-pipeline and restart the data collection and NLP processing pipelines.

2. Adapting the System for Different Topics

The system can be easily adapted for different topics by updating the vocabulary file.

  • Path: data/Vocabulaire_Expert_CSV.csv

File Structure

The CSV file must retain the same name and format with three columns:

  1. Vocabulaire de recherche (Search Vocabulary): Keywords used for search queries.
  2. Localisation de recherche (Search Location): Specific locations to target during the search. These terms will be combined with all search keywords to generate queries for Google searches, refining the results to focus on the specified areas.
  3. Vocabulaire d'analyse (Analysis Vocabulary): Terms used for deeper analysis and filtering of the scraped content.

Example CSV Format:

Vocabulaire de recherche,Localisation de recherche,Vocabulaire d'analyse
climate change,environment,carbon footprint
renewable energy,technology,solar panels

Steps

  1. Stop the Docker Container:

    • Run the following command to stop the container:
      docker-compose down
  2. Modify the CSV File:

    • Open data/Vocabulaire_Expert_CSV.csv in a text editor or spreadsheet software.
    • Replace or add topic-specific terms while maintaining the file structure.
  3. Restart the Docker Container:

    • After saving the changes, restart the container to apply the updates:
      docker-compose up -d
  4. Rerun the Pipelines:

    • Go to [base_url]/launch-pipeline and restart the data collection and NLP processing pipelines.

By following these steps, the service can be repurposed for new topics or domains without additional code modifications.

Clone this wiki locally