Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
139 changes: 139 additions & 0 deletions .codeboarding/HTML_Data_Extractor.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,139 @@
```mermaid

graph LR

CLI_Handler["CLI Handler"]

Web_Scraper["Web Scraper"]

HTML_Data_Extractor["HTML Data Extractor"]

Data_Processor["Data Processor"]

Output_Handler["Output Handler"]

CLI_Handler -- "invokes" --> Data_Processor

Data_Processor -- "calls" --> Web_Scraper

Web_Scraper -- "provides" --> HTML_Data_Extractor

HTML_Data_Extractor -- "provides" --> Data_Processor

Data_Processor -- "sends" --> Output_Handler

click HTML_Data_Extractor href "https://github.com/recursionpharma/academic-keyword-occurrence/blob/trunk/.codeboarding//HTML_Data_Extractor.md" "Details"

```



[![CodeBoarding](https://img.shields.io/badge/Generated%20by-CodeBoarding-9cf?style=flat-square)](https://github.com/CodeBoarding/GeneratedOnBoardings)[![Demo](https://img.shields.io/badge/Try%20our-Demo-blue?style=flat-square)](https://www.codeboarding.org/demo)[![Contact](https://img.shields.io/badge/Contact%20us%20-%20contact@codeboarding.org-lightgrey?style=flat-square)](mailto:contact@codeboarding.org)



## Details



One paragraph explaining the functionality which is represented by this graph. What the main flow is and what is its purpose.



### CLI Handler

This component is responsible for parsing command-line arguments (search term, start date, end date, and optional output file), validating the input, and orchestrating the overall execution flow of the script. It serves as the primary entry point for the utility.





**Related Classes/Methods**:



- <a href="https://github.com/recursionpharma/academic-keyword-occurrence/blob/trunk/extract_occurrences.py#L72-L92" target="_blank" rel="noopener noreferrer">`extract_occurrences.py` (72:92)</a>





### Web Scraper

This component handles the fetching of web page content from Google Scholar. It constructs the appropriate URL with search parameters, sets user-agent headers, manages cookies for session persistence, and executes the HTTP request to retrieve the raw HTML response.





**Related Classes/Methods**:



- <a href="https://github.com/recursionpharma/academic-keyword-occurrence/blob/trunk/extract_occurrences.py#L16-L46" target="_blank" rel="noopener noreferrer">`extract_occurrences.py:get_num_results` (16:46)</a>





### HTML Data Extractor [[Expand]](./HTML_Data_Extractor.md)

This component is dedicated to parsing the raw HTML content received from the `Web Scraper`. It utilizes BeautifulSoup to navigate the HTML structure, specifically locating the `div` element containing the search result count (`id="gs_ab_md"`). It then employs regular expressions to accurately extract and format the numerical count of search results, handling cases where no results are found.





**Related Classes/Methods**:



- <a href="https://github.com/recursionpharma/academic-keyword-occurrence/blob/trunk/extract_occurrences.py#L16-L46" target="_blank" rel="noopener noreferrer">`extract_occurrences.py:get_num_results` (16:46)</a>





### Data Processor

This component manages the core logic of iterating through the specified date range (year by year). For each year, it calls the `Web Scraper` and `HTML Data Extractor` (via `get_num_results`) to obtain the search result count. It also incorporates a delay (`time.sleep`) to mitigate potential rate-limiting by Google Scholar and prepares the `year,results` data for output.





**Related Classes/Methods**:



- <a href="https://github.com/recursionpharma/academic-keyword-occurrence/blob/trunk/extract_occurrences.py#L49-L69" target="_blank" rel="noopener noreferrer">`extract_occurrences.py:get_range` (49:69)</a>





### Output Handler

This component is responsible for formatting and presenting the final results of the keyword analysis. It writes the `year,results` data to a specified CSV file and simultaneously prints the same information to the console, ensuring both persistence and immediate feedback to the user. It also handles file opening and closing.





**Related Classes/Methods**:



- <a href="https://github.com/recursionpharma/academic-keyword-occurrence/blob/trunk/extract_occurrences.py#L49-L69" target="_blank" rel="noopener noreferrer">`extract_occurrences.py:get_range` (49:69)</a>

- <a href="https://github.com/recursionpharma/academic-keyword-occurrence/blob/trunk/extract_occurrences.py#L49-L69" target="_blank" rel="noopener noreferrer">`extract_occurrences.py:get_range` (49:69)</a>









### [FAQ](https://github.com/CodeBoarding/GeneratedOnBoardings/tree/main?tab=readme-ov-file#faq)
167 changes: 167 additions & 0 deletions .codeboarding/on_boarding.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,167 @@
```mermaid

graph LR

Application_Controller["Application Controller"]

Cookie_Manager["Cookie Manager"]

Data_Retrieval_Loop["Data Retrieval Loop"]

Web_Scraper["Web Scraper"]

HTML_Data_Extractor["HTML Data Extractor"]

Output_Writer["Output Writer"]

Application_Controller -- "initiates" --> Data_Retrieval_Loop

Application_Controller -- "instructs to save cookies" --> Cookie_Manager

Cookie_Manager -- "provides session cookies to" --> Web_Scraper

Data_Retrieval_Loop -- "calls" --> Web_Scraper

Data_Retrieval_Loop -- "passes extracted data to" --> Output_Writer

Web_Scraper -- "uses for requests" --> Cookie_Manager

Web_Scraper -- "provides raw HTML to" --> HTML_Data_Extractor

HTML_Data_Extractor -- "returns extracted results to" --> Data_Retrieval_Loop

click HTML_Data_Extractor href "https://github.com/recursionpharma/academic-keyword-occurrence/blob/trunk/.codeboarding//HTML_Data_Extractor.md" "Details"

```



[![CodeBoarding](https://img.shields.io/badge/Generated%20by-CodeBoarding-9cf?style=flat-square)](https://github.com/CodeBoarding/GeneratedOnBoardings)[![Demo](https://img.shields.io/badge/Try%20our-Demo-blue?style=flat-square)](https://www.codeboarding.org/demo)[![Contact](https://img.shields.io/badge/Contact%20us%20-%20contact@codeboarding.org-lightgrey?style=flat-square)](mailto:contact@codeboarding.org)



## Details



The `academic-keyword-occurrence` project, a command-line utility for web scraping and data analysis, exhibits a monolithic script architecture with clear functional separation. Based on the CFG and Source analysis, the project's high-level data flow can be described through six core components, each with distinct responsibilities and well-defined interactions.



### Application Controller

The central orchestrator of the application. It parses command-line arguments, initializes the data extraction process, and ensures session cookies are saved upon completion.





**Related Classes/Methods**:



- `Application Controller` (72:92)





### Cookie Manager

Responsible for loading and persisting HTTP cookies, crucial for maintaining session state with Google Scholar and bypassing CAPTCHA challenges.





**Related Classes/Methods**:



- `Cookie Manager` (11:15)

- `Cookie Manager` (92:92)





### Data Retrieval Loop

Manages the iterative process of fetching data for each year within the specified range. It coordinates calls to the Web Scraper and HTML Data Extractor, and passes the processed results to the Output Writer.





**Related Classes/Methods**:



- `Data Retrieval Loop` (50:70)





### Web Scraper

Constructs Google Scholar query URLs, sends HTTP requests, and retrieves the raw HTML content, utilizing managed cookies for session continuity.





**Related Classes/Methods**:



- `Web Scraper` (17:29)





### HTML Data Extractor [[Expand]](./HTML_Data_Extractor.md)

Parses the raw HTML content received from the Web Scraper using BeautifulSoup and regular expressions to accurately extract the numerical count of search results.





**Related Classes/Methods**:



- `HTML Data Extractor` (31:47)





### Output Writer

Formats the extracted year and result count into a CSV string and writes this data to both the specified output file and the console.





**Related Classes/Methods**:



- `Output Writer` (53:57)

- `Output Writer` (65:67)









### [FAQ](https://github.com/CodeBoarding/GeneratedOnBoardings/tree/main?tab=readme-ov-file#faq)