Conversation
Signed-off-by: Kipruto <43873157+kelvinkipruto@users.noreply.github.com>
Signed-off-by: Kipruto <43873157+kelvinkipruto@users.noreply.github.com>
Signed-off-by: Kipruto <43873157+kelvinkipruto@users.noreply.github.com>
Signed-off-by: Kipruto <43873157+kelvinkipruto@users.noreply.github.com>
…idiadata-init Signed-off-by: Kipruto <43873157+kelvinkipruto@users.noreply.github.com>
Signed-off-by: Kipruto <43873157+kelvinkipruto@users.noreply.github.com>
Signed-off-by: Kipruto <43873157+kelvinkipruto@users.noreply.github.com>
|
|
thepsalmist
left a comment
There was a problem hiding this comment.
Initial thoughts LGTM,
- consistency in logging, in one/more files you've setup logging & you're using
print()for logging/debugging? in multiple instances - Interesting choice for using
asyncio, from my earlier understanding thought a synchronous approach would have been straightforward. So could be helpful to add docs to help understand the process flow & motivations
Signed-off-by: Kipruto <43873157+kelvinkipruto@users.noreply.github.com>
Signed-off-by: Kipruto <43873157+kelvinkipruto@users.noreply.github.com>
Signed-off-by: Kipruto <43873157+kelvinkipruto@users.noreply.github.com>
Signed-off-by: Kipruto <43873157+kelvinkipruto@users.noreply.github.com>
Signed-off-by: Kipruto <43873157+kelvinkipruto@users.noreply.github.com>
Signed-off-by: Kipruto <43873157+kelvinkipruto@users.noreply.github.com>
Signed-off-by: Kipruto <43873157+kelvinkipruto@users.noreply.github.com>
Signed-off-by: Kipruto <43873157+kelvinkipruto@users.noreply.github.com>
Signed-off-by: Kipruto <43873157+kelvinkipruto@users.noreply.github.com>
Signed-off-by: Kipruto <43873157+kelvinkipruto@users.noreply.github.com>
Signed-off-by: Kipruto <43873157+kelvinkipruto@users.noreply.github.com>
Signed-off-by: Kipruto <43873157+kelvinkipruto@users.noreply.github.com>
Signed-off-by: Kipruto <43873157+kelvinkipruto@users.noreply.github.com>
Signed-off-by: Kipruto <43873157+kelvinkipruto@users.noreply.github.com>
Signed-off-by: Kipruto <43873157+kelvinkipruto@users.noreply.github.com>
- Replace direct status updates with new site_checks table for historical tracking - Add SiteCheck dataclass and related database methods - Update all status and robots operations to use new system - Fix typo in get_combined_data method name - Improve database schema and error handling
|
@codex Review this PR |
There was a problem hiding this comment.
Codex Review: Here are some suggestions.
About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you open a pull request for review, mark a draft as ready, or comment "@codex review". If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex fix this CI failure" or "@codex address that feedback".
| process = CrawlerProcess(settings={ | ||
| 'ITEM_PIPELINES': { | ||
| 'pipeline.ArchivedURLsDatabasePipeline': 2 | ||
| }, | ||
| }, install_root_handler=False) | ||
| process.crawl(ArchivedURLsSpider, urls=urls, target_date=target_date) | ||
| process.start() |
There was a problem hiding this comment.
[P0] Avoid restarting Scrapy reactor for each crawl
fetch_internet_archive_snapshots creates a fresh CrawlerProcess and calls start() even though fetch_robots already started and stopped a reactor earlier in the run. Twisted’s reactor is not restartable, so this second start() (and the third in fetch_archived_robots) will raise ReactorNotRestartable and stop the workflow after the first spider. Reuse one process/runner and run the reactor once to avoid the crash.
Useful? React with 👍 / 👎.
| from utils import validate_url, clean_url | ||
| import os | ||
| import logging | ||
| import re | ||
| from environs import Env | ||
| env = Env() | ||
| dotenv_path = os.path.join(os.path.dirname(__file__), '..', '.env') | ||
|
|
||
| env.read_env(dotenv_path) | ||
|
|
||
|
|
||
| logging.basicConfig(level=logging.INFO, | ||
| format='%(asctime)s - %(levelname)s - %(message)s') | ||
|
|
||
| api_key = os.getenv('AIRTABLE_API_KEY') | ||
| base_id = os.getenv('AIRTABLE_BASE_ID') | ||
| organisations_table = os.getenv('AIRTABLE_ORGANISATION_TABLE') | ||
| content_table = os.getenv('AIRTABLE_CONTENT_TABLE') | ||
|
|
||
| if not api_key or not base_id or not organisations_table or not content_table: | ||
| raise ValueError('API key, base ID and Organisation table are required') | ||
|
|
||
| at = Api(api_key) |
There was a problem hiding this comment.
[P0] Import pyairtable Api before use
The module instantiates Api(api_key) without importing the class, so importing airtable.py raises NameError: Api is not defined before any functionality can run. Add from pyairtable import Api (or the appropriate module) near the other imports.
Useful? React with 👍 / 👎.
| "cohere-ai" | ||
| "DataForSeoBot", | ||
| "Diffbot", |
There was a problem hiding this comment.
[P1] Fix missing commas in AI crawler list
The entries for cohere-ai and PerplexityBot are missing trailing commas, so Python concatenates them with the following literals, producing strings like "cohere-aiDataForSeoBot". As a result, those crawler names can never be detected in diff_robot_content, skewing the report about blocked AI crawlers.
Useful? React with 👍 / 👎.
| CREATE TABLE IF NOT EXISTS internet_archive_snapshots( | ||
| id INTEGER PRIMARY KEY AUTOINCREMENT, | ||
| airtable_id TEXT NOT NULL, | ||
| url TEXT NOT NULL, | ||
| archive_date TEXT NOT NULL UNIQUE, | ||
| archive_robots_url TEXT, |
There was a problem hiding this comment.
[P2] Allow identical archive timestamps for different sites
The internet_archive_snapshots table declares archive_date as globally UNIQUE. If two media sites happen to have snapshots at the same timestamp (only second-level precision), inserting the second record will fail with a unique-constraint error and the snapshot is lost. Making the uniqueness per (airtable_id, archive_date) or removing the uniqueness constraint avoids rejected inserts.
Useful? React with 👍 / 👎.
| row.update({ | ||
| "Archive URL": closest_snapshot.get("url"), | ||
| "Archive Date": format_db_date(closest_snapshot.get("archive_date")), | ||
| "Archive Robots URL": closest_snapshot.get("archive_robots_url"), | ||
| "Archive Robot Content": ( | ||
| "''" if closest_snapshot.get("archive_robots_url") == "" else closest_snapshot.get("archive_robots_url") | ||
| ), |
There was a problem hiding this comment.
[P2] Report shows robots URL where content is expected
When building the report row, the "Archive Robot Content" field is populated from archive_robots_url instead of archived_content. The generated spreadsheet therefore shows the URL twice and never includes the archived robots.txt content. Use archived_content for this column.
Useful? React with 👍 / 👎.
Description
This PR introduces a new app to check which media houses in the MediaData database block AI crawlers.
Type of change
Checklist: