Content Blocker Bot by kelvinkipruto · Pull Request #653 · CodeForAfrica/api

kelvinkipruto · 2024-04-24T06:54:53Z

Description

This PR introduces a new app to check which media houses in the MediaData database block AI crawlers.

Type of change

New feature (non-breaking change which adds functionality)

Checklist:

My code follows the style guidelines of this project
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation

Signed-off-by: Kipruto <43873157+kelvinkipruto@users.noreply.github.com>

…idiadata-init Signed-off-by: Kipruto <43873157+kelvinkipruto@users.noreply.github.com>

Signed-off-by: Kipruto <43873157+kelvinkipruto@users.noreply.github.com>

kilemensi · 2024-05-07T14:02:46Z

mediadata_ai_blocklist is a terrible name Mr. @kelvinkipruto ... we're 100% going to use this app for more than media data for one. My vote is naming it something related to checking whether website content data is accessible to bots e.g. accessbot, etc.

3rdparty/py/requirements-all.txt

mediadata_ai_blocklist/py/database.py

koechkevin

👍

thepsalmist

Initial thoughts LGTM,

consistency in logging, in one/more files you've setup logging & you're using print() for logging/debugging? in multiple instances
Interesting choice for using asyncio, from my earlier understanding thought a synchronous approach would have been straightforward. So could be helpful to add docs to help understand the process flow & motivations

Signed-off-by: Kipruto <43873157+kelvinkipruto@users.noreply.github.com>

mediadata_ai_blocklist/py/database.py

Signed-off-by: Kipruto <43873157+kelvinkipruto@users.noreply.github.com>

- Replace direct status updates with new site_checks table for historical tracking - Add SiteCheck dataclass and related database methods - Update all status and robots operations to use new system - Fix typo in get_combined_data method name - Improve database schema and error handling

kelvinkipruto · 2025-09-05T08:10:07Z

@codex Review this PR

chatgpt-codex-connector

Codex Review: Here are some suggestions.

About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you open a pull request for review, mark a draft as ready, or comment "@codex review". If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex fix this CI failure" or "@codex address that feedback".

chatgpt-codex-connector · 2025-09-05T08:12:53Z

content_access_bot/py/main.py

+    process = CrawlerProcess(settings={
+        'ITEM_PIPELINES': {
+            'pipeline.ArchivedURLsDatabasePipeline': 2
+        },
+    }, install_root_handler=False)
+    process.crawl(ArchivedURLsSpider, urls=urls, target_date=target_date)
+    process.start()


[P0] Avoid restarting Scrapy reactor for each crawl

fetch_internet_archive_snapshots creates a fresh CrawlerProcess and calls start() even though fetch_robots already started and stopped a reactor earlier in the run. Twisted’s reactor is not restartable, so this second start() (and the third in fetch_archived_robots) will raise ReactorNotRestartable and stop the workflow after the first spider. Reuse one process/runner and run the reactor once to avoid the crash.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2025-09-05T08:12:53Z

content_access_bot/py/airtable.py

+from utils import validate_url, clean_url
+import os
+import logging
+import re
+from environs import Env
+env = Env()
+dotenv_path = os.path.join(os.path.dirname(__file__), '..', '.env')
+
+env.read_env(dotenv_path)
+
+
+logging.basicConfig(level=logging.INFO,
+                    format='%(asctime)s - %(levelname)s - %(message)s')
+
+api_key = os.getenv('AIRTABLE_API_KEY')
+base_id = os.getenv('AIRTABLE_BASE_ID')
+organisations_table = os.getenv('AIRTABLE_ORGANISATION_TABLE')
+content_table = os.getenv('AIRTABLE_CONTENT_TABLE')
+
+if not api_key or not base_id or not organisations_table or not content_table:
+    raise ValueError('API key, base ID and Organisation table are required')
+
+at = Api(api_key)


[P0] Import pyairtable Api before use

The module instantiates Api(api_key) without importing the class, so importing airtable.py raises NameError: Api is not defined before any functionality can run. Add from pyairtable import Api (or the appropriate module) near the other imports.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2025-09-05T08:12:54Z

content_access_bot/py/diff.py

+    "cohere-ai"
+    "DataForSeoBot",
+    "Diffbot",


[P1] Fix missing commas in AI crawler list

The entries for cohere-ai and PerplexityBot are missing trailing commas, so Python concatenates them with the following literals, producing strings like "cohere-aiDataForSeoBot". As a result, those crawler names can never be detected in diff_robot_content, skewing the report about blocked AI crawlers.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2025-09-05T08:12:54Z

content_access_bot/py/db.py

+        CREATE TABLE IF NOT EXISTS internet_archive_snapshots(
+            id INTEGER PRIMARY KEY AUTOINCREMENT,
+            airtable_id TEXT NOT NULL, 
+            url TEXT NOT NULL,
+            archive_date TEXT NOT NULL UNIQUE,
+            archive_robots_url TEXT,


[P2] Allow identical archive timestamps for different sites

The internet_archive_snapshots table declares archive_date as globally UNIQUE. If two media sites happen to have snapshots at the same timestamp (only second-level precision), inserting the second record will fail with a unique-constraint error and the snapshot is lost. Making the uniqueness per (airtable_id, archive_date) or removing the uniqueness constraint avoids rejected inserts.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2025-09-05T08:12:54Z

content_access_bot/py/main.py

+            row.update({
+                "Archive URL": closest_snapshot.get("url"),
+                "Archive Date": format_db_date(closest_snapshot.get("archive_date")),
+                "Archive Robots URL": closest_snapshot.get("archive_robots_url"),
+                "Archive Robot Content": (
+                    "''" if closest_snapshot.get("archive_robots_url") == "" else closest_snapshot.get("archive_robots_url")
+                ),


[P2] Report shows robots URL where content is expected

When building the report row, the "Archive Robot Content" field is populated from archive_robots_url instead of archived_content. The generated spreadsheet therefore shows the URL twice and never includes the archived robots.txt content. Use archived_content for this column.

Useful? React with 👍 / 👎.

kelvinkipruto added 7 commits April 24, 2024 09:51

Minimal working setup

75f73f8

Signed-off-by: Kipruto <43873157+kelvinkipruto@users.noreply.github.com>

Working version with DB

b92ef3e

Signed-off-by: Kipruto <43873157+kelvinkipruto@users.noreply.github.com>

Cleanup

a4541de

Signed-off-by: Kipruto <43873157+kelvinkipruto@users.noreply.github.com>

Run time improvements

aec1820

Signed-off-by: Kipruto <43873157+kelvinkipruto@users.noreply.github.com>

Remove unused imports

8c0b06f

Signed-off-by: Kipruto <43873157+kelvinkipruto@users.noreply.github.com>

Merge branch 'main' of https://github.com/CodeForAfrica/api into ft/m…

95dae7f

…idiadata-init Signed-off-by: Kipruto <43873157+kelvinkipruto@users.noreply.github.com>

Docker files

9e17c89

Signed-off-by: Kipruto <43873157+kelvinkipruto@users.noreply.github.com>

kelvinkipruto marked this pull request as ready for review May 7, 2024 06:44

validate robots.txt

1469485

Signed-off-by: Kipruto <43873157+kelvinkipruto@users.noreply.github.com>

kilemensi requested review from VinneyJ, koechkevin and thepsalmist May 7, 2024 13:58

koechkevin reviewed May 7, 2024

View reviewed changes

3rdparty/py/requirements-all.txt Outdated Show resolved Hide resolved

koechkevin reviewed May 8, 2024

View reviewed changes

mediadata_ai_blocklist/py/database.py Outdated Show resolved Hide resolved

koechkevin reviewed May 8, 2024

View reviewed changes

thepsalmist reviewed May 8, 2024

View reviewed changes

kelvinkipruto added 2 commits May 14, 2024 09:59

Improve script to capture extra required fields

1e1c00d

Signed-off-by: Kipruto <43873157+kelvinkipruto@users.noreply.github.com>

Rename to content_access_bot

3140ecb

Signed-off-by: Kipruto <43873157+kelvinkipruto@users.noreply.github.com>

kelvinkipruto changed the title ~~MediaData AI Crawler Blocker Checker~~ Content Blocker Bot May 15, 2024

kelvinkipruto added 5 commits May 17, 2024 13:44

use case insensitivity when matching crawlers

906ba75

Signed-off-by: Kipruto <43873157+kelvinkipruto@users.noreply.github.com>

Improve url redirects check

e1dd2e4

Signed-off-by: Kipruto <43873157+kelvinkipruto@users.noreply.github.com>

Update list of crawlers

f74769b

Signed-off-by: Kipruto <43873157+kelvinkipruto@users.noreply.github.com>

use environs instead of dotenv

73a0031

Signed-off-by: Kipruto <43873157+kelvinkipruto@users.noreply.github.com>

Misc improvements

d8981e1

Signed-off-by: Kipruto <43873157+kelvinkipruto@users.noreply.github.com>

thepsalmist reviewed May 17, 2024

View reviewed changes

mediadata_ai_blocklist/py/database.py Outdated Show resolved Hide resolved

Code changes

883a8ab

Signed-off-by: Kipruto <43873157+kelvinkipruto@users.noreply.github.com>

kilemensi assigned kilemensi and kelvinkipruto and unassigned kilemensi Jun 12, 2025

kelvinkipruto added 9 commits June 13, 2025 17:57

Working Update

b551b3e

Signed-off-by: Kipruto <43873157+kelvinkipruto@users.noreply.github.com>

Refactor database imports to use sqliteDB module

09bc272

Signed-off-by: Kipruto <43873157+kelvinkipruto@users.noreply.github.com>

Improve script reliability

f13a25c

Signed-off-by: Kipruto <43873157+kelvinkipruto@users.noreply.github.com>

Fix SQL table definition to allow NULL values for archived robots fields

782b921

Signed-off-by: Kipruto <43873157+kelvinkipruto@users.noreply.github.com>

Simplified working scrapper

a2761a5

Signed-off-by: Kipruto <43873157+kelvinkipruto@users.noreply.github.com>

Update interpreter constraints to include Python 3.10

a1d7374

Signed-off-by: Kipruto <43873157+kelvinkipruto@users.noreply.github.com>

Enhance database connection timeout and improve robots fetching logic

df6e7a3

Signed-off-by: Kipruto <43873157+kelvinkipruto@users.noreply.github.com>

Merge branch 'main' into ft/midiadata-init

7ab4278

chatgpt-codex-connector bot reviewed Sep 5, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Content Blocker Bot#653

Content Blocker Bot#653
kelvinkipruto wants to merge 25 commits intomainfrom
ft/midiadata-init

kelvinkipruto commented Apr 24, 2024

Uh oh!

kilemensi commented May 7, 2024

Uh oh!

Uh oh!

Uh oh!

koechkevin left a comment

Uh oh!

thepsalmist left a comment

Uh oh!

Uh oh!

kelvinkipruto commented Sep 5, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Sep 5, 2025

Uh oh!

chatgpt-codex-connector bot Sep 5, 2025

Uh oh!

chatgpt-codex-connector bot Sep 5, 2025

Uh oh!

chatgpt-codex-connector bot Sep 5, 2025

Uh oh!

chatgpt-codex-connector bot Sep 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

kelvinkipruto commented Apr 24, 2024

Description

Type of change

Checklist:

Uh oh!

kilemensi commented May 7, 2024

Uh oh!

Uh oh!

Uh oh!

koechkevin left a comment

Choose a reason for hiding this comment

Uh oh!

thepsalmist left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kelvinkipruto commented Sep 5, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants