Replies: 1 comment 2 replies
-
See #40 and [href="https://github.com/ai-robots-txt/ai.robots.txt/pulls?q=is%3Apr+facebookexternalhit+is%3Aclosed](these closed PRs) (most recently #154) for the history of FacebookExternalHit on this site. The main difficulty is that it appears that Meta is not necessarily being honest in their description of the purposes of FacebookExternalHit. Please note I couldn't find FacebookExternalHit/facebookexternalhit in the Mastodon source code. If anyone else reading this can reproduce the behaviour with Mastodon, it would be worth asking around on Mastodon to see if anyone has an idea of what's going on. |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I run a website (The Steampunk Explorer) that provides news coverage and other resources related to steampunk,. I recently added the user agents listed in ai.robots.txt/robots.txt to my own robots.txt file as I don't want AI bots crawling my site,
I am most appreciative of the work done on this project. However, I discovered that adding one of the listed user agents, FacebookExternalHit, had unintended consequences. It does not appear to have any relation to Meta's AI initiatives, and websites that block it will likely find that this compromises their ability to manually share their content on Facebook and other platforms.
When I post articles on the site, the header includes metadata about the article's content, including OpenGraph (OG) metadata used by Facebook and other platforms to identify the title, description, and a representative image. When content is shared to Facebook (and other platforms), the platforms use that OG metadata to determine what to show. It appears that FacebookExternalHit somehow enables this to happen.
Shortly after adding the AI crawlers to robots.txt, I found that three social media platforms -- Facebook, Bluesky, and Mastodon -- were unable to read the OG metadata. I tried checking the URL in the Facebook debugger tool (https://developers.facebook[dot]com/tools/debug/) and it generated an error message stating that inclusion of FacebookExternalHit in robots.txt was preventing it from scraping the article.
When I removed FacebookExternalHit from robots.txt, I found that Facebook, Bluesky, and Mastodon could once again read the OG data.
I'm not sure why inclusion of FacebookExternalHit would affect non-Meta platforms, but that appears to be the case. It seems that Bluesky and Mastodon both rely on OpenGraph and somehow, blocking FacebookExternalHit prevents them from doing so.
Meta's developer documentation (https://developers.facebook[dot]com/docs/sharing/webmasters/web-crawlers/ lists the company's web crawlers and what they do. It appears that some, indeed, are involved in AI training. But if websites want to maximize their exposure on social media, it seems that they should be allowing FacebookExternalHit.
Beta Was this translation helpful? Give feedback.
All reactions