Skip to content

Conversation

@Sesquipedalian
Copy link
Member

@Sesquipedalian Sesquipedalian commented Dec 30, 2024

Provides SMF with the ability to add rules to robots.txt.

  • Adds a new setting, Config::$modSettings['robots_txt'], with the path to the robots.txt file.
  • To assist admins, the relevant ACP page offers to populate the setting with the correct path.
  • When a valid and writable path is saved in this setting, SMF will find the file and add rules to it as necessary.
  • If the path is not writable, the admin will be warned and SMF will not attempt to update the file.
  • If the path is writable, SMF adds rules to tell all spiders that they should ignore URLs that match the pattern /path/to/index.php?msg= (where /path/to/index.php will be set to the appropriate value for the individual forum instance), as well as URLs containing PHPSESSID or ;topicseen.
  • Redundancy checks are performed to ensure that if a rule is already present in the file, it will not be added again.

Fixes #8367

@Sesquipedalian Sesquipedalian changed the title Automatically populate robots.txt [3.0] Automatically populate robots.txt Dec 30, 2024
@Sesquipedalian Sesquipedalian force-pushed the 3.0/robots.txt branch 3 times, most recently from 3f6e419 to 916b153 Compare December 30, 2024 19:54
@Sesquipedalian
Copy link
Member Author

@sbulen and @Oldiesmann, I'd be glad to have your thoughts on this too.

@Sesquipedalian Sesquipedalian merged commit 61b5b52 into SimpleMachines:release-3.0 Jan 1, 2025
6 checks passed
@Sesquipedalian Sesquipedalian deleted the 3.0/robots.txt branch January 1, 2025 22:51
@sbulen
Copy link
Contributor

sbulen commented Jan 8, 2025

I came back to provide the requested feedback & see it's been merged already...

I strongly prefer the other solutions we discussed, that would eliminate the redirects (though not all the msg links). Or the proposal to hide message-specific links from bots.

The problem with a robots.txt solution is that robots.txt is actually kinda complex, and can have different rules for different bots. I find I need to make use of that, as some crawlers honor some directives & some don't. And this small set of rules is only a tiny portion of what is needed. And I'm not sure we want to automate providing everything that is needed, because different forums are configured differently. E.g., some allow guests viewing attachments & some don't. Some allow guests viewing calendar & some don't.

A clarification on the canonical: https://myforum.net/index.php?topic=333333.msg22222 does not match canonical. For topics & messages, only https://myforum.net/index.php?topic=333333.999 matches canonical.

For this reason I have added a ".msg" disallow to my robots.txt as well.

As far as I know, there is no message-specific canonical. And I don't think there should be.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[2.1], [3.0]: Google/Crawler Inefficiencies

3 participants