Skip to content

Conversation

@Grotax
Copy link

@Grotax Grotax commented Jul 23, 2025

Imported: alexdebril#434

Feeds may, for whatever reason, contain older articles, which disproportionately shifts the nextUpdateTime based on the average. This happens when the last update time + median does not yield a date in the future, and then the average is taken.

For example you have a feed with 20 items, where 15 items are from the last 2 hours and 5 are month ago. The median will than very low compared to the average. To prevent this, outliers are removed using the interquartile range.

Another problem are feeds that are generated during download and therefore the last modified time corresponds to the download. These feeds are not recognized as sleepy and are therefore calculated incorrectly.
It is therefore better to use the time of the most recent article as the basis.

Here are two examples for the first problem:

This feed is very active during the day, has 20 items that were mostly written in the last 1.5-2 hours. Towards the evening the intervals increases and from time to time there are items that are months old. It can therefore happen that the next update time is postponed by a week.

bin/feedio read https://newsfeed.kicker.de/news/aktuell

kicker.xml.txt

Currently:

Next time a new item may be published : 2025-01-27T02:45:31+00:00
Minimum interval between items : 0 days, 0 hours, 5 minutes, 57 seconds
Median interval : 0 days, 0 hours, 5 minutes, 57 seconds
Average interval : 8 days, 11 hours, 46 minutes, 8 seconds
Maximum interval : 161 days, 5 hours, 21 minutes, 7 seconds

Patched version:

Next time a new item may be published : 2025-01-20T20:09:18+00:00
Minimum interval between items : 0 days, 0 hours, 0 minutes, 4 seconds
Median interval : 0 days, 0 hours, 5 minutes, 57 seconds
Average interval : 0 days, 0 hours, 7 minutes, 8 seconds
Maximum interval : 161 days, 5 hours, 21 minutes, 7 seconds

Same for this feed, here there are always old items.

sportschau.xml.txt

bin/feedio read "https://www.sportschau.de/fussball/index~rss2.xml"

Currently:

Next time a new item may be published : 2025-01-22T19:45:04+00:00
Minimum interval between items : 0 days, 0 hours, 55 minutes, 53 seconds
Median interval : 0 days, 0 hours, 55 minutes, 53 seconds
Average interval : 2 days, 16 hours, 49 minutes, 14 seconds
Maximum interval : 48 days, 2 hours, 46 minutes, 50 seconds

Patched version:

Next time a new item may be published : 2025-01-20T20:13:59+00:00
Minimum interval between items : 0 days, 0 hours, 1 minutes, 28 seconds
Median interval : 0 days, 0 hours, 55 minutes, 53 seconds
Average interval : 0 days, 2 hours, 22 minutes, 10 seconds
Maximum interval : 48 days, 2 hours, 46 minutes, 50 seconds

@Grotax Grotax requested review from SMillerDev and Copilot July 23, 2025 12:17
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR improves the reliability of next update time calculations for feeds by addressing two key issues: feeds containing old articles that skew average calculations, and feeds generated during download time rather than reflecting actual publication patterns.

  • Implements outlier removal using interquartile range (IQR) to filter out old articles that disproportionately affect average intervals
  • Changes the base timestamp from feed's last modified time to the newest item's publication date for more accurate scheduling
  • Improves median calculation to handle even-numbered datasets correctly

Reviewed Changes

Copilot reviewed 2 out of 3 changed files in this pull request and generated 4 comments.

File Description
src/FeedIo/Reader/Result/UpdateStats.php Core logic changes including outlier removal, timestamp base switching, and median calculation improvements
tests/FeedIo/Reader/Result/UpdateStatsTest.php Updated test expectations to reflect the new calculation methodology

@wofferl
Copy link

wofferl commented Aug 13, 2025

Since I can't make any changes here, I thumbed up the copilot changes :)

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 3 out of 4 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 3 out of 4 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

wofferl and others added 6 commits November 13, 2025 19:04
- Use IQR method for outlier detection in average calculation
- Calculate median using middle two values for even counts
- Use newest item date for sleep detection and next update
- Prevent future dates from affecting calculations
- Add comprehensive edge case tests

Signed-off-by: Wolfgang <[email protected]>
@Grotax Grotax merged commit 2ce072d into main Nov 13, 2025
5 checks passed
@Grotax Grotax deleted the import-pr-434 branch November 13, 2025 18:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants