Skip to content

Conversation

Grotax
Copy link

@Grotax Grotax commented Jul 23, 2025

Imported: alexdebril#434

Feeds may, for whatever reason, contain older articles, which disproportionately shifts the nextUpdateTime based on the average. This happens when the last update time + median does not yield a date in the future, and then the average is taken.

For example you have a feed with 20 items, where 15 items are from the last 2 hours and 5 are month ago. The median will than very low compared to the average. To prevent this, outliers are removed using the interquartile range.

Another problem are feeds that are generated during download and therefore the last modified time corresponds to the download. These feeds are not recognized as sleepy and are therefore calculated incorrectly.
It is therefore better to use the time of the most recent article as the basis.

Here are two examples for the first problem:

This feed is very active during the day, has 20 items that were mostly written in the last 1.5-2 hours. Towards the evening the intervals increases and from time to time there are items that are months old. It can therefore happen that the next update time is postponed by a week.

bin/feedio read https://newsfeed.kicker.de/news/aktuell

kicker.xml.txt

Currently:

Next time a new item may be published : 2025-01-27T02:45:31+00:00
Minimum interval between items : 0 days, 0 hours, 5 minutes, 57 seconds
Median interval : 0 days, 0 hours, 5 minutes, 57 seconds
Average interval : 8 days, 11 hours, 46 minutes, 8 seconds
Maximum interval : 161 days, 5 hours, 21 minutes, 7 seconds

Patched version:

Next time a new item may be published : 2025-01-20T20:09:18+00:00
Minimum interval between items : 0 days, 0 hours, 0 minutes, 4 seconds
Median interval : 0 days, 0 hours, 5 minutes, 57 seconds
Average interval : 0 days, 0 hours, 7 minutes, 8 seconds
Maximum interval : 161 days, 5 hours, 21 minutes, 7 seconds

Same for this feed, here there are always old items.

sportschau.xml.txt

bin/feedio read "https://www.sportschau.de/fussball/index~rss2.xml"

Currently:

Next time a new item may be published : 2025-01-22T19:45:04+00:00
Minimum interval between items : 0 days, 0 hours, 55 minutes, 53 seconds
Median interval : 0 days, 0 hours, 55 minutes, 53 seconds
Average interval : 2 days, 16 hours, 49 minutes, 14 seconds
Maximum interval : 48 days, 2 hours, 46 minutes, 50 seconds

Patched version:

Next time a new item may be published : 2025-01-20T20:13:59+00:00
Minimum interval between items : 0 days, 0 hours, 1 minutes, 28 seconds
Median interval : 0 days, 0 hours, 55 minutes, 53 seconds
Average interval : 0 days, 2 hours, 22 minutes, 10 seconds
Maximum interval : 48 days, 2 hours, 46 minutes, 50 seconds

@Grotax Grotax requested review from SMillerDev and Copilot July 23, 2025 12:17
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR improves the reliability of next update time calculations for feeds by addressing two key issues: feeds containing old articles that skew average calculations, and feeds generated during download time rather than reflecting actual publication patterns.

  • Implements outlier removal using interquartile range (IQR) to filter out old articles that disproportionately affect average intervals
  • Changes the base timestamp from feed's last modified time to the newest item's publication date for more accurate scheduling
  • Improves median calculation to handle even-numbered datasets correctly

Reviewed Changes

Copilot reviewed 2 out of 3 changed files in this pull request and generated 4 comments.

File Description
src/FeedIo/Reader/Result/UpdateStats.php Core logic changes including outlier removal, timestamp base switching, and median calculation improvements
tests/FeedIo/Reader/Result/UpdateStatsTest.php Updated test expectations to reflect the new calculation methodology


$total = array_sum($result);

return count($this->intervals) ? intval(floor($total / count($this->intervals))) : 0;
Copy link

Copilot AI Jul 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The average calculation uses count($this->intervals) as divisor but $total is calculated from the filtered $result array. This will produce incorrect averages since the divisor should be count($result) to match the filtered data.

Suggested change
return count($this->intervals) ? intval(floor($total / count($this->intervals))) : 0;
return count($result) ? intval(floor($total / count($result))) : 0;

Copilot uses AI. Check for mistakes.

Comment on lines +144 to +145
$q1 = $this->intervals[floor($count * 0.25)];
$q3 = $this->intervals[floor($count * 0.75)];
Copy link

Copilot AI Jul 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using floor() for quartile calculation can cause index out of bounds when the array is small. For example, with count=1, floor(10.25)=0 is valid, but with count=2, floor(20.75)=1 accesses a valid index, but the logic doesn't handle edge cases consistently.

Suggested change
$q1 = $this->intervals[floor($count * 0.25)];
$q3 = $this->intervals[floor($count * 0.75)];
$q1_index = max(0, intdiv($count - 1, 4)); // Calculate Q1 index
$q3_index = min($count - 1, intdiv(3 * ($count - 1), 4)); // Calculate Q3 index
$q1 = $this->intervals[$q1_index];
$q3 = $this->intervals[$q3_index];

Copilot uses AI. Check for mistakes.

Comment on lines +144 to +145
$q1 = $this->intervals[floor($count * 0.25)];
$q3 = $this->intervals[floor($count * 0.75)];
Copy link

Copilot AI Jul 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same quartile calculation issue as Q1 - using floor() without bounds checking can lead to incorrect quartile values or potential array access issues in edge cases.

Suggested change
$q1 = $this->intervals[floor($count * 0.25)];
$q3 = $this->intervals[floor($count * 0.75)];
$q1_index = max(0, min($count - 1, intval(floor($count * 0.25))));
$q3_index = max(0, min($count - 1, intval(floor($count * 0.75))));
$q1 = $this->intervals[$q1_index];
$q3 = $this->intervals[$q3_index];

Copilot uses AI. Check for mistakes.

$this->intervals = $this->computeIntervals($this->extractDates($feed));
$dates = $this->extractDates($feed);
if (count($dates) > 0) {
$this->newestItemDate = min(max($dates), time());
Copy link

Copilot AI Jul 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The expression min(max($dates), time()) is unclear. Consider adding a comment explaining why the newest date is capped at the current time, or extract this logic to a descriptively named method.

Suggested change
$this->newestItemDate = min(max($dates), time());
$this->newestItemDate = $this->getCappedNewestDate($dates);

Copilot uses AI. Check for mistakes.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

possible comment here could be "get the most recent item date that is not in the future"

@wofferl
Copy link

wofferl commented Aug 13, 2025

Since I can't make any changes here, I thumbed up the copilot changes :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants