-
-
Notifications
You must be signed in to change notification settings - Fork 1
Make nextupdatetime more reliable #14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Wolfgang <[email protected]>
…s even Signed-off-by: Wolfgang <[email protected]>
Signed-off-by: Wolfgang <[email protected]>
…d to calculate the expected next update Signed-off-by: Wolfgang <[email protected]>
Signed-off-by: Wolfgang <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR improves the reliability of next update time calculations for feeds by addressing two key issues: feeds containing old articles that skew average calculations, and feeds generated during download time rather than reflecting actual publication patterns.
- Implements outlier removal using interquartile range (IQR) to filter out old articles that disproportionately affect average intervals
- Changes the base timestamp from feed's last modified time to the newest item's publication date for more accurate scheduling
- Improves median calculation to handle even-numbered datasets correctly
Reviewed Changes
Copilot reviewed 2 out of 3 changed files in this pull request and generated 4 comments.
File | Description |
---|---|
src/FeedIo/Reader/Result/UpdateStats.php | Core logic changes including outlier removal, timestamp base switching, and median calculation improvements |
tests/FeedIo/Reader/Result/UpdateStatsTest.php | Updated test expectations to reflect the new calculation methodology |
|
||
$total = array_sum($result); | ||
|
||
return count($this->intervals) ? intval(floor($total / count($this->intervals))) : 0; |
Copilot
AI
Jul 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The average calculation uses count($this->intervals) as divisor but $total is calculated from the filtered $result array. This will produce incorrect averages since the divisor should be count($result) to match the filtered data.
return count($this->intervals) ? intval(floor($total / count($this->intervals))) : 0; | |
return count($result) ? intval(floor($total / count($result))) : 0; |
Copilot uses AI. Check for mistakes.
$q1 = $this->intervals[floor($count * 0.25)]; | ||
$q3 = $this->intervals[floor($count * 0.75)]; |
Copilot
AI
Jul 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using floor() for quartile calculation can cause index out of bounds when the array is small. For example, with count=1, floor(10.25)=0 is valid, but with count=2, floor(20.75)=1 accesses a valid index, but the logic doesn't handle edge cases consistently.
$q1 = $this->intervals[floor($count * 0.25)]; | |
$q3 = $this->intervals[floor($count * 0.75)]; | |
$q1_index = max(0, intdiv($count - 1, 4)); // Calculate Q1 index | |
$q3_index = min($count - 1, intdiv(3 * ($count - 1), 4)); // Calculate Q3 index | |
$q1 = $this->intervals[$q1_index]; | |
$q3 = $this->intervals[$q3_index]; |
Copilot uses AI. Check for mistakes.
$q1 = $this->intervals[floor($count * 0.25)]; | ||
$q3 = $this->intervals[floor($count * 0.75)]; |
Copilot
AI
Jul 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same quartile calculation issue as Q1 - using floor() without bounds checking can lead to incorrect quartile values or potential array access issues in edge cases.
$q1 = $this->intervals[floor($count * 0.25)]; | |
$q3 = $this->intervals[floor($count * 0.75)]; | |
$q1_index = max(0, min($count - 1, intval(floor($count * 0.25)))); | |
$q3_index = max(0, min($count - 1, intval(floor($count * 0.75)))); | |
$q1 = $this->intervals[$q1_index]; | |
$q3 = $this->intervals[$q3_index]; |
Copilot uses AI. Check for mistakes.
$this->intervals = $this->computeIntervals($this->extractDates($feed)); | ||
$dates = $this->extractDates($feed); | ||
if (count($dates) > 0) { | ||
$this->newestItemDate = min(max($dates), time()); |
Copilot
AI
Jul 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The expression min(max($dates), time()) is unclear. Consider adding a comment explaining why the newest date is capped at the current time, or extract this logic to a descriptively named method.
$this->newestItemDate = min(max($dates), time()); | |
$this->newestItemDate = $this->getCappedNewestDate($dates); |
Copilot uses AI. Check for mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
possible comment here could be "get the most recent item date that is not in the future"
Since I can't make any changes here, I thumbed up the copilot changes :) |
Imported: alexdebril#434
Feeds may, for whatever reason, contain older articles, which disproportionately shifts the nextUpdateTime based on the average. This happens when the last update time + median does not yield a date in the future, and then the average is taken.
For example you have a feed with 20 items, where 15 items are from the last 2 hours and 5 are month ago. The median will than very low compared to the average. To prevent this, outliers are removed using the interquartile range.
Another problem are feeds that are generated during download and therefore the last modified time corresponds to the download. These feeds are not recognized as sleepy and are therefore calculated incorrectly.
It is therefore better to use the time of the most recent article as the basis.
Here are two examples for the first problem:
This feed is very active during the day, has 20 items that were mostly written in the last 1.5-2 hours. Towards the evening the intervals increases and from time to time there are items that are months old. It can therefore happen that the next update time is postponed by a week.
bin/feedio read https://newsfeed.kicker.de/news/aktuell
kicker.xml.txt
Currently:
Patched version:
Same for this feed, here there are always old items.
sportschau.xml.txt
bin/feedio read "https://www.sportschau.de/fussball/index~rss2.xml"
Currently:
Patched version: