Skip to content

Conversation

@byroot
Copy link
Member

@byroot byroot commented Jan 29, 2025

THe initial snapshot was from a few years back, the application changed a lot since then.

I had to run a bunch of migrations and had to fiddle with various sqlite3/msyql incompatibilities. Also some route changes, but with a bit of effort I managed to make it work again.

What is interesting is it seems lobste.rs became almost 3x as slow in these couple years, I haven't yet profiled to know why.

This branch:

itr:   time
 #1: 1532ms
 #2: 1013ms
 #3:  910ms
 #4:  852ms
 #5:  851ms
 #6:  901ms
 #7:  857ms
 #8:  886ms
 #9:  875ms

main:

itr:   time
 #1:  884ms
 #2:  482ms
 #3:  344ms
 #4:  358ms
 #5:  300ms
 #6:  316ms
 #7:  315ms
 #8:  328ms
 #9:  309ms

cc @rwstauner

Also FYI @pushcx, I'll try to let you know if I find the source of this massive slow-down. But perhaps it's just a red herring.

@byroot
Copy link
Member Author

byroot commented Jan 29, 2025

Profile after the upgrade: https://share.firefox.dev/4hrxsdY
Profile before: https://share.firefox.dev/3PTZyCG

Observations:

  • Sqlite3#step (queries) went from 19% to 25%.
  • There is a brand new 6.7% spent in Kernel#caller, and that's coming from Prosopite.subscribe (this is very fishy).

Nothing else very obvious just yet, after that it's all < 2%, so I'll have to play seven differences on the flamegraph.

@byroot
Copy link
Member Author

byroot commented Jan 29, 2025

  • Kernel#caller, and that's coming from Prosopite.subscribe (this is very fishy).

Holy cow, it's digesting the entire backtrace to make a unique key for queries: https://github.com/charkost/prosopite/blob/36f7f5a4fa800fcfa1f9bef5cae94c9739b128fc/lib/prosopite.rb#L263-L272

@pushcx
Copy link

pushcx commented Jan 29, 2025

Thanks for pinging me, I'm very curious to hear what you find. Though we don't have APM set up, I think we would've noticed a prod slowdown that big, so Prosopite or another dev/test change seems plausible.

I should mention two in-progress projects, neither of which has merged to master:

  • We've started work migrating away from MariaDB. That issue only talks about postgres, but we've had some sqlite advocacy this week and the existence of this benchmark's partial port is encouraging. So either it's a lot of churn for you (postgres) or a lot less work for you (sqlite).
  • Tomorrow @joeldrapper and I are experimenting on stream with porting some partials to Phlex. I'm still uncertain about adopting a young gem as a dependency, but its design addresses a lot of pain points I've had. If we entirely migrate from Erb to Phlex that might reduce the codebase's value to you as a Rails benchmark.

(EDIT: I have been scratching my head for two days trying to remember where I first learned of Phlex, and it just clicked that it's this thread that also includes some criticism from you. Small world!)

@byroot
Copy link
Member Author

byroot commented Jan 29, 2025

so Prosopite or another dev/test change seems plausible.

Note that the benchmark runs in production environment. AFAICT this Prosopite overhead isn't dev/test only.

the existence of this benchmark's partial port is encouraging.

Not really, we're only testing a handful of endpoint, and I had to get rid of some queries. It's probably a tons of work to migrate, but that's your call, doesn't affect us much.

If we entirely migrate from Erb to Phlex that might reduce the codebase's value to you as a Rails benchmark.

Probably yes. We can always chose to stop following lobsters and just freeze the codebase forever, just to Rails upgrades once in a while.

@maximecb
Copy link
Contributor

maximecb commented Jan 29, 2025

Worth profiling and maybe looking at how much code YJIT generates before/after since the change is so massive, also yjit insn executed count, percentage in YJIT.

@byroot
Copy link
Member Author

byroot commented Jan 29, 2025

how much code YJIT generates before/after

Before:

inline_code_size:    6,619,284
outlined_code_size:  2,340,100
code_region_size:    16,924,672
yjit_alloc_size:     25,453,141
yjit_compile_time:   1090.47ms
RSS: 372.3MiB
MAXRSS: 375.6MiB

After:

inline_code_size:    7,340,768
outlined_code_size:  2,462,532
code_region_size:    18,661,376
yjit_alloc_size:     27,652,073
yjit_compile_time:   1288.00ms
RSS: 330.3MiB
MAXRSS: 330.7MiB

@byroot
Copy link
Member Author

byroot commented Jan 30, 2025

Alright, so a major piece of that was my mistake. I thought the benchmark had caching entirely disabled, but actually no.

If I re-enable the cache like it was before, it's already much closer:

itr:   time
 #1: 1101ms
 #2:  655ms
 #3:  483ms
 #4:  481ms
 #5:  485ms
 #6:  484ms
 #7:  460ms
 #8:  531ms
 #9:  477ms
#10:  472ms
#11:  466ms
#12:  458ms
#13:  459ms
#14:  472ms
#15:  465ms
#16:  460ms
#17:  470ms
#18:  473ms
#19:  456ms
#20:  475ms
#21:  486ms
#22:  469ms
#23:  453ms
#24:  471ms
#25:  467ms

Now it's only about 50% slower. I'll see if I can figure out where the remaining difference comes from.

byroot added a commit to rails/mission_control-jobs that referenced this pull request Jan 30, 2025
While upgrading the lobsters benchmark in yjit-bench I noticed
2.8% of the overall time spent in `Dir[]`, all coming from
FileUpdateChecker, which shouldn't be a think in production.

After tracking it down, it comes from mission-control.

Profile: https://share.firefox.dev/4hBdygq
Ref: ruby/ruby-bench#358
@byroot
Copy link
Member Author

byroot commented Jan 30, 2025

Found a 3% regression caused by mission-control: rails/mission_control-jobs#241

I'll keep digging.

@byroot
Copy link
Member Author

byroot commented Jan 30, 2025

Found a regression in psych: ruby/psych#713

@byroot
Copy link
Member Author

byroot commented Jan 30, 2025

Seems like a big regression is the /rss route, the new stories.rss.builder is much slower than the old home/rss.erb, but I guess that's expected, can't beat ERB with Ruby DSLs. lobsters/lobsters@629df02

byroot added a commit to rails/mission_control-jobs that referenced this pull request Jan 30, 2025
While upgrading the lobsters benchmark in yjit-bench I noticed
2.8% of the overall time spent in `Dir[]`, all coming from
FileUpdateChecker, which shouldn't be a think in production.

After tracking it down, it comes from mission-control.

Profile: https://share.firefox.dev/4hBdygq
Ref: ruby/ruby-bench#358
rosa pushed a commit to rails/mission_control-jobs that referenced this pull request Jan 30, 2025
While upgrading the lobsters benchmark in yjit-bench I noticed
2.8% of the overall time spent in `Dir[]`, all coming from
FileUpdateChecker, which shouldn't be a think in production.

After tracking it down, it comes from mission-control.

Profile: https://share.firefox.dev/4hBdygq
Ref: ruby/ruby-bench#358
@byroot
Copy link
Member Author

byroot commented Jan 30, 2025

Another important change seems to be:

get "/threads" => "comments#threads"

That became:

get "/threads" => "comments#user_threads"

And that new action a at least one order of magnitude more costly. But in a way that is legit, because the old action would simply redirect back to /active, so I think it was an oversight in the benchmark, as it didn't make much sense to benchmark that.

@byroot
Copy link
Member Author

byroot commented Jan 30, 2025

Alright, I went over all routes one by one, the final big difference comes from stories#show.

Somehow the stories#load_user_votes is way slower, enough to make that particular action 3x slower.

Noah had left a comment in the route generator that seems related:

    { num: 15, routes: ["/s/:story_id"] },  # not including upvote/downvote because that can change story visibility dynamically

Which makes me wonder if perhaps he had disabled some part of the action before, I haven't yet identified what.

@byroot
Copy link
Member Author

byroot commented Jan 30, 2025

Something interesting, but not necessarily a huge deal. The newer version of lobsters has way more handcrafted SQL, which causes Active Record to attempt to parse some of it to get some metadata.

E.g.:

    Comment
      .joins(<<~SQL
        inner join (
          with recursive discussion as (
          select
            c.id,
            0 as depth,
            (select count(*) from comments where parent_comment_id = c.id) as reply_count,
            cast(confidence_order as char(#{Comment::COP_LENGTH})) as confidence_order_path
            from comments c
            where
              thread_id in (#{thread_ids.join(", ")}) and
              parent_comment_id is null
          union all
          select
            c.id,
            discussion.depth + 1,
            (select count(*) from comments where parent_comment_id = c.id),
            cast(concat(
              discussion.confidence_order_path, 3 * (depth + 1),
              c.confidence_order
            ) as char(#{Comment::COP_LENGTH}))
          from comments c join discussion on c.parent_comment_id = discussion.id
          )
          select * from discussion as comments
        ) as comments_recursive on comments.id = comments_recursive.id
      SQL
            )
      .order("comments.thread_id desc, comments_recursive.confidence_order_path")
      .select('
        comments.*,
        comments_recursive.depth as depth,
        comments_recursive.reply_count as reply_count
      ')

Here since the joins is raw SQL, Active Record tries to parse that string to extract table names with a Regexp: https://github.com/rails/rails/blob/e03f459163a333642a62bf4569beae7e426c1a28/activerecord/lib/active_record/relation.rb#L1486-L1508

And since the string is pretty big, that end up accounting for 2% of the runtime on /s/:story_id.

THe initial snapshot was from a few years back, the application changed
a lot since then.

I had to run a bunch of migrations and had to fiddle with various sqlite3/msyql
incompatibilities. Also some route changes, but with a bit of effort I managed
to make it work again.

What is interesting is it seems lobste.rs became 2-3-x as slow in these
couple years, I haven't yet profiled to know why.

This branch:

```
itr:   time
 ruby#1: 1532ms
 ruby#2: 1013ms
 ruby#3:  910ms
 ruby#4:  852ms
 ruby#5:  851ms
 ruby#6:  901ms
 ruby#7:  857ms
 ruby#8:  886ms
 ruby#9:  875ms
```

main:

```
itr:   time
 ruby#1:  884ms
 ruby#2:  482ms
 ruby#3:  344ms
 ruby#4:  358ms
 ruby#5:  300ms
 ruby#6:  316ms
 ruby#7:  315ms
 ruby#8:  328ms
 ruby#9:  309ms
```
@byroot
Copy link
Member Author

byroot commented Jan 30, 2025

Since I don't see any more obvious illegitimate slowdowns, here's main vs this branch:

main:

interp: ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin23]
yjit: ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +YJIT +PRISM [arm64-darwin23]

--------  -----------  ----------  ---------  ----------  ------------  -----------
bench     interp (ms)  stddev (%)  yjit (ms)  stddev (%)  yjit 1st itr  interp/yjit
lobsters  459.7        2.4         296.9      3.0         0.558         1.548      
--------  -----------  ----------  ---------  ----------  ------------  -----------

updated-lobsters:

interp: ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin23]
yjit: ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +YJIT +PRISM [arm64-darwin23]

--------  -----------  ----------  ---------  ----------  ------------  -----------
bench     interp (ms)  stddev (%)  yjit (ms)  stddev (%)  yjit 1st itr  interp/yjit
lobsters  600.6        1.9         453.1      2.2         0.543         1.326      
--------  -----------  ----------  ---------  ----------  ------------  -----------

Based on my profiling, most of the slowdown comes from logic that has moved from Ruby to SQL queries.

Now there's the question of whether we wish to upgrade or not, given that arguably this newer version of lobsters has much less Ruby code in it. On the other hand, the displayed speedup is more in line with what Rails apps seem to experience in production, so perhaps it's actually a good thing? Dunno.

@maximecb
Copy link
Contributor

Thanks for the investigation Jean 🙏

How does the percentage in YJIT compare before/after?

Now there's the question of whether we wish to upgrade or not, given that arguably this newer version of lobsters has much less Ruby code in it. On the other hand, the displayed speedup is more in line with what Rails apps seem to experience in production, so perhaps it's actually a good thing? Dunno.

Yeah I'm torn. On the one hand it's nice if the benchmark is bigger. On the other hand we kind of need as much Ruby code as possible to focus our optimization efforts I feel. When YJIT team members try to implement small optimizations, we look at performance differences on lobsters a lot (because it's the biggest benchmark). If the difference is below 1%, it's hard for us to make the call.

@byroot
Copy link
Member Author

byroot commented Jan 30, 2025

How does the percentage in YJIT compare before/after?

Good question, I need to figure out how to extract that one.

@byroot
Copy link
Member Author

byroot commented Jan 30, 2025

Ok, found it. Full YJIT stats in https://gist.github.com/byroot/e0d86a2ab174b140968f3b9b8e3d439d

Before:

ratio_in_yjit:                 99.0%
avg_len_in_yjit:                63.5

After:

ratio_in_yjit:                 98.9%
avg_len_in_yjit:                65.3

@maximecb
Copy link
Contributor

Tagging @XrXr to have his opinion.

If the slowdown is indeed more logic moving to SQL, then we may want to either:

  • Only update rails without changing the bench
  • Or, try to make the SQL faster? We should be using an in-memory database like sqlite if we aren't already?

@byroot
Copy link
Member Author

byroot commented Feb 10, 2025

  • We should be using an in-memory database like sqlite if we aren't already?

We are already.

Copy link
Member

@XrXr XrXr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like it's the mostly the same amount of ruby work, but there is now more non-ruby workload in the mix to make it slower overall. I share the concern about having a weaker signal for small changes, but a 30% speedup stills seems like a good chunk of the workload. Tracking new changes to keep benchmark relevant is probably important (Rails 8 instead of 7, for example).

I vote that we upgrade.

@@ -1,75 +1,82 @@
source "https://rubygems.org"

# Everything except Action Cable. It's unused and it installs native gems.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we upgrade, I'd like to get rid of nio4r, still, since the bench doesn't use Action Cable. We can do that in a separate PR, though.

@maximecb
Copy link
Contributor

I vote that we upgrade.

Jean suggested we could just upgrade Rails while keeping the benchmark largely the same and it shouldn't be too much trouble? We're agreeing on going that path?

I'd like to get rid of nio4r, still, since the bench doesn't use Action Cable. We can do that in a separate PR, though.

👍

@XrXr
Copy link
Member

XrXr commented Feb 11, 2025

Jean suggested we could just upgrade Rails while keeping the benchmark largely the same and it shouldn't be too much trouble? We're agreeing on going that path?

That sounds good to me too!

byroot added a commit to byroot/yjit-bench that referenced this pull request Feb 11, 2025
Fix: ruby#358

Before:
```
inline_code_size:    6,752,176
outlined_code_size:  2,067,600
code_region_size:    17,268,736
yjit_alloc_size:     25,934,245
yjit_compile_time:    989.69ms
RSS: 393.7MiB
MAXRSS: 393.9MiB
Average of last 15, non-warmup iters: 299ms
Total time spent benchmarking: 27s

interp: ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin23]
yjit: ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +YJIT +PRISM [arm64-darwin23]

--------  -----------  ----------  ---------  ----------  ------------  -----------
bench     interp (ms)  stddev (%)  yjit (ms)  stddev (%)  yjit 1st itr  interp/yjit
lobsters  463.9        1.3         299.8      2.6         0.563         1.547
--------  -----------  ----------  ---------  ----------  ------------  -----------
```

After:
```
inline_code_size:    6,808,680
outlined_code_size:  2,178,968
code_region_size:    17,448,960
yjit_alloc_size:     26,006,453
yjit_compile_time:    978.48ms
RSS: 380.1MiB
MAXRSS: 380.3MiB
Average of last 16, non-warmup iters: 296ms
Total time spent benchmarking: 27s

interp: ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin23]
yjit: ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +YJIT +PRISM [arm64-darwin23]

--------  -----------  ----------  ---------  ----------  ------------  -----------
bench     interp (ms)  stddev (%)  yjit (ms)  stddev (%)  yjit 1st itr  interp/yjit
lobsters  463.1        1.8         296.3      2.2         0.596         1.563
--------  -----------  ----------  ---------  ----------  ------------  -----------
```
byroot added a commit to byroot/yjit-bench that referenced this pull request Feb 11, 2025
Fix: ruby#358

Before:
```
inline_code_size:    6,752,176
outlined_code_size:  2,067,600
code_region_size:    17,268,736
yjit_alloc_size:     25,934,245
yjit_compile_time:    989.69ms
RSS: 393.7MiB
MAXRSS: 393.9MiB
Average of last 15, non-warmup iters: 299ms
Total time spent benchmarking: 27s

interp: ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin23]
yjit: ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +YJIT +PRISM [arm64-darwin23]

--------  -----------  ----------  ---------  ----------  ------------  -----------
bench     interp (ms)  stddev (%)  yjit (ms)  stddev (%)  yjit 1st itr  interp/yjit
lobsters  463.9        1.3         299.8      2.6         0.563         1.547
--------  -----------  ----------  ---------  ----------  ------------  -----------
```

After:
```
inline_code_size:    6,808,680
outlined_code_size:  2,178,968
code_region_size:    17,448,960
yjit_alloc_size:     26,006,453
yjit_compile_time:    978.48ms
RSS: 380.1MiB
MAXRSS: 380.3MiB
Average of last 16, non-warmup iters: 296ms
Total time spent benchmarking: 27s

interp: ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin23]
yjit: ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +YJIT +PRISM [arm64-darwin23]

--------  -----------  ----------  ---------  ----------  ------------  -----------
bench     interp (ms)  stddev (%)  yjit (ms)  stddev (%)  yjit 1st itr  interp/yjit
lobsters  463.1        1.8         296.3      2.2         0.596         1.563
--------  -----------  ----------  ---------  ----------  ------------  -----------
```
maximecb pushed a commit that referenced this pull request Feb 11, 2025
Fix: #358

Before:
```
inline_code_size:    6,752,176
outlined_code_size:  2,067,600
code_region_size:    17,268,736
yjit_alloc_size:     25,934,245
yjit_compile_time:    989.69ms
RSS: 393.7MiB
MAXRSS: 393.9MiB
Average of last 15, non-warmup iters: 299ms
Total time spent benchmarking: 27s

interp: ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin23]
yjit: ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +YJIT +PRISM [arm64-darwin23]

--------  -----------  ----------  ---------  ----------  ------------  -----------
bench     interp (ms)  stddev (%)  yjit (ms)  stddev (%)  yjit 1st itr  interp/yjit
lobsters  463.9        1.3         299.8      2.6         0.563         1.547
--------  -----------  ----------  ---------  ----------  ------------  -----------
```

After:
```
inline_code_size:    6,808,680
outlined_code_size:  2,178,968
code_region_size:    17,448,960
yjit_alloc_size:     26,006,453
yjit_compile_time:    978.48ms
RSS: 380.1MiB
MAXRSS: 380.3MiB
Average of last 16, non-warmup iters: 296ms
Total time spent benchmarking: 27s

interp: ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin23]
yjit: ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +YJIT +PRISM [arm64-darwin23]

--------  -----------  ----------  ---------  ----------  ------------  -----------
bench     interp (ms)  stddev (%)  yjit (ms)  stddev (%)  yjit 1st itr  interp/yjit
lobsters  463.1        1.8         296.3      2.2         0.596         1.563
--------  -----------  ----------  ---------  ----------  ------------  -----------
```
@pushcx
Copy link

pushcx commented May 6, 2025

I found this again while cleaning out some old tabs. I hadn't wanted to distract from your PR work.

Seems like a big regression is the /rss route, the new stories.rss.builder is much slower than the old home/rss.erb, but I guess that's expected, can't beat ERB with Ruby DSLs.

I rewrote this because the ERB produced invalid RSS. It's the same reason I'm looking to replace ERB at the view layer, we're constantly firefighting very basic HTML bugs like missing closing tags. Lobsters didn't get slower because it replaced ERB with a builder, it got correct.

ERB reminds me very strongly of 1990s C++ where a tool supposedly produced high-performance code, but it achieved it by completely lacking any kind of safety. The C++ culture reinforced it by blaming the developer for bugs that tools could have prevented. If ERB can't help this program produce correct output, it's going to get replaced with one that can. I don't have large amounts of developer time to throw after the unending stream of bugs it produces, and I definitely have too much professional self-respect to accept that correctness should be, or even can be handled by blaming individuals instead of developing reliable systems and processes.

I saw the newly-announced herb tooling and I'll give that a few months to see if it produces tools like to address ERB's fundamental shortcomings. I'm not optimistic for staying. To echo your comment: I guess that's expected, can't beat a Ruby DSL with string soup.

@byroot
Copy link
Member Author

byroot commented May 9, 2025

I saw the newly-announced herb tooling and I'll give that a few months to see if it produces tools like to address ERB's fundamental shortcomings.

herb look nice, but short term https://github.com/Shopify/erb_lint has existed for a while.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants