Skip to content

Conversation

@karlseguin
Copy link
Collaborator

@karlseguin karlseguin commented Aug 5, 2025

To get this working, it should only be necessary to re-run: make install-submodule. After this zig build test/run will automatically rebuild libcurl and its dependencies.

Missing

  • Supporting the headers set via Network.setExtraHTTPHeaders (cdp)
  • Supporting the proxyServer parameter of the Target.createBrowserContext (cdp). This will be tricky to implement.
  • Dispatching the http_request_start, http_request_fail and http_request_complete (only used by cdp)
  • Content-Type request header is weird / wrong (this is partially inherited from main). I think @sjorsdonkers said he was doing stuff with headers, so will wait until that's done to fix this.

Developer Notes

  1. In http/Http.zig, setting ENABLE_DEBUG to true will cause libcurl to print information to stdout, similar to how curl -v behaves.

  2. In the _wait function of page.zig, you'll find:

// defer self.printWaitAnalysis();

If you uncomment this line, it'll dump information about the http client and scheduler which might help indicating why wait is waiting so long to finish. Note: I generally think there's a lot of tweaking we can do here to hit a good balance between waiting for things to finish and not blocking unnecessarily.

  1. There's a pseudo_connect branch to the implementation proxy that'll act as a MITM connect proxy. It's rough around the edges. Browser should be started with:
  --http_proxy http://127.0.0.1:3000/ \
  --insecure_disable_tls_host_verification

Future

What allowed this to be completed today was to load both static and dynamic imported modules in a [mostly] blocking manner. There's still a lot of room for performance improvement here, especially for dynamically imported modules. With debug logs, you you can identify these requests via the blocking = true argument.

@karlseguin karlseguin force-pushed the nonblocking_libcurl branch from 7463c9e to 7f70e94 Compare August 6, 2025 11:45
@sjorsdonkers sjorsdonkers mentioned this pull request Aug 7, 2025
8 tasks
self.static_scripts_done and // and we've finished parsing the HTML to queue all <scripts>
self.scripts.first == null and // and there are no more <script src=> to wait for
self.deferreds.first == null; // and there are no more <script defer src=> to wait for
}
Copy link
Contributor

@sjorsdonkers sjorsdonkers Aug 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since a script that is not async or defered is still executed non_blocking we currently do not wait for them, as they are not added to self.asyncs
Line 224:

    self.getList(&pending_script.script).append(&pending_script.node);

    try self.client.request(.{

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This problem reveals itself when we try to intercept these script requests,

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't understand. I don't think there's any difference between defer, async and normal with respect to how they're fetched. They all just directly go to client.request. It's only the order in which the fully loaded body is executed that changes.

Copy link
Contributor

@sjorsdonkers sjorsdonkers Aug 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that for async script the asynct script list is not empty until the async request is completed.
As a result as long as any async script is still being processed isDone is false. However, if anon-async process has not been completed yet isDone can return true as the not async scripts are not put in that list.
This results in a race condition where the main page may check isDone, and call self.documentIsComplete() while actually a script is still in flight.
In practice the script seems to be returning in time usually, however if a non-async script is waiting for an intercept, documentIsComplete will be called and the page closed before the intercept has received a response.

Adding a basic intercept to Playwright/connect.js:

await page.route('**/*', async (route, request) => {
  console.log(`Request ROUTE: ${request.method()} ${request.url()}`);
  route.continue();
});

should reproduce it.

< {"method":"Network.requestWillBeSent","params":{"requestId":"REQ-2","frameId":"TID-1","loaderId":"LID-1","documentUrl":"http://127.0.0.1:1234/campfire-commerce/","request":{"url":"http://127.0.0.1:1234/campfire-commerce/json/product.json","urlFragment":"","method":"GET","hasPostData":false,"headers":{"User-Agent":"Lightpanda/1.0"}}},"sessionId":"SID-1"}
< {"method":"Fetch.requestPaused","params":{"requestId":"INTERCEPT-2","request":{"url":"http://127.0.0.1:1234/campfire-commerce/json/product.json","urlFragment":"","method":"GET","hasPostData":false,"headers":{"User-Agent":"Lightpanda/1.0"}},"frameId":"TID-1","resourceType":"Document","networkId":"REQ-2"},"sessionId":"SID-1"}
< {"method":"Network.requestWillBeSent","params":{"requestId":"REQ-3","frameId":"TID-1","loaderId":"LID-1","documentUrl":"http://127.0.0.1:1234/campfire-commerce/","request":{"url":"http://127.0.0.1:1234/campfire-commerce/json/reviews.json","urlFragment":"","method":"GET","hasPostData":false,"headers":{"User-Agent":"Lightpanda/1.0"}}},"sessionId":"SID-1"}
< {"method":"Fetch.requestPaused","params":{"requestId":"INTERCEPT-3","request":{"url":"http://127.0.0.1:1234/campfire-commerce/json/reviews.json","urlFragment":"","method":"GET","hasPostData":false,"headers":{"User-Agent":"Lightpanda/1.0"}},"frameId":"TID-1","resourceType":"Document","networkId":"REQ-3"},"sessionId":"SID-1"}
< {"method":"Page.frameNavigated","params":{"type":"Navigation","frame":{"id":"TID-1","loaderId":"LID-1","url":"http://127.0.0.1:1234/campfire-commerce/","domainAndRegistry":"","securityOrigin":"://","mimeType":"text/html","adFrameStatus":{"adFrameType":"none"},"secureContextType":"InsecureScheme","crossOriginIsolatedContextType":"NotIsolated","gatedAPIFeatures":[]}},"sessionId":"SID-1"}     <<-- Target will close as it is not waiting for the sync request to complete
< {"method":"DOM.documentUpdated","params":{},"sessionId":"SID-1"}
< {"method":"Page.domContentEventFired","params":{"timestamp":55348},"sessionId":"SID-1"}
< {"method":"Page.lifecycleEvent","params":{"frameId":"TID-1","loaderId":"LID-1","name":"DOMContentLoaded","timestamp":55348},"sessionId":"SID-1"}
< {"method":"Page.loadEventFired","params":{"timestamp":55348},"sessionId":"SID-1"}
< {"method":"Page.lifecycleEvent","params":{"frameId":"TID-1","loaderId":"LID-1","name":"load","timestamp":55348},"sessionId":"SID-1"}
< {"method":"Page.frameStoppedLoading","params":{"frameId":"TID-1"},"sessionId":"SID-1"}
> {"id":39,"method":"Target.closeTarget","params":{"targetId":"TID-1"}}     <<-- Target closed as it is not waiting for the sync request to complete
> {"id":40,"method":"Fetch.continueRequest","params":{"requestId":"INTERCEPT-2"},"sessionId":"SID-1"}
> {"id":41,"method":"Fetch.continueRequest","params":{"requestId":"INTERCEPT-3"},"sessionId":"SID-1"}
< {"id":39,"result":{"success":true}}
< {"method":"Inspector.detached","params":{"reason":"Render process gone."},"sessionId":"SID-1"}
< {"method":"Target.detachedFromTarget","params":{"targetId":"TID-1","sessionId":"SID-1","reason":"Render process gone."}}
< {"id":40,"code":-32001,"message":"Unknown sessionId"}
< {"id":41,"code":-32001,"message":"Unknown sessionId"}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are XHR requests. The issue is that Page.wait is seeing that http.active == 0 and exiting. It's impossible for active == 0 while there are active or pending requests (in nonblocking_libcurl)

But with request interception, you can have active == 0, but requests held-up by CDP.

I did a commit, then reverted it, which did self.active += 1 earlier (1). This way you could return from fn request(..) for interception and page wouldn't see active == 0. However, request interception calls request again, so now active would be incremented twice for a single request.

I think we should add a field to Page which is "pending_intercept: bool. Page.waitcan now wait foractive == 0 and pending_intercept == false. CDP can set this based on whether intercept_state.waiting.count() > 0`.

(1) 19c9080

@sjorsdonkers
Copy link
Contributor

sjorsdonkers commented Aug 11, 2025

Somewhat related to the isDone discussion.
I wonder whether the code would be easier to follow if we have 1 central callback that handles xhr,scripts,document with if statements instead of heaving configurable callback function pointers for each.

So instead of having a pageDoneCallback, httpDoneCallback(xhr), doneCallback(script), There would just be 1 requestDoneCallback, that handles all 3 cases.

For example now the pageDoneCallback and the script's doneCallback both separately handle the done scenario:

if (self.isDone()) {
    self.page.documentIsComplete();
}

The code is very far apart, but needs to stay aligned. So in the case where there is a single callback function the logic would be in 1 place.

get puppeteer/cdp.js working again

make test are all passing
Start work on supporting navigation events (clicks, form submission).
Better error logs on http callback error

Fix wait timing
Allow page.wait to transition page mode.

Optimize initial page load. No point running scheduler until the initial
page is loaded.

Support ISO-8859-1 charset
Improve cleanup/shutdown (*cough* memory leaks *cough*)
Set parent current script
Set SUPPRESS_CONNECT_HEADERS option.
This is necessary because of CloudFront which will send gzip content even if
we don't ask for it.

Properly handle scripts that are both async and defer.

Add a helper to print state of page wait. This can be helpful in identifying
what's causing the page to hang on page.wait.
Improve wait analysis dump.

De-prioritize secondary schedules.

Don't log warning for application/json scripts

Change pretty log timer to display time from start.
http_timeout_ms
http_connect_timeout_ms
http_max_host_open
http_max_concurrent
@karlseguin karlseguin force-pushed the nonblocking_libcurl branch from 64f79f2 to 079ce5e Compare August 11, 2025 13:38
mookums and others added 20 commits August 11, 2025 12:09
This ensures that page.wait won't unblock too early. As-is, this isn't an issue
since active can only be 0 if there are no active OR pending requests. However,
with request interception (#930)
it's possible to have no active requests and no pending requests - from the
http client's point of view - but still have pending-on-intercept requests.

An alternative to this would be to undo these changes, and instead change
Page.wait to be intercept-aware. That is, Page.wait would continue to block on
http activity and scheduled tasks, as well as intercepted requests. However,
since the Page doesn't know anything about CDP right now, and it does know
about the http client, maybe doing this in the client is fine.
Avoid infinite the loop of loading non-HTML documents with CDP.
finalize document loading with non-HTML pages
intercept continue and abort

feedback

First version of headers, no cookies yet
On client.request(req) we now immediately wrap the request into a Transfer. This
results in less copying of the Request object. It also makes the transfer.uri
available, so CDP no longer needs to std.Uri(request.url) anymore.

The main advantage is that it's easier to manage resources. There was a use-
after free before due to the sensitive nature of the tranfer's lifetime. There
were also corner cases where some resources might not be freed. This is
hopefully fixed with the lifetime of Transfer being extended.
fix integer overflow for sleeping delay
Add a header iterator to the transfer. This removes the need for NetworkState,
duping header name/values, and the http_header_received event.
@krichprollsch krichprollsch marked this pull request as ready for review August 14, 2025 09:20
@krichprollsch krichprollsch merged commit a707e10 into main Aug 14, 2025
10 checks passed
@krichprollsch krichprollsch deleted the nonblocking_libcurl branch August 14, 2025 09:43
@github-actions github-actions bot locked and limited conversation to collaborators Aug 14, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants