-
Notifications
You must be signed in to change notification settings - Fork 288
Nonblocking libcurl #922
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nonblocking libcurl #922
Conversation
7463c9e to
7f70e94
Compare
| self.static_scripts_done and // and we've finished parsing the HTML to queue all <scripts> | ||
| self.scripts.first == null and // and there are no more <script src=> to wait for | ||
| self.deferreds.first == null; // and there are no more <script defer src=> to wait for | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since a script that is not async or defered is still executed non_blocking we currently do not wait for them, as they are not added to self.asyncs
Line 224:
self.getList(&pending_script.script).append(&pending_script.node);
try self.client.request(.{There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This problem reveals itself when we try to intercept these script requests,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't understand. I don't think there's any difference between defer, async and normal with respect to how they're fetched. They all just directly go to client.request. It's only the order in which the fully loaded body is executed that changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that for async script the asynct script list is not empty until the async request is completed.
As a result as long as any async script is still being processed isDone is false. However, if anon-async process has not been completed yet isDone can return true as the not async scripts are not put in that list.
This results in a race condition where the main page may check isDone, and call self.documentIsComplete() while actually a script is still in flight.
In practice the script seems to be returning in time usually, however if a non-async script is waiting for an intercept, documentIsComplete will be called and the page closed before the intercept has received a response.
Adding a basic intercept to Playwright/connect.js:
await page.route('**/*', async (route, request) => {
console.log(`Request ROUTE: ${request.method()} ${request.url()}`);
route.continue();
});should reproduce it.
< {"method":"Network.requestWillBeSent","params":{"requestId":"REQ-2","frameId":"TID-1","loaderId":"LID-1","documentUrl":"http://127.0.0.1:1234/campfire-commerce/","request":{"url":"http://127.0.0.1:1234/campfire-commerce/json/product.json","urlFragment":"","method":"GET","hasPostData":false,"headers":{"User-Agent":"Lightpanda/1.0"}}},"sessionId":"SID-1"}
< {"method":"Fetch.requestPaused","params":{"requestId":"INTERCEPT-2","request":{"url":"http://127.0.0.1:1234/campfire-commerce/json/product.json","urlFragment":"","method":"GET","hasPostData":false,"headers":{"User-Agent":"Lightpanda/1.0"}},"frameId":"TID-1","resourceType":"Document","networkId":"REQ-2"},"sessionId":"SID-1"}
< {"method":"Network.requestWillBeSent","params":{"requestId":"REQ-3","frameId":"TID-1","loaderId":"LID-1","documentUrl":"http://127.0.0.1:1234/campfire-commerce/","request":{"url":"http://127.0.0.1:1234/campfire-commerce/json/reviews.json","urlFragment":"","method":"GET","hasPostData":false,"headers":{"User-Agent":"Lightpanda/1.0"}}},"sessionId":"SID-1"}
< {"method":"Fetch.requestPaused","params":{"requestId":"INTERCEPT-3","request":{"url":"http://127.0.0.1:1234/campfire-commerce/json/reviews.json","urlFragment":"","method":"GET","hasPostData":false,"headers":{"User-Agent":"Lightpanda/1.0"}},"frameId":"TID-1","resourceType":"Document","networkId":"REQ-3"},"sessionId":"SID-1"}
< {"method":"Page.frameNavigated","params":{"type":"Navigation","frame":{"id":"TID-1","loaderId":"LID-1","url":"http://127.0.0.1:1234/campfire-commerce/","domainAndRegistry":"","securityOrigin":"://","mimeType":"text/html","adFrameStatus":{"adFrameType":"none"},"secureContextType":"InsecureScheme","crossOriginIsolatedContextType":"NotIsolated","gatedAPIFeatures":[]}},"sessionId":"SID-1"} <<-- Target will close as it is not waiting for the sync request to complete
< {"method":"DOM.documentUpdated","params":{},"sessionId":"SID-1"}
< {"method":"Page.domContentEventFired","params":{"timestamp":55348},"sessionId":"SID-1"}
< {"method":"Page.lifecycleEvent","params":{"frameId":"TID-1","loaderId":"LID-1","name":"DOMContentLoaded","timestamp":55348},"sessionId":"SID-1"}
< {"method":"Page.loadEventFired","params":{"timestamp":55348},"sessionId":"SID-1"}
< {"method":"Page.lifecycleEvent","params":{"frameId":"TID-1","loaderId":"LID-1","name":"load","timestamp":55348},"sessionId":"SID-1"}
< {"method":"Page.frameStoppedLoading","params":{"frameId":"TID-1"},"sessionId":"SID-1"}
> {"id":39,"method":"Target.closeTarget","params":{"targetId":"TID-1"}} <<-- Target closed as it is not waiting for the sync request to complete
> {"id":40,"method":"Fetch.continueRequest","params":{"requestId":"INTERCEPT-2"},"sessionId":"SID-1"}
> {"id":41,"method":"Fetch.continueRequest","params":{"requestId":"INTERCEPT-3"},"sessionId":"SID-1"}
< {"id":39,"result":{"success":true}}
< {"method":"Inspector.detached","params":{"reason":"Render process gone."},"sessionId":"SID-1"}
< {"method":"Target.detachedFromTarget","params":{"targetId":"TID-1","sessionId":"SID-1","reason":"Render process gone."}}
< {"id":40,"code":-32001,"message":"Unknown sessionId"}
< {"id":41,"code":-32001,"message":"Unknown sessionId"}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are XHR requests. The issue is that Page.wait is seeing that http.active == 0 and exiting. It's impossible for active == 0 while there are active or pending requests (in nonblocking_libcurl)
But with request interception, you can have active == 0, but requests held-up by CDP.
I did a commit, then reverted it, which did self.active += 1 earlier (1). This way you could return from fn request(..) for interception and page wouldn't see active == 0. However, request interception calls request again, so now active would be incremented twice for a single request.
I think we should add a field to Page which is "pending_intercept: bool. Page.waitcan now wait foractive == 0 and pending_intercept == false. CDP can set this based on whether intercept_state.waiting.count() > 0`.
(1) 19c9080
|
Somewhat related to the isDone discussion. So instead of having a pageDoneCallback, httpDoneCallback(xhr), doneCallback(script), There would just be 1 requestDoneCallback, that handles all 3 cases. For example now the pageDoneCallback and the script's doneCallback both separately handle the done scenario: if (self.isDone()) {
self.page.documentIsComplete();
}The code is very far apart, but needs to stay aligned. So in the case where there is a single callback function the logic would be in 1 place. |
get puppeteer/cdp.js working again make test are all passing
Start work on supporting navigation events (clicks, form submission).
Better error logs on http callback error Fix wait timing
Allow page.wait to transition page mode. Optimize initial page load. No point running scheduler until the initial page is loaded. Support ISO-8859-1 charset
Improve cleanup/shutdown (*cough* memory leaks *cough*)
Set parent current script
Set SUPPRESS_CONNECT_HEADERS option.
This is necessary because of CloudFront which will send gzip content even if we don't ask for it. Properly handle scripts that are both async and defer. Add a helper to print state of page wait. This can be helpful in identifying what's causing the page to hang on page.wait.
Improve wait analysis dump. De-prioritize secondary schedules. Don't log warning for application/json scripts Change pretty log timer to display time from start.
http_timeout_ms http_connect_timeout_ms http_max_host_open http_max_concurrent
64f79f2 to
079ce5e
Compare
This ensures that page.wait won't unblock too early. As-is, this isn't an issue since active can only be 0 if there are no active OR pending requests. However, with request interception (#930) it's possible to have no active requests and no pending requests - from the http client's point of view - but still have pending-on-intercept requests. An alternative to this would be to undo these changes, and instead change Page.wait to be intercept-aware. That is, Page.wait would continue to block on http activity and scheduled tasks, as well as intercepted requests. However, since the Page doesn't know anything about CDP right now, and it does know about the http client, maybe doing this in the client is fine.
This reverts commit 19c9080.
Avoid infinite the loop of loading non-HTML documents with CDP.
finalize document loading with non-HTML pages
intercept continue and abort feedback First version of headers, no cookies yet
request interception
On client.request(req) we now immediately wrap the request into a Transfer. This results in less copying of the Request object. It also makes the transfer.uri available, so CDP no longer needs to std.Uri(request.url) anymore. The main advantage is that it's easier to manage resources. There was a use- after free before due to the sensitive nature of the tranfer's lifetime. There were also corner cases where some resources might not be freed. This is hopefully fixed with the lifetime of Transfer being extended.
fix integer overflow for sleeping delay
Add a header iterator to the transfer. This removes the need for NetworkState, duping header name/values, and the http_header_received event.
To get this working, it should only be necessary to re-run:
make install-submodule. After thiszig build test/runwill automatically rebuild libcurl and its dependencies.Missing
Network.setExtraHTTPHeaders(cdp)proxyServerparameter of theTarget.createBrowserContext(cdp). This will be tricky to implement.http_request_start,http_request_failandhttp_request_complete(only used by cdp)Developer Notes
In
http/Http.zig, settingENABLE_DEBUGtotruewill cause libcurl to print information to stdout, similar to howcurl -vbehaves.In the
_waitfunction ofpage.zig, you'll find:// defer self.printWaitAnalysis();If you uncomment this line, it'll dump information about the http client and scheduler which might help indicating why
waitis waiting so long to finish. Note: I generally think there's a lot of tweaking we can do here to hit a good balance between waiting for things to finish and not blocking unnecessarily.Future
What allowed this to be completed today was to load both static and dynamic imported modules in a [mostly] blocking manner. There's still a lot of room for performance improvement here, especially for dynamically imported modules. With debug logs, you you can identify these requests via the
blocking = trueargument.