-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
Which package is this bug report for? If unsure which one to select, leave blank
@crawlee/http (HttpCrawler)
Issue description
Good day everyone!
I am running a web crawler inside a Docker container on a robust machine. The crawler's performance starts fast but degrades significantly over time, sometimes to as slow as 1-2 requests per hour. The setup works without issues on my MacBook (not in docker).
Environment
- Docker version: v 4.21.1
- Host Machine: Windows 11 Pro, 32GB RAM, AMD Ryzen 7 5700X 8-core, 12GB VRAM
- Docker Image:
apify/actor-node:16
Dockerfile
FROM apify/actor-node:16
WORKDIR /app
COPY ./ /app
RUN npm ci
CMD npm start --silentI tried tweaking different settings and searching across docs, issues and code to figure it out myself but am feeling unsuccessful so far. Here is my crawler setup:
const config = Configuration.getGlobalConfig();
config.set("availableMemoryRatio", 0.9);
config.set("memoryMbytes", 22000); // setting enough memory
config.set("persistStorage", false);
const proxyUrls = await listProxies(); // returns a list of 100 fast proxies
const proxyConfiguration = new ProxyConfiguration({
proxyUrls,
});
const router = createBasicRouter();
export const queue = await RequestQueue.open();
queue.timeoutSecs = 10;
const crawler = new HttpCrawler({
useSessionPool: true,
maxConcurrency: 30,
sessionPoolOptions: { maxPoolSize: 100 }, // to equal amount of proxies
autoscaledPoolOptions: {
snapshotterOptions: {
eventLoopSnapshotIntervalSecs: 2,
maxBlockedMillis: 100,
},
systemStatusOptions: {
maxEventLoopOverloadedRatio: 1.9, // tried different options, no effect, even when it's not specified
},
},
requestHandlerTimeoutSecs: 60 * 2,
requestQueue: queue,
keepAlive: true,
proxyConfiguration,
requestHandler: router,
failedRequestHandler({ request, proxyInfo }) {
log.debug(`Request ${request.url} failed.`);
log.debug("Proxy info: ", proxyInfo);
},
postNavigationHooks: [
async (context: BasicCrawlingContext) => {
const { url } = context.request;
await markUrlProcessed(url); // logic i need for my app
},
],
});
await crawler.run();In the autoscaled pool logs i noticed that the event loop is often at the ratio of 1, that's why i tried increasing it:
DEBUG HttpCrawler:AutoscaledPool: scaling up {
"oldConcurrency": 28,
"newConcurrency": 30,
"systemStatus": {
"isSystemIdle": true,
"memInfo": {
"isOverloaded": false,
"limitRatio": 0.2,
"actualRatio": 0
},
"eventLoopInfo": {
"isOverloaded": false,
"limitRatio": 1.9,
"actualRatio": 1
},
"cpuInfo": {
"isOverloaded": false,
"limitRatio": 0.4,
"actualRatio": 0
},
"clientInfo": {
"isOverloaded": false,
"limitRatio": 0.3,
"actualRatio": 0
}
}
}Questions
Is there any misconfiguration in my setup causing the slowdown?
Are there any known issues regarding the performance of HttpCrawler when running in a Docker environment?
Are there platform-specific considerations I might have overlooked?
Code sample
No response
Package version
3.4.0
Node.js version
16.20.2
Operating system
No response
Apify platform
- Tick me if you encountered this issue on the Apify platform
I have tested this on the next release
No response
Other context
No response