Skip to content

Very slow crawling inside docker #2053

@wicked-network

Description

@wicked-network

Which package is this bug report for? If unsure which one to select, leave blank

@crawlee/http (HttpCrawler)

Issue description

Good day everyone!

I am running a web crawler inside a Docker container on a robust machine. The crawler's performance starts fast but degrades significantly over time, sometimes to as slow as 1-2 requests per hour. The setup works without issues on my MacBook (not in docker).

Environment

  • Docker version: v 4.21.1
  • Host Machine: Windows 11 Pro, 32GB RAM, AMD Ryzen 7 5700X 8-core, 12GB VRAM
  • Docker Image: apify/actor-node:16

Dockerfile

FROM apify/actor-node:16
WORKDIR /app

COPY ./ /app

RUN npm ci

CMD npm start --silent

I tried tweaking different settings and searching across docs, issues and code to figure it out myself but am feeling unsuccessful so far. Here is my crawler setup:

const config = Configuration.getGlobalConfig();
config.set("availableMemoryRatio", 0.9);
config.set("memoryMbytes", 22000); // setting enough memory
config.set("persistStorage", false);

const proxyUrls = await listProxies(); // returns a list of 100 fast proxies
const proxyConfiguration = new ProxyConfiguration({
  proxyUrls,
});

const router = createBasicRouter();

export const queue = await RequestQueue.open();
queue.timeoutSecs = 10;

const crawler = new HttpCrawler({
  useSessionPool: true,
  maxConcurrency: 30,
  sessionPoolOptions: { maxPoolSize: 100 }, // to equal amount of proxies
  autoscaledPoolOptions: {
    snapshotterOptions: {
      eventLoopSnapshotIntervalSecs: 2,
      maxBlockedMillis: 100,
    },
    systemStatusOptions: {
      maxEventLoopOverloadedRatio: 1.9, // tried different options, no effect, even when it's not specified
    },
  },
  requestHandlerTimeoutSecs: 60 * 2,
  requestQueue: queue,
  keepAlive: true,

  proxyConfiguration,
  requestHandler: router,

  failedRequestHandler({ request, proxyInfo }) {
    log.debug(`Request ${request.url} failed.`);
    log.debug("Proxy info: ", proxyInfo);
  },

  postNavigationHooks: [
    async (context: BasicCrawlingContext) => {
      const { url } = context.request;

      await markUrlProcessed(url); // logic i need for my app
    },
  ],
});

await crawler.run();

In the autoscaled pool logs i noticed that the event loop is often at the ratio of 1, that's why i tried increasing it:

DEBUG HttpCrawler:AutoscaledPool: scaling up {
  "oldConcurrency": 28,
  "newConcurrency": 30,
  "systemStatus": {
    "isSystemIdle": true,
    "memInfo": {
      "isOverloaded": false,
      "limitRatio": 0.2,
      "actualRatio": 0
    },
    "eventLoopInfo": {
      "isOverloaded": false,
      "limitRatio": 1.9,
      "actualRatio": 1
    },
    "cpuInfo": {
      "isOverloaded": false,
      "limitRatio": 0.4,
      "actualRatio": 0
    },
    "clientInfo": {
      "isOverloaded": false,
      "limitRatio": 0.3,
      "actualRatio": 0
    }
  }
}

Questions
Is there any misconfiguration in my setup causing the slowdown?
Are there any known issues regarding the performance of HttpCrawler when running in a Docker environment?
Are there platform-specific considerations I might have overlooked?

Code sample

No response

Package version

3.4.0

Node.js version

16.20.2

Operating system

No response

Apify platform

  • Tick me if you encountered this issue on the Apify platform

I have tested this on the next release

No response

Other context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions