Skip to content

Conversation

bbarclay
Copy link

Running the gpt-crawler from an External Script

This example demonstrates how to use the core functionalities of the gpt-crawler package outside of its CLI interface by directly importing the module’s functions programmatically using Node.js. Since gpt-crawler is an ES module, we need to use dynamic imports in a CommonJS environment to ensure it works seamlessly.

// test-direct-call.js (using dynamic import in CommonJS)
(async () => {
    try {
        // Dynamically import the ES module
        const { crawl, write } = await import('./node_modules/@builder.io/gpt-crawler/dist/src/core.js');

        // Define your custom configuration for the crawl
        const config = {
            url: "https://example.com",
            match: "/articles/",
            selector: "h1",
            maxPagesToCrawl: 10,
            outputFileName: "output.json",
            maxTokens: 5000,   // Optional for token limit logic
            maxFileSize: 5,    // Maximum file size in MB
        };

        // Call the crawl function directly from the core.js file
        console.log("Starting crawl...");
        await crawl(config);
        console.log("Crawl complete.");

        // Call the write function to store results
        console.log("Writing output...");
        await write(config);
        console.log("Output written to:", config.outputFileName);

    } catch (error) {
        console.error("An error occurred:", error.message);
    }
})();

@bbarclay bbarclay closed this by deleting the head repository Apr 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant