Skip to content

Commit 2994c39

Browse files
authored
Jina tool (#185)
feat(tools): add jina-url-to-markdown tool integration - Updated package.json to include jina-url-to-markdown with ESM and CJS entry points. - Modified rollup.config.mjs to include jina-url-to-markdown in the toolFolders array. - Exported jina-url-to-markdown in the main index.js file for tools. This integration enhances the tools package by adding functionality for converting Jina URLs to Markdown format.
2 parents bec568f + f292ed1 commit 2994c39

File tree

8 files changed

+380
-0
lines changed

8 files changed

+380
-0
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,7 @@ yarn-error.log*
3636
.DS_Store
3737
*.pem
3838
_todo.md
39+
.vscode
3940

4041
*storybook.log
4142
_todo.md

packages/tools/package.json

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -53,6 +53,10 @@
5353
"./make-webhook": {
5454
"import": "./dist/make-webhook/index.esm.js",
5555
"require": "./dist/make-webhook/index.cjs.js"
56+
},
57+
"./jina-url-to-markdown": {
58+
"import": "./dist/jina-url-to-markdown/index.esm.js",
59+
"require": "./dist/jina-url-to-markdown/index.cjs.js"
5660
}
5761
},
5862
"files": [

packages/tools/rollup.config.mjs

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,7 @@ const toolFolders = [
2020
'textfile-search',
2121
'zapier-webhook',
2222
'make-webhook',
23+
'jina-url-to-markdown',
2324
]; // Add more folder names as needed
2425

2526
const toolConfigs = toolFolders.map((tool) => {

packages/tools/src/index.js

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,3 +10,4 @@ export * from './pdf-search/index.js';
1010
export * from './textfile-search/index.js';
1111
export * from './zapier-webhook/index.js';
1212
export * from './make-webhook/index.js';
13+
export * from './jina-url-to-markdown/index.js';
Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,81 @@
1+
# Jina URL to Markdown
2+
3+
This tool integrates with Jina (https://jina.ai/), a web scraping and crawling service designed to turn websites into LLM-ready data. It enables the extraction of clean, well-formatted content from websites, making it ideal for AI applications, particularly those using Large Language Models (LLMs).
4+
5+
## Components
6+
7+
The tool uses the following components:
8+
9+
- A Jina API client instance
10+
- An API Key for authentication
11+
- A custom HTTP client (ky) for making API requests
12+
- Input validation using Zod schema
13+
- Configurable output format
14+
15+
## Key Features
16+
17+
- Scrapes and crawls websites, even those with dynamic content
18+
- Converts web content into clean, LLM-ready markdown
19+
- Handles complex web scraping challenges:
20+
- Rate limits
21+
- JavaScript rendering
22+
- Anti-bot mechanisms
23+
- Multiple output format options
24+
- Clean, structured data extraction
25+
- Support for dynamic content
26+
- Automatic content cleaning and formatting
27+
28+
## Input
29+
30+
The input should be a JSON object with a "url" field containing the URL to scrape and retrieve content from.
31+
32+
## Output
33+
34+
The output is the scraped content from the specified URL, formatted according to the configured format (default: markdown).
35+
36+
## Configuration Options
37+
38+
- `apiKey`: Your Jina API key (optional)
39+
- `options`: Options for the Jina API request (optional)
40+
41+
## Example
42+
43+
```javascript
44+
const tool = new JinaUrlToMarkdown();
45+
46+
const result = await tool._call({
47+
url: 'https://example.com',
48+
});
49+
```
50+
51+
## Advanced Example with Custom Options and Error Handling
52+
53+
```javascript
54+
const tool = new JinaUrlToMarkdown({
55+
apiKey: process.env.JINA_API_KEY,
56+
options: {
57+
targetSelector: ['body', '.class', '#id'],
58+
retainImages: 'none',
59+
},
60+
});
61+
62+
try {
63+
const result = await tool._call({
64+
url: 'https://example.com/blog/article',
65+
});
66+
67+
// Process the scraped content
68+
console.log('Markdown content:', result);
69+
70+
// Use the content with an LLM or other processing
71+
// ...
72+
} catch (error) {
73+
console.error('Error scraping website:', error);
74+
}
75+
```
76+
77+
For more information about Jina, visit: https://jina.ai/, https://r.jina.ai/docs
78+
79+
### Disclaimer
80+
81+
Ensure you have proper API credentials and respect Jina's usage terms and rate limits. The service offers flexible pricing plans, including a free tier for small-scale use. When scraping websites, make sure to comply with the target website's terms of service and robots.txt directives.
Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,81 @@
1+
/**
2+
* Jina URL to Markdown
3+
*
4+
* This tool integrates with Jina (https://jina.ai/), a web scraping
5+
* and crawling service designed to turn websites into LLM-ready data.
6+
*
7+
* Jina allows you to extract clean, well-formatted markdown or structured data
8+
* from websites, making it ideal for AI applications, particularly those using
9+
* Large Language Models (LLMs).
10+
*
11+
* Key features of Jina:
12+
* - Scrapes and crawls websites, even those with dynamic content
13+
* - Converts web content into clean, LLM-ready markdown
14+
* - Handles challenges like rate limits, JavaScript rendering, and anti-bot mechanisms
15+
* - Offers flexible pricing plans, including a free tier for small-scale use
16+
*
17+
* Usage:
18+
* const tool = new JinaUrlToMarkdown();
19+
* const result = await tool._call({ url: 'https://example.com' });
20+
* or
21+
* const tool = new JinaUrlToMarkdown({ apiKey: 'your-api-key', options: { 'targetSelector': ['body', '.class', '#id'], 'retainImages': 'none' } });
22+
* const result = await tool._call({ url: 'https://example.com' });
23+
*
24+
* For more information about Jina, visit: https://jina.ai/, https://r.jina.ai/docs
25+
*/
26+
27+
import { Tool } from '@langchain/core/tools';
28+
import { z } from 'zod';
29+
import ky from 'ky';
30+
import { HTTPError } from 'ky';
31+
32+
export class JinaUrlToMarkdown extends Tool {
33+
constructor(fields) {
34+
super(fields);
35+
this.name = 'jina-url-to-markdown';
36+
this.apiKey = fields.apiKey;
37+
this.options = fields.options || {};
38+
this.description = `Fetches web content from a specified URL and returns it in Markdown format. Input should be a JSON object with a "url".`;
39+
40+
this.headers = { 'Content-Type': 'application/json' };
41+
42+
if (this.apiKey) {
43+
this.headers.Authorization = `Bearer ${this.apiKey}`;
44+
}
45+
// Define the input schema using Zod
46+
this.schema = z.object({
47+
url: z.string().describe('The URL to scrape and retrieve content from.'),
48+
});
49+
50+
this.httpClient = ky;
51+
}
52+
53+
async _call(input) {
54+
try {
55+
const response = await this.httpClient
56+
.post(`https://r.jina.ai/`, {
57+
json: {
58+
url: input.url,
59+
...this.options,
60+
},
61+
headers: this.headers,
62+
})
63+
.json();
64+
65+
return response?.data || 'The API returned an empty response.';
66+
} catch (error) {
67+
if (error instanceof HTTPError) {
68+
const statusCode = error.response.status;
69+
let errorType = 'Unknown';
70+
if (statusCode >= 400 && statusCode < 500) {
71+
errorType = 'Client Error';
72+
} else if (statusCode >= 500) {
73+
errorType = 'Server Error';
74+
}
75+
return `API request failed: ${errorType} (${statusCode})`;
76+
} else {
77+
return `An unexpected error occurred: ${error.message}`;
78+
}
79+
}
80+
}
81+
}
Lines changed: 82 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,82 @@
1+
import { ToolPreviewer } from '../_utils/ToolPreviewer.jsx';
2+
import { AgentWithToolPreviewer } from '../_utils/AgentWithToolPreviewer.jsx';
3+
import { JinaUrlToMarkdown } from './index.js';
4+
import { Agent, Task, Team } from '../../../../src/index';
5+
import React from 'react';
6+
7+
// More on how to set up stories at: https://storybook.js.org/docs/writing-stories#default-export
8+
export default {
9+
title: 'Tools/Jina URL to Markdown',
10+
parameters: {
11+
// Optional parameter to center the component in the Canvas. More info: https://storybook.js.org/docs/configure/story-layout
12+
layout: 'centered',
13+
},
14+
// This component will have an automatically generated Autodocs entry: https://storybook.js.org/docs/writing-docs/autodocs
15+
tags: ['autodocs'],
16+
// More on argTypes: https://storybook.js.org/docs/api/argtypes
17+
argTypes: {
18+
// backgroundColor: { control: 'color' },
19+
// url: { control: 'text' },
20+
// apiKey: { control: 'text' },
21+
// format: { control: 'select', options: ['markdown', 'json']},
22+
// initializationCode: { table: { disable: true } },
23+
// executionCode: { table: { disable: true } }
24+
},
25+
};
26+
27+
const jinaUrlToMarkdownTool = new JinaUrlToMarkdown({
28+
// apiKey: import.meta.env.VITE_JINA_API_KEY,
29+
// options: {
30+
// targetSelector: ['body', '.class', '#id'],
31+
// retainImages: 'none',
32+
// },
33+
});
34+
35+
// More on writing stories with args: https://storybook.js.org/docs/writing-stories/args
36+
export const Default = {
37+
render: (args) => <ToolPreviewer {...args} />,
38+
args: {
39+
toolInstance: jinaUrlToMarkdownTool,
40+
callParams: {
41+
url: 'https://www.kaibanjs.com',
42+
},
43+
},
44+
};
45+
46+
// Create an agent with the firecrawl tool
47+
const webResearcher = new Agent({
48+
name: 'Web Researcher',
49+
role: 'Web Content Analyzer',
50+
goal: 'Extract and analyze content from specified websites',
51+
tools: [jinaUrlToMarkdownTool],
52+
});
53+
54+
// Create a research task
55+
const webAnalysisTask = new Task({
56+
description:
57+
'Fetches web content from the followin URL: {url} and provides a structured summary',
58+
agent: webResearcher,
59+
expectedOutput: 'A well-formatted analysis of the website content',
60+
});
61+
62+
// Create the team
63+
const team = new Team({
64+
name: 'Web Analysis Unit',
65+
description: 'Specialized team for web content extraction and analysis',
66+
agents: [webResearcher],
67+
tasks: [webAnalysisTask],
68+
inputs: {
69+
url: 'https://www.kaibanjs.com',
70+
},
71+
env: {
72+
OPENAI_API_KEY: import.meta.env.VITE_OPENAI_API_KEY,
73+
},
74+
});
75+
76+
// More on writing stories with args: https://storybook.js.org/docs/writing-stories/args
77+
export const withAgent = {
78+
render: (args) => <AgentWithToolPreviewer {...args} />,
79+
args: {
80+
team: team,
81+
},
82+
};

0 commit comments

Comments
 (0)