Skip to content

Commit 7a3e4ce

Browse files
committed
Added Python MarkItDown docs
1 parent 32ff569 commit 7a3e4ce

File tree

3 files changed

+276
-0
lines changed

3 files changed

+276
-0
lines changed

docs/docs.json

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -312,6 +312,7 @@
312312
"group": "Python guides",
313313
"pages": [
314314
"guides/python/python-image-processing",
315+
"guides/python/python-doc-to-markdown",
315316
"guides/python/python-crawl4ai",
316317
"guides/python/python-pdf-form-extractor"
317318
]

docs/guides/introduction.mdx

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,7 @@ Get set up fast using our detailed walk-through guides.
2929
| [Cursor rules](/guides/cursor-rules) | Use Cursor rules to help write Trigger.dev tasks |
3030
| [Prisma](/guides/frameworks/prisma) | How to setup Prisma with Trigger.dev |
3131
| [Python image processing](/guides/python/python-image-processing) | Use Python and Pillow to process images |
32+
| [Python document to markdown](/guides/python/python-doc-to-markdown) | Use Python and MarkItDown to convert documents to markdown |
3233
| [Python PDF form extractor](/guides/python/python-pdf-form-extractor) | Use Python, PyMuPDF and Trigger.dev to extract data from a PDF form |
3334
| [Python web crawler](/guides/python/python-crawl4ai) | Use Python, Crawl4AI and Playwright to create a headless web crawler |
3435
| [Sequin database triggers](/guides/frameworks/sequin) | Trigger tasks from database changes using Sequin |
Lines changed: 274 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,274 @@
1+
---
2+
title: "Convert documents to markdown using Python and MarkItDown"
3+
sidebarTitle: "Python document to markdown"
4+
description: "Learn how to use Trigger.dev with Python to convert documents to markdown using MarkItDown."
5+
---
6+
7+
import PythonLearnMore from "/snippets/python-learn-more.mdx";
8+
9+
## Overview
10+
11+
Convert documents to markdown using Microsoft's [MarkItDown](https://github.com/microsoft/markitdown) library. This can be especially useful for preparing documents in a structured format for AI applications.
12+
13+
## Prerequisites
14+
15+
- A project with [Trigger.dev initialized](/quick-start)
16+
- [Python](https://www.python.org/) installed on your local machine. _This example requires Python 3.10 or higher._
17+
18+
## Features
19+
20+
- A Trigger.dev task which downloads a document from a URL and runs the Python script which converts it to markdown
21+
- A Python script to convert documents to markdown using Microsoft's [MarkItDown](https://github.com/microsoft/markitdown) library
22+
- Uses our [Python build extension](/config/extensions/pythonExtension) to install dependencies and run Python scripts
23+
24+
## GitHub repo
25+
26+
<Card
27+
title="View the project on GitHub"
28+
icon="GitHub"
29+
href="https://github.com/triggerdotdev/examples/tree/main/python-doc-to-markdown-converter"
30+
>
31+
Click here to view the full code for this project in our examples repository on GitHub. You can
32+
fork it and use it as a starting point for your own project.
33+
</Card>
34+
35+
## The code
36+
37+
### Build configuration
38+
39+
After you've initialized your project with Trigger.dev, add these build settings to your `trigger.config.ts` file:
40+
41+
```ts trigger.config.ts
42+
import { pythonExtension } from "@trigger.dev/python/extension";
43+
import { defineConfig } from "@trigger.dev/sdk/v3";
44+
45+
export default defineConfig({
46+
runtime: "node",
47+
project: "<your-project-ref>",
48+
// Your other config settings...
49+
build: {
50+
extensions: [
51+
pythonExtension({
52+
// The path to your requirements.txt file
53+
requirementsFile: "./requirements.txt",
54+
// The path to your Python binary
55+
devPythonBinaryPath: `venv/bin/python`,
56+
// The paths to your Python scripts to run
57+
scripts: ["src/python/**/*.py"],
58+
}),
59+
],
60+
},
61+
});
62+
```
63+
64+
<Info>
65+
Learn more about executing scripts in your Trigger.dev project using our Python build extension
66+
[here](/config/extensions/pythonExtension).
67+
</Info>
68+
69+
### Task code
70+
71+
This task uses the `python.runScript` method to run the `markdown-converter.py` script with the given document URL as an argument.
72+
73+
```ts src/trigger/convertToMarkdown.ts
74+
import { task } from "@trigger.dev/sdk/v3";
75+
import { python } from "@trigger.dev/python";
76+
import { z } from "zod";
77+
import * as fs from "fs";
78+
import * as path from "path";
79+
import * as os from "os";
80+
import * as https from "https";
81+
import * as http from "http";
82+
83+
export const convertToMarkdown = task({
84+
id: "convert-to-markdown",
85+
run: async (payload: { url: string }) => {
86+
try {
87+
const { url } = payload;
88+
89+
// STEP 1: Create temporary file with unique name
90+
const tempDir = os.tmpdir();
91+
const fileName = `doc-${Date.now()}-${Math.random().toString(36).substring(2, 7)}`;
92+
const urlPath = new URL(url).pathname;
93+
// Detect file extension from URL or default to .docx
94+
const extension = path.extname(urlPath) || ".docx";
95+
const tempFilePath = path.join(tempDir, `${fileName}${extension}`);
96+
97+
// STEP 2: Download file from URL
98+
await new Promise<void>((resolve, reject) => {
99+
const protocol = url.startsWith("https") ? https : http;
100+
const file = fs.createWriteStream(tempFilePath);
101+
102+
protocol
103+
.get(url, (response) => {
104+
if (response.statusCode !== 200) {
105+
reject(new Error(`Download failed with status ${response.statusCode}`));
106+
return;
107+
}
108+
109+
response.pipe(file);
110+
file.on("finish", () => {
111+
file.close();
112+
resolve();
113+
});
114+
})
115+
.on("error", (err) => {
116+
// Clean up on error
117+
fs.unlink(tempFilePath, () => {});
118+
reject(err);
119+
});
120+
});
121+
122+
// STEP 3: Run Python script to convert document to markdown
123+
const pythonResult = await python.runScript("./src/python/markdown-converter.py", [
124+
JSON.stringify({ file_path: tempFilePath }),
125+
]);
126+
127+
// STEP 4: Clean up temporary file
128+
fs.unlink(tempFilePath, () => {});
129+
130+
// STEP 5: Process result - handle possible warnings
131+
// Only treat stderr as error if we don't have stdout data
132+
// This handles cases where non-critical warnings appear in stderr
133+
if (
134+
pythonResult.stderr &&
135+
!pythonResult.stderr.includes("Couldn't find ffmpeg") &&
136+
!pythonResult.stdout
137+
) {
138+
throw new Error(`Python error: ${pythonResult.stderr}`);
139+
}
140+
141+
// If we got valid stdout data, parse and use it regardless of stderr warnings
142+
// This ensures harmless warnings don't break the conversion
143+
if (pythonResult.stdout) {
144+
const result = JSON.parse(pythonResult.stdout);
145+
146+
return {
147+
url,
148+
markdown: result.status === "success" ? result.markdown : null,
149+
error: result.status === "error" ? result.error : null,
150+
success: result.status === "success",
151+
};
152+
}
153+
154+
return {
155+
url,
156+
markdown: null,
157+
error: "No output from Python script",
158+
success: false,
159+
};
160+
} catch (error) {
161+
if (error instanceof z.ZodError) {
162+
return {
163+
url: payload.url,
164+
markdown: null,
165+
error: "Invalid URL format: " + error.errors[0].message,
166+
success: false,
167+
};
168+
}
169+
170+
return {
171+
url: payload.url,
172+
markdown: null,
173+
error: error instanceof Error ? error.message : String(error),
174+
success: false,
175+
};
176+
}
177+
},
178+
});
179+
```
180+
181+
### Add a requirements.txt file
182+
183+
Add the following to your `requirements.txt` file. This is required in Python projects to install the dependencies.
184+
185+
```txt requirements.txt
186+
markitdown[all]
187+
```
188+
189+
### The Python script
190+
191+
The Python script uses MarkItDown to convert documents to Markdown format.
192+
193+
```python src/python/markdown-converter.py
194+
import json
195+
import sys
196+
import os
197+
from markitdown import MarkItDown
198+
199+
def convert_to_markdown(file_path):
200+
"""Convert a file to markdown format using MarkItDown"""
201+
# Check if file exists
202+
if not os.path.exists(file_path):
203+
raise FileNotFoundError(f"File not found: {file_path}")
204+
205+
# Initialize MarkItDown
206+
md = MarkItDown()
207+
208+
# Convert the file
209+
try:
210+
result = md.convert(file_path)
211+
return result.text_content
212+
except Exception as e:
213+
raise Exception(f"Error converting file: {str(e)}")
214+
215+
def process_trigger_task(file_path):
216+
"""Process a file and convert to markdown"""
217+
try:
218+
markdown_result = convert_to_markdown(file_path)
219+
return {
220+
"status": "success",
221+
"markdown": markdown_result
222+
}
223+
except Exception as e:
224+
return {
225+
"status": "error",
226+
"error": str(e)
227+
}
228+
229+
if __name__ == "__main__":
230+
# Get the file path from command line arguments
231+
if len(sys.argv) < 2:
232+
print(json.dumps({"status": "error", "error": "No file path provided"}))
233+
sys.exit(1)
234+
235+
try:
236+
config = json.loads(sys.argv[1])
237+
file_path = config.get("file_path")
238+
239+
if not file_path:
240+
print(json.dumps({"status": "error", "error": "No file path specified in config"}))
241+
sys.exit(1)
242+
243+
result = process_trigger_task(file_path)
244+
print(json.dumps(result))
245+
except Exception as e:
246+
print(json.dumps({"status": "error", "error": str(e)}))
247+
sys.exit(1)
248+
```
249+
250+
## Testing your task
251+
252+
1. Create a virtual environment `python -m venv venv`
253+
2. Activate the virtual environment, depending on your OS: On Mac/Linux: `source venv/bin/activate`, on Windows: `venv\Scripts\activate`
254+
3. Install the Python dependencies `pip install -r requirements.txt`. _Make sure you have Python 3.10 or higher installed._
255+
4. Copy the project ref from your [Trigger.dev dashboard](https://cloud.trigger.dev) and add it to the `trigger.config.ts` file.
256+
5. Run the Trigger.dev CLI `dev` command (it may ask you to authorize the CLI if you haven't already).
257+
6. Test the task in the dashboard by providing a valid document URL.
258+
7. Deploy the task to production using the Trigger.dev CLI `deploy` command.
259+
260+
## MarkItDown Conversion Capabilities
261+
262+
- Convert various file formats to Markdown:
263+
- Office formats (Word, PowerPoint, Excel)
264+
- PDFs
265+
- Images (with optional LLM-generated descriptions)
266+
- HTML, CSV, JSON, XML
267+
- Audio files (with optional transcription)
268+
- ZIP archives
269+
- And more
270+
- Preserve document structure (headings, lists, tables, etc.)
271+
- Handle multiple input methods (file paths, URLs, base64 data)
272+
- Optional Azure Document Intelligence integration for better PDF and image conversion
273+
274+
<PythonLearnMore />

0 commit comments

Comments
 (0)