|
| 1 | +--- |
| 2 | +title: Extracting blog post content as markdown using the markdown endpoint |
| 3 | +sidebar: |
| 4 | + order: 4 |
| 5 | +--- |
| 6 | + |
| 7 | +This guide shows you how to capture the complete JSON output from Cloudflare's [`/markdown` API endpoint](/browser-rendering/rest-api/markdown-endpoint/). |
| 8 | + |
| 9 | +We are extracting the content of a blog post from the Cloudflare Blog: [Introducing AutoRAG on Cloudflare](https://blog.cloudflare.com/introducing-autorag-on-cloudflare/) |
| 10 | + |
| 11 | +## Prerequisites |
| 12 | + |
| 13 | +1. Cloudflare Account and API Token. |
| 14 | + |
| 15 | + - [Create a token](/fundamentals/api/get-started/create-token/) with **Browser Rendering: Edit** permissions. |
| 16 | + - You can do this under **My Profile → API Tokens → Create Token** on your [Cloudflare dashboard](https://dash.cloudflare.com/). |
| 17 | + - Note your **Account ID** (from the dashboard homepage) and **API Token**. |
| 18 | + |
| 19 | +2. Command-line tools installed. |
| 20 | + |
| 21 | + - cURL: a command-line tool for sending HTTP requests. |
| 22 | + - macOS/Linux: usually preinstalled. |
| 23 | + - Windows: available via WSL, Git Bash, or native Windows builds. |
| 24 | + |
| 25 | +## 1: Configure your environment variables |
| 26 | + |
| 27 | +Save your sensitive information into environment variables to avoid hardcoding credentials. |
| 28 | + |
| 29 | +```bash |
| 30 | +export CF_ACCOUNT_ID="your-cloudflare-account-id" |
| 31 | +export CF_API_TOKEN="your-api-token-with-edit-permissions" |
| 32 | +``` |
| 33 | + |
| 34 | +## 2: Make the API Request and save the raw JSON |
| 35 | + |
| 36 | +Run this command to fetch the markdown representation of the AutoRAG blog post and store it into a local JSON file: |
| 37 | + |
| 38 | +```bash |
| 39 | +curl -s -X POST \ |
| 40 | + "https://api.cloudflare.com/client/v4/accounts/${CF_ACCOUNT_ID}/browser-rendering/markdown" \ |
| 41 | + -H "Content-Type: application/json" \ |
| 42 | + -H "Authorization: Bearer ${CF_API_TOKEN}" \ |
| 43 | + -d '{ |
| 44 | + "url": "https://blog.cloudflare.com/introducing-autorag-on-cloudflare/" |
| 45 | + }' \ |
| 46 | +> autorag-full-response.json |
| 47 | +``` |
| 48 | + |
| 49 | +The `>` parameter redirects output into a file (`autorag-full-response.json`). |
| 50 | + |
| 51 | +## 3: Inspect the saved JSON |
| 52 | + |
| 53 | +You can check the start of the saved JSON file to ensure it looks right: |
| 54 | + |
| 55 | +```bash |
| 56 | +head -n 20 autorag-full-response.json |
| 57 | +``` |
| 58 | + |
| 59 | +```json output |
| 60 | +{ |
| 61 | + "success": true, |
| 62 | + "errors": [], |
| 63 | + "messages": [], |
| 64 | + "result": "# "[Get Started Free](https://dash.cloudflare.com/sign-up)|[Contact Sales](https://www.cloudflare.com/plans/enterprise/contact/)\n\n[ Skip unwanted resources |
| 69 | + |
| 70 | +To ignore unnecessary assets like CSS, JavaScript, or images when fetching the page add `rejectRequestPattern` parameter: |
| 71 | + |
| 72 | +```bash |
| 73 | +curl -s -X POST \ |
| 74 | + "https://api.cloudflare.com/client/v4/accounts/${CF_ACCOUNT_ID}/browser-rendering/markdown" \ |
| 75 | + -H "Content-Type: application/json" \ |
| 76 | + -H "Authorization: Bearer ${CF_API_TOKEN}" \ |
| 77 | + -d '{ |
| 78 | + "url": "https://blog.cloudflare.com/introducing-autorag-on-cloudflare/", |
| 79 | + "rejectRequestPattern": [ |
| 80 | + "/^.*\\.(css|js|png|svg)$/" |
| 81 | + ] |
| 82 | + }' \ |
| 83 | +> autorag-no-assets.json |
| 84 | +``` |
| 85 | + |
| 86 | +## 5: Extracting and saving the markdown from the JSON file |
| 87 | + |
| 88 | +After saving the full response, below is how to how to extract just the Markdown. |
| 89 | + |
| 90 | +The script does the following: |
| 91 | + |
| 92 | +1. Reads the full JSON response from `autorag-full-response.json` |
| 93 | +2. Extracts the Markdown string from the `"result"` field |
| 94 | +3. Writes that Markdown to `autorag-blog.md` |
| 95 | + |
| 96 | +```py |
| 97 | +#!/usr/bin/env python3 |
| 98 | +""" |
| 99 | +extract_markdown.py |
| 100 | +
|
| 101 | +Reads the full JSON response from Cloudflare's Markdown endpoint |
| 102 | +and writes the 'result' field (the converted Markdown) to a .md file. |
| 103 | +""" |
| 104 | + |
| 105 | +import json |
| 106 | +import sys |
| 107 | +from pathlib import Path |
| 108 | + |
| 109 | +# Input and output file paths |
| 110 | +INPUT_JSON = Path("autorag-full-response.json") |
| 111 | +OUTPUT_MD = Path("autorag-blog.md") |
| 112 | + |
| 113 | +def main(): |
| 114 | + # Check that the input file exists |
| 115 | + if not INPUT_JSON.is_file(): |
| 116 | + print(f"Error: Input file '{INPUT_JSON}' not found.", file=sys.stderr) |
| 117 | + sys.exit(1) |
| 118 | + |
| 119 | + # Load the JSON response |
| 120 | + try: |
| 121 | + with INPUT_JSON.open("r", encoding="utf-8") as f: |
| 122 | + data = json.load(f) |
| 123 | + except json.JSONDecodeError as e: |
| 124 | + print(f"Error: Failed to parse JSON in '{INPUT_JSON}': {e}", file=sys.stderr) |
| 125 | + sys.exit(1) |
| 126 | + |
| 127 | + # Validate structure |
| 128 | + if not data.get("success", False): |
| 129 | + print("Error: API reported failure.", file=sys.stderr) |
| 130 | + errors = data.get("errors") or data.get("messages") |
| 131 | + if errors: |
| 132 | + print("Details:", errors, file=sys.stderr) |
| 133 | + sys.exit(1) |
| 134 | + |
| 135 | + if "result" not in data: |
| 136 | + print("Error: 'result' field not found in JSON.", file=sys.stderr) |
| 137 | + sys.exit(1) |
| 138 | + |
| 139 | + # Extract and write the Markdown |
| 140 | + markdown_content = data["result"] |
| 141 | + try: |
| 142 | + with OUTPUT_MD.open("w", encoding="utf-8") as md_file: |
| 143 | + md_file.write(markdown_content) |
| 144 | + except IOError as e: |
| 145 | + print(f"Error: Could not write to '{OUTPUT_MD}': {e}", file=sys.stderr) |
| 146 | + sys.exit(1) |
| 147 | + |
| 148 | + print(f"Success: Markdown content written to '{OUTPUT_MD}'.") |
| 149 | + |
| 150 | +if __name__ == "__main__": |
| 151 | + main() |
| 152 | +``` |
| 153 | + |
| 154 | +### Usage |
| 155 | + |
| 156 | +1. Ensure you have run the `curl` command to produce `autorag-full-response.json`. |
| 157 | + |
| 158 | +2. Place `extract_markdown.py` in the same directory. |
| 159 | + |
| 160 | +3. Run: |
| 161 | + |
| 162 | +``` |
| 163 | +python3 extract_markdown.py |
| 164 | +``` |
| 165 | + |
| 166 | +After execution, `autorag-blog.md` will contain the extracted Markdown. |
| 167 | + |
| 168 | +## Final folder structure |
| 169 | + |
| 170 | +After following these steps, your working folder will look like: |
| 171 | + |
| 172 | +``` |
| 173 | +. |
| 174 | +├── autorag-full-response.json # Full API response |
| 175 | +├── autorag-no-assets.json # Full API response without extra assets (optional) |
| 176 | +├── autorag-blog.md # Extracted Markdown content |
| 177 | +└── extract_markdown.py # Python extraction script (optional) |
| 178 | +``` |
0 commit comments