|
1 | 1 | defmodule Html2Markdown do |
2 | 2 | @moduledoc """ |
3 | | - A library for converting HTML to Markdown syntax in Elixir |
| 3 | + Convert HTML documents to clean, readable Markdown. |
4 | 4 |
|
5 | | - ## Configuration Options |
| 5 | + Html2Markdown intelligently extracts content from HTML while filtering out |
| 6 | + navigation, advertisements, and other non-content elements. It's designed |
| 7 | + for web scraping, content migration, and any scenario where you need to |
| 8 | + convert HTML to Markdown. |
6 | 9 |
|
7 | | - The library supports configuration options via `convert/2`: |
| 10 | + ## Basic Usage |
8 | 11 |
|
9 | | - - `:navigation_classes` - Customize which CSS classes identify navigation elements to remove |
10 | | - - `:non_content_tags` - Customize which HTML tags to filter out during conversion |
11 | | - - `:markdown_flavor` - Currently only `:basic` is supported (future enhancement) |
12 | | - - `:normalize_whitespace` - Normalize whitespace in text content |
| 12 | + iex> Html2Markdown.convert("<h1>Hello</h1><p>World</p>") |
| 13 | + "\\n# Hello\\n\\n\\n\\nWorld\\n" |
13 | 14 |
|
14 | | - Note: HTML entity decoding is performed automatically by Floki for all content. |
15 | | - Common entities like &, <, >, ", ', and numeric entities |
16 | | - are decoded to their corresponding characters. |
| 15 | + ## Configuration |
| 16 | +
|
| 17 | + The library supports extensive configuration through the second parameter: |
| 18 | +
|
| 19 | + Html2Markdown.convert(html, %{ |
| 20 | + navigation_classes: ["nav", "menu", "sidebar"], |
| 21 | + non_content_tags: ["script", "style", "iframe"], |
| 22 | + markdown_flavor: :basic, |
| 23 | + normalize_whitespace: true |
| 24 | + }) |
| 25 | +
|
| 26 | + ## Features |
| 27 | +
|
| 28 | + - **Smart filtering** - Automatically removes common non-content elements |
| 29 | + - **HTML5 support** - Handles modern semantic elements |
| 30 | + - **Table conversion** - Converts HTML tables to Markdown tables |
| 31 | + - **Entity decoding** - Automatically handled by Floki |
| 32 | + - **Whitespace normalization** - Optional cleanup of excessive whitespace |
| 33 | + - **Configurable** - Customize filtering behavior to your needs |
| 34 | +
|
| 35 | + ## Examples |
| 36 | +
|
| 37 | + ### Web Scraping |
| 38 | +
|
| 39 | + # Extract article content from a web page |
| 40 | + {:ok, %{body: html}} = HTTPoison.get("https://example.com/article") |
| 41 | + |
| 42 | + content = Html2Markdown.convert(html, %{ |
| 43 | + navigation_classes: ["header", "footer", "nav", "sidebar"], |
| 44 | + normalize_whitespace: true |
| 45 | + }) |
| 46 | +
|
| 47 | + ### Content Migration |
| 48 | +
|
| 49 | + # Convert WordPress posts to Markdown |
| 50 | + post_html |
| 51 | + |> Html2Markdown.convert() |
| 52 | + |> File.write!("post.md") |
| 53 | +
|
| 54 | + ### Email Processing |
| 55 | +
|
| 56 | + # Clean up HTML emails |
| 57 | + email_body |
| 58 | + |> Html2Markdown.convert(%{ |
| 59 | + non_content_tags: ["style", "meta", "link"], |
| 60 | + navigation_classes: ["unsubscribe", "footer"] |
| 61 | + }) |
| 62 | +
|
| 63 | + ## Supported HTML Elements |
| 64 | +
|
| 65 | + ### Text Elements |
| 66 | + - Headings: `<h1>` through `<h6>` |
| 67 | + - Paragraphs: `<p>` |
| 68 | + - Emphasis: `<em>`, `<i>` → `*italic*` |
| 69 | + - Strong: `<strong>`, `<b>` → `**bold**` |
| 70 | + - Strikethrough: `<del>` → `~~strikethrough~~` |
| 71 | + - Code: `<code>` → `` `code` `` |
| 72 | + - Preformatted: `<pre>` → ``` code blocks ``` |
| 73 | +
|
| 74 | + ### Lists |
| 75 | + - Unordered lists: `<ul>`, `<li>` → `- item` |
| 76 | + - Ordered lists: `<ol>`, `<li>` → `1. item` |
| 77 | + - Definition lists: `<dl>`, `<dt>`, `<dd>` |
| 78 | +
|
| 79 | + ### Links and Media |
| 80 | + - Links: `<a href="...">` → `[text](url)` |
| 81 | + - Images: `<img>` → `` |
| 82 | + - Picture: `<picture>` with fallback to `<img>` |
| 83 | +
|
| 84 | + ### Tables |
| 85 | + Full support for HTML tables with automatic header detection: |
| 86 | +
|
| 87 | + <table> |
| 88 | + <tr><th>Name</th><th>Value</th></tr> |
| 89 | + <tr><td>Elixir</td><td>1.15</td></tr> |
| 90 | + </table> |
| 91 | +
|
| 92 | + Converts to: |
| 93 | +
|
| 94 | + | Name | Value | |
| 95 | + | --- | --- | |
| 96 | + | Elixir | 1.15 | |
| 97 | +
|
| 98 | + ### HTML5 Elements |
| 99 | + - `<details>` / `<summary>` - Collapsible sections |
| 100 | + - `<mark>` - Highlighted text (GFM: `==marked==`) |
| 101 | + - `<abbr title="...">` - Abbreviations with expansion |
| 102 | + - `<cite>` - Citations in italics |
| 103 | + - `<q cite="...">` - Inline quotes with optional citation |
| 104 | + - `<time datetime="...">` - Time with preserved datetime |
| 105 | + - `<video>` - Converted to markdown link |
| 106 | +
|
| 107 | + ## Entity Handling |
| 108 | +
|
| 109 | + HTML entities are automatically decoded by Floki: |
| 110 | + - `&` → `&` |
| 111 | + - `<` → `<` |
| 112 | + - `>` → `>` |
| 113 | + - ` ` → non-breaking space |
| 114 | + - `{` → `{` |
| 115 | + - `«` → `«` |
17 | 116 | """ |
18 | 117 |
|
19 | 118 | alias Html2Markdown.{Options, Parser, Converter} |
|
0 commit comments