Skip to content

Commit a92cafb

Browse files
committed
chore: bump version to 0.2.0 and enhance documentation
Major release with significant improvements: - Module refactoring for better maintainability - Performance optimizations with MapSet and IOLists - HTML5 element support (details, mark, abbr, cite, q, time, video) - Type specifications for all public functions - Comprehensive documentation and examples
1 parent 4677384 commit a92cafb

File tree

8 files changed

+288
-17
lines changed

8 files changed

+288
-17
lines changed

LICENSE

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
MIT License
22

3-
Copyright (c) 2024 Chase Pursley
3+
Copyright (c) 2025 Chase Pursley
44

55
Permission is hereby granted, free of charge, to any person obtaining a copy
66
of this software and associated documentation files (the "Software"), to deal

README.md

Lines changed: 92 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,10 @@
11
# Html2Markdown
22

3-
Extract the content from an HTML document to Markdown (removing non-content sections and tags)
3+
[![Hex.pm](https://img.shields.io/hexpm/v/html2markdown.svg)](https://hex.pm/packages/html2markdown)
4+
[![Hex Docs](https://img.shields.io/badge/hex-docs-purple.svg)](https://hexdocs.pm/html2markdown)
5+
[![License](https://img.shields.io/hexpm/l/html2markdown.svg)](https://github.com/cpursley/html2markdown/blob/main/LICENSE)
6+
7+
Convert HTML to clean, readable Markdown. Designed for content extraction, this library intelligently handles common HTML patterns while filtering out non-content elements like navigation and and scripts.
48

59
## Installation
610

@@ -9,16 +13,99 @@ Add `html2markdown` to your list of dependencies in `mix.exs`:
913
```elixir
1014
def deps do
1115
[
12-
{:html2markdown, "~> 0.1.6"}
16+
{:html2markdown, "~> 0.2.0"}
1317
]
1418
end
1519
```
1620

17-
## Usage
21+
## Quick Start
22+
23+
```elixir
24+
# Basic conversion
25+
Html2Markdown.convert("<h1>Hello World</h1><p>Welcome to <strong>Elixir</strong>!</p>")
26+
# => "\n# Hello World\n\n\n\nWelcome to **Elixir**!\n"
27+
28+
# With custom options
29+
Html2Markdown.convert(html, %{
30+
navigation_classes: ["nav", "menu", "custom-nav"],
31+
normalize_whitespace: true
32+
})
33+
```
34+
35+
## Features
36+
37+
- **Smart Content Extraction**: Automatically removes navigation, ads, and other non-content elements
38+
- **HTML5 Support**: Handles modern semantic elements like `<details>`, `<mark>`, `<time>`
39+
- **Table Conversion**: Converts HTML tables to clean Markdown tables
40+
- **Entity Handling**: Properly decodes HTML entities (`&amp;`, `&lt;`, `&nbsp;`, etc.)
41+
- **Configurable**: Customize filtering and processing behavior
42+
43+
## Configuration Options
44+
45+
```elixir
46+
Html2Markdown.convert(html, %{
47+
# CSS classes that identify navigation elements to remove
48+
navigation_classes: ["footer", "menu", "nav", "sidebar", "aside"],
49+
50+
# HTML tags to filter out during conversion
51+
non_content_tags: ["script", "style", "form", "nav", ...],
52+
53+
# Markdown flavor (currently :basic, future: :gfm, :commonmark)
54+
markdown_flavor: :basic,
55+
56+
# Normalize whitespace (collapses multiple spaces, trims)
57+
normalize_whitespace: true
58+
})
59+
```
60+
61+
## Common Use Cases
62+
63+
### Web Scraping
64+
Extract readable content from web pages:
65+
66+
```elixir
67+
{:ok, %{body: html}} = HTTPoison.get(url)
68+
markdown = Html2Markdown.convert(html)
69+
```
70+
71+
### Content Migration
72+
Convert existing HTML content to Markdown:
1873

1974
```elixir
20-
Html2Markdown.convert(html)
75+
# Convert blog posts from HTML to Markdown
76+
html_content
77+
|> Html2Markdown.convert(%{normalize_whitespace: true})
78+
|> save_as_markdown()
2179
```
2280

23-
Docs can be found at <https://hexdocs.pm/html2markdown>.
81+
### Email Processing
82+
Clean up HTML emails for plain text storage:
83+
84+
```elixir
85+
email_html
86+
|> Html2Markdown.convert(%{
87+
non_content_tags: ["style", "script", "meta"],
88+
navigation_classes: ["unsubscribe", "footer"]
89+
})
90+
```
91+
92+
## Supported Elements
93+
94+
- **Headings**: `<h1>` through `<h6>`
95+
- **Text**: Paragraphs, emphasis (`<em>`, `<i>`), strong (`<strong>`, `<b>`)
96+
- **Lists**: Ordered and unordered lists with nesting
97+
- **Links**: `<a>` tags with proper URL handling
98+
- **Images**: `<img>` and `<picture>` elements
99+
- **Code**: Both inline `<code>` and block `<pre>` elements
100+
- **Tables**: Full table support with headers
101+
- **Quotes**: `<blockquote>` and `<q>` elements
102+
- **HTML5**: `<details>`, `<summary>`, `<mark>`, `<abbr>`, `<cite>`, `<time>`, `<video>`
103+
104+
## Documentation
105+
106+
Full documentation is available at [https://hexdocs.pm/html2markdown](https://hexdocs.pm/html2markdown).
107+
108+
## License
109+
110+
MIT License - see [LICENSE](LICENSE) file for details.
24111

lib/html2markdown.ex

Lines changed: 109 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,19 +1,118 @@
11
defmodule Html2Markdown do
22
@moduledoc """
3-
A library for converting HTML to Markdown syntax in Elixir
3+
Convert HTML documents to clean, readable Markdown.
44
5-
## Configuration Options
5+
Html2Markdown intelligently extracts content from HTML while filtering out
6+
navigation, advertisements, and other non-content elements. It's designed
7+
for web scraping, content migration, and any scenario where you need to
8+
convert HTML to Markdown.
69
7-
The library supports configuration options via `convert/2`:
10+
## Basic Usage
811
9-
- `:navigation_classes` - Customize which CSS classes identify navigation elements to remove
10-
- `:non_content_tags` - Customize which HTML tags to filter out during conversion
11-
- `:markdown_flavor` - Currently only `:basic` is supported (future enhancement)
12-
- `:normalize_whitespace` - Normalize whitespace in text content
12+
iex> Html2Markdown.convert("<h1>Hello</h1><p>World</p>")
13+
"\\n# Hello\\n\\n\\n\\nWorld\\n"
1314
14-
Note: HTML entity decoding is performed automatically by Floki for all content.
15-
Common entities like &amp;, &lt;, &gt;, &quot;, &#39;, &nbsp; and numeric entities
16-
are decoded to their corresponding characters.
15+
## Configuration
16+
17+
The library supports extensive configuration through the second parameter:
18+
19+
Html2Markdown.convert(html, %{
20+
navigation_classes: ["nav", "menu", "sidebar"],
21+
non_content_tags: ["script", "style", "iframe"],
22+
markdown_flavor: :basic,
23+
normalize_whitespace: true
24+
})
25+
26+
## Features
27+
28+
- **Smart filtering** - Automatically removes common non-content elements
29+
- **HTML5 support** - Handles modern semantic elements
30+
- **Table conversion** - Converts HTML tables to Markdown tables
31+
- **Entity decoding** - Automatically handled by Floki
32+
- **Whitespace normalization** - Optional cleanup of excessive whitespace
33+
- **Configurable** - Customize filtering behavior to your needs
34+
35+
## Examples
36+
37+
### Web Scraping
38+
39+
# Extract article content from a web page
40+
{:ok, %{body: html}} = HTTPoison.get("https://example.com/article")
41+
42+
content = Html2Markdown.convert(html, %{
43+
navigation_classes: ["header", "footer", "nav", "sidebar"],
44+
normalize_whitespace: true
45+
})
46+
47+
### Content Migration
48+
49+
# Convert WordPress posts to Markdown
50+
post_html
51+
|> Html2Markdown.convert()
52+
|> File.write!("post.md")
53+
54+
### Email Processing
55+
56+
# Clean up HTML emails
57+
email_body
58+
|> Html2Markdown.convert(%{
59+
non_content_tags: ["style", "meta", "link"],
60+
navigation_classes: ["unsubscribe", "footer"]
61+
})
62+
63+
## Supported HTML Elements
64+
65+
### Text Elements
66+
- Headings: `<h1>` through `<h6>`
67+
- Paragraphs: `<p>`
68+
- Emphasis: `<em>`, `<i>` → `*italic*`
69+
- Strong: `<strong>`, `<b>` → `**bold**`
70+
- Strikethrough: `<del>` → `~~strikethrough~~`
71+
- Code: `<code>` → `` `code` ``
72+
- Preformatted: `<pre>` → ``` code blocks ```
73+
74+
### Lists
75+
- Unordered lists: `<ul>`, `<li>` → `- item`
76+
- Ordered lists: `<ol>`, `<li>` → `1. item`
77+
- Definition lists: `<dl>`, `<dt>`, `<dd>`
78+
79+
### Links and Media
80+
- Links: `<a href="...">` → `[text](url)`
81+
- Images: `<img>` → `![alt](src)`
82+
- Picture: `<picture>` with fallback to `<img>`
83+
84+
### Tables
85+
Full support for HTML tables with automatic header detection:
86+
87+
<table>
88+
<tr><th>Name</th><th>Value</th></tr>
89+
<tr><td>Elixir</td><td>1.15</td></tr>
90+
</table>
91+
92+
Converts to:
93+
94+
| Name | Value |
95+
| --- | --- |
96+
| Elixir | 1.15 |
97+
98+
### HTML5 Elements
99+
- `<details>` / `<summary>` - Collapsible sections
100+
- `<mark>` - Highlighted text (GFM: `==marked==`)
101+
- `<abbr title="...">` - Abbreviations with expansion
102+
- `<cite>` - Citations in italics
103+
- `<q cite="...">` - Inline quotes with optional citation
104+
- `<time datetime="...">` - Time with preserved datetime
105+
- `<video>` - Converted to markdown link
106+
107+
## Entity Handling
108+
109+
HTML entities are automatically decoded by Floki:
110+
- `&amp;` → `&`
111+
- `&lt;` → `<`
112+
- `&gt;` → `>`
113+
- `&nbsp;` → non-breaking space
114+
- `&#123;` → `{`
115+
- `&#xAB;` → `«`
17116
"""
18117

19118
alias Html2Markdown.{Options, Parser, Converter}

lib/html2markdown/converter.ex

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,26 @@
11
defmodule Html2Markdown.Converter do
22
@moduledoc """
33
Handles the conversion of HTML nodes to Markdown format.
4+
5+
This module is responsible for transforming parsed HTML nodes into their
6+
Markdown equivalents. It uses an efficient IOList-based approach for
7+
building the output string.
8+
9+
## Implementation Details
10+
11+
The converter uses pattern matching to handle different HTML elements:
12+
- Headers (`h1`-`h6`) → Markdown headers with appropriate `#` prefixes
13+
- Text formatting (`strong`, `em`, `del`) → Markdown emphasis markers
14+
- Lists (`ul`, `ol`) → Markdown list syntax with proper nesting
15+
- Tables → Delegated to `Html2Markdown.TableConverter`
16+
- Links and images → Markdown link syntax
17+
- Code blocks → Fenced code blocks with language detection
18+
19+
## Performance Optimizations
20+
21+
- Uses IOList building instead of string concatenation
22+
- Processes nodes in a single pass
23+
- Preserves whitespace in code blocks while normalizing elsewhere
424
"""
525

626
alias Html2Markdown.{TableConverter, Options}

lib/html2markdown/options.ex

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,31 @@
11
defmodule Html2Markdown.Options do
22
@moduledoc """
33
Handles configuration options for HTML to Markdown conversion.
4+
5+
## Available Options
6+
7+
- `:navigation_classes` - List of CSS classes that identify navigation elements to remove.
8+
Default: `["footer", "menu", "nav", "sidebar", "aside"]`
9+
10+
- `:non_content_tags` - List of HTML tags to filter out completely.
11+
Default includes tags like `script`, `style`, `iframe`, etc.
12+
13+
- `:markdown_flavor` - The markdown variant to generate.
14+
Currently only `:basic` is supported. Future versions may support `:gfm` and `:commonmark`.
15+
16+
- `:normalize_whitespace` - Whether to collapse multiple spaces and trim whitespace.
17+
Default: `true`. Code blocks always preserve whitespace regardless of this setting.
18+
19+
## Examples
20+
21+
# Use all defaults
22+
options = Options.defaults()
23+
24+
# Merge custom options with defaults
25+
custom = Options.merge(%{
26+
navigation_classes: ["custom-nav", "advertisement"],
27+
normalize_whitespace: false
28+
})
429
"""
530

631
@type t :: %{

lib/html2markdown/parser.ex

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,21 @@
11
defmodule Html2Markdown.Parser do
22
@moduledoc """
33
Handles HTML preprocessing and parsing operations.
4+
5+
This module is responsible for:
6+
1. Parsing HTML content using Floki
7+
2. Filtering out non-content elements
8+
3. Preparing the document tree for conversion
9+
10+
## Filtering Strategy
11+
12+
The parser removes elements in two ways:
13+
- **Tag-based filtering**: Removes elements like `<script>`, `<style>`, `<nav>`
14+
- **Class-based filtering**: Removes elements with navigation classes like "footer", "sidebar"
15+
16+
## Performance
17+
18+
Uses MapSet for O(1) lookup performance when checking tags and classes.
419
"""
520

621
alias Html2Markdown.Options

lib/html2markdown/table_converter.ex

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,31 @@
11
defmodule Html2Markdown.TableConverter do
22
@moduledoc """
33
Handles conversion of HTML tables to Markdown format.
4+
5+
Converts HTML tables to GitHub Flavored Markdown tables with support for:
6+
- Header detection (from `<th>` elements or `<thead>`)
7+
- Complex table structures with `<thead>` and `<tbody>`
8+
- Colspan handling (content repeated across columns)
9+
- Empty cells and malformed tables
10+
11+
## Examples
12+
13+
# Simple table
14+
<table>
15+
<tr><th>Name</th><th>Age</th></tr>
16+
<tr><td>Alice</td><td>30</td></tr>
17+
</table>
18+
19+
# Converts to:
20+
| Name | Age |
21+
| --- | --- |
22+
| Alice | 30 |
23+
24+
## Implementation Notes
25+
26+
- Tables without headers still generate valid Markdown tables
27+
- Empty cells are preserved as empty columns
28+
- Malformed HTML is handled gracefully
429
"""
530

631
alias Html2Markdown.{Converter, Options}

mix.exs

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ defmodule Html2Markdown.MixProject do
44
def project do
55
[
66
app: :html2markdown,
7-
version: "0.1.6",
7+
version: "0.2.0",
88
elixir: "~> 1.15",
99
start_permanent: Mix.env() == :prod,
1010
description: description(),

0 commit comments

Comments
 (0)