chore: bump version to 0.2.0 and enhance documentation

cpursley · cpursley · commit a92cafbfb71c · 2025-07-16T16:22:21.000-04:00
Major release with significant improvements:
- Module refactoring for better maintainability
- Performance optimizations with MapSet and IOLists
- HTML5 element support (details, mark, abbr, cite, q, time, video)
- Type specifications for all public functions
- Comprehensive documentation and examples
diff --git a/LICENSE b/LICENSE
@@ -1,6 +1,6 @@
 MIT License
 
-Copyright (c) 2024 Chase Pursley
+Copyright (c) 2025 Chase Pursley
 
 Permission is hereby granted, free of charge, to any person obtaining a copy
 of this software and associated documentation files (the "Software"), to deal
diff --git a/README.md b/README.md
@@ -1,6 +1,10 @@
 # Html2Markdown
 
-Extract the content from an HTML document to Markdown (removing non-content sections and tags)
+[![Hex.pm](https://img.shields.io/hexpm/v/html2markdown.svg)](https://hex.pm/packages/html2markdown)
+[![Hex Docs](https://img.shields.io/badge/hex-docs-purple.svg)](https://hexdocs.pm/html2markdown)
+[![License](https://img.shields.io/hexpm/l/html2markdown.svg)](https://github.com/cpursley/html2markdown/blob/main/LICENSE)
+
+Convert HTML to clean, readable Markdown. Designed for content extraction, this library intelligently handles common HTML patterns while filtering out non-content elements like navigation and and scripts.
 
 ## Installation
 
@@ -9,16 +13,99 @@ Add `html2markdown` to your list of dependencies in `mix.exs`:
 ```elixir
 def deps do
   [
-    {:html2markdown, "~> 0.1.6"}
+    {:html2markdown, "~> 0.2.0"}
   ]
 end
 ```
 
-## Usage
+## Quick Start
+
+```elixir
+# Basic conversion
+Html2Markdown.convert("<h1>Hello World</h1><p>Welcome to <strong>Elixir</strong>!</p>")
+# => "\n# Hello World\n\n\n\nWelcome to **Elixir**!\n"
+
+# With custom options
+Html2Markdown.convert(html, %{
+  navigation_classes: ["nav", "menu", "custom-nav"],
+  normalize_whitespace: true
+})
+```
+
+## Features
+
+- **Smart Content Extraction**: Automatically removes navigation, ads, and other non-content elements
+- **HTML5 Support**: Handles modern semantic elements like `<details>`, `<mark>`, `<time>`
+- **Table Conversion**: Converts HTML tables to clean Markdown tables
+- **Entity Handling**: Properly decodes HTML entities (`&amp;`, `&lt;`, `&nbsp;`, etc.)
+- **Configurable**: Customize filtering and processing behavior
+
+## Configuration Options
+
+```elixir
+Html2Markdown.convert(html, %{
+  # CSS classes that identify navigation elements to remove
+  navigation_classes: ["footer", "menu", "nav", "sidebar", "aside"],
+  
+  # HTML tags to filter out during conversion
+  non_content_tags: ["script", "style", "form", "nav", ...],
+  
+  # Markdown flavor (currently :basic, future: :gfm, :commonmark)
+  markdown_flavor: :basic,
+  
+  # Normalize whitespace (collapses multiple spaces, trims)
+  normalize_whitespace: true
+})
+```
+
+## Common Use Cases
+
+### Web Scraping
+Extract readable content from web pages:
+
+```elixir
+{:ok, %{body: html}} = HTTPoison.get(url)
+markdown = Html2Markdown.convert(html)
+```
+
+### Content Migration
+Convert existing HTML content to Markdown:
 
 ```elixir
-Html2Markdown.convert(html)
+# Convert blog posts from HTML to Markdown
+html_content
+|> Html2Markdown.convert(%{normalize_whitespace: true})
+|> save_as_markdown()
 ```
 
-Docs can be found at <https://hexdocs.pm/html2markdown>.
+### Email Processing
+Clean up HTML emails for plain text storage:
+
+```elixir
+email_html
+|> Html2Markdown.convert(%{
+  non_content_tags: ["style", "script", "meta"],
+  navigation_classes: ["unsubscribe", "footer"]
+})
+```
+
+## Supported Elements
+
+- **Headings**: `<h1>` through `<h6>`
+- **Text**: Paragraphs, emphasis (`<em>`, `<i>`), strong (`<strong>`, `<b>`)
+- **Lists**: Ordered and unordered lists with nesting
+- **Links**: `<a>` tags with proper URL handling
+- **Images**: `<img>` and `<picture>` elements
+- **Code**: Both inline `<code>` and block `<pre>` elements
+- **Tables**: Full table support with headers
+- **Quotes**: `<blockquote>` and `<q>` elements
+- **HTML5**: `<details>`, `<summary>`, `<mark>`, `<abbr>`, `<cite>`, `<time>`, `<video>`
+
+## Documentation
+
+Full documentation is available at [https://hexdocs.pm/html2markdown](https://hexdocs.pm/html2markdown).
+
+## License
+
+MIT License - see [LICENSE](LICENSE) file for details.
 
diff --git a/lib/html2markdown.ex b/lib/html2markdown.ex
@@ -1,19 +1,118 @@
 defmodule Html2Markdown do
   @moduledoc """
-  A library for converting HTML to Markdown syntax in Elixir
+  Convert HTML documents to clean, readable Markdown.
 
-  ## Configuration Options
+  Html2Markdown intelligently extracts content from HTML while filtering out
+  navigation, advertisements, and other non-content elements. It's designed
+  for web scraping, content migration, and any scenario where you need to
+  convert HTML to Markdown.
 
-  The library supports configuration options via `convert/2`:
+  ## Basic Usage
 
-  - `:navigation_classes` - Customize which CSS classes identify navigation elements to remove
-  - `:non_content_tags` - Customize which HTML tags to filter out during conversion
-  - `:markdown_flavor` - Currently only `:basic` is supported (future enhancement)
-  - `:normalize_whitespace` - Normalize whitespace in text content
+      iex> Html2Markdown.convert("<h1>Hello</h1><p>World</p>")
+      "\\n# Hello\\n\\n\\n\\nWorld\\n"
 
-  Note: HTML entity decoding is performed automatically by Floki for all content.
-  Common entities like &amp;, &lt;, &gt;, &quot;, &#39;, &nbsp; and numeric entities
-  are decoded to their corresponding characters.
+  ## Configuration
+
+  The library supports extensive configuration through the second parameter:
+
+      Html2Markdown.convert(html, %{
+        navigation_classes: ["nav", "menu", "sidebar"],
+        non_content_tags: ["script", "style", "iframe"],
+        markdown_flavor: :basic,
+        normalize_whitespace: true
+      })
+
+  ## Features
+
+  - **Smart filtering** - Automatically removes common non-content elements
+  - **HTML5 support** - Handles modern semantic elements
+  - **Table conversion** - Converts HTML tables to Markdown tables
+  - **Entity decoding** - Automatically handled by Floki
+  - **Whitespace normalization** - Optional cleanup of excessive whitespace
+  - **Configurable** - Customize filtering behavior to your needs
+
+  ## Examples
+
+  ### Web Scraping
+
+      # Extract article content from a web page
+      {:ok, %{body: html}} = HTTPoison.get("https://example.com/article")
+      
+      content = Html2Markdown.convert(html, %{
+        navigation_classes: ["header", "footer", "nav", "sidebar"],
+        normalize_whitespace: true
+      })
+
+  ### Content Migration
+
+      # Convert WordPress posts to Markdown
+      post_html
+      |> Html2Markdown.convert()
+      |> File.write!("post.md")
+
+  ### Email Processing
+
+      # Clean up HTML emails
+      email_body
+      |> Html2Markdown.convert(%{
+        non_content_tags: ["style", "meta", "link"],
+        navigation_classes: ["unsubscribe", "footer"]
+      })
+
+  ## Supported HTML Elements
+
+  ### Text Elements
+  - Headings: `<h1>` through `<h6>`
+  - Paragraphs: `<p>`
+  - Emphasis: `<em>`, `<i>` → `*italic*`
+  - Strong: `<strong>`, `<b>` → `**bold**`
+  - Strikethrough: `<del>` → `~~strikethrough~~`
+  - Code: `<code>` → `` `code` ``
+  - Preformatted: `<pre>` → ``` code blocks ```
+
+  ### Lists
+  - Unordered lists: `<ul>`, `<li>` → `- item`
+  - Ordered lists: `<ol>`, `<li>` → `1. item`
+  - Definition lists: `<dl>`, `<dt>`, `<dd>`
+
+  ### Links and Media
+  - Links: `<a href="...">` → `[text](url)`
+  - Images: `<img>` → `![alt](src)`
+  - Picture: `<picture>` with fallback to `<img>`
+
+  ### Tables
+  Full support for HTML tables with automatic header detection:
+
+      <table>
+        <tr><th>Name</th><th>Value</th></tr>
+        <tr><td>Elixir</td><td>1.15</td></tr>
+      </table>
+
+  Converts to:
+
+      | Name | Value |
+      | --- | --- |
+      | Elixir | 1.15 |
+
+  ### HTML5 Elements
+  - `<details>` / `<summary>` - Collapsible sections
+  - `<mark>` - Highlighted text (GFM: `==marked==`)
+  - `<abbr title="...">` - Abbreviations with expansion
+  - `<cite>` - Citations in italics
+  - `<q cite="...">` - Inline quotes with optional citation
+  - `<time datetime="...">` - Time with preserved datetime
+  - `<video>` - Converted to markdown link
+
+  ## Entity Handling
+
+  HTML entities are automatically decoded by Floki:
+  - `&amp;` → `&`
+  - `&lt;` → `<`
+  - `&gt;` → `>`
+  - `&nbsp;` → non-breaking space
+  - `&#123;` → `{`
+  - `&#xAB;` → `«`
   """
 
   alias Html2Markdown.{Options, Parser, Converter}
diff --git a/lib/html2markdown/converter.ex b/lib/html2markdown/converter.ex
@@ -1,6 +1,26 @@
 defmodule Html2Markdown.Converter do
   @moduledoc """
   Handles the conversion of HTML nodes to Markdown format.
+
+  This module is responsible for transforming parsed HTML nodes into their
+  Markdown equivalents. It uses an efficient IOList-based approach for
+  building the output string.
+
+  ## Implementation Details
+
+  The converter uses pattern matching to handle different HTML elements:
+  - Headers (`h1`-`h6`) → Markdown headers with appropriate `#` prefixes
+  - Text formatting (`strong`, `em`, `del`) → Markdown emphasis markers
+  - Lists (`ul`, `ol`) → Markdown list syntax with proper nesting
+  - Tables → Delegated to `Html2Markdown.TableConverter`
+  - Links and images → Markdown link syntax
+  - Code blocks → Fenced code blocks with language detection
+
+  ## Performance Optimizations
+
+  - Uses IOList building instead of string concatenation
+  - Processes nodes in a single pass
+  - Preserves whitespace in code blocks while normalizing elsewhere
   """
 
   alias Html2Markdown.{TableConverter, Options}
diff --git a/lib/html2markdown/options.ex b/lib/html2markdown/options.ex
@@ -1,6 +1,31 @@
 defmodule Html2Markdown.Options do
   @moduledoc """
   Handles configuration options for HTML to Markdown conversion.
+
+  ## Available Options
+
+  - `:navigation_classes` - List of CSS classes that identify navigation elements to remove.
+    Default: `["footer", "menu", "nav", "sidebar", "aside"]`
+
+  - `:non_content_tags` - List of HTML tags to filter out completely.
+    Default includes tags like `script`, `style`, `iframe`, etc.
+
+  - `:markdown_flavor` - The markdown variant to generate.
+    Currently only `:basic` is supported. Future versions may support `:gfm` and `:commonmark`.
+
+  - `:normalize_whitespace` - Whether to collapse multiple spaces and trim whitespace.
+    Default: `true`. Code blocks always preserve whitespace regardless of this setting.
+
+  ## Examples
+
+      # Use all defaults
+      options = Options.defaults()
+
+      # Merge custom options with defaults
+      custom = Options.merge(%{
+        navigation_classes: ["custom-nav", "advertisement"],
+        normalize_whitespace: false
+      })
   """
 
   @type t :: %{
diff --git a/lib/html2markdown/parser.ex b/lib/html2markdown/parser.ex
@@ -1,6 +1,21 @@
 defmodule Html2Markdown.Parser do
   @moduledoc """
   Handles HTML preprocessing and parsing operations.
+
+  This module is responsible for:
+  1. Parsing HTML content using Floki
+  2. Filtering out non-content elements
+  3. Preparing the document tree for conversion
+
+  ## Filtering Strategy
+
+  The parser removes elements in two ways:
+  - **Tag-based filtering**: Removes elements like `<script>`, `<style>`, `<nav>`
+  - **Class-based filtering**: Removes elements with navigation classes like "footer", "sidebar"
+
+  ## Performance
+
+  Uses MapSet for O(1) lookup performance when checking tags and classes.
   """
 
   alias Html2Markdown.Options
diff --git a/lib/html2markdown/table_converter.ex b/lib/html2markdown/table_converter.ex
@@ -1,6 +1,31 @@
 defmodule Html2Markdown.TableConverter do
   @moduledoc """
   Handles conversion of HTML tables to Markdown format.
+
+  Converts HTML tables to GitHub Flavored Markdown tables with support for:
+  - Header detection (from `<th>` elements or `<thead>`)
+  - Complex table structures with `<thead>` and `<tbody>`
+  - Colspan handling (content repeated across columns)
+  - Empty cells and malformed tables
+
+  ## Examples
+
+      # Simple table
+      <table>
+        <tr><th>Name</th><th>Age</th></tr>
+        <tr><td>Alice</td><td>30</td></tr>
+      </table>
+
+      # Converts to:
+      | Name | Age |
+      | --- | --- |
+      | Alice | 30 |
+
+  ## Implementation Notes
+
+  - Tables without headers still generate valid Markdown tables
+  - Empty cells are preserved as empty columns
+  - Malformed HTML is handled gracefully
   """
 
   alias Html2Markdown.{Converter, Options}
diff --git a/mix.exs b/mix.exs
@@ -4,7 +4,7 @@ defmodule Html2Markdown.MixProject do
   def project do
     [
       app: :html2markdown,
-      version: "0.1.6",
+      version: "0.2.0",
       elixir: "~> 1.15",
       start_permanent: Mix.env() == :prod,
       description: description(),

Original file line number	Diff line number	Diff line change
`@@ -4,7 +4,7 @@ defmodule Html2Markdown.MixProject do`
`4`	`4`	`def project do`
`5`	`5`	`[`
`6`	`6`	`app: :html2markdown,`
`7`		`- version: "0.1.6",`
	`7`	`+ version: "0.2.0",`
`8`	`8`	`elixir: "~> 1.15",`
`9`	`9`	`start_permanent: Mix.env() == :prod,`
`10`	`10`	`description: description(),`