Skip to content

Commit 8ec74c2

Browse files
authored
Merge pull request #21 from managedcode/codex/add-ichatclient-support-to-parsing-pipeline-3libiz
address review feedback for docx metadata and pdf rendering
2 parents 427c332 + 04aabb7 commit 8ec74c2

21 files changed

+2094
-313
lines changed

README.md

Lines changed: 49 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -56,10 +56,12 @@ This is a high-fidelity C# port of Microsoft's original [MarkItDown Python libra
5656

5757
**Modern .NET** - Targets .NET 9.0 with up-to-date language features
5858
📦 **NuGet Package** - Drop-in dependency for libraries and automation pipelines
59-
🔄 **Async/Await** - Fully asynchronous pipeline for responsive apps
60-
🧠 **LLM-Optimized** - Markdown tailored for AI ingestion and summarisation
61-
🔧 **Extensible** - Register custom converters or plug additional caption/transcription services
62-
🧭 **Smart Detection** - Automatic MIME, charset, and file-type guessing (including data/file URIs)
59+
🔄 **Async/Await** - Fully asynchronous pipeline for responsive apps
60+
🧠 **LLM-Optimized** - Markdown tailored for AI ingestion and summarisation
61+
🔧 **Extensible** - Register custom converters or plug additional caption/transcription services
62+
🧩 **Conversion middleware** - Compose post-processing steps with `IConversionMiddleware` (AI enrichment ready)
63+
📂 **Raw artifacts API** - Inspect text blocks, tables, and images via `DocumentConverterResult.Artifacts`
64+
🧭 **Smart Detection** - Automatic MIME, charset, and file-type guessing (including data/file URIs)
6365
**High Performance** - Stream-friendly, minimal allocations, zero temp files
6466

6567
## 📋 Format Support
@@ -102,11 +104,12 @@ This is a high-fidelity C# port of Microsoft's original [MarkItDown Python libra
102104
- Header detection based on formatting
103105
- List item recognition
104106
- Title extraction from document content
107+
- Page snapshot artifacts ensure every page can be sent through AI enrichment (OCR, diagram-to-Mermaid, chart narration) even when the PDF exposes selectable text
105108

106109
### Office Documents (DOCX/XLSX/PPTX)
107-
- **Word (.docx)**: Headers, paragraphs, tables, bold/italic formatting
110+
- **Word (.docx)**: Headers, paragraphs, tables, bold/italic formatting, and embedded images captured for AI enrichment (OCR, Mermaid-ready diagrams)
108111
- **Excel (.xlsx)**: Spreadsheet data as Markdown tables with sheet organization
109-
- **PowerPoint (.pptx)**: Slide-by-slide content with title recognition
112+
- **PowerPoint (.pptx)**: Slide-by-slide content with title recognition plus image artifacts primed for detailed AI captions and diagrams
110113

111114
### CSV Conversion Features
112115
- Automatic table formatting with headers
@@ -1056,6 +1059,17 @@ var result = await markItDown.ConvertAsync("document.pdf");
10561059
Console.WriteLine(result.Markdown);
10571060
```
10581061

1062+
### .NET SDK Setup
1063+
1064+
MarkItDown targets .NET 9.0. If your environment does not have the required SDK, run the helper script once:
1065+
1066+
```bash
1067+
./eng/install-dotnet.sh
1068+
```
1069+
1070+
The script installs the SDK into `~/.dotnet` using the official `dotnet-install` bootstrapper and prints the environment
1071+
variables to add to your shell profile so the `dotnet` CLI is available on subsequent sessions.
1072+
10591073
### Building from Source
10601074

10611075
```bash
@@ -1084,6 +1098,10 @@ The command emits standard test results plus a Cobertura coverage report at
10841098
[ReportGenerator](https://github.com/danielpalme/ReportGenerator) can turn this into
10851099
HTML or Markdown dashboards.
10861100

1101+
> ✅ The regression suite now exercises DOCX and PPTX conversions with embedded imagery, ensuring conversion middleware runs and enriched descriptions remain attached to the composed Markdown.
1102+
>
1103+
> ✅ Additional image-placement regressions verify that AI-generated captions are injected immediately after each source placeholder for DOCX, PPTX, and PDF outputs.
1104+
10871105
### Project Structure
10881106

10891107
```
@@ -1218,6 +1236,31 @@ var options = new MarkItDownOptions
12181236
var markItDown = new MarkItDown(options);
12191237
```
12201238

1239+
### Conversion Middleware & Raw Artifacts
1240+
1241+
Every conversion now exposes the raw extraction artifacts that feed the Markdown composer. Use `DocumentConverterResult.Artifacts` to inspect page text, tables, or embedded images before they are flattened into Markdown. You can plug additional processing by registering `IConversionMiddleware` instances through `MarkItDownOptions.ConversionMiddleware`. Middleware executes after extraction and can mutate segments, enrich metadata, or call external AI services. When an `IChatClient` is supplied and `EnableAiImageEnrichment` remains `true` (default), MarkItDown automatically adds the built-in `AiImageEnrichmentMiddleware` to describe charts, diagrams, and other visuals. The middleware keeps enriched prose anchored to the exact Markdown placeholder emitted during extraction, ensuring captions, Mermaid diagrams, and OCR text land beside the original image instead of drifting to the end of the section.
1242+
1243+
```csharp
1244+
var options = new MarkItDownOptions
1245+
{
1246+
AiModels = new StaticAiModelProvider(chatClient: myChatClient, speechToTextClient: null),
1247+
ConversionMiddleware = new IConversionMiddleware[]
1248+
{
1249+
new MyDomainSpecificMiddleware()
1250+
}
1251+
};
1252+
1253+
var markItDown = new MarkItDown(options);
1254+
var result = await markItDown.ConvertAsync("docs/diagram.docx");
1255+
1256+
foreach (var image in result.Artifacts.Images)
1257+
{
1258+
Console.WriteLine($"Image {image.Label}: {image.DetailedDescription}");
1259+
}
1260+
```
1261+
1262+
Set `EnableAiImageEnrichment` to `false` when you need a completely custom pipeline with no default AI step.
1263+
12211264
### Production Configuration with Error Handling
12221265

12231266
```csharp

eng/install-dotnet.sh

Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
#!/usr/bin/env bash
2+
set -euo pipefail
3+
4+
CHANNEL="9.0"
5+
INSTALL_DIR="${DOTNET_ROOT:-$HOME/.dotnet}"
6+
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
7+
RESOLVED_INSTALL_DIR="${INSTALL_DIR}"
8+
TEMP_SCRIPT="${SCRIPT_DIR}/dotnet-install.sh"
9+
10+
cleanup() {
11+
rm -f "${TEMP_SCRIPT}"
12+
}
13+
trap cleanup EXIT
14+
15+
if ! command -v wget >/dev/null 2>&1 && ! command -v curl >/dev/null 2>&1; then
16+
echo "Either wget or curl is required to download dotnet-install.sh" >&2
17+
exit 1
18+
fi
19+
20+
DOWNLOAD_TOOL="wget"
21+
DOWNLOAD_ARGS=("-q" "-O")
22+
URL="https://dot.net/v1/dotnet-install.sh"
23+
24+
if command -v curl >/dev/null 2>&1; then
25+
DOWNLOAD_TOOL="curl"
26+
DOWNLOAD_ARGS=("-sSL" "-o")
27+
fi
28+
29+
${DOWNLOAD_TOOL} "${DOWNLOAD_ARGS[@]}" "${TEMP_SCRIPT}" "${URL}"
30+
chmod +x "${TEMP_SCRIPT}"
31+
32+
"${TEMP_SCRIPT}" --channel "${CHANNEL}" --install-dir "${INSTALL_DIR}" --no-path
33+
34+
cat <<EON
35+
Add the following to your shell profile to use the installed .NET SDK:
36+
37+
export DOTNET_ROOT="${RESOLVED_INSTALL_DIR}"
38+
export PATH="\$DOTNET_ROOT:\$PATH"
39+
40+
EON
Lines changed: 173 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,173 @@
1+
using System.Collections.ObjectModel;
2+
3+
namespace MarkItDown;
4+
5+
/// <summary>
6+
/// Represents the raw artifacts extracted during conversion prior to Markdown composition.
7+
/// </summary>
8+
public sealed class ConversionArtifacts
9+
{
10+
/// <summary>
11+
/// Initializes a new instance of the <see cref="ConversionArtifacts"/> class.
12+
/// </summary>
13+
public ConversionArtifacts()
14+
{
15+
TextBlocks = new List<TextArtifact>();
16+
Tables = new List<TableArtifact>();
17+
Images = new List<ImageArtifact>();
18+
Metadata = new Dictionary<string, string>();
19+
}
20+
21+
private ConversionArtifacts(bool _)
22+
{
23+
TextBlocks = EmptyTextBlocks;
24+
Tables = EmptyTables;
25+
Images = EmptyImages;
26+
Metadata = EmptyMetadata;
27+
}
28+
29+
/// <summary>
30+
/// Gets a reusable empty instance.
31+
/// </summary>
32+
public static ConversionArtifacts Empty { get; } = new(true);
33+
34+
private static readonly IList<TextArtifact> EmptyTextBlocks = new ReadOnlyCollection<TextArtifact>(Array.Empty<TextArtifact>());
35+
private static readonly IList<TableArtifact> EmptyTables = new ReadOnlyCollection<TableArtifact>(Array.Empty<TableArtifact>());
36+
private static readonly IList<ImageArtifact> EmptyImages = new ReadOnlyCollection<ImageArtifact>(Array.Empty<ImageArtifact>());
37+
private static readonly IDictionary<string, string> EmptyMetadata = new ReadOnlyDictionary<string, string>(new Dictionary<string, string>());
38+
39+
/// <summary>
40+
/// Gets the raw text artifacts captured from the source.
41+
/// </summary>
42+
public IList<TextArtifact> TextBlocks { get; }
43+
44+
/// <summary>
45+
/// Gets the tabular artifacts captured from the source.
46+
/// </summary>
47+
public IList<TableArtifact> Tables { get; }
48+
49+
/// <summary>
50+
/// Gets the image artifacts captured from the source.
51+
/// </summary>
52+
public IList<ImageArtifact> Images { get; }
53+
54+
/// <summary>
55+
/// Gets conversion-level metadata surfaced by the converter.
56+
/// </summary>
57+
public IDictionary<string, string> Metadata { get; }
58+
}
59+
60+
/// <summary>
61+
/// Represents a block of text extracted from the source document.
62+
/// </summary>
63+
public sealed class TextArtifact
64+
{
65+
public TextArtifact(string text, int? pageNumber = null, string? source = null, string? label = null)
66+
{
67+
Text = text ?? string.Empty;
68+
PageNumber = pageNumber;
69+
Source = source;
70+
Label = label;
71+
}
72+
73+
public string Text { get; set; }
74+
75+
public int? PageNumber { get; set; }
76+
77+
public string? Source { get; set; }
78+
79+
public string? Label { get; set; }
80+
}
81+
82+
/// <summary>
83+
/// Represents tabular content extracted from the source document.
84+
/// </summary>
85+
public sealed class TableArtifact
86+
{
87+
public TableArtifact(IList<IList<string>> rows, int? pageNumber = null, string? source = null, string? label = null)
88+
{
89+
Rows = rows ?? throw new ArgumentNullException(nameof(rows));
90+
PageNumber = pageNumber;
91+
Source = source;
92+
Label = label;
93+
}
94+
95+
public IList<IList<string>> Rows { get; }
96+
97+
public int? PageNumber { get; set; }
98+
99+
public string? Source { get; set; }
100+
101+
public string? Label { get; set; }
102+
}
103+
104+
/// <summary>
105+
/// Represents an image extracted from the source document.
106+
/// </summary>
107+
public sealed class ImageArtifact
108+
{
109+
public ImageArtifact(byte[] data, string? contentType = null, int? pageNumber = null, string? source = null, string? label = null)
110+
{
111+
Data = data ?? throw new ArgumentNullException(nameof(data));
112+
ContentType = contentType;
113+
PageNumber = pageNumber;
114+
Source = source;
115+
Label = label;
116+
Metadata = new Dictionary<string, string>();
117+
}
118+
119+
/// <summary>
120+
/// Gets the raw binary data for the image.
121+
/// </summary>
122+
public byte[] Data { get; }
123+
124+
/// <summary>
125+
/// Gets the content type associated with the image.
126+
/// </summary>
127+
public string? ContentType { get; set; }
128+
129+
/// <summary>
130+
/// Gets or sets the page number that owns the image, when applicable.
131+
/// </summary>
132+
public int? PageNumber { get; set; }
133+
134+
/// <summary>
135+
/// Gets or sets the logical source identifier for the image.
136+
/// </summary>
137+
public string? Source { get; set; }
138+
139+
/// <summary>
140+
/// Gets or sets the friendly label for the image.
141+
/// </summary>
142+
public string? Label { get; set; }
143+
144+
/// <summary>
145+
/// Gets or sets the enriched description generated for the image.
146+
/// </summary>
147+
public string? DetailedDescription { get; set; }
148+
149+
/// <summary>
150+
/// Gets or sets a Mermaid diagram representation when the image depicts structured data.
151+
/// </summary>
152+
public string? MermaidDiagram { get; set; }
153+
154+
/// <summary>
155+
/// Gets or sets additional textual extraction (such as OCR output).
156+
/// </summary>
157+
public string? RawText { get; set; }
158+
159+
/// <summary>
160+
/// Gets metadata describing the image artifact.
161+
/// </summary>
162+
public IDictionary<string, string> Metadata { get; }
163+
164+
/// <summary>
165+
/// Gets or sets the segment index that references this artifact within the composed output.
166+
/// </summary>
167+
public int? SegmentIndex { get; set; }
168+
169+
/// <summary>
170+
/// Gets or sets the Markdown placeholder that was emitted during extraction for this image.
171+
/// </summary>
172+
public string? PlaceholderMarkdown { get; set; }
173+
}
Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
using System;
2+
using System.Collections.Generic;
3+
using System.Linq;
4+
using System.Threading;
5+
using System.Threading.Tasks;
6+
using MarkItDown.Intelligence;
7+
using Microsoft.Extensions.Logging;
8+
9+
namespace MarkItDown;
10+
11+
/// <summary>
12+
/// Sequential middleware pipeline that executes configured <see cref="IConversionMiddleware"/> components.
13+
/// </summary>
14+
public sealed class ConversionPipeline : IConversionPipeline
15+
{
16+
private readonly IReadOnlyList<IConversionMiddleware> middlewares;
17+
private readonly IAiModelProvider aiModels;
18+
private readonly ILogger? logger;
19+
20+
public static IConversionPipeline Empty { get; } = new ConversionPipeline(Array.Empty<IConversionMiddleware>(), NullAiModelProvider.Instance, logger: null);
21+
22+
public ConversionPipeline(IEnumerable<IConversionMiddleware> middlewares, IAiModelProvider aiModels, ILogger? logger)
23+
{
24+
this.middlewares = (middlewares ?? throw new ArgumentNullException(nameof(middlewares))).ToArray();
25+
this.aiModels = aiModels ?? NullAiModelProvider.Instance;
26+
this.logger = logger;
27+
}
28+
29+
public async Task ExecuteAsync(StreamInfo streamInfo, ConversionArtifacts artifacts, IList<DocumentSegment> segments, CancellationToken cancellationToken)
30+
{
31+
if (middlewares.Count == 0)
32+
{
33+
return;
34+
}
35+
36+
var context = new ConversionPipelineContext(streamInfo, artifacts, segments, aiModels, logger);
37+
foreach (var middleware in middlewares)
38+
{
39+
cancellationToken.ThrowIfCancellationRequested();
40+
41+
try
42+
{
43+
await middleware.InvokeAsync(context, cancellationToken).ConfigureAwait(false);
44+
}
45+
catch (Exception ex)
46+
{
47+
logger?.LogWarning(ex, "Conversion middleware {Middleware} failed", middleware.GetType().Name);
48+
}
49+
}
50+
}
51+
}

0 commit comments

Comments
 (0)