Skip to content

Commit bbdf4a3

Browse files
committed
youtube featrure
1 parent 9339d2b commit bbdf4a3

38 files changed

+1714
-644
lines changed

AGENTS.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -122,6 +122,9 @@ Extraction guidelines:
122122
- Multi-page tables must emit continuation comments and populate `table.pageStart`, `table.pageEnd`, `table.pageRange` metadata.
123123
- PDF converters must honor `SegmentOptions.Pdf.TreatPagesAsImages` by rendering pages to PNG, running OCR/vision enrichment, and composing image+recognized-text segments.
124124
- Persist conversion workspaces through `ManagedCode.Storage` with sanitized per-document folders and store extracted artifacts + final markdown there.
125+
- Media upload integrations must use `ManagedCode.Storage.Core.IStorage` directly (factory/options flow); do not introduce custom URL-upload provider abstractions when `IStorage` is available.
126+
- After feature/refactor work, delete orphaned/unused code files and stale abstractions immediately; do not leave dead code in the repository.
127+
- In storage-related tests, use real `ManagedCode.Storage` implementations (for example `LocalStorage`) instead of custom storage stubs when feasible.
125128
- Root path configurability: `MarkItDownPathResolver` must support configurable root via `MarkItDownOptions.RootPath` or `MarkItDownServiceBuilder.UseRootPath()`, with lock-guarded atomic initialization and conflict exceptions.
126129

127130
### Autonomy

Directory.Packages.props

Lines changed: 14 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,16 @@
11
<Project>
22
<ItemGroup>
33
<PackageVersion Include="AngleSharp" Version="1.4.0" />
4-
<PackageVersion Include="AWSSDK.Rekognition" Version="4.0.3.10" />
5-
<PackageVersion Include="AWSSDK.S3" Version="4.0.17.1" />
6-
<PackageVersion Include="AWSSDK.Textract" Version="4.0.3.10" />
7-
<PackageVersion Include="AWSSDK.TranscribeService" Version="4.0.5" />
4+
<PackageVersion Include="AWSSDK.Rekognition" Version="4.0.3.13" />
5+
<PackageVersion Include="AWSSDK.S3" Version="4.0.18.6" />
6+
<PackageVersion Include="AWSSDK.Textract" Version="4.0.3.13" />
7+
<PackageVersion Include="AWSSDK.TranscribeService" Version="4.0.5.3" />
88
<PackageVersion Include="Azure.AI.FormRecognizer" Version="4.1.0" />
99
<PackageVersion Include="Azure.AI.OpenAI" Version="2.1.0" />
1010
<PackageVersion Include="Azure.AI.Vision.ImageAnalysis" Version="1.0.0" />
1111
<PackageVersion Include="Azure.Storage.Blobs" Version="12.27.0" />
1212
<PackageVersion Include="Azure.Identity" Version="1.17.1" />
13-
<PackageVersion Include="coverlet.collector" Version="6.0.4" />
13+
<PackageVersion Include="coverlet.collector" Version="8.0.0" />
1414
<PackageVersion Include="DocumentFormat.OpenXml" Version="3.4.1" />
1515
<PackageVersion Include="DotNet.ReproducibleBuilds" Version="1.2.25" />
1616
<PackageVersion Include="Google.Cloud.DocumentAI.V1" Version="3.23.0" />
@@ -22,22 +22,22 @@
2222
<PackageVersion Include="ManagedCode.Storage.Core" Version="10.0.2" />
2323
<PackageVersion Include="ManagedCode.Storage.FileSystem" Version="10.0.2" />
2424
<PackageVersion Include="ManagedCode.Storage.Gcp" Version="10.0.2" />
25-
<PackageVersion Include="Microsoft.Extensions.AI" Version="10.1.1" />
26-
<PackageVersion Include="Microsoft.Extensions.AI.OpenAI" Version="9.9.1-preview.1.25474.6" />
27-
<PackageVersion Include="Microsoft.Extensions.DependencyInjection.Abstractions" Version="10.0.1" />
28-
<PackageVersion Include="Microsoft.Extensions.Logging.Abstractions" Version="10.0.1" />
29-
<PackageVersion Include="Microsoft.Extensions.Options" Version="10.0.1" />
25+
<PackageVersion Include="Microsoft.Extensions.AI" Version="10.3.0" />
26+
<PackageVersion Include="Microsoft.Extensions.AI.OpenAI" Version="10.3.0" />
27+
<PackageVersion Include="Microsoft.Extensions.DependencyInjection.Abstractions" Version="10.0.3" />
28+
<PackageVersion Include="Microsoft.Extensions.Logging.Abstractions" Version="10.0.3" />
29+
<PackageVersion Include="Microsoft.Extensions.Options" Version="10.0.3" />
3030
<PackageVersion Include="Microsoft.NET.Test.Sdk" Version="18.0.1" />
31-
<PackageVersion Include="MimeKit" Version="4.14.0" />
31+
<PackageVersion Include="MimeKit" Version="4.15.0" />
3232
<PackageVersion Include="Moq" Version="4.20.72" />
3333
<PackageVersion Include="PdfPig" Version="0.1.13" />
3434
<PackageVersion Include="PDFtoImage" Version="5.2.0" />
35-
<PackageVersion Include="Sep" Version="0.12.1" />
35+
<PackageVersion Include="Sep" Version="0.12.2" />
3636
<PackageVersion Include="Shouldly" Version="4.3.0" />
3737
<PackageVersion Include="Testcontainers.Azurite" Version="4.10.0" />
38-
<PackageVersion Include="SkiaSharp" Version="3.119.1" />
38+
<PackageVersion Include="SkiaSharp" Version="3.119.2" />
3939
<PackageVersion Include="Spectre.Console" Version="0.54.0" />
40-
<PackageVersion Include="YoutubeExplode" Version="6.5.6" />
40+
<PackageVersion Include="YoutubeExplode" Version="6.5.7" />
4141
<PackageVersion Include="xunit" Version="2.9.3" />
4242
<PackageVersion Include="xunit.runner.visualstudio" Version="3.1.5" />
4343
</ItemGroup>

docs/DocumentProcessingPipeline.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -67,7 +67,7 @@ The table below maps every converter family to its scope and special notes. "Inl
6767
| Web & Feeds | `HtmlConverter`, `WikipediaConverter`, `BingSerpConverter`, `RssFeedConverter` | HTML pages, SERP snapshots, RSS/Atom feeds | No | Optional link normalisation middleware | Callers must pre-download content; converters assume a file path containing the payload. |
6868
| Markup & Diagrams | `AsciiDocConverter`, `DjotConverter`, `GraphvizConverter`, `LatexConverter`, `OrgConverter`, `PlantUmlConverter`, `RstConverter`, `TextileConverter`, `TikzConverter`, `TypstConverter`, `WikiMarkupConverter`, `MermaidConverter` | Text-based markup, diagrams | No | Diagram converters may rely on rendering middleware | `MermaidConverter` should never render in-memory; run CLI tools against temp files if needed. |
6969
| Data & Metadata | `CsvConverter`, `MetaMdConverter`, `JsonConverter`, `XmlConverter`, `BibTexConverter`, `RisConverter`, `CslJsonConverter` | Tabular or metadata files | No | Optional schema validation middleware | `MetaMdConverter` extracts Markdown metadata blocks and attaches them as segments/artifacts. |
70-
| Media | `ImageConverter`, `AudioConverter`, `YouTubeUrlConverter` | Images, audio files, YouTube URLs | Image & Audio converters call providers inline; YouTube uses metadata APIs | `AiImageEnrichmentMiddleware` for remaining images | Audio transcription must stream from disk; no byte-array mirrors. |
70+
| Media | `ImageConverter`, `AudioConverter`, `YouTubeUrlConverter` | Images, audio files, YouTube + supported video-platform URLs | Image & Audio converters call providers inline; URL converter resolves/downloads media then delegates to `VideoConverter` | `AiImageEnrichmentMiddleware` for remaining images | Audio/video transcription must stream from disk; no byte-array mirrors. Azure media upload route is controlled per request via `MediaTranscriptionRequest.UploadRoute` (`Auto`/`Stream`/`SourceUrl`/`StorageUrl`). |
7171
| Archives & Packaging | `ZipConverter`, future TAR/7z | Aggregated content | Depends on entries | Might run child pipelines per entry | Each entry is persisted to its own temp file before being handed to inner converters. |
7272

7373
---

docs/Features/format-detection-and-converter-routing.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -56,6 +56,7 @@ Select the correct converter for each input by combining explicit metadata (mime
5656
- If all converters fail, user receives `UnsupportedFormatException` with converter failure details, except authorization/authentication failures which surface as `FileConversionException`.
5757
- When a converter is selected for execution, emit an `Information` log with converter name and source.
5858
- Media uploads (`audio/*`, `video/*`) must stay on media/file converter paths and must not be routed through YouTube URL converter logic.
59+
- YouTube and supported video-platform URLs must resolve/download real media and run through `VideoConverter` media transcription flow.
5960
- `video/*` files route through `VideoConverter` first, then use configured media transcription providers (for example Azure Video Indexer) via the media converter flow.
6061

6162
---
@@ -73,7 +74,7 @@ Select the correct converter for each input by combining explicit metadata (mime
7374
2. Convert URL input
7475
- Actor: library caller
7576
- Trigger: URL conversion API call
76-
- Steps: download -> create URL-aware stream info -> same routing pipeline
77+
- Steps: download/resolve URL media when applicable -> create URL-aware stream info -> same routing pipeline
7778
- Result: web/media/url converter path selected by input type.
7879

7980
### Edge cases

docs/Features/media-and-image-intelligence-enrichment.md

Lines changed: 8 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -55,9 +55,12 @@ Enable richer media conversion by combining baseline metadata extraction with op
5555
- Missing required media provider configurations must surface explicit failures instead of silent fallback.
5656
- Video Indexer processing must wait for `Processed` and fail with a clear timeout/config error if processing does not complete in the configured window.
5757
- For Azure Video Indexer uploads, prefer `videoUrl` when `StreamInfo.Url` is a valid `http/https` source (for example read-only SAS URL); use multipart stream upload only when no valid source URL is available.
58+
- `MediaTranscriptionRequest.SourceUrl` must override `StreamInfo.Url` for Azure Video Indexer upload source selection, and invalid (non-`http/https`) override values must fail fast.
59+
- `MediaTranscriptionRequest.UploadRoute` controls provider upload routing: `Auto`, `Stream`, `SourceUrl`, `StorageUrl`.
60+
- `UploadRoute=StorageUrl` requires `AzureMediaIntelligenceOptions.UploadStorageFactory`; missing storage factory, missing local file path, or missing public HTTP/S URI in upload metadata must fail fast.
5861
- Image enrichment must reject missing MIME metadata and preserve one canonical image placeholder/description path in final markdown.
5962
- If AI image enrichment returns no insight, treat it as soft failure (log and continue).
60-
- Media routing must avoid YouTube converter path for uploaded `audio/*` or `video/*` media.
63+
- Media routing must avoid URL-converter interception for uploaded `audio/*` or `video/*` media.
6164
- Video transcript output must include rich context (timing/speaker/sentiment/topics/keywords plus Video Indexer state/index/progress metadata when available).
6265

6366
---
@@ -69,7 +72,7 @@ Enable richer media conversion by combining baseline metadata extraction with op
6972
1. Convert an audio/video file with media transcription enabled
7073
- Actor: library caller
7174
- Trigger: media conversion request with provider options
72-
- Steps: route `video/*` to `VideoConverter`/`AudioConverter` -> extract metadata -> upload via `videoUrl` (when source URL exists) or multipart fallback -> poll provider -> build transcript + analysis segments -> compose markdown
75+
- Steps: route `video/*` to `VideoConverter`/`AudioConverter` -> extract metadata -> choose upload route (`UploadRoute`) and source URL (`SourceUrl` override, URL metadata, or storage-generated public URL) -> upload via `videoUrl` or multipart stream -> poll provider -> build transcript + analysis segments -> compose markdown
7376
- Result: markdown containing metadata and transcript segments.
7477

7578
2. Convert an image with AI enrichment enabled
@@ -153,9 +156,11 @@ flowchart LR
153156

154157
| ID | Description | Level (Unit / Int / API / UI) | Expected result | Data / Notes |
155158
| --- | --- | --- | --- | --- |
156-
| EDGE-001 | Video media input with URL metadata | Integration | Routed through media path, not YouTube metadata path | `tests/MarkItDown.Tests/ConverterAcceptanceTests.cs` |
159+
| EDGE-001 | Video media input with URL metadata | Integration | Routed through media path, not URL video-platform resolver path | `tests/MarkItDown.Tests/ConverterAcceptanceTests.cs` |
157160
| EDGE-002 | Live Azure media transcription path | Integration (live) | Provider transcript segments marked as azure video indexer with transcript + analysis sections | `tests/MarkItDown.Tests/Intelligence/Integration/AzureIntelligenceIntegrationTests.cs` |
158161
| EDGE-003 | Azure Video Indexer upload with HTTP/S source URL | Unit/Integration | Upload request uses `videoUrl` query without multipart body | `tests/MarkItDown.Tests/Intelligence/VideoIndexerClientTests.cs` |
162+
| EDGE-004 | Azure upload source override via request | Unit/Integration | `MediaTranscriptionRequest.SourceUrl` wins over `StreamInfo.Url`; invalid override fails fast | `tests/MarkItDown.Tests/Intelligence/AzureMediaTranscriptionProviderTests.cs` |
163+
| EDGE-005 | Azure upload route selection | Unit/Integration | `UploadRoute.Stream` forces multipart; `UploadRoute.StorageUrl` requires configured uploader and emits `videoUrl` | `tests/MarkItDown.Tests/Intelligence/AzureMediaTranscriptionProviderTests.cs` |
159164

160165
### Test mapping
161166

0 commit comments

Comments
 (0)