Skip to content

Commit 3097645

Browse files
bridgetmcgvagenas
andauthored
feat: add code chunking functionality (#398)
* initial code chunking for docling-core * DCO Remediation Commit for Bridget McGinn <[email protected]> I, Bridget McGinn <[email protected]>, hereby add my Signed-off-by to this commit: 334811a Signed-off-by: Bridget McGinn <[email protected]> * include language detections, add code chunking into hierarchical chunker * add serializer, internal marking of chunkers, typing * Update pyproject.toml Co-authored-by: Panos Vagenas <[email protected]> Signed-off-by: Bridget <[email protected]> * Update docling_core/transforms/chunker/hierarchical_chunker.py Co-authored-by: Panos Vagenas <[email protected]> Signed-off-by: Bridget <[email protected]> * run all pre-commit less pytest * update test files for code ID * DCO Remediation Commit for Bridget McGinn <[email protected]> I, Bridget McGinn <[email protected]>, hereby add my Signed-off-by to this commit: 46bb88a I, Bridget McGinn <[email protected]>, hereby add my Signed-off-by to this commit: 10e9ed8 I, Bridget McGinn <[email protected]>, hereby add my Signed-off-by to this commit: d9827c7 I, Bridget McGinn <[email protected]>, hereby add my Signed-off-by to this commit: 814dc61 Signed-off-by: Bridget McGinn <[email protected]> * update uv.lock Signed-off-by: Bridget McGinn <[email protected]> * revert to stricter treesitter versioning due to compatibility Signed-off-by: Bridget McGinn <[email protected]> * DCO Remediation Commit for Bridget McGinn <[email protected]> I, Bridget McGinn <[email protected]>, hereby add my Signed-off-by to this commit: a4a21e9 I, Bridget McGinn <[email protected]>, hereby add my Signed-off-by to this commit: 0266c63 I, Bridget McGinn <[email protected]>, hereby add my Signed-off-by to this commit: 336dd6a I, Bridget McGinn <[email protected]>, hereby add my Signed-off-by to this commit: 68890e9 I, Bridget McGinn <[email protected]>, hereby add my Signed-off-by to this commit: 3c65eef Signed-off-by: Bridget McGinn <[email protected]> * remove language detection (to be run by client, i.e. docling) Signed-off-by: Panos Vagenas <[email protected]> * align new dependency specs Signed-off-by: Panos Vagenas <[email protected]> * address backticks, ABC, and supported languages feedback Signed-off-by: Bridget McGinn <[email protected]> * remove Language class and reuse CodeLanguageLabel Signed-off-by: Bridget McGinn <[email protected]> * DCO Remediation Commit for Bridget McGinn <[email protected]> I, Bridget McGinn <[email protected]>, hereby add my Signed-off-by to this commit: 63c7739 I, Bridget McGinn <[email protected]>, hereby add my Signed-off-by to this commit: 431d357 I, Bridget McGinn <[email protected]>, hereby add my Signed-off-by to this commit: f3175c2 I, Bridget McGinn <[email protected]>, hereby add my Signed-off-by to this commit: 1a01de8 I, Bridget McGinn <[email protected]>, hereby add my Signed-off-by to this commit: 025aea3 Signed-off-by: Bridget McGinn <[email protected]> * refactoring and improvements - encapsulated code chunking specifics to separate package - clearly separated public vs internal API via module and method naming conventions - simplified or removed parts not stricly necessary for public API (e.g. lang support querying, noopstrategy) - split chunk data model to separate modules to prevent circular dependencies - renamed DefaultCodeChunkingStrategy to Standard... for clarity as it need not be the default strategy - fixed some issues (e.g. gen flag in test) Signed-off-by: Panos Vagenas <[email protected]> --------- Signed-off-by: Bridget McGinn <[email protected]> Signed-off-by: Bridget <[email protected]> Signed-off-by: Panos Vagenas <[email protected]> Co-authored-by: Panos Vagenas <[email protected]> Co-authored-by: Panos Vagenas <[email protected]>
1 parent a54f6f0 commit 3097645

File tree

74 files changed

+10987
-304
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

74 files changed

+10987
-304
lines changed

docling_core/transforms/chunker/__init__.py

Lines changed: 14 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -6,9 +6,19 @@
66
"""Define the chunker types."""
77

88
from docling_core.transforms.chunker.base import BaseChunk, BaseChunker, BaseMeta
9-
from docling_core.transforms.chunker.hierarchical_chunker import (
10-
DocChunk,
11-
DocMeta,
12-
HierarchicalChunker,
9+
from docling_core.transforms.chunker.code_chunking.base_code_chunking_strategy import (
10+
BaseCodeChunkingStrategy,
1311
)
12+
from docling_core.transforms.chunker.code_chunking.code_chunk import (
13+
CodeChunk,
14+
CodeChunkType,
15+
CodeDocMeta,
16+
)
17+
from docling_core.transforms.chunker.code_chunking.standard_code_chunking_strategy import (
18+
StandardCodeChunkingStrategy,
19+
)
20+
from docling_core.transforms.chunker.doc_chunk import DocChunk, DocMeta
21+
from docling_core.transforms.chunker.hierarchical_chunker import HierarchicalChunker
22+
from docling_core.transforms.chunker.hybrid_chunker import HybridChunker
1423
from docling_core.transforms.chunker.page_chunker import PageChunker
24+
from docling_core.types.doc.labels import CodeLanguageLabel
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
"""Code chunking package."""

0 commit comments

Comments
 (0)