Skip to content

Commit 5d40600

Browse files
feat: add HTML serializer (#232)
* iunitial attempt at HTML serializer Signed-off-by: Peter Staar <[email protected]> * first version, to be tested thoroughly Signed-off-by: Peter Staar <[email protected]> * added the new test Signed-off-by: Peter Staar <[email protected]> * rewrote carefully the export-to-html into new framework Signed-off-by: Peter Staar <[email protected]> * fixed the inline list-items Signed-off-by: Peter Staar <[email protected]> * added the inline code Signed-off-by: Peter Staar <[email protected]> * migrated the table html code Signed-off-by: Peter Staar <[email protected]> * updated the picture HTML serializer Signed-off-by: Peter Staar <[email protected]> * updated the html for Form and KeyValue Signed-off-by: Peter Staar <[email protected]> * first version of KeyValue serialisation Signed-off-by: Peter Staar <[email protected]> * fixed the key-value and form-region and added the GraphData serializer Signed-off-by: Peter Staar <[email protected]> * need to do some mypy work now Signed-off-by: Peter Staar <[email protected]> * passed the mypy Signed-off-by: Peter Staar <[email protected]> * added the get_excluded_refs function to obtain proper serialization Signed-off-by: Peter Staar <[email protected]> * enabled the captions Signed-off-by: Peter Staar <[email protected]> * removed empty lists Signed-off-by: Peter Staar <[email protected]> * added initial split view and customised styles Signed-off-by: Peter Staar <[email protected]> * cleaned up, now waiting for page-indices propagationg Signed-off-by: Peter Staar <[email protected]> * updated the styles and parameters with split_page Signed-off-by: Peter Staar <[email protected]> * propagated the parameter split_page_view Signed-off-by: Peter Staar <[email protected]> * first fully working version Signed-off-by: Peter Staar <[email protected]> * removed the prints Signed-off-by: Peter Staar <[email protected]> * fixed the tests Signed-off-by: Peter Staar <[email protected]> * fixed the tests Signed-off-by: Peter Staar <[email protected]> * fixed the tests for html export Signed-off-by: Peter Staar <[email protected]> * removed dead code Signed-off-by: Peter Staar <[email protected]> * updated the test output Signed-off-by: Peter Staar <[email protected]> * fixed the tests Signed-off-by: Peter Staar <[email protected]> * reformatted the code Signed-off-by: Peter Staar <[email protected]> * rename parameter tag to class_name Signed-off-by: Peter Staar <[email protected]> * added serializers to table and picture Signed-off-by: Peter Staar <[email protected]> * removed dead code Signed-off-by: Peter Staar <[email protected]> * various HTML serialization improvements (#242) Signed-off-by: Panos Vagenas <[email protected]> * added enum for different output styles Signed-off-by: Peter Staar <[email protected]> --------- Signed-off-by: Peter Staar <[email protected]> Signed-off-by: Panos Vagenas <[email protected]> Co-authored-by: Panos Vagenas <[email protected]>
1 parent 23036e1 commit 5d40600

File tree

12 files changed

+2163
-836
lines changed

12 files changed

+2163
-836
lines changed

docling_core/experimental/serializer/html.py

Lines changed: 931 additions & 0 deletions
Large diffs are not rendered by default.
Lines changed: 212 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,212 @@
1+
"""HTML styles for different export modes."""
2+
3+
4+
def _get_css_with_no_styling() -> str:
5+
"""Return default CSS styles for the HTML document."""
6+
return "<style></style>"
7+
8+
9+
def _get_css_for_split_page() -> str:
10+
"""Return default CSS styles for the HTML document."""
11+
return """<style>
12+
html {
13+
background-color: #e1e1e1;
14+
font-family: Arial, sans-serif;
15+
line-height: 1.6;
16+
}
17+
img {
18+
min-width: 500px;
19+
max-width: 100%;
20+
}
21+
table {
22+
border-collapse: collapse;
23+
border: 0px solid #fff;
24+
width: 100%;
25+
}
26+
td {
27+
vertical-align: top;
28+
}
29+
.page {
30+
background-color: white;
31+
margin-top:15px;
32+
padding: 30px;
33+
border: 1px solid black;
34+
width:100%;
35+
max-width:1000px;
36+
box-shadow: 0 0 10px rgba(0,0,0,0.5);
37+
}
38+
.page figure {
39+
text-align: center;
40+
}
41+
.page img {
42+
max-width: 900px;
43+
min-width: auto;
44+
}
45+
.page table {
46+
border-collapse: collapse;
47+
margin: 1em 0;
48+
width: 100%;
49+
}
50+
.page table td {
51+
border: 1px solid #ddd;
52+
padding: 8px;
53+
text-align: left;
54+
}
55+
.page table th {
56+
border: 1px solid #ddd;
57+
padding: 8px;
58+
text-align: left;
59+
background-color: #f2f2f2;
60+
font-weight: bold;
61+
}
62+
.page table caption {
63+
color: #666;
64+
font-style: italic;
65+
margin-top: 0.5em;
66+
padding: 8px;
67+
margin-top: 5px;
68+
margin-bottom: 5px;
69+
}
70+
.page figcaption {
71+
color: #666;
72+
font-style: italic;
73+
margin-top: 0.5em;
74+
padding: 8px;
75+
margin-top: 5px;
76+
margin-bottom: 5px;
77+
}
78+
code {
79+
background-color: rgb(228, 228, 228);
80+
border: 1px solid darkgray;
81+
padding: 10px;
82+
display: inline-block;
83+
font-family: monospace;
84+
max-width:980px;
85+
word-wrap: normal;
86+
white-space: pre-wrap;
87+
word-wrap: break-word;
88+
/*overflow-wrap: break-word;*/
89+
}
90+
</style>
91+
"""
92+
93+
94+
def _get_css_for_single_column() -> str:
95+
"""Return CSS styles for the single-column HTML document."""
96+
return """<style>
97+
html {
98+
background-color: #f5f5f5;
99+
font-family: Arial, sans-serif;
100+
line-height: 1.6;
101+
}
102+
body {
103+
max-width: 800px;
104+
margin: 0 auto;
105+
padding: 2rem;
106+
background-color: white;
107+
box-shadow: 0 0 10px rgba(0,0,0,0.1);
108+
}
109+
h1, h2, h3, h4, h5, h6 {
110+
color: #333;
111+
margin-top: 1.5em;
112+
margin-bottom: 0.5em;
113+
}
114+
h1 {
115+
font-size: 2em;
116+
border-bottom: 1px solid #eee;
117+
padding-bottom: 0.3em;
118+
}
119+
table {
120+
border-collapse: collapse;
121+
margin: 1em 0;
122+
width: 100%;
123+
}
124+
th, td {
125+
border: 1px solid #ddd;
126+
padding: 8px;
127+
text-align: left;
128+
}
129+
th {
130+
background-color: #f2f2f2;
131+
font-weight: bold;
132+
}
133+
figure {
134+
margin: 1.5em 0;
135+
text-align: center;
136+
}
137+
figcaption {
138+
color: #666;
139+
font-style: italic;
140+
margin-top: 0.5em;
141+
}
142+
img {
143+
max-width: 100%;
144+
height: auto;
145+
}
146+
pre {
147+
background-color: #f6f8fa;
148+
border-radius: 3px;
149+
padding: 1em;
150+
overflow: auto;
151+
}
152+
code {
153+
font-family: monospace;
154+
background-color: #f6f8fa;
155+
padding: 0.2em 0.4em;
156+
border-radius: 3px;
157+
}
158+
pre code {
159+
background-color: transparent;
160+
padding: 0;
161+
}
162+
.formula {
163+
text-align: center;
164+
padding: 0.5em;
165+
margin: 1em 0;
166+
background-color: #f9f9f9;
167+
}
168+
.formula-not-decoded {
169+
text-align: center;
170+
padding: 0.5em;
171+
margin: 1em 0;
172+
background: repeating-linear-gradient(
173+
45deg,
174+
#f0f0f0,
175+
#f0f0f0 10px,
176+
#f9f9f9 10px,
177+
#f9f9f9 20px
178+
);
179+
}
180+
.page-break {
181+
page-break-after: always;
182+
border-top: 1px dashed #ccc;
183+
margin: 2em 0;
184+
}
185+
.key-value-region {
186+
background-color: #f9f9f9;
187+
padding: 1em;
188+
border-radius: 4px;
189+
margin: 1em 0;
190+
}
191+
.key-value-region dt {
192+
font-weight: bold;
193+
}
194+
.key-value-region dd {
195+
margin-left: 1em;
196+
margin-bottom: 0.5em;
197+
}
198+
.form-container {
199+
border: 1px solid #ddd;
200+
padding: 1em;
201+
border-radius: 4px;
202+
margin: 1em 0;
203+
}
204+
.form-item {
205+
margin-bottom: 0.5em;
206+
}
207+
.image-classification {
208+
font-size: 0.9em;
209+
color: #666;
210+
margin-top: 0.5em;
211+
}
212+
</style>"""

0 commit comments

Comments
 (0)