Skip to content

Error on python 1-extraction.py #17

@neurosurgeon

Description

@neurosurgeon

I get the following output at the end of the run. I was wondering if the issues is a dead URL:


Affiliation

teria for documents were described in Section 3. A large effort went into ensuring that all documents are free to use. The data sources in DocBank, are often only distinguishable by discriminating on 3 https://arxiv.org/ Figure 4: Table 1 from the DocLayNet paper in the original PDF (A), as rendered Markdown (B) and in JSON representation (C). Spanning table cells, such as the multi-column header 'triple interannotator mAP@0.5-0.95 (%)', is repeated for each column in the Markdown representation (B), which guarantees that every data point can be traced back to row and column headings only by its grid coordinates in the table. In the JSON representation, the span information is reflected in the fields of each table cell (C).

and

, as seen

Phase 1: Data selection and preparation.

Our inclusion cri-
Traceback (most recent call last):
File "/Users/user1/projects/ai-cookbook/knowledge/docling/1-extraction.py", line 22, in
result = converter.convert("https://ds4sd.github.io/docling/")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/user1/opt/anaconda3/envs/py312/lib/python3.12/site-packages/pydantic/_internal/_validate_call.py", line 39, in wrapper_function
return wrapper(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/user1/opt/anaconda3/envs/py312/lib/python3.12/site-packages/pydantic/_internal/_validate_call.py", line 136, in call
res = self.pydantic_validator.validate_python(pydantic_core.ArgsKwargs(args, kwargs))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/user1/opt/anaconda3/envs/py312/lib/python3.12/site-packages/docling/document_converter.py", line 241, in convert
return next(all_res)
^^^^^^^^^^^^^
File "/Users/user1/opt/anaconda3/envs/py312/lib/python3.12/site-packages/docling/document_converter.py", line 264, in convert_all
for conv_res in conv_res_iter:
^^^^^^^^^^^^^
File "/Users/user1/opt/anaconda3/envs/py312/lib/python3.12/site-packages/docling/document_converter.py", line 314, in _convert
for input_batch in chunkify(
^^^^^^^^^
File "/Users/user1/opt/anaconda3/envs/py312/lib/python3.12/site-packages/docling/utils/utils.py", line 15, in chunkify
for first in iterator: # Take the first element from the iterator
^^^^^^^^
File "/Users/user1/opt/anaconda3/envs/py312/lib/python3.12/site-packages/docling/datamodel/document.py", line 247, in docs
resolve_source_to_stream(item, self.headers)
File "/Users/user1/opt/anaconda3/envs/py312/lib/python3.12/site-packages/docling_core/utils/file.py", line 108, in resolve_source_to_stream
res.raise_for_status()
File "/Users/user1/opt/anaconda3/envs/py312/lib/python3.12/site-packages/requests/models.py", line 1026, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://ds4sd.github.io/docling/

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions