Extremely large documents

Hi! I recently got into my head an idea to [play around with offline copies of Wikipedia](https://github.com/segfall/static-wiki#building-the-app). The Wikimedia foundation very helpfully [provides downloadable dumps](https://dumps.wikimedia.org/enwiki/20210720/) of the full content of Wikipedia. The dump itself is a 20GB file that unzips to one .xml file that measures **78 GB** in size. You saw that right: 1 xml file, 84,602,863,258 bytes. You can probably see where this is going...

Alas, I am many gigabytes short of fitting this entire mountain of a document in memory at once, let alone twice plus overhead (as a string and parsed). If I have any hope of consuming this thing with precision (as opposed to regex *shudder*) I believe a streaming parser and query engine will be necessary, however I did not see a streaming interface in the [`xsd_document::parser` docs](https://docs.rs/sxd-document/0.3.2/sxd_document/parser/index.html) or in `sxd_xpath`; is that a correct assessment? Have you considered building a streaming interface to handle such cases? (In my experience, opting for a streaming solution can lead to the fastest implementation even when memory pressure is not a concern; such a use-case may be a valuable just to reach for speed.)

Thoughts?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Extremely large documents #85

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Extremely large documents #85

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions