-
Notifications
You must be signed in to change notification settings - Fork 37
Description
Hi! I recently got into my head an idea to play around with offline copies of Wikipedia. The Wikimedia foundation very helpfully provides downloadable dumps of the full content of Wikipedia. The dump itself is a 20GB file that unzips to one .xml file that measures 78 GB in size. You saw that right: 1 xml file, 84,602,863,258 bytes. You can probably see where this is going...
Alas, I am many gigabytes short of fitting this entire mountain of a document in memory at once, let alone twice plus overhead (as a string and parsed). If I have any hope of consuming this thing with precision (as opposed to regex shudder) I believe a streaming parser and query engine will be necessary, however I did not see a streaming interface in the xsd_document::parser docs or in sxd_xpath; is that a correct assessment? Have you considered building a streaming interface to handle such cases? (In my experience, opting for a streaming solution can lead to the fastest implementation even when memory pressure is not a concern; such a use-case may be a valuable just to reach for speed.)
Thoughts?