Best practice to safe and load data from/to files #660

manuschneider · 2025-09-05T08:04:57Z

manuschneider
Sep 5, 2025
Collaborator

I would like to discuss file IO, and see how this can be integrated in Cytnx projects.

In a typical project, I need some file IO - I usually run the same algorithm many times (usually in parallel), with different options and parameters. Three file types are common:

Input file: usually in text format, defining my parameters; read at the beginning of my code
Then, the algorithm runs and creates
Output file: A few numbers that are measured in every iteration and saved to a file. This would be something like [iteration->15, Energy->-3.149, magnetization->0.957593, ... ], [iteration->16, Energy->-3.17, magnetization->0.98375, ... ], ...
Additionally, it would make sense to safe some values initially, like the parameters the algorithm ran with, initial values for several variables, etc.
Data file(s): Tensors and large data that need to be stored

Often, I need to read and write files in several languages (for example, bash for manipulation of input files, C++/Python to run the algorithm, Matlab to do the data analysis). I can now write project-specific file IO commands in all languages. But it would be nicer to have a standard form.

My wish would be:
-text files (?) in a standard format for 1) and 3), that can be written and read in many languages. Maybe as tupels (variable name -> value), but it also makes sense to combine all values with the same iteration number. What is a good practice for this?
-binary files for 2). Also here, it would be great if the data can be read and written in a standard format, and if things that belong together can be written in one file. This could, for example, be all parameters, iteration number, a TN that consists of many tensors. It would be good if such a file could be opened in this standard format, and it would tell me the parameter-value pairs, and say that there are 100 objects of type 'UniTensor' (which can then be saved in a Cytnx-specific format). That way, a user who does not know my project-specific binary IO format can still understand what is saved in this file.

It would be great if Cytnx would support such a format, since currently each tensor needs to be written to a new file, which seems quite impractical for big networks (or do I miss something here?)

IvanaGyro · 2025-09-05T08:14:36Z

IvanaGyro
Sep 5, 2025
Collaborator

For 1 and 2, my suggestion is saving your configuration files and output files in a format that is convenient to save/load in most of languages. INI, JSON, YAML, and TOML are possible candidates. I am not familiar with Matlab, so I am not sure what formats are supported by Matlab.

For 3, is there any point that saving the data in the Cytnx format is not covered?

0 replies

manuschneider · 2025-09-05T08:20:57Z

manuschneider
Sep 5, 2025
Collaborator Author

Thanks a lot, I will have a look into these formats.

For 3: can I combine parameters (name-value pairs) and several tensors in one file (and if so, how)? Also, one typically needs to know the exact file format of the project. So I was wondering if there is a standard format to pack the tensors into a container together with metadata such as parameters, etc. which can be read even without knowing my project details. This makes it easier to share my results with others who do not know my specific binary format.

1 reply

IvanaGyro Sep 5, 2025
Collaborator

If you want to save the key-value strings in a human-readable format and the data (tensors) in the save file, the most portable way I can figure out is to encode the binary data (the binary files output by Cytnx) with base64 and save them in JSON like format. The backward of this method is the file will be very large because the information density of base64 is quite lower than the binary format, and encoding and decoding will cost extra resources.

Maybe, saving the file path of the binary data in the JSON like configuration file is a better method than above.

manuschneider · 2025-09-05T08:55:24Z

manuschneider
Sep 5, 2025
Collaborator Author

My workaround so far was to write all relevant parameters in the file name. But this seems not a very clean solution to me, and what if I want to add a parameter later?

Writing the tensors in text format seems inefficient, and Cytnx provides no way of doing so either (which makes my encoding format project-specific again).

How about HDF5? I think one can combine human readable key-value pairs with large data structures efficiently.

For now: can I at least store several tensors in one file? Like combining them in a vector and saving that vector to binary?
For example, I currently have a project with some 100 MPSs per configuration, several hundred tensors per MPS, and some hundreds of configurations. If I safe every tensor in a single file (about 10 Million files!), I even run into problems with the number of Inodes on the old filesystem of the cluster I use. And synchronization between machines with so many files is not convenient or efficient either. So this is not really a solution for big projects.

0 replies

ianmccul · 2025-09-05T10:17:08Z

ianmccul
Sep 5, 2025

HDF5 might be a good option here - I've never used it myself but I know other people use it (including iTensor I think). There are some efforts to find common formats, eg https://tensor.sciencesconf.org/ and https://github.com/TAPPorg/tensor-interfaces although I can't see anything about file formats, I believe that has been discussed (but no visible progress...). An interchange format is of course a different problem to saving data during a calculation (or for followup calculations).

https://zarr.dev/ might be a good option, but I only just came across it now, I know nothing about it.

The scheme the Matrix Product Toolkit uses works very well, but is very specialized. It stores history along with the data, so eg mp-history <psi> gives the complete log of all commands that modified that wavefunction. This kind of metadata is really useful. If using HDF5 or Zarr, something similar could be done by adding some history array as JSON or similar.

0 replies

yingjerkao · 2025-09-05T12:06:56Z

yingjerkao
Sep 5, 2025
Maintainer

I vote for HDF5. Actually, there are some discussion among library builders to discuss a potential common schema for HDF5 to share tensor in the future. It depends on what kind of metadata is required as you need to store enough information so that people can easily know how the tensors are produced, under what kind of symmetries etc.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Best practice to safe and load data from/to files #660

Uh oh!

{{title}}

Uh oh!

Replies: 5 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Best practice to safe and load data from/to files #660

Uh oh!

manuschneider Sep 5, 2025 Collaborator

Replies: 5 comments · 1 reply

Uh oh!

IvanaGyro Sep 5, 2025 Collaborator

Uh oh!

manuschneider Sep 5, 2025 Collaborator Author

Uh oh!

IvanaGyro Sep 5, 2025 Collaborator

Uh oh!

Uh oh!

manuschneider Sep 5, 2025 Collaborator Author

Uh oh!

ianmccul Sep 5, 2025

Uh oh!

yingjerkao Sep 5, 2025 Maintainer

manuschneider
Sep 5, 2025
Collaborator

Replies: 5 comments 1 reply

IvanaGyro
Sep 5, 2025
Collaborator

manuschneider
Sep 5, 2025
Collaborator Author

IvanaGyro Sep 5, 2025
Collaborator

manuschneider
Sep 5, 2025
Collaborator Author

ianmccul
Sep 5, 2025

yingjerkao
Sep 5, 2025
Maintainer