vlen strings are stored in HDF5's heap and not compressed. Fixed-length strings are compressed, but not helpful with datasets of large, irregularly sized texts. A good option to overcome this is a string meta-dataset which stores vlen strings as a dataset of fixed-length character rows and another dataset of start/stop indices to allow subsetting vlen strings as well as string compression.