Streaming gigabyte medical images from S3 without downloading them
Many of these scientific file formats (HDF5, netCDF, TIFF/COG, FITS, GRIB, JPEG and more) are essentially just contiguous multidimensional array(/"tensor") chunks embedded alongside metadata about what's in the chunks. Efficiently fetching these from object storage is just about efficiently fetching the metadata up front so you know where the chunks you want are[1].
The data model of Zarr[2] generalizes this pattern pretty well, so that when backed by Icechunk[3], you can store a "datacube" of "virtual chunk references" that point at chunks anywhere inside the original files on S3.
This allows you to stream data out as fast as the S3 network connection allows[4], and then you're free to pull that directly, or build tile servers on top of it[5].
In the Pangeo project and at Earthmover we do all this for Weather and Climate science data. But the underlying OSS stack is domain-agnostic, so works for all sorts of multidimensional array data, and VirtualiZarr has a plugin system for parsing different scientific file formats.
I would love to see if someone could create a virtual Zarr store pointing at this WSI data!
[0] https://virtualizarr.readthedocs.io/en/stable/
[1] https://earthmover.io/blog/fundamentals-what-is-cloud-optimi...
[2] https://earthmover.io/blog/what-is-zarr
[3] https://earthmover.io/blog/icechunk-1-0-production-grade-clo...
[4] https://earthmover.io/blog/i-o-maxing-tensors-in-the-cloud
I wonder what exactly the big multi-model AI companies are doing to optimize model cold-start latency, and how much it just looks like Zarr on top of on-prem object storage.
It's definitely one of many fields that see convergent evolution towards something that just looks like Zarr. In fact you can use VirtualiZarr to parse HuggingFace's "SafeTensors" format[0].
[0] https://github.com/zarr-developers/VirtualiZarr/pull/555
I feel that we no longer really need TIFF etc. - for scientific use cases in the cloud Zarr is all that's needed going forwards. The other file formats become just archival blobs that either are converted to Zarr or pointed at by virtual Zarr stores.
Many of these scientific file formats (HDF5, netCDF, TIFF/COG, FITS, GRIB, JPEG and more) are essentially just contiguous multidimensional array(/"tensor") chunks
Yeah, a recurring thought is that these should condense into Apache Arrow queried by DuckDB but there must be some reason for this not to have already happened.
Existing solutions are all complicated and clunky, I put something together with S3 and bastardised CoGeoTIFF, instant view of any part of the image.
Wish I knew how to commercialise it…
You've already done the "building v1" part, and have started to do the "talking about it" part.
Next step is to write up how one could use it, how it is better than the alternatives, and put it up on a website.
I'm happy to chat about it if you like. My email is in my profile.
Once you have real users, they will pull the v2 out of you, and that will be what you'll sell.
What I've written above sounds like a business proposition, but I want to clarify that I'm just offering to share what I know for free :-)
JPEG-LL refers to the lossless mode of the original JPEG standard (ISO/IEC 10918-1 or ITU-T T.81), also known as JPEG Lossless, and not to be confused with JPEG-LS (ISO/IEC 14495-1, Transfer Syntax 1.2.840.10008.1.2.4.80), which offers better ratios and speed via LOCO-I algorithm. JPEG-LL is older and less efficient yet more widely implemented in legacy systems.
The lossless mode in JPEG-XL is superior to all of those.
Main problem is most support subset of the more advanced S3 features and often not all that big one. But if you just want to dump some backups in the cloud backblaze and other alternatives is cheaper
There are choices that speak the S3 data plane API (GetObject, ListBucket, etc).
There are no alternatives that support most of the AWS S3 functionality such as replication, event notifications.
That being said, I plan to support more cloud platforms in the future, starting with GCP.
Interesting guide to the Whole Slide Images (WSI) format. The surprising thing for me is that compression is used, and they note does not affect use in diagnostics.
Back in the day we used TIFF for a similar application (X-ray detector images).
WSIStreamer is relevant for storage systems without a filesystem. In this case, OpenSlide cannot work (it needs to seek and open the file).
Edit: Looks like this is a slight discrepancy between the HN title and the GitHub description.
Was there a requirement to work with these formats directly without converting?
Sometimes, it happens that we re-write the image in a pyramidal TIFF format (happened to me a few times, where NDPI images had only the highest resolution level, no pyramid), in which case COGs could work.
As for digital pathology, the field is very much tied to scanner-vendor proprietary formats (SVS, NDPI, MRXS, etc).