Deep Lake, a Lakehouse for Deep Learning: Related Work

cover
5 Jun 2024

Authors:

(1) Sasun Hambardzumyan, Activeloop, Mountain View, CA, USA;

(2) Abhinav Tuli, Activeloop, Mountain View, CA, USA;

(3) Levon Ghukasyan, Activeloop, Mountain View, CA, USA;

(4) Fariz Rahman, Activeloop, Mountain View, CA, USA;.

(5) Hrant Topchyan, Activeloop, Mountain View, CA, USA;

(6) David Isayan, Activeloop, Mountain View, CA, USA;

(7) Mark McQuade, Activeloop, Mountain View, CA, USA;

(8) Mikayel Harutyunyan, Activeloop, Mountain View, CA, USA;

(9) Tatevik Hakobyan, Activeloop, Mountain View, CA, USA;

(10) Ivo Stranic, Activeloop, Mountain View, CA, USA;

(11) Davit Buniatyan, Activeloop, Mountain View, CA, USA.

Multiple projects have tried to improve upon or create new formats for storing unstructured datasets including TFRecord extending Protobuf [5], Petastorm [18] extending Parquet [79], Feather [7] extending arrow [13], Squirrel using MessagePack [75], Beton in FFCV [39]. Designing a universal dataset format that solves all use cases is very challenging. Our approach was mostly inspired by CloudVolume [11], a 4-D chunked NumPy storage for storing large volumetric biomedical data. There are other similar chunked NumPy array storage formats such as Zarr [52], TensorStore [23], TileDB [57]. Deep Lake introduced a typing system, dynamically shaped tensors, integration with fast deep learning streaming data loaders, queries on tensors and in-browser visualization support. An alternative approach to store large-scale datasets is to use HPC distributed file system such as Lustre [69], extending with PyTorch cache [45] or performant storage layer such as AIStore [26]. Deep Lake datasets can be stored on top of POSIX or REST API-compatible distributed storage systems by leveraging their benefits. Other comparable approaches evolve in vector databases [80, 8, 80] for storing embeddings, feature stores [73, 16] or data version control systems such as DVC [46], or LakeFS [21]. In contrast, Deep Lake version control is in-built into the format without an external dependency, including Git. Tensor Query Language, similar to TQP [41] and Velox [59] approaches, runs n-dimensional numeric operations on tensor storage by truly leveraging the full capabilities of deep learning frameworks. Overall, Deep Lake takes parallels from data lakes such as Hudi, Iceberg, Delta [27, 15, 10] and complements systems such as Databarick’s Lakehouse [28] for Deep Learning applications.

This paper is available on arxiv under CC 4.0 license.