JZFS: Datasets Management in the Era of AI

Product Thinking

Structured, semi-structured, and unstructured data are the three categories of data sources that can be classified. Unstructured data account for approximately 80% of all global data, whereas structured data account for only 20%.

As models have become more sophisticated and pushbutton, AI teams have realized that focusing on data iteration is just as important, if not more so, for developing and deploying high-accuracy models successfully and efficiently. ML models have become increasingly complicated and opaque in recent years, necessitating significantly larger amounts of training data. In addition, data have evolved into a useful interface for working with subject matter experts and transforming their expertise into software. Finally, data-centric AI enables a higher level of model accuracy than was previously feasible using only model centric techniques.

Datasets are dynamic. New files and new versions of existing files enter the datasets at the ingestion stage. Additionally, extractors can evolve over time and generate new versions of raw data. As a result, datasets versioning is a cross-cutting concern across all stages of a datasets. Of course vanilla distributed file systems are not adequate for versioning-related operations. For example, simply storing all versions may be too costly for large datasets, and without a good version manager, just using filenames to track versions can be error-prone. In a datasets, for which there are usually many users, it is even more important to clearly maintain correct versions being used and evolving across different users. Furthermore, as the number of versions increases, efficiently and cost-effectively providing storage and retrieval of versions is going to be an important feature of a successful datasets system.

JZFS was born for the above problems:
A Git-like version control file system for data lineage & data collaboration.
https://github.com/GitDataAI/jzfs

Knowledge Base

Home

Topics

Groups

JZFS: Datasets Management in the Era of AI