Try searching for anything onhttps://open.quiltdata.com/and let us know how search works for you. We kind of surprised ourselves with a Google-like experience that returns primary data instead of links to web pages. We’ve got over 1M Jupyter notebooks, 100 M Amazon reviews, and many more public S3 objects on over a dozen topics indexed in ElasticSearch.
The best example, so far, of “S3 bucket as data repo” is from the Allen Institute for Cell Sciencehttps://open.quiltdata.com/b/allencell/tree/.
Kevin and I met in grad school. We started with the belief that if data could be “managed like code,” data would be easier to access, more accurate, and could serve as the foundation for smarter decisions. While we loved databases and systems, we found that technical and cost barriers kept data out of the hands of people that needed it the most: NGOs, citizens, and non-technical users. That led to three distinct iterations of Quilt over as many years and has now culminated in open.quiltdata.com, where we’ve made a few petabytes of public data in S3 easy to search, browse, visualize, and summarize.
In earlier versions of Quilt, we focused on writing new software to version and package data. We also attempted to host private user data in our own cloud. For reasons that we would soon realize, these were mistakes:
* Few users were willing to copy data — especially sensitive and large data — into Quilt
* It was difficult to gather a critical mass of interesting and useful data that would keep users coming back
* Data are consumed in teams that include a variety of non-technical users
* Even in 2019, it’s unnecessarily difficult and expensive to host and share large files. (GitHub, Dropbox, and Google Drive all have quotas, performance limitations, and none of them can serve as a distributed backend for an application.)
* It’s difficult for a small team to build both “git for data” (core tech) and “Github for data” (website network effect) at the same time
On the plus side, our users confirmed that “immutable data dependencies” (something Quilt still does) went a long way towards making analysis reproducible and trace-able.
Put all of the above together, and we had the realization that if we viewed S3 as “git for data”, it would solve a lot of problems at once: S3 supports object versioning, a huge chunk of public and customer data are already there (no copying), and it keeps users in direct control of their own data. Looking forward, the S3 interface is general enough (especially with tools like min.io) to abstract away any storage layer. And we want to bring Quilt to other clouds, and even to on-prem volumes. We repurposed our “immutable dataset abstraction” (Quilt packages) and used them to solve a problem that S3 object versioning doesn’t: the ability to take an immutable snapshot of an entire directory, bucket, or collection of buckets.
We believe that public data should be free and open to all — with no competing interests from advertisers — that private data should be secure, and that all data should remain under the direct control of its creators. We feel that a “federated network of S3 buckets” offers the foundations on which to achieve such a vision.
All of that said, wow do we have a long way to go. We ran into all kinds of challenges scaling and sharding ElasticSearch to accommodate the 10 billion objects on open.quiltdata.com, and we are still researching the best way to fork and merge datasets. (The Quilt package manifests are JSONL, so our leading theory is to check these into git so that diffs and merges can be accomplished over S3 key metadata, without the need to diff or even touch primary data in S3, which are too large to fit into git anyway.)
Your comments, design suggestions, and open source contributions to any of the above topics are welcomed. **