Launch HN: Quilt - A versioned data portal for S3, Hacker News

We’re Aneesh and Kevin of Quilt (https://open.quiltdata.com/). Quilt is a versioned data portal for S3 that makes it easier to share, discover, model, and decide based on data at scale. It consists of a Python client, web catalog, and lambda functions (all open source), plus a suite of backend containers and CloudFormation templates for businesses to run their own stacks. Public data are free. Private stacks are available for a flat monthly licensing fee.

Try searching for anything on https://open.quiltdata.com/and let us know how search works for you. We kind of surprised ourselves with a Google-like experience that returns primary data instead of links to web pages. We’ve got over 1M Jupyter notebooks, 100 M Amazon reviews, and many more public S3 objects on over a dozen topics indexed in ElasticSearch.

The best example, so far, of “S3 bucket as data repo” is from the Allen Institute for Cell Sciencehttps://open.quiltdata.com/b/allencell/tree/.

Kevin and I met in grad school. We started with the belief that if data could be “managed like code,” data would be easier to access, more accurate, and could serve as the foundation for smarter decisions. While we loved databases and systems, we found that technical and cost barriers kept data out of the hands of people that needed it the most: NGOs, citizens, and non-technical users. That led to three distinct iterations of Quilt over as many years and has now culminated in open.quiltdata.com, where we’ve made a few petabytes of public data in S3 easy to search, browse, visualize, and summarize.

In earlier versions of Quilt, we focused on writing new software to version and package data. We also attempted to host private user data in our own cloud. For reasons that we would soon realize, these were mistakes:

* Few users were willing to copy data — especially sensitive and large data — into Quilt

* It was difficult to gather a critical mass of interesting and useful data that would keep users coming back

* Data are consumed in teams that include a variety of non-technical users

* Even in 2019, it’s unnecessarily difficult and expensive to host and share large files. (GitHub, Dropbox, and Google Drive all have quotas, performance limitations, and none of them can serve as a distributed backend for an application.)

* It’s difficult for a small team to build both “git for data” (core tech) and “Github for data” (website network effect) at the same time

On the plus side, our users confirmed that “immutable data dependencies” (something Quilt still does) went a long way towards making analysis reproducible and trace-able.

Put all of the above together, and we had the realization that if we viewed S3 as “git for data”, it would solve a lot of problems at once: S3 supports object versioning, a huge chunk of public and customer data are already there (no copying), and it keeps users in direct control of their own data. Looking forward, the S3 interface is general enough (especially with tools like min.io) to abstract away any storage layer. And we want to bring Quilt to other clouds, and even to on-prem volumes. We repurposed our “immutable dataset abstraction” (Quilt packages) and used them to solve a problem that S3 object versioning doesn’t: the ability to take an immutable snapshot of an entire directory, bucket, or collection of buckets.

We believe that public data should be free and open to all — with no competing interests from advertisers — that private data should be secure, and that all data should remain under the direct control of its creators. We feel that a “federated network of S3 buckets” offers the foundations on which to achieve such a vision.

All of that said, wow do we have a long way to go. We ran into all kinds of challenges scaling and sharding ElasticSearch to accommodate the 10 billion objects on open.quiltdata.com, and we are still researching the best way to fork and merge datasets. (The Quilt package manifests are JSONL, so our leading theory is to check these into git so that diffs and merges can be accomplished over S3 key metadata, without the need to diff or even touch primary data in S3, which are too large to fit into git anyway.)

Your comments, design suggestions, and open source contributions to any of the above topics are welcomed. **

Launch HN: Quilt – A versioned data portal for S3, Hacker News

What do you think?

N-days Chaining Vulnerability Exploitation Analysis Part 3: Windows Driver LPE–Medium to System

emm… Indian anti-virus software eScan has long used the HTTP protocol and was used by hackers to launch man-in-the-middle attacks.

Excelling at Excel, Part 4

CVE-2024-20353, CVE-2024-20359: Frequently Asked Questions About ArcaneDoor

NodeZero: Testing for Exploitability of Palo Alto Networks CVE-2024-3400

Issues Resolving Symbols on Windows 11 on ARM64

Xiaomi to launch Mi Robot Vacuum Cleaner in India today: All we know so far – Firstpost, Firstpost.com

Launch HN: Art in Res (YC W20) – Buy art directly from artists, hacker news

Launch HN: Global Belly (YC W20) – Helping influencers launch their own products, hacker news

NASA spent a decade and nearly $ 1 billion for a single launch tower, Ars Technica

For this launch, everything on the rocket is recycled but the second stage, Ars Technica

Launch HN: Ophelia (YC W20) – At-home recovery for opioid addiction, hacker news

Leave a ReplyCancel reply

Cheats For Little Alchemy

3TB Of Mega.nz Links For Free Courses And E-Books 2022 (Updated)

Udemy Coupon [100% OFF] QuickBooks Online 2020

How to Earn Money from FreeCash.com, Playing Games, Testing Apps, and Taking Surveys

Amazon FBA Product Research & Find Products for Amazon FBA

How Much Do Car Accident Attorneys Cost You in 2022?

OKEx and Blockchain Transparency Institute Spar Over Alleged Wash Trading, Crypto Coins News

Greta Thunberg Out-Trolls Donald Trump After US Leader's Sarcastic Tweet – NDTV News, Ndtv.com

What do you think?

Leave a ReplyCancel reply

Log In

Sign In

Forgot password?

Your password reset link appears to be invalid or expired.

Log in

Privacy Policy

Add to Collection

No Collections