Google Bigtable

BigTable is a storage system designed for petabyte-scale structured data. Although it is generally seen as a database, the authors of the paper are careful to label it as a “storage system.” It introduces the concept of an Sorted string table (SSTable)

BigTable’s query model is a lookup consisting of a row key, a column key, and a timestamp. It is built on top of GFS; hence, replication and availability are taken as separate concerns. Its biggest strength is its ability to scale to arbitrary size with linear (or better) operation scaling by relying on rows being sorted by key.

Rows are arbitrary strings. Writes to a given row are atomic; i.e., all such changes are guaranteed to be captured as a single operation. Data for a row is kept together, and data are dynamically redistributed into row-ranges called “tablets.”

Columns are grouped into column families. Column families consist of a compressed version of all columns within. The entire column family is permissioned as a unit. Specific columns are accessed as if in a namespace (family:column).

Each cell in Bigtable is versioned using a timestamp. The API permits querying for arbitrary timestamps, not just one that is an exact match to a record; in this case, it fetches the latest version that is older than the time specified.

Feature stores for machine learning are a specific example of this pattern, and it is common for feature stores to depend on Bigtable-like storage for offline storage.

Apache HBase is essentially a FOSS version of Bigtable.

David's raw ML reference notes

Explorer

Google Bigtable

Graph View

Backlinks