Today we will explore a certain type of storage called object storage.

You might have heard of offerings such as AmazonS3, Google cloud storage, or Azure blob storage which are managed offerings of this type.

What is an Object really?

They can actually be anything you want them to be! An audio file, a video file, excel, 3-D, maybe just a text document.

It's just the smallest abstraction in this form of storage.

The anatomy of an Object

An object has the 4 following things:

A unique identifier: This is not really that useful to us, to be honest, and is mostly used under the hood for identification.
Data: This is the actual data residing inside this abstraction.
Meta-Data: This is probably one of the most important aspects of this type of storage. Not only does this store information about the data (think the time of creation, owner, type of file, version), the fields over here will be used to partition the data as well. This means while configuring this type of storage, the partition columns should be chosen with much care.
Attributes: This is also like meta-data, but it's not about the data at all. It has stuff like policies that can communicate with the software handling the objects to give certain privileges. For example, you might want an object to be encrypted in a different manner.

All the data you want to save is broken down into this flattened structure. Where each unit is an object.

What is a bucket?

This is the next level of abstraction.

These are essentially scalable containers that hold objects within them. They can actually scale infinitely, holding billions of objects at a time (might not be a good idea, but good to know it's possible).

And these buckets are ultimately mapped to physical servers.

The objects that you create can be replicated among multiple buckets in different physical servers for reasons like:

Data protection: You wouldn't want to lose the data if a server crashes.
Faster latency: It might be a good idea to keep the data a client wanted to access close to them. This can be done by placing different buckets in geographically different locations and then routing the requests from the geologically closest bucket.

How to access the data?

This type of storage typically uses HTTP for accessing the data.

When to use this type of storage?

Well, to understand this, let's look at what this kind of storage is good for and what kind of things it horribly fails at.

The good stuff

It scales incredibly well since you can keep adding objects to buckets and keep adding those buckets to more physical servers without much overhead. Since everything is in a flattened structure, the complexity to fetch these data objects is not gonna increase exponentially on scaling.
Doesn't care if data is structured or not. Since you get to choose your meta-data separately from the data, the performance of reading this data really only depends on your metadata. This means it can handle unstructured workloads nicely.
Reduced latency for global access. Since you can distribute the data globally, you can expect a nice latency when accessing the internet.
It's easy to access the data since it uses HTTP.
Cost-effective since you are only paying for what you use.
Since, it can support a large amount of meta-data, you can simplify your data architecture.

The bad stuff

Very slow updates. Since your data is distributed across multiple locations across multiple buckets, you can imagine the overhead that you will incur if you keep trying to update the data. So, you can rule out this kind of storage if you are planning to run a traditional database on top.
Writes have performance implications. The data object is the smallest abstraction here, so there is no way to update the data inside an object. You will have to delete the object and create a new one. For example, you can't expect to save log files and keep appending to the files. For every new line, you will have to create a new object.
Most operating systems don't have native support. You can't just mount the bucket and browse through as you can in file storage.

Example

1) Hosting the data for an application like youtube is a good example here since you don't need to update, latency matters a lot and it scales really well.

2) Along the same lines, web hosting would also work out.

3) Something like google docs could work. Although you might think it's a bad example at first since updates are poor for this type of storage. You can add a version field to the metadata to enable version control built into the storage. Then each update won't really be an issue since you just create a new object and the old one is still useful.

Can't think of any more examples off the top of my head. Please share in the comments if you thought of a good one.

Have a good day!

Utkarsh Bindal's Blog