Understanding the Google File System

While searching the internet about how the internet works I came across a research paper about something called the Google File System. It cleared my fog of doubt about how Google File System helps something as large as Google to store such a sheer amount of information while also being fast and efficient in fetching that information to us.

Introduction

Google in its early days encountered the challenge of optimizing the organization and administration of their rapidly increasing databases so in 2003 they developed a file structure. A file structure system so efficient that it is still used in multiple gigantic projects like YouTube, Google Search and Gmail. In this blog post, we'll dive into the inner workings of the Google File System, exploring its key components, features, and the problems it was built to solve.

The Structure

The Google File System at its core has two basic components, namely the master server and the chunk server

The Master Server is the central node of the entire GFS. It maintains the metadata( data about the data ) for the whole system. It takes part in cleaning and migrating chunks between the chunk servers while also acting as a monitor of files and permission spanning across the whole file system.

Master contains the metadata but the real data ( websites, files, cloud storage ) is stored in the building block of the system called chunk. Chunk is the fixed-sized data unit for our system. To maintain stability chunks are usually replicated on multiple servers called chunkservers.

Chunkservers are distributed storage servers that hold data in large fixed-size chunks (typically 64 MB each). They are responsible for storing and serving data chunks when requested.

The Working

GFS uses Linux machines running a user-level server process. Why Linux you ask? Because it is reliable and stable. We can also run both the chunk server and the client on the same Linux machine as long as the machine's resources permit.

When a client application desires to access to a file

  • Its first step involves reaching out to the master server that stores the metadata.

  • The metadata tells you how to obtain information about the location of the exact chunkserver housing the data you're searching for.

  • The client-server then communicates with the relevant chunk and returns you with the information you need

GFS also replicates the chunk data multiple times to maintain reliability while also using something named checksum to verify the integrity of the data and detect if there is any corrupt file.

The Google File System has undergone several evolutionary changes since its initial release, serving as inspiration for numerous other file systems and helping us access the internet even better.

If you want to deep dive you can read the research paper about GFS here

If you have any questions or suggestions, you can message me on Twitter.

Thanks. Have a nice day, developers.