Instagram grew from 0 to 14 million users in just over a year, from October 2010 to December 2010 2011.

They achieved this with Only 3 engineers.

They achieved this by following 3 key principles.

The principles of Instagram

  • Keep things simple.
  • Not reinventing the wheel.
  • Use proven, robust technologies.

The stack explained in a simple way

Instagram's initial infrastructure ran on AWS, using EC2 with Ubuntu Linux.

For reference: EC2 is Amazon's service that allows developers to rent virtual computers.

For simplicity, let's look at what a user session looks like, from a software engineer's perspective.

It will then be marked as Sesión, at the beginning of each section.

Frontend

Sesión: Un usuario abre la aplicación de Instagram.

Instagram was initially launched as an iOS app in 2010.

Since Swift was launched in 2014, we can assume that Instagram was written using Objective-C and a combination of other things like UIKit.

Frontend stack

Load balancing

Sesión: Después de abrir la aplicación, se envía una solicitud al backend para obtener las fotos del feed principal. Instagram cuenta con un balanceador de carga.

Instagram used Amazon's "Elastic Load Balancer" service.

They had 3 NGINX instances that were interleaved at convenience.

Each request went through the load balancer before being routed to the "application server"".

Backend

Sesión: El balanceador de carga envía la solicitud al servidor de aplicación, que tiene la lógica para procesar la solicitud correctamente.

Instagram's "application server" used Django and it was written in Python, with Gunicorn as your WSGI server.

As a reminder:

  • WSGI stands for "Web Server Gateway Interface".
  • And it is responsible for redirecting requests from a web server to a web application.

Instagram used Fabric to run commands in parallel on many instances at once. This allowed deployment in seconds.

They had 25 "High-CPU Extra-Large" machines provided by Amazon.

Because the server is stateless, when they needed to handle more requests, they could add more machines without problem (scale out).

Backend processing requests

Data storage

Sesión: El servidor de aplicación ve que la solicitud necesita datos para el feed principal.

For this, let's say you need:

  • IDs of the most recent relevant photos.
  • The photos real that match those photo IDs.
  • User data for those photos.

Database: Postgres

Sesión: El servidor de aplicación toma los IDs de fotos relevantes más recientes de Postgres.

The application server gets data from PostgreSQL, which stored most of Instagram's data, such as users and photo metadata.

The connections between Postgres and Django were grouped using Pgbouncer.

Instagram fragmented their data due to the high traffic they were receiving (more than 25 photos and 90 likes per second). They used code to map several thousand 'logical' fragments to a few physical fragments.

An interesting challenge that Instagram faced and solved is the generation of IDs that could be sorted by time.

Your IDs Sortable by time, consisted of:

  • 41 bits for time in milliseconds (equivalent to 41 years of IDs, with a Custom Epoch).
  • 13 bits that represent the logical fragment ID.
  • 10 bits representing a self-increasing sequence, module 1024. This means that we can generate 1024 IDs, per fragment, per millisecond.

Thanks to the time-sortable IDs in Postgres, the application server successfully received the IDs of the most recent relevant photos.

Photo storage: S3 and Cloudfront

Sesión: El servidor de aplicación luego obtiene las fotos reales que coincidan con esos IDs de fotos con enlaces de CDN, para que carguen rápidamente para el usuario.

Several terabytes of photos were stored in Amazon S3.

These photos were quickly served to users using Amazon CloudFront.

Caching: Redis and Memcached

Sesión: Para obtener los datos de usuario de Postgres, el servidor de aplicación (Django) asociaba los IDs de fotos con IDs de usuario usando Redis.

Instagram used Redis to store a mapping of approximately 300 million photos to the user ID that created them, in order to know which fragment to query when obtaining photos for the main feed, activity feed, etc.

All Redis was stored in memory to decrease latency and split into multiple shards.

With smart hashing, Instagram was able to store 300 million key mappings in less than 5 GB.

This key-value mapping of photo ID to user ID was necessary to know which Postgres fragment to query.

Sesión: Gracias al almacenamiento en caché eficiente usando Memcached, obtener datos de usuario de Postgres fue rápido, ya que las respuesta recientes se leían de caché.

For general caching, Instagram used Memcached.

  • They had 6 instances of Memcached at the time.
  • Memcached is relatively simple to implement on top of Django.

Interesting fact: 2 years later, in 2013, Facebook published a paper on how Memcached scaled, to help them handle billions of requests per second.

Sesión: El usuario ahora ve el feed de inicio, poblado con las últimas imágenes de personas a las que sigue.

Redis and Memcached

Master-Replica Configuration

Both Postgres and Redis ran in a configuration master-replica, and used snapshots of Amazon EBS (Elastic Block Store) for frequent system backups.

Push notifications and asynchronous tasks

Sesión: Ahora, supongamos que el usuario cierra la aplicación, pero luego recibe una notificación push de que un amigo publicó una foto.

This push notification would be sent using pyapns, along with the more than a billion push notifications Instagram had already sent.

Pyapns is an open source project, which facilitates the integration and use of the Apple Push Notification Service (APNS).

Sesión: ¡Al usuario le encantó esta foto! Entonces decidió compartirla en Twitter.

On the backend, the task is sent to Gearman, a Task queue who distributed the work to better adapted machines.

Instagram had around 200 workers written in Python, consuming the task queue defined with Gearman.

Gearman was used for multiple asynchronous tasks, such as distributing activities (such as a newly posted photo) to all of a user's followers (this is called fanout).

Push notifications and Task queues

Monitoring

Sesión: ¡Oh no! La aplicación de Instagram se bloqueó porque algo falló en el servidor y envió una respuesta errónea. Los tres ingenieros de Instagram son alertados instantáneamente.

Instagram used Sentry, an open-source Django application, to monitor Python bugs in real time.

Munin It was used to graph system-level metrics and alert anomalies. Instagram had a bunch of plugins to monitor app-level metrics (such as Photos published by Segundo).

Pingdom was used for monitoring external services, and PagerDuty Used to handle incidents and notifications.