Github Designing Data-intensive Applications !!install!! -
Use DDIA as your map, but use GitHub as your training ground.
Second, and more radically, GitHub implemented (horizontal partitioning) using a custom middleware layer called gh-ost (GitHub Online Schema Transfers) and later, their Vitess-inspired system. They split the massive issues and pull_requests tables by repository ID. This meant that data for a single repository always lived on one shard. This is a thoughtful choice: most queries (e.g., “list all issues in this repo”) are naturally local to a shard, avoiding costly distributed joins. The downside, as Kleppmann warns, is the loss of cross-shard transactional guarantees. For example, moving an issue from one repository to another becomes a complex distributed transaction, something GitHub handles with asynchronous workflows and idempotent retries. github designing data-intensive applications
The new data-intensive architecture brought significant improvements to GitHub's platform: Use DDIA as your map, but use GitHub as your training ground
Data-intensive applications are everywhere, from social media platforms to e-commerce websites, and from financial systems to IoT sensor networks. These applications are characterized by their ability to handle large amounts of data, often in real-time, and provide insights and value to users. The importance of data-intensive applications lies in their ability to: This meant that data for a single repository
Look at the storage layer of SQLite to see the gold standard of B-Tree implementations. 4. Distributed Data: Transactions and Consensus
GitHub’s architecture reflects this through and reconciliation . Consider the git push operation. Network requests can time out, and clients will retry. If GitHub processes the same push twice, it must not duplicate commits or corrupt the repository. By leveraging Git’s own immutable, content-addressed nature (where the same data yields the same hash), pushes are naturally idempotent. However, metadata operations are harder. When a webhook delivers a “push” event to an integration, the integration might fail. GitHub therefore implements an outbox pattern : the event is written to a persistent queue (like Kafka or their internal Resque system) before being sent. If delivery fails, the queue retries with exponential backoff, guaranteeing at-least-once delivery. The consumer, in turn, must be written to handle duplicates gracefully.
This draft provides a comprehensive overview of designing data-intensive applications, covering key concepts, principles, and best practices. It is inspired by Martin Kleppmann's book "Designing Data-Intensive Applications" and provides a detailed guide for software engineers and architects building scalable and fault-tolerant systems.