Alexandre Lim

Release It!By Michael T. Nygard

Not an easy read, but I strongly recommend Release It! It will give you a good overview of what it takes to design and deploy production-ready software. It will help you serve your clients better but also help you and your teams to avoid many sleepless nights.


Testing—even agile, pragmatic, automated testing—is not enough to prove that software is ready for the real world. You will need to accept the fact that despite your best-laid plans, bad things will still happen.

To build software fast that’s cheap to build, good for users, and cheap to operate—demands continually improving architecture and design techniques.

During the rush of a development project, you can easily make decisions that optimize development costs at the expense of operational costs. Systems spend much more of their life in operation than in development. Design and architecture decisions are also financial decisions.

The beginning is when your team is most ignorant of the eventual structure of the software, yet that’s when some of the most irrevocable decisions must be made.

An architect who doesn’t bother to listen to coders on the team doesn’t bother listening to users either. Avoid an ivory-tower architect. The ivory-tower architect most enjoys an end-state vision of ringing crystal perfection, but the pragmatic architect constantly thinks about the dynamics of change.

In any incident, restoring service takes precedence over the investigation. The trick to restoring service is figuring out what to target.

Managing perception after a major incident can be as important as managing the incident itself.

Bugs will happen. They cannot be eliminated, so they must be survived instead. A good question to ask is “How do we prevent bugs in one system from affecting everything else?”

A robust system keeps processing transactions, even when transient impulses, persistent stresses, or component failures disrupt normal processing. The user can still get work done.

The major dangers to your system’s longevity are memory leaks and data growth. Both are rarely caught during testing. The trouble is that applications never run long enough in a development environment to reveal their longevity bugs.

Once you accept that failures will happen, you have the ability to design your system’s reaction to specific failures.

The more tightly coupled the architecture, the greater the chance a coding error can propagate. A failure in one point or layer actually increases the probability of other failures.

Triggering a fault opens the crack. Faults become errors, and errors provoke failures. That’s how the cracks propagate. Assume the worst. Faults will happen. We need to examine what happens after the fault creeps in.

Integration points are the number-one killer of systems. Just as integration points are the number-one source of cracks, cascading failures are the number-one crack accelerator. The most effective patterns to combat cascading failures are Circuit Breaker and Timeouts.

Every architecture diagram ever drawn has boxes and arrows. A new architect will focus on the boxes; an experienced one is more interested in the arrows.

Not every problem can be solved at the level of abstraction where it manifests.

Horizontal scaling means we add capacity by adding more servers. The alternative, vertical scaling means building bigger and bigger servers.

The best thing you can do about expensive users is test aggressively.

Blocked threads can happen anytime you check resources out of a connection pool, deal with caches or object registries, or make calls to external systems.

The problem has four parts:

  • Error conditions and exceptions create too many permutations to test exhaustively.
  • Unexpected interactions can introduce problems in previously safe code.
  • Timing is crucial. The probability that the app will hang goes up with the number of concurrent requests.
  • Developers never hit their application with 10,000 concurrent requests.

If you find yourself synchronizing methods on your domain objects, you should probably rethink the design. Find a way that each thread can get its own copy of the object in question. One elegant way to avoid synchronization on domain objects is to make them immutable. Look into Command Query Responsibility Separation. It avoids a large number of concurrency issues.

A method without side effects in a base class should also be free of side effects in derived classes. A method that throws the exception E in base classes should throw only exceptions of type E (or subtypes of E) in derived classes. Otherwise, there is a violation of the Liskov substitution principle.

Use Caching carefully. The maximum memory usage of all application-level caches should be configurable. You need to monitor hit rates for the cached items. Caches should be built using weak references. Every cache should have an invalidation strategy to manage stale data.

Infrastructure management tools can make very large impacts quickly. Build limiters and safeguards into them so they won’t destroy your whole system at once.

Design with skepticism, and you will achieve resilience.

The Timeouts pattern is useful when you need to protect your system from someone else’s failure. Fail Fast is useful when you need to report why you won’t be able to process some transaction. Fail Fast applies to incoming requests, whereas the Timeouts pattern applies primarily to outbound requests.

It’s best to keep people off production systems to the greatest extent possible. The system should be able to run at least one release cycle without human intervention. The Steady State pattern says that for every mechanism that accumulates a resource, some other mechanisms must recycle that resource.

Designing asynchronous processes is inherently harder. The move from synchronous request/reply to asynchronous communication necessitates very different design. That makes the switching cost something to consider.

Every performance problem starts with a queue backing up somewhere. If a queue is unbounded, it can consume all available memory.

We should use automation for things humans are bad at: repetitive tasks and fast responses. We should use humans for what automation is bad at: perceiving the whole situation at a higher level.

The whole point of a governor is to slow things down enough for humans to get involved.

The ability to restart components, instead of entire servers, is a key concept of recovery-oriented computing (ROC).

A machine uses its own hostname to identify the whole machine, while a DNS name identifies an IP address.

When designing web applications to run in VM you must make sure that they’re not sensitive to the loss or slowdown of any one host.

Don’t trust the OS clock. If external, human time is important, use an external source like a local NTP server.

Developers should not do production builds from their own machines. Only make production builds on CI server, and have it put the binary into a safe repository that nobody else can write into.

Immutable infrastructure: machines don’t change once they’ve been deployed.

When making technical or architectural changes, you are totally dependent on data collected from the existing infrastructure. Good data enables good decision-making. A system without transparency cannot survive long in production. Transparency arises from deliberate design and architecture.

Load balancing plays a part in availability, resilience, and scaling. Health checks are a vital part of load balancer configuration.

The world can crush our systems at any time. There are two basic strategies to protect ourselves: either refuse work or scale-out.

There’s a relationship between the number of sockets available and the number of requests per second your service can handle. That relationship depends on the duration of the requests.

If production user data passes through it, it’s production software. If its main job is to manage other software, it’s the control plane.

The more sophisticated your control plane becomes, the more it costs to implement and operate. Always keep the operating cost in mind.

Every postmortem review has three important jobs to do:

  • Explain what happened.
  • Apologized.
  • Commit to improvement.

A monitoring team doesn’t do the monitoring. It provides the ability for others to do their own monitoring.

The best way to tell if users are receiving a good experience is to measure it directly. This is known as real-user monitoring (RUM).

See monitoring, log collection, alternating, and dashboarding as being about economic value more than technical availability.

Admin API over HTTP needs to be a different port than ordinary traffic. It should not be available to the general public!

“Injection” is an attack on a parser or interpreter that relies on user-supplied input.

If your session IDs are generated by any kind of predictable process, then your service may also be vulnerable to a “session prediction” attack.

Don’t trust calls based on their originating IP addresses, because those can be faked.

Cross-site scripting (XSS) happens when a service renders a user’s input directly into HMTL without applying input escaping. Never trust input. Don’t build structured data by smashing strings together.

If a caller is not authorized to see the contents of a resource, it should be as if the resource doesn’t even exist.

Never allow a default password on production. Reduce the surface of possible attacks. Make sure every administrator uses a personal account, not a group account.

The principle of “least privilege” mandates that a process should have the lowest level of privilege needed to accomplish its task. Anything application services need to do, they should do as non-admin users.

Immutable infrastructure is for cattle, convergence is for pets.

The idea of continuous deployment is to minimize the liability of undeployed code. Run the full build pipeline on every commit.

Don’t forget to test on a realistic sample of data, ideally copies of real production data.

Static assets should always have far-future cache expiration headers.

We can only add constraints after the rollout. That’s because the old application version wouldn’t know how to satisfy them.

We must design our software to be deployable, just as we design software for production. Zero downtime is the objective.

Once a service is public, a new version cannot reject requests that would’ve been accepted before. Anything else is a breaking change.

Exercise your API with inbound testing and outbound testing for consumed API. Inbound testing exercises your API to make sure it does what you think it does. Outbound testing exercises your dependencies to make them act the way you think they do.

Contract testing approach: testing how well our code adheres to the contract.

Counting concurrent users is a misleading way of judging the capacity of the system.

Keep accelerating and you’ll soon be able to run your decision loop faster than your competitors. That’s when you force them to react to you. That’s when you’ve gotten “inside their decision loop”.

The platform team must not be held accountable for application availability. That must be on the application teams. Instead, the platform team must be measured on the availability of the platform itself. It needs a customer-focused orientation where customers are the application developers.

Having a group called the DevOps team is an antipattern.

The two-pizza team is about reducing external dependencies. It’s about having a small group that can be self-sufficient and pushes things all the way through to production. Getting down to this team size requires a lot of tooling and infrastructure support.

Efficiency can come at the cost of flexibility. More efficiency means more specialization in today’s tasks. That can make it harder to change for the future.

Microservices are a technology solution to an organizational problem.

When you’re embedded in a paradigm, it’s hard to see its limits. Each paradigm defines what you can and cannot express. None of them are the whole reality, but each of them can represent some knowledge about reality.

Your job in building systems is to decide what facets of reality matter to your system, how are you going to represent those, and how that representation can survive over time.

Think of using URL dualism to break a lot of dependencies that otherwise seem impossible. Encrypt URLs that you send out to users to verify that whatever you receive back is something you generated.

Last Updated

July 14th, 2022