In order to know when to scale up or down, one needs information about the current demand on the cloud. In other words, one needs to measure things like CPU, memory, and network bandwidth usage to make sure cloud consumers never run out of those resources. The types of resources to measure depend in part on the types of services that the cloud system offers.
Service Level Agreements
One of the advantages of cloud computing is that the consumer no longer has the burden of making sure capacity is adequate for fulfilling demand. The downside of that, is that the consumer risks that the provider doesn’t have enough capacity for the consumer when he needs it. To mitigate that risk, consumers sign up for Service Level Agreements (SLAs), that guarantee them enough capacity. An SLA should contain:
- The list of services the provider will deliver and a complete definition of each service
- Metrics to determine whether the provider is delivering the service as promised and an auditing mechanism to monitor the service. The metrics measure so called Quality of Service (QoS) attributes, like response time. Service Level Objectives (SLOs) define acceptable values for the QoS attributes
- Responsibilities of the provider and the consumer and remedies available to both if the terms of the SLA are not met
- A description of how the SLA will change over time
To prove that certain QoS attributes are met, it may be necessary to keep an audit trail of performed operations. An audit trail logs each operation, along with information like who performed it and when, how long it took, etc.
The metrics that are automatically gathered to prove that the SLA is met, can also be used for billing.
One of the most important things to settle in an SLA is availability. This is usually expressed in a number of nines, e.g. five nines stands for 99.999% uptime. Although hardware is usually very robust, everything breaks at scale. And software has always been more prone to failure than hardware. Therefore, failure of components has to be incorporated into the design. Some clouds, like Azure and Atmos, automatically maintain three (or more) copies of data to ensure that at least one copy is available to satisfy a request. This is called (data) replication.
In replication, a logical variable x that can be read and written to, actually consists of a set of physical variables x0, … xn and an associated protocol that makes sure that reads and writes to the replicas are performed in a way that looks indistinguishable from reads and writes to the original variable. Problems can arise, however, when multiple updates are performed concurrently (an edit conflict), or when a node fails during an update.
There are three major types of data replication protocols:
- Transactional replication maintains replication within the boundaries of a single transaction. Replicas are persistently stored on disk. This is the most common form for distributed databases, but it has serious drawbacks. It does, however, guarantee consistency, one of the ACID properties. This form of replication is also called active or pessimistic replication.
- Virtual synchrony is an inter-process message passing technology that guarantees that messages are delivered to all nodes, in the order they were sent. Sometimes the order is violated to improve performance, but only if that has no impact on nodes. The messages are used to update the replicas, that are assumed to be in memory. Virtual synchrony performs the best of the three models, at the expensive of weaker fault-tolerance. Virtual synchrony has other uses than just replication: it also supports event notification, locking, fault-tolerance, and other peer-to-peer mechanisms. Virtual synchrony is used, among others, in Chubby and ZooKeeper.
- State machine consensus / Paxos is a way of achieving consensus among a group of distributed servers that guarantees fault-tolerance. State machine consensus lies between transactional replication and virtual synchrony: variables are assumed to be persisted to disk, but full ACID properties are not guaranteed. Virtually Synchronous Paxos is a hybrid between virtual synchrony and Paxos that makes the virtual synchrony approach more fault-tolerant.
The last two approaches are sometimes called passive, lazy, or optimistic replication. They don’t guarantee consistency, so they can’t be applied in a system that must have the ACID properties. However, the system should exhibit eventual consistency, which means that the update is guaranteed to reach all replicas eventually. Since that can take a while, conflicts may arise that should be resolved in one of the following ways:
- Read repair
The correction is done when a read finds an inconsistency. This slows down the read operation.
- Write repair
The correction is done during a write operation, if an inconsistency has been found out, slowing down the write operation.
- Asynchronous repair
The correction is not part of a read or write operation.
Another distinction with replication is between primary-backup or multi-primary schemes. In a primary-backup architecture, one of the replicas is indicated the primary, and all updates go through this replica. The other replicas improve read access only. If any replica can be updated and its changes spread, we have a multi-primary scheme. This has obvious performance benefits for updates, but requires an update conflict prevention and/or resolution problem. Synchronous solutions usually employ conflict prevention, while asynchronous systems must use conflict resolution. The synchronous approach only completes when both the local and remote sites are updated. In case the remote site goes down (or even just the connection between the local and remote sites), this approach means no local updates are possible. At Internet scale, where components are expected to fail, this is usually not an acceptable strategy.
|Previous: Rapid elasticity||Next: Service Models|