Following a recommendation to move traffic distribution to a purpose built appliance for performance and stability gains, we made a number of observations when testing the implementation. Below are some of those together with points to consider in future implementations.
- Using TCP least connections per host. If a set of hosts are running a mix of different services a failure of one service could cause an overload of a different service. If a host has multiple services, A, B, C and there are multiple instances of that host 1, 2, 3, 4, then if a service goes down, e.g. service A goes down on host 2, TCP connections will be rebalanced between nodes. This can mean that service B and C on host 2 start receiving more traffic as service A has reduced the number of TCP connections. Depending on the number of instances it could mean that B and C on node 2 start receiving all traffic. This may not be a problem but should be considered in the testing of B and C. Options may be to change this to a least connections per service not host.
- Using DNS for site failover. If a set of services, A, B, C, all use the same DNS name, then if one service has a site outage, traffic for all services may swing over to the second site. Split out by DNS for each service, or allow cross site traffic in failure scenarios.
- Using a single service to detect site availability. If that service goes down it can impact traffic for all other services, and in the case of a total outage for that service it can stop traffic for all services, even if they are up and running.
- Custom header mapping. Setting and passing of headers correctly to downstream services. There can be potential for malformation and manipulation of these to bypass security controls.
- Failure detection. The time taken to detect and act on a failure can vary depending on the method of detection and action taken. Examples such as retry policies and DNS TTL can lead to longer than expected times before action is taken. It is a good idea to consider sequenced scenarios such as service failure, node failure, site failure, total failure, followed by their resumption to determine if time to detect is acceptable. In addition it is a good idea to test failure configurations under different load scenarios. consumer retry policies may also be a consideration.
- Client certificates. Similar to header mapping, make sure that extraction, validation or pass through of the certificate DN cannot be bypassed manipulated.
While these particular points may not apply to your own implementation, we hope our past experience provides some food for thought.