Microservice Sequential Processing

Users of a new browser based data management service were experiencing high response times for the initial landing page of over 20 seconds, with some instances timing out after 60 seconds. We worked with the team to quickly analyse the problem and provided information to help fix the issue.

The service was accessed through a browser and the issue replicated quickly and easily with a single user session. Network output from the browser developer tools showed that most of the time was being spent waiting for the first buffer to be returned for a particular xhr API call. We were given access to the node hosting the API and setup monitoring for the JVM running the microservice. Neither resource or application logs highlighted any immediate issues with the overall health of the service so we profiled the activity of the service in more detail.

System calls made by the API when an inbound request is received were captured using perf trace. The summary output from perf trace showed a high number of poll system calls were being made per request. This suggests time is being spent waiting for an event on a file descriptor.

$ perf trace -s -p <pid>

  syscall            calls errors  total       min       avg       max   stddev
                                   (msec)    (msec)    (msec)    (msec)    (%)
  --------------- -------- ------ -------- --------- --------- --------- ------
  futex                278     5 16061.328     0.000    57.775 15883.006 98.89%
  poll                1650     0  9115.090     0.002     5.524   317.277  4.42%
  read                6805  1594  1638.737     0.002     0.241    46.596  8.22%
  write               3332     0   143.566     0.007     0.043     1.775  2.13%
  ioctl               2092     0    17.152     0.002     0.008     0.267  2.33%

Using strace we were able to breakdown the poll system calls in more detail. Data showed that this was network I/O to another microservice. Requests were processed in two stages, the first makes a single API call to an external API for a large block of data. The second step makes multiple separate API calls to the same external API for each individual item contained in the first block of data. The problem was quickly identified, each individual item was being requested sequentially.

The first request responded in 400 ms with a set of 1600 items. The second step requested data for each item sequentially. With an API to API round trip time of 12 ms this added up to 19 sec. Adding data transformation and network transfer time on top this resulted in a 20 sec response time for the user. Furthermore, requests were issued at 84/sec for 19 sec to the external API during the second step. If the data management service landing page was requested multiple times at the same time, this quickly generated a very high request rate on the external API using up its available capacity resulting in 60 sec request timeouts.

Now that the root cause was identified the development team quickly changed the data retrieval mechanism to replace the sequential block with a single fetch reducing response times for the whole page to drop to 500 ms. This was well within the tolerance for this application.