Data for a new microservice showed that throughput was constrained and showing signs of an off-CPU bottleneck. This post summarises the approach to investigating the problem from a Linux perspective and shows how system calls can be analysed to identify where time is being spent.
Strategic observability and tracing toolsets had not yet been implemented so we were given direct access to the hosts to allow local observability tools to be installed. The service was running on an AWS EKS node (c6g.large) as a single container (500m, 500Mi) with JVM max heap configured to 85% of the container allocation. A single load generator rated up to 9000 req/sec was used for the testing, connected using a private link to a VPC endpoint service. All data transfers and packet rates were significantly less than the AWS stress testing policy.
Throughput
The table below summarises some of the metrics produced by a series of load tests against two endpoints provided by the microservice. Tests were executed against a GET endpoint followed by tests against a POST endpoint. Mean client pacing was 1.5ms. Results show that the GET endpoint is constrained at 287 req/sec and the CPU allocation is 99% utilised. Results show that the POST endpoint is constrained at 90 req/sec and the CPU allocation is a lot lower at 37% suggesting the POST endpoint bottleneck is off-CPU.
Metric GET Test POST Test
-------------------------------------- --------- ---------
Active Clients 10 10
Throughput OK (req/sec) 287 90
Throughput KO (req/sec) 0 0
Response Time Min. (msec) 4 22
Response Time Mean (msec) 32 109
Container Count 1 1
Container CPU (% allocation) 99 37
Container Memory RSS (% allocation) 68 69
Container Network RX (KB/s) 297 279
Container Network TX (KB/s) 444 234
Logged Errors 0 0
Threads
Running ps on the EKS host we can find the process id of the java application to investigate. An alternative way of doing this using ctr can be found here.
$ ps -afe | grep java
185 4976 2966 0 Jan10 ? 00:47:41 java -Djava.util.logging.manager=org.jboss.logmanager.LogManager -jar -quarkus-run.jar
root 4165123 4164704 0 07:56 pts/0 00:00:00 grep --color=auto java
Using top we can see details about the threads the process is running and their CPU time. The -d flag can be used to set a delay between screen updates and the number of updates configured with -n. The -b flag can be used to output to a file for offline analysis.
$ top -H -p 4976 -d 300 -n 3 -b > top-output.txt
The output below shows a 5 minute summary when the system was under maximum load during a GET test. %CPU shows the tasks share of CPU time since the last update. The java process is using almost all of the CPU allocation for the container (CPU limit is 500 millicores, 25% CPU) and the most active threads are executor-thread, VM Thread and vert.x-eventloop threads. The executor thread and event-loop threads are processing application requests and VM thread is most likely performing garbage collection. There are 11 executor threads running, approximately 1 per active client. NOTE: only threads with %CPU greater than 0 are listed.
top - 11:05:12 up 18:00, 0 users, load average: 1.85, 1.81, 1.49
Threads: 34 total, 10 running, 24 sleeping, 0 stopped, 0 zombie
%Cpu(s): 24.1 us, 3.2 sy, 0.0 ni, 71.6 id, 0.0 wa, 0.0 hi, 1.1 si, 0.0 st
KiB Mem : 3908232 total, 170604 free, 1408528 used, 2329100 buff/cache
KiB Swap: 0 total, 0 free, 0 used, 2322292 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
341304 185 20 0 2779060 249384 19832 R 4.0 6.4 0:26.13 executor-thread
331916 185 20 0 2779060 249384 19832 R 3.9 6.4 0:50.30 executor-thread
342982 185 20 0 2779060 249384 19832 S 3.9 6.4 0:21.12 executor-thread
331645 185 20 0 2779060 249384 19832 S 3.8 6.4 0:49.09 executor-thread
342254 185 20 0 2779060 249384 19832 R 3.8 6.4 0:23.33 executor-thread
345262 185 20 0 2779060 249384 19832 S 3.8 6.4 0:14.70 executor-thread
342255 185 20 0 2779060 249384 19832 R 3.7 6.4 0:23.00 executor-thread
338001 185 20 0 2779060 249384 19832 R 3.6 6.4 0:33.54 executor-thread
331222 185 20 0 2779060 249384 19832 R 3.3 6.4 0:46.87 executor-thread
330952 185 20 0 2779060 249384 19832 R 2.9 6.4 0:47.67 executor-thread
4683 185 20 0 2779060 249384 19832 R 2.8 6.4 0:38.15 VM Thread
347967 185 20 0 2779060 249384 19832 R 2.2 6.4 0:06.50 executor-thread
4773 185 20 0 2779060 249384 19832 R 1.7 6.4 0:34.86 vert.x-eventloo
4772 185 20 0 2779060 249384 19832 R 1.7 6.4 0:34.09 vert.x-eventloo
349398 185 20 0 2779060 249384 19832 R 1.1 6.4 0:03.31 executor-thread
4692 185 20 0 2779060 249384 19832 R 0.3 6.4 0:04.12 Notification Th
The output below shows a 5 minute summary when the system was under maximum load during a POST test. The snapshot shows a different profile to the GET test. As before, the event loop, executor and VM threads are the most active out of the entire set, however, the executor threads are on CPU for significantly less time and the vert.x-eventloop thread is on CPU significantly more time, for one of the threads. The total thread count is similar, but the process ids of all the executor threads are different indicating they are created on demand. When the snapshot was taken all threads are in an interruptible sleep state.
top - 12:15:37 up 19:10, 0 users, load average: 0.19, 0.30, 0.68
Threads: 36 total, 0 running, 36 sleeping, 0 stopped, 0 zombie
%Cpu(s): 9.9 us, 3.2 sy, 0.0 ni, 85.2 id, 0.0 wa, 0.0 hi, 1.7 si, 0.0 st
KiB Mem : 3908232 total, 273392 free, 1275780 used, 2359060 buff/cache
KiB Swap: 0 total, 0 free, 0 used, 2455464 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
4772 185 20 0 2783732 303640 19832 S 3.6 7.8 1:34.86 vert.x-eventloo
4773 185 20 0 2783732 303640 19832 S 1.7 7.8 1:34.09 vert.x-eventloo
401353 185 20 0 2783732 303640 19832 S 1.3 7.8 0:21.12 executor-thread
401762 185 20 0 2783732 303640 19832 S 1.3 7.8 0:49.09 executor-thread
404179 185 20 0 2783732 303640 19832 S 1.3 7.8 0:23.33 executor-thread
402481 185 20 0 2783732 303640 19832 S 1.2 7.8 0:14.70 executor-thread
403638 185 20 0 2783732 303640 19832 S 1.2 7.8 0:23.00 executor-thread
405708 185 20 0 2783732 303640 19832 S 1.1 7.8 0:33.54 executor-thread
403048 185 20 0 2783732 303640 19832 S 1.1 7.8 0:46.87 executor-thread
401343 185 20 0 2783732 303640 19832 S 1.1 7.8 0:47.67 executor-thread
406398 185 20 0 2783732 303640 19832 S 1.0 7.8 0:38.15 executor-thread
402916 185 20 0 2783732 303640 19832 S 0.8 7.8 0:06.50 executor-thread
411472 185 20 0 2783732 303640 19832 S 0.4 7.8 0:34.86 executor-thread
4683 185 20 0 2783732 303640 19832 S 0.3 7.8 1:34.09 VM Thread
4689 185 20 0 2783732 303640 19832 S 0.1 7.8 0:03.31 C2 CompilerThre
The output from top shows that the process is running a set of threads for the vert.x reactive application toolkit. Vert.x uses an event based architecture that runs an event-loop thread for non-blocking event dispatching, and executor threads for longer running blocking requests. It also runs an acceptor thread for handling client requests on a socket and a blocked thread checker for monitoring the event loop and executor threads. Event loop thread count defaults to 2 * cores, hence two threads for the above process. Maximum event loop thread execute time is 2 seconds and the executor thread is 60 seconds. See VertxOptions for more details. It provides a set of modules for building reactive applications and it uses Netty for the NIO event loop. Top shows that worker threads have been created which suggests they are handling long running or blocking tasks. See quarkus dispatching strategy for further information.
System Calls
Output from top shows that the executor threads are on CPU a lot less in the POST test. When a thread goes off CPU it can be waiting for events such as network I/O, disk I/O, scheduling, lock acquisition, or work to process. These type of events involve system calls to the kernel. Using perf trace we can obtain a summary of the system calls the executor thread and event loop threads are making to find out more about their activity. The command below was run for 10 seconds
$ perf trace -s -p 4976 -o perf-trace-output.txt
The output below shows system calls made by one of the executor threads during a POST endpoint test at maximum load. This shows the executor thread is spending most of its time waiting for I/O on file descriptors using ppoll and shows significant differences between the average and maximum values reported. Errors are also shown for read and futex calls. The summary shown is representative of the rest of the executor threads.
Summary of events:
executor-thread (496150), 19974 events, 7.3%
syscall calls errors total min avg max stddev
(msec) (msec) (msec) (msec) (%)
--------------- -------- ------ -------- --------- --------- --------- ------
ppoll 1465 0 20360.553 0.000 13.898 389.397 7.48%
futex 1768 45 2294.950 0.002 1.298 23.692 4.45%
write 2316 0 55.680 0.003 0.024 0.158 1.04%
read 4437 1465 43.308 0.002 0.010 5.931 13.73%
gettid 1 0 0.010 0.010 0.010 0.010 0.00%
The output below shows top 5 system calls for the event loop threads during the same test in the same sample window. This shows the threads are spending most of their time waiting for I/O on file descriptors using epoll_pwait and that the event loop threads are also writing to sockets. It also shows some errors for the futex calls as well as significant differences between the average and maximum values reported for epoll_pwait. One of the event loop threads processed more events than the other and logged 1027 futex errors.
Summary of events:
vert.x-eventloo (5055), 40318 events, 14.8%
syscall calls errors total min avg max stddev
(msec) (msec) (msec) (msec) (%)
--------------- -------- ------ -------- --------- --------- --------- ------
epoll_pwait 5609 0 20214.774 0.000 3.604 97.388 2.75%
futex 5373 145 95.882 0.002 0.018 4.787 12.99%
read 5623 0 51.278 0.002 0.009 1.082 2.62%
write 2408 0 41.232 0.003 0.017 0.298 1.63%
writev 835 0 24.772 0.011 0.030 0.251 1.50%
vert.x-eventloo (5056), 54052 events, 19.8%
syscall calls errors total min avg max stddev
(msec) (msec) (msec) (msec) (%)
--------------- -------- ------ -------- --------- --------- --------- ------
epoll_pwait 7658 0 19652.526 0.000 2.566 85.186 2.06%
futex 7310 1027 146.670 0.002 0.020 6.561 6.29%
read 7760 0 75.407 0.002 0.010 0.394 1.28%
write 2825 0 60.900 0.003 0.022 2.496 4.25%
writev 1008 0 31.529 0.012 0.031 0.395 1.49%
Whilst some of the system calls made by each thread type have high maximum values. None of the calls show an average timing that equates to the 109 ms round trip times measured in the test, but they do indicate that a lot of time is spent in file descriptor I/O.
System Trace
Using strace we can identify the sequence of system calls and arguments being passed when processing a request. POST requests were executed sequentially and traced with no other activity on the system. In this scenario we are using strace as by default it shows request data being passed which will make it easier to trace the requests. Similar information can be obtained using perf which is less intrusive, as caution should be used running strace on a live environment. When executing strace under maximum load a 25% reduction in throughput was observed.
$ strace -rttTf -p 3923004 -o strace-output.txt
The output of strace can be filtered to find the event loop thread and executor thread that processed the client request. To focus on file descriptor I/O, the futex calls have been omitted and for ease of reading some of the read and write data has been replaced with “<data>”. An ongoing system call is marked as unfinished and resumed by strace if another system call is made from a different process/thread.
5069 11:49:11.129865 (+ 0.004709) read(17, "POST /servicea HTTP/1."..., 288) = 274 <0.000026>
5069 11:49:11.130072 (+ 0.000013) epoll_pwait(156, <unfinished...>
1613992 11:49:11.130144 (+ 0.000061) write(157, "\1\0\0\0\0\0\0\0", 8) = 8 <0.000052>
5069 11:49:11.130218 (+ 0.000076) <... epoll_pwait resumed> [{EPOLLIN, {u32=157, u64=281470681743517}}], 1024, 613, NULL, 8) = 1 <0.000139>
5069 11:49:11.130247 (+ 0.000012) read(157, "\1\0\0\0\0\0\0\0", 16) = 8 <0.000038>
5069 11:49:11.130408 (+ 0.000009) epoll_pwait(156, <unfinished...>
1613992 11:49:11.130476 (+ 0.000024) write(157, "\1\0\0\0\0\0\0\0", 8) = 8 <0.000025>
5069 11:49:11.130569 (+ 0.000025) <... epoll_pwait resumed> [{EPOLLIN, {u32=157, u64=281470681743517}}], 1024, 612, NULL, 8) = 1 <0.000153>
5069 11:49:11.130599 (+ 0.000028) read(157, "\1\0\0\0\0\0\0\0", 16) = 8 <0.000024>
1613992 11:49:11.130944 (+ 0.000197) write(157, "\1\0\0\0\0\0\0\0", 8 <unfinished ...>
5069 11:49:11.130971 (+ 0.000023) epoll_pwait(156, <unfinished...>
1613992 11:49:11.131005 (+ 0.000035) <... write resumed> ) = 8 <0.000043>
5069 11:49:11.131019 (+ 0.000011) <... epoll_pwait resumed> [{EPOLLIN, {u32=157, u64=281470681743517}}], 1024, 612, NULL, 8) = 1 <0.000034>
5069 11:49:11.131053 (+ 0.000012) read(157, "\1\0\0\0\0\0\0\0", 16) = 8 <0.000014>
5069 11:49:11.131123 (+ 0.000072) write(20, "GET /serviceb/123"..., 232) = 232 <0.000056>
5069 11:49:11.131229 (+ 0.000107) epoll_pwait(156, [{EPOLLIN, {u32=20, u64=281470681743380}}], 1024, 612, NULL, 8) = 1 <0.003947>
5069 11:49:11.135213 (+ 0.003983) read(20, "HTTP/1.1 200 \r\nVary: Origin\r\n"..., 16384) = 2240 <0.000030>
1613992 11:49:11.135756 (+ 0.000164) write(27, "<data>"..., 144 <unfinished ...>
5069 11:49:11.135789 (+ 0.000029) epoll_pwait(156, <unfinished...>
1613992 11:49:11.135819 (+ 0.000030) <... write resumed> ) = 144 <0.000039>
1613992 11:49:11.135840 (+ 0.000020) read(27, 0xffff7ca3b5a0, 5) = -1 EAGAIN (Resource temporarily unavailable) <0.000027>
1613992 11:49:11.135899 (+ 0.000062) ppoll([{fd=27, events=POLLIN}], 1, NULL, NULL, 0) = 1 ([{fd=27, revents=POLLIN}]) <0.001360>
1613992 11:49:11.137309 (+ 0.001408) read(27, "<data>", 5) = 5 <0.000023>
1613992 11:49:11.137363 (+ 0.000054) read(27, "<data>"..., 208) = 208 <0.000030>
1613992 11:49:11.137529 (+ 0.000167) write(27, "<data>"..., 136) = 136 <0.000050>
1613992 11:49:11.137615 (+ 0.000083) read(27, 0xffff7ca3b5a0, 5) = -1 EAGAIN (Resource temporarily unavailable) <0.000029>
1613992 11:49:11.137674 (+ 0.000058) ppoll([{fd=27, events=POLLIN}], 1, NULL, NULL, 0) = 1 ([{fd=27, revents=POLLIN}]) <0.001264>
1613992 11:49:11.138981 (+ 0.001307) read(27, "<data>", 5) = 5 <0.000022>
1613992 11:49:11.139030 (+ 0.000049) read(27, "<data>"..., 49) = 49 <0.000028>
1613992 11:49:11.139130 (+ 0.000101) write(27, "<data>"..., 140) = 140 <0.000059>
1613992 11:49:11.139223 (+ 0.000091) read(27, 0xffff7ca3b5a0, 5) = -1 EAGAIN (Resource temporarily unavailable) <0.000027>
1613992 11:49:11.139280 (+ 0.000056) ppoll([{fd=27, events=POLLIN}], 1, NULL, NULL, 0) = 1 ([{fd=27, revents=POLLIN}]) <0.001193>
1613992 11:49:11.140523 (+ 0.001243) read(27, "<data>", 5) = 5 <0.000023>
1613992 11:49:11.140576 (+ 0.000054) read(27, "<data>"..., 481) = 481 <0.000031>
1613992 11:49:11.140755 (+ 0.000180) write(27, "<data>"..., 148) = 148 <0.000049>
1613992 11:49:11.140839 (+ 0.000081) read(27, 0xffff7ca3b5a0, 5) = -1 EAGAIN (Resource temporarily unavailable) <0.000029>
1613992 11:49:11.140898 (+ 0.000058) ppoll([{fd=27, events=POLLIN}], 1, NULL, NULL, 0) = 1 ([{fd=27, revents=POLLIN}]) <0.002336>
1613992 11:49:11.143277 (+ 0.002379) read(27, "<data>", 5) = 5 <0.000023>
1613992 11:49:11.143327 (+ 0.000049) read(27, "<data>"..., 49) = 49 <0.000029>
1613992 11:49:11.143436 (+ 0.000110) write(27, "<data>"..., 62) = 62 <0.000053>
1613992 11:49:11.143537 (+ 0.000100) read(27, 0xffff7ca3b5a0, 5) = -1 EAGAIN (Resource temporarily unavailable) <0.000023>
1613992 11:49:11.143588 (+ 0.000051) ppoll([{fd=27, events=POLLIN}], 1, NULL, NULL, 0) = 1 ([{fd=27, revents=POLLIN}]) <0.001193>
1613992 11:49:11.144825 (+ 0.001236) read(27, "<data>", 5) = 5 <0.000023>
1613992 11:49:11.144878 (+ 0.000053) read(27, "<data>"..., 68) = 68 <0.000028>
1613992 11:49:11.145084 (+ 0.000207) write(27, "<data>"..., 405) = 405 <0.000050>
1613992 11:49:11.145171 (+ 0.000084) read(27, 0xffff7ca3b5a0, 5) = -1 EAGAIN (Resource temporarily unavailable) <0.000030>
1613992 11:49:11.145231 (+ 0.000059) ppoll([{fd=27, events=POLLIN}], 1, NULL, NULL, 0) = 1 ([{fd=27, revents=POLLIN}]) <0.001333>
1613992 11:49:11.146607 (+ 0.001376) read(27, "<data>", 5) = 5 <0.000022>
1613992 11:49:11.146657 (+ 0.000049) read(27, "<data>"..., 51) = 51 <0.000028>
1613992 11:49:11.146806 (+ 0.000150) write(27, "<data>"..., 60) = 60 <0.000054>
1613992 11:49:11.146894 (+ 0.000086) read(27, 0xffff7ca3b5a0, 5) = -1 EAGAIN (Resource temporarily unavailable) <0.000031>
1613992 11:49:11.146954 (+ 0.000060) ppoll([{fd=27, events=POLLIN}], 1, NULL, NULL, 0) = 1 ([{fd=27, revents=POLLIN}]) <0.003337>
1613992 11:49:11.150342 (+ 0.003388) read(27, "<data>", 5) = 5 <0.000025>
1613992 11:49:11.150400 (+ 0.000058) read(27, "<data>"..., 47) = 47 <0.000030>
1613992 11:49:11.150573 (+ 0.000172) write(1, "<data>"..., 196) = 196 <0.000073>
1613992 11:49:11.150887 (+ 0.000313) write(157, "\1\0\0\0\0\0\0\0"..., 8) = 8 <0.000048>
5069 11:49:11.150965 (+ 0.000075) <... epoll_pwait resumed> [{EPOLLIN, {u32=157, u64=281470681743517}}], 1024, 607, NULL, 8) = 1 <0.015166>
5069 11:49:11.150996 (+ 0.000012) read(157, "\1\0\0\0\0\0\0\0", 16) = 8 <0.000020>
5069 11:49:11.151083 (+ 0.000094) writev(17, [{iov_base="HTTP/1.1 201 Created\r\nContent-Ty"..., iov_len=254}, {iov_base="<data>"
5069 11:49:11.151217 (+ 0.006079) epoll_pwait(156, [{EPOLLIN, {u32=17, u64=281470681743377}}], 1024, 592, NULL, 8) = 1 <0.006079>
File Descriptors
The trace above shows that the event loop thread and executor threads are reading and writing to multiple file descriptors. Using lsof we can find out what type they are. The event loop thread is using eventpoll 156 for I/O notification on multiple descriptors, eventfd 157 for application wait events and descriptors 17 and 20 for network I/O.
$ lsof | grep -e 5069 | grep -e " 17[urw] " -e " 20[urw] " -e " 156[urw] " -e " 157[urw] "
vert.x-ev 4976 5069 185 17u sock 0,8 0t0 9005252 protocol: TCPv6
vert.x-ev 4976 5069 185 20u sock 0,8 0t0 9004984 protocol: TCPv6
vert.x-ev 4976 5069 185 156u a_inode 0,13 0 11657 [eventpoll]
vert.x-ev 4976 5069 185 157u a_inode 0,13 0 11657 [eventfd]
The executor thread is using eventfd 157 for application notification events, descriptor 1 for stdout logging and descriptor 27 for network I/O.
$ lsof | grep -e 1613992 | grep -e " 1[urw] " -e " 27[urw] " -e " 157[urw] "
executor- 4976 1613992 185 1w FIFO 0,12 0t0 34511 pipe
executor- 4976 1613992 185 27u sock 0,8 0t0 8978747 protocol: TCPv6
executor- 4976 1613992 185 157u a_inode 0,13 0 11657 [eventfd]
Using nsenter to run ss in the namespace of the process we can find out what the socket connections are. Descriptor 17 is the connection with the requesting client, descriptor 20 is a connection to an external service on port 8090, and descriptor 27 is a connection to a postgres database.
$ nsenter -t 4976 -n ss -i -p | grep -e "fd=17" -e "fd=20" -e "fd=27"
tcp ESTAB 0 0 [::ffff:xx.x.xxx.xxx]:37636 [::ffff:xx.x.xxx.xxx]:postgres users:(("java",pid=4976,fd=27))
tcp ESTAB 0 1578 [::ffff:xx.x.xxx.xxx]:webcache [::ffff:xx.x.xxx.xxx]:16828 users:(("java",pid=4976,fd=17))
tcp ESTAB 0 0 [::ffff:xx.x.xxx.xxx]:60508 [::ffff:xx.x.xxx.xxx]:8090 users:(("java",pid=4976,fd=20))
Call Sequence
Combining the trace output with the file descriptor information we can see that the event loop receives the client request on a tcp socket (fd=17) and returns the final response to the client on the socket (fd=17). The entire request took 21.2 ms to process in an uncontended scenario. The event loop thread makes a call to service b (fd=20) which took 4.1 ms and when the response is received the executor thread makes 7 calls to a postgres database (fd=27) totalling 15.2 ms before the response is returned to the client. The initial database read fails with EAGAIN as no data has been returned to read, which this is the source of the executor thread read errors in the perf summary above. ppoll is used to wait for data to read with a timeout specified as NULL which will block indefinitely. The event loop thread uses epoll for I/O notification and monitors eventpoll descriptor (fd=156) for socket I/O from the client request (fd=17), and service b request (fd=20), as well as event notification from the executor thread on eventfd (fd=157). The writev system call is used to return the response back to the client (fd=17).
Database Latency
In the trace above a single request made 7 round trips to the database. Referring back to the summary report from perf trace the ratio of ppoll system calls to writev system calls across all threads is 6.9 which approximately aligns with the single request trace. With an average ppoll latency of 14.51 ms each client request spent 101.6 ms waiting for a response from the database. This accounts for 93% of the 109 ms response time observed under load. The remaining 7% comprises the call to service b (4 ms) as well as latency between the test tools and the microservice.
executor-thread
ppoll 1465 0 20360.553 0.000 13.898 389.397 7.48%
ppoll 1460 0 20091.517 0.000 13.761 336.062 6.69%
ppoll 1440 0 20553.106 0.000 14.273 302.391 6.73%
ppoll 1462 0 20236.413 0.000 13.842 384.526 7.00%
ppoll 1341 0 20813.370 0.000 15.521 316.937 6.91%
ppoll 1366 0 20612.547 0.000 15.090 444.624 7.41%
ppoll 1441 0 20386.041 0.000 14.147 383.300 6.99%
ppoll 1421 0 20578.373 0.000 14.482 412.241 7.25%
ppoll 1315 0 20852.107 0.000 15.857 396.901 7.34%
vert.x-eventloop
writev 835 0 24.772 0.011 0.030 0.251 1.50%
writev 1008 0 31.529 0.012 0.031 0.395 1.49%
Blocking Calls
The output from strace shows that the call to service b was made from the event loop thread, and although blocked thread warnings are not being reported by vert.x during testing (as the round trip time is less than the 2000 ms threshold), we recommended that this is reviewed. Event loop threads run in a sequential way and are designed to run one thread per processor to minimise context switching overhead. For this model to work blocking calls must not be made from event loop threads (see vert.x golden rule). The development team subsequently confirmed that REST client getById method is being used which Quarkus recommend should not be invoked from the event loop thread and instead be made using the async call.
Conclusion
Using the tools above we have been able to identify that:
- Threads are waiting off-CPU because requests to the database are slowing down under load. The amount of time spent waiting for a response from the database begins to increase when the database reaches a limit. This is compounded by the fact that there are 7 calls per API request.
- Blocking calls being made from the event loop threads present a further potential bottleneck within Vert.x impacting both GET and POST requests if service b slows down. Following on from this, if calls to service B are made non-blocking, service B may then become a bottleneck on POST request throughput.
This gives the following questions for further investigation :
- Why are the database calls slowing down under load ?
- What diagnostic information can we get from the database ?
- Can we monitor network latency to the database ?
- What level of throughput does the POST endpoint need to support ?
- What type of database workload does the POST endpoint generate, is this a write heavy workload ?
- Are there any options to reduce the visit count ?
- Can the blocking call made to service B be replaced with a non-blocking asynchronous version ?
- What happens to service B if throughput is increased ? does service B become a bottleneck ?
Other points to note were:
- Some of the tools that were used create a snapshot of system activity at a particular point in time. It was useful to combine these into a script to capture output from multiple tools at the same point and when sampling will be repeated.
- The summary report produced by perf gives summary statistics for each system call. If some of the system calls perform I/O on different file descriptor types or show a lot of tail latency it can be useful to view the latency distribution using funclatency in BCC tools. Other tools in this repository such as offwaketime are also useful to track off cpu time.