Playing with Multi-threading bug (2)
When QA was doing test for the previous bug, a new problem was found.
On startup, when PC is under load, our customer premise system shows no interface (the link to server) available and it can not process any request.
Our customer premise system is a multi process architecture (simplified):
An process accepts and handles requests (these requests come from another process developed by business partner, but let's ignore it here), implement most of the business logics, let's call it App process;
Another process is responsible establishing connections to server, forwards the messages from App process to server and vice versa, let's call it Network process.
App process <---TCP--> Network process <---TCP--> Server
To solve the issue, first we need to narrow down the problem. The first part that I looked at was the connection between the Network process and server.
I examined the logs generated by the Network process. The connection setup between the Network process and server is fine, because the log shows that the links are closed as idle after connected (I mentioned in previous blog that some third party server does not allow persistent connections):
Line 103: 02:58:22.234375 W ...> ** - IDLE CLOSE (**)
Line 108: 02:58:22.234375 W ...> ** - IDLE CLOSE (*)
Line 115: 02:58:22.234375 W ...> ** - IDLE CLOSE (*)
So the problem resides on App process, or the interaction between App process and Network process.
Dived into the code of App process to do a quick investigation of the status update mechanism of server interface.
The status is initialized as false(0). So when App process connected to Network process, the status is 0. But after connected, App process will send a status query message to Network process to get link status every 2 seconds . And calls a setting method update the status (only if the status is OK) when response message returns.
Added debug log to that status setting method in App process, and QA reproduced it. But the added log could not be seen, so the server status in App process was never updated actually.
However, from the point of view of App process, the TCP connection with Network process should have been connected successfully, otherwise it would attempt to reconnect.
The log of Network process showed that no accept event occurred, which meant from Network process’s point of view there is no TCP connection between App process and Network process at all.
Contradictory, so we need to look at it at a lower level to see what really happens.
Used RawCap (right, Wireshark can not sniff the loopback traffic) to capture the loopback traffic between App process and Network process. Analyse the captured packets with Wireshark, filtered with expression “tcp.dstport == xx or tcp.srcport == xx” (Network process listen on port xx):
155 02:12:58.077133 127.0.0.1 127.0.0.1 TCP 52 61169 > xx [SYN] Seq=0 Win=8192 Len=0 MSS=65495 WS=256 SACK_PERM=1
156 02:12:58.077133 127.0.0.1 127.0.0.1 TCP 52 xx > 61169 [SYN, ACK] Seq=0 Ack=1 Win=8192 Len=0 MSS=65495 WS=256 SACK_PERM=1
157 02:12:58.077133 127.0.0.1 127.0.0.1 TCP 40 61169 > xx [ACK] Seq=1 Ack=1 Win=8192 Len=0
158 02:12:58.088133 127.0.0.1 127.0.0.1 TCP 84 61169 > xx [PSH, ACK] Seq=1 Ack=1 Win=8192 Len=44
159 02:12:58.088133 127.0.0.1 127.0.0.1 TCP 40 xx > 61169 [ACK] Seq=1 Ack=45 Win=7936 Len=0
173 02:12:59.436133 127.0.0.1 127.0.0.1 TCP 178 61169 > xx [PSH, ACK] Seq=45 Ack=1 Win=8192 Len=138
174 02:12:59.436133 127.0.0.1 127.0.0.1 TCP 40 xx > 61169 [ACK] Seq=1 Ack=183 Win=7936 Len=0
175 02:12:59.463133 127.0.0.1 127.0.0.1 TCP 219 61169 > xx [PSH, ACK] Seq=183 Ack=1 Win=8192 Len=179
176 02:12:59.463133 127.0.0.1 127.0.0.1 TCP 40 xx > 61169 [ACK] Seq=1 Ack=362 Win=7680 Len=0
188 02:13:00.188133 127.0.0.1 127.0.0.1 TCP 84 61169 > xx [PSH, ACK] Seq=362 Ack=1 Win=8192 Len=44
189 02:13:00.188133 127.0.0.1 127.0.0.1 TCP 40 xx > 61169 [ACK] Seq=1 Ack=406 Win=7680 Len=0
227 02:13:02.291133 127.0.0.1 127.0.0.1 TCP 84 61169 > xx [PSH, ACK] Seq=406 Ack=1 Win=8192 Len=44
228 02:13:02.291133 127.0.0.1 127.0.0.1 TCP 40 xx > 61169 [ACK] Seq=1 Ack=450 Win=7680 Len=0
259 02:13:04.394133 127.0.0.1 127.0.0.1 TCP 84 61169 > xx [PSH, ACK] Seq=450 Ack=1 Win=8192 Len=44
260 02:13:04.394133 127.0.0.1 127.0.0.1 TCP 40 xx > 61169 [ACK] Seq=1 Ack=494 Win=7680 Len=0
293 02:13:06.497133 127.0.0.1 127.0.0.1 TCP 84 61169 > xx [PSH, ACK] Seq=494 Ack=1 Win=8192 Len=44
294 02:13:06.497133 127.0.0.1 127.0.0.1 TCP 40xx > 61169 [ACK] Seq=1 Ack=538 Win=7424 Len=0
A TCP connection did exist between App process and someone’s port xx. Handshakes succeeded. And App process always sent 44 byte data to the opponent every 2 seconds, but the opponent did not reply any data, just ack with sequence number 1. At this point, I can only imagine several possibilities: a) App process connected with process other than Network process, that process of course did not know how to respond App process’ get status msg. b) The connection somehow terminated at the TCP level, instead of going thru the process.
I did an experiment with the accept code blocked in Network process, to see what would happen if Network process does not accept the connect request from App process. Got the same TCP sequence:
69 05:58:57.548361 127.0.0.1 127.0.0.1 TCP 52 61533 > xx [SYN] Seq=0 Win=8192 Len=0 MSS=65495 WS=256 SACK_PERM=1
70 05:58:57.548361 127.0.0.1 127.0.0.1 TCP 40 xx > 61533 [RST, ACK] Seq=1 Ack=1 Win=0 Len=0
75 05:58:58.042916 127.0.0.1 127.0.0.1 TCP 52 61533 > xx [SYN] Seq=0 Win=8192 Len=0 MSS=65495 WS=256 SACK_PERM=1
76 05:58:58.042916 127.0.0.1 127.0.0.1 TCP 40 xx > 61533 [RST, ACK] Seq=1 Ack=1 Win=0 Len=0
85 05:58:58.542466 127.0.0.1 127.0.0.1 TCP 48 61533 > xx [SYN] Seq=0 Win=8192 Len=0 MSS=65495 SACK_PERM=1
86 05:58:58.542466 127.0.0.1 127.0.0.1 TCP 40 xx > 61533 [RST, ACK] Seq=1 Ack=1 Win=0 Len=0
189 05:59:08.648362 127.0.0.1 127.0.0.1 TCP 52 61550 > xx [SYN] Seq=0 Win=8192 Len=0 MSS=65495 WS=256 SACK_PERM=1
190 05:59:08.648362 127.0.0.1 127.0.0.1 TCP 52 xx > 61550 [SYN, ACK] Seq=0 Ack=1 Win=8192 Len=0 MSS=65495 WS=256 SACK_PERM=1
191 05:59:08.648362 127.0.0.1 127.0.0.1 TCP 40 61550 > xx [ACK] Seq=1 Ack=1 Win=8192 Len=0
198 05:59:09.549550 127.0.0.1 127.0.0.1 TCP 84 61550 > xx [PSH, ACK] Seq=1 Ack=1 Win=8192 Len=44
199 05:59:09.549550 127.0.0.1 127.0.0.1 TCP 40 xx > 61550 [ACK] Seq=1 Ack=45 Win=7936 Len=0
222 05:59:11.650658 127.0.0.1 127.0.0.1 TCP 84 61550 > xx [PSH, ACK] Seq=45 Ack=1 Win=8192 Len=44
223 05:59:11.650658 127.0.0.1 127.0.0.1 TCP 40 xx > 61550 [ACK] Seq=1 Ack=89 Win=7936 Len=0
244 05:59:13.751765 127.0.0.1 127.0.0.1 TCP 84 61550 > xx [PSH, ACK] Seq=89 Ack=1 Win=8192 Len=44
245 05:59:13.751765 127.0.0.1 127.0.0.1 TCP 40 xx > 61550 [ACK] Seq=1 Ack=133 Win=7936 Len=0
266 05:59:15.852872 127.0.0.1 127.0.0.1 TCP 84 61550 > xx [PSH, ACK] Seq=133 Ack=1 Win=8192 Len=44
267 05:59:15.852872 127.0.0.1 127.0.0.1 TCP 40 xx > 61550 [ACK] Seq=1 Ack=177 Win=7936 Len=0
So basically I can assume it’s Network process not accepting the connect request causing the problem at this point.
More debug log added (and QA reproduced that, cause I could not reproduce it on my machine), and found that Network process could receive the readable event of the listen socket, but the event was somehow ignored.
Line 90: 03:22:08.906250 W 00161788> Socket 932 created
Line 96: 03:22:09.031250 W 00161788> Listening on Socket 932
Line 250: 03:22:17.671875 W 00168450> readable socket list:932,
More debug log added, found some readable socket did not have corresponding Socket entry in the internal socket table:
21:34:01.181500 W 00176268> Socket 972 not matched ...
The internal socket table was in a mess, some slot was mistakenly erased, and this is highly possible with multiple threads writing to it.
With more debug log added, this round of reproduce finally reveal the root cause:
01:47:01.796875 Socket 968 created
01:47:01.796875 Attach: i=0.
01:47:01.796875 Socket 952 created
01:47:01.796875 Attach: i=0.
01:47:01.796875 Socket 964 created
01:47:01.796875 Attach: i=0.
01:47:01.796875 Socket 960 created
01:47:01.796875 Attach: i=0.
01:47:01.796875 Socket 956 created
01:47:01.796875 Attach: i=0.
Multiple sockets stored to the same slot (i=0)of the internal table, So only the last one got stored, all previous ones were leaked.
Each connection create/setup was execute in distinct thread, but the code to find the available slot has no protection, so multiple threads see the same slot (i=0) as empty..
Reverse the investigation process, the ins and outs are very clear:
a) The main thread that create the listen socket (on port xx), stored the socket pointer in internal socket table’s slot 0 Socket[0].
b) Some IO thread that create socket to connect server saw slot 0 was still available, so overwritten that slot.
c) App process attempted to setup connection with Network process port xx,
d) OS’s TCP layer finished the handshake of the TCP connection.
e) Network process polled the listen socket’s FD and get the readable event.
f) Looked it up in the socket table but cannot find a match entry. So Network process did not accept the connection.
g) Removed the listen socket’s FD from the read FD set, and so the listen socket would not be polled (by calling select()) again.
h) So, although App process sent get status request every 2s, but the msg was just terminated in TCP layer, Network process never knows that, so App process got no real reply.
i) Without any update, App process’ link status is kept as “link failure”.
(note. for confidential purpose, some of the details are hidden. I tried to explain the technical aspect in a general approach )