I just wanted to add that my network problems seem to have “disappeared” - I’ve had no problems for 4 days with my backups over the non-tailscale IPs and for the last 2 days I switched back to sending the backups over the tailscale IPs with no problems whatsoever.
I guess someone between my 2 servers noticed the problems and fixed their network
P.S. At least now I know way more about network debugging than I ever thought I’d need
Good to hear. We’ll keep looking at the pcap. So far we’ve run it through tcptrace to produce a time-sequence graph of when packets were transferred, and used xplot.org to poke around at it. Here is the zoomed out view of the transfer:
The purple S means Selective Acknowledgement, that TCP is signalling it has received subsequent packets and is waiting for retransmission of the earlier ones.
It is strange that it isn’t just the occasional packet lost. The transfer is proceeding fine until suddenly 10-15 consecutive packets are lost. Just the occasional packet here and there could be recovered from more readily, but losing so any packets at once makes TCP slow way down in response.
The thing the packet trace doesn’t tell us is why the packets are lost:
is it something we’re doing?
is it a buffer in a router somewhere that is overflowing and tail-dropping packets?
It is a bad fiber somewhere? Fibers which have been bent too sharply or which have microfractures in the glass tend to result in bursts of losses, due to thermal effects as the cracks expand and contract.
In case the problem does come back: It turns out it would be really helpful to have a copy of the iperf traces over the non-tailscale network as well, for comparison. It would be nice to see if the behaviour of dropping 10-15 packets in a row is happening at the physical network level, or is something wrong with wireguard or tailscale. Unfortunately either one is quite possible.
Awesome, thanks so much for looking into it. I’ll be sure to update you if anything pops up again as a “regular problem”.
Things seem to be going smooth, at least the network performance is constant and reproducible.