Dramatic decline in performance in direct connect

I just wanted to add that my network problems seem to have “disappeared” - I’ve had no problems for 4 days with my backups over the non-tailscale IPs and for the last 2 days I switched back to sending the backups over the tailscale IPs with no problems whatsoever.

I guess someone between my 2 servers noticed the problems and fixed their network :slight_smile:

P.S. At least now I know way more about network debugging than I ever thought I’d need :slight_smile:

Good to hear. We’ll keep looking at the pcap. So far we’ve run it through tcptrace to produce a time-sequence graph of when packets were transferred, and used xplot.org to poke around at it. Here is the zoomed out view of the transfer:

The purplish mess in the middle is a burst of packet loss. TCP congestion control slows down in response, and never fully recovers.

Zooming in, there are a number of distinct instances of packet loss. A number of packets are lost and have to be retransmitted.

Looking at one of them:

The purple S means Selective Acknowledgement, that TCP is signalling it has received subsequent packets and is waiting for retransmission of the earlier ones.

It is strange that it isn’t just the occasional packet lost. The transfer is proceeding fine until suddenly 10-15 consecutive packets are lost. Just the occasional packet here and there could be recovered from more readily, but losing so any packets at once makes TCP slow way down in response.

The thing the packet trace doesn’t tell us is why the packets are lost:

  • is it something we’re doing?
  • is it a buffer in a router somewhere that is overflowing and tail-dropping packets?
  • It is a bad fiber somewhere? Fibers which have been bent too sharply or which have microfractures in the glass tend to result in bursts of losses, due to thermal effects as the cracks expand and contract.

In case the problem does come back: It turns out it would be really helpful to have a copy of the iperf traces over the non-tailscale network as well, for comparison. It would be nice to see if the behaviour of dropping 10-15 packets in a row is happening at the physical network level, or is something wrong with wireguard or tailscale. Unfortunately either one is quite possible.

Awesome, thanks so much for looking into it. I’ll be sure to update you if anything pops up again as a “regular problem”.
Things seem to be going smooth, at least the network performance is constant and reproducible.

Have a nice day
Ovidiu