Intermittent Timeouts on Windows Desktop

tailscale version 1.34.2
Windows 11 Pro 22H2 22621.1105

I have a tailscale network with a variety of devices. In my home I have some iOS devices, a NAS, and a Windows Desktop. In the cloud I have a few Linux VPCs.

In terms of the fundamentals, everything seems to be working correctly. The devices can all connect to each other. The addresses and DNS are all correct. IPv4 and v6 are both good to go. I’m honestly impressed by tailscale. My only issues are the very well known iOS battery usage problems, and the following problem.

I have a very annoying and persistent issue that occurs only for connections between my Windows desktop and devices that are outside my home network. Here is what I have discovered.

I first encountered the issue when I was working on a cloud VPC using SSH. I’m not using tailscale SSH. I just have OpenSSH server running on the VPC with port 22 open on the tailscale interface, and I use the standard OpenSSH client. Every few seconds, maybe once or twice a minute, the connection hangs. I type and nothing appears. Then after a few seconds of delay all the things I typed appear instantly. It’s as if suddenly a bunch of buffered up packets got through. You can imagine this makes it infuriatingly difficult to get any work done when the problem reliably happens several times per minute.

I performed several tests and discovered that this issue was not specific to SSH. I could reproduce it reliably using the tailscale ping command. Here is what I discovered.

  • When using SSH between my iPad and a VPC in the cloud, no issues.
  • When using tailscale ping between any two devices on my home network, no issues.
  • When using tailscale ping between any two different VPCs in the cloud, no issues.
  • When using tailscale ping between a VPC and a device on my home network other than my Windows desktop, in either direction, no issues.
  • When using tailscale ping between my Windows Desktop and a cloud VPC, in both directions, reliably reproducible recurring timeouts happen every single time. Even if do a fresh restart of my Windows desktop the problem is there.
  • When using regular ping, outside of tailscale, between my Windows Desktop and the cloud VPC there are no problems with either IPv4 or 6.

I still have to get another Windows device on my home network to test with to determine if it’s a Windows problem, or a problem with just my desktop specifically.

Here is what the pings look like. In this example I am pinging from a cloud VPC to my desktop over tailscale. If I attempt the pings in the reverse direction, from desktop to VPC, the exact same phenomena occurs. And it happens every single time.

In this example 100.100.100.100 is the tailscale IP of my desktop.
66.66.66.66 is the public IPv4 address of my home network’s router.
2600:2600:2600:2600::2600 is the public IPv6 address of my home network’s router.

apreche@myvpc:~$ tailscale ping -c 100 --until-direct=false --verbose mydesktop
2023/01/23 16:38:34 lookup "mydesktop" => "100.100.100.100"
ping "100.100.100.100" timed out
pong from mydesktop (100.100.100.100) via 66.66.66.66:1115 in 12ms
pong from mydesktop (100.100.100.100) via [2600:2600:2600:2600::2600]:41641 in 7ms
pong from mydesktop (100.100.100.100) via [2600:2600:2600:2600::2600]:41641 in 21ms
pong from mydesktop (100.100.100.100) via [2600:2600:2600:2600::2600]:41641 in 9ms
pong from mydesktop (100.100.100.100) via [2600:2600:2600:2600::2600]:41641 in 8ms
pong from mydesktop (100.100.100.100) via [2600:2600:2600:2600::2600]:41641 in 6ms
ping "100.100.100.100" timed out
pong from mydesktop (100.100.100.100) via [2600:2600:2600:2600::2600]:41641 in 7ms
pong from mydesktop (100.100.100.100) via [2600:2600:2600:2600::2600]:41641 in 7ms
pong from mydesktop (100.100.100.100) via [2600:2600:2600:2600::2600]:41641 in 9ms
pong from mydesktop (100.100.100.100) via [2600:2600:2600:2600::2600]:41641 in 7ms
pong from mydesktop (100.100.100.100) via [2600:2600:2600:2600::2600]:41641 in 8ms
pong from mydesktop (100.100.100.100) via [2600:2600:2600:2600::2600]:41641 in 6ms
pong from mydesktop (100.100.100.100) via [2600:2600:2600:2600::2600]:41641 in 6ms
pong from mydesktop (100.100.100.100) via [2600:2600:2600:2600::2600]:41641 in 7ms
pong from mydesktop (100.100.100.100) via [2600:2600:2600:2600::2600]:41641 in 10ms
pong from mydesktop (100.100.100.100) via [2600:2600:2600:2600::2600]:41641 in 6ms
pong from mydesktop (100.100.100.100) via [2600:2600:2600:2600::2600]:41641 in 7ms
ping "100.100.100.100" timed out
pong from mydesktop (100.100.100.100) via [2600:2600:2600:2600::2600]:41641 in 9ms
pong from mydesktop (100.100.100.100) via [2600:2600:2600:2600::2600]:41641 in 8ms
pong from mydesktop (100.100.100.100) via [2600:2600:2600:2600::2600]:41641 in 9ms
pong from mydesktop (100.100.100.100) via [2600:2600:2600:2600::2600]:41641 in 8ms
pong from mydesktop (100.100.100.100) via [2600:2600:2600:2600::2600]:41641 in 10ms
ping "100.100.100.100" timed out
pong from mydesktop (100.100.100.100) via [2600:2600:2600:2600::2600]:41641 in 12ms
pong from mydesktop (100.100.100.100) via [2600:2600:2600:2600::2600]:41641 in 6ms
pong from mydesktop (100.100.100.100) via [2600:2600:2600:2600::2600]:41641 in 9ms

Does anyone have any ideas on how I can further diagnose this problem? I’ve narrowed it down a whole bunch, but now I have no clues as to what the cause could be. Some Windows firewall or network configuration perhaps?

Thanks.

It’s been a few months, and I am still experiencing this problem, with no solution in sight. Unless it is solved, it makes tailscale absolutely intolerable to use for SSH.

When I run tailscale ping in a loop it pings perfectly and intermittently times out. Then it starts working perfectly again.

I did some additional testing and ran tailscale status in a loop next to tailscale ping. I can confirm that when tailscale ping times out, tailscale is indeed reverting back to the relay. Then it re-establishes the direct connection and works perfectly again. If the direct connection is working, what is causing it to revert to the relay every 10 seconds or so?

I checked the firewall rules and logs on both ends, local and remote. Both of them are permitting all connections on the tailscale interface in both directions using all protocols and all ports.

Does anyone have even a guess as to what is causing this?

I have done some experimenting, and I learned something.
This problem does not seem to occur when IPv6 is disabled entirely on my Windows desktop.
This gives me a new hypothesis.

There is a well known problem relating to IPv6 TCP checksum offloading with certain Intel NICs. I know for a fact that my desktop has this problem. I have the checksum offload disabled, and that solves the problem.

However, I’m wondering if somehow Windows is not disabling the checksum offload for Tailscale. The Tailscale adapter in Windows does not have this setting. I know that in principle it shouldn’t matter since the Tailscale adapter isn’t a hardware adapter, but the behavior is suspiciously extremely similar, and is only affecting IPv6.

I guess the way to test this hypothesis is to get another NIC and stick it in a PCI express slot?

I went and got a new PCI express NIC to test my hypothesis.

It turns out that I was correct.

I know for a fact that my Intel NIC and my ISP combined had this problem with IPv6.
I know that if I disable the checksum offloading for IPv6 on the Intel network adapter in Windows that the problem goes away for all regular network traffic.
I know that when using Tailscale to connect over IPv6 there were intermittent timeouts, as evidenced in my previous post. This only happened when connecting via IPv6.
When I completely disabled all IPv6 on the entire computer, forcing tailscale to use IPv4 only, the problem disappeared.
When I put in a new NIC in the very same computer, with checksum offloading enabled and IPv6 enabled, all networking, including tailscale, worked perfectly.

Therefore, I am 99% confident that one or both of the following two things is true:

  1. When checksum offloading is disabled in Windows on a real hardware network adapter, Windows is not honoring that setting for other virtual network adapters that are using that hardware adapter.
  2. When checksum offloading is disabled in Windows on the real hardware network adapter that Tailscale is tunneling through, it does not honor the Windows setting for checksum offloading, and attempts to offload anyway.

I think it will take someone more expert than me to do some deep packet inspections and Windows network internals debugging to find whether Tailscale or the OS is at fault. I will not be investigating any further as the new NIC has solved all of my problems.

I’m just glad I have the answer so I could make this post to help anyone in the future who comes across this issue. I don’t even know if this is worth fixing as eventually all the affected NICs will fade out of use. They are already quite old. No new hardware will have this problem.

1 Like

I take back everything I said in the previous post.

The problem has returned exactly as before.
This is a fresh installation of Windows, so it’s not some weird software issue.
It’s a new NIC that doesn’t have the checksum offload problem.

Yet, the intermittent timeouts that occur only with tailscale are back.

My next plan is to get a different Windows 11 device to test with both on my home network and also from other networks. I already know that non-Windows devices do not have this problem.

I’m also finally frustrated enough to bust out wireshark to see what I can see.

1 Like