Not relaying to the closest location

Tailscale version - 1.8.7
Your operating system & version - Windows 10 21H1

Tailscale has DERP on several places, but my server which is located on the US, is relaying on Sydney for an unknown reason, this is causing my RDP session to be unusable. I used to have 160ms of latency, now I get 350ms to 500ms.

netcheck from the host I’m trying to connect to

> tailscale netcheck

Report:
        * UDP: false
        * IPv4: (no addr found)
        * IPv6: no
        * MappingVariesByDestIP:
        * HairPinning:
        * PortMapping:
        * Nearest DERP: New York City
        * DERP latency:
                - nyc: 11.7ms  (New York City)
                - dfw: 33.4ms  (Dallas)
                - sea: 67.3ms  (Seattle)
                - sfo: 72ms    (San Francisco)
                - lhr: 80.2ms  (London)
                - fra: 93.1ms  (Frankfurt)
                - sao: 127.8ms (São Paulo)
                - tok: 174ms   (Tokyo)
                - sin: 233ms   (Singapore)
                - syd: 235.3ms (Sydney)
                - blr: 261.9ms (Bangalore)

tailscale status from the host I’m trying to connect from

100.79.254.82   rdp-machine           redacted@    windows active; relay "syd", tx 2030008 rx 4477400

Why would this be happening and how to avoid it? It used to work just fine up until yesterday.

Hello,

Can you please run the bug report when you find this happening again?

“c:\Program Files (x86)\Tailscale IPN\tailscale.exe” bugreport

This is still happening.

BUG-7ea46ec8ff844353b13b40dff3e8b6ac382e8368d04c70854f59ecdc243a0cc7-20210611192818Z-7e8cdf774e1d9fee

Today it decided it’s not Sydney anymore, but it’s still somewhere very far from where it is supposed to be.
It says “blr” now.

BUG-7ea46ec8ff844353b13b40dff3e8b6ac382e8368d04c70854f59ecdc243a0cc7-20210614170739Z-881983f4d93155a8

Hello Ouji

“tailscale status” is the command that can give you the connection regarding the connection, if it’s direct or through the DERP server.

The connection between the peers is evaluated using the latency where lower latency gets the priority. Latency will be higher if the connection is going through the DERP server normally unless direction connection as special firewall/ACL or some config in the network makes connection slow.

I checked the given node and found that UDP traffic is blocked with this node may be strict firewall rules, or something ios blocking outgoing udp traffic. For Tailscale we need this to be allowed.

2021-06-14 15:54:19.6247912 -0300 -0300: 2021/06/14 15:54:19 netcheck: report: udp=false v4=false …

I appreciate the reply. I think there has been a misunderstanding on what my issue is, this client will be relaying 100% of the time as you mentioned it is indeed behinid a strict firewall and can’t use UDP traffic.
The actual issue is that this is located on the US, but was relaying on Sydney and now it is on BLR which I’m assuming is Bangalore.
It shouldn’t be relaying so far away from the actual server, I had no problems until Thursday last week. The nearest relay is NYC, which is about is 9-12ms of latency.

Hi Ouji,

I checked how DERP is selected when UDP is blocked. We determine which DERP relay is the best by measuring how long it takes to receive a STUN reply. If UDP is blocked, we have no latency information so it picks whichever DERP is the most popular of the other devices it can see. Once a DERP is selected, it tries to stay on the same one unless latency information becomes available.

In your case there are ACLs blocking most devices from being visible to that one, so it’s going to be based on whichever of those are online at the time that it starts up.

Does your firewall restrict outgoing tcp/443 connections? It looks like it might be trying to connect to a new DERP server and if that fails, it may just be picking one of the remaining options at random.

My recommendation would be to open up UDP traffic to enable direct connections, but if you can’t do that, then things are going to work on a best-effort basis.

My guess about what changed recently is that you have some devices that migrated to the new DERP server and it changed the best guess, perhaps combined with bad timing during a reboot.

Thank you for the reply. When I run tailscale netcheck it can see that NYC is the closest relay, shouldn’t it be using that info to select the NYC DERP server?

I don’t think the firewall restricts 443/tcp outgoing otherwise half of the internet would be unavailable for me, no? If I’m wrong in that, how can I test?

This DERP migration you mentioned would be from my devices? I have no devices that would relay so far away, all my devices that relay are relaying either in North America or South America.

We recently added a relay in South America (I was being vague in my response to avoid revealing too much about your devices on a public forum). If they were previously mostly connecting to NYC, then now maybe the majority are connecting to SAO.

I asked about 443/tcp because if you had blocked everything but allowed the specific IPs of the existing relays, then when the new relay is added it would be unreachable, therefore leading to another layer of fallback (try relays unrelated to location until one works). I see log messages indicating that connections to derp11 443/tcp (SAO) are timing out.

The UDP probes used to collect latency information in tailscale netcheck originate from a different port number than the normal probes used by tailscaled (it probes from the same port used for direct connections in order for STUN to provide accurate port number information). It will also attempt to determine latency via HTTPS. I will have to dig deeper to have a full understanding of how it would end up in this state. Regardless, opening up 3478/udp (outgoing, will usually originate from 41641/udp but not always) will dramatically improve DERP selection. Opening up UDP further will enable direct connections.

Unfortunately I can’t control the firewall behavior on this device.
There were no relay changes on any other of my devices besides this one. And for some reason today it changed again to somewhere in Asia instead of back to NYC. So I guess I have no other option than trying another tool that I can actually control where the devices relay?

What is really confusing me is why the behavior changed if nothing else changed. Maybe in the future we should be able to manually tell Tailscale our preferred DERP as to avoid this?

It’s unfortunate that you can’t adjust the firewall but I can try digging a little more. Is there any additional information about the firewall that you can provide? Feel free to send me a private message if you prefer.

It is strange that this behavior changed recently. I will see if I can find some more details about what may be going on – the logs do not seem to be providing obvious answers and the relay selection code is somewhat complex.

Does restarting the Tailscale service (or rebooting the device) help? That should completely reset the relay selection process. You could also try downgrading to 1.6.0 (https://pkgs.tailscale.com/stable/tailscale-ipn-setup-1.6.0.exe) if that was working better previously.

There is not extra information I can provide since I don’t have control over the firewall, this is a corporate machine. I had just installed the version 1.6.0 and it changed the relay to “tok” now, which is still in Asia hahahaha

Maybe if I keep reconnecting it might eventually move to the right relay?

Everytime I try to restart the service, I get an error, but the service restarts successfully anyway and I can confirm. I was able to get all the way to Seattle (not it’s in London), I will keep trying until I reach NYC.

The error I get


Tailscale Error

Failed to connect to Tailscale service.

Version 1.6.0 (tddc975fcb-g995460c32)

LogID: f58e11d9b96dc7b5813eb4e920ba18fe09073b1246cd8e617dae98f7d6535a8f
(Ctrl-C to copy this text)


OK

This error was from the 1.6 Tailscale while the DERP issue you reported is 1.8.7.

Does that machine still complaining about 1.6?

I downgraded to 1.6.0 as @adrian suggested to see if the relays would be changed. This error shows up when restarting the service, but as I mentioned, it works anyway.