Strange keep alive and peer update behavior

Here is the setup:
Machine A is in cloud VPC with 172.17.x.x segment, with public IP and all ports wide open.
Machine B is behind home router which has public IP.

Step 1: get both Machine A and B up and running. They can ping each other with Tailscale 100.x.x.x IP addresses.
Step 2: cut off both machines’ connection to control server and DERP servers.
Step 3: leave it for a while.

Now comes the interesting part:
Ping from A to B, it doesn’t go through. tcpdump on Machine B’s router shows that the UDP wireguard packets do come but machine B didn’t get it. Maybe Machine B doesn’t do keep alive?

Now if I ping from machine B to A with one packet, it’ll will go through. But another weird thing happens: If I ping from A to B, only one packet can go through. I run tailscale status -json on machine A, the peer’s CurAddr field is updated to machine B’s internal IP address 192.168.x.x:port. And now ping doesn’t go through from A to B (of course). If I ping another packet from B to A, the CurAddr field on machine A is updated correctly to machine B’s public IP on router and now this time it stays even if I ping from A to B again.
If I leave it again for a while it’ll repeat the process.

My tailscale version is 20210104 on both machines.

From machine A’s tailscaled log I see two lines
magicsock: rx [lNyi8] from 192.168.1.52:44640 (12/18), replaces old priority 111.201.77.202:44640
magicsock: rx [lNyi8] from low-pri 111.201.77.202:44640 (1), keeping current 192.168.1.52:44640 (12)
I understand that internal address 192.168.x.x has higher priority but it is not reachable, why does it replace a reachable IP address?

If you’re seeing those “replaces old priority” and “keeping current” log messages, that means one of your nodes is using an ancient version of Tailscale predating 1.0 or even before 0.100.

But you say you’re using “version is 20210104 on both machines”, which is disconcerting. It shouldn’t be going into that legacy path.

I wasn’t able to look up node [lNyi8]… seems like you might’ve since deleted it?

What are the Tailscale IPs in question?

I forgot to mention that I was playing with headscale + derper from official repository. Maybe that’s why it falls back to legacy path.

That’s almost certainly why it falls back to a legacy path. I’m not sure if headscale has kept updated with various protocol changes.

I guess not. It doesn’t know DiscoKey’s existence. I made a simple change, just copy DiscoKey from MapRequst to MapResponse, then the problem disappeared. I’m not quite clear about how DiscoKey works though. I’d really appreciate if you could explain a bit.

Is legacy path going away? Otherwise it might still be worth investing why this happens.

The protocol docs are in tailcfg/tailcfg.go for the most part. For disco, see the disco/ directory.

I don’t have time to explain it more at the moment. Part of the reason we haven’t open sourced the control server yet is because we weren’t totally happy with the protocol and wanted to keep it easy to iterate on it quickly until we got happy with it. It’s cool that headscale exists, but we don’t have time to add it to our test matrix or keep it updated yet.

Understand.

BTW, the product is really cool. I’ve been toying with Wireguard for almost two years and suddenly found Tailscale which made me really exited.

@AndySong I have updated Headscale with the latest protocol changes (hopefully). Could you please check again?