Lost Connectivity after months of no problems

Hi. I’ve just recently lost access to my cloud VM via Tailscale. I re-added a tcp-22 port and was able to login just fine through the public ip.

This has not resolved itself, despite the status looking fine from the CLI. Yet, you can see that I can’t SSH properly.

from xeep:

cole@xeep ~ 45m 3s
❯ tailscale status
100.72.11.62    xeep                 cole.mickens@ linux   -
100.103.122.117 azdev                cole.mickens@ linux   idle, tx 3281144 rx 5683896
100.101.102.103 ("hello-ipn-dev")    services@    linux   -
100.103.91.27   jeffhyper            cole.mickens@ linux   -
100.89.55.100   pinebook             cole.mickens@ linux   -
100.89.237.128  pixel-3-1            cole.mickens@ android -
100.68.13.41    porty                cole.mickens@ linux   -
100.96.145.20   redsly               cole.mickens@ windows idle, tx 80056 rx 71416
100.111.5.113   rpifour1             cole.mickens@ linux   -

cole@xeep ~
❯ ping azdev.ts.r10e.tech
PING azdev.ts.r10e.tech (100.103.122.117) 56(84) bytes of data.

cole@xeep ~
❯ ssh cole@azdev.ts.r10e.tech
ssh: connect to host azdev.ts.r10e.tech port 22: No route to host

from azdev:

cole@azdev ~ 1h 17m 11s
❯ tailscale status
100.103.122.117 azdev                cole.mickens@ linux   -
100.101.102.103 ("hello-ipn-dev")    services@    linux   -
100.103.91.27   jeffhyper            cole.mickens@ linux   -
100.89.55.100   pinebook             cole.mickens@ linux   -
100.89.237.128  pixel-3-1            cole.mickens@ android -
100.68.13.41    porty                cole.mickens@ linux   -
100.96.145.20   redsly               cole.mickens@ windows -
100.111.5.113   rpifour1             cole.mickens@ linux   -
100.72.11.62    xeep                 cole.mickens@ linux   active; relay "sea", tx 1184 rx 0

cole@azdev ~
❯ ssh cole@100.72.11.62 # aka 'xeep'
# actually just sort of hangs... maybe it will timeout

Background:

  • xeep is my laptop
  • azdev is my cloud VM
  • redsly is my desktop machine where I’m connecting to these machines

I can’t ssh azdev.ts.r10e.tech from xeep (though I can again from redsly; that was also broken last night). And as you can see xeep thinks it has an active idle tunnel to azdev.

Seems related:

cole@xeep ~
❯ sudo tailscale down
2021/04/29 15:02:42 was in state "Running"
2021/04/29 15:02:42 now in state "Stopped"

cole@xeep ~
❯ sudo tailscale up

cole@xeep ~
❯ tailscale status
100.72.11.62    xeep.cole-mickens.gmail.com.beta.tailscale.net userid:c126b52d00467c linux   -
                ("")                 -                    -
                ("")                 -                    -
                ("")                 -                    -
                ("")                 -                    -
                ("")                 -                    -
                ("")                 -                    -
                ("")                 -                    -
                ("")                 -                    -

cole@xeep ~
❯ tailscale status
100.72.11.62    xeep.cole-mickens.gmail.com.beta.tailscale.net userid:c126b52d00467c linux   -
                ("")                 -                    -
                ("")                 -                    -
                ("")                 -                    -
                ("")                 -                    -
                ("")                 -                    -
                ("")                 -                    -
                ("")                 -                    -
                ("")                 -                    -

EDIT: Checking the usual suspects: the system date/time is correct.

A random snippet of tailscaled logs:

Apr 29 15:07:16 xeep tailscaled[2588]: magicsock: [0xc00034e000] derp.Recv(derp-10): derphttp.Client.Recv connect to region 10 (sea): dial tcp6 [2001:19f0:8001:2d9:5400:2ff:feef:bbb1]:443: connect: network is unreachable
Apr 29 15:07:16 xeep tailscaled[2588]: derp-10: backoff: 6644 msec
Apr 29 15:07:22 xeep tailscaled[2588]: derphttp.Client.Recv: connecting to derp-10 (sea)
Apr 29 15:07:22 xeep tailscaled[2588]: magicsock: [0xc00034e000] derp.Recv(derp-10): derphttp.Client.Recv connect to region 10 (sea): dial tcp6 [2001:19f0:8001:2d9:5400:2ff:feef:bbb1]:443: connect: network is unreachable
Apr 29 15:07:22 xeep tailscaled[2588]: derp-10: backoff: 6884 msec
Apr 29 15:07:24 xeep tailscaled[2588]: logtail: dial "log.tailscale.io:443" failed: dial tcp [2600:1f14:436:d603:342:4c0d:2df9:191b]:443: connect: network is unreachable (in 1ms)
Apr 29 15:07:24 xeep tailscaled[2588]: logtail: upload: log upload of 309 bytes compressed failed: Post "https://log.tailscale.io/c/tailnode.log.tailscale.io/80feebb9472c8f609d11ffe0df461e4e7c5d1282541ced0d4a1fc0c4bb03a85d": dial tcp [2600:1f14:436:d603:342:4c0d:2df9:191b]:443: connect: network is unreachable
Apr 29 15:07:24 xeep tailscaled[2588]: logtail: backoff: 35776 msec
Apr 29 15:07:29 xeep tailscaled[2588]: derphttp.Client.Recv: connecting to derp-10 (sea)
Apr 29 15:07:29 xeep tailscaled[2588]: magicsock: [0xc00034e000] derp.Recv(derp-10): derphttp.Client.Recv connect to region 10 (sea): dial tcp6 [2001:19f0:8001:2d9:5400:2ff:feef:bbb1]:443: connect: network is unreachable

(BTW, I’m a free user with an easy workaround – I’m rapidly posting just to provide info, not out of any sense of urgency or expectation. Thank you for Tailscale!)

Is there a chance your key has expired after 6 months? Nowadays you can renew it via the admin panel.
https://tailscale.com/kb/1028/key-expiry/

I don’t think so? (see screenshot that shows key expiry in 3 months still). I also figure the CLI would produce a more relevant message in that case.

Oof I wonder if ipv6 isn’t working for this machine, and that’s why tailscale client can’t hit the tailscale servers? (I can’t imagine why that would’ve happened suddenly last night though…)

Yeah, it’s going to be hard to say - I had to reboot this machine for other reasons and of course it’s happily reconnected now.

Not sure, if anyone has notes for more data to collect next time… otherwise we can just let this be. Thanks again for tailscale!

Hmm, if tailscale up doesn’t ask for reauthentication then key expiry isn’t it.

In your logs, the “network unreachable” problems dialing logtail and DERP are a bad sign. Is it possible your firewall has suddenly started blocking outgoing https? We need this in order to negotiate connections.

There’s also quite a bit of IPv6 noise in there. I wonder if tailscale has accidentally latched itself onto using IPv6 for everything, and then only the IPv6 part of your link has gone down.

Is it possible your firewall has suddenly started blocking outgoing https? We need this in order to negotiate connections.

Zero chance of this, but it is possible that I had pushed a nixos change that altered a different part of my config (removed an ethernet bridge that shouldn’t have even been at all related to the device that tailscale would’ve been using to get to the Internet). I do sort of feel like maybe that dislodged something, somewhere, that caused ipv6 to fail for tailscale.

Or, maybe tailscale had latched on to using the bridge, and then I’d torn it down? Not sure that even makes sense though? (I even think I’d restarted tailscaled.)

I really wish I’d have checked if general ipv6 was working for other programs. Now everything seems fine, of course.

I have a similar problem from time to time that requires restarting sshd after a system reboot. Any chance you have tried that?

Until I do that, I can ssh via a local lan IP but no go for the tailscale IP. Might be something with the order the daemons load up.

It’s looking like: Tailscale loses control plane + DERP connectivity when a node loses IPv6 internet connectivity · Issue #1726 · tailscale/tailscale · GitHub

1 Like

I had a similar problem with everything working fine with a computer in the lab (running Ubuntu bionic), and then things stopped working soon after a power outage. Tailscale admin showed that it was connected but I could not ssh in. tailscale status showed a -, that it was not communicating with any other devices successfully. It went on like this for days where I could not figure out why it was connected yet not working. Rebooting had no effect either.

Disabling IPv6 on the LAN network interface and rebooting solved the problem.

Device was connected to an eero router in case any internal eero shenanigans unknowingly contributed to this.

Also been having issues with Tailscale disconnecting at random times… and I can’t connect to my services. Restarting tailscale or rebooting the server usually brings it back up… but it’s still been frustrating.

What’s the best place to check Tailscale logs on Debian and Ubuntu?

Logging, auditing, and streaming · Tailscale link for finding where to find logs for different platforms.

1 Like

I’ve had this happen to me as well on various Ubuntu systems that I use as exit nodes.[1]

I like to allow Ubuntu to patch automatically and even automatically reboot i.e. “[reboot required] unattended-upgrades”. The problem I ran into was that even after the patched host rebooted I would have to use console or out of band access to restart sshd.[2]

The reason is sshd would fail to start is because the address from tailscale wasn’t ready for sshd to bind to – and I’d have to manually restart sshd and then access was restored as expected.

My workaround has been to allow sshd to bind to a nonlocal address in the event that tailscale isn’t established before sshd tries to bind to the tailscale address.[3]

$ grep -v \# /etc/sysctl.d/99-sysctl.conf 
net.ipv4.ip_forward = 1
net.ipv6.conf.all.forwarding = 1
net.ipv4.ip_nonlocal_bind = 1

[1] Enable IP forwarding on Linux · Tailscale
[2] see above for /kb/1009/protect-ssh-servers/ (this forum only allows two URLs for new users)
[3] ip_nonlocal_bind | sysctl-explorer.net

3 Likes

I have experienced the same, or at least a similar, issue where I am able to ping a node but unable to ssh to the same node, receiving a “… port 22: No route to host” from the ssh client. The destination is a Cloud VPS running Fedora 35 Server. I have not experienced this issue on any of my physical destination nodes.

I eventually resolved the issue by configuring a firewalld zone that accepts port 22/tcp on the tailscale0 interface.

<?xml version="1.0" encoding="utf-8"?>
<zone target="DROP">
  <short>Tailscale</short>
  <interface name="tailscale0"/>
  <port port="22" protocol="tcp"/>
  <icmp-block-inversion/>
  <icmp-block name="echo-reply"/>
  <icmp-block name="echo-request"/>
  <icmp-block name="time-exceeded"/>
  <icmp-block name="port-unreachable"/>
  <icmp-block name="fragmentation-needed"/>
  <icmp-block name="packet-too-big"/>
</zone>

Sorry for necro-ing the thread, but I felt it would be useful to post and underline that jay’s idea of the ip_nonlocal_bind setting in sysctl is a great solution to this problem.

I too had the same problem of being locked out of the server after a reboot since sshd fails to start because the tailscale interface takes some seconds to properly set up the ip address. This applies probably to any service that tries to bind to the tailscale IP address too early.

As an aside, the sshd systemd configuration does specify retries but it does not actually restart because in the specific failure case of being unable to bind to an IP, it returns error code 255 which makes systemd not retry to start the service. More info in debian bugreport [1]

[1] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=923508

P.S. This issue is also discussed here in the forums in another thread [2], but that did not have any solution. The supportbot’s suggestion of modifiying all systemd configurations for all services on the machine is very brittle and a pain to do in practice.

[2] Ubuntu's boot order for Tailscale service - #3 by dkam