Author Topic: The computing stuff rant thread  (Read 406319 times)

Kim

  • Timelord
    • Fediverse
Re: The computing stuff rant thread
« Reply #3275 on: 30 April, 2024, 12:40:18 pm »
Absolutely top notch case of gremlins last night.  Noticed the smart lamp socket blinking its notwork error blinkenlight as I went to bed, and it all went downhill from there.

Several hours of trial and error later, and I'd determined:

- Proxmox hypervisor's ethernet timing out and getting bounced was almost certainly a symptom of something else.  Connectivity to all the VMs (including useful things like DNS and syslog) flapping didn't make for harmounious troubleshooting.
- DHCP only seeming to work on the primary VLAN with the servers and trusted linux boxen was a big part of the issue.  (For historical reasons, this one is connected to its own non-VLAN-aware switch port, which may be a Clue.)
- On further investigation, it appeared that the wireless-stuff and internet-of-shit VLAN interfaces on pfsense were actually unreachable, even if IPs were configured statically.
- Random VLAN-related nonsense makes me suspect the ageing Procurve switch, which gets power-cycled to no effect.  Did the downstairs one for good measure.
- Backup ageing Procurve swapped in and power-cycled until it boots.  No improvement.
- Turning my attention back to pfsense, it becomes evident through a lot of trial and error that the unreachable interfaces are briefly reachable on bootup, until the DHCP server starts processing requests.
- ifdown/ifup-ing the physical interface brings them back, and with the DHCP server disabled, they appear to stay up.  Statically configure a few things and note that the internet-of-shit stuff seems okay, but the wirless stuff is cursed.
- Establish to a reasonable degree of confidence that the Proxmox connectivity issue doesn't occur while in this state.  (I have a second Proxmox node on frankenputer hardware, and by migrating VMs around and experiencing the same problem on both nodes am reasonably confident this isn't a hardware issue.)
- Decide that WiFi problems might be complicated by the Unifi access point(s) lacking connectivity due to DHCP failure, so leave that to one side.
- Hack away at the interfaces config in pfsense, putting the three main VLANS on their own physical interfaces.
- Re-enable DHCP, note that some of the internet-of-shit stuff is managing to request an address without the interface becoming unreachable.
- Power-cycle the access points, note wireless-VLAN stuff is now successfully DHCPing, and a fuckload of wireless internet_of_shit devices suddenly reconnect as well.
- Decide that since barakta needs her $ork laptop to give a training session over Teams in a few hours, to quit while we have a mostly functional network.
- Bring assorted VMs (on various VLANS) back up on the primary Proxmox machine to see what happens.  They DHCP successfully and everything just works.  Do some ping -f and fling some large files around to generate network traffic, not a timeout in sight.

So yeah.  Fuck knows.

Working theory with the Proxmox side of things is that if enough containers/VMs are trying and failing to do DHCP on the same bridge at the same time will somehow hang the interface, perhaps by filling up a buffer or something?

Prime suspect is currently the pfsense box, on general principle.  It all going weirdly wrong late at night makes me think it could be an intermittent hardware issue (it's one of those apu2 boards, so not much scope for swapping hardware, though I suppose it could run in a VM).

Though I'm also aware that what actually seemed to clear the problem was power-cycling the access point.  Maybe it had got wedged somehow?

Once barakta doesn't need the network, I'm going to try restoring the pfsense and switch config and see what happens.


This is all deeply suboptimal, as I'm supposed to be off for a week of cycle touring tomorrow...

Re: The computing stuff rant thread
« Reply #3276 on: 30 April, 2024, 12:52:17 pm »
Yesterday a colleague has a failed laptop battery and had to use one of the spare desktops in the office. It then generated a "User Profile Service service timed out, unable to upload the user profile" error. So I tried and got the same error. I raised a helpdesk ticket thinking it would need a prod from the local IT resource but instead got a call from India.

Turn the PC off, reboot it and problem solved. I should have tried that old windows fix much earlier.  :facepalm: