如何修复一台无法启动的 Azure NixOS

周六中午尝试启动 VPS 看电影的时候,发现 hang 在 start 阶段了,没法 stop 也没法 restart,ssh 当然也是连接不上去的。

发现在 Azure VPS 面板的 Help - Serial Console 可以看到启动 log,毕竟 Azure VPS 仍然是一台虚拟机。

发现显示的唯一一条 log 是

1EFI stub: Loaded initrd from LINUX_EFI_INITRD_MEDIA_GUID device path

这根本不是 error 呀,甚至不是 warning,就只是 boot 阶段的一个 info 而已。

搜索社区发现对于 NixOS 需要添加 boot 参数 console=tty1 console=ttyS01 来显示完整 log。

在 Serial Console 点击电源 icon 选择 Reset VM (hard) 会强制重启。
这时候方向键上下选择,会显示 systemd-boot menu(这里似乎不是全屏刷新,所以需要手动操作来刷新渲染 boot menu)。
在 NixOS 的 entry 按下 e 来编辑 boot param,最后输入 console=tty1 console=ttyS0

好了可以看到具体的 error 了:

  1EFI stub: Loaded initrd from LINUX_EFI_INITRD_MEDIA_GUID device path
  2
  3<<< NixOS Stage 1 >>>
  4
  5loading module hv_balloon...
  6loading module hv_netvsc...
  7loading module hv_storvsc...
  8loading module hv_utils...
  9loading module hv_vmbus...
 10loading module dm_mod...
 11running udev...
 12Starting systemd-udevd version 257.3
 13kbd_mode: KDSKBMODE: Inappropriate ioctl for device
 14starting device mapper and LVM...
 15File descriptor 8 (/dev/console) leaked on lvm invocation. Parent PID 1: /nix/store/w2s431ikyi9j2mynz2y2s0phgczcvhwz-extra-utils/bin/ash
 16File descriptor 9 (/dev/console) leaked on lvm invocation. Parent PID 1: /nix/store/w2s431ikyi9j2mynz2y2s0phgczcvhwz-extra-utils/bin/ash
 17checking /dev/disk/by-uuid/f222513b-ded1-49fa-b591-20ce86a2fe7f...
 18fsck (busybox 1.36.1)
 19[fsck.ext4 (1) -- /mnt-root/] fsck.ext4 -a /dev/disk/by-uuid/f222513b-ded1-49fa-b591-20ce86a2fe7f
 20nixos: recovering journal
 21nixos: clean, 764799/33022080 files, 27304054/134155259 blocks
 22mounting /dev/disk/by-uuid/f222513b-ded1-49fa-b591-20ce86a2fe7f on /...
 23
 24<<< NixOS Stage 2 >>>
 25
 26running activation script...
 27setting up /etc...
 28starting systemd...
 29
 30Welcome to NixOS 25.05 (Warbler)!
 31
 32[  OK  ] Created slice Slice /hyperv.
 33[  OK  ] Created slice Slice /system/getty.
 34[  OK  ] Created slice Slice /system/modprobe.
 35[  OK  ] Created slice Slice /system/serial-getty.
 36[  OK  ] Created slice Slice /system/systemd-fsck.
 37[  OK  ] Created slice User and Session Slice.
 38[  OK  ] Started Dispatch Password Requests to Console Directory Watch.
 39[  OK  ] Started Forward Password Requests to Wall Directory Watch.
 40         Expecting device /dev/disk/by-uuid/12CE-A600...
 41         Expecting device /dev/disk/by-uuid…1160-6af7-48b4-8bb8-e0051649b500...
 42         Expecting device /dev/ttyS0...
 43[  OK  ] Reached target Local Encrypted Volumes.
 44[  OK  ] Reached target Containers.
 45[  OK  ] Reached target Path Units.
 46[  OK  ] Reached target Remote File Systems.
 47[  OK  ] Reached target Slice Units.
 48[  OK  ] Reached target Swaps.
 49[  OK  ] Listening on Process Core Dump Socket.
 50[  OK  ] Listening on Credential Encryption/Decryption.
 51[  OK  ] Listening on Journal Audit Socket.
 52[  OK  ] Listening on Journal Socket (/dev/log).
 53[  OK  ] Listening on Journal Sockets.
 54[  OK  ] Listening on Userspace Out-Of-Memory (OOM) Killer Socket.
 55[  OK  ] Listening on udev Control Socket.
 56[  OK  ] Listening on udev Kernel Socket.
 57         Mounting Huge Pages File System...
 58         Mounting POSIX Message Queue File System...
 59         Mounting Kernel Debug File System...
 60         Mounting Kernel Trace File System...
 61         Starting Create List of Static Device Nodes...
 62         Starting Load Kernel Module configfs...
 63         Starting Load Kernel Module drm...
 64         Starting Load Kernel Module efi_pstore...
 65         Starting Load Kernel Module fuse...
 66         Starting mount-pstore.service...
 67         Starting Journal Service...
 68         Starting Load Kernel Modules...
 69         Starting Userspace Out-Of-Memory (OOM) Killer...
 70         Starting Remount Root and Kernel File Systems...
 71         Starting Coldplug All udev Devices...
 72[  OK  ] Mounted Huge Pages File System.
 73[  OK  ] Mounted POSIX Message Queue File System.
 74[  OK  ] Mounted Kernel Debug File System.
 75[  OK  ] Mounted Kernel Trace File System.
 76[  OK  ] Finished Create List of Static Device Nodes.
 77[  OK  ] Finished Load Kernel Module configfs.
 78[  OK  ] Finished Load Kernel Module drm.
 79[  OK  ] Finished Load Kernel Module efi_pstore.
 80[  OK  ] Finished Load Kernel Module fuse.
 81         Mounting FUSE Control File System...
 82         Mounting Kernel Configuration File System...
 83         Starting Create Static Device Nodes in /dev gracefully...
 84[  OK  ] Mounted Kernel Configuration File System.
 85[  OK  ] Mounted FUSE Control File System.
 86[  OK  ] Finished mount-pstore.service.
 87[  OK  ] Finished Remount Root and Kernel File Systems.
 88         Starting Load/Save OS Random Seed...
 89         Starting Network Time Synchronization...
 90[  OK  ] Started Journal Service.
 91[  OK  ] Finished Create Static Device Nodes in /dev gracefully.
 92         Starting Flush Journal to Persistent Storage...
 93         Starting Create Static Device Nodes in /dev...
 94[  OK  ] Started Userspace Out-Of-Memory (OOM) Killer.
 95[  OK  ] Finished Load Kernel Modules.
 96         Starting Firewall...
 97         Starting Apply Kernel Variables...
 98[  OK  ] Finished Load/Save OS Random Seed.
 99[  OK  ] Finished Create Static Device Nodes in /dev.
100[  OK  ] Reached target Preparation for Local File Systems.
101         Starting Rule-based Manager for Device Events and Files...
102[  OK  ] Finished Coldplug All udev Devices.
103[  OK  ] Finished Apply Kernel Variables.
104[  OK  ] Finished Flush Journal to Persistent Storage.
105[  OK  ] Started Rule-based Manager for Device Events and Files.
106[  OK  ] Started Network Time Synchronization.
107[  OK  ] Found device /dev/ttyS0.
108         Starting Virtual Console Setup...
109[  OK  ] Found device Virtual_Disk ESP.
110         Starting File System Check on /dev/disk/by-uuid/12CE-A600...
111[  OK  ] Finished Firewall.
112[  OK  ] Finished Virtual Console Setup.
113[  OK  ] Finished File System Check on /dev/disk/by-uuid/12CE-A600.
114         Mounting /boot...
115         Mounting /run/wrappers...
116[  OK  ] Mounted /run/wrappers.
117         Starting Create SUID/SGID Wrappers...
118[  OK  ] Mounted /boot.
119[  OK  ] Finished Create SUID/SGID Wrappers.
120[ TIME ] Timed out waiting for device /dev/…711160-6af7-48b4-8bb8-e0051649b500.
121[DEPEND] Dependency failed for /mnt/resource.
122[DEPEND] Dependency failed for Local File Systems.
123[DEPEND] Dependency failed for File System …711160-6af7-48b4-8bb8-e0051649b500.
124[  OK  ] Stopped Dispatch Password Requests to Console Directory Watch.
125[  OK  ] Stopped Forward Password Requests to Wall Directory Watch.
126[  OK  ] Reached target Timer Units.
127[  OK  ] Listening on Boot Entries Service Socket.
128[  OK  ] Reached target Hyper-V Daemons.
129[  OK  ] Reached target Preparation for Network.
130         Starting Networking Setup...
131[  OK  ] Reached target Login Prompts.
132[  OK  ] Reached target Socket Units.
133[  OK  ] Started Emergency Shell.
134[  OK  ] Reached target Emergency Mode.
135         Starting Update Boot Loader Random Seed...
136         Starting Create System Files and Directories...
137[  OK  ] Finished Update Boot Loader Random Seed.
138[  OK  ] Finished Create System Files and Directories.
139         Starting Record System Boot/Shutdown in UTMP...
140[  OK  ] Finished Networking Setup.
141[  OK  ] Reached target Network.
142[  OK  ] Finished Record System Boot/Shutdown in UTMP.
143You are in emergency mode. After logging in, type "journalctl -xb" to view
144system logs, "systemctl reboot" to reboot, or "exit"
145to continue bootup.
146
147Cannot open access to console, the root account is locked.
148See sulogin(8) man page for more details.
149
150Press Enter to continue.

注意到

1[DEPEND] Dependency failed for /mnt/resource.
2[DEPEND] Dependency failed for Local File Systems.
3[DEPEND] Dependency failed for File System …711160-6af7-48b4-8bb8-e0051649b500.

为什么挂载 /mnt/resource 这个目录会失败呢?

在玩 NixOS 的时候,扫过常见目录的链接,这里的 resource 是 Azure 分配的一个磁盘,在最外侧就有 DATALOSS.txt 的 warning,大概率是这个磁盘不在了,但是声明在了 /etc/fstab 里,所以 mount 一直失败。

确定的是,在 /etc/fstab 删掉对应的 item 就应该没问题了。

如果是一台物理机,那么用 U 盘通过 Live CD 启动,就可以直接修改了,但是 VPS 也可以吗?

可以的,我们有 emergency mode 嘛。

在上面提到的 Help - Serial Console 页面,重新添加 boot param console=tty1 console=ttyS0 emergency

迎来的是什么呢?

 1Welcome to NixOS 25.05 (Warbler)!
 2
 3[  OK  ] Started Emergency Shell.
 4[  OK  ] Reached target Emergency Mode.
 5You are in emergency mode. After logging in, type "journalctl -xb" to view
 6system logs, "systemctl reboot" to reboot, or "exit"
 7to continue bootup.
 8
 9Cannot open access to console, the root account is locked.
10See sulogin(8) man page for more details.
11
12Press Enter to continue.

WTF,为什么 root 会被锁住啊?

继续搜资料,对于 NixOS 需要更多的 boot param rescue systemd.setenv=SYSTEMD_SULOGIN_FORCE=12

那么添加完整的参数 console=tty1 console=ttyS0 rescue systemd.setenv=SYSTEMD_SULOGIN_FORCE=1,终于是可以用 root 为所欲为了:

1You are in rescue mode. After logging in, type "journalctl -xb" to view
2system logs, "systemctl reboot" to reboot, or "exit"
3to continue bootup.
4Press Enter for maintenance
5(or press Control-D to continue):
6❯➜  ~

除了体会到在周末 oncall 的感觉,还有什么呢?

NixOS 真的很棒,它作为 Linux Distro 唯一的原因是,它仍然兼容了传统 Linux 的 API,因为传统 Linux 中「一切皆文件」,所以是用链接的方式兼容了传统 Linux 的文件系统。

那么除此之外,NixOS 是完全不一样的东西,很多 Linux 的经验就会失效了。


如果找不到任何资料怎么办?

不要放弃!我们仍然可以延用 U 盘启动 Live CD 的经验!

上学时候为男生女生们修电脑装系统的经验,之后的一生都在受益,对不对?