If a disk fails, the zpool on the primary server will show a DEGRADED state. To identify the faulty disk, run the command:
# zpool status
pool: NETSTOR
state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
repaired.
scan: scrub repaired 0 in 0h0m with 0 errors on Tue Dec 6 15:10:59 2016
config:
NAME STATE READ WRITE CKSUM
NETSTOR DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
SW3-NETSTOR-SRV1-1 ONLINE 0 0 0
SW3-NETSTOR-SRV2-1 FAULTED 3 0 0 too many errors
errors: No known data errors
Look for any disks marked as FAULTED in the output. The server name next to the disk indicates which host has the damaged disk. For example:
SW3-NETSTOR-SRV2-1 FAULTED
Here, SRV2 indicates that the damaged disk is on Server-2. Once the faulty disk has been identified and confirmed to be on a secondary (not primary) server, you can proceed with the replacement or recovery steps. If the damaged disk is located on the primary server (SRV1), a manual takeover must be performed to switch control to the secondary server before proceeding.
To do this, SSH into the secondary server and run the following command:
killall -SIGUSR1 sysmonit
This command triggers the manual takeover, allowing the secondary server to assume control while the primary server's disk is being addressed.
If zpool status shows a disk as FAULTED (e.g., SW3-NETSTOR-SRV2-1), it indicates the disk is corrupted and needs to be replaced. Physically remove the damaged disk, install a new one, and verify that the replacement appears in /dev/disk/by-id/ before adding it back to the zpool mirror.
# ls -lah /dev/disk/by-id
total 0
drwxr-xr-x 2 root root 480 Srp 27 08:57 .
drwxr-xr-x 7 root root 140 Srp 27 08:13 ..
lrwxrwxrwx 1 root root 9 Srp 27 08:13 ata-INTEL_SSDSC2CW060A3_CVCV308402M3060AGN -> ../../sde
lrwxrwxrwx 1 root root 10 Srp 27 08:13 ata-INTEL_SSDSC2CW060A3_CVCV308402M3060AGN-part1 -> ../../sde1
lrwxrwxrwx 1 root root 10 Srp 27 08:13 ata-INTEL_SSDSC2CW060A3_CVCV308402M3060AGN-part2 -> ../../sde2
lrwxrwxrwx 1 root root 10 Srp 27 08:13 ata-INTEL_SSDSC2CW060A3_CVCV308402M3060AGN-part9 -> ../../sde9
lrwxrwxrwx 1 root root 9 Srp 27 08:13 ata-ST31000520AS_5VX0BZN0 -> ../../sda
lrwxrwxrwx 1 root root 10 Srp 27 08:13 ata-ST31000520AS_5VX0BZN0-part1 -> ../../sda1
lrwxrwxrwx 1 root root 9 Srp 27 08:13 ata-WDC_WD10JFCX-68N6GN0_WD-WX61A465TH1Y -> ../../sdc
lrwxrwxrwx 1 root root 10 Srp 27 08:13 ata-WDC_WD10JFCX-68N6GN0_WD-WX61A465TH1Y-part1 -> ../../sdc1
lrwxrwxrwx 1 root root 9 Srp 27 08:13 ata-WDC_WD10JFCX-68N6GN0_WD-WX81EC512Y4H -> ../../sdd
lrwxrwxrwx 1 root root 10 Srp 27 08:13 ata-WDC_WD10JFCX-68N6GN0_WD-WX81EC512Y4H-part1 -> ../../sdd1
lrwxrwxrwx 1 root root 9 Srp 27 08:57 ata-WDC_WD10JFCX-68N6GN0_WD-WXK1E6458WKX -> ../../sdb
lrwxrwxrwx 1 root root 9 Srp 27 08:13 wwn-0x10076999618641940481x -> ../../sdd
lrwxrwxrwx 1 root root 10 Srp 27 08:13 wwn-0x10076999618641940481x-part1 -> ../../sdd1
lrwxrwxrwx 1 root root 9 Srp 27 08:13 wwn-0x11689569317835657217x -> ../../sdc
lrwxrwxrwx 1 root root 10 Srp 27 08:13 wwn-0x11689569317835657217x-part1 -> ../../sdc1
lrwxrwxrwx 1 root root 9 Srp 27 08:57 wwn-0x11769037186453098497x -> ../../sdb
lrwxrwxrwx 1 root root 9 Srp 27 08:13 wwn-0x12757853320186451405x -> ../../sde
lrwxrwxrwx 1 root root 10 Srp 27 08:13 wwn-0x12757853320186451405x-part1 -> ../../sde1
lrwxrwxrwx 1 root root 10 Srp 27 08:13 wwn-0x12757853320186451405x-part2 -> ../../sde2
lrwxrwxrwx 1 root root 10 Srp 27 08:13 wwn-0x12757853320186451405x-part9 -> ../../sde9
lrwxrwxrwx 1 root root 9 Srp 27 08:13 wwn-0x7847552951345238016x -> ../../sda
lrwxrwxrwx 1 root root 10 Srp 27 08:13 wwn-0x7847552951345238016x-part1 -> ../../sda1
Newly added disks can be also identified by checking kernel messages. Simply run dmesg | tail -n 20 to display the most recent system logs, which include details about the newly detected disk.
Once the block device name of the new disk is identified, create a partition table and prepare the drive for use. Use parted to create a GPT partition table:
# parted /dev/<device> --script -- mktable gpt
Important: The partition label must follow the format: SW3-NETSTOR-SRVx-y .
So, following the format:
(i.e., SW3-NETSTOR-SRV2-1 = virtual disk on Server two and 1 = disk one.)
Next, create a partition with a name matching the faulty partition from the previous setup. For the example above, the command would be:
# parted /dev/<device> --script -- mkpart "SW3-NETSTOR-SRV2-1" 1 -1
At this point, the new partition has been created and labeled correctly, ready to be added to the zpool mirror.
Use the sw-nvme command to replace the old drive with the new one:
# sw-nvme replace-disk --old /dev/disk/by-id/<old_disk_id> --new /dev/disk/by-id/<new_disk_id>
To find the old disk ID, run:
#sw-nvme show
{
"config": "/sys/kernel/config/nvmet",
"hosts": [
"3cc5c2aa47825e608570a938971bcd7c"
],
"subsystems": {
"sw-mirror": {
"acl": [
"3cc5c2aa47825e608570a938971bcd7c"
],
"namespaces": [
{
"id": 1,
/dev/disk/by-id/ata-KINGSTON_SA400S37120G_50026B73804B902A",
}
],
"allow_any_host": false
}
},
"ports": {
"1": {
"address": "1.1.1.31",
"port": 4420,
"address_family": "ipv4",
"trtype": "tcp",
"subsystems": "sw-mirror"
}
}
}
Search for device information similar to: "device": "/dev/disk/by-id/ata-KINGSTON_SA400S37120G_50026B73804B902A".
Once both the old and new disk IDs are known, the replacement command will look like this:
# sw-nvme replace-disk --old /dev/disk/by-id/ata-KINGSTON_SA400S37120G_50026B73804B902A --new /dev/disk/by-id/ata-WDC_WD10JFCX-68N6GN0_WD-WXK1E6458WKX
This completes the replacement procedure on the secondary server.
On Primary Server
Next, Update the ZFS pool on the Primary server by adding the newly created virtual disk:
partprobe
The next step is to update the GUID of the new disk so that the zpool can recognize it. The GUID can be obtained using the zdb command.
The key part of the zdb output is the line that provides the GUID and path information, for example:
path: '/dev/disk/by-partlabel/SW3-NETSTOR-SRV2-1'
After retrieving the GUID, run the command to update it in the zpool configuration.
#Example: # zpool replace NETSTOR 12365645279327980714 /dev/disk/by-partlabel/SW3-NETSTOR-SRV2-1 -f
After the command executes, monitor the zpool status and wait for resilvering to complete. Once resilvering finishes, the disk replacement procedure is complete.
| Command | Description |
|---|---|
| sw-nvme list | Lists all connected devices with /dev/nvme-fabrics |
| sw-nvme discover | Discover all devices exported on the remote host with given IP and port |
| sw-nvme connect | Import remote device from given IP, port and NQN |
| sw-nvme disconnect | Remove the imported device from the host |
| sw-nvme disconnect-all | Remove all imported devices from the host |
| sw-nvme import | Import remote devices from a given JSON file |
| sw-nvme reload-import | Import remote devices from JSON file after disconnecting all current imports |
| sw-nvme enable-modules | Enable necessary kernel modules for NVMe/TCP |
| sw-nvme enable-namespace | Enable namespace with given ID |
| sw-nvme disable-namespace | Disable namespace with given ID |
| sw-nvme load | Export remote devices from a given JSON file |
| sw-nvme store | Save system configuration in JSON format if devices are exported manually |
| sw-nvme clear | Remove exported device from system configuration; with 'all' removes all configurations |
| sw-nvme export | Export device on port with given NQN |
| sw-nvme export-stop | Remove device being exported on port with given ID |
| sw-nvme reload-configuration | Export remote devices from JSON file after removing all current exports |
| sw-nvme replace-disk | Combine 'clear all' and 'reload-configuration' for easier disk replacement on SERVERware |
| sw-nvme expand-pool | Update export configuration and add new namespace into sw-mirror subsystem for SERVERware |