Understanding C9800 High Availability Troubleshooting
The Cisco Catalyst 9800 wireless controller supports two distinct high availability architectures: Stateful Switchover (SSO) for subsecond failover with full configuration synchronization, and N+1 Redundancy for distributed deployments where the standby unit takes over as a separate entity. When one of these systems fails, your wireless network—and the APs, client devices, and policies it carries—faces immediate disruption. Troubleshooting HA issues requires systematic understanding of how SSO manages state, how redundancy ports maintain health, and how sync mechanisms preserve your configuration and client sessions across chassis boundaries.
This article walks you through the core HA troubleshooting methodology for the C9800, covering SSO failover diagnostics, redundancy port issues, configuration synchronization failures, and practical CLI commands that reveal exactly where your HA pair has broken down. Whether you're diagnosing a failed failover, state mismatch between active and standby units, or split-brain scenarios, the techniques and output examples here will help you isolate and resolve the issue faster.
SSO vs. N+1 Redundancy: Quick Reference
Before diving into troubleshooting, clarify which HA model you're running. Each has different failure modes, recovery mechanisms, and diagnostic expectations.
| Aspect | SSO (Stateful Switchover) | N+1 Redundancy |
|---|---|---|
| Failover Time | Subseconds; clients remain connected | 45–60 seconds; clients reconnect to standby |
| Configuration Sync | Full sync in real-time; all config on standby | No config sync; standby is independent |
| Operational State Sync | Full client and AP state replicated | Minimal state sync; APs re-join from scratch |
| Use Case | Campus/enterprise requiring near-zero downtime | Distributed/branch where recovery time is flexible |
| Redundancy Port (RMI) | Single dedicated port; continuous heartbeat | Single dedicated port; heartbeat with longer intervals |
| Hardware Match Requirement | Identical chassis, line cards, and licenses | Can be different hardware SKUs |
SSO Failover Issues: Diagnosis and Resolution
When SSO failover fails or takes longer than expected, the root cause typically lies in redundancy port connectivity, configuration synchronization lag, or state mismatch. Start by checking if failover occurs at all, then work backward to find the blockage.
Step 1: Verify Chassis Redundancy State
Use show redundancy to reveal the current HA state of both the active and standby units.
C9800-1# show redundancy
Redundancy information for chassis 1:
My Slot Number = 1
Redundancy Mode = Stateful Switchover
Redundancy State = active
Standby provided by Slot Number = 2
Redundancy State Transition reason = Manual config
Redundancy information for chassis 2:
My Slot Number = 2
Redundancy Mode = Stateful Switchover
Redundancy State = standby hot
Active provided by Slot Number = 1
Redundancy State Transition reason = Manual configWhat you're looking for: Both units report matching redundancy mode (both SSO or both N+1). The active unit shows active state and identifies the standby slot. The standby unit shows standby hot state. If the standby reports standby cold or disabled, sync is not occurring.
Common mismatches: One unit reports SSO while the other reports N+1 (configuration inconsistency). The standby reports not present or not eligible (hardware or IOS mismatch). Redundancy state is initialization in progress for more than a few minutes (sync is stuck).
Step 2: Check Redundancy Port and RMI Connectivity
The Redundancy Management Interface (RMI) is a dedicated port used for heartbeat and state replication. If this port is down, the standby chassis cannot sync. Check both the physical port status and the RMI-specific state.
C9800-1# show interfaces | include RMI
GigabitEthernet0/0/0 is up, line protocol is up
Hardware is CSR GigabitEthernet
Description: RMI Port to Standby
MTU 1500 bytes, BW 1000000 Kbit/sec
Encapsulation ARPA, loopback not set
Last input 00:00:01, output 00:00:01
C9800-1# show redundancy interconnect
RMI Status = Connected
RMI link = GigabitEthernet0/0/0
RMI MTU = 1500
RMI Heartbeat interval = 1000 ms
RMI Replication enabled = true
RMI bandwidth = 1000 MbpsWhat you're looking for: RMI Status = Connected. The RMI link interface is up. Heartbeat interval is 1000 ms or less (subsecond SSO response requires fast heartbeat). RMI replication is true.
If RMI Status is Disconnected or the RMI link is down, the standby chassis cannot receive state updates and failover will fail. Check physical cable, verify VLAN configuration on the RMI port, and confirm MTU is consistent (1500 bytes). If the RMI port is shutdown due to spanning tree or port security, disable those features on the RMI link.
Step 3: Validate Configuration Synchronization
SSO requires complete configuration sync from active to standby. A mismatch here causes the standby to reject takeover or to boot with stale config. Check sync status and the last sync timestamp.
C9800-1# show redundancy status
Redundancy Status:
Redundancy State = active
Hardware: Catalyst 9800-80-3x40GE, Processor = Intel(R) Xeon(R)
Uptime = 45 days, 3 hours, 22 minutes
Standby Information:
Standby State = hot standby
Standby Uptime = 45 days, 2 hours, 55 minutes
Configuration sync = up-to-date
Operational sync = in-sync
Last config sync = 00:00:03 (3 seconds ago)
Checkpoint sync = enabled
RMI = connectedWhat you're looking for: Configuration sync = up-to-date. The last config sync timestamp should be recent (within seconds if SSO is stable). Checkpoint sync = enabled indicates that configuration checkpoints are being replicated. Operational sync = in-sync means AP state, client state, and license state are synchronized.
If sync shows out-of-sync or initializing, the standby is not receiving updates. Check RMI bandwidth (should be at least 100 Mbps for timely sync). If config size is very large (>500 MB of running-config), sync may lag; consider splitting large configs or increasing the sync timeout. If checkpoint sync is disabled, enable it with redundancy mode configuration.
Step 4: Monitor CAPWAP Heartbeat and AP Connectivity During Failover
SSO is designed to failover so quickly that APs do not detect the transition. However, if CAPWAP heartbeat timers are misconfigured, APs may timeout and disconnect during a brief active-to-standby switch. Verify CAPWAP timer settings on the active unit.
C9800-1# show wireless global config | include capwap
CAPWAP Echo Request timeout = 30 seconds
CAPWAP Echo Response timeout = 60 seconds
CAPWAP DTL Timeout = 30 seconds
CAPWAP Backoff Timer = 30 seconds
Maximum CAPWAP Retransmissions = 5For SSO to work without AP disconnection, ensure CAPWAP heartbeat timeouts are generous (30 seconds or higher). The active unit should not drop APs due to brief heartbeat loss. If you see APs reconnecting every time SSO failover occurs, lower the CAPWAP timeout to detect real failures faster, or increase the threshold to tolerate brief RMI delays.
Redundancy Port Issues and RMI Failures
The redundancy port is the single point of contact between active and standby. If this port fails, the pair becomes split-brain: both units believe they are active, or one unit is isolated and cannot sync.
Identifying Split-Brain Scenarios
A split-brain occurs when the RMI link is down and both units assume active role. This is catastrophic because configurations diverge, and when the link comes back, one unit must forcibly be demoted to standby (risking config loss).
C9800-1# show redundancy
Redundancy Mode = Stateful Switchover
Redundancy State = active
Standby provided by Slot Number = 2
C9800-2# show redundancy
Redundancy Mode = Stateful Switchover
Redundancy State = active
Standby provided by Slot Number = 1Both units report active state. This is a split-brain. Immediately troubleshoot the RMI link. Run ping from one unit to the other over the RMI port to confirm connectivity. Check the physical cable, verify the port is not shutdown, and confirm VLAN tagging (if used) matches both sides.
RMI Port Health Check
Use hardware diagnostics and interface counters to confirm the RMI port is truly healthy.
C9800-1# show interfaces GigabitEthernet0/0/0 counters
GigabitEthernet0/0/0
RX Packets = 4521840, TX Packets = 4521810
RX Bytes = 1829504821, TX Bytes = 1829501201
RX Errors = 0, TX Errors = 0
RX Discards = 0, TX Discards = 0
CRC/FCS Errors = 0
Giants = 0, Runts = 0
C9800-1# show interfaces GigabitEthernet0/0/0 | include errors
RX packets with errors = 0
TX packets with errors = 0What you're looking for: Zero (or nearly zero) RX and TX errors. No CRC/FCS errors. No discards. High packet counts indicate active replication. If you see rising error counters, the physical link has issues (bad cable, transceiver, or port). Replace the cable or move the RMI to a different port and reconfigure.
RMI Bandwidth and Latency Constraints
RMI replication speed depends on link bandwidth. If the RMI port is running at 100 Mbps (legacy configs), sync may lag. Verify actual port speed and consider upgrading to 1 Gbps or higher.
C9800-1# show interfaces GigabitEthernet0/0/0 | include speed
Encapsulation ARPA
MTU 1500 bytes, BW 1000000 Kbit/sec
Keepalive set (10 sec)The BW value shows 1000000 Kbit/sec = 1 Gbps. This is acceptable. If it shows 100000 Kbit/sec, upgrade the port or use an aggregated link (EtherChannel). Also verify latency: the RMI port should have <5 ms latency. If latency exceeds 50 ms, sync delays will cascade and failover recovery may take longer than expected.
Configuration Synchronization Failures
When the standby unit is out of sync with the active, it cannot take over cleanly. Troubleshoot sync failures by checking the running-config size, checkpoint status, and sync initiation logs.
Validating Checkpoint Sync
Checkpoints are configuration snapshots sent to the standby unit. If checkpoint sync fails, the standby does not have a consistent config baseline.
C9800-1# show redundancy ckpt
Checkpoint replication Status:
Total Checkpoints = 1247
Last Checkpoint Replicated = 1247
Checkpoint Age = 0 seconds
Checkpoint Consistency = matched
Missing Checkpoints on Standby = 0
Checkpoint Timeout = 10000 msWhat you're looking for: Missing Checkpoints on Standby = 0 (all checkpoints replicated). Checkpoint Consistency = matched (active and standby agree on last checkpoint). Checkpoint Age is fresh (seconds, not minutes).
If checkpoints are missing, the standby will not have the full config. Check RMI bandwidth and restart the redundancy process with redundancy force-switchover (on active) to trigger a full resync. If checkpoint age is very old, the sync process may be stuck; review system logs with show log for errors.
Running-Config Size and Sync Timeout
Large configurations take time to sync. If your running-config exceeds 100 MB, increase the checkpoint timeout to allow more time for replication.
C9800-1# show running-config | wc
500
50000
8234567
C9800-1# show redundancy state peer
Standby Statistics:
Config Replication Time = 8234 ms
Last Replication Status = OKConfig replication time of 8.2 seconds is acceptable for large configs. If replication time exceeds 30 seconds, consider reducing the config size or upgrading the RMI link bandwidth. If replication status shows FAILED, increase the timeout value in the redundancy configuration stanza.
State Mismatch and Operational Sync Issues
Even if configuration sync is working, operational state (AP-to-tag mappings, client sessions, license entitlements) may drift between active and standby. This causes APs to disconnect or lose their tag assignment during failover.
Checking AP-to-Tag Mapping Sync
When SSO fails over, the standby unit must have the same AP-to-tag mappings as the active. Verify this by comparing AP summary output from both units before and after failover.
C9800-1# show ap summary
Number of APs: 128
AP Hostname Model Location Group Tag State
ap-1 9130AXE Building-1-Floor-2 1 tagA Connected
ap-2 9130AXE Building-1-Floor-3 1 tagB Connected
ap-3 9115AX Building-2-Floor-1 2 tagC Connected
...C9800-2# show ap summary
Number of APs: 128
AP Hostname Model Location Group Tag State
ap-1 9130AXE Building-1-Floor-2 1 tagA Connected
ap-2 9130AXE Building-1-Floor-3 1 tagB Connected
ap-3 9115AX Building-2-Floor-1 2 tagC Connected
...Both units should report identical AP counts and tag assignments. If the standby has fewer APs or incorrect tag assignments, the sync is lagging. Check RMI and restart sync with redundancy resync standby.
Client Session State Preservation
SSO preserves client session state so wireless clients do not experience disconnect during failover. To verify this is working, check the client table on both units.
C9800-1# show wireless client summary
Total Clients: 1456
ClientMac Hostname SSID State WLAN AP Channel
aa:bb:cc:dd:ee:01 host-1 corporate Associated 1 ap-1 36
aa:bb:cc:dd:ee:02 host-2 corporate Associated 1 ap-2 48
...Compare the active and standby client counts. They should match (or be within a few clients of each other). If the standby has significantly fewer clients, client state is not syncing. Increase RMI bandwidth or reduce AP density to lower state replication load.
License and DNA Smart License Sync
AP MAC address entitlements and license counts must sync to the standby. If licenses are not synced, the standby will reject APs upon takeover.
C9800-1# show license summary
Smart License Status: Active
Total Licenses Consumed: 128
AP Licenses Available: 256
AP MAC Entitlements: 128
RMI Sync Status: In-SyncThe standby should report identical license counts and entitlements. If the standby shows RMI Sync Status: Out-of-Sync, restart the license sync daemon or restart the redundancy process.
RMI Port Troubleshooting Checklist
The RMI port is critical to all HA operations. Use this table to quickly narrow down RMI issues.
| Symptom | Likely Cause | Troubleshooting Steps |
|---|---|---|
| RMI Status = Disconnected | Physical cable down, port shutdown, VLAN mismatch | Check cable, verify port is not shutdown, confirm VLAN on both sides, test with ping over RMI |
| RMI Heartbeat = No Response | Standby unit is offline or unresponsive | Power-cycle standby unit, check console for boot errors, verify both units are running same IOS version |
| Configuration Sync = Stuck at 50% | RMI bandwidth too low, running-config too large | Upgrade RMI link to 1 Gbps, reduce running-config size, increase checkpoint timeout |
| Redundancy State = Initialization in Progress (>10 min) | Sync timeout or RMI error recovery | Check RMI port errors, review system logs, manually restart redundancy with redundancy reload standby |
| Both units report Active state | Split-brain: RMI was down, both units promoted | Fix RMI immediately, shut down standby, verify RMI, then power-cycle standby to resync from scratch |
Systematic Troubleshooting: A Practical Workflow
When HA is broken, follow this sequence to isolate the problem:
Phase 1: Confirm Current State (2 minutes)
C9800-1# show redundancy
C9800-2# show redundancyBoth units should agree on who is active and standby. If they disagree, you have a split-brain; fix the RMI immediately.
Phase 2: Check RMI Health (3 minutes)
C9800-1# show redundancy interconnect
C9800-1# show interfaces description | include RMI
C9800-1# ping <standby-rmi-ip>RMI must be connected. If disconnected, physically inspect the cable and port, then verify VLAN.
Phase 3: Verify Sync Status (3 minutes)
C9800-1# show redundancy status
C9800-1# show redundancy ckptConfiguration and checkpoint sync should be up-to-date and consistent. If lagging, check RMI bandwidth and running-config size.
Phase 4: Test Operational State (2 minutes)
C9800-1# show ap summary
C9800-1# show wireless client summary
C9800-2# show ap summary
C9800-2# show wireless client summaryAP and client counts should match between active and standby. If standby has fewer APs or clients, sync is incomplete.
Phase 5: Validate IOS and Hardware Match (1 minute)
C9800-1# show version
C9800-2# show versionBoth units must run the same IOS version (or compatible patch releases). Hardware models and line card SKUs must match for SSO. If they differ, SSO will fail; use N+1 redundancy instead.
Phase 6: Review System Logs for Error Context (2 minutes)
C9800-1# show log | include REDUNDANCY
C9800-1# show log | include RMI
C9800-1# show log | include CKPTLogs often reveal the exact moment sync broke, RMI failed, or checkpoint replication timed out. Look for timestamps and error codes, then cross-reference with the Cisco TAC documentation.
Recovery Procedures for Common Failures
Forcing a Controlled Switchover (Planned Maintenance)
If you need to take the active unit down for maintenance, force a graceful switchover to the standby first.
C9800-1# redundancy force-switchover
System is going down for Software switchover
Warning: The system is shutting down
C9800-2# show redundancy
Redundancy Mode = Stateful Switchover
Redundancy State = active
Standby provided by Slot Number = 1The active unit notifies the standby to take over, and all config and state are already in sync. The switchover completes in seconds with zero client impact.
Recovering from Sync Timeout
If sync is stuck, restart the redundancy process on the standby.
C9800-1# redundancy resync standby
Restarting Standby synchronization
[Wait 5-10 minutes for full sync to complete]
C9800-1# show redundancy status | include Configuration
Configuration sync = up-to-dateRecovering from Split-Brain
If both units report active (split-brain detected), you must cleanly shut down one unit and restart it to break the tie.
- Identify which unit should remain active (typically the one with live traffic).
- Shut down the other unit:
shutdown(in config mode) or power-cycle it. - Allow the active unit to stabilize (give it 2-3 minutes).
- Power on or restart the standby unit.
- Monitor the standby boot process to ensure it syncs cleanly from the active.
- Verify
show redundancyon both units reports correct roles.
Monitoring and Proactive Health Checks
To prevent HA failures, run these commands weekly and establish baseline values.
C9800-1# show redundancy status
C9800-1# show redundancy interconnect
C9800-1# show redundancy ckpt
C9800-1# show interfaces description | include RMILog these outputs and compare them week-to-week. Gradual drift in checkpoint age, sync time, or RMI bandwidth may indicate an impending failure. Also monitor the RMI port error counters; rising CRC or discard rates warn of cable degradation.
Key Takeaways
- Verify redundancy state first. Both units must agree on active/standby roles. Split-brain is a critical failure and requires immediate RMI repair.
- RMI port is the lifeline. If RMI is down, HA is broken. Check physical connectivity, VLAN, MTU, and bandwidth (1 Gbps recommended).
- Configuration sync must be up-to-date. Lagging sync causes the standby to boot with stale config. Increase checkpoint timeout if running-config is large.
- Operational state (APs, clients, licenses) must also sync. Compare AP and client counts between active and standby. Mismatches indicate incomplete sync.
- IOS version and hardware must match. SSO requires identical chassis, line cards, and IOS versions. Mismatches force fallback to N+1 or disable HA.
- CAPWAP timers must be generous for SSO. APs should not timeout during subsecond failover. Set CAPWAP echo timeout to 30 seconds or higher.
- Follow the systematic workflow. Confirm state → check RMI → verify sync → test operational state → validate hardware/IOS → review logs. This sequence isolates 90% of HA issues.
- Document baseline values. Weekly health checks of redundancy status, sync times, and RMI counters help predict failures before they occur.