C9800 High Availability Troubleshooting: SSO Failover and Sync Issues

Understanding C9800 High Availability Troubleshooting

The Cisco Catalyst 9800 wireless controller supports two distinct high availability architectures: Stateful Switchover (SSO) for subsecond failover with full configuration synchronization, and N+1 Redundancy for distributed deployments where the standby unit takes over as a separate entity. When one of these systems fails, your wireless network—and the APs, client devices, and policies it carries—faces immediate disruption. Troubleshooting HA issues requires systematic understanding of how SSO manages state, how redundancy ports maintain health, and how sync mechanisms preserve your configuration and client sessions across chassis boundaries.

This article walks you through the core HA troubleshooting methodology for the C9800, covering SSO failover diagnostics, redundancy port issues, configuration synchronization failures, and practical CLI commands that reveal exactly where your HA pair has broken down. Whether you're diagnosing a failed failover, state mismatch between active and standby units, or split-brain scenarios, the techniques and output examples here will help you isolate and resolve the issue faster.

SSO vs. N+1 Redundancy: Quick Reference

Before diving into troubleshooting, clarify which HA model you're running. Each has different failure modes, recovery mechanisms, and diagnostic expectations.

Aspect	SSO (Stateful Switchover)	N+1 Redundancy
Failover Time	Subseconds; clients remain connected	45–60 seconds; clients reconnect to standby
Configuration Sync	Full sync in real-time; all config on standby	No config sync; standby is independent
Operational State Sync	Full client and AP state replicated	Minimal state sync; APs re-join from scratch
Use Case	Campus/enterprise requiring near-zero downtime	Distributed/branch where recovery time is flexible
Redundancy Port (RMI)	Single dedicated port; continuous heartbeat	Single dedicated port; heartbeat with longer intervals
Hardware Match Requirement	Identical chassis, line cards, and licenses	Can be different hardware SKUs

SSO Failover Issues: Diagnosis and Resolution

When SSO failover fails or takes longer than expected, the root cause typically lies in redundancy port connectivity, configuration synchronization lag, or state mismatch. Start by checking if failover occurs at all, then work backward to find the blockage.

Step 1: Verify Chassis Redundancy State

Use show redundancy to reveal the current HA state of both the active and standby units.

C9800-1# show redundancy
Redundancy information for chassis 1:
  My Slot Number = 1
  Redundancy Mode = Stateful Switchover
  Redundancy State = active
  Standby provided by Slot Number = 2
  Redundancy State Transition reason = Manual config

Redundancy information for chassis 2:
  My Slot Number = 2
  Redundancy Mode = Stateful Switchover
  Redundancy State = standby hot
  Active provided by Slot Number = 1
  Redundancy State Transition reason = Manual config

What you're looking for: Both units report matching redundancy mode (both SSO or both N+1). The active unit shows active state and identifies the standby slot. The standby unit shows standby hot state. If the standby reports standby cold or disabled, sync is not occurring.

Common mismatches: One unit reports SSO while the other reports N+1 (configuration inconsistency). The standby reports not present or not eligible (hardware or IOS mismatch). Redundancy state is initialization in progress for more than a few minutes (sync is stuck).

Step 2: Check Redundancy Port and RMI Connectivity

The Redundancy Management Interface (RMI) is a dedicated port used for heartbeat and state replication. If this port is down, the standby chassis cannot sync. Check both the physical port status and the RMI-specific state.

C9800-1# show interfaces | include RMI
GigabitEthernet0/0/0 is up, line protocol is up
  Hardware is CSR GigabitEthernet
  Description: RMI Port to Standby
  MTU 1500 bytes, BW 1000000 Kbit/sec
    Encapsulation ARPA, loopback not set
    Last input 00:00:01, output 00:00:01

C9800-1# show redundancy interconnect
RMI Status = Connected
RMI link = GigabitEthernet0/0/0
RMI MTU = 1500
RMI Heartbeat interval = 1000 ms
RMI Replication enabled = true
RMI bandwidth = 1000 Mbps

What you're looking for: RMI Status = Connected. The RMI link interface is up. Heartbeat interval is 1000 ms or less (subsecond SSO response requires fast heartbeat). RMI replication is true.

If RMI Status is Disconnected or the RMI link is down, the standby chassis cannot receive state updates and failover will fail. Check physical cable, verify VLAN configuration on the RMI port, and confirm MTU is consistent (1500 bytes). If the RMI port is shutdown due to spanning tree or port security, disable those features on the RMI link.

Step 3: Validate Configuration Synchronization

SSO requires complete configuration sync from active to standby. A mismatch here causes the standby to reject takeover or to boot with stale config. Check sync status and the last sync timestamp.

C9800-1# show redundancy status
Redundancy Status:
  Redundancy State = active
  Hardware: Catalyst 9800-80-3x40GE, Processor = Intel(R) Xeon(R)
  Uptime = 45 days, 3 hours, 22 minutes

  Standby Information:
    Standby State = hot standby
    Standby Uptime = 45 days, 2 hours, 55 minutes
    Configuration sync = up-to-date
    Operational sync = in-sync
    Last config sync = 00:00:03 (3 seconds ago)
    Checkpoint sync = enabled
    RMI = connected

What you're looking for: Configuration sync = up-to-date. The last config sync timestamp should be recent (within seconds if SSO is stable). Checkpoint sync = enabled indicates that configuration checkpoints are being replicated. Operational sync = in-sync means AP state, client state, and license state are synchronized.

If sync shows out-of-sync or initializing, the standby is not receiving updates. Check RMI bandwidth (should be at least 100 Mbps for timely sync). If config size is very large (>500 MB of running-config), sync may lag; consider splitting large configs or increasing the sync timeout. If checkpoint sync is disabled, enable it with redundancy mode configuration.

Step 4: Monitor CAPWAP Heartbeat and AP Connectivity During Failover

SSO is designed to failover so quickly that APs do not detect the transition. However, if CAPWAP heartbeat timers are misconfigured, APs may timeout and disconnect during a brief active-to-standby switch. Verify CAPWAP timer settings on the active unit.

C9800-1# show wireless global config | include capwap
  CAPWAP Echo Request timeout = 30 seconds
  CAPWAP Echo Response timeout = 60 seconds
  CAPWAP DTL Timeout = 30 seconds
  CAPWAP Backoff Timer = 30 seconds
  Maximum CAPWAP Retransmissions = 5

For SSO to work without AP disconnection, ensure CAPWAP heartbeat timeouts are generous (30 seconds or higher). The active unit should not drop APs due to brief heartbeat loss. If you see APs reconnecting every time SSO failover occurs, lower the CAPWAP timeout to detect real failures faster, or increase the threshold to tolerate brief RMI delays.

Redundancy Port Issues and RMI Failures

The redundancy port is the single point of contact between active and standby. If this port fails, the pair becomes split-brain: both units believe they are active, or one unit is isolated and cannot sync.

Identifying Split-Brain Scenarios

A split-brain occurs when the RMI link is down and both units assume active role. This is catastrophic because configurations diverge, and when the link comes back, one unit must forcibly be demoted to standby (risking config loss).

C9800-1# show redundancy
Redundancy Mode = Stateful Switchover
Redundancy State = active
Standby provided by Slot Number = 2

C9800-2# show redundancy
Redundancy Mode = Stateful Switchover
Redundancy State = active
Standby provided by Slot Number = 1

Both units report active state. This is a split-brain. Immediately troubleshoot the RMI link. Run ping from one unit to the other over the RMI port to confirm connectivity. Check the physical cable, verify the port is not shutdown, and confirm VLAN tagging (if used) matches both sides.

RMI Port Health Check

Use hardware diagnostics and interface counters to confirm the RMI port is truly healthy.

C9800-1# show interfaces GigabitEthernet0/0/0 counters
GigabitEthernet0/0/0
  RX Packets = 4521840, TX Packets = 4521810
  RX Bytes = 1829504821, TX Bytes = 1829501201
  RX Errors = 0, TX Errors = 0
  RX Discards = 0, TX Discards = 0
  CRC/FCS Errors = 0
  Giants = 0, Runts = 0

C9800-1# show interfaces GigabitEthernet0/0/0 | include errors
  RX packets with errors = 0
  TX packets with errors = 0

What you're looking for: Zero (or nearly zero) RX and TX errors. No CRC/FCS errors. No discards. High packet counts indicate active replication. If you see rising error counters, the physical link has issues (bad cable, transceiver, or port). Replace the cable or move the RMI to a different port and reconfigure.

RMI Bandwidth and Latency Constraints

RMI replication speed depends on link bandwidth. If the RMI port is running at 100 Mbps (legacy configs), sync may lag. Verify actual port speed and consider upgrading to 1 Gbps or higher.

C9800-1# show interfaces GigabitEthernet0/0/0 | include speed
  Encapsulation ARPA
  MTU 1500 bytes, BW 1000000 Kbit/sec
  Keepalive set (10 sec)

The BW value shows 1000000 Kbit/sec = 1 Gbps. This is acceptable. If it shows 100000 Kbit/sec, upgrade the port or use an aggregated link (EtherChannel). Also verify latency: the RMI port should have <5 ms latency. If latency exceeds 50 ms, sync delays will cascade and failover recovery may take longer than expected.

Configuration Synchronization Failures

When the standby unit is out of sync with the active, it cannot take over cleanly. Troubleshoot sync failures by checking the running-config size, checkpoint status, and sync initiation logs.

Validating Checkpoint Sync

Checkpoints are configuration snapshots sent to the standby unit. If checkpoint sync fails, the standby does not have a consistent config baseline.

C9800-1# show redundancy ckpt
Checkpoint replication Status:
  Total Checkpoints = 1247
  Last Checkpoint Replicated = 1247
  Checkpoint Age = 0 seconds
  Checkpoint Consistency = matched
  Missing Checkpoints on Standby = 0
  Checkpoint Timeout = 10000 ms

What you're looking for: Missing Checkpoints on Standby = 0 (all checkpoints replicated). Checkpoint Consistency = matched (active and standby agree on last checkpoint). Checkpoint Age is fresh (seconds, not minutes).

If checkpoints are missing, the standby will not have the full config. Check RMI bandwidth and restart the redundancy process with redundancy force-switchover (on active) to trigger a full resync. If checkpoint age is very old, the sync process may be stuck; review system logs with show log for errors.

Running-Config Size and Sync Timeout

Large configurations take time to sync. If your running-config exceeds 100 MB, increase the checkpoint timeout to allow more time for replication.

C9800-1# show running-config | wc
  500
  50000
  8234567

C9800-1# show redundancy state peer
Standby Statistics:
  Config Replication Time = 8234 ms
  Last Replication Status = OK

Config replication time of 8.2 seconds is acceptable for large configs. If replication time exceeds 30 seconds, consider reducing the config size or upgrading the RMI link bandwidth. If replication status shows FAILED, increase the timeout value in the redundancy configuration stanza.

State Mismatch and Operational Sync Issues

Even if configuration sync is working, operational state (AP-to-tag mappings, client sessions, license entitlements) may drift between active and standby. This causes APs to disconnect or lose their tag assignment during failover.

Checking AP-to-Tag Mapping Sync

When SSO fails over, the standby unit must have the same AP-to-tag mappings as the active. Verify this by comparing AP summary output from both units before and after failover.

C9800-1# show ap summary
Number of APs: 128
AP Hostname     Model        Location          Group      Tag       State
ap-1            9130AXE      Building-1-Floor-2 1          tagA      Connected
ap-2            9130AXE      Building-1-Floor-3 1          tagB      Connected
ap-3            9115AX       Building-2-Floor-1 2          tagC      Connected
...

C9800-2# show ap summary
Number of APs: 128
AP Hostname     Model        Location          Group      Tag       State
ap-1            9130AXE      Building-1-Floor-2 1          tagA      Connected
ap-2            9130AXE      Building-1-Floor-3 1          tagB      Connected
ap-3            9115AX       Building-2-Floor-1 2          tagC      Connected
...

Both units should report identical AP counts and tag assignments. If the standby has fewer APs or incorrect tag assignments, the sync is lagging. Check RMI and restart sync with redundancy resync standby.

Client Session State Preservation

SSO preserves client session state so wireless clients do not experience disconnect during failover. To verify this is working, check the client table on both units.

C9800-1# show wireless client summary
Total Clients: 1456
ClientMac            Hostname    SSID          State     WLAN  AP          Channel
aa:bb:cc:dd:ee:01   host-1      corporate     Associated 1    ap-1        36
aa:bb:cc:dd:ee:02   host-2      corporate     Associated 1    ap-2        48
...

Compare the active and standby client counts. They should match (or be within a few clients of each other). If the standby has significantly fewer clients, client state is not syncing. Increase RMI bandwidth or reduce AP density to lower state replication load.

License and DNA Smart License Sync

AP MAC address entitlements and license counts must sync to the standby. If licenses are not synced, the standby will reject APs upon takeover.

C9800-1# show license summary
Smart License Status: Active
Total Licenses Consumed: 128
AP Licenses Available: 256
AP MAC Entitlements: 128
RMI Sync Status: In-Sync

The standby should report identical license counts and entitlements. If the standby shows RMI Sync Status: Out-of-Sync, restart the license sync daemon or restart the redundancy process.

RMI Port Troubleshooting Checklist

The RMI port is critical to all HA operations. Use this table to quickly narrow down RMI issues.

Symptom	Likely Cause	Troubleshooting Steps
RMI Status = Disconnected	Physical cable down, port shutdown, VLAN mismatch	Check cable, verify port is not shutdown, confirm VLAN on both sides, test with ping over RMI
RMI Heartbeat = No Response	Standby unit is offline or unresponsive	Power-cycle standby unit, check console for boot errors, verify both units are running same IOS version
Configuration Sync = Stuck at 50%	RMI bandwidth too low, running-config too large	Upgrade RMI link to 1 Gbps, reduce running-config size, increase checkpoint timeout
Redundancy State = Initialization in Progress (>10 min)	Sync timeout or RMI error recovery	Check RMI port errors, review system logs, manually restart redundancy with redundancy reload standby
Both units report Active state	Split-brain: RMI was down, both units promoted	Fix RMI immediately, shut down standby, verify RMI, then power-cycle standby to resync from scratch

Systematic Troubleshooting: A Practical Workflow

When HA is broken, follow this sequence to isolate the problem:

Phase 1: Confirm Current State (2 minutes)

C9800-1# show redundancy
C9800-2# show redundancy

Both units should agree on who is active and standby. If they disagree, you have a split-brain; fix the RMI immediately.

Phase 2: Check RMI Health (3 minutes)

C9800-1# show redundancy interconnect
C9800-1# show interfaces description | include RMI
C9800-1# ping <standby-rmi-ip>

RMI must be connected. If disconnected, physically inspect the cable and port, then verify VLAN.

Phase 3: Verify Sync Status (3 minutes)

C9800-1# show redundancy status
C9800-1# show redundancy ckpt

Configuration and checkpoint sync should be up-to-date and consistent. If lagging, check RMI bandwidth and running-config size.

Phase 4: Test Operational State (2 minutes)

C9800-1# show ap summary
C9800-1# show wireless client summary
C9800-2# show ap summary
C9800-2# show wireless client summary

AP and client counts should match between active and standby. If standby has fewer APs or clients, sync is incomplete.

Phase 5: Validate IOS and Hardware Match (1 minute)

C9800-1# show version
C9800-2# show version

Both units must run the same IOS version (or compatible patch releases). Hardware models and line card SKUs must match for SSO. If they differ, SSO will fail; use N+1 redundancy instead.

Phase 6: Review System Logs for Error Context (2 minutes)

C9800-1# show log | include REDUNDANCY
C9800-1# show log | include RMI
C9800-1# show log | include CKPT

Logs often reveal the exact moment sync broke, RMI failed, or checkpoint replication timed out. Look for timestamps and error codes, then cross-reference with the Cisco TAC documentation.

Recovery Procedures for Common Failures

Forcing a Controlled Switchover (Planned Maintenance)

If you need to take the active unit down for maintenance, force a graceful switchover to the standby first.

C9800-1# redundancy force-switchover
System is going down for Software switchover
Warning: The system is shutting down

C9800-2# show redundancy
Redundancy Mode = Stateful Switchover
Redundancy State = active
Standby provided by Slot Number = 1

The active unit notifies the standby to take over, and all config and state are already in sync. The switchover completes in seconds with zero client impact.

Recovering from Sync Timeout

If sync is stuck, restart the redundancy process on the standby.

C9800-1# redundancy resync standby
Restarting Standby synchronization

[Wait 5-10 minutes for full sync to complete]

C9800-1# show redundancy status | include Configuration
  Configuration sync = up-to-date

Recovering from Split-Brain

If both units report active (split-brain detected), you must cleanly shut down one unit and restart it to break the tie.

Identify which unit should remain active (typically the one with live traffic).
Shut down the other unit: shutdown (in config mode) or power-cycle it.
Allow the active unit to stabilize (give it 2-3 minutes).
Power on or restart the standby unit.
Monitor the standby boot process to ensure it syncs cleanly from the active.
Verify show redundancy on both units reports correct roles.

Monitoring and Proactive Health Checks

To prevent HA failures, run these commands weekly and establish baseline values.

C9800-1# show redundancy status
C9800-1# show redundancy interconnect
C9800-1# show redundancy ckpt
C9800-1# show interfaces description | include RMI

Log these outputs and compare them week-to-week. Gradual drift in checkpoint age, sync time, or RMI bandwidth may indicate an impending failure. Also monitor the RMI port error counters; rising CRC or discard rates warn of cable degradation.

Key Takeaways

Verify redundancy state first. Both units must agree on active/standby roles. Split-brain is a critical failure and requires immediate RMI repair.
RMI port is the lifeline. If RMI is down, HA is broken. Check physical connectivity, VLAN, MTU, and bandwidth (1 Gbps recommended).
Configuration sync must be up-to-date. Lagging sync causes the standby to boot with stale config. Increase checkpoint timeout if running-config is large.
Operational state (APs, clients, licenses) must also sync. Compare AP and client counts between active and standby. Mismatches indicate incomplete sync.
IOS version and hardware must match. SSO requires identical chassis, line cards, and IOS versions. Mismatches force fallback to N+1 or disable HA.
CAPWAP timers must be generous for SSO. APs should not timeout during subsecond failover. Set CAPWAP echo timeout to 30 seconds or higher.
Follow the systematic workflow. Confirm state → check RMI → verify sync → test operational state → validate hardware/IOS → review logs. This sequence isolates 90% of HA issues.
Document baseline values. Weekly health checks of redundancy status, sync times, and RMI counters help predict failures before they occur.