C9800 N+1 Redundancy Configuration and Best Practices

When designing a wireless network, you face a critical choice about how much downtime your business can tolerate. Do you need seamless, hitless failover that never drops a single client, or is a brief reconnection window acceptable? That distinction shapes everything about your Catalyst 9800 redundancy strategy. In this article, you'll learn how to implement N+1 redundancy—a cost-effective middle ground that keeps your network running even when a controller fails, accepting a few seconds of client disruption in exchange for significant savings.

N+1 Redundancy vs. Stateful Switchover (SSO)

Cisco offers two primary high-availability architectures for the C9800. Understanding the tradeoffs between them helps you choose the right design for your environment.

Feature	N+1 Redundancy	SSO (Active-Standby)
Architecture	Primary + shared backup controller	Active-standby pair with continuous sync
Client Impact on Failure	Brief disruption (15-30 sec); APs rejoin	Seamless; zero client disruption
Failover Time	20-30 seconds (AP heartbeat timeout + rejoin)	Subsecond (no data path interruption)
AP Redundancy Setup	Primary, secondary, tertiary controller IPs	Single virtual IP (VRRP)
Network Capacity	APs share primary until failover	Active half of pair always ready
Cost Efficiency	High; backup is idle until needed	Low; requires two fully licensed controllers
Complexity	Lower; simple configuration, manual fallback	Higher; state sync, timing requirements
Scalability	Limited by backup capacity; assume 40-60% peak load	Full scale from day one
Best For	Branch offices, low-downtime-tolerant environments	Enterprise campuses, mission-critical WLANs

The key insight: N+1 is a "shared spare" model where the backup controller sits idle under normal conditions, handling traffic only when the primary fails. SSO runs two active controllers in parallel from the start, maintaining complete state sync at all times. N+1 saves money; SSO saves downtime.

Understanding N+1 Architecture and AP Failover Behavior

In an N+1 redundancy setup, you designate one controller as primary (handling all APs) and one or more as secondary and tertiary (backups). Every AP is configured with a list of controllers in order of preference. When an AP powers on or loses connection to its primary controller, it attempts to join the secondary, then the tertiary if needed.

Here's how AP discovery and failover work in practice:

Initial Join: An AP sends CAPWAP discovery requests to the primary controller. If it receives a discovery response within the timeout window, it joins the primary.
Heartbeat Monitoring: Once joined, the AP maintains a heartbeat (keep-alive message) to the primary. The AP sends keep-alive packets at regular intervals (default: every 30 seconds).
Failure Detection: If the AP misses a configured number of consecutive heartbeats (default: 3 missed), it declares the primary dead and begins discovery of the secondary controller.
Secondary Fallback: The AP broadcasts CAPWAP discovery requests to the secondary controller. If the secondary responds, the AP re-authenticates and rejoins, moving its clients to the secondary's database.
Fallback to Primary: If the primary recovers and comes back online, the AP does not automatically fall back to the primary by default (to avoid constant flapping). Instead, it remains on the secondary until the administrator enables and configures fallback behavior.

This behavior is crucial to understand: N+1 does not provide automatic fallback. If you want APs to return to the primary after it recovers, you must explicitly configure fallback with appropriate timers.

AP High Availability Configuration via CLI

Let's walk through configuring an AP join profile with primary, secondary, and tertiary controller assignments. This is the foundation of N+1 redundancy.

Device# configure terminal
Device(config)# ap join-profile MyProfile
Device(config-ap-join)# capwap discovery timeout 20
Device(config-ap-join)# capwap discovery primary ip 10.50.1.10
Device(config-ap-join)# capwap discovery secondary ip 10.50.1.20
Device(config-ap-join)# capwap discovery tertiary ip 10.50.1.30
Device(config-ap-join)# capwap discovery ap-ip-addr-request timeout 5

In this example, APs joining the network use the profile "MyProfile" and will attempt to join controllers in this order: 10.50.1.10 (primary), 10.50.1.20 (secondary), 10.50.1.30 (tertiary). The discovery timeout is set to 20 seconds, giving the AP time to wait for a response before moving to the next controller.

Now, configure heartbeat and fallback behavior:

Device(config-ap-join)# capwap client keep-alive retries 3
Device(config-ap-join)# capwap client keep-alive timeout 30
Device(config-ap-join)# capwap client fallback enable
Device(config-ap-join)# capwap client fallback timeout 300
Device(config-ap-join)# exit
Device(config)# exit
Device# copy running-config startup-config

These settings tell the system: keep-alive every 30 seconds, if 3 consecutive keep-alives are missed (90 seconds of no response), begin failover. If fallback is enabled and the primary comes back online, wait 300 seconds before allowing APs to migrate back to the primary.

AP High Availability Configuration via GUI

Many administrators prefer the web UI for profile configuration. Navigate to Configuration > Tags & Profiles > AP Join Profiles, select or create a profile, then click the CAPWAP tab.

In the CAPWAP settings section, you'll find fields for:

Primary Controller IP: The main controller for this AP group
Secondary Controller IP: First fallback if primary is unreachable
Tertiary Controller IP: Second fallback if secondary is also unreachable
Discovery Timeout: How long an AP waits for a discovery response (default 30 sec)
Keep Alive Retries: Number of missed heartbeats before failover (default 3)
Keep Alive Timeout: Interval between keep-alive messages (default 30 sec)
Fallback Enabled: Enable or disable automatic return to primary (default disabled)
Fallback Timeout: How long to wait after primary recovers before returning APs (default 300 sec)

Click Update & Apply to Device to push the profile to your controller. The profile is then available for assignment to AP tags.

N+1 Hitless Rolling AP Upgrade

One of the powerful applications of N+1 redundancy is performing firmware upgrades on access points with zero downtime. The system automatically moves APs to the backup controller during the upgrade window, then returns them to the primary once the primary is back online.

Prerequisites:

N+1 redundancy configured with primary and secondary controllers
APs must be able to reach both controllers
Both controllers must be running identical or compatible software versions
Enough capacity on the backup controller to temporarily host all APs
AP images pre-downloaded to all controllers involved

High-Level Workflow:

Prepare AP image files on all controllers (predownload via ap image predownload)
Trigger rolling AP upgrade on the primary controller
System groups APs into batches (default 15% per iteration)
For each batch, the system:
- Steers clients from that batch of APs to other APs
- Forces those APs to rejoin the secondary controller
- Upgrades the AP software
- Allows APs to rejoin the primary once upgrade is complete
Monitor progress via show ap upgrade

Configuration and Execution:

First, ensure your AP upgrade configuration is set:

Device(config)# ap upgrade-configuration
Device(config-ap-upgrade)# upgrade priority high
Device(config-ap-upgrade)# upgrade per-iteration 15
Device(config-ap-upgrade)# exit

Then trigger the rolling upgrade:

Device# ap image upgrade test

Monitor the process with:

Device# show ap upgrade
AP upgrade is in progress
From version: 8 16.9.1.6
To version: 9 16.9.1.30
Started at: 03/09/2018 21:33:37 IST
Percentage complete: 0
Expected time of completion: 03/09/2018 22:33:37 IST

Iterations
Iteration Start time End time AP count
0 03/09/2018 21:33:37 IST 03/09/2018 21:33:37 IST 0
1 03/09/2018 21:33:37 IST ONGOING 0

Upgraded
AP Name Ethernet MAC Iteration Status
(none)

In Progress
AP Name Ethernet MAC
APF07Z.06a5.d78c.f07C.06cf.b910

Remaining
AP Name Ethernet MAC
APCC16.7ED8.cFA6 0081.c458.ab30
AP38ED.18CA.2ED0 38ed.18cb.25a0
AP881d.fce7.5ee4 d46d.50ee.33d0

The upgrade completes when all APs have cycled through the upgrade iteration. The system automatically returns APs to the primary controller once the primary has also been upgraded (if needed).

AP Fallback Behavior and Configuration

By default, when an AP fails over to a secondary controller, it remains there even if the primary recovers. This is a deliberate design choice to avoid constant churn and prevent flapping between controllers.

If you want APs to automatically return to the primary after a configurable wait period, enable and tune fallback:

Device(config)# ap join-profile MyProfile
Device(config-ap-join)# capwap client fallback enable
Device(config-ap-join)# capwap client fallback timeout 600
Device(config-ap-join)# exit

With these settings, if the primary comes back online, the system will wait 600 seconds (10 minutes) before allowing APs to migrate back. This timeout gives the primary time to stabilize and ensures you don't trigger another flap if the primary fails again shortly after recovery.

To disable fallback and keep APs on the secondary indefinitely (until manually moved or primary is declared permanently failed):

Device(config)# ap join-profile MyProfile
Device(config-ap-join)# capwap client fallback disable
Device(config-ap-join)# exit

Mobility Group Integration with N+1

For environments that span multiple sites or subnets, combining N+1 redundancy with mobility groups enables seamless client roaming. A mobility group is a collection of controllers that share client context and state, allowing a client to roam from one controller's AP to another controller's AP without losing session continuity.

When you configure a mobility group and N+1 redundancy together, clients can roam between APs on the primary controller, and if the primary fails, those same clients can seamlessly transition to APs on the secondary (which is also a member of the mobility group).

Example mobility group configuration:

Device(config)# wireless mobility group name MyMobilityGroup
Device(config-wireless-mobility)# mobility group mac-address 00:1A:B2:DC:5A:FF
Device(config-wireless-mobility)# mobility mac-address 00:1A:B2:DC:5A:FF
Device(config-wireless-mobility)# wireless mobility group member mac-address 00:1A:B2:DC:5A:01 ip 10.50.1.10
Device(config-wireless-mobility)# wireless mobility group member mac-address 00:1A:B2:DC:5A:02 ip 10.50.1.20
Device(config-wireless-mobility)# exit

Both the primary (10.50.1.10) and secondary (10.50.1.20) controllers are members of MyMobilityGroup. When an AP fails over during a controller failure, the client's mobility entry travels with it—the secondary controller already knows about the client and can serve it without a full re-authentication.

Scaling and Capacity Planning for N+1

A critical question in N+1 design: how many APs can the backup controller handle if the primary fails?

In most deployments, the backup controller should be sized to handle 40-60% of peak AP load. For example, if your primary controller manages 500 APs at peak, your backup should be capable of supporting 200-300 APs. This is a cost optimization: you don't need a second full-capacity controller sitting idle.

However, you must ensure sufficient headroom to avoid overwhelming the backup during failover. If your primary goes down during peak hours and all 500 APs attempt to rejoin the secondary, performance will degrade. To mitigate this:

Tune AP batching: Configure rolling AP upgrade to move APs in smaller batches, distributing the load over time instead of moving everything at once.
Use AP-to-controller affinity: Some deployments use load balancing scripts to steer certain AP groups to the secondary even under normal conditions, keeping both controllers warm.
Monitor backup capacity: Track how many APs the backup is servicing at any time, and be prepared to upgrade the backup if the primary fails and load exceeds capacity.

License Pooling: One advantage of N+1 is that you can often use a smaller, less expensive controller as the backup. However, both controllers must have valid licenses for the AP models they're expected to support. Cisco license pooling (available on some platforms) allows you to share licenses between controllers, reducing the overall license cost of a two-controller setup.

Configuration Synchronization Between Primary and Backup

Unlike SSO, which maintains real-time configuration sync, N+1 requires manual synchronization or scripted replication. When you change a policy, WLAN, or security setting on the primary, the secondary does not automatically receive that change.

Best practices for keeping controllers in sync:

Use centralized management: Deploy Cisco DNA Center or Catalyst Center to manage both controllers as a single entity. Configuration changes are automatically applied to both.
Export and import configurations: Periodically export the running config from the primary and import it to the secondary. Automate this with scripts if possible.
Script-driven updates: Use network automation (Ansible, Python, etc.) to apply policy changes to both controllers in a controlled sequence.
Verify after changes: After updating the primary, always verify that critical settings (SSIDs, policies, mobility group membership) are replicated to the backup before relying on it for failover.

Failure to synchronize configuration is one of the most common N+1 deployment mistakes. If a failover occurs and the secondary has an outdated config, clients may encounter authentication failures, incorrect VLAN assignments, or blocked traffic policies.

Verification and Troubleshooting

Use these commands to verify your N+1 setup and diagnose failover issues:

Device# show ap summary
AP Name MAC Address IP Address Status Mode
AP-Floor1-1 00:1A:B2:DC:5A:01 10.40.1.50 Joined Local
AP-Floor1-2 00:1A:B2:DC:5A:02 10.40.1.51 Joined Local
AP-Floor2-1 00:1A:B2:DC:5A:03 10.40.1.52 Joined Local

This shows all APs currently joined to the primary. During normal operation, all should show "Joined" status.

Device# show wireless loadbalance ap affinity status
AP Affinity Status:
AP Name Affinity Status
AP-Floor1-1 Primary
AP-Floor1-2 Primary
AP-Floor2-1 Primary

This confirms APs are associated with the primary (affinity). If a failover occurred, these would show "Secondary" affinity.

Device# show ap join-profile name MyProfile detailed
AP Join Profile: MyProfile
CAPWAP Discovery Settings:
Primary IP: 10.50.1.10
Secondary IP: 10.50.1.20
Tertiary IP: 10.50.1.30
Discovery Timeout: 20 seconds
Keep Alive Timeout: 30 seconds
Keep Alive Retries: 3
Fallback Enabled: Yes
Fallback Timeout: 300 seconds

This shows your join profile is correctly configured with all three controllers and appropriate timers.

If APs fail to join the secondary during a failure, check:

Network connectivity: Can APs reach the secondary controller's IP address? Check routing, firewalls, and network ACLs.
AP firmware: Ensure APs support joining a secondary controller. (Most modern APs do, but legacy models may not.)
Controller capacity: Is the secondary controller at capacity? If it has reached its AP limit, new APs will be rejected.
Certificate and trust: Both controllers should have compatible certificates. CAPWAP discovery may fail if there's a certificate mismatch.

Key Takeaways

N+1 redundancy offers a practical balance between cost and availability for many organizations:

Cost-effective: The backup controller can be smaller and less expensive than the primary, and it sits idle until needed.
Simple to configure: Unlike SSO, N+1 does not require complex state sync. Configuration is straightforward and can be done entirely via CLI or GUI.
Accepts brief downtime: Failover takes 20-30 seconds during which APs disconnect and rejoin. This is acceptable for many environments but not for mission-critical ones.
Requires manual fallback: By default, APs do not automatically return to the primary after recovery. You must enable and configure fallback behavior, or manage this manually.
Demands configuration sync: Keep the primary and secondary controllers in sync by exporting/importing configs, using centralized management, or scripting configuration updates. A stale secondary will cause problems during failover.
Works well with mobility groups: Combining N+1 with mobility groups allows clients to roam seamlessly between controllers, improving the failover experience.
Suitable for most deployments: Branch offices, campus networks with acceptable failover windows, and cost-conscious organizations should evaluate N+1 as their primary redundancy strategy.

By following the configuration examples and best practices in this guide, you'll have a resilient wireless infrastructure that keeps your network running even when a controller fails, with minimal impact to your users and at a reasonable cost.

C9800 N+1 Redundancy Configuration and Best Practices