802.1X Scalability and HA Design for Large Enterprise Networks

The Scale Problem

A lab topology with one switch and one ISE node reveals authentication logic. A production enterprise with 300 access switches, 12,000 endpoints, and four data centers reveals scalability limits. The design decisions that work at small scale — a single ISE PSN, manual interface configs, no re-authentication timers — break down or create operational debt at enterprise scale.

This article addresses the scale dimension across three layers:

ISE infrastructure scale — how to size, deploy, and operate multiple ISE nodes
Switch infrastructure scale — stack design, template-based config, and port-level HA
Operational scale — monitoring, automation, and change management across hundreds of switches

The configurations in this article use the same lab references throughout this series (Catalyst 9300, IOS XE 17.9.x, ISE 3.2), but the design patterns apply to any deployment that has outgrown a single-node setup.

ISE Node Architecture at Scale

Cisco ISE 3.2 supports several node persona types in a distributed deployment. Understanding which persona handles which function determines how you scale.

Node Personas

Primary Administration Node (PAN): All configuration changes flow through the PAN. It replicates configuration to all other nodes. In large deployments, the PAN is a dedicated appliance — it does not process RADIUS authentication. Running authentication through the PAN creates a bottleneck at exactly the node that also handles all policy changes.

Secondary Administration Node (secondary PAN): Mirrors the primary PAN. In a PAN failure, the secondary PAN is manually promoted (or auto-promoted, depending on licensing and configuration). The secondary PAN handles monitoring queries when the primary is under load.

Policy Service Nodes (PSNs): Handle all live RADIUS authentication. This is where 802.1X authentication, MAB, CoA, and guest flows are processed. PSNs are the scaling dimension — add PSNs to increase RADIUS throughput.

Monitoring and Troubleshooting Node (MnT): Receives and stores RADIUS accounting data, generates reports, and provides the live log data visible in Operations > RADIUS > Live Logs. Separate MnT nodes from PSNs in large deployments — a PSN that also runs MnT workloads experiences degraded RADIUS performance during heavy reporting queries.

pxGrid Node: Handles inter-platform context sharing (TrustSec, ISE APIs, third-party integrations). In large deployments, dedicate pxGrid to its own node.

Reference Architecture for 5,000-25,000 Endpoints

Node Role	Count	Notes
Primary PAN	1	Admin + configuration
Secondary PAN	1	Standby, handles MnT failover
PSN (Data Center 1)	2	RADIUS authentication for DC1 and campus buildings 1-5
PSN (Data Center 2)	2	RADIUS authentication for campus buildings 6-10 and remote sites
Primary MnT	1	Accounting, reporting, live logs
Secondary MnT	1	MnT failover

This deployment uses 8 ISE nodes total. All nodes must run ISE 3.2 with matching patch levels — mixed versions within a deployment cause replication failures. Navigate to Administration > System > Deployment to verify all nodes are running the same version.

PSN Sizing

ISE 3.2 PSN throughput guidelines (approximate, varies by EAP method and hardware):

Authentication Method	Auths per PSN per Second
MAB	1,000+
PEAP-MSCHAPv2	200-400
EAP-TLS (without OCSP)	100-200
EAP-TLS (with OCSP)	50-100

For a campus with 10,000 endpoints and a peak morning authentication surge (assume 30% of endpoints authenticate in a 15-minute window): 3,000 authentications in 900 seconds ≈ 3.3 auths/second. Two PSNs running EAP-TLS with OCSP can handle this comfortably. Factor in a 2x headroom buffer — size for 2x peak, not average.

Re-authentication timers affect steady-state load. If authentication timer reauthenticate 3600 is set globally, 10,000 endpoints re-authenticate approximately every hour — averaging 2.8 auths/second continuously. In a deployment with mixed EAP methods, calculate the weighted average per-method cost.

RADIUS Anycast for Multi-Site Deployments

For a large enterprise with multiple data centers and dozens of remote sites, configuring two PSN IPs on every switch becomes a management burden and a failover problem. RADIUS anycast solves this cleanly.

Anycast Design

Assign a single /32 IP address (e.g., 10.255.0.1/32) to a loopback interface on every PSN. Advertise this /32 into your routing protocol (OSPF or BGP) from each PSN's data center. Access switches send RADIUS to 10.255.0.1 — the routing layer delivers the packet to the nearest PSN via ECMP.

When a PSN fails, the routing protocol withdraws its advertisement of 10.255.0.1 from that data center. Traffic automatically routes to the next-closest PSN. The switch configuration requires no changes — it always sends to 10.255.0.1.

! On each ISE PSN (Linux loopback — example only, not IOS XE)
! ip addr add 10.255.0.1/32 dev lo
! Route is advertised into OSPF from the PSN's uplink router

! On every access switch — single RADIUS server entry pointing to anycast IP
radius server ISE-ANYCAST
 address ipv4 10.255.0.1 auth-port 1812 acct-port 1813
 key ISEsecret123
 timeout 3
 retransmit 1

aaa group server radius ISE_SERVERS
 server name ISE-ANYCAST
 deadtime 5

aaa authentication dot1x default group ISE_SERVERS
aaa authorization network default group ISE_SERVERS
aaa accounting dot1x default start-stop group ISE_SERVERS

Anycast and RADIUS Session Affinity

RADIUS carries state in the Access-Challenge / Access-Response exchange. A multi-packet EAP exchange (EAP-TLS with certificate validation requires multiple round trips) must stay on the same PSN for the duration of the exchange. With anycast and ECMP, if different packets in the same EAP conversation take different paths to different PSNs, the exchange fails — the second PSN has no context for the EAP conversation started on the first.

Mitigation: Configure ECMP with RADIUS source-port-based hashing. Most enterprise routers support hashing based on source IP + source UDP port. Since RADIUS uses a fixed destination port (1812) but a random source port per session, source-port-based hashing achieves per-session affinity rather than per-packet random distribution.

Alternatively, use a load balancer with source IP persistence (sticky sessions). The load balancer sends all packets from a given switch's source IP to the same PSN, providing strong session affinity. The tradeoff is that a switch failure scenario (switch sends from a new source IP after failover) may land on a different PSN.

For most deployments, source IP hashing at the router level is sufficient. EAP-TLS sessions from a given switch always hash to the same PSN because the switch management IP (the RADIUS source IP) is stable.

Switch Stack Design for HA

A single Cisco Catalyst 9300 is a single point of failure for the 48 access ports it serves. In high-availability environments, use Catalyst 9300 stacks (up to 8 switches in a stack via StackWise-480) to eliminate the single-switch failure scenario.

StackWise Configuration for 802.1X

In a 9300 stack, the stack acts as a single logical switch. The 802.1X configuration applies at the logical switch level, not per-stack member. The StackWise ring provides sub-second failover when a stack member fails — active sessions on surviving members continue uninterrupted. Only ports on the failed member lose connectivity.

! Verify stack member status
SW1# show switch
Switch/Stack Mac Address : a4b1.c2d3.e4f5 - Local Mac Address
Mac persistency wait time: Indefinite
                                            H/W   Current
Switch#  Role    Mac Address   Priority  Version  State
------------------------------------------------------------
 *1      Active  a4b1.c2d3.e4f5     15        V01  Ready
  2      Standby 001e.7a3b.9c12     14        V01  Ready
  3      Member  0050.56b3.11aa      1        V01  Ready

The Active stack member processes all control plane functions, including RADIUS exchanges and 802.1X state machine operations. The Standby member takes over automatically if the Active fails, with all RADIUS server configurations and authentication state preserved.

SW1# show authentication sessions summary
Interface  MAC Address      Method   Domain   Status
...
Total number of sessions 144

After an Active-to-Standby switchover, run show authentication sessions to verify that sessions are still active. In most cases, existing authenticated sessions survive the switchover without re-authentication (NSF/SSO-aware 802.1X). New authentications may briefly queue during the switchover period (typically under 3 seconds on a 9300 stack).

Dual Uplinks for Access-to-Distribution Redundancy

Each access switch (or stack) should have two uplinks to distribution switches — preferably to different distribution switches to survive a distribution failure.

! Port-channel to dual distribution
interface Port-channel1
 description UPLINK-TO-DIST-PAIR
 switchport mode trunk
 switchport trunk allowed vlan 10,20,30,40,50,99

interface TenGigabitEthernet1/1/1
 description UPLINK-DIST-SW1
 channel-group 1 mode active

interface TenGigabitEthernet1/1/2
 description UPLINK-DIST-SW2
 channel-group 1 mode active

With LACP-based port-channel, the access switch maintains a single logical uplink to both distribution switches. If one distribution switch fails, the port-channel narrows to the remaining uplink. The 802.1X state machine is not affected — RADIUS traffic traverses whatever uplink path is available, and the RADIUS server remains reachable as long as at least one uplink is active.

Template-Based Configuration at Scale

Manually configuring 802.1X on 300 switches with 48 ports each means touching 14,400 interface configurations. Errors in manual configuration create inconsistent security posture. Use a configuration template approach.

IOS XE Configuration Template with Cisco DNA Center

Cisco DNA Center provides 802.1X template deployment across the entire switch inventory. The workflow:

Create a configuration template in DNA Center (Design > Network Profiles > Templates)
Define variables for site-specific values (VLAN IDs, RADIUS server IPs, interface ranges)
Assign the template to network devices by site
Deploy the template — DNA Center pushes config to each switch via NETCONF/RESTCONF or SSH

The template for Monitor Mode deployment:

! -- Template: 802.1X-Monitor-Mode-v1 --
aaa new-model
aaa authentication dot1x default group ISE_SERVERS
aaa authorization network default group ISE_SERVERS
aaa accounting dot1x default start-stop group ISE_SERVERS

radius server ISE-PSN-1
 address ipv4 $ISE_PSN1_IP auth-port 1812 acct-port 1813
 key $RADIUS_SECRET
 timeout 5
 retransmit 2
 automate-tester username radiustest probe-on

aaa group server radius ISE_SERVERS
 server name ISE-PSN-1
 deadtime 15
 load-balance method least-outstanding

dot1x system-auth-control

interface range $ACCESS_PORT_RANGE
 switchport mode access
 switchport access vlan $DATA_VLAN
 switchport voice vlan $VOICE_VLAN
 authentication open
 authentication port-control auto
 authentication host-mode multi-domain
 authentication order dot1x mab
 authentication priority dot1x mab
 dot1x pae authenticator
 dot1x timeout tx-period 10
 spanning-tree portfast

Variables ($ISE_PSN1_IP, $DATA_VLAN, etc.) are substituted per-device or per-site during deployment. This ensures every switch receives exactly the same logical configuration, with only the site-specific values differing.

NETCONF/YANG for Programmatic Configuration

For environments with a network automation platform (Ansible, NSO, or custom Python tooling), use NETCONF with YANG models to configure 802.1X parameters programmatically. IOS XE 17.9.x supports NETCONF on port 830.

SW1# show netconf-yang status
NETCONF YANG     : Enabled
NETCONF YANG ssh port: 830
NETCONF YANG candidate-datastore: Disabled

The YANG model for 802.1X port configuration is under Cisco-IOS-XE-dot1x. Automating 802.1X at scale via NETCONF eliminates the SSH/CLI dependency and enables idempotent configuration management — pushing the same template twice has no effect if the config is already correct.

Operational Monitoring at Scale

ISE Dashboards for 802.1X Health

Navigate to Operations > RADIUS > Live Logs for real-time authentication monitoring. For scale environments, this view alone is insufficient — it shows only the last 200 authentications. Use:

Operations > Reports > Reports > Endpoint and Users > Authentication Summary: Aggregate view by time period, identity group, NAS IP, or failure reason. Export to CSV for trending analysis.

Operations > Reports > Reports > Device Administration > RADIUS Accounting: Shows all RADIUS accounting sessions — useful for detecting unusually long sessions or endpoints that are failing to properly terminate sessions.

Administration > System > Health Summary: Node-by-node health status for all ISE nodes. Watch for PSN CPU above 70% sustained, MnT disk utilization above 80%, and replication lag between nodes.

SNMP Monitoring for Switch-Side 802.1X

Configure SNMP traps on all access switches for 802.1X state changes:

snmp-server community public ro
snmp-server host 10.0.0.30 version 2c public
snmp-server enable traps dot1x
snmp-server enable traps radius

Key SNMP traps for 802.1X monitoring:

dot1xAuthFail — an authentication failure on a port
dot1xAuthSuccess — an authentication success
ciscoRadiusNoServerAvailable — all RADIUS servers dead

Send these traps to your SIEM or network management system. Configure alerting for:

Sustained dot1xAuthFail rate above threshold (indicates a systematic failure, not individual endpoint issues)
ciscoRadiusNoServerAvailable from any switch (critical — ISE is unreachable from that switch)

Syslog at Scale

With 300 switches each generating 802.1X syslog messages, a centralized syslog platform is essential. Configure all switches to send syslog to a central collector:

logging host 10.0.0.30 transport udp port 514
logging trap informational
logging facility local6

Key log messages to alert on:

Syslog Message	Severity	Meaning
`%RADIUS-3-ALLDEADSERVER`	Error	All RADIUS servers dead — Critical VLAN activating
`%RADIUS-4-RADIUS_DEAD`	Warning	A PSN is unreachable
`%DOT1X-5-FAIL`	Notice	Individual authentication failure
`%DOT1X-5-SUCCESS`	Notice	Authentication success (noisy — filter unless needed)
`%AUTHMGR-5-VLANASSIGN`	Notice	VLAN assignment on a port

Re-Authentication Timer Design at Scale

Re-authentication (authentication timer reauthenticate) is often set to 3600 seconds (1 hour) as a default. At scale, this creates a predictable load spike: if you deploy 10,000 endpoints all at 09:00 AM on a Monday, the re-authentication surge hits every hour on the hour. At 10:00 AM, all 10,000 endpoints attempt re-authentication simultaneously.

Timer Jitter

Distribute re-authentication timers to prevent synchronized spikes. IOS XE does not natively support per-port timer randomization, but you can achieve effective jitter by deploying re-authentication timers in bands during the initial rollout:

Floor 1 ports: authentication timer reauthenticate 3600
Floor 2 ports: authentication timer reauthenticate 3900
Floor 3 ports: authentication timer reauthenticate 4200
Floor 4 ports: authentication timer reauthenticate 3300

The 300-second offsets spread re-authentication across multiple 5-minute windows per hour, reducing peak RADIUS load by approximately 75% compared to a synchronized timer.

Alternatively, ISE can control session re-authentication via the RADIUS Session-Timeout attribute returned in the Access-Accept. Setting different Session-Timeout values in different ISE authorization profiles achieves the same jitter without switch-by-switch timer configuration.

In ISE, navigate to Policy > Policy Elements > Results > Authorization > Authorization Profiles > [Profile Name] > Common Tasks > Session Settings. Set Session Timeout to the desired seconds. ISE returns this as Session-Timeout [27] in the Access-Accept, and the switch uses it as the re-authentication timer for that session.

Bringing It All Together: Enterprise Reference Architecture

A complete enterprise 802.1X design integrates every component from this series:

Access Layer (Catalyst 9300 Stacks):

802.1X with MAB fallback (Articles 8 and 12)
Multi-domain host mode for voice/data (Articles 13 and 17)
Closed Mode enforcement after phased rollout (Articles 26 and 27)
Critical VLAN for RADIUS failover (Articles 15 and 28)

Distribution/Core Layer:

TrustSec SGACL enforcement (Article 29)
Port-channel uplinks from access stacks
RADIUS anycast routing

ISE Infrastructure:

Multi-node deployment (PAN pair, PSN cluster, MnT pair)
RADIUS anycast via PSN loopback advertisements
Policy set hierarchy covering all authentication scenarios (Articles 9 and 14)
dACL library for post-auth enforcement (Article 16)
CoA integration for policy updates (Article 19)

Operations:

SNMP traps and syslog to centralized monitoring
DNA Center template deployment
ISE reporting for compliance and audit
Troubleshooting runbooks based on Articles 20-25

Final Verification: Steady-State Health Check

Run these commands weekly on a sample of access switches to validate ongoing 802.1X health:

SW1# show authentication sessions summary
SW1# show radius server-group all
SW1# show radius statistics
SW1# show dot1x all
SW1# show cts credentials

And in ISE:

Operations > RADIUS > Live Logs: Check for sustained failure patterns
Administration > System > Health Summary: Verify all nodes healthy
Work Centers > TrustSec > Reports > SGACL Drop Summary: Verify no unexpected SGACL drops on production traffic flows

Troubleshooting

Symptom: Authentication surge after a site-wide power event causes ISE PSNs to reach CPU saturation

Cause: All endpoints at a site authenticate simultaneously when power is restored. In a campus with 2,000 endpoints, this can mean 2,000 concurrent authentication requests hitting ISE within 30 seconds — exceeding PSN capacity and causing RADIUS timeouts. Endpoints then retry, compounding the load.

Fix: Configure dot1x timeout quiet-period 60 on access switch interfaces. The quiet period introduces a delay before a failed endpoint retries authentication. After a simultaneous power restoration event, the quiet period staggers retries and reduces the synchronization of the surge. Additionally, ensure Critical VLAN is configured — during the peak surge, some endpoints will time out and land in VLAN 50, which keeps them on the network while ISE processes the backlog.

Symptom: RADIUS accounting records are incomplete — some sessions show no accounting stop records, making session tracking unreliable

Cause: The MnT node is dropping accounting records due to disk space or processing overload, or accounting is not configured on all switches consistently. Also occurs when switches reset without cleanly terminating sessions (power failure, stack switchover) — the RADIUS accounting stop is never sent.

Fix: Verify aaa accounting dot1x default start-stop group ISE_SERVERS is configured on all switches. In ISE, check the MnT node disk utilization (Administration > System > Health Summary). If disk is above 80%, configure ISE to purge older logs (Administration > System > Maintenance > Data Purge). For missing accounting stops, ISE handles this via session timeout logic — sessions without a stop record are aged out after the configured idle timeout.

Symptom: DNA Center template deployment pushes 802.1X config to switches, but some switches show incorrect VLAN IDs for voice or auth-fail VLANs

Cause: Template variable substitution is using incorrect per-site values. The template variable $VOICE_VLAN is mapped to the wrong value in the DNA Center site hierarchy, or the site hierarchy was not populated before template deployment.

Fix: In DNA Center, navigate to Design > Network Hierarchy and verify that the network settings (VLAN IDs, RADIUS server IPs) are correctly populated for each site and building. Navigate to Design > Network Profiles and review the variable bindings for the 802.1X template at each site level. Re-deploy the template after correcting the variable values. Verify the deployed config on affected switches with show running-config interface [range] and spot-check VLAN assignments.

Closing

This article and the 29 that preceded it cover the complete 802.1X technology stack from first principles through enterprise-scale design. The series started with what 802.1X is and ended with how to run it across thousands of ports with full redundancy, TrustSec integration, and automated deployment. The underlying technology has not changed — every authentication still follows the same EAPOL → RADIUS → ISE → Access-Accept flow described in Article 7: 802.1X Authentication Flow Step by Step. What changes at scale is the operational discipline required to keep that flow working reliably across a complex, heterogeneous environment.

What's Next: You've reached the end of the 802.1X series. Return to the 802.1X Series Index to review any article or share the series with your team.

802.1X Scalability and High Availability Design for Large Enterprise Networks