The Scale Problem
A lab topology with one switch and one ISE node reveals authentication logic. A production enterprise with 300 access switches, 12,000 endpoints, and four data centers reveals scalability limits. The design decisions that work at small scale — a single ISE PSN, manual interface configs, no re-authentication timers — break down or create operational debt at enterprise scale.
This article addresses the scale dimension across three layers:
- ISE infrastructure scale — how to size, deploy, and operate multiple ISE nodes
- Switch infrastructure scale — stack design, template-based config, and port-level HA
- Operational scale — monitoring, automation, and change management across hundreds of switches
The configurations in this article use the same lab references throughout this series (Catalyst 9300, IOS XE 17.9.x, ISE 3.2), but the design patterns apply to any deployment that has outgrown a single-node setup.
ISE Node Architecture at Scale
Cisco ISE 3.2 supports several node persona types in a distributed deployment. Understanding which persona handles which function determines how you scale.
Node Personas
Primary Administration Node (PAN): All configuration changes flow through the PAN. It replicates configuration to all other nodes. In large deployments, the PAN is a dedicated appliance — it does not process RADIUS authentication. Running authentication through the PAN creates a bottleneck at exactly the node that also handles all policy changes.
Secondary Administration Node (secondary PAN): Mirrors the primary PAN. In a PAN failure, the secondary PAN is manually promoted (or auto-promoted, depending on licensing and configuration). The secondary PAN handles monitoring queries when the primary is under load.
Policy Service Nodes (PSNs): Handle all live RADIUS authentication. This is where 802.1X authentication, MAB, CoA, and guest flows are processed. PSNs are the scaling dimension — add PSNs to increase RADIUS throughput.
Monitoring and Troubleshooting Node (MnT): Receives and stores RADIUS accounting data, generates reports, and provides the live log data visible in Operations > RADIUS > Live Logs. Separate MnT nodes from PSNs in large deployments — a PSN that also runs MnT workloads experiences degraded RADIUS performance during heavy reporting queries.
pxGrid Node: Handles inter-platform context sharing (TrustSec, ISE APIs, third-party integrations). In large deployments, dedicate pxGrid to its own node.
Reference Architecture for 5,000-25,000 Endpoints
| Node Role | Count | Notes |
|---|---|---|
| Primary PAN | 1 | Admin + configuration |
| Secondary PAN | 1 | Standby, handles MnT failover |
| PSN (Data Center 1) | 2 | RADIUS authentication for DC1 and campus buildings 1-5 |
| PSN (Data Center 2) | 2 | RADIUS authentication for campus buildings 6-10 and remote sites |
| Primary MnT | 1 | Accounting, reporting, live logs |
| Secondary MnT | 1 | MnT failover |
This deployment uses 8 ISE nodes total. All nodes must run ISE 3.2 with matching patch levels — mixed versions within a deployment cause replication failures. Navigate to Administration > System > Deployment to verify all nodes are running the same version.
PSN Sizing
ISE 3.2 PSN throughput guidelines (approximate, varies by EAP method and hardware):
| Authentication Method | Auths per PSN per Second |
|---|---|
| MAB | 1,000+ |
| PEAP-MSCHAPv2 | 200-400 |
| EAP-TLS (without OCSP) | 100-200 |
| EAP-TLS (with OCSP) | 50-100 |
For a campus with 10,000 endpoints and a peak morning authentication surge (assume 30% of endpoints authenticate in a 15-minute window): 3,000 authentications in 900 seconds ≈ 3.3 auths/second. Two PSNs running EAP-TLS with OCSP can handle this comfortably. Factor in a 2x headroom buffer — size for 2x peak, not average.
Re-authentication timers affect steady-state load. If authentication timer reauthenticate 3600 is set globally, 10,000 endpoints re-authenticate approximately every hour — averaging 2.8 auths/second continuously. In a deployment with mixed EAP methods, calculate the weighted average per-method cost.
RADIUS Anycast for Multi-Site Deployments
For a large enterprise with multiple data centers and dozens of remote sites, configuring two PSN IPs on every switch becomes a management burden and a failover problem. RADIUS anycast solves this cleanly.
Anycast Design
Assign a single /32 IP address (e.g., 10.255.0.1/32) to a loopback interface on every PSN. Advertise this /32 into your routing protocol (OSPF or BGP) from each PSN's data center. Access switches send RADIUS to 10.255.0.1 — the routing layer delivers the packet to the nearest PSN via ECMP.
When a PSN fails, the routing protocol withdraws its advertisement of 10.255.0.1 from that data center. Traffic automatically routes to the next-closest PSN. The switch configuration requires no changes — it always sends to 10.255.0.1.
! On each ISE PSN (Linux loopback — example only, not IOS XE)
! ip addr add 10.255.0.1/32 dev lo
! Route is advertised into OSPF from the PSN's uplink router
! On every access switch — single RADIUS server entry pointing to anycast IP
radius server ISE-ANYCAST
address ipv4 10.255.0.1 auth-port 1812 acct-port 1813
key ISEsecret123
timeout 3
retransmit 1
aaa group server radius ISE_SERVERS
server name ISE-ANYCAST
deadtime 5
aaa authentication dot1x default group ISE_SERVERS
aaa authorization network default group ISE_SERVERS
aaa accounting dot1x default start-stop group ISE_SERVERS
Anycast and RADIUS Session Affinity
RADIUS carries state in the Access-Challenge / Access-Response exchange. A multi-packet EAP exchange (EAP-TLS with certificate validation requires multiple round trips) must stay on the same PSN for the duration of the exchange. With anycast and ECMP, if different packets in the same EAP conversation take different paths to different PSNs, the exchange fails — the second PSN has no context for the EAP conversation started on the first.
Mitigation: Configure ECMP with RADIUS source-port-based hashing. Most enterprise routers support hashing based on source IP + source UDP port. Since RADIUS uses a fixed destination port (1812) but a random source port per session, source-port-based hashing achieves per-session affinity rather than per-packet random distribution.
Alternatively, use a load balancer with source IP persistence (sticky sessions). The load balancer sends all packets from a given switch's source IP to the same PSN, providing strong session affinity. The tradeoff is that a switch failure scenario (switch sends from a new source IP after failover) may land on a different PSN.
For most deployments, source IP hashing at the router level is sufficient. EAP-TLS sessions from a given switch always hash to the same PSN because the switch management IP (the RADIUS source IP) is stable.
Switch Stack Design for HA
A single Cisco Catalyst 9300 is a single point of failure for the 48 access ports it serves. In high-availability environments, use Catalyst 9300 stacks (up to 8 switches in a stack via StackWise-480) to eliminate the single-switch failure scenario.
StackWise Configuration for 802.1X
In a 9300 stack, the stack acts as a single logical switch. The 802.1X configuration applies at the logical switch level, not per-stack member. The StackWise ring provides sub-second failover when a stack member fails — active sessions on surviving members continue uninterrupted. Only ports on the failed member lose connectivity.
! Verify stack member status
SW1# show switch
Switch/Stack Mac Address : a4b1.c2d3.e4f5 - Local Mac Address
Mac persistency wait time: Indefinite
H/W Current
Switch# Role Mac Address Priority Version State
------------------------------------------------------------
*1 Active a4b1.c2d3.e4f5 15 V01 Ready
2 Standby 001e.7a3b.9c12 14 V01 Ready
3 Member 0050.56b3.11aa 1 V01 Ready
The Active stack member processes all control plane functions, including RADIUS exchanges and 802.1X state machine operations. The Standby member takes over automatically if the Active fails, with all RADIUS server configurations and authentication state preserved.
SW1# show authentication sessions summary
Interface MAC Address Method Domain Status
...
Total number of sessions 144
After an Active-to-Standby switchover, run show authentication sessions to verify that sessions are still active. In most cases, existing authenticated sessions survive the switchover without re-authentication (NSF/SSO-aware 802.1X). New authentications may briefly queue during the switchover period (typically under 3 seconds on a 9300 stack).
Dual Uplinks for Access-to-Distribution Redundancy
Each access switch (or stack) should have two uplinks to distribution switches — preferably to different distribution switches to survive a distribution failure.
! Port-channel to dual distribution
interface Port-channel1
description UPLINK-TO-DIST-PAIR
switchport mode trunk
switchport trunk allowed vlan 10,20,30,40,50,99
interface TenGigabitEthernet1/1/1
description UPLINK-DIST-SW1
channel-group 1 mode active
interface TenGigabitEthernet1/1/2
description UPLINK-DIST-SW2
channel-group 1 mode active
With LACP-based port-channel, the access switch maintains a single logical uplink to both distribution switches. If one distribution switch fails, the port-channel narrows to the remaining uplink. The 802.1X state machine is not affected — RADIUS traffic traverses whatever uplink path is available, and the RADIUS server remains reachable as long as at least one uplink is active.
Template-Based Configuration at Scale
Manually configuring 802.1X on 300 switches with 48 ports each means touching 14,400 interface configurations. Errors in manual configuration create inconsistent security posture. Use a configuration template approach.
IOS XE Configuration Template with Cisco DNA Center
Cisco DNA Center provides 802.1X template deployment across the entire switch inventory. The workflow:
- Create a configuration template in DNA Center (Design > Network Profiles > Templates)
- Define variables for site-specific values (VLAN IDs, RADIUS server IPs, interface ranges)
- Assign the template to network devices by site
- Deploy the template — DNA Center pushes config to each switch via NETCONF/RESTCONF or SSH
The template for Monitor Mode deployment:
! -- Template: 802.1X-Monitor-Mode-v1 --
aaa new-model
aaa authentication dot1x default group ISE_SERVERS
aaa authorization network default group ISE_SERVERS
aaa accounting dot1x default start-stop group ISE_SERVERS
radius server ISE-PSN-1
address ipv4 $ISE_PSN1_IP auth-port 1812 acct-port 1813
key $RADIUS_SECRET
timeout 5
retransmit 2
automate-tester username radiustest probe-on
aaa group server radius ISE_SERVERS
server name ISE-PSN-1
deadtime 15
load-balance method least-outstanding
dot1x system-auth-control
interface range $ACCESS_PORT_RANGE
switchport mode access
switchport access vlan $DATA_VLAN
switchport voice vlan $VOICE_VLAN
authentication open
authentication port-control auto
authentication host-mode multi-domain
authentication order dot1x mab
authentication priority dot1x mab
dot1x pae authenticator
dot1x timeout tx-period 10
spanning-tree portfast
Variables ($ISE_PSN1_IP, $DATA_VLAN, etc.) are substituted per-device or per-site during deployment. This ensures every switch receives exactly the same logical configuration, with only the site-specific values differing.
NETCONF/YANG for Programmatic Configuration
For environments with a network automation platform (Ansible, NSO, or custom Python tooling), use NETCONF with YANG models to configure 802.1X parameters programmatically. IOS XE 17.9.x supports NETCONF on port 830.
SW1# show netconf-yang status
NETCONF YANG : Enabled
NETCONF YANG ssh port: 830
NETCONF YANG candidate-datastore: Disabled
The YANG model for 802.1X port configuration is under Cisco-IOS-XE-dot1x. Automating 802.1X at scale via NETCONF eliminates the SSH/CLI dependency and enables idempotent configuration management — pushing the same template twice has no effect if the config is already correct.
Operational Monitoring at Scale
ISE Dashboards for 802.1X Health
Navigate to Operations > RADIUS > Live Logs for real-time authentication monitoring. For scale environments, this view alone is insufficient — it shows only the last 200 authentications. Use:
Operations > Reports > Reports > Endpoint and Users > Authentication Summary: Aggregate view by time period, identity group, NAS IP, or failure reason. Export to CSV for trending analysis.
Operations > Reports > Reports > Device Administration > RADIUS Accounting: Shows all RADIUS accounting sessions — useful for detecting unusually long sessions or endpoints that are failing to properly terminate sessions.
Administration > System > Health Summary: Node-by-node health status for all ISE nodes. Watch for PSN CPU above 70% sustained, MnT disk utilization above 80%, and replication lag between nodes.
SNMP Monitoring for Switch-Side 802.1X
Configure SNMP traps on all access switches for 802.1X state changes:
snmp-server community public ro
snmp-server host 10.0.0.30 version 2c public
snmp-server enable traps dot1x
snmp-server enable traps radius
Key SNMP traps for 802.1X monitoring:
dot1xAuthFail— an authentication failure on a portdot1xAuthSuccess— an authentication successciscoRadiusNoServerAvailable— all RADIUS servers dead
Send these traps to your SIEM or network management system. Configure alerting for:
- Sustained
dot1xAuthFailrate above threshold (indicates a systematic failure, not individual endpoint issues) ciscoRadiusNoServerAvailablefrom any switch (critical — ISE is unreachable from that switch)
Syslog at Scale
With 300 switches each generating 802.1X syslog messages, a centralized syslog platform is essential. Configure all switches to send syslog to a central collector:
logging host 10.0.0.30 transport udp port 514
logging trap informational
logging facility local6
Key log messages to alert on:
| Syslog Message | Severity | Meaning |
|---|---|---|
%RADIUS-3-ALLDEADSERVER |
Error | All RADIUS servers dead — Critical VLAN activating |
%RADIUS-4-RADIUS_DEAD |
Warning | A PSN is unreachable |
%DOT1X-5-FAIL |
Notice | Individual authentication failure |
%DOT1X-5-SUCCESS |
Notice | Authentication success (noisy — filter unless needed) |
%AUTHMGR-5-VLANASSIGN |
Notice | VLAN assignment on a port |
Re-Authentication Timer Design at Scale
Re-authentication (authentication timer reauthenticate) is often set to 3600 seconds (1 hour) as a default. At scale, this creates a predictable load spike: if you deploy 10,000 endpoints all at 09:00 AM on a Monday, the re-authentication surge hits every hour on the hour. At 10:00 AM, all 10,000 endpoints attempt re-authentication simultaneously.
Timer Jitter
Distribute re-authentication timers to prevent synchronized spikes. IOS XE does not natively support per-port timer randomization, but you can achieve effective jitter by deploying re-authentication timers in bands during the initial rollout:
- Floor 1 ports:
authentication timer reauthenticate 3600 - Floor 2 ports:
authentication timer reauthenticate 3900 - Floor 3 ports:
authentication timer reauthenticate 4200 - Floor 4 ports:
authentication timer reauthenticate 3300
The 300-second offsets spread re-authentication across multiple 5-minute windows per hour, reducing peak RADIUS load by approximately 75% compared to a synchronized timer.
Alternatively, ISE can control session re-authentication via the RADIUS Session-Timeout attribute returned in the Access-Accept. Setting different Session-Timeout values in different ISE authorization profiles achieves the same jitter without switch-by-switch timer configuration.
In ISE, navigate to Policy > Policy Elements > Results > Authorization > Authorization Profiles > [Profile Name] > Common Tasks > Session Settings. Set Session Timeout to the desired seconds. ISE returns this as Session-Timeout [27] in the Access-Accept, and the switch uses it as the re-authentication timer for that session.
Bringing It All Together: Enterprise Reference Architecture
A complete enterprise 802.1X design integrates every component from this series:
Access Layer (Catalyst 9300 Stacks):
- 802.1X with MAB fallback (Articles 8 and 12)
- Multi-domain host mode for voice/data (Articles 13 and 17)
- Closed Mode enforcement after phased rollout (Articles 26 and 27)
- Critical VLAN for RADIUS failover (Articles 15 and 28)
Distribution/Core Layer:
- TrustSec SGACL enforcement (Article 29)
- Port-channel uplinks from access stacks
- RADIUS anycast routing
ISE Infrastructure:
- Multi-node deployment (PAN pair, PSN cluster, MnT pair)
- RADIUS anycast via PSN loopback advertisements
- Policy set hierarchy covering all authentication scenarios (Articles 9 and 14)
- dACL library for post-auth enforcement (Article 16)
- CoA integration for policy updates (Article 19)
Operations:
- SNMP traps and syslog to centralized monitoring
- DNA Center template deployment
- ISE reporting for compliance and audit
- Troubleshooting runbooks based on Articles 20-25
Final Verification: Steady-State Health Check
Run these commands weekly on a sample of access switches to validate ongoing 802.1X health:
SW1# show authentication sessions summary
SW1# show radius server-group all
SW1# show radius statistics
SW1# show dot1x all
SW1# show cts credentials
And in ISE:
- Operations > RADIUS > Live Logs: Check for sustained failure patterns
- Administration > System > Health Summary: Verify all nodes healthy
- Work Centers > TrustSec > Reports > SGACL Drop Summary: Verify no unexpected SGACL drops on production traffic flows
Troubleshooting
Symptom: Authentication surge after a site-wide power event causes ISE PSNs to reach CPU saturation
Cause: All endpoints at a site authenticate simultaneously when power is restored. In a campus with 2,000 endpoints, this can mean 2,000 concurrent authentication requests hitting ISE within 30 seconds — exceeding PSN capacity and causing RADIUS timeouts. Endpoints then retry, compounding the load.
Fix: Configure dot1x timeout quiet-period 60 on access switch interfaces. The quiet period introduces a delay before a failed endpoint retries authentication. After a simultaneous power restoration event, the quiet period staggers retries and reduces the synchronization of the surge. Additionally, ensure Critical VLAN is configured — during the peak surge, some endpoints will time out and land in VLAN 50, which keeps them on the network while ISE processes the backlog.
Symptom: RADIUS accounting records are incomplete — some sessions show no accounting stop records, making session tracking unreliable
Cause: The MnT node is dropping accounting records due to disk space or processing overload, or accounting is not configured on all switches consistently. Also occurs when switches reset without cleanly terminating sessions (power failure, stack switchover) — the RADIUS accounting stop is never sent.
Fix: Verify aaa accounting dot1x default start-stop group ISE_SERVERS is configured on all switches. In ISE, check the MnT node disk utilization (Administration > System > Health Summary). If disk is above 80%, configure ISE to purge older logs (Administration > System > Maintenance > Data Purge). For missing accounting stops, ISE handles this via session timeout logic — sessions without a stop record are aged out after the configured idle timeout.
Symptom: DNA Center template deployment pushes 802.1X config to switches, but some switches show incorrect VLAN IDs for voice or auth-fail VLANs
Cause: Template variable substitution is using incorrect per-site values. The template variable $VOICE_VLAN is mapped to the wrong value in the DNA Center site hierarchy, or the site hierarchy was not populated before template deployment.
Fix: In DNA Center, navigate to Design > Network Hierarchy and verify that the network settings (VLAN IDs, RADIUS server IPs) are correctly populated for each site and building. Navigate to Design > Network Profiles and review the variable bindings for the 802.1X template at each site level. Re-deploy the template after correcting the variable values. Verify the deployed config on affected switches with show running-config interface [range] and spot-check VLAN assignments.
Closing
This article and the 29 that preceded it cover the complete 802.1X technology stack from first principles through enterprise-scale design. The series started with what 802.1X is and ended with how to run it across thousands of ports with full redundancy, TrustSec integration, and automated deployment. The underlying technology has not changed — every authentication still follows the same EAPOL → RADIUS → ISE → Access-Accept flow described in Article 7: 802.1X Authentication Flow Step by Step. What changes at scale is the operational discipline required to keep that flow working reliably across a complex, heterogeneous environment.
What's Next: You've reached the end of the 802.1X series. Return to the 802.1X Series Index to review any article or share the series with your team.