Skip to content

RADIUS Redundancy and Failover in 802.1X Deployments

J
RADIUS Redundancy and Failover in 802.1X Deployments

RADIUS redundancy is not an optional enhancement to 802.1X — it is a prerequisite for running Closed Mode in any environment where network downtime has a business cost. The failure mode of an unreachable RADIUS server in a correctly configured Closed Mode deployment is that unauthenticated endpoints cannot get on the network, and endpoints with re-authentication timers that fire during the outage will be moved to the Critical VLAN or blocked, depending on your configuration.

This article covers the complete redundancy design from the IOS XE switch configuration layer through ISE PSN deployment, with realistic verification output and the specific failure scenarios you need to plan for.


The Failure Modes of a Single RADIUS Server

Before configuring redundancy, understand exactly what breaks and when.

Scenario 1: ISE PSN goes down, no existing sessions affected, but new auth fails.
If re-authentication is disabled (authentication periodic not set) and no existing sessions time out, existing authorized ports stay up. Only new connections (new endpoints, rebooted machines, newly plugged-in cables) attempt authentication and fail. This is a partial outage — visible but not catastrophic.

Scenario 2: ISE PSN goes down, re-authentication is enabled with a short timer.
If authentication timer reauthenticate is set to 1800 seconds and the ISE outage lasts 35 minutes, every session that was established more than 30 minutes before the outage will attempt re-authentication and fail. This generates a wave of authentication failures that can lock out 50%+ of a building in under an hour.

Scenario 3: ISE PSN is reachable but the shared secret is wrong on one switch.
This is not a RADIUS unreachable event — the switch receives a response (a silent drop or an Access-Reject). show radius statistics shows increased timeouts from that switch's perspective. The switch eventually marks ISE as dead after retransmit attempts are exhausted. This is why monitoring per-switch RADIUS statistics matters.


IOS XE RADIUS Server Group Configuration

The foundation of RADIUS redundancy on IOS XE is the AAA server group with multiple RADIUS servers configured.

! Define two ISE PSNs
radius server ISE-PSN-1
 address ipv4 10.0.0.10 auth-port 1812 acct-port 1813
 key ISEsecret123
 timeout 5
 retransmit 2
 automate-tester username radiustest probe-on

radius server ISE-PSN-2
 address ipv4 10.0.0.11 auth-port 1812 acct-port 1813
 key ISEsecret123
 timeout 5
 retransmit 2
 automate-tester username radiustest probe-on

! Server group with both PSNs
aaa group server radius ISE_SERVERS
 server name ISE-PSN-1
 server name ISE-PSN-2
 deadtime 15
 load-balance method least-outstanding

! Apply to 802.1X authentication and authorization
aaa authentication dot1x default group ISE_SERVERS
aaa authorization network default group ISE_SERVERS
aaa accounting dot1x default start-stop group ISE_SERVERS

Key Parameters Explained

timeout 5 — The switch waits 5 seconds for a RADIUS response before considering the attempt a timeout. After retransmit 2 failed attempts (total of 3 attempts × 5 seconds = 15 seconds per server), the switch marks the server dead and moves to the next.

retransmit 2 — The switch retransmits the RADIUS request 2 times after the first timeout, for a total of 3 attempts. Combined with a 5-second timeout, each server gets 15 seconds before being marked dead. With two servers, the total failover time is 15 seconds (PSN1 exhausted) before PSN2 is tried.

deadtime 15 — After a server is marked dead, the switch does not attempt to use it for 15 minutes. During this window, all authentication requests go directly to PSN2. Without deadtime, the switch would try PSN1 on every request, adding 15 seconds of timeout delay to every authentication attempt.

automate-tester username radiustest probe-on — The switch sends periodic test packets to dead RADIUS servers to detect recovery. When ISE PSN1 comes back online and responds to the test probe, the switch removes it from the dead state and begins sending live authentication requests again. The radiustest username must exist in ISE as a local user (or be excluded from authentication via ISE policy) — if the probe generates an authentication attempt that logs as a failure, it creates noise in ISE live logs. Create a dedicated probe account in ISE under Administration > Identity Management > Identities > Users.

load-balance method least-outstanding — With two PSNs active, the switch distributes authentication requests to the server with the fewest outstanding (unacknowledged) requests. This is preferable to round-robin for 802.1X because authentication requests are not uniform in processing time — EAP-TLS certificate validation takes longer than MAB. Least-outstanding naturally balances load based on actual server capacity.

Verification

SW1# show radius server-group all
Server group ISE_SERVERS
    Sharecount = 1  sg_unconfigured = FALSE
    Type       = standard
    Deadtime(s)= 900
    Load-balance: least-outstanding

    Server(10.0.0.10:1812,1813)
       State=ACTIVE  Dead time remaining(s)=0
       Quarantined=No
       Platform State=ACTIVE
       Authen: req 4821, timeouts 3, response 4818
       Author: req 4821, timeouts 0, response 4821

    Server(10.0.0.11:1812,1813)
       State=ACTIVE  Dead time remaining(s)=0
       Quarantined=No
       Platform State=ACTIVE
       Authen: req 4819, timeouts 2, response 4817
       Author: req 4819, timeouts 0, response 4819

Both PSNs show State=ACTIVE with request counts roughly balanced. The small number of timeouts (3 and 2 across thousands of requests) is normal — occasional packet loss or momentary ISE processing delays. A server showing State=DEAD with Dead time remaining(s)=720 means the server failed and is in the deadtime window. Authentication requests are bypassing it and going exclusively to the active server.

SW1# show radius statistics
Auth. Rqst Sent:       9640
Auth. Rqst Timeouts:   5
Auth. Rqst Retransmit: 5
Auth. Rsp Received:    9635

Global RADIUS statistics aggregate across all servers. Cross-reference with per-server stats from show radius server-group all to identify whether timeouts are concentrated on one server.


Critical VLAN Configuration

The Critical VLAN (VLAN 50) handles the scenario where ISE is completely unreachable — all servers in the group are dead. Without a Critical VLAN, Closed Mode ports with no active session (e.g., a laptop that rebooted during the ISE outage) have no network access.

interface GigabitEthernet1/0/1
 switchport mode access
 switchport access vlan 10
 switchport voice vlan 20
 authentication event fail action authorize vlan 40
 authentication event no-response action authorize vlan 40
 authentication event server dead action authorize vlan 50
 authentication event server alive action reinitialize
 authentication port-control auto
 authentication host-mode multi-domain
 authentication order dot1x mab
 authentication priority dot1x mab
 dot1x pae authenticator
 dot1x timeout tx-period 10
 spanning-tree portfast

authentication event server dead action authorize vlan 50 — When all RADIUS servers in the group are marked dead, the switch moves this port to VLAN 50. Endpoints get DHCP from the VLAN 50 scope and limited network access (whatever the VLAN 50 routing and ACL policy allows).

authentication event server alive action reinitialize — When a RADIUS server comes back (detected via automate-tester probe), the switch reinitializes all sessions in the Critical VLAN and attempts re-authentication. This is the automatic recovery mechanism — endpoints move from VLAN 50 back to their proper VLAN after ISE recovers without requiring manual intervention or a port bounce.

Critical VLAN Design Considerations

VLAN 50 should provide connectivity to essential services only:

  • DNS (to resolve domain names for re-authentication)
  • NTP (critical for certificate validation)
  • Possibly a remediation server or basic intranet access

VLAN 50 should not have full internet access or access to sensitive systems. It is a safety net, not a bypass. Size the DHCP pool for VLAN 50 to accommodate your maximum simultaneous client count — a /24 for a campus of 1,000 ports, a /25 for smaller branches.

Verifying Critical VLAN Behavior

Simulate an ISE outage by temporarily disabling the ISE PSN and waiting for the deadtime to expire.

SW1# show authentication sessions
Interface    MAC Address      Method   Domain   Status         Session ID
Gi1/0/1      a4b1.c2d3.e4f5   N/A      DATA     Auth           0A00630A00000001
Gi1/0/2      001e.7a3b.9c12   N/A      DATA     Auth           0A00630A00000002

During a Critical VLAN state, show authentication sessions shows Method: N/A because authentication was not completed — the port was authorized due to the server-dead event, not a RADIUS Accept. The session ID changes when the port was moved.

SW1# show authentication sessions interface GigabitEthernet1/0/1 details
            Interface:  GigabitEthernet1/0/1
          MAC Address:  a4b1.c2d3.e4f5
              Status:  Auth
              Domain:  DATA
      Security Policy:  Restrict
      Security Status:  Unsecured
               Vlan:  50

Method status list:
       Method           State
       dot1x            Stopped
       mab              Stopped

Vlan: 50 confirms the Critical VLAN is active. Method State: Stopped confirms that authentication was not completed — the server-dead trigger bypassed the normal auth flow.


ISE PSN High Availability Design

The switch-side configuration handles failover between PSNs, but the PSNs themselves must be designed for high availability at the ISE infrastructure level.

ISE Deployment Topology for 802.1X

A production 802.1X deployment requires at minimum:

  • 1 Primary Administration Node (PAN) — configuration, policy, reporting
  • 1 Secondary Administration Node (secondary PAN) — hot standby for PAN failure
  • 2+ Policy Service Nodes (PSNs) — handle live RADIUS authentication

The PAN handles configuration and monitoring. PSNs handle all RADIUS traffic from switches. Switches should never have the PAN IP as a RADIUS server — the PAN does not process RADIUS authentication requests.

For a campus with 1,500 ports and a peak authentication rate of approximately 200 auths/minute (morning arrival surge), two PSNs with the default ISE hardware specifications are sufficient. Scale PSN count based on:

  • Peak simultaneous authentication rate
  • EAP method complexity (EAP-TLS with OCSP checking is CPU-intensive per auth)
  • Number of endpoints with short re-authentication timers

Navigate to Administration > System > Deployment to view the current node health. Each PSN shows its RADIUS request processing rate. If any PSN is consistently above 80% CPU during peak periods, add a PSN.

PSN Load Balancing Options

Option 1: IOS XE Server Group (Active/Active)
The load-balance method least-outstanding configuration described above distributes load between PSNs at the switch level. Each switch independently load-balances between the two PSNs. This approach is simple and does not require an external load balancer.

Option 2: External Load Balancer (Active/Active with persistence)
A load balancer (Cisco Application Centric Infrastructure, F5, or similar) can front-end multiple PSNs, presenting a single virtual IP to all switches. The switch configuration has only one RADIUS server IP — the VIP. The load balancer handles distribution and health checking.

This approach simplifies switch configuration (one server IP) and enables more sophisticated health checking (the load balancer can detect a PSN that is responding slowly but not fully down). The tradeoff is that RADIUS over a load balancer requires careful attention to source IP — ISE expects to see RADIUS requests from the switch management IP, not the load balancer IP. Configure the load balancer to preserve source IP (DNAT destination only, not source NAT).

Option 3: ISE PSN Anycast (Advanced)
A static anycast IP is assigned to multiple PSNs. Routers use ECMP to forward RADIUS packets to the nearest active PSN. When a PSN fails, the routing protocol removes the route and traffic automatically shifts to the remaining PSNs. This is the most scalable approach for large multi-site deployments and is covered in Article 30: 802.1X Scalability and High Availability Design for Large Enterprise Networks.


Per-Server Timeout Tuning

The default RADIUS timeout of 5 seconds and retransmit of 2 are starting points. Tuning these values affects the user experience during failover.

Tight timers (timeout 2, retransmit 1):

  • Failover time: 4 seconds per server (2 seconds × 2 attempts)
  • With two servers: 4 seconds before moving to PSN2
  • Risk: False positives — a momentarily slow ISE response causes unnecessary server dead events

Loose timers (timeout 10, retransmit 3):

  • Failover time: 30 seconds per server
  • With two servers: 30 seconds of delay before PSN2 takes over
  • Risk: Extended authentication delays during failover events

For a campus access layer where 10-15 second authentication delays are acceptable, the default settings (timeout 5, retransmit 2) are appropriate. For high-density environments like conference centers or exam halls where many endpoints authenticate simultaneously, tighter timers reduce the cascading effect of a slow PSN.

radius server ISE-PSN-1
 timeout 3
 retransmit 1

With these values, PSN1 is considered dead after 6 seconds (3 seconds × 2 attempts), and PSN2 is tried. Total failover time is under 10 seconds.


Monitoring RADIUS Health in Production

Do not wait for users to report authentication failures to discover a RADIUS issue. Configure SNMP traps or syslog alerts for RADIUS server state changes.

! Syslog to central server
logging host 10.0.0.20
logging trap informational
logging facility local6

! RADIUS state change messages are logged at severity 3 (error) to syslog
! Example syslog message when server goes dead:
! %RADIUS-3-ALLDEADSERVER: Group ISE_SERVERS: No active radius servers found

The syslog message %RADIUS-3-ALLDEADSERVER is the most critical alert — it means all RADIUS servers are dead and the Critical VLAN is now active across the switch. Configure your SIEM or monitoring platform to generate a P1 alert on this message.

For individual server dead events:

! %RADIUS-4-RADIUS_DEAD: RADIUS server 10.0.0.10:1812 is not responding.

This message indicates PSN1 is dead but PSN2 is still active. Authentication continues with a slight delay. Still worth alerting on, but lower urgency than ALLDEADSERVER.


CoA Redundancy

Change of Authorization (CoA) adds a complication to RADIUS redundancy. CoA packets are sent from ISE to the switch, not the other way around. If the switch is configured to accept CoA only from ISE-PSN-1 (10.0.0.10) and authentication has failed over to ISE-PSN-2 (10.0.0.11), CoA packets from PSN2 will be rejected by the switch.

aaa server radius dynamic-author
 client 10.0.0.10 server-key ISEsecret123
 client 10.0.0.11 server-key ISEsecret123
 port 1700
 auth-type all

Both PSN IPs must be listed as authorized CoA clients. The switch accepts CoA from either PSN regardless of which PSN processed the original authentication. For the full CoA configuration context, see Article 19: Change of Authorization (CoA) in 802.1X.


Troubleshooting

Symptom: show radius server-group all shows PSN1 as DEAD with deadtime remaining, but ISE is actually running normally

Cause: The switch is marking ISE-PSN-1 as dead due to timeout, but ISE is running. This is usually a network path issue — an intermediate firewall, ACL, or routing change is blocking UDP 1812 from the switch to that specific PSN IP. The ISE process is running, but UDP 1812 packets are not arriving.

Fix: Run ping 10.0.0.10 source vlan 99 from the switch to test basic ICMP reachability. If ping succeeds but RADIUS is still timing out, the issue is specific to UDP 1812. Check any firewall rules between the management VLAN (10.0.99.0/24) and the ISE network. In ISE, navigate to Operations > RADIUS > Live Logs and filter by NAS-IP to confirm whether any packets are arriving from the switch. Zero entries mean the packets are not reaching ISE at all.


Symptom: After PSN2 takes over (PSN1 dead), some endpoints cannot re-authenticate even though PSN2 is active and accepting requests

Cause: PSN2 may not have the same policy configuration as PSN1. ISE PSNs replicate policy from the PAN, but replication can lag — especially if the deployment was recently modified. Another cause: some endpoints have sessions established on PSN1 that included RADIUS session tracking, and PSN2 does not have those sessions in its context for CoA purposes.

Fix: In ISE, navigate to Administration > System > Deployment and check the replication status for PSN2. If there are pending replication items, wait for sync to complete. For CoA issues after failover, the sessions initiated on PSN1 cannot be managed via CoA from PSN2 — this is a known architectural limitation. The practical fix is to reinitialize the affected sessions from the switch side: authentication reinitialize interface [interface].


Symptom: After ISE recovery, endpoints stuck in Critical VLAN (VLAN 50) do not re-authenticate automatically

Cause: The authentication event server alive action reinitialize command is missing from the interface configuration, or the automate-tester probe is not configured, so the switch does not detect ISE recovery.

Fix: Verify the interface config includes both authentication event server dead action authorize vlan 50 and authentication event server alive action reinitialize. Verify the automate-tester username radiustest probe-on is configured under the RADIUS server stanza and that the radiustest username exists in ISE. If the probe user does not exist, the probe generates a failed authentication response, and the switch may interpret this as ISE still being unavailable depending on the response type. After fixing the config, manually clear the dead server status with clear aaa counters servers radius [ip] and verify with show radius server-group all.


What's Next: Article 29: Cisco TrustSec and SGTs: How They Integrate with 802.1X — 802.1X controls who gets on the network. TrustSec controls what traffic can flow between authenticated endpoints after they are on the network. Article 29 explains the Security Group Tag (SGT) model, how SGTs are assigned at the 802.1X authentication point, and how to configure TrustSec policy on IOS XE with ISE as the policy server.

© 2025 Ping Labz. All rights reserved.