ASA: 冗長機能と、Failoverのトリガー、Health Monitoringについて

Taisuke Nakamura · ‎09-19-2017

はじめに
ASA 冗長機能の仕組み
Failoverのトリガー
Health Monitoringについて
1. Unit Health Monitoring
2. Interface Health Monitoring
障害ケースとその確認
1. 試験構成と設定
2. Primary(Active)機電源障害時
3. Primary(Active)機アップリンク障害 (ハング)
4. Primary(Active)機アップリンク障害 (shutdown)
デバッグでの詳細確認
ベストプラクティス
2. Polltimeと Holdtimeのチューニングについて
3. 監視対象のInterfaceのチューニングについて
4. 監視対象のService Moduleのチューニングについて
6. 各ユニットのPriorityと状態の簡易確認について
7. Standby機のシスログ出力の有効化について
8. ASA接続機器のポート設定について (スイッチ利用時)
参考情報

はじめに

ASAの冗長構成では、各筐体がお互いの監視を行っており、問題を検知時にActive/Standbyの切り替わり(Failover)などが発生します。

本ドキュメントでは、ASAの冗長機能の説明、及び、Failoverのトリガーと、その判断を行うための Health Monitoringの実装について、実例を交え紹介します。合わせ、安定した冗長構成の運用のためのベストプラクティスを紹介します。

本ドキュメントは、ASAバージョン 9.1(6)10 にて確認、作成をしております。

ASA 冗長機能の仕組み

2台の筐体(ユニット)で、1ペアの冗長構成となります。冗長構成では、1台をPrimary機、もう1台をSecondary機に指定します。正しく同期関係が結べている場合、動的に、片側がActive機に、もう片側がStandby機となります。

優先度・状態	選出方法	説明
Primary	手動設定	片側1台を、Primary機に明示指定 Primary機は優先してActiveになろうとする
Secondary	手動設定	片側1台を、Secondary機に明示指定 Secondary機は優先してStandbyになろうとする
Active	自動選出	通信を処理する Active IPアドレスを利用設定変更時は、その設定をStandby機に同期
Standby	自動選出	待機し Active機の通信情報を常に同期する Standby IPアドレスを利用 Active機がFailed時に、Standby→Activeに昇格

通信はActive機が処理し、Standby機は待機状態となります。 Standby機は、Active機の通信状態を常に同期・監視しており、Active機の障害に備えています。

管理アクセス(SSHやASDM)は、Active機・Standby機両方に可能です。

設定変更はActive機で可能であり、設定変更後に速やかに Active機からStandby機に実行コマンドが複製されます。また、Standby機を再起動した場合、Standby機起動時にActive機より全体の設定を同期した後、セッション同期をしなおします。

　　　　
Primary機とSecondary機の違いについてですが、Primary機が優先してActiveになろうとします。以下はケース毎の Active/Standby選出例です。

（何らかの電源障害などで）両ユニットが同時に起動した場合、Primary機がActiveになる
（何らかの電源障害などで）Secondary機が先に起動しActiveになっている状態で、Primary機が起動した場合、Primary機はStandbyになる
両機器が対向ユニットを認識できず両方ともActiveの状態で、冗長構成が回復した場合、Secondary機がStandbyに変わり、Secondary機(Standby)は設定を同期しなおす

　　　　
　
Active機に問題が発生時、Standby機は Active機のIPアドレスとMACアドレスを引き継ぎ(Swap)、Active機に昇格します。切り替わり時に、Gratuitous ARP(GARP)を出し周囲機器に通知する事で、速やかな通信経路切り替えと通信継続を実現します。

　　　
　　　　
ユニット間の専用リンクで、お互いの状態確認や制御と、コネクション等の同期を行います。以下の2つの異なる機能のリンクがあり、これらリンクを 1つの物理リンクや論理リンク(LACP)内に共存させることも可能です。

Failover Link

各ユニットのステート情報 (Active or Standby)
Hello message (keepalive)
ネットワークリンクステータス
設定の Active機からStandby機への複製・同期など

　　　
Stateful Failover Link

TCP/UDP Connection情報
NAT変換情報
ARPテーブル
ISAKMPと IPsec SA情報
Dynamic Routing 情報など

　　　　
以下は実際の冗長構成の設定例です。 Failover Linkと Stateful Failover Linkは、物理リンク Gi0/2を共有してます。

Failoverのトリガー

以下のいずれかのイベントが発生した場合、Failoverが発生する可能性があります。

ハードウェア障害(モジュール含む) もしくは電源断
ソフトウェア障害
モニタ対象のインターフェイスの障害
Failoverコマンドでの手動切り替え (e.g. 対象機でfailover activeコマンドを実行しActive化)

各障害ケース毎の、Failover発生の有無は以下です。

Failure Event	Policy	Active Action	Standby Action
Active機の障害時 (power or hardware)	Failover	-	自身をActiveにする
旧Active機の復旧後	No failover	自身をStandbyにする	-
Standby機の障害時 (power or hardware)	No failover	Standby機を Failed状態にする	-
稼働中のFailover linkの障害	No failover	failover link を Failedに	failover link を Failedに
起動時のFailover linkの障害	No failover	failover link を Failedに	自身をActiveにする (=両機器がActiveに）
Stateful failover linkの障害	No failover	-	-
Active機の Interface障害	Failover	自身をFailed状態にする	自身をActiveにする
Standby機の Interface障害	No failover	-	自身をFailed状態にする

Health Monitoringについて

装置やインターフェイスの監視のため、以下２つの Healthチェックを行っています。

1. Unit Health Monitoring

Failover Interfaceを介して、定間隔でkeepaliveを交換します。相手装置から応答がないとFailoverをトリガーします。

2. Interface Health Monitoring

Data Interfaceを介して、定間隔でkeepaliveを交換(2.1)します。相手装置から応答がないと、Interface Test(2.2～2.4)と Failover判断(2.5)を行います。 2.1～2.5は、Hold time内で実行されます。

2.1. Failover health monitoring thread による Interface monitoring
- 定間隔(デフォルト5秒)で、Keepaliveを交換しあい、3回応答がない場合、Interface Test(2.2～2.4)を実行

2.2. LINK and TRAFFIC test
- 自身リンクのアップ状況の簡易確認の後、何らかのパケットを受信できるか確認。 (なお、デフォルトでマルチキャストトラフィックは受信確認の対象に含まれない。) 相手装置が受信でき、自身が受信できない場合は、自身InterfaceをFailedとする

2.3. ARP test
- 最近学習したARPエントリの機器に ARPリクエストを送り、ARP解決確認。相手装置が受信でき、自身が受信できない場合は、自身InterfaceをFailedとする

2.4. PING test
- Pingをブロードキャストし、応答があるか確認。相手装置が受信でき、自身が受信できない場合は、自身InterfaceをFailedとする

2.5. Failover health monitoring threadによる Failover判断
- 2.2～2.4.の Interface Test結果を評価し、相手装置の結果と比較し、Active機のfailed interfaceのほうが多い場合、Failoverをトリガー

InterfaceがFailedとなった後も、何らかの通信を受信できた場合、復旧したと考え、Operationalな状態に戻ります。

また、Interfaceリンク障害(Down)時は、高速な切り替わりとなり、素早くFailoverをトリガーします。通常数秒以内に収束します。

各Health Monitoringの設定値は、show failover コマンドで確認できます。

ciscoasa/pri/act# show failover
Failover On
Failover unit Primary
Failover LAN Interface: fover GigabitEthernet0/3 (up)
Unit Poll frequency 1 seconds, holdtime 15 seconds <--- Unit Health Monitoring
Interface Poll frequency 5 seconds, holdtime 25 seconds <--- Interface Health Monitoring
Interface Policy 1
Monitored Interfaces 1 of 160 maximum
Version: Ours 9.1(6)10, Mate 9.1(6)10
Group 1 last failover at: 12:07:54 JST Jan 18 2016
Group 2 last failover at: 12:07:54 JST Jan 18 2016

障害ケースとその確認

1. 試験構成と設定

【設定】

ciscoasa/pri/act# show failover
Failover On
Failover unit Primary
Failover LAN Interface: fover GigabitEthernet0/3 (up)
Unit Poll frequency 1 seconds, holdtime 15 seconds
Interface Poll frequency 5 seconds, holdtime 25 seconds
Interface Policy 1
Monitored Interfaces 1 of 160 maximum
Version: Ours 9.1(6)10, Mate 9.1(6)10
Group 1 last failover at: 14:51:43 JST Jan 18 2016
Group 2 last failover at: 14:51:43 JST Jan 18 2016

This host: Primary
 Group 1 State: Active
 Active time: 9666 (sec)
 Group 2 State: Active
 Active time: 67 (sec)

slot 0: ASA5520 hw/sw rev (2.0/9.1(6)10) status (Up Sys)
 fg02 Interface inside (10.10.10.254): Normal (Monitored)
 slot 1: empty

Other host: Secondary
 Group 1 State: Standby Ready
 Active time: 1018 (sec)
 Group 2 State: Standby Ready
 Active time: 9278 (sec)

slot 0: ASA5520 hw/sw rev (2.0/9.1(6)10) status (Up Sys)
 fg02 Interface inside (10.10.10.253): Normal (Monitored)
 slot 1: empty

2. Primary(Active)機電源障害時

以下は、Primary(Active)機の電源断による障害発生時の、Secondary(Standby)機のログです。 Primary(Active)ユニットの電源障害時、Hold time時間内で切り替わりが発生します。

【Secondary(Standby) ログ機】

ciscoasa/sec/stby> 
Jan 18 2016 11:50:43: %ASA-1-105043: (Secondary) Failover interface failed    <--- InterfaceDownを検知
Jan 18 2016 11:50:43: %ASA-6-720032: (VPN-Secondary) HA status callback: id=3,seq=200,grp=0,event=401,op=0,my=Standby Ready,peer=Active.
Jan 18 2016 11:50:43: %ASA-6-720024: (VPN-Secondary) HA status callback: Control channel is down.
Jan 18 2016 11:50:43: %ASA-6-720032: (VPN-Secondary) HA status callback: id=3,seq=200,grp=0,event=402,op=0,my=Standby Ready,peer=Active.
Jan 18 2016 11:50:43: %ASA-6-720025: (VPN-Secondary) HA status callback: Data channel is down.
Jan 18 2016 11:50:56: %ASA-1-103001: (Secondary) No response from other firewall (reason code = 4).
    --- snip ---
Jan 18 2016 11:50:56: %ASA-1-104001: (Secondary) Switching to ACTIVE - HELLO not heard from mate. <---- Failover発生

3. Primary(Active)機アップリンク障害 (ハング)

以下は、Primary(Active)の上位スイッチがハング(10:21:54頃)し、リンクアップ状態のままパケットを処理できなくなった時の、Primary(Active)機と Secondary(Standby)機のログです。以下順序で、遷移します。

1. Primary機とSecondary機のInterface間でのKeepaliveが3回失敗(10:21:04)

2. Interface Test中に Primary(Active)がARP testに失敗、Secondary(Standby)がARP testに成功したため、Primary(Active)の該当InterfaceがFailed(10:21:07)

3. Active機に比べStandby機のほうが健全な状態であるため、最終的に、Primary(Active)機は自身がFailedだと判断し、Secondary(Active)に Failover (10:21:11)

【Primary(Active)機ログ】 (debug fover fail + inside interface capture有効)

ciscoasa/pri/act/fg02# 
Jan 18 2016 10:21:04: %ASA-1-105005: (Primary_group_2) Lost Failover communications with mate on interface inside
Jan 18 2016 10:21:04: %ASA-1-105008: (Primary) Testing Interface inside
Jan 18 2016 10:21:04: %ASA-0-711001: fover_fail_check: ifc_monitor(20002) hcnt(3) exceeded threshold <--- 5秒毎の3回 Keepaliveに失敗
Jan 18 2016 10:21:07: %ASA-1-105009: (Primary_group_2) Testing on interface inside Failed <--- ARP testに失敗し、Failedをマーク
Jan 18 2016 10:21:11: %ASA-1-104002: (Primary_group_2) Switching to STANDBY - Interface check <--- Failover発生

ciscoasa/pri/stby/fg02# show capture IN
 　　　--- snip ---
 9: 10:20:49.156119 802.1Q vlan#991 P0 10.10.10.253 > 10.10.10.254: ip-proto-105, length 48
 10: 10:20:49.390589 802.1Q vlan#991 P0 10.10.10.254 > 10.10.10.253: ip-proto-105, length 48
 11: 10:20:54.154807 802.1Q vlan#991 P0 10.10.10.253 > 10.10.10.254: ip-proto-105, length 48
 12: 10:20:54.390604 802.1Q vlan#991 P0 10.10.10.254 > 10.10.10.253: ip-proto-105, length 48 
 13: 10:20:59.390772 802.1Q vlan#991 P0 10.10.10.254 > 10.10.10.253: ip-proto-105, length 48 
 14: 10:21:04.390620 802.1Q vlan#991 P0 10.10.10.254 > 10.10.10.253: ip-proto-105, length 48 <--- 5秒毎の3回 Keepaliveに失敗
 15: 10:21:06.190129 802.1Q vlan#991 P0 arp who-has 10.10.10.1 tell 10.10.10.254 <--- Arp Test。応答がないため失敗
 16: 10:21:09.390604 802.1Q vlan#991 P0 10.10.10.254 > 10.10.10.253: ip-proto-105, length 48
 17: 10:21:11.641400 802.1Q vlan#991 P0 arp who-has 10.10.10.253 tell 10.10.10.253 <--- Failover発生
 18: 10:21:11.641553 802.1Q vlan#991 P0 arp who-has 10.10.10.253 tell 10.10.10.253
 19: 10:21:11.641675 802.1Q vlan#991 P0 arp who-has 10.10.10.253 tell 10.10.10.253
 20: 10:21:14.390574 802.1Q vlan#991 P0 10.10.10.253 > 10.10.10.254: ip-proto-105, length 48 <--- Failoverによる IPのSwap発生
 21: 10:21:19.390482 802.1Q vlan#991 P0 10.10.10.253 > 10.10.10.254: ip-proto-105, length 48
 22: 10:21:24.390497 802.1Q vlan#991 P0 10.10.10.253 > 10.10.10.254: ip-proto-105, length 48
22 packets shown

ciscoasa/pri/stby/fg02# show failover
Failover On
Last Failover at: 10:21:11 JST Jan 18 2016
 This context: Failed <--- 自身stateがFailedに遷移
 Active time: 269 (sec)
 Interface inside (10.10.10.253): Failed (Waiting) <--- InterfaceがFailedに遷移
 Peer context: Active
 Active time: 101 (sec)
 Interface inside (10.10.10.254): Normal (Waiting)

【Secondary(Standby)機ログ】 (debug fover fail + inside interface capture有効)

ciscoasa/sec/stby/fg02# 
Jan 18 2016 10:21:11: %ASA-1-104001: (Secondary_group_2) Switching to ACTIVE - Other unit wants me Active. Primary unit switch reason: Interface check. <--- Failover発生

ciscoasa/sec/act/fg02# show capture IN
　　　 --- snip ---
 9: 10:20:53.980372 802.1Q vlan#991 P0 10.10.10.253 > 10.10.10.254: ip-proto-105, length 48
 10: 10:20:54.216343 802.1Q vlan#991 P0 10.10.10.254 > 10.10.10.253: ip-proto-105, length 48
 11: 10:20:58.980357 802.1Q vlan#991 P0 10.10.10.253 > 10.10.10.254: ip-proto-105, length 48 
 12: 10:21:03.980524 802.1Q vlan#991 P0 10.10.10.253 > 10.10.10.254: ip-proto-105, length 48
 13: 10:21:04.318617 802.1Q vlan#991 P0 arp who-has 10.10.10.1 tell 10.10.10.253 <--- ARP Test
 14: 10:21:04.318663 802.1Q vlan#991 P0 10.10.10.253 > 10.10.10.254: ip-proto-105, length 44
 15: 10:21:04.319014 802.1Q vlan#991 P6 arp reply 10.10.10.1 is-at 0:15:c7:23:f1:80 <--- ARP Test 成功
 16: 10:21:04.409951 802.1Q vlan#991 P0 10.10.10.253 > 10.10.10.254: ip-proto-105, length 44
 17: 10:21:04.509983 802.1Q vlan#991 P0 10.10.10.253 > 10.10.10.254: ip-proto-105, length 44
 　　　--- snip ---
 33: 10:21:06.019087 802.1Q vlan#991 P0 arp who-has 10.10.10.1 tell 10.10.10.253 <--- ARP Test
 34: 10:21:06.019118 802.1Q vlan#991 P0 10.10.10.253 > 10.10.10.254: ip-proto-105, length 44
 35: 10:21:06.019484 802.1Q vlan#991 P6 arp reply 10.10.10.1 is-at 0:15:c7:23:f1:80 <--- ARP Test 成功
 　　　--- snip ---
 52: 10:21:07.709924 802.1Q vlan#991 P0 10.10.10.253 > 10.10.10.254: ip-proto-105, length 44
 53: 10:21:07.809940 802.1Q vlan#991 P0 10.10.10.253 > 10.10.10.254: ip-proto-105, length 44
 54: 10:21:07.909926 802.1Q vlan#991 P0 10.10.10.253 > 10.10.10.254: ip-proto-105, length 44
 55: 10:21:08.980357 802.1Q vlan#991 P0 10.10.10.253 > 10.10.10.254: ip-proto-105, length 48
 56: 10:21:11.473852 802.1Q vlan#991 P0 arp who-has 10.10.10.254 tell 10.10.10.254
 57: 10:21:11.473989 802.1Q vlan#991 P0 arp who-has 10.10.10.254 tell 10.10.10.254
 58: 10:21:11.474111 802.1Q vlan#991 P0 arp who-has 10.10.10.254 tell 10.10.10.254
 59: 10:21:13.980570 802.1Q vlan#991 P0 10.10.10.254 > 10.10.10.253: ip-proto-105, length 48 <--- FailoverによるIP Swap発生
 60: 10:21:18.980494 802.1Q vlan#991 P0 10.10.10.254 > 10.10.10.253: ip-proto-105, length 48
 61: 10:21:23.980479 802.1Q vlan#991 P0 10.10.10.254 > 10.10.10.253: ip-proto-105, length 48
 62: 10:21:28.980631 802.1Q vlan#991 P0 10.10.10.254 > 10.10.10.253: ip-proto-105, length 48
62 packets shown

4. Primary(Active)機アップリンク障害 (shutdown)

以下は、Primary(Active)の上位スイッチのInterfaceを静的にshutdowonした時の、Primary(Active)機と Secondary(Standby)機のログです。素早く Failoverがトリガーされます。通常数秒以内に収束します。

【Primary(Active)機ログ】

ciscoasa/pri/act# 
Jan 18 2016 12:13:14: %ASA-1-105007: (Primary) Link status 'Down' on interface inside <--- Interface down検知
Jan 18 2016 12:13:15: %ASA-4-411002: Line protocol on Interface GigabitEthernet0/1, changed state to down
Jan 18 2016 12:13:15: %ASA-4-411002: Line protocol on Interface inside, changed state to down
Jan 18 2016 12:13:16: %ASA-1-104002: (Primary_group_2) Switching to STANDBY - Interface check <--- 数秒内にFailover

【Secondary(Standby)機ログ】

ciscoasa/sec/stby/fg02# 
Jan 18 2016 12:13:17: %ASA-1-104001: (Secondary_group_2) Switching to ACTIVE - Other unit wants me Active. Primary unit switch reason: Interface check.
.

デバッグでの詳細確認

以下のデバッグを有効化する事で、Interface Test状況や、Failover発生理由の詳細調査が可能です。

debug fover ifc            #Interface Testの詳細出力
debug fover switch         #切り替え関連の詳細出力
debug fover fail           #Fail Checkの表示

デバッグの、タイムスタンプ設定やロギング出力先設定は、以下ドキュメントを参考にしてください。

ASA: logging debug-traceを活用したトラブルシューティング
https://supportforums.cisco.com/ja/document/12271691

Multiple Context Modeを利用の場合、admin contextで、各ロギング設定とデバッグ有効化をしてください。

以下は、「Primary(Active)機アップリンク障害 (ハング)」の障害ケースと同等の試験で、デバッグとロギング(level=error)をコンソールで有効化した場合の出力です。 (見やすくするため、年月日とシスログIDを削除してあります。)

#Jan 18 2016 20:28:35 に上位スイッチ 障害発生
ciscoasa/pri/act#
20:28:34: fover_health_monitoring_thread: ifc_check() group: 1, - time = 30563070
20:28:34: fover_health_monitoring_thread: ifc_check() group: 2, - time = 30563070
20:28:35: fover_health_monitoring_thread: poll_count_check: pcnt = 0, at 30563320 
20:28:36: fover_health_monitoring_thread: poll_count_check: pcnt = 0, at 30564320
20:28:37: fover_health_monitoring_thread: poll_count_check: pcnt = 0, at 30565320
20:28:37: fover_health_monitoring_thread: ifc_check() group: 1, - time = 30565570
20:28:37: fover_health_monitoring_thread: ifc_check() group: 2, - time = 30565570
20:28:38: fover_health_monitoring_thread: poll_count_check: pcnt = 0, at 30566320
20:28:39: fover_health_monitoring_thread: poll_count_check: pcnt = 0, at 30567320
20:28:39: fover_health_monitoring_thread: ifc_check() group: 1, - time = 30568070
20:28:39: fover_health_monitoring_thread: ifc_check() group: 2, - time = 30568070
20:28:40: fover_health_monitoring_thread: poll_count_check: pcnt = 0, at 30568320
20:28:41: fover_health_monitoring_thread: poll_count_check: pcnt = 0, at 30569320
20:28:42: fover_health_monitoring_thread: poll_count_check: pcnt = 0, at 30570320
20:28:42: fover_health_monitoring_thread: ifc_check() group: 1, - time = 30570570
20:28:42: fover_health_monitoring_thread: ifc_check() group: 2, - time = 30570570
20:28:43: fover_health_monitoring_thread: poll_count_check: pcnt = 0, at 30571320
20:28:43: fover_parse: send_arp(50002) - 10.10.10.1
20:28:44: fover_health_monitoring_thread: poll_count_check: pcnt = 0, at 30572320
20:28:44: fover_health_monitoring_thread: ifc_check() group: 1, - time = 30573070
20:28:44: fover_health_monitoring_thread: ifc_check() group: 2, - time = 30573070
20:28:45: fover_health_monitoring_thread: poll_count_check: pcnt = 0, at 30573320
20:28:45: (Primary_group_2) Lost Failover communications with mate on interface inside <---
20:28:45: (Primary) Testing Interface inside <--- insideの Interface Test開始
20:28:45: fover_fail_check: ifc_monitor(50002) hcnt(3) exceeded threshold <--- Keepalive失敗
20:28:45: fover_ifc_test: ifc_test(50002) - test wait time = 1562
20:28:45: fover_ifc_test: ifc_test(50002) - LINKTEST  <--- LINKテスト
20:28:45: fover_parse: send_arp(50002) - 10.10.10.1
20:28:45: fover_ifc_test: ifc_test(50002) - test wait time = 1562
20:28:45: fover_ifc_test: ifc_test(50002) - TRAFFICTEST, starttime 1d28560  <--- TRAFFICテスト
20:28:45: fover_ifc_test: ifc_test(50002) - test wait time = 1562
20:28:45: fover_ifc_test: ifc_test(50002) - TRAFFICTEST, starttime 1d28560
20:28:45: fover_parse: parse_thread_helper() - mate ifc 327682 link status change from IFC_TESTING to IFC_PASSED
20:28:45: fover_ifc_test: ifc_test(50002) - test wait time = 1562
20:28:45: fover_ifc_test: ifc_test(50002) - TRAFFICTEST, starttime 1d28560
20:28:45: fover_ifc_test: ifc_test(50002) - test wait time = 1562
20:28:45: fover_ifc_test: ifc_test(50002) - TRAFFICTEST, starttime 1d28560
20:28:46: fover_health_monitoring_thread: poll_count_check: pcnt = 0, at 30574320
20:28:46: fover_ifc_test: ifc_test(50002) - test wait time = 1562
20:28:46: fover_ifc_test: ifc_test(50002) - TRAFFICTEST, starttime 1d28560
20:28:46: fover_ifc_test: ifc_test(50002) - test wait time = 1562
20:28:46: fover_ifc_test: ifc_test(50002) - TRAFFICTEST, starttime 1d28560
20:28:46: fover_ifc_test: ifc_test(50002) - test wait time = 1562
20:28:46: fover_ifc_test: ifc_test(50002) - TRAFFICTEST, starttime 1d28560
20:28:46: fover_ifc_test: ifc_test(50002) - test wait time = 1562
20:28:46: fover_ifc_test: ifc_test(50002) - TRAFFICTEST, starttime 1d28560
20:28:46: fover_ifc_test: ifc_test(50002) - test wait time = 1562
20:28:46: fover_ifc_test: ifc_test(50002) - TRAFFICTEST, starttime 1d28560
20:28:46: fover_ifc_test: ifc_test(50002) - test wait time = 1562
20:28:46: fover_ifc_test: ifc_test(50002) - TRAFFICTEST, starttime 1d28560
20:28:46: fover_ifc_test: ifc_test(50002) - test wait time = 1562
20:28:46: fover_ifc_test: ifc_test(50002) - TRAFFICTEST, starttime 1d28560
20:28:46: fover_ifc_test: ifc_test(50002) - test wait time = 1562
20:28:46: fover_ifc_test: ifc_test(50002) - TRAFFICTEST, starttime 1d28560
20:28:46: fover_ifc_test: ifc_test(50002) - test wait time = 1562
20:28:46: fover_ifc_test: ifc_test(50002) - TRAFFICTEST, starttime 1d28560
20:28:46: fover_ifc_test: ifc_test(50002) - test wait time = 1562
20:28:46: fover_ifc_test: ifc_test(50002) - TRAFFICTEST, starttime 1d28560
20:28:47: fover_health_monitoring_thread: poll_count_check: pcnt = 0, at 30575320
20:28:47: fover_ifc_test: ifc_test(50002) - test wait time = 1562
20:28:47: fover_ifc_test: ifc_test(50002) - TRAFFICTEST, starttime 1d28560
20:28:47: fover_ifc_test: ifc_test(50002) - test wait time = 1562
20:28:47: fover_ifc_test: ifc_test(50002) - TRAFFICTEST, starttime 1d28560
20:28:47: fover_ifc_test: ifc_test(50002) - test wait time = 1562
20:28:47: fover_ifc_test: ifc_test(50002) - TRAFFICTEST, starttime 1d28560
20:28:47: fover_health_monitoring_thread: ifc_check() group: 1, - time = 30575570
20:28:47: fover_health_monitoring_thread: ifc_check() group: 2, - time = 30575570
20:28:47: fover_ifc_test: ifc_test(50002) - test wait time = 1562
20:28:47: fover_ifc_test: ifc_test(50002,0) - ARPTEST, starttime 0 <--- ARPテスト開始
20:28:47: fover_ifc_test: send_arp(50002) - 10.10.10.1 <--- ARPリクエスト送付
20:28:47: fover_ifc_test: ifc_test(50002) - test wait time = 1562
20:28:47: fover_ifc_test: ifc_test(50002) - WRCNT, starttime 1d28c04
20:28:47: fover_ifc_test: ifc_test(50002) - test wait time = 1562
20:28:47: fover_ifc_test: ifc_test(50002) - WRCNT, starttime 1d28c04
20:28:47: fover_ifc_test: ifc_test(50002) - test wait time = 1562
20:28:47: fover_ifc_test: ifc_test(50002) - WRCNT, starttime 1d28c04
20:28:47: fover_ifc_test: ifc_test(50002) - test wait time = 1562
20:28:47: fover_ifc_test: ifc_test(50002) - WRCNT, starttime 1d28c04
20:28:47: fover_ifc_test: ifc_test(50002) - test wait time = 1562
20:28:47: fover_ifc_test: ifc_test(50002) - WRCNT, starttime 1d28c04
20:28:47: fover_ifc_test: ifc_test(50002) - test wait time = 1562
20:28:47: fover_ifc_test: ifc_test(50002) - WRCNT, starttime 1d28c04
20:28:48: fover_health_monitoring_thread: poll_count_check: pcnt = 0, at 30576320
20:28:48: fover_ifc_test: ifc_test(50002) - test wait time = 1562
20:28:48: fover_ifc_test: ifc_test(50002) - WRCNT, starttime 1d28c04
20:28:48: fover_ifc_test: ifc_test(50002) - test wait time = 1562
20:28:48: fover_ifc_test: ifc_test(50002) - WRCNT, starttime 1d28c04
20:28:48: fover_ifc_test: ifc_test(50002) - test wait time = 1562
20:28:48: fover_ifc_test: ifc_test(50002) - WRCNT, starttime 1d28c04
20:28:48: fover_ifc_test: ifc_test(50002) - test wait time = 1562
20:28:48: fover_ifc_test: ifc_test(50002) - WRCNT, starttime 1d28c04
20:28:52: fover_ifc_test: ifc_test(50002) - test wait time = 1562
20:28:52: fover_ifc_test: ifc_test(50002) - WRCNT, starttime 1d28c04
20:28:52: fover_ifc_test: ifc_test(50002) - test wait time = 1562
20:28:52: fover_ifc_test: ifc_test(50002) - WRCNT, starttime 1d28c04
20:28:52: fover_ifc_test: ifc_test(50002) - test wait time = 1562
20:28:52: fover_ifc_test: ifc_test(50002) - WRCNT, starttime 1d28c04
20:28:52: fover_ifc_test: ifc_test(50002) - test wait time = 1562
20:28:52: fover_ifc_test: ifc_test(50002) - WRCNT, starttime 1d28c04
20:28:52: fover_ifc_test: ifc_test(50002) - test wait time = 1562
20:28:52: fover_ifc_test: ifc_test(50002) - WRCNT, starttime 1d28c04
20:28:52: fover_ifc_test: ifc_test(50002) - test wait time = 1562
20:28:52: fover_ifc_test: ifc_test(50002) - WRCNT, starttime 1d28c04
20:28:52: fover_health_monitoring_thread: poll_count_check: pcnt = 0, at 30577320
20:28:52: fover_ifc_test: ifc_test(50002) - test wait time = 1562
20:28:52: fover_ifc_test: ifc_test(50002) - WRESULT, starttime 1d292a8
20:28:52: (Primary_group_2) Testing on interface inside Failed <--- ARP応答を確認できずFailed
20:28:52: fover_ifc_test: ifc_test(50002) - test wait time = 1562
20:28:52: fover_ifc_test: ifc_test(50002) - GOTRESULT, starttime 0
20:28:52: fover_ifc_test: ifc_test(50002) - ENDTEST 
20:28:52: fover_ifc_test: ifc_test(50002) completed: test state ENDTEST, ifc status IFC_FAILED <--- テスト終了
20:28:52: fover_health_monitoring_thread: ifc_check() group: 1, - time = 30578070
20:28:52: fover_health_monitoring_thread: ifc_check() group: 2, - time = 30578070
20:28:52: fover_health_monitoring_thread: ifc_check() - group 2 HW failed 1 (mate 0) <--- 相手と自身を比較
20:28:52: fover_health_monitoring_thread: Skip switch to failed group: 2, time = 30578070
20:28:52: fover_health_monitoring_thread: poll_count_check: pcnt = 0, at 30578320
20:28:52: fover_health_monitoring_thread: poll_count_check: pcnt = 0, at 30579320
20:28:52: fover_health_monitoring_thread: poll_count_check: pcnt = 0, at 30580320
20:28:52: (Primary_group_2) Switching to STANDBY - Interface check <--- Failover開始
20:28:52: fover_health_monitoring_thread: ifc_check() group: 1, - time = 30580570
20:28:52: fover_health_monitoring_thread: ifc_check() group: 2, - time = 30580570
20:28:52: fover_health_monitoring_thread: ifc_check() - group 2 HW failed 1 (mate 0)
20:28:52: fover_FSM_thread: HA FSM: group 2, state - Active, event MOVE_TO_STATE, operand 20, 24
20:28:52: fover_FSM_thread: HA FSM: start switching from Active to Failed 
20:28:52: fover_FSM_thread: HA FSM: switched from Active to Failed (Interface check)
20:28:52: fover_FSM_thread: HA FSM: group 2, state - Failed, event SELF_STATE_CHANGED, operand 24, 24

上記デバッグとログからは、insideインターフェイスの Keepaliveが失敗し、ARPテストが失敗し、自身(Primary/Active)の状態が相手装置(mate)より悪いため、Failoverが発生したことが解ります。

ベストプラクティス

1. Failover Linkのケーブルの接続方法について

Failover Linkは、筐体間を監視する重要なリンクです。 ASA筐体間でLANケーブルの直結をお勧めします。論理リンク(LACP)などを用い、物理リンクの冗長化もお勧めします。

Failover Linkの筐体間接続で、スイッチを中継に挟むのは避けてください。仮に中継スイッチがハングアップや停止した場合、各ASAはPrimary/Secondary間の状態確認や設定同期ができなくなるためです。中継スイッチのトラブルはその性質上気づき辛く、問題調査や解決が長引きやすくなります。

2. Polltimeと Holdtimeのチューニングについて

Unit Health Monitoringや、Interface Monitoringの各timeはチューニングが可能ですが、多くの場合、デフォルト設定のままで問題ありません。

仮にチューニングする場合、十分に長いtimeを持つように設定してください。特に通信量が多く負荷が高い環境ほど、Polltimeや Holdtimeを極端に短いような値にチューニングすべきではありません。

Polling間隔や Holdtimeが短いと、ASAや周囲機器が極めて高負荷時の、一時的なKeepaliveの連続したドロップや Failover処理のスタックなどの影響を受けやすくなり、予期せぬFailoverが発生するリスクが上昇します。

3. 監視対象のInterfaceのチューニングについて

予期せぬInterface Testの発生や、Failoverの防止のために、監視の必要のないInterfaceは、"no monitor-interface [if名]"コマンドで監視の無効化を検討してください。

例えば、Multiple Context Modeでの、各Contextの共有Interfaceの場合、直結したInterfaceのDown検知や、経路上の機器やInterfaceのDown検知には、1つのContextのみの共有Interfaceの監視有効化で、十分機能します。

例えば、Management Interfaceの監視を無効化する事で、Management Interfaceやその経路のDownによるFailoverの発生を抑えることができます。結果、管理セグメントの構成変更などメンテナンスが容易になります。

重要なInterfaceのみ監視することで、ASAのFailover監視プロセスの負荷も下げることができます。

指定Interface(management)の監視の無効化設定例

no monitor-interface management

4. 監視対象のService Moduleのチューニングについて

Service Module(Firepowerや ASA-CXなど)を利用時、その死活監視の無効化を、以下コマンドで可能です。何らかの理由で利用していないService-Moduleは監視を無効化することで、これらモジュールの管理時(再セットアップ時を含む)やトラブル時のFailoverのリスクを抑えることができます。以下コマンドは、 ASA version 9.3(1)からサポート開始です。

Service Moduleの監視の無効化設定例

no monitor-interface service-module

5. データ処理用Interfaceの仮想MACアドレスの利用について (NAT利用時)

特にNAT利用環境の場合、データ処理用Interfaceには、デフォルトの焼き付け(Burned-IN)のMACアドレスではなく、任意仮想MACアドレスの利用をお勧めします。

冗長構成では、デフォルトで、Primary機のInterfaceの焼き付けMACアドレスを、そのInterfaceの Active IPアドレスと紐づけて利用します。仮に Primary(Active)機で障害が発生し、Secondary(Active)に切替わった場合、Secondary(Active)機は Primary機のMACアドレスを引継いで利用します。当処理は、切替発生時の、スムーズな通信復旧を実現します。

このデフォルト動作の場合、NAT機能を利用時、かつ、Primary機交換時などに、通信影響が発生するリスクがあります。

Primary機を保守交換した場合、Primary機のMACアドレスが変わるため、その冗長構成で「Active IP : Macアドレス(旧Primary→新Primary)」の変更処理が発生します。 ASAは速やかに変更通知(GARP)を試みますが、対向L3機器のARPテーブルのMACアドレスの変更が完了するまでは、対向L3機器は "旧Primary機"のMACアドレス宛にパケットを送付してしまいます。この間、ごく短時間の通信影響が発生するリスクがあります。

注意点として、このGARPで通知を試みるのは、自身管理のInterfaceのIPアドレスのみであり、NATのMapped IPで定義した他IPアドレス(=ProxyARPで解決されるIP)は含まれません。つまり、対向L3機器のARPテーブルは、ASA実InterfaceのIPのみ「Active IP : MACアドレス(新Primary)」に切り替わり、NATで利用する他IPは「Active IP : MACアドレス(旧Primary)」の状態で残ってしまうことになります。結果、対向L3機器は NAT用のIP宛通信を "旧Primary機"のMACアドレス宛にパケットを送付し続けてしまい、通信影響が長時間におよびます。この回復には、そのL3機器の該当ARPエントリのタイムアウトを待つか、そのL3機器にログインしての手動消去や再起動(=強制ARPテーブルクリア)が必要になります。

上記のトラブル発生リスクを回避するためには、特にNAT利用のデータ処理用Interfaceには仮想MACアドレスの設定を行ってください。結果、機器交換やリプレース時などの「Active IP：Primary MAC」の変動と、それに伴うトラブル発生を防げます。

指定Interfaceに固定の仮想MACアドレス設定は、以下の "failover mac address"コマンド、もしくは "(I/F設定モード内での) mac-address"コマンドが利用可能です。どちらか好きなMAC設定コマンドのみの利用が推奨です。同機器内での以下２種のコマンドの併用は避けてください。

asa/pri/act(config)# failover mac addr gi0/0 1234.1234.0001 1234.1234.0002
asa/pri/act(config)# sh interface gi0/0 | in IP|MAC
 MAC address 1234.1234.0001, MTU 1500     <---- 変更されたMACアドレスを確認
 IP address 192.168.10.254, subnet mask 255.255.255.0

　　　　　　　　　　　　　もしくは

asa/pri/act(config)# interface gi 0/0
asa/pri/act(config-if)# mac-addr 1234.1234.0001 standby 1234.1234.0002

詳しくは以下設定ガイドも参照ください。

ASA9.1: アクティブ/スタンバイフェールオーバーの設定 - 仮想MACアドレスの設定
http://www.cisco.com/cisco/web/support/JP/docs/SEC/Firewall/ASA5500NextGenerationFire/CG/001/ha_active_standby.html?bid=0900e4b183271d68#34672

6. 各ユニットのPriorityと状態の簡易確認について

どのユニットがPrimary/Secondaryか、Active/Standbyかの素早い確認には、"prompt hostname state priority"コマンドが大変便利です。詳しくは ASA 冗長構成で、コマンドラインのプロンプトで Active機 Standby機状態を確認するを参照してください。

asa(config)# prompt hostname priority state
asa/pri/act(config)# <--- 当ユニットが pri(Primary) かつ act(Active)である事がわかる

7. Standby機のシスログ出力の有効化について

Standby機のロギング出力はデフォルト無効です。トラブルシューティングなどでStandby機側のシスログ出力も確認したい場合は、Active機側で "logging standby"コマンドの有効化を検討してください。

なお、"logging standby"を有効にした場合、Standby機側もシスログ出力が開始される事で、Standby機のシスログ出力負荷上昇や、外部シスログサーバにログ保存時のログ保存量増大が発生します。また、ASAを通過する通信のログを監視したい場合は、Active機のみの通信ログの確認で十分です。その為、常時 "logging standby"の有効化が必要かは慎重に検討する必要があります。

以下はコマンド実行と確認例です。

ciscoasa/pri/act(config)# logging ?

configure mode commands/options:
  asdm                  Set logging level or list for ASDM
  asdm-buffer-size      Specify ASDM logging buffer size
  buffer-size           Specify logging memory buffer size
  buffered              Set buffer logging level or list
　　　　　--- 略 ---
  standby               Enable logging on standby unit with failover
                        enabled, warning: this option causes twice as much
                        traffic on the syslog server
  timestamp             Enable logging timestamp on syslog messages
  trap                  Set logging level or list for syslog server

exec mode commands/options:
 savelog Save logging buffer to flash
ciscoasa/pri/act(config)# 
ciscoasa/pri/act(config)# logging standby     <---- Standby機シスログ有効化
ciscoasa/pri/act(config)#
ciscoasa/pri/act(config)# show logging
Syslog logging: enabled
    Facility: 20
    Timestamp logging: enabled
    Hide Username logging: enabled
    Standby logging: enabled   <---- 有効になった事を確認
    Debug-trace logging: disabled
    Console logging: disabled
    Monitor logging: disabled
    Buffer logging: level informational, 176 messages logged
    Trap logging: disabled

8. ASA接続機器のポート設定について (スイッチ利用時)

ASAの冗長構成を導入する場合、通常 ASAのデータ処理用インターフェイスはスイッチもしくは HUBで収容します。ASAはスイッチ/HUBを通して、お互いのデータ処理用インターフェイスの定期監視(Interface Health Monitoring)を行います。

このインターフェイス間の定期的な監視のため、スイッチとASAを接続時は、スイッチ側ポートの Portfast(もしくは同等機能)の有効化、もしくは利用VlanのSTPの無効化、などを検討してください。当対応により、予期せぬポートのDown/Upが発生後の、そのポートの素早いForwarding再開と、ASAのインターフェイス間監視の素早い復旧が可能です。

なお、仮にデータ処理用インターフェイスが、ポートはUpしても、何らかの理由(※)でパケットをForwardingできない状態が継続した場合、ASAはそのインターフェイス間の監視の復旧確認が取れず、Interface testが発生する原因になります。(※STPによるBlocking状態も含みます。)

上記と同様の理由で、同期情報の管理・交換のためにASA間で接続する Failover Link や Stateful Failover Linkについても、(これらは原則 ASA間で直結が推奨されますが、)仮にスイッチ経由で接続する場合は、Portfast(もしくは同等機能)の有効化、もしくは利用VlanのSTPの無効化、などを検討してください。

スイッチに適用する設定について詳しくは、ご利用の構成や機器・運用ポリシーなどによっても変わってくるため、スイッチ側担当者にもご相談頂くことをお勧めいたします。

参考情報

PIX/ASA Active/Standby Failover Configuration Example
http://www.cisco.com/c/en/us/support/docs/security/pix-500-series-security-appliances/77809-pixfailover.html

ASA9.1: Configuring Failover - Failover Health Monitoring
http://www.cisco.com/c/en/us/td/docs/security/asa/asa91/configuration/general/asa_91_general_config/ha_failover.html#pgfId-1079010

ファイアウォールトラブルシューティング
https://supportforums.cisco.com/ja/document/12725841#hdr-1