# Runbook — ServiceEntry `resolution: DNS` 동작과 진단 (STRICT_DNS)

**Date:** 2026-06-07
**호스트:** ubuntu (home server) / cluster: homelab
**도메인:** Kubernetes / Istio (egress, Envoy DNS)
**관련:** `scenarios/20-egress/`, `docs/test-reports/2026-06-07_ingress-egress.md`

---

## 1. 개요

egress 시나리오에서 외부 도메인(`httpbin.org`)을 `ServiceEntry(resolution: DNS)`로 등록하면,
istiod가 이를 Envoy의 **STRICT_DNS cluster**로 변환한다. 이 문서는 그 DNS 해석 메커니즘,
"죽은 IP" 처리 방식, 그리고 실제 진단(조회) 명령을 정리한다.

핵심 한 줄: **DNS refresh는 health check가 아니다.** DNS는 "목록"을 줄 뿐, 그 IP의 생사는 모른다.

---

## 2. 변환 구조

```
ServiceEntry(resolution: DNS, hosts: httpbin.org)
        |  istiod 변환
        v
Envoy cluster "outbound|443||httpbin.org"
   type: STRICT_DNS          <- Envoy가 직접 DNS 질의
   respect_dns_ttl: true     <- 갱신 주기 = DNS 응답의 TTL
   dns_refresh_rate: 60s     <- TTL이 0/없을 때만 쓰는 fallback
   endpoints: [8 IPs]        <- A record 전체를 펼쳐서 보관
```

> 경유 구성 주의: 트래픽 경로가 `sleep → egress gateway → httpbin.org`이므로
> **실제 외부 DNS를 도는 주체는 egress gateway의 Envoy**다. sleep proxy에도 같은 cluster가
> 보이지만 라우팅이 egress gateway로 향하므로 실질 사용처가 아니다. 진단은 egress gateway 기준.

---

## 3. 갱신 주기 결정 규칙

| 필드 | 현재 값 | 의미 |
|---|---|---|
| `respect_dns_ttl` | `true` | DNS 응답 TTL을 갱신 주기로 사용 |
| `dns_refresh_rate` | `60s` | TTL이 0/없을 때만 쓰는 fallback |
| `httpbin.org` 실제 TTL | `10s` (dig) | → **실제 갱신 ≈ 10초** |

`dns_refresh_rate: 60s`는 함정값 — `respect_dns_ttl: true`면 거의 안 쓰인다(TTL=0일 때만).
mesh 전역 기본은 configmap `istio`의 `dnsRefreshRate`로 조정.

---

## 4. "죽은 IP" 처리 — 가장 중요한 부분

DNS refresh는 liveness 체크가 아니므로, TTL 만료 전에 IP가 죽으면 Envoy는 그 IP를
HEALTHY로 들고 있다가 LB가 고르면 연결 실패한다.

```
[갱신 전, IP 죽음]
client --> egress GW --> [죽은 IP 선택] --X TCP connect 실패
                          +-- retry O      --> 다른 IP 재시도 --> 성공
                          +-- retry X      --> 요청 실패 (503/connect error)
```

목록에서 빠지는 경로 3가지 (독립적):

| 메커니즘 | 트리거 | 기본 구성 |
|---|---|---|
| DNS refresh | TTL 만료 → 재질의 시 DNS가 해당 IP 제외 | O (수동적) |
| **Outlier detection** (passive HC) | 연속 5xx/connect 실패 → 일시 ejection | **X (DestinationRule 미설정)** |
| Active health check | Envoy가 직접 주기적 probe | X |

→ **프로덕션 결론**: 죽은 IP 회피는 DNS TTL에 기대지 말고
`DestinationRule.trafficPolicy.outlierDetection` + `VirtualService.http.retries`로 명시 설정.
(설정 예시는 `scenarios/20-egress/destinationrule-egress.yaml` 갱신본 참조)

---

## 5. 진단 명령 (재현 가능)

```bash
CTX=homelab
GW=deploy/istio-egressgateway.istio-system
CL="outbound|443||httpbin.org"

# (1) 현재 endpoint IP 목록 + outlier 상태
istioctl --context $CTX proxy-config endpoints $GW --cluster "$CL"

# (2) cluster DNS 설정 (type / refresh_rate / respect_dns_ttl)
kubectl --context $CTX -n istio-system exec deploy/istio-egressgateway -c istio-proxy -- \
  curl -s "localhost:15000/config_dump?resource=dynamic_active_clusters" \
  | grep -A10 '"'"$CL"'"'

# (3) 런타임 endpoint 상태(success_rate, healthy flags)
kubectl --context $CTX -n istio-system exec deploy/istio-egressgateway -c istio-proxy -- \
  curl -s "localhost:15000/clusters" | grep "httpbin.org"

# (4) DNS 원천 데이터(IP + TTL)
dig +noall +answer httpbin.org           # TTL = 첫 컬럼 숫자

# (5) DNS 갱신 카운터를 보려면 sidecar stats 필터를 풀어야 함(기본 미노출):
#   pod annotation proxy.istio.io/config 의 proxyStatsMatcher.inclusionRegexps 에 ".*httpbin\.org.*" 추가
```

**검증 실험**: `dig` IP 집합과 `proxy-config endpoints`를 ~10초 간격 2회씩 떠서,
IP 변동 시 endpoint도 따라 바뀌면 "DNS 주기 = endpoint 갱신 주기" 확인.

---

## 6. 학습 포인트

- STRICT_DNS = A record의 모든 IP를 endpoint로 펼침(다중 conn pool, LB 분산).
  LOGICAL_DNS = 첫 IP 1개만, 연결 재사용 위주(CDN/대형 LB 뒤에 적합).
- DNS refresh ≠ health check. liveness는 outlier detection / active HC의 몫.
- egress gateway 경유 시 DNS 조회 주체는 gateway지 클라이언트 sidecar가 아니다.

## 7. STRICT_DNS vs LOGICAL_DNS — ServiceEntry resolution (실측)

`ServiceEntry.spec.resolution` 필드가 Envoy cluster type을 결정한다.

| `resolution` | Envoy type | endpoint 적재 | 적합 대상 |
|---|---|---|---|
| `DNS` | **STRICT_DNS** | A record **모든 IP**를 펼침 | 개별 IP를 Envoy가 직접 LB하고 싶을 때 |
| `DNS_ROUND_ROBIN` | **LOGICAL_DNS** | **첫 IP 1개**만, 연결 재사용 | CDN/GSLB/대형 LB 뒤 "논리적 단일 endpoint" |

실측(`scenarios/20-egress/serviceentry-example-logicaldns.yaml`로 example.com을 ROUND_ROBIN 등록):
```
SERVICE FQDN    TYPE          endpoints(proxy-config)
httpbin.org     STRICT_DNS    다수 (A record 전체)
example.com     LOGICAL_DNS   1
```
→ `istioctl proxy-config endpoints <pod> --cluster "outbound|443||<host>"` 의 endpoint 개수가
**STRICT=다수 / LOGICAL=1** 로 갈리는 것이 Envoy 반영의 결정적 증거.

LOGICAL_DNS는 "DNS 레코드가 자주 바뀌어도 기존 연결을 유지(connection draining/cycling 제거)"하는 게 핵심 — 공식 문서가 "large web scale services accessed via DNS"에 권장하는 이유.

---

## 8. 죽은 IP 회피 — outlier detection + retry

DNS refresh는 liveness를 모르므로(§4), 죽은 IP 회피는 아래로 명시한다.

**outlier detection** (`DestinationRule.trafficPolicy.outlierDetection`,
`scenarios/20-egress/destinationrule-httpbin-outlier.yaml` — 적용 후 Envoy 반영 확인됨):
```yaml
trafficPolicy:
  connectionPool: { tcp: { connectTimeout: 2s } }  # connect 실패 빨리 판정
  outlierDetection:
    consecutiveLocalOriginFailures: 3   # ★ connect 실패(L4) 3회 → eject  (passthrough 핵심)
    splitExternalLocalOriginErrors: true
    consecutive5xxErrors: 5             # 5xx(L7) 5회 → eject
    interval: 10s
    baseEjectionTime: 30s
    maxEjectionPercent: 50              # 전멸 방지
    minHealthPercent: 40
```

**retry** (`VirtualService.http.retries`) — **L7 HTTP route에서만** 동작:
```yaml
http:
  - route: [...]
    retries:
      attempts: 3
      perTryTimeout: 2s
      retryOn: connect-failure,refused-stream,5xx,gateway-error
```

> ⚠️ **함정**: 현재 egress는 **TLS PASSTHROUGH(L4)** 라 egress gateway가 HTTP를 못 본다 →
> `http.retries`는 **조용히 무시**된다. retry를 쓰려면 **TLS origination(L7)** 구성으로 전환해야 한다.
> - passthrough(L4): `outlierDetection(consecutiveLocalOriginFailures)` + 짧은 `connectTimeout` 이 유일한 자동 회피
> - TLS origination(L7): 위 + `http.retries(retryOn: connect-failure)`

---

## 9. GSLB 환경 권장 — LOGICAL_DNS

GSLB(DNS가 클라이언트 위치·health로 최적 IP를 고르는 모델) 뒤 호스트는 **LOGICAL_DNS** 권장.

근거:
1. **GSLB 결정 존중**: STRICT_DNS는 A record 전부를 펼쳐 Envoy가 자체 LB → GSLB가 고른 IP 외 원격 리전까지 섞어 **GSLB 의도 무력화**. LOGICAL_DNS는 DNS가 준 첫 IP만 써 결정을 따른다.
2. **동적 변경 + 짧은 TTL 친화**: LOGICAL_DNS는 새 연결 시 재해석해 최신 GSLB 결정 반영. STRICT_DNS는 전체 IP에 conn pool 유지 → stale·원격 연결 부담.
3. **자원 낭비 방지**: GSLB 뒤는 사실상 "하나의 논리적 endpoint" → 전체 IP에 사전 연결은 낭비.

**Trade-off**: LOGICAL_DNS는 endpoint가 1개라 Envoy outlier/세밀 LB 효력이 약화 → "장애 판정을 GSLB에 위임"하는 셈. GSLB health가 느슨하면 STRICT_DNS + outlier가 나을 수도. **본질은 "LB 권한을 DNS(GSLB)에 둘 것인가 Envoy에 둘 것인가"의 선택.**

---

## 10. "ambient DNS" — 도메인 질의는 어디로 가나 (실측)

공식 문서의 "querying the ambient DNS"에서 **ambient는 Istio ambient mesh가 아니라**
"그 프록시 환경에 깔린 기본 resolver(`/etc/resolv.conf`)"라는 영어 일반어다. STRICT/LOGICAL 둘 다 동일.

측정 결과 질의 경로:
```
Envoy (c-ares resolver; cluster에 dns_resolver 주소 명시 없음)
  -> /etc/resolv.conf  nameserver 169.254.25.10   (NodeLocal DNSCache, link-local)
  -> CoreDNS (클러스터 DNS)
  -> httpbin.org는 클러스터 도메인 아님 -> upstream DNS로 재귀
  -> 권위 응답(A record) -> 역순 반환 -> Envoy endpoint 적재
```

확인된 사실:
- 질의 주체 = **Envoy의 c-ares** (`typed_dns_resolver_config: envoy.network.dns_resolver.cares`).
  resolver 주소 미지정 → c-ares가 `resolv.conf`를 그대로 읽음.
- sleep/egress gateway pod의 nameserver = `169.254.25.10` (nodelocaldns) → CoreDNS → 외부 upstream.
- **Istio DNS capture(`ISTIO_META_DNS_CAPTURE`)는 비활성** (meshConfig·pod env 모두 미주입)
  → istio-agent DNS proxy를 안 거치고 Envoy가 직접 나간다.

> DNS capture를 켜면: istio-agent가 53을 가로채(DNS proxy), 메시 내부 호스트는 NDS로 로컬 응답,
> 외부만 upstream 포워딩 → "질의 위치"가 istio-agent로 바뀐다.

진단 명령:
```bash
# resolver(=ambient DNS) 확인
kubectl -n mesh-test exec deploy/sleep -c istio-proxy -- cat /etc/resolv.conf
# Envoy가 쓰는 resolver 종류
kubectl -n istio-system exec deploy/istio-egressgateway -c istio-proxy -- \
  curl -s "localhost:15000/config_dump?resource=dynamic_active_clusters" \
  | grep -A60 '"outbound|443||httpbin.org"' | grep -i dns_resolver
# DNS capture 활성 여부
kubectl -n istio-system get cm istio -o jsonpath='{.data.mesh}' | grep -i DNS_CAPTURE
```

---

## 11. 참조
- `docs/test-reports/2026-06-07_ingress-egress.md`
- 시연 manifest: `scenarios/20-egress/{serviceentry-example-logicaldns,destinationrule-httpbin-outlier}.yaml`
- `~/istio-md` 아카이브: `gw__note-eastwest-gateway-sni`, `gw__note-circuit-breaking-mechanisms`