---
type: src
tags: [istio, envoy, response-flags, access-log, telemetry, troubleshooting]
created: 2026-06-07
---
# Envoy Response Flags 운영 레퍼런스 (출처: Istio 1.30 공식 문서 — Envoy access log response flags)

> [!abstract] 이 문서가 다루는 것
> **status code는 "결과"만, response flag는 그 결과가 Envoy 처리 파이프라인의 어느 단계에서 났는지를 알려준다.** 같은 503도 flag(UF/UH/UO/NR)에 따라 원인 부서가 완전히 다르다 — 그래서 503을 보면 status가 아니라 flag를 본다.
> 이 문서는 ① 왜 status만으로는 부족한지(배경) → ② response flag가 Envoy 파이프라인 단계를 어떻게 1:1로 투영하는지(메커니즘·기억법) → ③ 28개 flag 레퍼런스 표와 503/504 빠른 triage → ④ long flag와 진단 필드를 access log에 노출시키는 설정과 실제 JSON 로그 순으로 전개한다.

---

## 01. 배경 — 왜 status code 한 줄로는 원인을 못 찾나

운영에서 5xx를 보면 가장 먼저 떠오르는 질문은 "어디가 고장났나"입니다. 그런데 HTTP status code는 그 질문에 답하지 않습니다. 503 하나를 두고 Envoy 입장에서 가능한 시나리오를 펼치면:

- upstream으로 TCP/TLS connection 자체를 못 맺었다.
- cluster는 있는데 healthy endpoint가 0이다.
- circuit breaker(connection pool 한도)가 의도적으로 막았다.
- 요청에 매칭되는 route가 아예 없다.
- 심지어 **app이 직접 503 body를 만들어 정상 응답으로 돌려줬다** (Envoy는 멀쩡).

이 다섯은 대응 부서가 전부 다릅니다(네트워크/플랫폼/리소스 한도/라우팅 설정/애플리케이션). 그런데 클라이언트가 받는 status code는 모두 똑같이 `503`입니다. **status code는 "무엇이 일어났는가(결과)"만 말하고, "처리의 어느 지점에서 일어났는가(위치)"는 말하지 않습니다.** 바로 그 "위치"를 채워주는 한 토큰이 response flag입니다.

이 문서를 읽기 전에 알아두면 좋은 선행 개념은 Envoy가 요청을 처리하는 큰 골격입니다 — 요청은 `listener → filter chain → route → cluster → endpoint(upstream connection)` 순으로 흐릅니다. response flag의 핵심은 **이 흐름의 각 단계가 실패할 때 서로 다른 flag를 찍는다**는 점이고, 02장이 그 대응을 메커니즘으로 풉니다. 단계별 진단의 정본은 [xDS 계층과 진단](xds__src-xds-layers-and-diagnosis.html)을 함께 참조하세요.

대상 독자는 access log로 5xx triage를 해야 하는 SRE/플랫폼 엔지니어이고, 범위는 Istio 1.30(Envoy) 기준 HTTP/TCP response flag 해석과 그 노출 설정입니다.

---

## 02. 핵심 — response flag = Envoy 파이프라인 단계의 투영

> [!key] 한 문장 멘탈모델 (이 그림만 머리에 넣으면 된다)
> response flag는 **요청이 Envoy 처리 파이프라인의 어느 단계에서 죽었는지를 가리키는 좌표**다. flag를 외우는 게 아니라, "이 요청은 route 단계까지 갔나 / cluster를 찾았나 / endpoint에 연결됐나 / 응답을 받았나"를 단계 위에 찍는 것이다. 그래서 status가 같아도 flag가 다르면 죽은 위치가 다르다.

요청 하나가 Envoy를 통과하는 동안 여러 관문을 지납니다. 각 관문은 "통과 / 여기서 실패"라는 게이트이고, 실패하면 그 관문 고유의 flag를 남기고 응답을 끝냅니다. 이 대응을 그림으로 두면 표 28줄이 전부 derivable해집니다.

```mermaid
flowchart TD
  REQ[Downstream request 도착] --> L{listener / filter chain
matching 됨?}
  L -- 매칭 없음 --> NR[NR NoRouteFound]
  L -- ok --> RT{route 매칭 됨?}
  RT -- 매칭 없음 --> NR
  RT -- ok --> CL{cluster 존재?}
  CL -- 없음 --> NC[NC NoClusterFound]
  CL -- ok --> EP{healthy endpoint 있나?}
  EP -- 0개 --> UH[UH NoHealthyUpstream]
  EP -- 있음 --> POOL{connection pool /
circuit breaker 여유?}
  POOL -- 한도 초과 --> UO[UO UpstreamOverflow]
  POOL -- ok --> CONN{upstream connect 성공?}
  CONN -- 실패 --> UF[UF UpstreamConnectionFailure]
  CONN -- ok --> RESP{제때 응답 받음?}
  RESP -- timeout --> UT[UT UpstreamRequestTimeout]
  RESP -- reset/term --> URUC[UR / UC upstream reset·term]
  RESP -- ok --> OK[정상: flag '-']
  REQ -. client가 끊음 .-> DC[DC DownstreamConnectionTermination]
```

이 파이프라인을 따라 읽으면 **방향 + 사건**이라는 작명 규칙이 왜 이렇게 생겼는지가 보입니다. flag는 통째로 외우는 약어가 아니라, "누가(방향)" + "무엇을(사건)"의 조합이라 처음 보는 flag도 추론됩니다.

```text
방향 접두사
  N = No            (없음: NoRoute, NoCluster, NoHealthy)
  U = Upstream      (서버 쪽 / 우리가 보내는 대상)
  D = Downstream    (클라이언트 쪽 / 우리에게 보낸 주체)
  L = Local         (Envoy 자기 자신이 발생시킴)
사건
  R = Route / Retry / Reset      O = Overflow / Overload
  T = Timeout                    F = Failure
  H = Healthy                    C = Cluster / Connection
```

조합하면 의미가 바로 풀립니다 — 그리고 각 조합이 위 파이프라인의 어느 게이트인지도 같이 보입니다.

```text
NR  = No + Route                   → route/filter chain 단계에서 매칭 실패
NC  = No + Cluster                 → cluster 조회 단계 실패
UH  = (No) + Upstream + Healthy    → endpoint 선택 단계, healthy 0
UO  = Upstream + Overflow          → pool/circuit breaker 게이트에서 차단
UF  = Upstream + Failure           → connect 단계 실패
UT  = Upstream + Timeout           → 응답 대기 단계 timeout
DC  = Downstream + Connection      → client가 응답 전에 연결 끊음
LR  = Local + Reset                → Envoy 자신이 reset(timeout/overload/filter)
URX = Upstream + Retry + eXceeded  → retry limit 초과
```

> [!tip] 핵심
> 방향 접두사(U/D/L/N)와 사건(F/H/O/T/C/R)을 분리해서 읽으면 외우지 않아도 90%는 해석된다. 거기에 "파이프라인의 어느 게이트인가"를 얹으면, flag 하나가 곧 점검 대상(어느 리소스를 볼지)으로 바로 번역된다.

이 단계 모델이 곧 "먼저 볼 것"을 결정합니다. NR은 라우팅 설정(VirtualService/Gateway), UH는 endpoint 공급(Service/EndpointSlice/readiness), UO는 한도 설정(DestinationRule), UF는 transport(port/firewall/mTLS) — flag가 가리키는 단계가 곧 책임 리소스입니다. Istio request path 전체 맥락(iptables → listener → route → cluster → endpoint)은 [Cluster 해부](xds__src-cluster-anatomy.html)와 [xDS 계층과 진단](xds__src-xds-layers-and-diagnosis.html)에서 더 또렷해집니다.

마지막으로 access log에는 두 출력 형태가 있습니다. 운영 권장은 short만 찍지 말고 long과 `response_code_details`까지 같이 넣는 것입니다(05장에서 설정).

```text
%RESPONSE_FLAGS%       → UF                          (short 약어)
%RESPONSE_FLAGS_LONG%  → UpstreamConnectionFailure   (PascalCase long name)
```

---

## 03. 전체 flag 레퍼런스 표

운영에서 마주칠 28개 핵심 response flag입니다. "먼저 볼 것"은 그 flag가 가리키는 파이프라인 단계의 책임 리소스 — 즉 어디부터 점검할지입니다.

| Short | Long | 의미 | 먼저 볼 것 |
|---|---|---|---|
| `-` | `-` | 특별한 Envoy error flag 없음 | upstream app이 정상 응답했거나, 적어도 Envoy 레벨 에러는 아님 |
| `NR` | `NoRouteFound` | route 없음 또는 matching filter chain 없음 | VirtualService host/gateway/match, Gateway server, SNI, port protocol, Sidecar scope |
| `NC` | `NoClusterFound` | route가 가리키는 cluster가 없음 | DestinationRule subset 오타, Service/ServiceEntry 없음, config scope, Envoy warming |
| `UH` | `NoHealthyUpstream` | cluster는 있지만 healthy endpoint가 없음 | Service selector, EndpointSlice, Pod readiness, outlier detection으로 전부 eject, locality failover |
| `UF` | `UpstreamConnectionFailure` | upstream connection 실패 | app port 미청취, NetworkPolicy/firewall, mTLS mismatch, TLS SAN 문제, endpoint port 오류 |
| `UO` | `UpstreamOverflow` | circuit breaker/connection pool overflow | DestinationRule connectionPool, pending request, retry 폭증, LLM/GPU queue 포화 |
| `UT` | `UpstreamRequestTimeout` | upstream response timeout | VirtualService timeout, app latency, DB/GPU queue, streaming timeout |
| `URX` | `UpstreamRetryLimitExceeded` | retry limit 또는 TCP connect attempts 초과 | retry policy, maxRetries, 불필요한 retry, upstream 불안정 |
| `UC` | `UpstreamConnectionTermination` | upstream connection이 중간에 종료됨 | upstream keepalive/idle timeout, app crash, server가 FIN/RST |
| `UR` | `UpstreamRemoteReset` | upstream이 reset | HTTP/2 reset, gRPC reset, app/proxy가 stream reset |
| `LR` | `LocalReset` | Envoy local reset | local filter, timeout, overload, policy/filter가 reset 유발 |
| `DC` | `DownstreamConnectionTermination` | downstream client가 끊음 | client timeout, browser cancel, LB idle timeout, gRPC client cancel |
| `DR` | `DownstreamRemoteReset` | downstream remote reset | client 쪽 HTTP/2 reset/refuse |
| `SI` | `StreamIdleTimeout` | stream idle timeout | streaming/gRPC/SSE에서 데이터 공백, idleTimeout 너무 짧음 |
| `DT` | `DurationTimeout` | max connection/request duration 초과 | maxConnectionDuration, max downstream duration |
| `UMSDR` | `UpstreamMaxStreamDurationReached` | upstream max stream duration 도달 | long-running stream, gRPC streaming 제한 |
| `DF` | `DnsResolutionFailed` | DNS resolution 실패 | ServiceEntry DNS, CoreDNS, external DNS, egress gateway DNS |
| `RL` | `RateLimited` | local HTTP rate limit | local rate limit filter, EnvoyFilter/Wasm/extension 설정 |
| `RLSE` | `RateLimitServiceError` | rate limit service 자체 오류 | external/global rate limit service 장애 |
| `UAEX` | `UnauthorizedExternalService` | external authorization service가 deny | ext_authz 정책, OPA/custom authz, auth service 응답 |
| `DI` | `DelayInjected` | fault injection delay | VirtualService fault delay |
| `FI` | `FaultInjected` | fault injection abort | VirtualService fault abort |
| `IH` | `InvalidEnvoyRequestHeaders` | Envoy가 invalid header로 거부 | 잘못된 header, strict header validation |
| `DPE` | `DownstreamProtocolError` | downstream HTTP protocol error | client가 잘못된 HTTP, h2 framing 문제 |
| `UPE` | `UpstreamProtocolError` | upstream HTTP protocol error | upstream이 protocol 위반, h2/h1 mismatch |
| `OM` | `OverloadManagerTerminated` | Envoy overload manager가 종료 | Envoy 메모리/CPU 압박, overload action |
| `DO` | `DropOverLoad` | overload drop | Envoy overload/drop_overloads |
| `UDO` | `UnconditionalDropOverload` | 100% drop overload | 강한 overload drop 설정 |
| `NFCF` | `NoFilterConfigFound` | filter config warming deadline 내 미수신 | xDS warming, extension config, EnvoyFilter/Wasm 관련 |

> [!warning] 함정
> `-` 가 떠도 "정상"이라고 단정하지 말 것. Envoy 레벨 error flag가 없다는 뜻일 뿐, upstream app 자체가 5xx body를 정상 응답으로 돌려준 경우(예: app이 500을 의도적으로 반환)는 `-`로 찍힌다. 즉 `503 -` 는 "app이 직접 503을 만들어 보냄", `503 UF` 는 "Envoy가 app에 닿지도 못함" 으로 의미가 완전히 다르다. 파이프라인 그림으로 보면 `-`는 끝(RESP ok)까지 통과한 것이고, `UF`는 connect 게이트에서 죽은 것이다.

---

## 04. 자주 보는 503/504 조합 빠른 해석

실전 장애의 절대다수는 다섯 조합(`503 UF` / `503 UH` / `503 UO` / `404·503 NR` / `504 UT`)으로 수렴합니다. status가 아니라 status+flag 쌍으로 기억해야 하고, 각 쌍은 02장 파이프라인의 한 게이트입니다.

- **503 UF** (connect 게이트) — Envoy가 upstream으로 TCP/TLS connection 자체를 못 맺음. mTLS mismatch(client는 plaintext인데 server `PeerAuthentication STRICT`), app이 포트를 listen 안 함, NetworkPolicy/firewall 차단, endpoint port 오류가 전형.
- **503 UH** (endpoint 선택 게이트) — cluster는 존재하지만 healthy endpoint가 0. Service selector 오타, Pod readiness 미충족, 또는 outlier detection이 정상 endpoint까지 eject한 경우(숨은 원인). outlier detection 동작·튜닝 상세는 [Cluster 해부](xds__src-cluster-anatomy.html) 정본 참조.
- **503 UO** (pool/circuit breaker 게이트) — upstream은 멀쩡하지만 Envoy가 의도적으로 안 보냄. DestinationRule `connectionPool`/circuit breaker 한도(동시성, pending request, retry)를 초과. "upstream이 죽었다"가 아니라 "Envoy가 정한 한도를 넘어 더 안 보냈다".
- **NR** (route/filter chain 게이트) — VirtualService `hosts`가 실제 `:authority`와 불일치, `gateways` 필드 때문에 mesh sidecar엔 적용 안 됨, Service port가 HTTP로 인식 안 되어 L7 route가 아예 안 만들어짐, SNI/port mismatch.
- **504 UT** (응답 대기 게이트) — upstream이 응답을 제때 못 줌. VirtualService timeout이 너무 짧거나, app latency/DB/GPU queue가 길거나, streaming인데 timeout을 짧게 잡음.

> [!warning] 함정
> LLMOps/MLOps에서 `URX`(retry limit exceeded)와 `UO`는 특히 위험하다. inference 요청은 비싸고 길며 streaming/non-idempotent일 수 있어, blanket retry가 GPU queue를 더 밀어 넣고 tail latency를 악화시킨다. 기본값처럼 보이는 resiliency 설정이 실제로는 장애 증폭기가 된다. inference POST에는 retry 금지 또는 강한 제한이 원칙.

### 503 triage decision

```mermaid
flowchart TD
  A[503 발생] --> B[response_flag 확인]
  B --> UF[UF: 연결 실패]
  B --> UH[UH: healthy endpoint 0]
  B --> UO[UO: connectionPool/CB]
  B --> NR[NR: route/filter chain 없음]
  B --> UT[UT: upstream timeout]

  UF --> UFq{transport_failure_reason?}
  UFq -->|TLS 에러| UFtls[mTLS mode mismatch]
  UFq -->|빈값| UFnet[app port·firewall·NetworkPolicy]

  UH --> UHq{endpoint 존재?}
  UHq -->|0개| UHready[Service selector·Pod readiness]
  UHq -->|존재하나 eject| UHout[outlier detection 과다 eject]
```

---

## 05. 예시 — RESPONSE_FLAGS_LONG과 진단 필드를 access log에 노출시키고 읽기

여기까지가 "flag를 어떻게 해석하나"였다면, 이 장은 "그 flag를 실제 로그에서 어떻게 손에 쥐나"입니다. 기본 access log 포맷에는 short flag만 들어가는 경우가 많아 진단이 느립니다. 가장 권장되는 방식은 **Telemetry API로 access log를 켜고, MeshConfig에서 포맷을 JSON으로 조정**하는 것입니다.

### 05.1 mesh 전체 access log 활성화 (Telemetry API)

```yaml
apiVersion: telemetry.istio.io/v1
kind: Telemetry
metadata:
  name: mesh-default
  namespace: istio-system   # root configuration namespace
spec:
  accessLogging:
  - providers:
    - name: envoy
```

Telemetry 리소스는 workload-specific → namespace-specific → root namespace 순서의 hierarchy를 가집니다. mesh 전체에 적용하려면 root config namespace(보통 `istio-system`)에 selector 없이 둡니다.

### 05.2 JSON format에 short/long flag + 진단 필드 추가 (MeshConfig)

`accessLogFile`, `accessLogEncoding`, `accessLogFormat`은 MeshConfig에서 설정합니다. `accessLogFile`이 빈 값이면 access logging이 꺼지고, `accessLogEncoding`은 `TEXT` 또는 `JSON`입니다. JSON 전문:

> ⚠️ **이 환경(Helm 설치)에서는** `IstioOperator` CR이 reconcile되지 않는다(operator 미설치) — 아래 `meshConfig:` 블록을 `install/helm/values-istiod.yaml`의 `meshConfig:`에 그대로 넣고 `helm upgrade`로 적용할 것. 아래 manifest는 **meshConfig 필드 형태 참고용**이다.
>
> - 📎 [values-istiod.yaml](attachment/install/helm/values-istiod.yaml) — 이 환경의 control-plane Helm values (현재 `accessLogEncoding: TEXT`, long flag 미노출 → 여기에 아래 JSON 포맷을 적용)

```yaml
apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
spec:
  meshConfig:
    accessLogFile: /dev/stdout
    accessLogEncoding: JSON
    accessLogFormat: |
      {
        "start_time": "%START_TIME%",
        "method": "%REQ(:METHOD)%",
        "path": "%REQ(X-ENVOY-ORIGINAL-PATH?:PATH)%",
        "authority": "%REQ(:AUTHORITY)%",
        "protocol": "%PROTOCOL%",
        "upstream_protocol": "%UPSTREAM_PROTOCOL%",
        "response_code": "%RESPONSE_CODE%",
        "response_flags": "%RESPONSE_FLAGS%",
        "response_flags_long": "%RESPONSE_FLAGS_LONG%",
        "response_code_details": "%RESPONSE_CODE_DETAILS%",
        "connection_termination_details": "%CONNECTION_TERMINATION_DETAILS%",
        "upstream_transport_failure_reason": "%UPSTREAM_TRANSPORT_FAILURE_REASON%",
        "bytes_received": "%BYTES_RECEIVED%",
        "bytes_sent": "%BYTES_SENT%",
        "duration_ms": "%DURATION%",
        "upstream_service_time": "%RESP(X-ENVOY-UPSTREAM-SERVICE-TIME)%",
        "upstream_cluster": "%UPSTREAM_CLUSTER_RAW%",
        "upstream_host": "%UPSTREAM_HOST%",
        "upstream_hosts_attempted": "%UPSTREAM_HOSTS_ATTEMPTED%",
        "downstream_remote_address": "%DOWNSTREAM_REMOTE_ADDRESS%",
        "downstream_local_address": "%DOWNSTREAM_LOCAL_ADDRESS%",
        "requested_server_name": "%REQUESTED_SERVER_NAME%",
        "route_name": "%ROUTE_NAME%",
        "x_request_id": "%REQ(X-REQUEST-ID)%"
      }
```

short flag는 "어느 단계에서 죽었나"까지만 알려줍니다. 그 단계 안에서 "왜"를 좁혀주는 진단 필드 3종이 같이 있어야 분석이 한 번에 끝납니다.

- `response_flags_long` — short flag를 PascalCase로 풀어 사람이 바로 읽음.
- `response_code_details` — Envoy가 그 status를 만든 구체적 사유(예: `upstream_reset_before_response_started{connection_failure}`).
- `upstream_transport_failure_reason` — TLS/연결 실패의 실제 transport 레벨 이유(예: TLS handshake 에러 메시지). mTLS 장애 분석의 결정적 단서.

추가로 두 operator가 endpoint-레벨 triage에 유용합니다.

- `%UPSTREAM_CLUSTER_RAW%` — `%UPSTREAM_CLUSTER%`가 observability name으로 단순화되기 전의 원본(`direction|port|subset|fqdn` 형태). 실제 어떤 subset/port로 라우팅됐는지 가공 없이 보여줘 `NC`/`UH` 분석 시 DestinationRule subset 매칭을 직접 확인할 수 있음.
- `%UPSTREAM_HOSTS_ATTEMPTED%` — 이 요청에서 Envoy가 시도한 endpoint 수. `UH`(0이면 pool 자체가 비었음)와 `UF`(여러 endpoint를 시도하고도 전부 실패) triage, outlier eject 판단에 유용.

### 05.3 적용 후 확인

```bash
kubectl logs -n <ns> <pod> -c istio-proxy | jq .
# long flag가 실제로 노출되는 라인만 확인
kubectl logs -n <ns> <pod> -c istio-proxy | jq 'select(.response_flags_long != "")'
```

> [!note] 적용 범위
> MeshConfig `accessLogFormat` 변경은 **새로 뜨거나 재시작된 proxy부터** 반영됩니다. 기존 pod에 즉시 적용하려면 `kubectl rollout restart deploy/<name>`(또는 gateway deployment)로 sidecar를 재기동해야 합니다.

### 05.4 결과 — 실제 JSON 로그 한 줄에서 원인까지 (503 UF 사례)

```json
{
  "response_code": "503",
  "response_flags": "UF",
  "response_flags_long": "UpstreamConnectionFailure",
  "response_code_details": "upstream_reset_before_response_started{connection_failure}",
  "upstream_transport_failure_reason": "TLS_error:..."
}
```

이 한 줄을 02장 파이프라인 위에서 읽으면 결론이 자동으로 좁혀집니다: `UF`는 **connect 게이트에서 죽었다**(route·cluster·endpoint는 다 통과) → app의 응답이 아니라 Envoy가 upstream에 연결조차 못 함 → `upstream_transport_failure_reason`에 TLS 에러가 있으니 port/firewall이 아니라 **mTLS mode mismatch**가 1순위 용의자. status code 한 줄로는 끝나지 않을 추적이 로그 한 줄로 끝납니다.

> [!warning] 함정
> access log custom format을 확장할 때 prompt body, Authorization header, tenant 식별자, PII가 새지 않도록 허용 필드를 운영 표준에서 명확히 제한할 것. response flag 진단 목적의 필드(`response_flags`, `response_code_details`, `upstream_transport_failure_reason` 등)는 민감정보가 아니므로 안전하지만, 무분별하게 `%REQ(...)%` 를 늘리면 로그에 secret이 들어간다.

---

## 핵심 정리

멘탈모델 한 줄: **response flag는 요청이 Envoy 파이프라인(route → cluster → endpoint → connect → response)의 어느 게이트에서 죽었는지를 찍는 좌표다. 그 게이트가 곧 책임 리소스다.**

- **503은 status가 아니라 flag를 봐야 한다.** 같은 503도 UF/UH/UO/NR로 죽은 게이트(=원인 부서)가 완전히 다르다.
- **flag는 외우는 게 아니라 derive한다.** 방향(N/U/D/L) + 사건(F/H/O/T/C/R)을 분리해 읽고, 그걸 파이프라인 게이트에 얹으면 점검 대상이 바로 나온다.
- **자주 보는 5종만 먼저:** `503 UF`=연결실패(connect 게이트: mTLS/port/firewall), `503 UH`=healthy endpoint 0(endpoint 게이트: readiness/outlier), `503 UO`=한도 초과(pool 게이트: connectionPool/CB), `404·503 NR`=route 없음(route 게이트: VS/Gateway/SNI/port), `504 UT`=upstream timeout(응답 게이트).
- **운영 로그엔 short만 찍지 말고** `response_flags_long` + `response_code_details` + `upstream_transport_failure_reason`까지 JSON access log에 노출. 게이트(flag) → 사유(code_details/transport_reason) 두 단계로 좁힌다.
- **활성화 경로:** Telemetry API(`mesh-default`, root ns) + MeshConfig `accessLogFormat`(JSON). 기존 pod엔 `rollout restart`로 반영.

---

## What you might be missing

- **`-` flag는 "Envoy 무에러"이지 "장애 없음"이 아니다.** app이 직접 만든 5xx(`503 -`, 파이프라인 끝까지 통과)와 Envoy가 못 닿은 5xx(`503 UF`, connect 게이트에서 사망)는 대응 부서가 다르다. status만 보고 인프라/네트워크를 의심하면 시간을 버린다.
- **DENY rule 사고는 access log에 깔끔한 flag로 안 남을 수 있다.** AuthorizationPolicy DENY로 막힌 요청은 RBAC filter에서 끊겨 `response_code_details`에 RBAC 관련 사유가 박힌다. flag만 보지 말고 code_details까지 봐야 "장애"가 아니라 "정책 차단"임을 구분한다.
- **outlier detection이 `UH`의 숨은 범인일 때가 많다.** endpoint가 멀쩡한데 `UH`가 뜨면 Pod 장애가 아니라 `consecutive5xxErrors`/`maxEjectionPercent` 설정이 정상 endpoint까지 pool에서 빼낸 self-inflicted 장애일 수 있다. Pod 수가 적은 서비스에서 특히 잘 터진다.
- **`upstream_transport_failure_reason`이 비어 있으면 mTLS 문제가 아닐 가능성이 높다.** 반대로 이 필드에 TLS 에러가 찍히면 `UF`의 원인을 port/firewall보다 mTLS mode mismatch(DestinationRule TLS mode vs PeerAuthentication STRICT)부터 의심하라. 이 한 필드가 triage 시간을 크게 줄인다.

---

## 관련 파일 · 참조

- 📎 [values-istiod.yaml](attachment/install/helm/values-istiod.yaml) — 이 환경의 control-plane Helm values (현재 `accessLogEncoding: TEXT`, long flag 미노출 → §05.2의 JSON 포맷을 이 파일의 `meshConfig:`에 적용 후 `helm upgrade`)
- [Cluster 해부](xds__src-cluster-anatomy.html) — `NC`/`UH` 분석 시 cluster/subset 매칭, outlier detection 동작·튜닝 정본
- [xDS 계층과 진단](xds__src-xds-layers-and-diagnosis.html) — listener → route → cluster → endpoint 단계별 실패 위치 추적