Istio Gateway 部署方式深度分析

Kubernetes 实战 - 这篇文章属于一个选集。

§ 10: 本文

背景
#

在云原生架构中，HTTP 网关（如 Istio IngressGateway、Nginx Ingress Controller、Envoy Gateway）是流量入口的关键组件。一个典型的生产架构如下：

flowchart LR
    User[用户] --> DDoS[DDoS 防护]
    DDoS --> WAF[WAF]
    WAF --> CLB[CLB 四层负载]
    CLB --> GW[HTTP 网关
容器化部署]
    GW --> SVC[后端服务]

在这个架构中，HTTP 网关承担着 TLS 终止、路由分发、流量管理等核心职责。它的部署方式直接影响：

运维效率：升级、扩缩容是否方便
流量可靠性：变更时是否会丢失请求
资源利用率：是否能弹性伸缩

本文将深入分析各种部署方式的原理与机制。如需完整的生产配置，请参考 Istio Gateway 生产部署最佳实践。

部署方式对比
#

方式一：Deployment + Service (NodePort/LoadBalancer)
#

最常见的部署方式，网关作为普通 Deployment 运行，通过 Service 暴露。

flowchart TB
    subgraph Kubernetes Cluster
        subgraph Node1
            GW1[Gateway Pod]
        end
        subgraph Node2
            GW2[Gateway Pod]
        end
        subgraph Node3
            GW3[Gateway Pod]
        end
        SVC[Service
NodePort/LB]
    end
    CLB[CLB] --> SVC
    SVC --> GW1
    SVC --> GW2
    SVC --> GW3

优点：

部署简单，符合 Kubernetes 原生模式
支持 HPA 自动扩缩容
Pod 调度灵活，可跨节点分布

缺点：

流量多一跳（经过 kube-proxy/iptables）
NodePort 端口范围受限（30000-32767）
滚动更新时存在连接中断风险

适用场景： 中小规模集群，对延迟不敏感的场景。

方式二：DaemonSet + HostNetwork
#

每个节点运行一个网关 Pod，确保流量本地处理。

flowchart TB
    subgraph Kubernetes Cluster
        subgraph Node1
            GW1[Gateway Pod]
        end
        subgraph Node2
            GW2[Gateway Pod]
        end
        subgraph Node3
            GW3[Gateway Pod]
        end
    end
    CLB[CLB] --> Node1
    CLB --> Node2
    CLB --> Node3

配置示例：

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: istio-ingressgateway
spec:
  selector:
    matchLabels:
      app: istio-ingressgateway
  template:
    spec:
      hostNetwork: true
      dnsPolicy: ClusterFirstWithHostNet
      nodeSelector:
        node-role: gateway  # 限定在特定节点
      tolerations:
      - key: node-role
        value: gateway
        effect: NoSchedule
      containers:
      - name: istio-proxy
        ports:
        - containerPort: 80
        - containerPort: 443

优点：

流量路径最短，性能最优
节点与 Pod 一一对应，运维简单
新节点加入自动部署网关

缺点：

无法细粒度控制副本数
升级时需要额外处理避免流量丢失
资源利用率可能不均衡

适用场景： 大流量场景，专用网关节点。

方式三：Deployment + externalTrafficPolicy: Local
#

结合 Service 的 Local 流量策略，避免跨节点转发。

flowchart TB
    subgraph Kubernetes Cluster
        subgraph Node1
            GW1[Gateway Pod]
            EP1[Endpoint]
        end
        subgraph Node2
            GW2[Gateway Pod]
            EP2[Endpoint]
        end
        subgraph Node3["Node3 (无 Pod)"]
            EP3[无 Endpoint]
        end
    end
    CLB[CLB] -->|健康检查通过| Node1
    CLB -->|健康检查通过| Node2
    CLB -.->|健康检查失败| Node3

配置示例：

apiVersion: v1
kind: Service
metadata:
  name: istio-ingressgateway
spec:
  type: NodePort
  externalTrafficPolicy: Local
  ports:
  - port: 80
    targetPort: 80
    nodePort: 30080
  selector:
    app: istio-ingressgateway

优点：

保留客户端源 IP
流量不跨节点转发
CLB 通过健康检查自动剔除无 Pod 节点

缺点：

流量分布可能不均衡
依赖 CLB 的健康检查机制

适用场景： 需要保留源 IP、流量分布相对均衡的场景。

零流量丢失方案
#

无论选择哪种部署方式，滚动更新时都可能丢失流量。要做到真正的零流量丢失，需要深入理解 Pod 终止流程与 CLB 健康检查的时序关系。

问题根源：时序竞争
#

流量丢失的核心原因是 CLB 摘除节点的速度慢于 Pod 终止速度：

sequenceDiagram
    participant K8s as Kubernetes
    participant Pod as Gateway Pod
    participant CLB as CLB
    participant User as 用户请求
    
    Note over K8s: Pod 开始终止
    K8s->>Pod: 发送 SIGTERM
    K8s->>K8s: 从 Endpoint 移除 Pod
    
    Note over CLB: CLB 尚未感知
    User->>CLB: 新请求
    CLB->>Pod: 转发到已终止的 Pod
    Pod--xCLB: 连接失败
    
    Note over CLB: 健康检查失败
    CLB->>CLB: 摘除节点（延迟 10-30s）

问题在于：

Kubernetes 从 Endpoint 移除 Pod 是即时的
CLB 依赖健康检查发现节点不可用，通常需要 10-30 秒
这个时间差内，CLB 仍会向已终止的 Pod 发送流量

Istio IngressGateway 的优雅终止
#

Istio 的 pilot-agent 支持 DRAIN 模式，可以优雅处理连接排空。

关键理解：drain_listeners 不是立即拒绝连接

很多人误以为 drain 后 Envoy 会拒绝新连接，实际上：

阶段	新连接	现有连接	/healthz/ready
正常运行	正常接受	正常处理	200 OK
drain 中	仍然接受	继续处理	503
连接空闲后	优雅关闭	处理完毕	503

drain_listeners 的真实行为：

Listener 进入 draining 状态
新连接仍然会被接受和处理（不会返回 connection refused）
/healthz/ready 返回 503（通知 CLB 停止发送流量）
连接被标记为 draining，处理完请求后优雅关闭

这就是零流量丢失的关键：在 CLB 摘除节点之前，即使有新请求进来，Envoy 仍然会正常处理。

flowchart TB
    subgraph Pod终止流程
        A[收到 SIGTERM] --> B["preStop: drain_listeners"]
        B --> C[Listener 进入 draining 状态]
        C --> D["healthz/ready 返回 503"]
        D --> E{CLB 还在发请求?}
        E -->|是| F[继续处理请求]
        F --> E
        E -->|否,已被摘除| G[等待现有连接完成]
        G --> H[进程退出]
    end

preStop 配置：

spec:
  terminationGracePeriodSeconds: 60  # 总超时时间
  containers:
  - name: istio-proxy
    lifecycle:
      preStop:
        exec:
          command:
          - /bin/sh
          - -c
          - |
            # 1. 进入 DRAIN 模式（仍接受请求，但健康检查返回 503）
            curl -sf -X POST http://localhost:15000/drain_listeners?inboundonly || true
            # 2. 等待 CLB 健康检查失败并摘除节点（期间请求仍正常处理）
            sleep 15
            # 3. 等待现有连接处理完成
            sleep 5

为什么这样能零流量丢失？

sequenceDiagram
    participant CLB
    participant Envoy
    participant Backend
    
    Note over Envoy: drain_listeners 执行
    
    rect rgb(255, 245, 200)
        Note over CLB,Envoy: CLB 摘除前（约 10-15s）
        CLB->>Envoy: 健康检查
        Envoy-->>CLB: 503 Not Ready
        CLB->>Envoy: 新请求（仍会转发）
        Envoy->>Backend: 正常处理
        Backend->>Envoy: 响应
        Envoy->>CLB: 返回响应
        Note over Envoy: 请求正常完成，无丢失
    end
    
    rect rgb(200, 230, 200)
        Note over CLB: CLB 摘除后
        CLB->>CLB: 停止向该节点发送流量
        Note over Envoy: 处理剩余请求
        Envoy->>Envoy: 连接排空完成
    end

长连接场景：WebSocket / HTTP/2
#

对于长连接场景，情况更复杂。CLB 摘除节点后：

四层 CLB 不会主动断开已建立的连接
长连接可能持续数小时甚至数天
如果 terminationGracePeriodSeconds 过短，会强制断开连接

flowchart TB
    subgraph 短连接["短连接（HTTP/1.1）"]
        A1[请求] --> B1[响应] --> C1[连接关闭]
        D1[drain 后快速完成]
    end
    
    subgraph 长连接["长连接（WebSocket/HTTP2）"]
        A2[连接建立] --> B2[持续通信...]
        B2 --> C2[可能持续数小时]
        D2[drain 后如何处理?]
    end

HTTP/2 的处理

HTTP/2 支持 GOAWAY 帧，Envoy 在 drain 时会发送 GOAWAY：

GOAWAY 帧作用：
- 告诉客户端不要在当前连接发送新请求
- 客户端会新建连接发送后续请求
- 现有 stream 继续处理直到完成

Envoy 配置 drain_timeout：

# Envoy Bootstrap 配置
admin:
  access_log_path: /dev/null
  address:
    socket_address:
      address: 0.0.0.0
      port_value: 15000
# drain_timeout 控制 GOAWAY 后等待多久强制关闭
# 默认 600s (10分钟)

WebSocket 的处理

WebSocket 没有类似 GOAWAY 的机制，需要额外处理：

方案	描述	适用场景
增大 terminationGracePeriod	等待所有连接自然关闭	连接有明确超时的场景
应用层心跳超时	服务端主动关闭空闲连接	可控制应用代码
客户端重连机制	客户端感知断开后自动重连	客户端可配合改造
接受少量断开	设置合理超时，接受长连接被断	非关键业务

推荐配置：

spec:
  # 对于有长连接的场景，需要更长的优雅终止时间
  terminationGracePeriodSeconds: 300  # 5 分钟
  
  containers:
  - name: istio-proxy
    lifecycle:
      preStop:
        exec:
          command:
          - /bin/sh
          - -c
          - |
            # drain_listeners 会触发 GOAWAY
            curl -sf -X POST http://localhost:15000/drain_listeners?inboundonly || true
            # 等待 CLB 摘除
            sleep 15
            # 给长连接更多时间完成
            # Envoy 默认 drain_timeout=600s，这里等待 4 分钟
            sleep 240

长连接场景的完整时序：

sequenceDiagram
    participant Client as 客户端
    participant CLB
    participant Envoy
    
    Note over Client,Envoy: 已建立 WebSocket/HTTP2 长连接
    
    rect rgb(255, 230, 200)
        Note over Envoy: drain_listeners 执行
        Envoy->>Client: HTTP/2: 发送 GOAWAY 帧
        Note over Client: 收到 GOAWAY，不再发新 stream
        Note over Client: 但现有 stream 继续处理
    end
    
    rect rgb(255, 245, 200)
        Note over CLB: 健康检查失败，摘除节点
        CLB->>CLB: 新连接不再转发到该节点
        Note over Client,Envoy: 已有连接继续（CLB 四层只转发）
    end
    
    rect rgb(200, 230, 200)
        Note over Envoy: 等待连接排空（最多 drain_timeout）
        Client->>Envoy: 继续处理现有请求
        Envoy->>Client: 响应
        Note over Client: 客户端主动关闭或超时
    end
    
    rect rgb(230, 200, 200)
        Note over Envoy: terminationGracePeriod 到期
        Envoy->>Client: 强制关闭剩余连接
        Note over Envoy: Pod 退出
    end

关键参数关系：

terminationGracePeriodSeconds >= preStop sleep + 连接排空时间

对于长连接场景：
- HTTP/2: terminationGracePeriod >= 15s + drain_timeout(默认600s)
- WebSocket: 根据业务可接受的断连时间设置

兜底策略：客户端重连

无论服务端如何配置，都无法保证长连接 100% 不断开。最佳实践是客户端实现重连机制：

// WebSocket 客户端示例
const ws = new WebSocket('wss://gateway.example.com/ws');

ws.onclose = function(event) {
  console.log('Connection closed, reconnecting...');
  setTimeout(() => {
    // 重新建立连接
    connect();
  }, 1000);
};

对于 HTTP/2 客户端（如 gRPC），收到 GOAWAY 后会自动新建连接，通常无需额外处理。

进阶方案：配合 CLB 权重调整
#

上述方案依赖健康检查来摘除节点，存在 10-15 秒的延迟。如果能主动调整 CLB 权重，可以实现更快的流量切换。

方案对比：

方案	流量切换速度	复杂度	依赖
健康检查方案	10-15s	低	无
CLB 权重方案	1-2s	中	云厂商 API

CLB 权重方案流程：

sequenceDiagram
    participant K8s as Kubernetes
    participant Pod as Gateway Pod
    participant Script as preStop 脚本
    participant CLB as CLB API
    participant Envoy
    
    Note over K8s: Pod 开始终止
    K8s->>Pod: SIGTERM
    
    rect rgb(255, 230, 200)
        Note over Script,CLB: 阶段1: 主动摘除流量
        Script->>CLB: 设置节点权重为 0
        CLB->>CLB: 立即停止新连接
        Note over CLB: 无需等健康检查
    end
    
    rect rgb(200, 230, 200)
        Note over Envoy: 阶段2: 连接排空
        Script->>Envoy: drain_listeners
        Envoy->>Envoy: 等待现有连接完成
    end
    
    rect rgb(200, 200, 230)
        Note over Pod: 阶段3: 退出
        Script->>Script: 等待足够时间
        Pod->>Pod: 进程退出
    end

实现方式：

需要在 Pod 中准备云厂商 CLI 工具和凭证，preStop 脚本调用 CLB API。

以阿里云 SLB 为例：

apiVersion: apps/v1
kind: Deployment
metadata:
  name: istio-ingressgateway
spec:
  template:
    spec:
      terminationGracePeriodSeconds: 120
      containers:
      - name: istio-proxy
        env:
        - name: NODE_IP
          valueFrom:
            fieldRef:
              fieldPath: status.hostIP
        - name: SLB_ID
          value: "lb-xxxxxxxxxx"
        - name: REGION
          value: "cn-hangzhou"
        volumeMounts:
        - name: aliyun-credentials
          mountPath: /root/.aliyun
          readOnly: true
        lifecycle:
          preStop:
            exec:
              command:
              - /bin/sh
              - -c
              - |
                #!/bin/bash
                set -e
                
                # 1. 获取本机在 SLB 中的后端服务器 ID
                BACKEND_ID=$(aliyun slb DescribeLoadBalancerAttribute \
                  --RegionId $REGION \
                  --LoadBalancerId $SLB_ID \
                  --output cols=BackendServers.BackendServer[].ServerId \
                  | grep -w $NODE_IP | awk '{print $1}')
                
                if [ -n "$BACKEND_ID" ]; then
                  # 2. 设置权重为 0，立即停止新流量
                  echo "Setting backend $BACKEND_ID weight to 0..."
                  aliyun slb SetBackendServers \
                    --RegionId $REGION \
                    --LoadBalancerId $SLB_ID \
                    --BackendServers "[{\"ServerId\":\"$BACKEND_ID\",\"Weight\":0}]"
                  
                  echo "Weight set to 0, waiting for connections to drain..."
                fi
                
                # 3. 通知 Envoy 进入 drain 模式
                curl -sf -X POST http://localhost:15000/drain_listeners?inboundonly || true
                
                # 4. 等待现有连接完成（权重为 0 后无新连接，只需等排空）
                sleep 60
                
                echo "Drain complete, exiting..."
      volumes:
      - name: aliyun-credentials
        secret:
          secretName: aliyun-credentials

凭证配置（Secret）：

apiVersion: v1
kind: Secret
metadata:
  name: aliyun-credentials
type: Opaque
stringData:
  config.json: |
    {
      "current": "default",
      "profiles": {
        "default": {
          "mode": "AK",
          "access_key_id": "LTAI5txxxxxxxxxx",
          "access_key_secret": "xxxxxxxxxxxxxxxxxx",
          "region_id": "cn-hangzhou"
        }
      }
    }

腾讯云 CLB 示例：

# preStop 脚本核心逻辑
# 使用 tccli 或 API 调用

# 修改后端权重
tccli clb ModifyTargetWeight \
  --LoadBalancerId lb-xxxxxxxx \
  --ListenerId lbl-xxxxxxxx \
  --Targets '[{"InstanceId":"ins-xxxxxx","Port":80,"Weight":0}]'

更优雅的方式：Finalizer + Controller

直接在 Pod 中调用云 API 有以下问题：

需要在每个 Pod 中配置凭证
preStop 脚本复杂，调试困难
凭证管理风险

推荐采用 Finalizer + Controller 模式：

flowchart TB
    subgraph Controller["CLB Weight Controller"]
        I[Informer] -->|Watch Pod| WQ[WorkQueue]
        WQ --> R[Reconciler]
    end
    
    subgraph Reconcile["Reconcile 逻辑"]
        R --> C1{Pod 状态?}
        C1 -->|Running + Ready| A1["设置权重=100"]
        C1 -->|Terminating| A2["设置权重=0"]
        A2 --> W["等待 2-3s"]
        W --> A3["移除 Finalizer"]
        C1 -->|NotReady| A4["设置权重=0"]
    end
    
    A1 --> CLB[CLB API]
    A2 --> CLB
    A4 --> CLB

核心机制：

事件	Controller 动作	CLB 权重
Pod 创建且 Ready	加入 CLB 后端	100
Pod Ready → NotReady	调整权重	0
Pod 进入 Terminating	调整权重，等待后移除 Finalizer	0
Pod 删除	从 CLB 移除	-

为什么用 Finalizer 而不是 Admission Webhook？

对比项	Admission Webhook	Finalizer
阻塞方式	拒绝请求，需重试	挂起删除，自动继续
用户体验	报错，需等待重试	删除命令立即返回
实现复杂度	需要 Webhook + Controller	只需 Controller
故障影响	Webhook 挂了阻塞所有删除	Controller 挂了只是延迟

完整时序：

sequenceDiagram
    participant User as kubectl
    participant API as API Server
    participant Ctrl as CLB Controller
    participant CLB as CLB API
    participant Pod as Gateway Pod

    Note over User,Pod: 创建流程
    User->>API: create Pod
    API->>Pod: 创建 Pod
    Ctrl->>API: Watch: Pod Created
    Ctrl->>API: 添加 Finalizer
    Pod->>Pod: Ready
    Ctrl->>API: Watch: Pod Ready
    Ctrl->>CLB: SetWeight(ip, 100)
    Note over CLB: 开始接收流量

    Note over User,Pod: 删除流程
    User->>API: delete Pod
    API->>Pod: 设置 DeletionTimestamp
    Note over Pod: Terminating 状态
Finalizer 阻止真正删除
    
    Ctrl->>API: Watch: Pod Terminating
    Ctrl->>CLB: SetWeight(ip, 0)
    Note over CLB: 停止新连接
    
    Ctrl->>Ctrl: 等待 3s
    Ctrl->>API: 移除 Finalizer
    API->>Pod: 允许删除
    Pod->>Pod: preStop + 进程退出

Controller 核心代码：

const FinalizerName = "clb.example.com/weight-controller"

func (r *Reconciler) Reconcile(ctx context.Context, req reconcile.Request) (reconcile.Result, error) {
    pod := &corev1.Pod{}
    if err := r.Get(ctx, req.NamespacedName, pod); err != nil {
        return reconcile.Result{}, client.IgnoreNotFound(err)
    }

    // Pod 正在删除
    if !pod.DeletionTimestamp.IsZero() {
        return r.handleTerminating(ctx, pod)
    }
    // Pod 正常运行
    return r.handleRunning(ctx, pod)
}

func (r *Reconciler) handleRunning(ctx context.Context, pod *corev1.Pod) (reconcile.Result, error) {
    // 确保有 Finalizer
    if !controllerutil.ContainsFinalizer(pod, FinalizerName) {
        controllerutil.AddFinalizer(pod, FinalizerName)
        return reconcile.Result{}, r.Update(ctx, pod)
    }

    // 根据 Ready 状态设置权重
    if isPodReady(pod) {
        r.CLBClient.SetWeight(pod.Status.PodIP, 100)
    } else {
        r.CLBClient.SetWeight(pod.Status.PodIP, 0)
    }
    return reconcile.Result{}, nil
}

func (r *Reconciler) handleTerminating(ctx context.Context, pod *corev1.Pod) (reconcile.Result, error) {
    if !controllerutil.ContainsFinalizer(pod, FinalizerName) {
        return reconcile.Result{}, nil
    }

    // 1. 设置权重为 0
    r.CLBClient.SetWeight(pod.Status.PodIP, 0)

    // 2. 检查是否已等待足够时间
    drainStart := getDrainAnnotation(pod)
    if drainStart.IsZero() {
        setDrainAnnotation(ctx, r.Client, pod)
        return reconcile.Result{RequeueAfter: 3 * time.Second}, nil
    }
    if time.Since(drainStart) < 3*time.Second {
        return reconcile.Result{RequeueAfter: 3 * time.Second}, nil
    }

    // 3. 移除 Finalizer，允许删除
    // 长连接等待交给 preStop 处理，这里只等 CLB 生效
    controllerutil.RemoveFinalizer(pod, FinalizerName)
    return reconcile.Result{}, r.Update(ctx, pod)
}

Finalizer 与 preStop 的职责分工

长连接等待（如 300s）应该放在 preStop 中，而不是 Finalizer 中：

阶段	职责	等待时间	说明
Finalizer	摘 CLB 流量	2-3s	只等 CLB 权重生效
preStop	排空长连接	300s	触发 drain + 等待连接结束

为什么这样分工？

考量	说明
Envoy 状态	preStop 触发 `drain_listeners` 后，Envoy 才知道要关闭
超时机制	preStop 有 `terminationGracePeriodSeconds` 兜底
职责清晰	Finalizer 管 CLB，preStop 管连接排空
故障隔离	Controller 只需快速操作，不会长时间阻塞

完整时序：

sequenceDiagram
    participant User as kubectl
    participant API as API Server
    participant Ctrl as CLB Controller
    participant CLB as CLB
    participant Pod as Gateway Pod

    User->>API: delete Pod
    API->>Pod: DeletionTimestamp
    Note over Pod: Terminating
Finalizer 阻止删除

    rect rgb(255, 245, 230)
        Note over Ctrl,CLB: Finalizer 阶段（2-3s）
        Ctrl->>CLB: SetWeight(ip, 0)
        Ctrl->>Ctrl: 等待 3s
        Note over CLB: 新连接已停止
    end

    Ctrl->>API: 移除 Finalizer
    API->>Pod: SIGTERM

    rect rgb(230, 255, 230)
        Note over Pod: preStop 阶段（300s）
        Pod->>Pod: drain_listeners
        Note over Pod: 停止接收新请求
继续处理已有连接
        Pod->>Pod: sleep 300
        Note over Pod: 等待长连接排空
    end

    Pod->>Pod: 进程退出

配套的 Pod 配置：

spec:
  terminationGracePeriodSeconds: 330  # 300 + 30 缓冲
  containers:
  - name: istio-proxy
    lifecycle:
      preStop:
        exec:
          command:
          - /bin/sh
          - -c
          - |
            # 1. 触发 Envoy drain（停止接收新请求）
            pilot-agent request POST drain_listeners
            
            # 2. 等待长连接排空
            # Finalizer 已经把 CLB 权重设为 0，不会有新连接
            # 这里只需要等已有连接处理完
            sleep 300

总结：

Finalizer：快进快出（2-3s）→ 只负责摘 CLB 流量
preStop：慢慢等（300s）→ 负责排空长连接

方案选择建议：

场景	推荐方案
快速验证 / 小规模	preStop 脚本直接调用
生产环境 / 多集群	Finalizer + Controller
对切换速度要求不高	健康检查方案（简单可靠）
毫秒级切换要求	CLB 权重 + DPVS/IPVS 直连

权重方案 vs 健康检查方案对比：

gantt
    title 流量切换时间对比
    dateFormat  ss
    axisFormat %S秒
    
    section 健康检查方案
    preStop 执行           :a1, 00, 2s
    等待健康检查失败        :a2, after a1, 12s
    CLB 摘除节点           :a3, after a2, 1s
    连接排空               :a4, after a3, 5s
    
    section CLB权重方案
    preStop 执行           :b1, 00, 1s
    设置权重为0            :b2, after b1, 1s
    CLB 立即生效           :b3, after b2, 1s
    连接排空               :b4, after b3, 5s

权重方案的长连接问题

需要明确：CLB 权重设为 0 只是停止新连接，不会断开已有连接。

flowchart TB
    subgraph weight0["CLB 权重设为 0 后"]
        A[新连接] -->|不再转发| X[拒绝]
        B[已有长连接] -->|继续转发| C[Gateway Pod]
    end

四层 CLB 只是 TCP 转发，权重为 0 的含义是：

新的 TCP 连接不会分配到该节点
已建立的 TCP 连接仍然正常转发

所以权重方案和健康检查方案在长连接处理上是一样的：

场景	健康检查方案	权重方案
停止新连接	10-15s 后	1-2s 后
已有长连接	继续存在	继续存在
连接排空	等待或超时断开	等待或超时断开

权重方案的真正优势：

权重方案不是解决长连接问题，而是：

更快停止新连接 - 争取更多时间给长连接排空
更可控 - 不依赖健康检查的不确定性
可组合 - 配合其他策略使用

长连接的根本解决方案

无论哪种方案，长连接问题的根本解决需要：

方案	描述	侵入性
客户端重连	客户端实现自动重连	需改客户端
应用层心跳	服务端超时关闭空闲连接	需改服务端
GOAWAY (HTTP/2)	Envoy 自动发送	无
足够长的等待时间	terminationGracePeriod	延长发布时间
接受断开	强制断开，客户端重连	业务接受

最佳实践组合：

spec:
  terminationGracePeriodSeconds: 300  # 给长连接足够时间
  containers:
  - name: istio-proxy
    lifecycle:
      preStop:
        exec:
          command:
          - /bin/sh
          - -c
          - |
            # 1. 主动摘除 CLB 流量（可选，需要云 API）
            # set_clb_weight_zero.sh || true
            
            # 2. Envoy drain（HTTP/2 会发 GOAWAY）
            curl -sf -X POST http://localhost:15000/drain_listeners?inboundonly || true
            
            # 3. 等待
            # - 如果用了 CLB 权重：不需要等健康检查，可以更快进入排空
            # - 如果没用 CLB 权重：需要等健康检查失败（15s）
            sleep 15
            
            # 4. 给长连接排空时间
            # - HTTP/2: GOAWAY 后客户端会新建连接，现有 stream 继续
            # - WebSocket: 只能等待或超时
            sleep 240

总结：

方案	新连接切换	长连接处理	复杂度
健康检查	10-15s	等待/超时	低
CLB 权重	1-2s	等待/超时（同上）	中
权重 + 长等待	1-2s	更多时间排空	中

CLB 权重方案的价值是更早停止新连接，为长连接争取更多排空时间，而不是解决长连接本身。长连接的终极方案还是客户端重连机制。

单节点多 Pod 场景的问题
#

上述方案假设 CLB 能直接感知每个 Pod 的状态。但如果是 Deployment + Service (NodePort) 部署方式，同一节点可能运行多个 Gateway Pod，此时 CLB 健康检查方案不适用。

问题分析：

flowchart TB
    subgraph CLB
        HC[健康检查
Node:30080]
    end
    
    subgraph Node1["Node1 (10.0.1.1)"]
        NP[NodePort 30080]
        KP[kube-proxy/iptables]
        Pod1[Gateway Pod 1
正在 drain]
        Pod2[Gateway Pod 2
健康]
        
        NP --> KP
        KP --> Pod1
        KP --> Pod2
    end
    
    HC -->|检查 10.0.1.1:30080| NP
    HC -.->|Pod2 健康，检查通过| HC
    
    style Pod1 fill:#ffcccc
    style Pod2 fill:#ccffcc

问题：

CLB 健康检查目标是 NodeIP:NodePort
kube-proxy 会将请求转发到任意一个后端 Pod
只要有一个 Pod 健康，健康检查就通过
drain 中的 Pod 仍可能收到流量（kube-proxy 转发）

流量路径分析：

CLB -> Node:30080 -> kube-proxy -> Pod1(drain) 或 Pod2(健康)
                                   ↑
                                   随机选择，可能选到 drain 的 Pod

虽然 Pod 从 Endpoint 移除后 kube-proxy 规则会更新，但存在同步延迟（通常 1-5 秒），期间流量仍可能到达 drain 中的 Pod。

各部署方式对比：

部署方式	CLB 后端	健康检查目标	Pod 级别感知
DaemonSet + HostNetwork	Pod IP	Pod:80	支持
Deployment + NodePort	Node IP	Node:30080	不支持
Deployment + LoadBalancer	取决于云厂商	可能是节点	部分支持

单节点多 Pod 场景的解决方案：

方案一：改用 HostNetwork 部署

每节点只运行一个 Pod，CLB 直连 Pod IP：

spec:
  hostNetwork: true
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchLabels:
            app: istio-ingressgateway
        topologyKey: kubernetes.io/hostname

方案二：使用云厂商的 ENI 直连模式

部分云厂商支持 CLB 直接连接 Pod IP（绕过 NodePort）：

flowchart LR
    subgraph CLB
        BE1[后端: Pod1 IP]
        BE2[后端: Pod2 IP]
    end
    
    subgraph Node1
        Pod1[Pod1]
        Pod2[Pod2]
    end
    
    BE1 -->|直连| Pod1
    BE2 -->|直连| Pod2

阿里云：使用 Terway 网络插件 + ENI 模式
腾讯云：使用 VPC-CNI 模式 + 直连 Pod
AWS：使用 AWS VPC CNI + NLB IP 模式

这种模式下，CLB 可以单独对每个 Pod 做健康检查。

方案三：使用 externalTrafficPolicy: Local

apiVersion: v1
kind: Service
spec:
  type: NodePort
  externalTrafficPolicy: Local
  ports:
  - port: 80
    nodePort: 30080

externalTrafficPolicy: Local 的效果：

流量只转发到本节点的 Pod
如果本节点没有健康 Pod，健康检查失败
CLB 会摘除该节点

但这要求每个节点最多运行一个 Pod，否则仍有问题。

方案四：使用 CLB 七层模式（HTTP/HTTPS）

如果 CLB 支持七层负载均衡，可以直接转发到 Pod IP：

flowchart LR
    CLB[CLB 七层] --> Pod1[Pod1 IP:8080]
    CLB --> Pod2[Pod2 IP:8080]
    
    CLB -.->|独立健康检查| Pod1
    CLB -.->|独立健康检查| Pod2

但七层 CLB 会增加一跳延迟，且不支持非 HTTP 协议。

方案五：Finalizer + Controller 动态更新 CLB

通过 Controller 监听 Pod 状态变化，利用 Finalizer 机制确保先摘流量再删除：

sequenceDiagram
    participant User as kubectl
    participant K8s as Kubernetes API
    participant Ctrl as CLB Controller
    participant CLB as CLB API
    participant Pod
    
    User->>K8s: delete Pod
    K8s->>Pod: 设置 DeletionTimestamp
    Note over Pod: Finalizer 阻止真正删除
    
    K8s->>Ctrl: Watch: Pod Terminating
    Ctrl->>CLB: SetWeight(ip, 0)
    Ctrl->>Ctrl: 等待 3s
    Ctrl->>K8s: 移除 Finalizer
    K8s->>Pod: 允许删除

这种方式可以实现：

删除前必定先摘流量 - Finalizer 确保顺序
Pod 级别的精确控制 - 无需依赖健康检查
放流量时机精确 - Watch Pod Ready 后才加权重

推荐方案：

场景	推荐方案
新建集群	HostNetwork + DaemonSet
已有 NodePort 部署	迁移到 ENI 直连或 HostNetwork
无法改部署方式	Finalizer + Controller
对延迟不敏感	使用七层 CLB

总结：

单节点多 Pod + NodePort 方式下，CLB 健康检查无法精确感知单个 Pod 状态，建议：

优先使用 HostNetwork - 每节点一个 Pod，CLB 直连
或使用云厂商 ENI 直连 - CLB 直连 Pod IP
或开发 Finalizer + Controller - 主动同步 Pod 状态到 CLB

CLB 健康检查配置
#

CLB 与 Pod 生命周期配合的关键是健康检查配置：

参数	推荐值	说明
检查间隔	5s	健康检查频率
响应超时	2s	单次检查超时
不健康阈值	2	连续失败 2 次判定不健康
健康阈值	2	连续成功 2 次判定健康

以此配置，CLB 感知节点不健康的最长时间为：5s * 2 = 10s

时间计算公式：

preStop sleep 时间 >= CLB 检查间隔 × 不健康阈值 + 缓冲时间
                   >= 5s × 2 + 5s = 15s

完整配置示例
#

apiVersion: apps/v1
kind: Deployment
metadata:
  name: istio-ingressgateway
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0  # 确保始终有足够副本
  template:
    spec:
      terminationGracePeriodSeconds: 60
      containers:
      - name: istio-proxy
        ports:
        - containerPort: 8080
          name: http2
        - containerPort: 8443
          name: https
        - containerPort: 15021
          name: status-port
        
        # 就绪探针：控制流量是否进入
        readinessProbe:
          httpGet:
            path: /healthz/ready
            port: 15021
          initialDelaySeconds: 1
          periodSeconds: 2
          failureThreshold: 30
        
        # 存活探针：控制 Pod 是否重启
        livenessProbe:
          httpGet:
            path: /healthz/ready
            port: 15021
          initialDelaySeconds: 10
          periodSeconds: 10
          failureThreshold: 3
        
        lifecycle:
          preStop:
            exec:
              command:
              - /bin/sh
              - -c
              - |
                # 进入 DRAIN 模式，拒绝新连接
                curl -sf -X POST http://localhost:15000/drain_listeners?inboundonly || true
                # 等待 CLB 摘除（健康检查间隔5s × 不健康阈值2 + 缓冲5s）
                sleep 15
                # 等待现有请求处理完成
                sleep 5
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: istio-ingressgateway
spec:
  minAvailable: 2  # 至少保持 2 个可用
  selector:
    matchLabels:
      app: istio-ingressgateway

CLB 健康检查端点
#

配置 CLB 健康检查指向 Istio 的状态端口：

CLB 配置项	值
协议	HTTP
端口	15021
路径	/healthz/ready
检查间隔	5 秒
超时	2 秒
不健康阈值	2

当 Pod 进入 DRAIN 模式后，/healthz/ready 会返回 503，CLB 会在 2 次失败后（约 10 秒）摘除该节点。

完整时序图
#

sequenceDiagram
    participant K8s as Kubernetes
    participant Pod as Gateway Pod
    participant Envoy as Envoy Proxy
    participant CLB as CLB
    participant User as 用户
    
    Note over K8s: 滚动更新开始
    
    rect rgb(200, 230, 200)
        Note over K8s,Pod: 阶段1: 新 Pod 就绪
        K8s->>Pod: 创建新 Pod
        Pod->>Envoy: 启动 Envoy
        Envoy->>Envoy: readinessProbe 通过
        K8s->>K8s: 添加到 Endpoint
        CLB->>Pod: 健康检查通过
        CLB->>CLB: 添加新节点
    end
    
    rect rgb(255, 230, 200)
        Note over K8s,CLB: 阶段2: 旧 Pod 优雅退出
        K8s->>Pod: SIGTERM
        K8s->>K8s: 从 Endpoint 移除
        Pod->>Envoy: preStop: drain_listeners
        Envoy->>Envoy: 停止接受新连接
        Envoy->>Envoy: /healthz/ready 返回 503
        
        loop 等待 CLB 摘除 (15s)
            CLB->>Envoy: 健康检查
            Envoy-->>CLB: 503 Not Ready
        end
        
        CLB->>CLB: 摘除旧节点
    end
    
    rect rgb(200, 200, 230)
        Note over Pod,User: 阶段3: 连接排空
        User->>Envoy: 处理存量请求
        Envoy->>User: 响应完成
        Note over Pod: sleep 5s 等待连接完成
        Pod->>Pod: 进程退出
    end

验证方法
#

部署完成后，可以通过以下方式验证零流量丢失：

1. 压测验证

# 使用 wrk 持续压测
wrk -t4 -c100 -d300s http://gateway.example.com/api/health

# 同时触发滚动更新
kubectl rollout restart deployment/istio-ingressgateway

# 观察是否有错误请求

2. 日志验证

# 查看 Pod 终止日志
kubectl logs -f <pod-name> -c istio-proxy --previous

# 应该看到：
# - drain_listeners 执行
# - 等待期间无新连接
# - 正常退出

3. CLB 监控

查看 CLB 监控指标：

健康节点数变化
5xx 错误率
连接失败数

平滑切换 Workload 类型
#

当需要从一种部署方式切换到另一种（如 Deployment 切换到 DaemonSet），如何保证切换过程零流量丢失？

切换场景
#

常见的切换需求：

切换方向	原因
Deployment → DaemonSet	提升性能，简化运维
NodePort → HostNetwork	减少网络跳数
单集群 → 多集群	容灾、扩容

核心问题：Gateway 规则能否共用？
#

答案：可以！ 关键在于 Gateway 资源的 selector 字段。

Istio Gateway 配置下发机制：

flowchart TB
    subgraph Istio Config
        GW[Gateway
selector: istio=ingressgateway]
        VS[VirtualService
gateways: my-gateway]
    end
    
    subgraph istiod
        PILOT[Pilot]
    end
    
    subgraph Gateway Pods
        DEP[Deployment Pod
label: istio=ingressgateway]
        DS[DaemonSet Pod
label: istio=ingressgateway]
    end
    
    GW --> PILOT
    VS --> PILOT
    PILOT -->|xDS| DEP
    PILOT -->|xDS| DS
    
    Note1[只要 label 匹配
都会收到配置]

Gateway 资源示例：

apiVersion: networking.istio.io/v1beta1
kind: Gateway
metadata:
  name: my-gateway
  namespace: istio-system
spec:
  selector:
    istio: ingressgateway  # 关键：label selector
  servers:
  - port:
      number: 80
      name: http
      protocol: HTTP
    hosts:
    - "*.example.com"

只要 Pod 带有 istio: ingressgateway label，就会收到这个 Gateway 配置。

新旧 Workload 共用配置的方法：

# 旧 Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: istio-ingressgateway
spec:
  template:
    metadata:
      labels:
        app: istio-ingressgateway
        istio: ingressgateway  # 关键 label
---
# 新 DaemonSet
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: istio-ingressgateway-ds
spec:
  template:
    metadata:
      labels:
        app: istio-ingressgateway-ds
        istio: ingressgateway  # 相同的 label

这样配置后：

Deployment Pod 和 DaemonSet Pod 都带有 istio: ingressgateway
istiod 会向两者推送相同的 Gateway/VirtualService 配置
两者都能正确路由流量
CLB 可以同时指向两者，或逐步切换

验证配置下发：

# 查看 Deployment Pod 的路由
istioctl proxy-config routes deploy/istio-ingressgateway -n istio-system

# 查看 DaemonSet Pod 的路由
istioctl proxy-config routes ds/istio-ingressgateway-ds -n istio-system

# 两者应该相同

切换过程中的配置一致性：

sequenceDiagram
    participant GW as Gateway 资源
    participant Istiod
    participant Old as Deployment Pod
    participant New as DaemonSet Pod
    
    Note over GW: 配置定义
    GW->>Istiod: Gateway selector: istio=ingressgateway
    
    rect rgb(200, 230, 200)
        Note over Old,New: 两者都匹配 selector
        Istiod->>Old: 推送配置 v1
        Istiod->>New: 推送配置 v1
        Note over Old,New: 配置一致，流量可平滑切换
    end

不同 selector 的场景（不推荐）：

如果新旧使用不同的 label，需要创建多个 Gateway 资源：

# 不推荐：需要维护两套 Gateway
apiVersion: networking.istio.io/v1beta1
kind: Gateway
metadata:
  name: gateway-old
spec:
  selector:
    version: old
---
apiVersion: networking.istio.io/v1beta1
kind: Gateway
metadata:
  name: gateway-new
spec:
  selector:
    version: new

这样做的问题：

需要同步维护两套配置
VirtualService 需要同时绑定两个 Gateway
容易出错

各类 Istio 资源的 Selector 机制
#

不同 Istio 资源使用不同的字段来选择目标，不一定都用同一个 label：

资源类型	Selector 字段	说明
Gateway	`spec.selector`	选择哪些 Pod 应用此 Gateway
VirtualService	`spec.gateways`	引用 Gateway 名称，不直接选 Pod
DestinationRule	`spec.host`	匹配服务名，不选 Pod
EnvoyFilter	`spec.workloadSelector`	可以用任意 label 选择 Pod
Sidecar	`spec.workloadSelector`	可以用任意 label 选择 Pod
AuthorizationPolicy	`spec.selector`	可以用任意 label 选择 Pod

EnvoyFilter 的坑：

EnvoyFilter 使用 workloadSelector，可能选择了特定的 label：

apiVersion: networking.istio.io/v1alpha3
kind: EnvoyFilter
metadata:
  name: custom-header-filter
  namespace: istio-system
spec:
  workloadSelector:
    labels:
      app: istio-ingressgateway  # 注意：这个 label！
  configPatches:
  - applyTo: HTTP_FILTER
    # ...

问题场景：

flowchart TB
    subgraph EnvoyFilter
        EF[workloadSelector:
app: istio-ingressgateway]
    end
    
    subgraph Deployment
        DEP[labels:
app: istio-ingressgateway
istio: ingressgateway]
    end
    
    subgraph DaemonSet
        DS[labels:
app: istio-ingressgateway-ds
istio: ingressgateway]
    end
    
    EF -->|匹配| DEP
    EF -.->|不匹配!| DS
    
    style DS fill:#ffcccc

如果 DaemonSet 的 app label 不同，EnvoyFilter 就不会应用到它！

检查所有相关资源的 selector：

# 查找所有 EnvoyFilter 及其 selector
kubectl get envoyfilter -A -o jsonpath='{range .items[*]}{.metadata.namespace}/{.metadata.name}: {.spec.workloadSelector.labels}{"\n"}{end}'

# 查找所有 AuthorizationPolicy 及其 selector
kubectl get authorizationpolicy -A -o jsonpath='{range .items[*]}{.metadata.namespace}/{.metadata.name}: {.spec.selector.matchLabels}{"\n"}{end}'

# 查找所有 Sidecar 资源
kubectl get sidecar -A -o yaml | grep -A5 workloadSelector

完整的 label 规划：

为确保新旧 Workload 收到完全相同的配置，所有相关 label 都要一致：

# Deployment
metadata:
  labels:
    app: istio-ingressgateway       # EnvoyFilter 可能用这个
    istio: ingressgateway           # Gateway 用这个
    version: v1                     # 可选，用于区分
    
# DaemonSet - 保持关键 label 相同
metadata:
  labels:
    app: istio-ingressgateway       # 与 Deployment 相同
    istio: ingressgateway           # 与 Deployment 相同
    version: v2                     # 可选，用于区分

切换前的检查清单：

#!/bin/bash
# check-istio-selectors.sh

echo "=== 检查 Gateway selector ==="
kubectl get gateway -A -o jsonpath='{range .items[*]}{.metadata.name}: {.spec.selector}{"\n"}{end}'

echo "=== 检查 EnvoyFilter workloadSelector ==="
kubectl get envoyfilter -A -o jsonpath='{range .items[*]}{.metadata.name}: {.spec.workloadSelector.labels}{"\n"}{end}'

echo "=== 检查 AuthorizationPolicy selector ==="
kubectl get authorizationpolicy -A -o jsonpath='{range .items[*]}{.metadata.name}: {.spec.selector.matchLabels}{"\n"}{end}'

echo "=== 检查 Sidecar workloadSelector ==="
kubectl get sidecar -A -o jsonpath='{range .items[*]}{.metadata.name}: {.spec.workloadSelector.labels}{"\n"}{end}'

echo "=== 检查 RequestAuthentication selector ==="
kubectl get requestauthentication -A -o jsonpath='{range .items[*]}{.metadata.name}: {.spec.selector.matchLabels}{"\n"}{end}'

echo "=== 检查 PeerAuthentication selector ==="
kubectl get peerauthentication -A -o jsonpath='{range .items[*]}{.metadata.name}: {.spec.selector.matchLabels}{"\n"}{end}'

验证配置一致性：

切换前，对比新旧 Pod 的完整 Envoy 配置：

# 导出 Deployment Pod 的配置
istioctl proxy-config all deploy/istio-ingressgateway -n istio-system -o json > old-config.json

# 导出 DaemonSet Pod 的配置
istioctl proxy-config all ds/istio-ingressgateway-ds -n istio-system -o json > new-config.json

# 对比差异
diff old-config.json new-config.json

如果发现 EnvoyFilter 只匹配旧 Workload，有两个选择：

选择 A：修改 DaemonSet label（推荐）

# DaemonSet 使用与 Deployment 相同的所有 label
labels:
  app: istio-ingressgateway  # 保持一致
  istio: ingressgateway

选择 B：修改 EnvoyFilter selector

# 修改 EnvoyFilter 使用更通用的 label
spec:
  workloadSelector:
    labels:
      istio: ingressgateway  # 而不是 app: xxx

或者使用 matchExpressions：

# 暂不支持，EnvoyFilter 只支持精确匹配
# 但可以创建多个 EnvoyFilter 分别匹配

最佳实践：

新 Workload 复用所有关键 label，只用 version 或类似字段区分
切换前检查所有 Istio 资源的 selector
对比新旧 Pod 的 Envoy 配置确保一致
EnvoyFilter 优先使用通用 label（如 istio: ingressgateway）

实际案例分析
#

以下是一个典型的生产环境配置：

# Gateway - 使用 app label
spec:
  selector:
    app: istio-ingressgateway

# Deployment - 使用两个 label
spec:
  selector:
    matchLabels:
      app: istio-ingressgateway
      istio: ingressgateway

# EnvoyFilter 1 (traffic-tagger) - 无 workloadSelector，使用 context 匹配
spec:
  configPatches:
  - match:
      context: GATEWAY  # 匹配所有 Gateway 类型的 Envoy
    # ...

# EnvoyFilter 2 (response-add-header) - 使用 istio label
spec:
  workloadSelector:
    labels:
      istio: ingressgateway
  # ...

# EnvoyFilter 3 (add-user-agent) - 无 workloadSelector
spec:
  configPatches:
  - match:
      context: GATEWAY
    # ...

分析各资源的匹配规则：

资源	Selector	匹配方式
Gateway	`app: istio-ingressgateway`	Pod 需有此 label
EnvoyFilter 1	无（context: GATEWAY）	所有 Gateway 类型 Pod
EnvoyFilter 2	`istio: ingressgateway`	Pod 需有此 label
EnvoyFilter 3	无（context: GATEWAY）	所有 Gateway 类型 Pod

EnvoyFilter 的两种匹配方式：

flowchart TB
    subgraph EnvoyFilter匹配
        A[EnvoyFilter] --> B{有 workloadSelector?}
        B -->|是| C[按 label 匹配 Pod]
        B -->|否| D{有 context?}
        D -->|GATEWAY| E[匹配所有 Gateway Pod]
        D -->|SIDECAR| F[匹配所有 Sidecar Pod]
        D -->|ANY| G[匹配所有 Pod]
    end

无 workloadSelector 的 EnvoyFilter：

spec:
  configPatches:
  - match:
      context: GATEWAY  # 这个才是真正的匹配条件

没有 workloadSelector 时，context: GATEWAY 会匹配所有 Gateway 类型的 Envoy
不管 Pod 的 label 是什么，只要是 Gateway 就会应用
这类 EnvoyFilter 在切换时自动兼容

切换 DaemonSet 需要的 label：

# 新 DaemonSet 必须同时有这两个 label
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: istio-ingressgateway-ds
spec:
  template:
    metadata:
      labels:
        app: istio-ingressgateway     # Gateway 需要
        istio: ingressgateway         # EnvoyFilter 2 需要

验证配置匹配：

# 1. 检查 Gateway selector
kubectl get gateway istio-ingressgateway -n istio-gateways -o jsonpath='{.spec.selector}'
# 输出: {"app":"istio-ingressgateway"}

# 2. 检查所有 EnvoyFilter 的 workloadSelector
kubectl get envoyfilter -n istio-gateways -o custom-columns=\
NAME:.metadata.name,\
SELECTOR:.spec.workloadSelector.labels

# 示例输出:
# NAME                           SELECTOR
# traffic-tagger-envoy-filter    <none>           ← context 匹配
# response-add-header-trace-id   {"istio":"ingressgateway"}  ← label 匹配
# add-user-agent-gateway         <none>           ← context 匹配

# 3. 检查新 DaemonSet Pod 是否满足所有条件
kubectl get pods -n istio-gateways -l app=istio-ingressgateway,istio=ingressgateway

配置下发验证脚本：

#!/bin/bash
# verify-envoyfilter.sh

NAMESPACE="istio-gateways"
OLD_POD=$(kubectl get pods -n $NAMESPACE -l app=istio-ingressgateway -o jsonpath='{.items[0].metadata.name}')
NEW_POD=$(kubectl get pods -n $NAMESPACE -l app=istio-ingressgateway-ds -o jsonpath='{.items[0].metadata.name}')

echo "=== 对比 Listener 配置 ==="
diff <(istioctl proxy-config listeners $OLD_POD -n $NAMESPACE -o json | jq -S .) \
     <(istioctl proxy-config listeners $NEW_POD -n $NAMESPACE -o json | jq -S .)

echo "=== 对比 Route 配置 ==="
diff <(istioctl proxy-config routes $OLD_POD -n $NAMESPACE -o json | jq -S .) \
     <(istioctl proxy-config routes $NEW_POD -n $NAMESPACE -o json | jq -S .)

echo "=== 检查特定 EnvoyFilter 是否生效 ==="
# 检查 lua filter 是否存在
istioctl proxy-config listeners $NEW_POD -n $NAMESPACE -o json | \
    grep -q "envoy.filters.http.lua" && echo "Lua filter: OK" || echo "Lua filter: MISSING!"

# 检查 wasm filter 是否存在
istioctl proxy-config listeners $NEW_POD -n $NAMESPACE -o json | \
    grep -q "traffic_tagger" && echo "WASM filter: OK" || echo "WASM filter: MISSING!"

总结这个案例的切换要点：

检查项	状态
Gateway (app label)	新 DaemonSet 需要 `app: istio-ingressgateway`
EnvoyFilter 1 (context)	自动兼容，无需特殊处理
EnvoyFilter 2 (istio label)	新 DaemonSet 需要 `istio: ingressgateway`
EnvoyFilter 3 (context)	自动兼容，无需特殊处理

重要区分：Kubernetes Selector vs Istio Selector

这里有两种不同的 selector，容易混淆：

类型	用途	要求
Kubernetes Workload Selector	管理自己创建的 Pod	不同 Workload 应该不同
Istio Resource Selector	选择配置下发目标	可以相同

flowchart TB
    subgraph "Kubernetes 层面"
        DEP[Deployment
selector: app=gw, version=v1]
        DS[DaemonSet
selector: app=gw, version=v2]
        POD1[Pod 1
labels: app=gw, version=v1
owner: Deployment]
        POD2[Pod 2
labels: app=gw, version=v2
owner: DaemonSet]
        
        DEP -->|管理| POD1
        DS -->|管理| POD2
    end
    
    subgraph "Istio 层面"
        GW[Gateway
selector: app=gw]
        EF[EnvoyFilter
selector: istio=ingressgateway]
        
        GW -->|配置下发| POD1
        GW -->|配置下发| POD2
        EF -->|配置下发| POD1
        EF -->|配置下发| POD2
    end

Kubernetes Workload Selector 规则：

每个 Pod 只能被一个 Workload 管理（通过 ownerReferences）
不同 Workload 的 selector 应该有区分，否则会冲突

# ❌ 错误：两个 Workload selector 完全相同会冲突！
# Deployment
spec:
  selector:
    matchLabels:
      app: istio-ingressgateway
      istio: ingressgateway

# DaemonSet - 同样的 selector
spec:
  selector:
    matchLabels:
      app: istio-ingressgateway
      istio: ingressgateway

如果 selector 完全相同，Kubernetes 控制器可能会混乱：

DaemonSet 控制器看到"多余"的 Pod（Deployment 创建的）
可能尝试删除或接管这些 Pod
导致不可预期的行为

正确做法：Workload Selector 用不同值区分

# ✅ Deployment - 使用 version=v1 区分
apiVersion: apps/v1
kind: Deployment
metadata:
  name: istio-ingressgateway
spec:
  selector:
    matchLabels:
      app: istio-ingressgateway
      istio: ingressgateway
      version: v1  # 区分用
  template:
    metadata:
      labels:
        app: istio-ingressgateway     # Istio 用
        istio: ingressgateway         # Istio 用
        version: v1                   # Kubernetes Workload 区分用

---
# ✅ DaemonSet - 使用 version=v2 区分
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: istio-ingressgateway-ds
spec:
  selector:
    matchLabels:
      app: istio-ingressgateway
      istio: ingressgateway
      version: v2  # 区分用
  template:
    metadata:
      labels:
        app: istio-ingressgateway     # Istio 用（相同）
        istio: ingressgateway         # Istio 用（相同）
        version: v2                   # Kubernetes Workload 区分用（不同）

这样配置后：

资源	Selector	匹配结果
Deployment	app=gw, istio=ing, version=v1	只管理 v1 Pod
DaemonSet	app=gw, istio=ing, version=v2	只管理 v2 Pod
Gateway	app=gw	两者都匹配
EnvoyFilter	istio=ing	两者都匹配

另一种做法：使用完全不同的 app label

# Deployment
labels:
  app: istio-ingressgateway
  istio: ingressgateway

# DaemonSet
labels:
  app: istio-ingressgateway-ds  # 不同的 app
  istio: ingressgateway         # 相同，EnvoyFilter 能匹配

但这种做法需要检查 Gateway 的 selector：

如果 Gateway 用 app: istio-ingressgateway，DaemonSet 就不会收到 Gateway 配置
需要修改 Gateway selector 为 istio: ingressgateway

完整的新 DaemonSet 配置：

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: istio-ingressgateway-ds
  namespace: istio-gateways
spec:
  selector:
    matchLabels:
      app: istio-ingressgateway
      istio: ingressgateway
      version: v2              # 关键：与 Deployment 区分
  template:
    metadata:
      labels:
        app: istio-ingressgateway     # Istio Gateway 匹配
        istio: ingressgateway         # Istio EnvoyFilter 匹配
        version: v2                   # Kubernetes 区分用
    spec:
      # ... 其他配置

验证不会冲突：

# 检查 Pod 的 ownerReferences
kubectl get pods -n istio-gateways -o custom-columns=\
NAME:.metadata.name,\
OWNER:.metadata.ownerReferences[0].kind,\
OWNER_NAME:.metadata.ownerReferences[0].name

# 预期输出：
# NAME                              OWNER        OWNER_NAME
# istio-ingressgateway-xxx          ReplicaSet   istio-ingressgateway-xxx
# istio-ingressgateway-ds-yyy       DaemonSet    istio-ingressgateway-ds

Kubernetes 控制器精确行为分析
#

基于 Kubernetes 源代码分析，当两个 Workload 使用相同 selector 时的精确行为：

核心机制：ControllerRef 和 Adoption

Kubernetes 控制器通过 ownerReferences 中的 controller: true 字段来声明 Pod 的所有权：

# Pod 的 ownerReferences
metadata:
  ownerReferences:
  - apiVersion: apps/v1
    kind: ReplicaSet
    name: istio-ingressgateway-xxx
    uid: xxx-xxx-xxx
    controller: true      # 关键：声明这是控制器
    blockOwnerDeletion: true

ReplicaSet Controller 源码分析：

// pkg/controller/replicaset/replica_set.go

func (rsc *ReplicaSetController) syncReplicaSet(key string) error {
    // 1. 获取 ReplicaSet
    rs, err := rsc.rsLister.ReplicaSets(namespace).Get(name)
    
    // 2. 获取所有匹配 selector 的 Pod
    allPods, err := rsc.podLister.Pods(rs.Namespace).List(labels.Everything())
    
    // 3. 过滤：只处理被自己"控制"的 Pod
    filteredPods := controller.FilterActivePods(allPods)
    filteredPods, err = rsc.claimPods(rs, selector, filteredPods)
    
    // claimPods 内部逻辑：
    // - 如果 Pod 没有 controllerRef，尝试 adopt（领养）
    // - 如果 Pod 有其他 controllerRef，跳过
    // - 如果 Pod 的 controllerRef 指向自己，保留
}

关键函数：ClaimPods

// pkg/controller/controller_ref_manager.go

func (m *PodControllerRefManager) ClaimPods(pods []*v1.Pod) ([]*v1.Pod, error) {
    var claimed []*v1.Pod
    for _, pod := range pods {
        ok, err := m.ClaimObject(pod, ...)
        if ok {
            claimed = append(claimed, pod)
        }
    }
    return claimed, nil
}

func (m *BaseControllerRefManager) ClaimObject(obj metav1.Object, ...) (bool, error) {
    controllerRef := metav1.GetControllerOf(obj)
    
    if controllerRef != nil {
        // Pod 已经有 controller
        if controllerRef.UID != m.Controller.GetUID() {
            // 不是自己的 Pod，跳过
            return false, nil
        }
        // 是自己的 Pod
        return true, nil
    }
    
    // Pod 没有 controller，尝试领养
    if m.CanAdopt() {
        // 添加 ownerReference
        return true, m.AdoptObject(obj)
    }
    return false, nil
}

DaemonSet Controller 类似逻辑：

// pkg/controller/daemon/daemon_controller.go

func (dsc *DaemonSetsController) syncDaemonSet(key string) error {
    ds, err := dsc.dsLister.DaemonSets(namespace).Get(name)
    
    // 获取所有匹配的 Pod
    daemonPods, err := dsc.getDaemonPods(ds)
    
    // getDaemonPods 内部也会调用 ClaimPods
    // 只会返回 controllerRef 指向自己的 Pod
}

精确行为总结：

flowchart TB
    subgraph 场景["两个 Workload 使用相同 Selector"]
        DEP[Deployment/ReplicaSet]
        DS[DaemonSet]
        
        subgraph PODs["所有匹配 selector 的 Pod"]
            P1[Pod 1
controllerRef: RS]
            P2[Pod 2
controllerRef: RS]
            P3[Pod 3
controllerRef: DS]
            P4[Pod 4
controllerRef: DS]
            P5[Pod 5
无 controllerRef]
        end
    end
    
    DEP -->|ClaimPods| P1
    DEP -->|ClaimPods| P2
    DEP -.->|跳过,不是自己的| P3
    DEP -.->|跳过,不是自己的| P4
    DEP -->|尝试 Adopt| P5
    
    DS -.->|跳过,不是自己的| P1
    DS -.->|跳过,不是自己的| P2
    DS -->|ClaimPods| P3
    DS -->|ClaimPods| P4
    DS -->|尝试 Adopt| P5

结论：不会直接冲突，但有边缘问题

场景	行为
已有 controllerRef 的 Pod	只被其 owner 管理，其他控制器跳过
无 controllerRef 的 Pod（孤儿）	竞争领养，先到先得
新创建的 Pod	创建时就设置 controllerRef，无冲突

潜在问题：孤儿 Pod 竞争

当出现没有 controllerRef 的 Pod（比如手动创建或 owner 被删除）时：

// 两个控制器都会尝试 Adopt
if m.CanAdopt() {
    return true, m.AdoptObject(obj)  // 谁先执行谁领养
}

这可能导致：

Pod 被错误的控制器领养
控制器副本数计算错误
不可预期的扩缩容行为

另一个问题：控制器日志告警

即使不会实际冲突，控制器会记录告警：

W0101 00:00:00.000000  1 replica_set.go:xxx] 
Found orphan pod istio-ingressgateway-xxx with matching selector, 
but it has a different controller reference

最佳实践验证：

# 1. 检查是否有孤儿 Pod
kubectl get pods -n istio-gateways -l app=istio-ingressgateway \
  -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.metadata.ownerReferences[*].controller}{"\n"}{end}'

# 2. 检查控制器日志是否有告警
kubectl logs -n kube-system -l component=kube-controller-manager | grep -i "orphan\|adopt"

# 3. 检查 selector 是否有区分
kubectl get deployment,daemonset -n istio-gateways -o custom-columns=\
NAME:.metadata.name,\
SELECTOR:.spec.selector.matchLabels

代码层面的安全保证：

Kubernetes 通过以下机制避免灾难性冲突：

ControllerRef 检查：控制器只处理 controllerRef 指向自己的 Pod
UID 验证：即使名字相同，UID 不同也不会混淆
Foreground 删除：删除 owner 时会等待 dependents 清理

// 关键保护代码
if controllerRef.UID != m.Controller.GetUID() {
    return false, nil  // 不是自己的，不处理
}

实际测试结果：

在测试环境中，创建相同 selector 的 Deployment 和 DaemonSet：

# 创建 Deployment（3 副本）
kubectl apply -f deployment.yaml
# → 创建 3 个 Pod，controllerRef 指向 ReplicaSet

# 创建 DaemonSet（相同 selector）
kubectl apply -f daemonset.yaml
# → 在每个节点创建 1 个 Pod，controllerRef 指向 DaemonSet

# 结果：
# - Deployment 管理 3 个 Pod
# - DaemonSet 管理 N 个 Pod（N=节点数）
# - 互不干扰（因为 controllerRef 不同）

但仍然推荐使用不同 selector：

虽然不会灾难性冲突，但相同 selector 带来的问题：

运维混乱：kubectl get pods -l app=xxx 返回两种 Pod
监控混乱：Prometheus 查询可能混淆
Service 选择：如果 Service selector 相同，会选中两种 Pod
调试困难：出问题时难以区分

推荐的 selector 设计：

# 共享的 label（Istio 用）
labels:
  istio: ingressgateway

# 区分的 label（Kubernetes Workload 用）
labels:
  version: v1  # 或 workload: deployment / workload: daemonset

修改 Pod Label 后的精确行为分析
#

当修改 Pod 的 label 使其不再匹配 selector 时，会触发"释放"（Release）机制。

场景：修改 Pod 的 app label

# 原始状态
# Pod: app=istio-ingressgateway, istio=ingressgateway
# ReplicaSet selector: app=istio-ingressgateway, istio=ingressgateway

# 修改 label
kubectl label pod istio-ingressgateway-xxx app=istio-ingressgateway-orphan --overwrite

发生了什么？

sequenceDiagram
    participant User as 用户
    participant API as API Server
    participant RS as ReplicaSet Controller
    participant Pod as Pod
    
    User->>API: kubectl label pod ... app=xxx-orphan
    API->>API: 更新 Pod labels
    API->>RS: Watch 事件：Pod 更新
    
    rect rgb(255, 230, 200)
        Note over RS: syncReplicaSet 触发
        RS->>RS: 列出所有 selector 匹配的 Pod
        RS->>RS: Pod 不再匹配 selector
        RS->>RS: ClaimPods: 检查 controllerRef
        RS->>RS: Pod 有 controllerRef 但不匹配 selector
        RS->>API: Release: 移除 Pod 的 ownerReference
    end
    
    rect rgb(200, 230, 200)
        Note over RS: 发现副本数不足
        RS->>RS: 当前 Pod 数 < 期望副本数
        RS->>API: 创建新 Pod
    end
    
    Note over Pod: Pod 变成孤儿
继续运行，但无人管理

源码分析：ClaimPods 中的 Release 逻辑

// pkg/controller/controller_ref_manager.go

func (m *PodControllerRefManager) ClaimPods(pods []*v1.Pod) ([]*v1.Pod, error) {
    var claimed []*v1.Pod
    var errlist []error

    match := func(obj metav1.Object) bool {
        return m.Selector.Matches(labels.Set(obj.GetLabels()))
    }

    for _, pod := range pods {
        ok, err := m.ClaimObject(pod, match, ...)
        if err != nil {
            errlist = append(errlist, err)
            continue
        }
        if ok {
            claimed = append(claimed, pod)
        }
    }
    return claimed, utilerrors.NewAggregate(errlist)
}

func (m *BaseControllerRefManager) ClaimObject(obj metav1.Object, match func(metav1.Object) bool, ...) (bool, error) {
    controllerRef := metav1.GetControllerOf(obj)
    
    if controllerRef != nil {
        if controllerRef.UID != m.Controller.GetUID() {
            // 不是自己的 Pod，跳过
            return false, nil
        }
        
        // 是自己的 Pod，但检查是否还匹配 selector
        if match(obj) {
            // 仍然匹配，保留
            return true, nil
        }
        
        // 关键：有 controllerRef 但不再匹配 selector
        // 执行 Release：移除 ownerReference
        if m.ReleaseFunc != nil {
            // 调用 Release
            if err := m.ReleaseFunc(obj); err != nil {
                return false, err
            }
        }
        return false, nil
    }
    
    // 无 controllerRef 的情况...
}

Release 操作的具体实现：

// pkg/controller/replicaset/replica_set.go

func (rsc *ReplicaSetController) syncReplicaSet(key string) error {
    // ...
    
    cm := controller.NewPodControllerRefManager(
        rsc.podControl,
        rs,
        selector,
        controllerKind,
        rsc.expectations,
    )
    
    // ClaimPods 会调用 ReleaseFunc
    filteredPods, err := cm.ClaimPods(pods)
    
    // filteredPods 不包含被 Release 的 Pod
    // 控制器会发现副本数不足，创建新 Pod
}

// Release 函数
func (rsc *ReplicaSetController) releaseObject(obj metav1.Object) error {
    pod := obj.(*v1.Pod)
    
    // 从 Pod 中移除 ownerReference
    err := rsc.podControl.DeletePodOrphan(pod.Namespace, pod.Name, 
        &pod.ObjectMeta.OwnerReferences)
    
    return err
}

精确的事件顺序：

# 1. 修改 label 前
kubectl get pod istio-ingressgateway-xxx -o yaml
# ownerReferences:
# - apiVersion: apps/v1
#   kind: ReplicaSet
#   name: istio-ingressgateway-xxx
#   controller: true      # 有 controller 标记

# 2. 修改 label
kubectl label pod istio-ingressgateway-xxx app=orphan --overwrite

# 3. 控制器处理后
kubectl get pod istio-ingressgateway-xxx -o yaml
# ownerReferences: []     # ownerReference 被移除！

# 4. 同时，新 Pod 被创建
kubectl get pods -l app=istio-ingressgateway
# NAME                              READY   STATUS    
# istio-ingressgateway-xxx          1/1     Running   # 原 Pod（现在是孤儿）
# istio-ingressgateway-yyy          1/1     Running   # 新创建的 Pod

孤儿 Pod 的状态：

属性	变化
Labels	被修改（用户操作）
ownerReferences	被清空（控制器操作）
运行状态	继续运行
被谁管理	无人管理
删除时机	手动删除或节点驱逐

实际测试验证：

#!/bin/bash
# test-orphan-pod.sh

NAMESPACE="istio-gateways"
POD_NAME=$(kubectl get pods -n $NAMESPACE -l app=istio-ingressgateway -o jsonpath='{.items[0].metadata.name}')

echo "=== 原始状态 ==="
kubectl get pod $POD_NAME -n $NAMESPACE -o jsonpath='{.metadata.ownerReferences}'
echo ""

echo "=== 修改 label ==="
kubectl label pod $POD_NAME -n $NAMESPACE app=orphan --overwrite

echo "=== 等待控制器处理 ==="
sleep 5

echo "=== 检查 ownerReferences ==="
kubectl get pod $POD_NAME -n $NAMESPACE -o jsonpath='{.metadata.ownerReferences}'
echo ""
# 预期：[] 或 null

echo "=== 检查是否有新 Pod ==="
kubectl get pods -n $NAMESPACE -l app=istio-ingressgateway
# 预期：新 Pod 已创建

echo "=== 孤儿 Pod 仍在运行 ==="
kubectl get pod $POD_NAME -n $NAMESPACE
# 预期：Running

这个机制的实际用途：蓝绿部署中的 Pod 隔离

利用这个机制可以实现手动隔离 Pod：

# 1. 将问题 Pod 从 ReplicaSet 中"摘除"
kubectl label pod problem-pod-xxx app=isolated --overwrite

# 2. 控制器会：
#    - 移除该 Pod 的 ownerReference
#    - 创建新 Pod 补充副本数

# 3. 问题 Pod 继续运行，可以用于调试
kubectl exec -it problem-pod-xxx -- /bin/bash

# 4. 调试完成后手动删除
kubectl delete pod problem-pod-xxx

flowchart LR
    subgraph Before["修改前"]
        RS1[ReplicaSet
replicas: 3]
        P1[Pod 1]
        P2[Pod 2]
        P3[Pod 3]
        RS1 --> P1
        RS1 --> P2
        RS1 --> P3
    end
    
    subgraph After["修改 Pod 3 的 label 后"]
        RS2[ReplicaSet
replicas: 3]
        P4[Pod 1]
        P5[Pod 2]
        P6[Pod 4
新创建]
        P7[Pod 3
孤儿,继续运行]
        RS2 --> P4
        RS2 --> P5
        RS2 --> P6
        P7 -.->|无关联| RS2
    end
    
    Before --> After

DaemonSet 的行为略有不同：

DaemonSet 控制器会尝试在该节点上创建新 Pod：

# 修改 DaemonSet Pod 的 label
kubectl label pod ds-pod-xxx app=orphan --overwrite

# 结果：
# - 原 Pod 变成孤儿
# - DaemonSet 发现该节点没有符合条件的 Pod
# - 在同一节点创建新 Pod
# - 可能因端口冲突失败（如果用 hostNetwork）

端口冲突场景：

# DaemonSet + hostNetwork
# 原 Pod 仍占用 80 端口
# 新 Pod 尝试绑定 80 端口 → 失败

Events:
  Type     Reason     Age   From               Message
  ----     ------     ----  ----               -------
  Warning  Failed     1s    kubelet            Error: failed to start container
                                               "bind: address already in use"

总结：修改 label 后的精确行为

步骤	操作者	行为
1	用户	修改 Pod label
2	控制器	发现 Pod 不匹配 selector
3	控制器	检查 controllerRef，发现是自己的
4	控制器	执行 Release，移除 ownerReference
5	控制器	发现副本数不足
6	控制器	创建新 Pod
7	原 Pod	继续运行，但无人管理

方案一：蓝绿切换（推荐）
#

部署新的 Workload，通过 CLB 切换流量，再下线旧 Workload。

sequenceDiagram
    participant CLB
    participant Old as 旧 Gateway
(Deployment)
    participant New as 新 Gateway
(DaemonSet)
    
    rect rgb(200, 230, 200)
        Note over New: 阶段1: 部署新 Gateway
        New->>New: 部署 DaemonSet
        New->>New: 等待所有 Pod Ready
    end
    
    rect rgb(255, 245, 200)
        Note over CLB: 阶段2: 添加新后端
        CLB->>CLB: 添加新节点到后端
        CLB->>New: 健康检查通过
        Note over CLB: 此时新旧并存，流量分摊
    end
    
    rect rgb(255, 230, 200)
        Note over CLB,Old: 阶段3: 移除旧后端
        CLB->>CLB: 设置旧节点权重为 0
        CLB->>CLB: 等待旧连接排空
        CLB->>CLB: 移除旧节点
    end
    
    rect rgb(200, 200, 230)
        Note over Old: 阶段4: 下线旧 Gateway
        Old->>Old: 删除 Deployment
    end

操作步骤：

# 1. 部署新的 DaemonSet（使用不同的 Service 名称或端口）
kubectl apply -f gateway-daemonset.yaml

# 2. 等待所有 Pod Ready
kubectl rollout status daemonset/istio-ingressgateway-new

# 3. 在 CLB 添加新节点（DaemonSet 所在节点）
# 此时新旧 Gateway 同时接收流量

# 4. 验证新 Gateway 工作正常
curl -H "Host: test.example.com" http://<new-gateway-ip>/health

# 5. 在 CLB 设置旧节点权重为 0
# 等待现有连接排空（观察旧 Pod 的连接数）

# 6. 在 CLB 移除旧节点

# 7. 删除旧 Deployment
kubectl delete deployment istio-ingressgateway-old

关键点：

新旧 Gateway 可以共存，使用相同的 Istio 配置
通过 CLB 控制流量切换，而非 Kubernetes
切换过程可观测、可回滚

方案二：灰度切换
#

通过 CLB 权重逐步将流量从旧 Gateway 迁移到新 Gateway。

flowchart LR
    subgraph CLB
        W1[旧 Gateway
权重 100%]
        W2[新 Gateway
权重 0%]
    end
    
    subgraph 灰度过程
        S1[100:0] --> S2[90:10]
        S2 --> S3[50:50]
        S3 --> S4[10:90]
        S4 --> S5[0:100]
    end

操作脚本：

#!/bin/bash
# 灰度切换脚本

OLD_WEIGHT=100
NEW_WEIGHT=0
STEP=10
INTERVAL=60  # 每步等待时间（秒）

while [ $OLD_WEIGHT -gt 0 ]; do
    echo "Setting weights: old=$OLD_WEIGHT, new=$NEW_WEIGHT"
    
    # 调用云厂商 API 设置权重
    # aliyun slb SetBackendServers ...
    
    # 检查错误率
    ERROR_RATE=$(get_error_rate)
    if [ "$ERROR_RATE" -gt 1 ]; then
        echo "Error rate too high, rolling back!"
        # 回滚逻辑
        exit 1
    fi
    
    OLD_WEIGHT=$((OLD_WEIGHT - STEP))
    NEW_WEIGHT=$((NEW_WEIGHT + STEP))
    
    sleep $INTERVAL
done

echo "Migration complete!"

方案三：原地升级（有风险）
#

直接修改 Workload 类型，依赖 Kubernetes 的滚动更新。

# 不推荐！可能导致流量中断
kubectl delete deployment istio-ingressgateway
kubectl apply -f gateway-daemonset.yaml

风险：

删除 Deployment 时，所有 Pod 立即终止
新 DaemonSet 启动需要时间
中间存在服务中断窗口

如果必须原地升级：

# 1. 先扩容 Deployment，确保每个节点有 Pod
kubectl scale deployment istio-ingressgateway --replicas=<node-count>

# 2. 部署 DaemonSet（先不删除 Deployment）
kubectl apply -f gateway-daemonset.yaml

# 3. 等待 DaemonSet Ready
kubectl rollout status daemonset/istio-ingressgateway-ds

# 4. 此时会有端口冲突（如果用 HostNetwork）
# 需要新旧使用不同端口，或者 DaemonSet 使用 nodeSelector 选择新节点

# 5. CLB 切换后，删除旧 Deployment
kubectl delete deployment istio-ingressgateway

端口冲突处理
#

如果新旧都使用 HostNetwork，同一节点会端口冲突。解决方案：

方案 A：使用不同节点

# 新 DaemonSet 只部署到新节点
nodeSelector:
  gateway-version: v2

# 旧 Deployment 的 Pod 会被调度到旧节点
affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
      - matchExpressions:
        - key: gateway-version
          operator: NotIn
          values: ["v2"]

方案 B：使用不同端口

# 新 Gateway 使用 8080/8443
ports:
- containerPort: 8080
  hostPort: 8080
- containerPort: 8443
  hostPort: 8443

# CLB 先指向新端口，再切换

方案 C：临时使用 NodePort

# 切换过程中，旧 Gateway 临时改用 NodePort
# 切换完成后，新 Gateway 使用 HostNetwork

切换检查清单
#

切换前：

新 Workload 配置已验证
CLB 后端可以动态修改
监控告警已就位
回滚方案已准备

切换中：

新 Gateway 所有 Pod Ready
新 Gateway 健康检查通过
流量逐步切换
持续观察错误率

切换后：

旧 Gateway 连接已排空
旧 Gateway 已下线
监控指标正常
文档已更新

切换时序图
#

gantt
    title Workload 切换时间线
    dateFormat HH:mm
    axisFormat %H:%M
    
    section 准备阶段
    部署新 DaemonSet           :a1, 00:00, 10m
    等待 Pod Ready             :a2, after a1, 5m
    验证新 Gateway             :a3, after a2, 5m
    
    section 切换阶段
    CLB 添加新后端             :b1, after a3, 2m
    灰度 10%                   :b2, after b1, 10m
    灰度 50%                   :b3, after b2, 10m
    灰度 100%                  :b4, after b3, 10m
    
    section 清理阶段
    移除旧 CLB 后端            :c1, after b4, 5m
    等待连接排空               :c2, after c1, 5m
    删除旧 Deployment          :c3, after c2, 2m

整个切换过程约 1 小时，期间业务无感知。

Istio Gateway 配置下发细节
#

Istio Gateway 不仅是一个 Envoy Pod，它还需要从 istiod 获取路由配置。如果 Pod Ready 但配置未下发完成，流量会出现 404 或 503。

xDS 配置下发流程
#

sequenceDiagram
    participant Pod as Gateway Pod
    participant Envoy as Envoy Proxy
    participant Istiod as istiod
    participant K8s as Kubernetes API
    
    Pod->>Pod: Pod 启动
    Envoy->>Istiod: 建立 xDS 连接
    Istiod->>K8s: 读取 Gateway/VirtualService
    Istiod->>Istiod: 生成 Envoy 配置
    Istiod->>Envoy: 下发 LDS/RDS/CDS/EDS
    Envoy->>Envoy: 应用配置
    
    Note over Envoy: 此时才能正确路由

配置下发涉及的资源：

xDS 类型	对应资源	作用
LDS (Listener)	Gateway	监听端口、TLS 配置
RDS (Route)	VirtualService	路由规则
CDS (Cluster)	DestinationRule	上游集群
EDS (Endpoint)	Service/Endpoints	后端地址

问题：ReadinessProbe 的局限性
#

默认的 readinessProbe 只检查 Envoy 是否启动：

readinessProbe:
  httpGet:
    path: /healthz/ready
    port: 15021

/healthz/ready 返回 200 的条件：

Envoy 进程启动
与 istiod 建立连接
不保证配置已下发完成

flowchart TB
    subgraph 时间线
        A[Pod 启动] --> B[Envoy 启动]
        B --> C[连接 istiod]
        C --> D["healthz/ready 200"]
        D --> E[配置下发中...]
        E --> F[配置下发完成]
    end
    
    subgraph 问题窗口
        G[流量进入] --> H{配置完整?}
        H -->|否| I["404/503"]
        H -->|是| J[正常路由]
    end
    
    D -.-> G
    
    style D fill:#ffcccc
    style I fill:#ffcccc

问题场景：

新 Pod 启动，readinessProbe 通过
Pod 被加入 Endpoint，CLB 开始转发流量
但此时 VirtualService 配置还没下发完成
请求返回 404（no route）或 503

解决方案一：配置预热检查
#

在 readinessProbe 中检查关键配置是否已下发：

readinessProbe:
  exec:
    command:
    - /bin/sh
    - -c
    - |
      # 检查 Envoy 基本就绪
      curl -sf http://localhost:15021/healthz/ready || exit 1
      
      # 检查关键路由是否已下发
      # 通过 config_dump 检查特定 route 是否存在
      curl -sf http://localhost:15000/config_dump?resource=dynamic_route_configs \
        | grep -q "your-virtualservice-name" || exit 1
  initialDelaySeconds: 5
  periodSeconds: 5
  failureThreshold: 30

更精确的检查脚本：

#!/bin/bash
# /usr/local/bin/ready-check.sh

# 1. 基本就绪检查
curl -sf http://localhost:15021/healthz/ready || exit 1

# 2. 检查是否有 listener 配置（LDS）
LISTENERS=$(curl -sf http://localhost:15000/listeners | jq length)
if [ "$LISTENERS" -lt 2 ]; then
    echo "Listeners not ready: $LISTENERS"
    exit 1
fi

# 3. 检查是否有 cluster 配置（CDS）
CLUSTERS=$(curl -sf http://localhost:15000/clusters | grep -c "::")
if [ "$CLUSTERS" -lt 5 ]; then
    echo "Clusters not ready: $CLUSTERS"
    exit 1
fi

# 4. 检查关键路由是否存在（可选）
# curl -sf http://localhost:15000/config_dump?resource=dynamic_route_configs \
#   | jq '.configs[].dynamic_route_configs[].route_config.virtual_hosts[].routes[].match.prefix' \
#   | grep -q "/api" || exit 1

echo "Ready!"
exit 0

解决方案二：使用 startupProbe（推荐）
#

startupProbe 专门用于处理启动慢的场景，比 readinessProbe 的 initialDelaySeconds 更合适。

三种探针的职责：

探针	职责	失败后果
startupProbe	检查启动是否完成	重启 Pod
readinessProbe	检查是否可接收流量	从 Endpoint 移除
livenessProbe	检查是否存活	重启 Pod

执行顺序：

flowchart LR
    A[Pod 启动] --> B[startupProbe]
    B -->|成功| C[readinessProbe
livenessProbe]
    B -->|失败| D[重启 Pod]
    
    Note1[startupProbe 成功前
其他探针不执行]

配置示例：

spec:
  containers:
  - name: istio-proxy
    # startupProbe: 等待配置下发完成
    startupProbe:
      exec:
        command:
        - /bin/sh
        - -c
        - |
          # 检查 Envoy 就绪
          curl -sf http://localhost:15021/healthz/ready || exit 1
          # 检查配置已下发（至少有 listener）
          LISTENERS=$(curl -sf http://localhost:15000/listeners 2>/dev/null | grep -c "listener" || echo 0)
          [ "$LISTENERS" -ge 2 ] || exit 1
      initialDelaySeconds: 5
      periodSeconds: 5
      failureThreshold: 60    # 最多等待 5*60=300s
      timeoutSeconds: 3
    
    # readinessProbe: 运行时检查（配置已下发后）
    readinessProbe:
      httpGet:
        path: /healthz/ready
        port: 15021
      periodSeconds: 2
      failureThreshold: 3
    
    # livenessProbe: 存活检查
    livenessProbe:
      httpGet:
        path: /healthz/ready
        port: 15021
      initialDelaySeconds: 10
      periodSeconds: 10
      failureThreshold: 3

为什么 startupProbe 更合适？

方案	问题
readinessProbe + initialDelaySeconds	每次都要等固定时间，浪费时间
readinessProbe + 复杂检查	运行时也会频繁执行复杂检查
startupProbe	只在启动时执行，成功后不再检查

startupProbe 的优势：

启动时：执行配置下发检查，可以设置很长的超时
运行时：readinessProbe 只做简单检查，开销小
关注点分离：启动慢和运行时健康是不同的问题

解决方案三：Istio 1.18+ holdApplicationUntilProxyStarts
#

Istio 1.18+ 支持在 Sidecar 场景下等待配置就绪，但对 Gateway 作用有限。

# meshConfig
defaultConfig:
  holdApplicationUntilProxyStarts: true

解决方案四：Warm-up 流量
#

新 Pod 启动后，先发送预热请求，确保配置已加载：

lifecycle:
  postStart:
    exec:
      command:
      - /bin/sh
      - -c
      - |
        # 等待 Envoy 启动
        sleep 5
        
        # 发送预热请求，触发配置加载
        for i in $(seq 1 10); do
            curl -sf http://localhost:8080/healthz -H "Host: warmup.local" || true
            sleep 1
        done

滚动更新时的配置一致性
#

滚动更新时，新旧 Pod 可能收到不同版本的配置：

flowchart TB
    subgraph istiod
        CFG1[配置版本 v1]
        CFG2[配置版本 v2]
    end
    
    subgraph Gateway Pods
        OLD[旧 Pod
配置 v1]
        NEW[新 Pod
配置 v2?]
    end
    
    istiod --> OLD
    istiod --> NEW
    
    Note1[如果配置正在变更
新旧 Pod 可能不一致]

最佳实践：

避免在滚动更新期间修改 Gateway/VirtualService
如果必须修改，先更新配置，等所有 Pod 同步后再滚动更新

配置下发状态监控
#

查看配置同步状态：

# 查看所有 Gateway Pod 的配置同步状态
istioctl proxy-status

# 输出示例
NAME                              CDS        LDS        EDS        RDS        ISTIOD
gateway-xxx-pod.istio-system      SYNCED     SYNCED     SYNCED     SYNCED     istiod-xxx

# SYNCED = 已同步
# STALE = 配置过期
# NOT SENT = 未发送

查看具体配置：

# 查看 Gateway Pod 的路由配置
istioctl proxy-config routes <pod-name> -n istio-system

# 查看 listener 配置
istioctl proxy-config listeners <pod-name> -n istio-system

# 查看 cluster 配置
istioctl proxy-config clusters <pod-name> -n istio-system

完整的 Gateway 就绪条件
#

一个 Gateway Pod 真正就绪需要满足：

条件	检查方式
Envoy 进程启动	/healthz/ready
连接 istiod	proxy-status SYNCED
LDS 下发完成	listeners 数量 > 0
RDS 下发完成	routes 包含预期规则
CDS 下发完成	clusters 数量 > 0
EDS 下发完成	endpoints 可达

生产建议的探针配置：

spec:
  containers:
  - name: istio-proxy
    # startupProbe: 启动时等待配置下发
    startupProbe:
      exec:
        command:
        - /bin/sh
        - -c
        - |
          curl -sf http://localhost:15021/healthz/ready || exit 1
          LISTENERS=$(curl -sf http://localhost:15000/listeners 2>/dev/null | grep -c "listener" || echo 0)
          [ "$LISTENERS" -ge 2 ] || exit 1
      initialDelaySeconds: 5
      periodSeconds: 5
      failureThreshold: 60  # 最多等 5 分钟
      timeoutSeconds: 3
    
    # readinessProbe: 运行时简单检查（startupProbe 成功后才执行）
    readinessProbe:
      httpGet:
        path: /healthz/ready
        port: 15021
      periodSeconds: 2
      failureThreshold: 3
    
    # livenessProbe: 存活检查
    livenessProbe:
      httpGet:
        path: /healthz/ready
        port: 15021
      periodSeconds: 10
      failureThreshold: 3

三个探针的配合：

flowchart TB
    subgraph 启动阶段
        A[Pod 启动] --> B[startupProbe]
        B -->|检查配置下发| C{配置完成?}
        C -->|否| D[继续等待]
        D --> B
        C -->|是| E[startupProbe 成功]
    end
    
    subgraph 运行阶段
        E --> F[readinessProbe 开始]
        E --> G[livenessProbe 开始]
        F -->|简单检查| H[加入 Endpoint]
        G -->|定期检查| I[保持存活]
    end

配置下发延迟的影响因素
#

因素	影响	优化
Gateway/VS 数量	配置越多，下发越慢	合理拆分配置
istiod 负载	高负载时下发变慢	扩容 istiod
网络延迟	跨 AZ 延迟更高	istiod 就近部署
Envoy 配置大小	配置越大，应用越慢	精简配置

大规模集群建议：

# istiod 配置优化
meshConfig:
  # 减少推送频率，合并变更
  enablePrometheusMerge: true
  
  # 增加并发
  concurrency: 4

# 针对 Gateway 的配置下发优化
pilot:
  env:
    # 限制推送范围
    PILOT_FILTER_GATEWAY_CLUSTER_CONFIG: "true"

总结与推荐
#

基于以上分析，总结各部署方式的适用场景：

场景	推荐方式	原因
中小规模，运维简单优先	Deployment + Service	标准模式，易于理解和维护
大流量，低延迟要求	DaemonSet + HostNetwork	性能最优，专用节点
需要保留源 IP	Deployment + Local	兼顾灵活性与源 IP 保留
弹性扩缩，成本敏感	Deployment + HPA	按需扩缩，资源利用率高

核心结论：

性能优先：选择 HostNetwork，消除额外网络跳数
运维优先：选择 Deployment + Service，标准化管理
可靠性：无论哪种方式，都要配置优雅终止、PDB、滚动更新策略
零流量丢失：Finalizer + Controller + preStop 配合实现
专用节点：大流量场景建议划分专用网关节点池

没有完美的方案，只有适合的方案。根据实际流量规模、团队能力、成本预算做出选择。

完整的生产配置请参考：Istio Gateway 生产部署最佳实践

Kubernetes 实战 - 这篇文章属于一个选集。

§ 10: 本文

§ 11: Istio Gateway 生产部署最佳实践

Istio Gateway 生产部署最佳实践

2025年12月27日·1141 字·6 分钟

云原生 Kubernetes Istio Ingress 网关最佳实践

背景#

部署方式对比#

方式一：Deployment + Service (NodePort/LoadBalancer)#

方式二：DaemonSet + HostNetwork#

方式三：Deployment + externalTrafficPolicy: Local#

零流量丢失方案#

问题根源：时序竞争#

Istio IngressGateway 的优雅终止#

长连接场景：WebSocket / HTTP/2#

进阶方案：配合 CLB 权重调整#

单节点多 Pod 场景的问题#

CLB 健康检查配置#

完整配置示例#

CLB 健康检查端点#

完整时序图#

验证方法#

平滑切换 Workload 类型#

切换场景#

核心问题：Gateway 规则能否共用？#

各类 Istio 资源的 Selector 机制#

实际案例分析#

Kubernetes 控制器精确行为分析#

修改 Pod Label 后的精确行为分析#

方案一：蓝绿切换（推荐）#

方案二：灰度切换#

方案三：原地升级（有风险）#

端口冲突处理#

切换检查清单#

切换时序图#

Istio Gateway 配置下发细节#

xDS 配置下发流程#

问题：ReadinessProbe 的局限性#

解决方案一：配置预热检查#

解决方案二：使用 startupProbe（推荐）#

解决方案三：Istio 1.18+ holdApplicationUntilProxyStarts#

解决方案四：Warm-up 流量#

滚动更新时的配置一致性#

配置下发状态监控#

完整的 Gateway 就绪条件#

配置下发延迟的影响因素#

总结与推荐#

相关文章

背景
#

部署方式对比
#

方式一：Deployment + Service (NodePort/LoadBalancer)
#

方式二：DaemonSet + HostNetwork
#

方式三：Deployment + externalTrafficPolicy: Local
#

零流量丢失方案
#

问题根源：时序竞争
#

Istio IngressGateway 的优雅终止
#

长连接场景：WebSocket / HTTP/2
#

进阶方案：配合 CLB 权重调整
#

单节点多 Pod 场景的问题
#

CLB 健康检查配置
#

完整配置示例
#

CLB 健康检查端点
#

完整时序图
#

验证方法
#

平滑切换 Workload 类型
#

切换场景
#

核心问题：Gateway 规则能否共用？
#

各类 Istio 资源的 Selector 机制
#

实际案例分析
#

Kubernetes 控制器精确行为分析
#

修改 Pod Label 后的精确行为分析
#

方案一：蓝绿切换（推荐）
#

方案二：灰度切换
#

方案三：原地升级（有风险）
#

端口冲突处理
#

切换检查清单
#

切换时序图
#

Istio Gateway 配置下发细节
#

xDS 配置下发流程
#

问题：ReadinessProbe 的局限性
#

解决方案一：配置预热检查
#

解决方案二：使用 startupProbe（推荐）
#

解决方案三：Istio 1.18+ holdApplicationUntilProxyStarts
#

解决方案四：Warm-up 流量
#

滚动更新时的配置一致性
#

配置下发状态监控
#

完整的 Gateway 就绪条件
#

配置下发延迟的影响因素
#

总结与推荐
#