Istio 超时与重试机制详解

Istio 实战 - 这篇文章属于一个选集。

§ 16: 本文

在微服务架构中，超时和重试是保障系统稳定性的关键机制。Istio 提供了丰富的超时重试配置，但默认行为和配置方式常常让人困惑。本文将深入分析 Istio 在 Gateway 和 Mesh 场景下的超时重试机制。

核心概念
#

flowchart LR
    subgraph 超时类型
        A[连接超时] --> B[请求超时]
        B --> C[空闲超时]
    end
    
    subgraph 重试类型
        D[连接重试] --> E[请求重试]
    end

概念	说明	默认值
连接超时	TCP 连接建立的超时时间	10s
请求超时	整个请求（含重试）的超时时间	无限制（Mesh）/ 15s（Gateway）
空闲超时	连接空闲后关闭的时间	1h
重试次数	失败后重试的次数	2 次（共 3 次请求）
重试条件	触发重试的条件	connect-failure,refused-stream

Istio 默认行为
#

Gateway vs Mesh 的区别
#

                    ┌─────────────┐
    外部请求 ───────►│   Gateway   │ 超时: 0s (无限制)
                    │   (Envoy)   │ 重试: 无
                    └──────┬──────┘
                           │
                    ┌──────▼──────┐
                    │  Sidecar A  │ 超时: 0s (无限制)
                    │   (Envoy)   │ 重试: 2 次
                    └──────┬──────┘
                           │
                    ┌──────▼──────┐
                    │  Sidecar B  │
                    │   (Envoy)   │
                    └─────────────┘

默认值详解
#

1. 请求超时（Request Timeout）
#

Istio 默认：无限制（0s）

# Envoy 默认配置
route:
  timeout: 0s  # 0 表示无限制，等待上游响应

注意：虽然 Istio/Envoy 默认无超时，但实际生产中会受到以下限制：

Kubernetes Service 的 sessionAffinityConfig.clientIP.timeoutSeconds
云负载均衡器的超时设置
客户端自身的超时设置

2. 重试（Retry）
#

Istio Mesh 内默认：2 次重试

# Envoy 默认重试配置
retry_policy:
  num_retries: 2
  retry_on: "connect-failure,refused-stream,unavailable,cancelled,retriable-status-codes"
  retriable_status_codes: [503]

Gateway 默认：无重试

Gateway 入口默认不启用重试，因为：

外部请求可能是非幂等的
重试可能导致重复操作
应由业务层控制

3. 连接超时（Connection Timeout）
#

# DestinationRule 中配置
trafficPolicy:
  connectionPool:
    tcp:
      connectTimeout: 10s  # 默认 10s

4. 空闲超时（Idle Timeout）
#

# HTTP 连接空闲超时
trafficPolicy:
  connectionPool:
    http:
      idleTimeout: 1h  # 默认 1 小时

VirtualService 超时重试配置
#

基本配置
#

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: my-service
  namespace: production
spec:
  hosts:
  - my-service.production.svc.cluster.local
  http:
  - route:
    - destination:
        host: my-service.production.svc.cluster.local
    
    # 超时配置
    timeout: 30s
    
    # 重试配置
    retries:
      attempts: 3          # 重试次数（不含首次请求）
      perTryTimeout: 10s   # 每次尝试的超时
      retryOn: "5xx,reset,connect-failure,retriable-4xx"
      retryRemoteLocalities: true  # 跨区域重试

超时配置详解
#

timeout vs perTryTimeout
#

总超时 (timeout): 30s
├── 第 1 次请求: perTryTimeout 10s
├── 第 2 次请求: perTryTimeout 10s  (重试 1)
└── 第 3 次请求: perTryTimeout 10s  (重试 2)

关键规则：

timeout 是整个请求（含所有重试）的总超时
perTryTimeout 是单次尝试的超时
如果 perTryTimeout * (attempts + 1) > timeout，实际重试次数会减少

# 示例：配置不当导致重试失效
retries:
  attempts: 5
  perTryTimeout: 10s
timeout: 20s  # 总超时只有 20s，最多只能完成 2 次请求

retryOn 条件
#

条件	说明	适用场景
`5xx`	上游返回 5xx 错误	通用
`gateway-error`	502, 503, 504 错误	网关场景
`reset`	连接被重置	网络问题
`connect-failure`	连接失败	网络问题
`retriable-4xx`	可重试的 4xx（409）	冲突重试
`refused-stream`	HTTP/2 流被拒绝	gRPC
`cancelled`	gRPC 取消	gRPC
`deadline-exceeded`	gRPC 超时	gRPC
`resource-exhausted`	gRPC 资源耗尽	gRPC
`unavailable`	gRPC 不可用	gRPC

完整配置示例
#

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: payment-service
  namespace: production
spec:
  hosts:
  - payment-service.production.svc.cluster.local
  http:
  # 幂等接口：查询订单
  - match:
    - uri:
        prefix: "/api/orders/"
      method:
        exact: "GET"
    route:
    - destination:
        host: payment-service.production.svc.cluster.local
    timeout: 10s
    retries:
      attempts: 3
      perTryTimeout: 3s
      retryOn: "5xx,reset,connect-failure"
  
  # 非幂等接口：创建订单（谨慎重试）
  - match:
    - uri:
        prefix: "/api/orders"
      method:
        exact: "POST"
    route:
    - destination:
        host: payment-service.production.svc.cluster.local
    timeout: 30s
    retries:
      attempts: 1  # 只重试 1 次
      perTryTimeout: 15s
      retryOn: "connect-failure,refused-stream"  # 只在连接级别失败时重试
  
  # 默认路由
  - route:
    - destination:
        host: payment-service.production.svc.cluster.local
    timeout: 15s

DestinationRule 连接池配置
#

DestinationRule 控制连接级别的超时和连接池：

apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: my-service
  namespace: production
spec:
  host: my-service.production.svc.cluster.local
  trafficPolicy:
    connectionPool:
      tcp:
        maxConnections: 100       # 最大连接数
        connectTimeout: 10s       # TCP 连接超时
        tcpKeepalive:
          time: 7200s             # TCP keepalive 时间
          interval: 75s           # keepalive 探测间隔
      http:
        h2UpgradePolicy: UPGRADE  # HTTP/2 升级策略
        http1MaxPendingRequests: 100   # HTTP/1.1 最大等待请求
        http2MaxRequests: 1000         # HTTP/2 最大并发请求
        maxRequestsPerConnection: 100  # 每连接最大请求数
        maxRetries: 3                  # 最大并发重试数
        idleTimeout: 300s              # 空闲超时
    
    # 异常检测（断路器）
    outlierDetection:
      consecutive5xxErrors: 5     # 连续 5xx 错误次数
      interval: 10s               # 检测间隔
      baseEjectionTime: 30s       # 基础驱逐时间
      maxEjectionPercent: 50      # 最大驱逐比例
      minHealthPercent: 30        # 最小健康比例

连接池参数说明
#

TCP 连接池
#

参数	说明	默认值	建议
`maxConnections`	到目标的最大 TCP 连接数	2^32-1	根据后端承载能力设置
`connectTimeout`	TCP 连接建立超时	10s	内网 1-5s，跨区域 5-10s

HTTP 连接池
#

参数	说明	默认值	建议
`http1MaxPendingRequests`	HTTP/1.1 等待连接的最大请求数	2^32-1	根据 QPS 设置
`http2MaxRequests`	HTTP/2 最大并发请求数	2^32-1	根据后端能力设置
`maxRequestsPerConnection`	每连接最大请求数	0（无限制）	设置后可触发连接轮换
`idleTimeout`	空闲连接超时	1h	根据后端 keepalive 设置
`maxRetries`	最大并发重试数	2^32-1	限制重试风暴

Gateway 超时配置
#

Ingress Gateway 超时
#

Gateway 本身不直接配置超时，而是通过 VirtualService：

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: frontend
  namespace: production
spec:
  hosts:
  - "www.example.com"
  gateways:
  - istio-system/istio-ingressgateway
  http:
  - match:
    - uri:
        prefix: "/api/"
    route:
    - destination:
        host: api-service.production.svc.cluster.local
    timeout: 60s  # API 超时 60s
    retries:
      attempts: 2
      perTryTimeout: 20s
      retryOn: "gateway-error,connect-failure"
  
  - match:
    - uri:
        prefix: "/upload/"
    route:
    - destination:
        host: upload-service.production.svc.cluster.local
    timeout: 300s  # 上传接口超时 5 分钟
    retries:
      attempts: 0  # 上传不重试

EnvoyFilter 配置 Gateway 全局超时
#

apiVersion: networking.istio.io/v1alpha3
kind: EnvoyFilter
metadata:
  name: gateway-timeout
  namespace: istio-system
spec:
  workloadSelector:
    labels:
      istio: ingressgateway
  configPatches:
  # 配置 HTTP 连接管理器
  - applyTo: NETWORK_FILTER
    match:
      context: GATEWAY
      listener:
        filterChain:
          filter:
            name: envoy.filters.network.http_connection_manager
    patch:
      operation: MERGE
      value:
        typed_config:
          "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
          # 请求头读取超时
          request_headers_timeout: 10s
          # 流空闲超时
          stream_idle_timeout: 300s
          # 请求超时（0 表示无限制）
          request_timeout: 0s

常见超时参数对照
#

Envoy 参数	说明	VirtualService 对应
`route.timeout`	路由超时	`timeout`
`route.retry_policy.per_try_timeout`	单次重试超时	`retries.perTryTimeout`
`stream_idle_timeout`	流空闲超时	无直接对应，用 EnvoyFilter
`request_timeout`	请求头+请求体超时	无直接对应
`request_headers_timeout`	请求头读取超时	无直接对应

常见场景配置
#

场景 1：gRPC 服务
#

gRPC 使用 HTTP/2，需要特殊配置：

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: grpc-service
spec:
  hosts:
  - grpc-service.production.svc.cluster.local
  http:
  - route:
    - destination:
        host: grpc-service.production.svc.cluster.local
    timeout: 30s
    retries:
      attempts: 3
      perTryTimeout: 10s
      # gRPC 专用重试条件
      retryOn: "cancelled,deadline-exceeded,resource-exhausted,unavailable"
---
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: grpc-service
spec:
  host: grpc-service.production.svc.cluster.local
  trafficPolicy:
    connectionPool:
      http:
        h2UpgradePolicy: UPGRADE
        http2MaxRequests: 1000

场景 2：长连接服务（WebSocket）
#

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: websocket-service
spec:
  hosts:
  - ws.example.com
  gateways:
  - istio-system/istio-ingressgateway
  http:
  - match:
    - uri:
        prefix: "/ws"
      headers:
        upgrade:
          exact: "websocket"
    route:
    - destination:
        host: websocket-service.production.svc.cluster.local
    timeout: 0s  # WebSocket 不设超时
    retries:
      attempts: 0  # WebSocket 不重试
---
apiVersion: networking.istio.io/v1alpha3
kind: EnvoyFilter
metadata:
  name: websocket-idle-timeout
  namespace: istio-system
spec:
  workloadSelector:
    labels:
      istio: ingressgateway
  configPatches:
  - applyTo: NETWORK_FILTER
    match:
      context: GATEWAY
      listener:
        filterChain:
          filter:
            name: envoy.filters.network.http_connection_manager
    patch:
      operation: MERGE
      value:
        typed_config:
          "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
          stream_idle_timeout: 3600s  # WebSocket 空闲 1 小时
          upgrade_configs:
          - upgrade_type: websocket

场景 3：文件上传下载
#

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: file-service
spec:
  hosts:
  - file-service.production.svc.cluster.local
  http:
  # 上传接口
  - match:
    - uri:
        prefix: "/upload"
    route:
    - destination:
        host: file-service.production.svc.cluster.local
    timeout: 600s  # 上传超时 10 分钟
    retries:
      attempts: 0  # 不重试
  
  # 下载接口
  - match:
    - uri:
        prefix: "/download"
    route:
    - destination:
        host: file-service.production.svc.cluster.local
    timeout: 300s  # 下载超时 5 分钟
    retries:
      attempts: 1
      perTryTimeout: 150s
      retryOn: "connect-failure,reset"

场景 4：跨区域服务
#

apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: cross-region-service
spec:
  host: remote-service.production.svc.cluster.local
  trafficPolicy:
    connectionPool:
      tcp:
        connectTimeout: 5s  # 跨区域连接超时适当增加
      http:
        idleTimeout: 60s    # 空闲超时缩短，减少僵尸连接
    
    # 区域故障转移
    outlierDetection:
      consecutive5xxErrors: 3
      interval: 5s
      baseEjectionTime: 30s
---
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: cross-region-service
spec:
  hosts:
  - remote-service.production.svc.cluster.local
  http:
  - route:
    - destination:
        host: remote-service.production.svc.cluster.local
    timeout: 30s
    retries:
      attempts: 2
      perTryTimeout: 10s
      retryOn: "5xx,connect-failure,reset"
      retryRemoteLocalities: true  # 允许跨区域重试

超时重试与断路器的关系
#

flowchart TB
    subgraph 请求流程
        A[请求进入] --> B{断路器状态}
        B -->|关闭| C[发送请求]
        B -->|打开| D[快速失败 503]
        B -->|半开| E[允许探测请求]
        
        C --> F{请求结果}
        F -->|成功| G[返回响应]
        F -->|失败| H{重试判断}
        
        H -->|满足重试条件| I{剩余重试次数}
        H -->|不满足| J[返回错误]
        
        I -->|有| C
        I -->|无| J
        
        E --> F
    end
    
    subgraph 断路器更新
        F -->|成功| K[重置错误计数]
        F -->|失败| L[增加错误计数]
        L --> M{达到阈值}
        M -->|是| N[打开断路器]
        M -->|否| O[保持关闭]
    end

配置协同
#

apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: service-resilience
spec:
  host: my-service.production.svc.cluster.local
  trafficPolicy:
    connectionPool:
      http:
        maxRetries: 10  # 限制并发重试数，防止重试风暴
    
    # 断路器配置
    outlierDetection:
      consecutive5xxErrors: 5      # 连续 5 次 5xx 触发熔断
      consecutiveGatewayErrors: 3  # 连续 3 次网关错误触发
      interval: 10s                # 检测间隔
      baseEjectionTime: 30s        # 熔断时间
      maxEjectionPercent: 50       # 最多熔断 50% 的实例
---
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: service-resilience
spec:
  hosts:
  - my-service.production.svc.cluster.local
  http:
  - route:
    - destination:
        host: my-service.production.svc.cluster.local
    timeout: 15s
    retries:
      attempts: 2
      perTryTimeout: 5s
      retryOn: "5xx,reset,connect-failure"

调试与排查
#

查看 Envoy 配置
#

# 查看路由超时配置
istioctl proxy-config routes <pod-name> -o json | jq '.[] | select(.name=="80") | .virtualHosts[].routes[].route'

# 查看重试配置
istioctl proxy-config routes <pod-name> -o json | jq '.. | .retryPolicy? | select(.)'

# 查看连接池配置
istioctl proxy-config clusters <pod-name> -o json | jq '.[] | select(.name | contains("my-service")) | .circuitBreakers'

常见问题
#

1. 超时不生效
#

问题：配置了 timeout 但请求仍然等待很久

排查：

# 检查 VirtualService 是否生效
istioctl analyze -n production

# 检查 Envoy 配置
istioctl proxy-config routes <pod-name> -o json | grep -A5 timeout

常见原因：

VirtualService 的 hosts 不匹配
请求没有经过 Sidecar
perTryTimeout 设置过大

2. 重试导致重复请求
#

问题：POST 请求被重试，导致数据重复

解决：

retries:
  attempts: 1
  retryOn: "connect-failure,refused-stream"  # 只在连接级别失败时重试

3. 超时时间计算
#

问题：不清楚最终超时时间

计算公式：

实际超时 = min(timeout, perTryTimeout × (attempts + 1))

监控指标
#

# Prometheus 查询

# 请求超时率
sum(rate(istio_requests_total{response_code="504"}[5m])) 
/ 
sum(rate(istio_requests_total[5m]))

# 重试次数
sum(rate(envoy_cluster_upstream_rq_retry[5m])) by (cluster_name)

# 断路器触发次数
sum(rate(envoy_cluster_outlier_detection_ejections_total[5m])) by (cluster_name)

最佳实践
#

1. 超时配置原则
#

# 层层递减原则
Gateway:     timeout: 60s
├── Service A: timeout: 30s
│   └── Service B: timeout: 10s
└── Service C: timeout: 20s

上游超时应大于下游，避免上游已超时但下游还在处理。

2. 重试配置原则
#

接口类型	重试策略	原因
GET/HEAD	积极重试	幂等操作
PUT/DELETE	谨慎重试	通常幂等
POST/PATCH	仅连接失败重试	可能非幂等

3. 超时重试检查清单
#

□ Gateway 入口配置了合理的超时
□ 非幂等接口限制重试条件
□ perTryTimeout < timeout / (attempts + 1)
□ 断路器配置了合理的阈值
□ 监控了超时和重试指标
□ 跨区域服务考虑了网络延迟

总结
#

配置项	位置	默认值	建议
请求超时	VirtualService	0s（无限制）	根据 SLA 设置
重试次数	VirtualService	2（Mesh）/0（Gateway）	幂等接口 2-3 次
单次超时	VirtualService	无	timeout/(attempts+1)
连接超时	DestinationRule	10s	内网 1-5s
空闲超时	DestinationRule	1h	根据业务设置
断路阈值	DestinationRule	5	根据错误率设置

合理配置超时和重试可以显著提升服务的弹性，但也需要注意避免重试风暴和非幂等操作的重复执行。

Istio 实战 - 这篇文章属于一个选集。

§ 2: Istio Gateway 流量控制策略实战：灰度发布与泳道隔离

§ 16: 本文

§ 17: Istio Wasm 插件开发指南

Istio Wasm 插件开发指南

2025年12月29日·3088 字·15 分钟

云原生 Istio Kubernetes Service Mesh Wasm Envoy

Istio Gateway 流量控制策略实战：灰度发布与泳道隔离

2025年12月28日·3926 字·19 分钟

云原生 Istio Kubernetes Service Mesh 灰度发布流量控制

Istio Gateway 生产部署最佳实践

2025年12月27日·1141 字·6 分钟

云原生 Kubernetes Istio Ingress 网关最佳实践

Istio Gateway 部署方式深度分析

2025年12月27日·7011 字·33 分钟

云原生 Kubernetes Istio Ingress 网关架构

发布策略详解：滚动、蓝绿、金丝雀与多批次发布

2025年12月29日·1377 字·7 分钟

云原生 Kubernetes DevOps 发布策略 CI/CD

核心概念#

Istio 默认行为#

Gateway vs Mesh 的区别#

默认值详解#

1. 请求超时（Request Timeout）#

2. 重试（Retry）#

3. 连接超时（Connection Timeout）#

4. 空闲超时（Idle Timeout）#

VirtualService 超时重试配置#

基本配置#

超时配置详解#

timeout vs perTryTimeout#

retryOn 条件#

完整配置示例#

DestinationRule 连接池配置#

连接池参数说明#

TCP 连接池#

HTTP 连接池#

Gateway 超时配置#

Ingress Gateway 超时#

EnvoyFilter 配置 Gateway 全局超时#

常见超时参数对照#

常见场景配置#

场景 1：gRPC 服务#

场景 2：长连接服务（WebSocket）#

场景 3：文件上传下载#

场景 4：跨区域服务#

超时重试与断路器的关系#

配置协同#

调试与排查#

查看 Envoy 配置#

常见问题#

1. 超时不生效#

2. 重试导致重复请求#

3. 超时时间计算#

监控指标#

最佳实践#

1. 超时配置原则#

2. 重试配置原则#

3. 超时重试检查清单#

总结#

相关文章