2022-07-05 23:05:03




We only recommend using the Pushgateway in certain limited cases. There are several pitfalls when blindly using the Pushgateway instead of Prometheus’s usual pull model for general metrics collection:

  • 多实例单pgw,pwg会成为单点瓶颈
  • 没有通用的up实例存活指标
  • pgw会记住推给他的指标,然后暴露给prometheus采集,除非通过pgw api删除这些指标




For batch jobs that are related to a machine (such as automatic security update cronjobs or configuration management client runs), expose the resulting metrics using the Node Exporter’s textfile collector instead of the Pushgateway.

Batch jobs

There is a fuzzy line between offline-processing and batch jobs, as offline processing may be done in batch jobs. Batch jobs are distinguished by the fact that they do not run continuously, which makes scraping them difficult.

The key metric of a batch job is the last time it succeeded. It is also useful to track how long each major stage of the job took, the overall runtime and the last time the job completed (successful or failed). These are all gauges, and should be pushed to a PushGateway. There are generally also some overall job-specific statistics that would be useful to track, such as the total number of records processed.

For batch jobs that take more than a few minutes to run, it is useful to also scrape them using pull-based monitoring. This lets you track the same metrics over time as for other types of jobs, such as resource usage and latency when talking to other systems. This can aid debugging if the job starts to get slow.

For batch jobs that run very often (say, more often than every 15 minutes), you should consider converting them into daemons and handling them as offline-processing jobs.


package main import ( "fmt" "github.com/prometheus/client_golang/prometheus" "github.com/prometheus/client_golang/prometheus/push" ) func main() { completionTime := prometheus.NewGauge(prometheus.GaugeOpts{ Name: "db_backup_last_completion_timestamp_seconds", Help: "The timestamp of the last successful completion of a DB backup.", }) completionTime.SetToCurrentTime() if err := push.New("http://pushgateway:9091", "db_backup"). Collector(completionTime). Grouping("db", "customers"). Push(); err != nil { fmt.Println("Could not push completion time to Pushgateway:", err) } }



  • 如果采用默认rr,那么指标会随机推送,采集的时候就会混乱
  • 如果采用ip hash方式,而源job可能会不定漂移,所以指标也会混乱打散


  • 采用自动发现+一致性哈希的方式来最大程度的解决以上问题。
  • 项目中采用consul作为注册中心,可改为k8s watch service资源
  • 采用redirect方式选择目标pgw节点,存在python sdk兼容问题,可以考虑直接实现个简单的网关,直接抛流量,避免一次307跳转


-- EOF --