Job类型的应用有时候也需要推指标,而Job类型的应用不是长时间运行的,所以只能采用主动推送指标的方式来进行指标收集,pushgateway就是这么组件,提供接口推送,同时提供接口让prometheus进行采集
啥时候该用pushgateway
We only recommend using the Pushgateway in certain limited cases. There are several pitfalls when blindly using the Pushgateway instead of Prometheus’s usual pull model for general metrics collection:
推荐只在有限特定场景使用Pushgateway
使用Pushgateway有很多陷阱:
- 多实例单pgw,pwg会成为单点瓶颈
- 没有通用的up实例存活指标
- pgw会记住推给他的指标,然后暴露给prometheus采集,除非通过pgw api删除这些指标
当单job多实例的场景,使用pgw比较特别,因为job实例的生命周期和pgw中指标的生命周期独立,所以当job实例变更,需要手动清理pgw中关联指标。
通常,pgw只用于批处理job类型应用。
其他选择
For batch jobs that are related to a machine (such as automatic security update cronjobs or configuration management client runs), expose the resulting metrics using the Node Exporter’s textfile collector instead of the Pushgateway.
Batch jobs
There is a fuzzy line between offline-processing and batch jobs, as offline processing may be done in batch jobs. Batch jobs are distinguished by the fact that they do not run continuously, which makes scraping them difficult.
The key metric of a batch job is the last time it succeeded. It is also useful to track how long each major stage of the job took, the overall runtime and the last time the job completed (successful or failed). These are all gauges, and should be pushed to a PushGateway. There are generally also some overall job-specific statistics that would be useful to track, such as the total number of records processed.
For batch jobs that take more than a few minutes to run, it is useful to also scrape them using pull-based monitoring. This lets you track the same metrics over time as for other types of jobs, such as resource usage and latency when talking to other systems. This can aid debugging if the job starts to get slow.
For batch jobs that run very often (say, more often than every 15 minutes), you should consider converting them into daemons and handling them as offline-processing jobs.
sdk例子
package main
import (
"fmt"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/push"
)
func main() {
completionTime := prometheus.NewGauge(prometheus.GaugeOpts{
Name: "db_backup_last_completion_timestamp_seconds",
Help: "The timestamp of the last successful completion of a DB backup.",
})
completionTime.SetToCurrentTime()
if err := push.New("http://pushgateway:9091", "db_backup").
Collector(completionTime).
Grouping("db", "customers").
Push(); err != nil {
fmt.Println("Could not push completion time to Pushgateway:", err)
}
}
问题
当pushgateway需要部署多副本的时候,存在以下问题:
- 如果采用默认rr,那么指标会随机推送,采集的时候就会混乱
- 如果采用ip hash方式,而源job可能会不定漂移,所以指标也会混乱打散
无意中看到这么个项目:https://github.com/ning1875/dynamic-sharding
- 采用自动发现+一致性哈希的方式来最大程度的解决以上问题。
- 项目中采用consul作为注册中心,可改为k8s watch service资源
- 采用redirect方式选择目标pgw节点,存在python sdk兼容问题,可以考虑直接实现个简单的网关,直接抛流量,避免一次307跳转