Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Prometheus is borgmon. Which is a bit funny as it's used by k8s now. It's also funny as the author left Google, wrote it in 2 years for SoundCloud, then went back to Google.

https://prometheus.io



Why is the part that k8s uses it funny? That would have been one of the goals right? K8s is like lot of these projects except that it was created as a open source version of borg from within Google.


It’s funny because inside google borgmon is deprecated for new projects and has been since nine years at least. Of course google has stretched “deprecated” to its limits.


I keep seeing that Borgmon is deprecated but no references to what the replacement is like.

Can someone please share even a high level summary?

Is the replacement a high cardinality event store like Facebook’s Scuba [0]?

[0] https://research.fb.com/wp-content/uploads/2016/11/scuba-div...


The replacement (Monarch) is similar to borgmon except:

* All metrics have an associated type. Eg. Response time (milliseconds). That's great because units for derived metrics can be dynamically computed. Eg. Bytes/second.

* The query language can fairly efficiently compute metrics at query time rather than needing everything precomputed (eg. 95 percent latency across 1000 tasks can be calculated in real-time).

* The config system is a mess and nobody likes it. Borgmon uses a DSL which is obscure, but almost identical to Prometheus. Monarch has various different config frontends (mostly around the idea of running code to produce an expanded protobuffer config) which all suck. Luckily because there isn't a strong requirement for rules to aggregate data, you don't need much config for most services - just say "scrape everything and keep it for a year".

* There are "levels" of storage at different speeds. In memory, on disk, etc. You have to configure where to put what data. You can also downsample (eg. Change scrape interval to 5 mins after a week).

* Metric names follow a directory-like heirachy. Since tasks can easily have 10k exported metrics, that's pretty important. No need to scrape the ones that aren't relevant.

* It has a shinier UI.

* It has support for exemplars. So to answer the question "Give me an example of a request which saw this high request latency". With not much added code to the monitored service, a small number of exemplars are captured and aggregated in a way that median and outlier exemplars are available. They're super useful for finding out the cause of random slow performance.

* It is run as a service. Rather than code that every team has to run, the new thing is a single instance for all teams in Google. That in turn means it can be more complex, have more dependencies, etc, without being a burden on the user.


> The query language can fairly efficiently compute metrics at query time rather than needing everything precomputed

I'm having hard time figuring out how query language might affect exectuion time. Granted, I've never seen borgmon in action, so the first question should probably be: how different its query language is from what Prometheus offers? If it's more on imperative side than declarative one then your point might actually make sense to me as it is.

> You can also downsample (eg. Change scrape interval to 5 mins after a week).

Does it allow different aggregation methods (avg/sum/max/whatever) for different metrics, like Carbon does?


I think what he is saying is that it is easier to do something ad hoc in Monarch since it is a global service with unified storage and retrieval. A Googler can join the data of Gmail and Ads easily if they want, whereas combining the borgmon data of those services is either not possible or it involves downloading CSVs and performing the join externally.

> Does it allow different aggregation methods (avg/sum/max/whatever) for different metrics, like Carbon does?

All of these methods are available[1]. Stackdriver Monitioring is the public face of Monarch.

1: https://cloud.google.com/monitoring/api/ref_v3/rest/v3/proje...


No, I think it's strictly about where aggregations and other computations occur. You hint at that when you mention CSV files. Borgmon keeps X hours of data in memory and looks up anything older from TSDB. I never understood how to aggregate data from the same service, after the fact, in a reliable fashion. Or if it was possible at all. I believe it was only working when looking at data in memory. That's why people had rules to aggregate data at multiple levels as it was being ingested, then persisting that into TSDB.

I don't remember the latter being much more sophisticated than a bunch of files in GFS with the timeseries and some metadata, either. Monarch, I think, is based on Bigtable. It has a richer API. It might even use BT coprocessors to perform computations closer to the where the raw data is stored, rather than in the Monarch frontends. (I haven't watched the public talk, so I'm just going from vague recollection.)


Borgmon/TSDB was also a bigtable, an enormous one known as 'dumptruck'.


I am pretty sure that was developed for Monarch. Perhaps later it was adopted by BM to consolidate storage, but still through a dumb API? I know that originally it had a bunch of files in GFS, with a layer of indirection, because there was a tool to fix them.


Importantly, Monarch is push-based and centralized. Previously, product teams would have to run their own borgmen, and those in turn would get scraped by the upstream borgmen of their orga for aggregation, archiving etc. Monarch is more of an As A Service offering.


The fact that Monarch configs can be written in Python instead of Borgmon is a huge win for our team; being able to write and debug our own alerting rather than have to bug SREs every time has been worth the switch alone.


Monarch, like Borg, is configured by RPC. You can use python if that suits you but you can also use C++ or Java or Go or any language capable of putting an encoded protobuf on the wire. This is a mistake people also make about Borg: there is borgcfg but borgcfg is not Borg's API. You can use the Borg RPC interface from any language and borgcfg is optional.

Compare to borgmon where use of the DSL is obligatory.


We're specifically using Gmon (GMon/Viceroy for dashboards, GMon Monarch for alerting), which is all Python built on top of Monarch.


A close source told me only half joking that at Google, when you start any new service, you‘ve got the choice between about a dozen backing services, of which half are deprecated and the other half are not officially supported yet...


The actual proverb, by an engineer (Paul C), is "there are two ways of doing things at Google: the deprecated one, and the one that doesn't work yet". Usually the deprecation occurred when a service outlived the original requirements and was no longer a great fit or not very usable. A good example were Babysitter and GWQ, which were eventually obsoleted by Borg.


I don’t think he was joking at all


As long as you don't have to write promethus configs in the horrendous borgmon language....


Fwiw, I like Borgmon, but I also like APL derivatives :). My sadness is that Borgmon’s type system is quite poor, but like many “product of necessity” systems it actually worked incredibly well.


What are they supposed to use instead of borgmon?



I feel like much of the following conversation violates NDAs signed as a condition of employment there. Is this not the case?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: