Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I keep seeing that Borgmon is deprecated but no references to what the replacement is like.

Can someone please share even a high level summary?

Is the replacement a high cardinality event store like Facebook’s Scuba [0]?

[0] https://research.fb.com/wp-content/uploads/2016/11/scuba-div...



The replacement (Monarch) is similar to borgmon except:

* All metrics have an associated type. Eg. Response time (milliseconds). That's great because units for derived metrics can be dynamically computed. Eg. Bytes/second.

* The query language can fairly efficiently compute metrics at query time rather than needing everything precomputed (eg. 95 percent latency across 1000 tasks can be calculated in real-time).

* The config system is a mess and nobody likes it. Borgmon uses a DSL which is obscure, but almost identical to Prometheus. Monarch has various different config frontends (mostly around the idea of running code to produce an expanded protobuffer config) which all suck. Luckily because there isn't a strong requirement for rules to aggregate data, you don't need much config for most services - just say "scrape everything and keep it for a year".

* There are "levels" of storage at different speeds. In memory, on disk, etc. You have to configure where to put what data. You can also downsample (eg. Change scrape interval to 5 mins after a week).

* Metric names follow a directory-like heirachy. Since tasks can easily have 10k exported metrics, that's pretty important. No need to scrape the ones that aren't relevant.

* It has a shinier UI.

* It has support for exemplars. So to answer the question "Give me an example of a request which saw this high request latency". With not much added code to the monitored service, a small number of exemplars are captured and aggregated in a way that median and outlier exemplars are available. They're super useful for finding out the cause of random slow performance.

* It is run as a service. Rather than code that every team has to run, the new thing is a single instance for all teams in Google. That in turn means it can be more complex, have more dependencies, etc, without being a burden on the user.


> The query language can fairly efficiently compute metrics at query time rather than needing everything precomputed

I'm having hard time figuring out how query language might affect exectuion time. Granted, I've never seen borgmon in action, so the first question should probably be: how different its query language is from what Prometheus offers? If it's more on imperative side than declarative one then your point might actually make sense to me as it is.

> You can also downsample (eg. Change scrape interval to 5 mins after a week).

Does it allow different aggregation methods (avg/sum/max/whatever) for different metrics, like Carbon does?


I think what he is saying is that it is easier to do something ad hoc in Monarch since it is a global service with unified storage and retrieval. A Googler can join the data of Gmail and Ads easily if they want, whereas combining the borgmon data of those services is either not possible or it involves downloading CSVs and performing the join externally.

> Does it allow different aggregation methods (avg/sum/max/whatever) for different metrics, like Carbon does?

All of these methods are available[1]. Stackdriver Monitioring is the public face of Monarch.

1: https://cloud.google.com/monitoring/api/ref_v3/rest/v3/proje...


No, I think it's strictly about where aggregations and other computations occur. You hint at that when you mention CSV files. Borgmon keeps X hours of data in memory and looks up anything older from TSDB. I never understood how to aggregate data from the same service, after the fact, in a reliable fashion. Or if it was possible at all. I believe it was only working when looking at data in memory. That's why people had rules to aggregate data at multiple levels as it was being ingested, then persisting that into TSDB.

I don't remember the latter being much more sophisticated than a bunch of files in GFS with the timeseries and some metadata, either. Monarch, I think, is based on Bigtable. It has a richer API. It might even use BT coprocessors to perform computations closer to the where the raw data is stored, rather than in the Monarch frontends. (I haven't watched the public talk, so I'm just going from vague recollection.)


Borgmon/TSDB was also a bigtable, an enormous one known as 'dumptruck'.


I am pretty sure that was developed for Monarch. Perhaps later it was adopted by BM to consolidate storage, but still through a dumb API? I know that originally it had a bunch of files in GFS, with a layer of indirection, because there was a tool to fix them.


Importantly, Monarch is push-based and centralized. Previously, product teams would have to run their own borgmen, and those in turn would get scraped by the upstream borgmen of their orga for aggregation, archiving etc. Monarch is more of an As A Service offering.


The fact that Monarch configs can be written in Python instead of Borgmon is a huge win for our team; being able to write and debug our own alerting rather than have to bug SREs every time has been worth the switch alone.


Monarch, like Borg, is configured by RPC. You can use python if that suits you but you can also use C++ or Java or Go or any language capable of putting an encoded protobuf on the wire. This is a mistake people also make about Borg: there is borgcfg but borgcfg is not Borg's API. You can use the Borg RPC interface from any language and borgcfg is optional.

Compare to borgmon where use of the DSL is obligatory.


We're specifically using Gmon (GMon/Viceroy for dashboards, GMon Monarch for alerting), which is all Python built on top of Monarch.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: