Synchronous logging strikes again! We ran into this some at work with GLOG (Google's logging library), which can, e.g., block on disk IO if stdout is a file or whatever. GLOG was like, 90-99% of culprits when our service stalled for over 100ms.
I have discussions with cow-orkers around logging;
"We have Best-Effort and Guaranteed-Delivery APIs"
"I want Guaranteed Delivery!!!"
"If the GD logging interface is offline or slow, you'll take downtime; is that okay?"
"NO NO Must not take downtime!"
"If you need it logged, and can't log it, what do you do?"
These days I just point to the CAP theorem and suggest that logging is the same as any other distributed system. Because there's a wikipedia article with a triangle and the word "theorem" people seem to accept that.
[edit: added "GD" to clarify that I was referring to the guaranteed delivery logging api, not the best effort logging API]
Interesting, I'd think logging is one of the clearest situations when you want best effort. Logging is, almost by definition, not the "core" of your application, so failure to log properly should not prevent the core of the program from working. Killing the whole program because logging server is clearly throwing the baby out with the bathwater.
What people probably mean is "logging is important, let's avoid losing log messages if possible", which is what "best" in "best effort" stands for. For example it's often a good idea to have a local log queue, to avoid data loss in case of a temporary log server downtime.
Some systems the logs are journaled records for the business or are discoverable artifacts for compliance. In highly secure environments logs are not only durable but measures are taken to fingerprint them and their ordering (like ratchet hashing) to ensure integrity is invariant.
I would note that using disk based logging is generally harmful in these situations IMO. Network based logging is less likely to cause blocking at some OS level or other sorts of jitter that’s harder to mask. Typically I develop logging as an in memory thing that offloads to a remote service over the network. The durability of the memory store can be an issue in highly sensitive workloads, and you’ll want to do synchronous disk IO for that case to ensure durability and consistent time budgets, but for almost all application disk less logging is preferable.
If you're not waiting for the remote log server to write the messages to its disk before proceeding, then it seems like that's not guaranteed to me? And if you are, then you suffer all the problems of local disk logging but also all the extra failure modes introduced by the network, too
> If you're not waiting for the remote log server to write the messages to its disk before proceeding, then it seems like that's not guaranteed to me?
Depends on your failure model. I'd consider e.g. "received in memory by at least 3/5 remote servers in separate datacenters" to be safer than "committed to local disk".
For logs, which are immutable time series journals, any copy is entirely sufficient. The first write is a quorum. Also from a systems POV reads are not a feature of logs.
Consistency is a synonym for "guaranteed", and means "written to 2 remote, reliable, append-only storage endpoints" (for any reasonable definition of reliability)
So -- a single system collecting a log event -- it is not reliable (guaranteed) if written just to some device on that system. Instances can be de-provisioned (and logs lost), filesystems or databases can be scrambled, badguys can encrypt your data, etc.
In this context, a "network partition" prevents consistency (data not written to reliable media) or prevents availability (won't accept new requests until their activity can be logged reliably).
If you define "reliably" differently, you may have a different interpretation of log consistency.
I’m not sure I understand the way you’re using the vocabulary. Consistency is a read operation concept not write. There is no online reads for logs.
Availability is achieved if at least one writer acknowledges a write. In a partition, it means when you have multiple parts of the system disagreeing about the write contents due to a partition in the network. But because logs are immutable and write only, this doesn’t happen in any situation. The only situation this might occur is if you’re maintaining a distributed ratchet with in delivery order semantics rather than eventually consistent temporal semantics- in which case you will never have CAP. But that’s an insanely rare edge case.
Note CAP doesn’t ensure perfect durability. I feel like you’re confusing consistency with durability. Consistency means after I’ve durably written something all nodes agree on read it’s been written. Since logs don’t support read on the online data plane this is trivially not an issue. Any write acknowledgment is sufficient.
> Consistency is a synonym for "guaranteed", and means "written to 2 remote, reliable, append-only storage endpoints" (for any reasonable definition of reliability)
No it doesn't. Read your own wiki link.
> In this context, a "network partition" prevents consistency (data not written to reliable media) or prevents availability (won't accept new requests until their activity can be logged reliably).
A network partition doesn't matter for a log system because there is no way to have consistency issues with logs. Even a single partitioned-off instance can accept writes without causing any problem.
Of course if you cannot connect to any instance of your log service then you cannot write logs. But that's got nothing to do with the CAP theorem.
The difference is that network IO can be more easily masked by the operating system than block device IO. When you offload your logging to another thread the story isn’t over because your disk logging can interfere at a system level. Network IO isn’t as noisy. If durability is important you might still need to wait for an ACK before freeing the buffer for the message which might lead to more overall memory use, all the operations play nicely in a preemptable scheduling system.
Also, the failure modes of systems are very tied to durable storage devices attached to the system and very rarely to network devices. By reducing the number of things that need a disk (ideally to zero) you can remove disks from the system and its availability story. Once you get to fully disk less systems the system failure modes are actually almost nothing. But even with disks attached reducing the times you interact with the disk (especially for chatty things like logs!) reduces the likelihood the entire system fails due to a disk issue.
People use logging (appropriately or inappropriately; not my bucket of monkeys) for a variety of things including audit and billing records, which are likely a good case for a guaranteed delivery API.
People often don't think precisely about what they say or want, and also often don't think through corner cases such as "what if XYZ breaks or gets slow?"
And don't get me started on "log" messages that are 300mb events. Per log. Sigh.
But if your service has downtime because the logs could not be written, that seems strictly inferior. As someone else wrote upthread, you only want guaranteed delivery for logs if they're required under a strict audit regime and the cost of noncompliance is higher than the cost of a service outage.
That's unavoidable if the logging service is down when your server crashes.
Having a local queue doesn't mean logging to the service is delayed, it can be sent immediately. All the local queue does is give you some resiliency, by being able to retry if the first logging attempt fails.
If your logging service is down all bets are off. But by buffering logs you're now accepting that problems not related to the logging service will also cause you to drop logs - as I mentioned, your service crashing, or being OOM'd, would be one example.
What's more likely? An intermittent network issue, the logging service being momentarily down, or a local crash that only affects your buffering queue?
If an OOM happens, all bets are off anyway, since it has as much likelihood of taking out your application as it does your buffering code. The local buffering code might very well be part of the application in the first place, so the fate of the buffering code is the same as the application anyway.
It seems you're trying very hard to contrive a situation where doing nothing is better than taking reasonable steps to counter occasional network hiccups.
> It seems you're trying very hard to contrive a situation where doing nothing is better than taking reasonable steps to counter occasional network hiccups.
I think you've completely misunderstood me then. I haven't taken a stance at all on what should be done. I'm only trying to agree with the grandparent poster about logging ultimately reflecting CAP Theorem.
Logging can be essential to security (to auditing). It's your record of what happened. If an attacker can cause logging to fail, they can cover their tracks more easily.
To me audit logs aren't "logs" (in the normal sense), despite the name. They tend to have different requirements; e.g., in my industry, they must be retained, by law, and for far longer than our normal logs.
To me, those different requirements imply that they should be treated differently by the code, probably even under distinct flows: synchronously, and ideally to somewhere that I can later compress like hell and store in some very cheap long term storage.
Whereas the debug logs that I use for debugging? Rotate out after 30 to 90d, … and yeah, best effort is fine.
(The audit logs might also end up in one's normal logs too, for convenience.)
I have some wishfull thinking ideas on this, but it should be possible to have both at least in an imaginary, theoretical scenario.
You can have both guaranteed delivery and no downtime if your whole system is so deterministic that anything that normally would result in blocking just will not, cannot happen. In other words it should be a hard real-time system that is formally verified top to bottom, down to the last transistor. Does anyone actually do that? Verify the program and the hardware to prove that it will never run out of memory for logs and such?
Continuing this thought, logs are probably generated endlessly, so either whoever wants them has to also guarantee that that they are processedand disposed of right after being logged... or there is a finite ammount of log messages that can be stored (arbitrary number like 10 000) but the user (of logs) has to guarantee that they will take the "mail" out of the box sooner than it overfills (at some predictable, deterministic rate). So really that means even if OUR system is mathematically perfect, we're just making the downtime someone elses problem - namely, the consumer of the infinite logs.
That, or we guarantee that the final resources of our self-contained, verified system will last longer than the finite shelf life of the system as a whole (like maybe 5 years for another arbitrary number)
From a hardware point of view, this system is unlikely to exist, because you need a system with components that never have any reliability issues ever to have a totally deterministic system.
From a software point of view, this system is unlikely to exist as it doesn't matter that the cause of your downtime is "something else that isn't our system". As a result, you're gonna end up requiring infinite reliable storage to upkeep your promises.
The better way to do this is to write the logs to a file or an in-memory ring buffer and have a separate thread/process push logs from the file/ring-buffer to the logging service, allowing for retries if the logging service is down or slow (for moderately short values of down/slow).
Promtail[1] can do this if you're using Loki for logging.
Yeah, what the "best effort" actually means in practice is usually a result of how much resources you want to throw at the problem. Those give you runway on how much of a problem you can withstand and perhaps recover from without any loss of data (logs), but in the end you're usually still just buying time. That's usually enough though.
You are thinking too much in terms of the stated requirements instead of what people actually want: good uptime and good debugability. Falling back to local logging means a blip in logging availability doesn't turn into all hands on deck everything is on fire. And it means that logs will very likely be available for any failures.
Good enough is literally "best effort delivery", you're just agreeing with them that this is ultimately a distributed systems problem and you either choose CP or AP.
We did something like this at Weebly for stats. The app sent the stats to a local service via UDP, so shoot and forget. That service aggregated for 1s and then sent off server.
Ack is a TCP thing to acknowledge that the packet has been received over the unreliable network, it’s not relevant to Unix sockets which are for communicating between processes on the same host.
We had prod halt once when the syslog server hanged. Logs were pushed through TCP which propagated the blocking to the whole of prod. We switched to UDP transport since: better to lose some logs than the whole of prod.
TCP vs. UDP and async best-effort vs. synchronous are completely orthogonal…
E.g., a service I wrote wrote logs to an ELK setup; we logged over TCP. But the logging was async: we didn't wait for logs to make it to ELK, and if the logging services went down, we just queued up logs locally. (To a point; at some point, the buffer fills up, and logs were discarded. The process would make a note of this if it happened, locally.)
> TCP vs. UDP and async best-effort vs. synchronous are completely orthogonal…
I agree, when stuff is properly written. I don't remember the exact details, but at least with UDP the asyncness is built-in: there is no backpressure whatsoever. So poorly written software can just send udp to heart's end.
Especially if some system is unhappy enough to log enough volume to blow up the local log disk... you'll usually have enough messages and clues in the bazillion other messages that have been logged.
Ah, the classic GLOG-induced service stall - brings back memories! I've seen similar scenarios where logging, meant to be a safety net, turns into a trap. Your 90-99% figure resonates with my experience. It's like opening a small window for fresh air and having a storm barrel in. We eventually had to balance between logging verbosity and system performance, kind of like a tightrope walk over a sea of unpredictable IO delays. Makes one appreciate the delicate art of designing logging systems that don't end up hogging the spotlight (and resources) themselves, doesn't it?
Oh, we had this type of issue ("logging lib breaks everything") with a $MSFT logging library. Imagine having 100 threads each with their own logging buffer of 300MB. Needless to say it annihilated our memory and our server crashed, even on the most expensive sku of Azure App Service.
Reminds me a litte of the oldtimers trick of adding a sleep(1000) somewhere so they could later come back and have some resources later, or if they needed a quick win with the client.
Now cloud companies are using malloc(300000000) it to fake resource usage. /s
I would posit that if your product's availability hinges on +/- 100ms, you are doing something deeply wrong, and it's not your logging library's fault. Users are not going to care if a button press takes 100 more ms to complete.
100ms for something like say API authorization on a high volume data plane service would be unacceptable. Exceeding latencies like that can degrade bandwidth and cause workers to exhaust connection counts. Likewise, even in humans response space, 100ms is an enormous part of a budget for responsiveness. Taking again authorization, if you spend 100ms, you’re exhausting the perceptible threshold for a humans sense of responsiveness to do something that’s of no practical value but is entirely necessary. Your UI developers will be literally camped outside your zoom room with virtual pitch forks night and day.
Our service is expected to respond to small reads at under 1ms at the 99th percentile. >100ms stalls (which can go into many seconds) are absolutely unacceptable.
Core libraries at, say, Google, are supposed to be reliable to several nines. If they go down for long enough for a human to notice, they’re failing SLA.