While this provides perfect ordering information, it does so at the cost of boun...

While this provides perfect ordering information, it does so at the cost of bouncing a cacheline (the head of the queue) around the entire machine. That's fine if perfect ordering information is really, really important, but if a lack of probe effect is more important, try the DTrace approach: each thread writes into a private buffer, with a best-effort timestamp (say, the CPU's TSC). Drains merge-sort all the thread's buffers by timestamp.