I'm curious why they would have chosen to make their UDP messaging so inflexible...

dingosity · on Jan 31, 2022

Games programmers in the 90s had this belief that UDP was faster than TCP. And in some cases it was. I don't think there was a standard way to change your stack's behavior for startup and congestion back then (we have CUBIC, BIC, FAST START and all sorts of options now.)

The commonly held belief was that if you want speed, you have to use UDP. But then the first thing people do is they re-implement TCP poorly at the application layer to get fragmentation, reliability, windowing or large datagrams.

OGP and "Assets Over HTTP" were introduced later and used TCP to carry some of the application protocol, but by the time they came around, UDP was wedged very far up in the application code and it was difficult to get it all out.

I haven't looked at the SL Viewer code in over a decade. Maybe it's better now.

rdw · on Jan 31, 2022

It's still true that UDP is necessary for realtime communication. I recently worked on an MMO project in which a sub-team attempted to use TCP as the main layer, but ran into huge issues, all of which were cured by switching to UDP. TCP is good at what it's good at, but nearly every realtime audio and video protocol is gonna keep using UDP. You have to be able to ignore dropped packets instead of blocking the entire channel waiting for a full RTT, it very quickly spirals out of control.

Of course, TCP is a much better protocol for infrequent and request/response type interactions, so while I was at Linden we converted as much traffic as possible to TCP/XML. But a lot of the "realtime" stuff like object updates had to remain UDP, for empirical reasons.

Second Life didn't reimplement fragmentation -- messages just got truncated if they went over the MTU, lol.

dingosity · on Jan 31, 2022

UDP is not required for real-time communication. I have 5k+ concurrent channels of RTP over TLS (not DTLS) going most of the day. If you solved a throughput problem by switching from TCP to UDP, it means you had a middle-box doing bad things.

TCP is not slow unless you encounter congestion or packet loss. But you have to deal with those problems if you're sending data over UDP.

When I was at Linden, we activated fast start and bic on ADITI when testing OGP and Assets over HTTP and things worked MUCH BETTER.

rlonn · on Feb 1, 2022

OTOH, limited congestion and packet loss do not affect things much if you have UDP-based event/object updates. With TCP there there is no way to avoid latency due to head-of-line congestion, but with UDP you just drop a message and move on.

I think TCP works reasonably well today because the network is better and there is often very little packet loss, but as soon as you get some, TCP will optimise for throughput, like designed and at the expense of latency.

dingosity · on Feb 1, 2022

You should read about BIC.

dingosity · on Jan 31, 2022

actually... we DID do fragmentation in one of the messages. but i think that one didn't do windowing or something. this is the problem of using UDP, everyone thinks they're Van Jacobson and re-invents TCP poorly. In the case of the viewer, we had at least three implementations of "random transport over UDP," each talking with a different part of the back-end. Each used the same algorithm to deal with packet loss, however: dump core and wait five minutes. (I jest. I think we mostly fixed that, but it did happen occasionally.)

Animats · on Jan 31, 2022

I haven't looked at the SL Viewer code in over a decade. Maybe it's better now.

I have. It's a little better. Bulk file transfers over UDP are finally gone, moved to HTTP.

The commonly held belief was that if you want speed, you have to use UDP. But then the first thing people do is they re-implement TCP poorly at the application layer to get fragmentation, reliability, windowing or large datagrams.

Yes. The protocol Second Life uses has both unreliable and reliable UDP datagrams. "Reliable" means they have retransmits, on a fixed timer with a fixed retry count. Some get lost on overload, because the viewer's one-thread implementation discards if too many packets arrive in a frame time. In order delivery, reliability, no head of line blocking. - pick two. You can't have all three. TCP picks the first two. The Second Life protocol picks the second two. This results in occasional trouble where there really is an in-order requirement imposed by the way the state of the system changes.

Unreliable messages are common in game protocols. "Unreliable" means that lost messages should be superseded by the next one of the same type. The intent is to be consistent-eventually. Second Life is almost consistent-eventually, but not quite, which results in occasional viewer out of sync errors which make some objects in the world look wrong.

(Personally, I'd do this with one UDP path of unreliable messages, plus two TCP connections, one for high priority, low-traffic stuff, and one for lower-priority bigger stuff. Get rid of "reliable UDP". Only a very few message types, such as user mouse activity and movement updates, should be unreliable, because those can tolerate a lost message.)

rdw · on Jan 31, 2022

It was basically "throwaway code" which was written in a hurry for a prototype that was not expected to have a long life. This kind of parsing was (is?) very common in the video game industry, and quite easy to author in C++, so it would have been the first tool the developers reached for to implement a real-time network protocol.

The evolution was hardcoded -> message template -> "Liberación" which allows the protocol to tolerate unknown fields, and thus be future-compatible, much like Protobuf does.

Source: worked on the Liberación project.

dingosity · on Jan 31, 2022

Yeah. Linden was this weird mixture of Games Programmers and Enterprise Software Architects. Games programmers all know the absolute most important thing in the world is ship date and since you never re-use your code, who cares if it's easy maintain, you're never going to see it again! Enterprise Software Architects know performance takes a back seat to correctness and ease of extension. Since the enterprise never throws any code away, it's important to make sure everything can be updated by the next group of contractors you hire, even if that means you take a performance hit.

Turns out both communities didn't have a perfect idea of what they should be doing.