Why scaling and parallelism remain hard even with new tools and languages

rdtsc · on Aug 10, 2016

That's why I like Erlang and Elixir. They were built to handle concurrency down to the core. Currently only languages on BEAM VM provide a set of mature, built-in fault tolerance features (code reloading, isolated concurrency units -- processes, immutable data, sending messages between processes instead of acquiring lock).

I often see some frameworks which claim implement Erlang in "language $x" by adding a queue to a thread. But that is still very much behind what Erlang does, because it is missing these other components, namely fault tolerance.

Sure you can spawn OS processes, or spin up multiple machines/containers. But that is not built-in, so have to manage the additional stack for that. Java has code reloading, but it is not quite the same and so on.

And this is not just talk in generalities, these properties translate directly to benefits and profits -- faster development time, less ops overhead (some parts of the backend can crash and restart, without having to wake everyone up), less code to maintain and dependencies to manage.

Just the other day had a typical distributed systems problem -- didn't add backpressure and so messages were piling up in receiver mailbox. The simplest solution I tested was just switch a gen_server:cast to a gen_server:call. It was a 2-3 line change. Hotpatched on a test cluster and problem was fixed in a few minutes. Ultimately I did something else, but the point is using a language which was built in for concurrency is that it was just a couple of lines of code. Had that been a custom RPC solution, with some serialization and some socket code, it would have taken a lot more to write, test and deploy. All that adds up quickly and can make or break the project.

heavenlyhash · on Aug 10, 2016

The biggest question I have about this: Is it really a good idea to make local actors and supervisor trees indistinguishable from remote ones?

Reusability is good.

Abstractions that are so "good" that they obscure potentially order-of-mag performance differences -- I'm not sold. Whether an actor is in-memory with me or across a TCP socket is order-of-mag, and may make the difference between whether my program satisfies is service goals, or fails them abjectly. Especially if that randomized latency happens a number of times, or in unpredictable places.

Error handling paths that have to account for the Two-Generals problem being indistinguishable from local errors I can be confident I'll hear about -- I'm not sold on that, either.

I'd be interested to hear rebuttals to those from folks who have worked deeply with OTP. Is it just... not that bad, in practice? Or are there parts of the abstraction that specifically help counter these concerns?

pyotrgalois · on Aug 10, 2016

In some way this article by one of Erlang creators is related to your comment: https://armstrongonsoftware.blogspot.com.ar/2008/05/road-we-...

> The fundamental problem with taking a remote operation and wrapping it up so that it looks like a local operation is that the failure modes of local and remote operations are completely different.

derefr · on Aug 10, 2016

I think of Erlang less as making remote operations masquerade as local ones, and more as forcing you to assume that all local code has the potential failure-modes of remote code. Which, often (for weird and roundabout reasons) it does.

im_down_w_otp · on Aug 11, 2016

This.

jeffdavis · on Aug 11, 2016

Very insightful!

qznc · on Aug 10, 2016

Armstrong clearly sees the problem. I understand his solution as: Erlang makes local and remote actors indistinguishable, but you should build your own custom wrapper for remote calls.

This seems fishy to me. You build low-level boilerplate over a high-level foundation? To clarify, I consider implementing the timeout as "low-level" and a mechanism like "link(remote_process)" as "high-level".

pyotrgalois · on Aug 11, 2016

Could you please explain why do you think a timeout would be low level?

abecedarius · on Aug 10, 2016

You want two kinds of abstractions: ones that can only run locally, and ones that may be remote. For the remote kind it should be easy and transparent to run them locally.

In Erlang these are function calls vs. message sending. Some other languages like E also make that kind of distinction. I believe the bad rep of network transparency came from experience trying to cross the line in the other direction: trying to pretend remote services are local.

toast0 · on Aug 11, 2016

> The biggest question I have about this: Is it really a good idea to make local actors and supervisor trees indistinguishable from remote ones?

The erlang code to send a message to a remote pid and a local pid is the same, but that doesn't make them indistinguishable; you can check if the pid is local or not in code. In my experience, I have some expectations of where the thing I'm messaging is going to be when I write the code, and I can act accordingly: better error handling when I expect to talk to a remote node since I can't be sure I'll be able to reach it; whereas if I expect to reach a local pid, I'm more likely to match against success and let it crash if something weird happened.

catnaroek · on Aug 11, 2016

> They were built to handle concurrency down to the core.

And they weren't built to handle data parallelism at all.

rdtsc · on Aug 11, 2016

Not sure I know what you mean by "data parallelism". I wrote code not too long ago which had multiple processes reference the same binary blob. I have enough CPUs to ensure some processes were actually accessing it in parallel. In other case, used an ETS table to have multiple processes read and write data to and from it.

Or maybe to put it another way, if you think I am not already doing data parallelism, sell me on why I need data parallelism.

catnaroek · on Aug 11, 2016

> reference the same binary blob

Is there some way to share data whose structure is actually understood and checked (statically or dynamically) by the language?

> to have multiple processes read and write data to and from it.

How did you coordinate those reads and writes? A writer not having exclusive access to what it's writing to, at the time of the writing... sounds... very bad.

> sell me on why I need data parallelism.

Just to name one example: machine learning. If you're feeding real-time data to a ML algorithm, chances are you'll have to perform multiple independent calculations that use the same data as input. To speed up those calculations, you need to perform them in parallel.

di4na · on Aug 11, 2016

Nothing is. There was an interesting discussion between Simon Peyton Jones and Joe Armstrong on that. https://www.infoq.com/interviews/armstrong-peyton-jones-erla...

In general, the thing is... parallelism is hard. If not impossible to do right from automated tools.

catnaroek · on Aug 11, 2016

First we have to make it safe: make it inexpressible to do parallelism wrong. Only then can we discuss automating it. There is no value whatsoever in automating the wrong thing.

di4na · on Aug 12, 2016

What do you mean by that. Because doing parallelism wrong is easy : do it. Parrallelism mean doing the same thing at the same time. The only way to not do it wrong is to not do it.

catnaroek · on Aug 12, 2016

> Parrallelism mean doing the same thing at the same time.

Yeah, but how to do it without making the program state go kaboom?

> What do you mean by that.

“Make it impossible to write to a data structure without (at least temporary) exclusive access to it.” At the risk of oversimplifying, there are basically three ways to do it:

(0) Use data structures that are immutable after they are first written to. Pros: very easy to understand and use. Cons: limited to pure parallelism, without concurrency.

(1) Clone all data structures when you need to share them. Pros: very easy to understand and use. Cons: not all data structures can be meaningfully cloned, e.g., how do you clone a file descriptor?

(2) Use a static analysis (such as type checking) to enforce the precondition that writers must have exclusive access at the time of the writing. Pros: can handle both parallelism and concurrency. Cons: requires the ability to write code that a static analysis tool can understand.

Since the last one is by far the most rare approach, let me give an example of it. Quoting the Rust book:

> Here’s the rules about borrowing in Rust:

> First, any borrow must last for a scope no greater than that of the owner. Second, you may have one or the other of these two kinds of borrows, but not both at the same time:

> - one or more references (&T) to a resource,

> - exactly one mutable reference (&mut T).

https://doc.rust-lang.org/stable/book/references-and-borrowi...

di4na · on Aug 17, 2016

Sure but 1 and 2 end up being the same. Because when you reach the limits you clone.

The thing with parallelism is not reading. It is writing. So the best way ... is to have isolated computation then reconnect them at the end... and yes it is hard..

But parallelism has few interest in my view. Concurrency is far more interested.

catnaroek · on Aug 18, 2016

> Sure but 1 and 2 end up being the same. Because when you reach the limits you clone.

You mean (0) and (1)? Yes, deep inside, they're the one and the same. (2) is qualitatively different from the other two.

> So the best way ... is to have isolated computation then reconnect them at the end...

What I'm saying is that Erlang-style complete process isolation is too heavy-handed and imposes unacceptable overhead in many cases. It's better to have fine-grained control over what you isolate and what you share, as in Rust.

> and yes it is hard..

It isn't easy, but rejecting large classes of errors automatically helps.

> But parallelism has few interest in my view. Concurrency is far more interested.

Besides writing correct programs in the face of shared resource usage, another of the main points to concurrency control is to safely parallelize the parts of your program that don't compete with each other for resources.

cryptica · on Aug 10, 2016

Highly scalable systems have to be designed in a particular way; it's an architectural concern - Your choice of language might make it somewhat easier to implement a highly parallel architecture, but the language itself will never make it 'easy' - Languages will always give you just enough rope to hang yourself (that is the cost of flexibility).

You could design a framework which is extendable and scalable in such a way that developers who want to write code on top of that framework don't need to think (much) about parallelism (see https://github.com/socketcluster/socketcluster#introducing-s...).

Unfortunately, you cannot build a highly scalable system without enforcing some rigid constraints. General purpose programming languages do not enforce restrictions on what design patterns you can or cannot use use; that is the role of a framework.

That said, frameworks can never fully hide the complexity of parallel systems (not for all use cases) but at least they can guide you to the best approach when solving specific problems.

Retric · on Aug 10, 2016

One of the problems is how far you can get with just one cheap server and then one not so cheap server. 0 to 100,000 users and you can just ignore this stuff. Then one day you wake up, the world is on fire and there are no simple solutions.

At the same time trying to hit 100,000 users on a distributed platform is much harder.

chii · on Aug 11, 2016

When you do a projection of user growth, you'd be able to anticipate the curve and prep ahead. Those caught off guard are either such a runaway success that it doesn't matter, or don't do the growth projection and deserved the fail.

dorfsmay · on Aug 11, 2016

Projections are hard.

Sometimes one successful marketing campaign after 10 or 100 failures will me you go from 10s to several 100,000s users in A very short period of time. Not enough time to rearchitecture.

RonanTheGrey · on Aug 10, 2016

A big part of the issue is that distributed computing is hard to think about, for the same reasons that multithreading are hard to think about: the single threaded model (either on a local machine or amongst a set of machines via something like RPC) is much easier to think about most of the time, even if it's the wrong overall solution. Conway's Law only kicks in when organizational issues FORCE distribution, and people are forced to think in a distributed way.

It's an entirely different way of thinking about a problem, based entirely on giving up control.

IgorPartola · on Aug 10, 2016

Because it is not a tool or a language problem but a network reliability problem?

coldtea · on Aug 10, 2016

Nope, that's not it.

For one, scaling and parallelism are hard even for programs not depending on the network (in the sense you mean, e.g. being web/internet services) at all (number crunching, simulations, video processing, etc).

Second, algorithms can be inherently hard (or even mathematically) impossible to parallelize by themselves -- no network or even hardware required.

stcredzero · on Aug 10, 2016

Also, our current hardware has no good mechanism for low overhead/latency communication between threads running on different cores. Instead, HFT programmers have to hack something together like Disruptor and insert unused arrays of longint to prevent false sharing.

zzzcpan · on Aug 10, 2016

I think the hardware is ok, it's just that most programmers are not that good at writing software on that level. Running an event loop per core and amortizing synchronization is not a rocket science.

stcredzero · on Aug 10, 2016

I think the hardware is ok, it's just that most programmers are not that good at writing software on that level.

If it becomes the case that the hardware has changed in such a way that most programmers now have a hard time writing code that compiles efficiently to it, the hardware has changed in the wrong way! Well, that is the case over the past few decades! Hardware has changed in the wrong way.

di4na · on Aug 11, 2016

Or the compilers we use did not evolved... which mean language have not evolved...

stcredzero · on Aug 11, 2016

I read it as closer to "640k ought to be enough for anyone" than simply having features in a language/compiler. C'mon, if you're inserting unused stuff into structures to get stuff on separate cache lines, something is a little too hacky here!

CyberDildonics · on Aug 10, 2016

> Also, our current hardware has no good mechanism for low overhead/latency communication between threads running on different cores

What is wrong with shared memory? Also if HFT programmers can do it, wouldn't that imply that the hardware can do it?

millstone · on Aug 10, 2016

Here's an illustration; this may be somewhat dated (not sure how this works with e.g. shared L3 caches) but it should give you a flavor of why shared memory sucks for messaging.

Consider the simplest example of a busy-loop, which is likely close to the fastest you can get. Core1 sends a message to Core2 by writing to an address that Core2 is repeatedly reading from.

Core1 first has to gain exclusive access to the address, by broadcasting an invalidate on the bus. Core2 thus discards its cache line. Next Core1 writes to the address by modifying the data in its L1 cache (or store buffer or worse). Core2 then reads from the address. Since Core1 holds the modified cache line, it is required to snoop the read. Core1 tells Core2 to wait and then retry. This happens repeatedly until Core1's write lands in main memory. Now Core2 can read the memory.

So:

1. Messaging requires a round trip to memory, a bunch of cache line invalidation, and other nonsense. Messaging involves multiple transitions to and from shared state.

2. Your CPUs already have a very fast message bus to implement MESI, but it is unavailable to you as a client programmer.

It should be obvious from this that we could build a much faster CPU messaging architecture if we cared to.

pcwalton · on Aug 11, 2016

x86 CPUs use MESIF or MOESI, not MESI. Those protocols don't require round trips to main memory.

infinite8s · on Aug 10, 2016

They can do it only by understanding the limitations of the underlying hardware - ie slots as big as a cache line to avoid false sharing, power of 2 ring buffer size to for fast slot calculation, etc.

pcwalton · on Aug 10, 2016

> Also, our current hardware has no good mechanism for low overhead/latency communication between threads running on different cores.

Shared memory is extremely fast on modern x86.

hyperpape · on Aug 10, 2016

Is millstone wrong, or are y'all just using different definitions? https://news.ycombinator.com/item?id=12264017

Jweb_Guru · on Aug 11, 2016

Millstone is wrong (and pcwalton explained why).

marcosdumay · on Aug 10, 2016

It is also full of synchronization problems. Both for the layer above and bellow it.

Hardware message interfaces would go a good deal into improving the situation.

jandrese · on Aug 10, 2016

It comes down to cache invalidation, which is one of the Hard Problems of computer science.

marcosdumay · on Aug 10, 2016

But it's much better if you only have to deal with it in software, only when it's strictly needed.

Currently, we just turn any random problem into a cache invalidation one, and hope it makes our life easier.

njharman · on Aug 10, 2016

Is it really that hard? Tons of companies do it. LAMP stack has been scalable for couple decades or so. There are platforms and services that you can rent/buy if you don't wanna role your own. I can get a free databricks account and go to town with pyspark there's even free courses on it.

Threading is hard and I'm sure will always be hard. But threading is the wrong way to scale or parallelise.

pcwalton · on Aug 10, 2016

> But threading is the wrong way to scale or parallelise.

Depends on your domain. Multicore shared-memory parallelism is still king for many use cases, with no realistic alternative in sight.

You heavily depended on threading to post this comment (because your GPU used it, along with vector instructions, to render the page).

hbogert · on Aug 10, 2016

LAMP scalable for a couple of decades? PHP is hardly usuable for two decades, let alone scalable since that very beginning. Same for Mysql. Did you actually ever try to scale Mysql? Not trivial. Did you every try to scale webserver with PHP where sessions are being used? Not trivial.

njharman · on Aug 12, 2016

P is Perl PHP Python. The other letters have various incarnations. Point being horizontal scale is known, and not difficult solution for many, many domains. Starting a long time ago.

Thaxll · on Aug 10, 2016

Yes it's called sharding for MySQL.

prodigal_erik · on Aug 10, 2016

The query engine doesn't support sharding. You're starting with a straightforward relational database and ending up with a bunch of fragments of relations that must only be used with great caution because most joins quietly give incorrect results.

duaneb · on Aug 10, 2016

If you dislike joins and love data inconsistencies, go for it.

kod · on Aug 10, 2016

Joins are still possible in sharded relational models. I wouldn't want to roll my own on mysql, but there are plenty of data stores that use this model (vertica, citus, etc)

duaneb · on Aug 10, 2016

> Joins are still possible in sharded relational models.

For some definition of joins, this is true. For a SQL-oriented version, I don't think it's possible, let alone easy, to run arbitrary joins across sharded data. Can you write subqueries that cross shard boundaries? How about aggregate functions? I would be (very pleasantly) surprised if this were possible outside of directly calling the aggregation/join.

> I wouldn't want to roll my own on mysql, but there are plenty of data stores that use this model (vertica, citus, etc)

Yes, and none of them have succeed at scaling mysql without sacrificing heavily in functionality. I don't believe foreign keys are supported cross-shard anywhere. I'm also pretty sure the shard schemes heavily dictate which operations are efficient and even possible, and which are not, so you really need to design around scaling to begin with anyway. So—yes, you can scale MySQL, but avoid using any of the features that make MySQL worth using, except intrashard.

kod · on Aug 10, 2016

None of the stores I mentioned use mysql. I'm not talking about scaling mysql, I'm talking about scaling relational databases, specifically the comment about joins.

Joins arent a big deal, just join on (at least) the shard key, or for smaller hash joins ship the small side of the join.

If you have multiple kinds of joins keep multiple copies with different shard keys.

Yes, (monoidal) aggregations are possible with this, yes subqueries are possible with this.

Thaxll · on Aug 10, 2016

It's the method that largest companies are using, it's not as bad as it sounds.

duaneb · on Aug 10, 2016

> It's the method that largest companies are using

....in addition to other data stores. I don't know of any company exclusively using mysql; I hear that billing was migrated onto a sharded mysql cluster much more frequently due to the inherently transactional nature of billing operations and queries.

Jtsummers · on Aug 10, 2016

This depends on your definition of scaling.

You're creating (relatively) heavyweight servers to handle multiple instances and hiding them behind a reverse proxy.

And this is for a very specific use-case. Not everything is HTTP, and not everything HTTP is handled (well) by a LAMP stack.

Certainly, as pcwalton points out, a LAMP stack will not handle scalability of your graphics (while hitting desired performance characteristics, at least). A LAMP stack is unlikely to be appropriate to the sort of scaling that map-reduce can perform. A LAMP stack won't handle, in the case of my office, handling an arbitrary number of connecting radios.

  ===EDIT===

> Threading is hard and I'm sure will always be hard. But threading is the wrong way to scale or parallelise.

Threading is an implementation of concurrency/parallelism. Spawning multiple servers behind a reverse proxy, like you seem to be suggesting, is one method for handling scaling issues.

But the models used to make effective use of threaded concurrency (actor model, CSP, even shared memory done well), also work in designing larger systems that need to scale beyond a single computer/CPU.

Jweb_Guru · on Aug 10, 2016

On the contrary, scaling and parallelism are much easier on a multicore machine (possibly with a NIC queue per socket) because most nontrivial workloads are ultimately communication bound. Single load balancers can handle most real sites' traffic and direct them to different cores, and for those concerned about (for example) DDOS attempts, services like Cloudflare will allow you to combine IP anycast with a reverse proxy to avoid this problem far from your machine.

For a large percentage of use cases, the only reason to have multiple machines at all is for fault tolerance, and for most of the others, it's just not being able to fit all your data in memory (increasingly rare). That said, modern networking has gotten good enough that in a really high-end datacenter you can sometimes still get increased performance out of the network (using RDMA, etc.), but you have to really work hard for it (usually bypassing the OS networking stack).