It searches fewer positions because it decides where to search using 4 TPUs, whi...

varelse · on Dec 6, 2017

That's not clear, each (second generation) TPU is 45 FP16ish unspecific TFLOPs. A single board consists of 4 TPUs at 180 TOPs total. This is similar to the Dual P100 NVLINKed Quadro which is an absolutely killer HPC/DL card. I believe they have a similar Volta option, but that kind of HW is above my pay grade these days.

Further, they used 5,000 (first generation) TPUs at 90 INT8 TOPS each, page 4, to run the network during MCTS and 64 (second generation) TPUs to train this thing according to the methods. That's a nice mix of using INT8 for inference and FP16ish for training IMO.

In contrast, I personally own 8 GTX Titan XP class GPUs and 8 more GTX Titan XM GPUs across 4 desktops in my home network. I'd love to experiment with algorithms like this, but I suspect I'd get just about nowhere due to insufficient sampling. These algorithms are insanely inefficient at sampling at the beginning. So I guess I will seed the network with expert training data to see if that speeds things up.

That said, more brilliant work from David Silver's group! But not all of us have 5,000 TPUs/GPUs just sitting around so there's still a lot more work/research to make this more accessible to less sexy problems.

nojvek · on Dec 7, 2017

It's definitely worth a shot reproducing the results.

On the other hand, Google will make a shit load of money when they make TPUs available on gcloud. Papers like this are great marketing for them.

skybrian · on Dec 6, 2017

I wonder how much the hardware would cost to rent for a researcher not working at Google?

varelse · on Dec 6, 2017

So 3 2nd generation TPUs are ~= 1 Volta class GPU ~= $3 per hour on-demand on AWS: https://aws.amazon.com/ec2/pricing/on-demand/ and ~$1 (75 cents at the moment with p3.8xlarge and its 4 GPUs) in spot: https://aws.amazon.com/ec2/spot/pricing/ if you take the time to build a robust framework.

And to make things simple, let's do it all in FP16 because INT8 on Volta ~= 1/2 a first generation TPU, but FP16 ~= 3 first generation TPUs at INT8 (sad, right?), an accident that occurred because P100 didn't support INT8, but consumer variants did.

So, 5,064/3 = 1,688 Volta GPUs ~= $5000 per hour, probably half that reserved, a quarter of that in spot.

Say you need a week to train this, so $200K-$800K...

You can buy DGX-1Vs off-label for about $75K. Say they costs $20K annually to host. Say you use them for 3 years, so total TCO is ~135K, which comes down to $0.64/hour.

Conclusion: p3.8xl spot instances are currently a steal! But I don't have ~$200K burning a hole in my pocket, so I guess I'm out of luck.

Chickenosaurus · on Dec 8, 2017

Google provides 1000 TPUs free of charge to researchers https://www.tensorflow.org/tfrc/

allenz · on Dec 6, 2017

I don't think that the specific numbers are relevant for what deepnotderp and I were saying: that Giraffe already demonstrated the potential, and all that was missing was a boatload of compute.

EvgeniyZh · on Dec 7, 2017

P100 is 20 FP16 TFLOPs, V100 is ~30. So 4 TPU gen 2 is ~9 P100 or 6 V100

CydeWeys · on Dec 6, 2017

TPUs aren't "cheating" though, as they can be used for generalized machine learning models, and not just Go.

Computer graphics is still an impressive achievement even when done on a GPU instead of a CPU.

Retric · on Dec 6, 2017

I think his point is if you devote X Flops to something then a fair comparison would be to also give X Flops to the competitor. The specifics of how an algorithm does not matter as much as the total resources used and outcome.

CydeWeys · on Dec 6, 2017

A more fair comparison would be to cap the hardware used at a certain cost. That's much more reflective of the real world. There are plenty of tasks that perhaps you could do more efficiently on a CPU for a given number of operations, e.g. maybe some graphics operations, but in practice it's completely irrelevant because a GPU gives so much more performance for the given cost. There's nothing special about an operation, but dollars do matter.

Retric · on Dec 6, 2017

Only if you're buying hardware based on the algorithm used. Useful chess programs need to actually run on people's phones where performance on a cluster of ASIC's is mostly meaningless.

varelse · on Dec 6, 2017

I think we need to start capping total electricity and total $$$. I'd love to see AlphaZero 20W pitted against that other 20W supercomputer. When humans fall to that, be afraid(tm).

I'll even be charitable in order to simulate the existence of school/teachers/books: training from the start gets 2KW. But gameplay still gets capped to 20W.

CydeWeys · on Dec 6, 2017

Electricity isn't free though; why can't it simply be rolled into cost? Just assign it a standard cost per kW-hr and charge accordingly. This more accurately reflects economic incentives driving hardware development.

varelse · on Dec 6, 2017

Sure, why not, but then how do we compare to a human?

CydeWeys · on Dec 6, 2017

I don't think you can, and such a comparison is not really needed here anyway. People are not chattel slaves and cannot be racked into data centers to solve boring problems.

Of course, you can hire people, and that has a well-defined cost, so it does all come down to money again.

varelse · on Dec 6, 2017

Sure, but the whole point of the above idea is to compare our 20W computers to what we can build that eats 20W. And don't give Silicon Valley ideas about disrupting the lucrative Mechanical Turk ecosystem by scaling it up with ideas borrowed from growing veal because some VC sociopath will take it seriously. Just sayin'...

CydeWeys · on Dec 6, 2017

And I'm saying that this 20W limitation isn't particularly meaningful, as many organizations have way more power at their disposal to throw at a problem than that. The economics of a given solution, on the other hand, is applicable at all scales.

nojvek · on Dec 7, 2017

Meaningful in the sense that if an AI plays against humans, is it smarter at the same energy efficiency of humans.

We are comparing machine intelligence vs human intelligence.

It can be said that with more computational power, you can raise intelligence. Human brains consume the most power relative to body size than any other animal.

CydeWeys · on Dec 6, 2017

This argument is about state-of-the-art chess, not chess as a mobile phone game. Humans are so bad at chess compared to the best programs now that even a smartphone app can't be defeated by people.

Also, mobile phones have Internet access, so there's no reason the algorithm has to run on the phone itself. It could run on TPUs in the cloud. It's common for many games to have server-side components. Though this isn't even necessary except maybe if Magnus Carlsen wants to play it.

Retric · on Dec 6, 2017

I think you misunderstood. Sure, if you are willing to deal with the increased costs and lowered reliability you could write a chess program that required massive server resources.

But, I don't think a lot of people would pay for that vs. having a program that just runs on there phone and still beats them. So, in practice without a significant subscription fee you are going to be limited to cellphone hardware.

PS: In practice most games take about as much computing power from a server as a chat app as companies need to pay for that hardware. Remember 1,000,000+ X get's big unless you keep X very low.

CydeWeys · on Dec 6, 2017

Again, this entire article and discussion is about state-of-the-art chess. As in, literally working to "solve" the game and develop optimal strategy. I don't understand what relevance casual mobile chess games have. Computer chess is already very far beyond human capabilities, and it can't be pressed further just using mobile phone hardware (nor is that a reasonable restriction).

It'd be like in a discussion about SpaceX's BFR designs to colonize Mars, someone comes in and questions why they're using retropropulsion since the requisite control systems are infeasibly expensive for amateur model rockets. It's a completely different discussion.

Retric · on Dec 6, 2017

That's not why this is relevant. Given equivalent hardware it's still a worse solution for chess. The value is you can get results of similar quality with vastly more compute power even without 1,000+ years of analysis.

Otherwise the only takeaway is this failed to improve the state of the art.

CydeWeys · on Dec 7, 2017

"Equivalent hardware" is only relevant if we're talking about cost. When measured by that metric, the TPUs are indeed superior. Raw operations is an irrelevant metric given the existence of economic purpose-specific hardware that can perform a lot more of the operations required for matrix multiplication than for general computation. GPUs work exactly the same.

Retric · on Dec 7, 2017

Again, cost is relative to hardware you have. If you own a supercomputer already and you want to run chess on it for whatever reason it matters what the performance you get from each algorithm on that hardware. If your going to buy new hardware it's design depends on performance across every algorithm you expect to use.

So, the only case where chess performance per $ matters is if you are only going to ever use that hardware to run chess. In every other case which is the vast majority of the time you care about diffent metrics.

empath75 · on Dec 7, 2017

Why? Just do the computation in the cloud.

modeless · on Dec 6, 2017

Several phones already have neural net acceleration hardware in them today, including the latest iPhones.

Retric · on Dec 6, 2017

Some iPhones are manufactured with this, but again if you have paid for the hardware you care about performance on that hardware. If you have yet to buy anything then theoretical performance per $ becomes the meaningful metric.

jacksmith21006 · on Dec 7, 2017

Same with the Pixel 2. But the Pixel 2 appears to be a bit more powerful than the iPhone neural chip. The PVC is able to do 3 TOPS but we really need instructions supported and word size to truly compare.

gwern · on Dec 6, 2017

But better evaluation gives you asymptotic speedups. You can give Stockfish several times its computation (which is already a lot, I mean, 64 threads, come on) and it doesn't make good use of it since it just runs into the search wall. If you gave Stockfish the equivalent in CPU power (and I'm not sure this is a fair hypothetical since part of the appeal of NNs is that they have such efficient hardware implementations, so it seems unfair to then grant a less efficient algorithm equivalent computing power by fiat), I'm not sure it would be restored to parity or superiority.

allenz · on Dec 6, 2017

Absolutely. This required an exorbitant amount of compute, but DeepMind had to do novel, nontrivial research to make use of those resources.

allenz · on Dec 6, 2017

Edit: DeepMind's victory over Stockfish didn't need novel research. Giraffe already demonstrated that the asymptotic speedup was possible; it just needed more compute.

nl · on Dec 6, 2017

The number of positions evaluated is the number evaluated. Speed doesn’t change that.

Speed probably made the initial self play training quicker though.

allenz · on Dec 6, 2017

Compute absolutely matters. With tree search, there's a tradeoff between scoring cost and positions evaluated. AlphaZero can evaluate fewer positions because it uses a huge amount of compute to accurately score each position.

It's not just training. Training used 5,000 TPUs.

arcticfox · on Dec 6, 2017

Can't those teraflops be applied to evaluating more positions instead of deciding which positions to evaluate?

It seems that the metric should be compute time, not positions evaluated.

nl · on Dec 6, 2017

Presumably they both had equal clock time - that is a standard chess rule so it would be surprising to see it different.

setr · on Dec 6, 2017

Wall clock != cpu clock

I can do more in the same wall time with a faster cpu(s); I can afford inefficiencies that the opponent cannot, and accomplish just as much.