That's not clear, each (second generation) TPU is 45 FP16ish unspecific TFLOPs. A single board consists of 4 TPUs at 180 TOPs total. This is similar to the Dual P100 NVLINKed Quadro which is an absolutely killer HPC/DL card. I believe they have a similar Volta option, but that kind of HW is above my pay grade these days.
Further, they used 5,000 (first generation) TPUs at 90 INT8 TOPS each, page 4, to run the network during MCTS and 64 (second generation) TPUs to train this thing according to the methods. That's a nice mix of using INT8 for inference and FP16ish for training IMO.
In contrast, I personally own 8 GTX Titan XP class GPUs and 8 more GTX Titan XM GPUs across 4 desktops in my home network. I'd love to experiment with algorithms like this, but I suspect I'd get just about nowhere due to insufficient sampling. These algorithms are insanely inefficient at sampling at the beginning. So I guess I will seed the network with expert training data to see if that speeds things up.
That said, more brilliant work from David Silver's group! But not all of us have 5,000 TPUs/GPUs just sitting around so there's still a lot more work/research to make this more accessible to less sexy problems.
And to make things simple, let's do it all in FP16 because INT8 on Volta ~= 1/2 a first generation TPU, but FP16 ~= 3 first generation TPUs at INT8 (sad, right?), an accident that occurred because P100 didn't support INT8, but consumer variants did.
So, 5,064/3 = 1,688 Volta GPUs ~= $5000 per hour, probably half that reserved, a quarter of that in spot.
Say you need a week to train this, so $200K-$800K...
You can buy DGX-1Vs off-label for about $75K. Say they costs $20K annually to host. Say you use them for 3 years, so total TCO is ~135K, which comes down to $0.64/hour.
Conclusion: p3.8xl spot instances are currently a steal! But I don't have ~$200K burning a hole in my pocket, so I guess I'm out of luck.
I don't think that the specific numbers are relevant for what deepnotderp and I were saying: that Giraffe already demonstrated the potential, and all that was missing was a boatload of compute.
I think his point is if you devote X Flops to something then a fair comparison would be to also give X Flops to the competitor. The specifics of how an algorithm does not matter as much as the total resources used and outcome.
A more fair comparison would be to cap the hardware used at a certain cost. That's much more reflective of the real world. There are plenty of tasks that perhaps you could do more efficiently on a CPU for a given number of operations, e.g. maybe some graphics operations, but in practice it's completely irrelevant because a GPU gives so much more performance for the given cost. There's nothing special about an operation, but dollars do matter.
Only if you're buying hardware based on the algorithm used. Useful chess programs need to actually run on people's phones where performance on a cluster of ASIC's is mostly meaningless.
I think we need to start capping total electricity and total $$$. I'd love to see AlphaZero 20W pitted against that other 20W supercomputer. When humans fall to that, be afraid(tm).
I'll even be charitable in order to simulate the existence of school/teachers/books: training from the start gets 2KW. But gameplay still gets capped to 20W.
Electricity isn't free though; why can't it simply be rolled into cost? Just assign it a standard cost per kW-hr and charge accordingly. This more accurately reflects economic incentives driving hardware development.
I don't think you can, and such a comparison is not really needed here anyway. People are not chattel slaves and cannot be racked into data centers to solve boring problems.
Of course, you can hire people, and that has a well-defined cost, so it does all come down to money again.
Sure, but the whole point of the above idea is to compare our 20W computers to what we can build that eats 20W. And don't give Silicon Valley ideas about disrupting the lucrative Mechanical Turk ecosystem by scaling it up with ideas borrowed from growing veal because some VC sociopath will take it seriously. Just sayin'...
And I'm saying that this 20W limitation isn't particularly meaningful, as many organizations have way more power at their disposal to throw at a problem than that. The economics of a given solution, on the other hand, is applicable at all scales.
Meaningful in the sense that if an AI plays against humans, is it smarter at the same energy efficiency of humans.
We are comparing machine intelligence vs human intelligence.
It can be said that with more computational power, you can raise intelligence. Human brains consume the most power relative to body size than any other animal.
This argument is about state-of-the-art chess, not chess as a mobile phone game. Humans are so bad at chess compared to the best programs now that even a smartphone app can't be defeated by people.
Also, mobile phones have Internet access, so there's no reason the algorithm has to run on the phone itself. It could run on TPUs in the cloud. It's common for many games to have server-side components. Though this isn't even necessary except maybe if Magnus Carlsen wants to play it.
I think you misunderstood. Sure, if you are willing to deal with the increased costs and lowered reliability you could write a chess program that required massive server resources.
But, I don't think a lot of people would pay for that vs. having a program that just runs on there phone and still beats them. So, in practice without a significant subscription fee you are going to be limited to cellphone hardware.
PS: In practice most games take about as much computing power from a server as a chat app as companies need to pay for that hardware. Remember 1,000,000+ X get's big unless you keep X very low.
Again, this entire article and discussion is about state-of-the-art chess. As in, literally working to "solve" the game and develop optimal strategy. I don't understand what relevance casual mobile chess games have. Computer chess is already very far beyond human capabilities, and it can't be pressed further just using mobile phone hardware (nor is that a reasonable restriction).
It'd be like in a discussion about SpaceX's BFR designs to colonize Mars, someone comes in and questions why they're using retropropulsion since the requisite control systems are infeasibly expensive for amateur model rockets. It's a completely different discussion.
That's not why this is relevant. Given equivalent hardware it's still a worse solution for chess. The value is you can get results of similar quality with vastly more compute power even without 1,000+ years of analysis.
Otherwise the only takeaway is this failed to improve the state of the art.
"Equivalent hardware" is only relevant if we're talking about cost. When measured by that metric, the TPUs are indeed superior. Raw operations is an irrelevant metric given the existence of economic purpose-specific hardware that can perform a lot more of the operations required for matrix multiplication than for general computation. GPUs work exactly the same.
Again, cost is relative to hardware you have. If you own a supercomputer already and you want to run chess on it for whatever reason it matters what the performance you get from each algorithm on that hardware. If your going to buy new hardware it's design depends on performance across every algorithm you expect to use.
So, the only case where chess performance per $ matters is if you are only going to ever use that hardware to run chess. In every other case which is the vast majority of the time you care about diffent metrics.
Some iPhones are manufactured with this, but again if you have paid for the hardware you care about performance on that hardware. If you have yet to buy anything then theoretical performance per $ becomes the meaningful metric.
Same with the Pixel 2. But the Pixel 2 appears to be a bit more powerful than the iPhone neural chip. The PVC is able to do 3 TOPS but we really need instructions supported and word size to truly compare.
But better evaluation gives you asymptotic speedups. You can give Stockfish several times its computation (which is already a lot, I mean, 64 threads, come on) and it doesn't make good use of it since it just runs into the search wall. If you gave Stockfish the equivalent in CPU power (and I'm not sure this is a fair hypothetical since part of the appeal of NNs is that they have such efficient hardware implementations, so it seems unfair to then grant a less efficient algorithm equivalent computing power by fiat), I'm not sure it would be restored to parity or superiority.
Edit: DeepMind's victory over Stockfish didn't need novel research. Giraffe already demonstrated that the asymptotic speedup was possible; it just needed more compute.
Compute absolutely matters. With tree search, there's a tradeoff between scoring cost and positions evaluated. AlphaZero can evaluate fewer positions because it uses a huge amount of compute to accurately score each position.