Sometimes it's a huge advantage. I wrote a network search engine. On a single 1T...

tambourine_man · on May 4, 2018

That’s fascinating. I guess you probably can’t open source it, but I’d love to read a blog post about it.

MichaelGG · on May 8, 2018

Here's a version https://github.com/michaelgg/cidb -- Just some of the raw integer k-v storage part. It assumes you already have the hashed entries (you truncate them and the compression takes it from there). It is really what you should expect more from a college course IR project but since I never went to school... oh well.

I used this same library to encode telephone porting (LNP) instructions. That is a database of about 600M entries, mapping one phone number to another. With a bit of manipulation when creating the file, you go from 12GB+ naive encoding as strings (one client was using nearly 50GB after expanding it to a hashtable) to under a GB. Still better than any RMDBS can do and small enough to easily toss this in-RAM on every routing box.

Some day I'd like to write it in Rust and implement vectorized encoding and more compression schemes. Like an optimized SSTable just for integers.

CyberDildonics · on May 4, 2018

I'm going to go out on a limb and guess that it would have been cheaper to upgrade the hardware.

MichaelGG · on May 4, 2018

Depends on scale. At higher end, it was near impossible to scale when you're e.g. inserting a MySQL row per packet. But maybe good enough for a viable business. I would probably try to take it as far as possible on Elastic if I were to write it today.

Same thing if you read the Dremel paper. Worrying about bits helps when scaling.

fizx · on May 4, 2018

Because Lucene wasn't good at near-realtime in 2009 or so, Twitter's original (acquired via summize) search was written in mysql. It might have even been a row for every token, not quite sure.

IIRC, when we moved to a highly-customized-Lucene-based system in 2011, we dropped the server count on the cluster from around 400 nodes to around 15.

teraflop · on May 4, 2018

You can only upgrade hardware so much. If by doing a lot of low-level optimizations, you can remove (or delay) the need to build a complex distributed system, then the optimizations end up paying big dividends beyond just the cost of the machines.

mmt · on May 4, 2018

I think it's also important to know when this occurred. I've found there's a general tendency among software engineers to (surprise!) believe that it's easier/cheaper to solve the problem of scale in software rather than hardware, and it's often fueled by the misconception that the alternative to doing so is a complex, distributed system.

This is a false dichotomy.

Maybe during the days of the dot-com boom, it was was true enough because scaling a single server "vertically" became cost prohibitive very quickly, especially since truly large machines came only from brand-name vendors. That was, however, a very long time ago.

A naive interpretation of Moore's law implies CPU performance today is in the high hundreds of times as fast as back then. Even I/O throughput has improved something like a multiple of mid-10s, IIRC. More importantly, cost has come down, too.

The purchase price premium for getting the highest-performance CPUs (and mobo they need) in a server over the lowest cost per performance option is about 3x. Considering that this is, necessarily, a single [1] server, the base for that premium isn't exactly tremendous. The total cost would seem to be on the same order of magnitude as a team of programmers.

Of course, in the instant example, the database was particularly specialized, including what strikes me as a unique feature, a lossy index. I'd expect data integrity to be one of the huge challenges of databases, which, if relaxed, makes writing a custom one a more reasonable proposition.

[1] Or a modest number, on the order of a dozen, for something like read slaves, rather than the multiple dozens if not hundreds the distributed system.

Retric · on May 5, 2018

It's not an either or situation. Often there is a ~10-1,000x performance gains to be had in software from the initial production version to the optimized version. Similarly you can often get ~10-1,000x speed bump from better hardware.

But, the gains become more expensive as you move up the scale. So, at least starting down the software path is often very cheap with many large gains to be had. Similarly, it's at least looking at the software before you scale to the next level of hardware tends to be a great investment.

It's not about always looking at software, it's a question of regularly going back to software as it's better to regularly do so rather than as a one time push.

mmt · on May 6, 2018

> It's not an either or situation.

I'm a bit confused.. are you agreeing or disagreeing? My point was to call out a false dichotomy and offer a third option.

> It's not about always looking at software

Yet that's exactly what happens. Software engineers completely dominate the field, including management, so they always look at software and only software.

Shikadi · on May 5, 2018

Moore's law is about transistor count, not performance

mmt · on May 6, 2018

I'm well aware, which is why I specified a naive interpretation. Still, are you actually saying that transistor count increase in the range of a multiple of 1024 hasn't been matched by comparable CPU performance improvements during that time?