I came across a similar solution[1] while creating a packet analyzer. At speeds like 50Kpps, and needing 10 or so indexes per packet, you end up needing to do 500K index writes per second. If you plan on making it work on a single small server and disk, you need to get creative. I used the original PageRank paper as inspiration, but it turns out to be a rather old and straightforward problem.
Isn't the key insight just using a couple layers of ISAM? Create ISAM index, save. Repeat for as many levels as you find useful. Top-level pages are almost always in RAM, so almost all index lookups require only a single seek. Toss in a nice compression library and you've got a fairly tight system.
Not to say it isn't nice to have a standard format and library to implement such things. It just seems that everything's just a variation on things developed in the 70s.
1: I made each index contain a pair of 64-bit ints, allowing me to reasonably index a wide variety of values and point to the likely chunk file. As the keys and values are ordered, I could get by with easy-to-implement delta compression only. The indexes only point to a chunk, so many packets shared a single entry. In production, it came out to about 2.8bits/packet (down from a few 8-byte hash values).
Blaze is intended to be a much more general solution than a pure key-value store (although it will be able to tackle this use case too). But yes, the idea under BLZ is close to your projects, namely, leveraging the available resources in your computer in the most useful way.
But sorry, I strongly disagree in that everything is a variation on things from 70's: during these good old years, the memory hierarchy was far simpler than nowadays, and having to deal with that introduces a great amount of complexity in libraries that are designed to get most out of modern computers.
The memory hierarchy can be pretty complicated, but the general design, the part that's making the rest possible, seems to me that it's stacking ISAM. It might not be a simple stacking, you need to tune the parameters to deal with various cache levels. But it's the same principle, isn't it?
I agree the hardware is quite different, vastly more complex.
I use hdf5 as the file format for my project. I'm not at all a hdf5 expert, but I know it supports: partial reads/writes, extending data sets, and transparent compression. It's a nice file format and it's very easy to use from python (and matlab, C++, R, and others).
I read the article, and blz looks interesting. What is it aiming to provide that is missing from HDF5? (Speed?).
HDF5 is a very nice format indeed, and in fact, BLZ is borrowing a lot of good ideas from it. However, HDF5 has its own drawbacks, like not being able to compress variable length datasets, the lack of a query/computational kernel or its flaky resiliency during updates. Also, its approach for distributing data among nodes diverges from our goals.
Finally, you are right, speed is pretty important for us, and we think that our approach can make a better use of current computer architectures.
I use HDF5s for storage/analytics of tick data; my experience has been that the performance for storing large sparse matrices is both expensive in storage space and speed.
I know that a lot of my peers in HFT have an illicit love for column stores, but for a lot of work, there's the need for converting to 'wide' format, which quickly can take a 10 million row matrix with a few columns to one now with a few thousand columns. (And thus stuff like KX, FastBit, etc becomes sort of suboptimal)
The need for massive 'last-of' information for time series leads to basically abandoning python/pandas/numpy and using C primitives and doing a lot more than you'd typically like 'online' but really a lot of this could happen behind the scenes with intelligent out of memory ops.
So...I'm pretty excited for innovation in data stores -- I look forward to seeing more!
Sounds really interesting for some of the things I'm working on. I would love to see some very basic stats on performance before trying it out though.
A fairly trivial example would be to store 8 billion random integers, then step through them in order and XOR them. Then you can have a baseline to compare monolithic file performance to blaze.
This would be a worst-case benchmark. Since random integers aren't compressible, you'll end up with the overhead of the extra logic for compression, but none of the benefits since it won't actually reduce the size of your data.
That's right. But even in this case, Blosc, the internal compressor used in Blaze, can detect whether the data is compressible or not pretty early in the compression pipeline, and decide to stop compressing and start just copying (how early that decision is taked depends on the compression level).
The good news is that Blosc can still use threads in parallel for doing the copy, and this normally gives significantly better speed than a non-threaded memcpy() (the default for many modern systems).
> significantly better speed than a non-threaded memcpy()
Really? I always thought that a single core could always saturate available memory bandwidth (unless you have some weird architecture like NUMA). If you're seeing a multithread memcpy that has better performance speed, maybe you're just stealing memory bandwidth from other processes (since AFAIK memory bandwidth is probably done on a per-thread basis), or maybe you're getting more CPU cache allocated to you because you're running on multiple cores?
I'm not sure why, but yes, I'm seeing these speedups. You can see them too in: http://blosc.pytables.org/trac/wiki/SyntheticBenchmarks
and paying attention to compression ratios of 1 (compression disabled). If you find any good reason on why this is happening, I'm all ears.
No doubt that much longer :) But the important points to take away are:
1) You can store more compressed data by using the same storage capacities.
2) If data can be compressed, the I/O effort will be less
3) If the compressor is fast enough, you may end saving I/O time
DyND is being developed as a low-level support library for Blaze.
Blaze is focused on supporting many things like distributed storage, distributed computation, out of core computation, and high level representations of computation that can move code to where the data lies.
DyND is focused on data that is purely in memory, and defines a dynamic multi-dimension data structure that Blaze can use as a support library. The development of DyND is also strictly partitioned into a C++ library, which could be used by other systems that do not want to depend on a Python runtime, and a Python module which exposes its functionality to Python.
Isn't the key insight just using a couple layers of ISAM? Create ISAM index, save. Repeat for as many levels as you find useful. Top-level pages are almost always in RAM, so almost all index lookups require only a single seek. Toss in a nice compression library and you've got a fairly tight system.
Not to say it isn't nice to have a standard format and library to implement such things. It just seems that everything's just a variation on things developed in the 70s.
1: I made each index contain a pair of 64-bit ints, allowing me to reasonably index a wide variety of values and point to the likely chunk file. As the keys and values are ordered, I could get by with easy-to-implement delta compression only. The indexes only point to a chunk, so many packets shared a single entry. In production, it came out to about 2.8bits/packet (down from a few 8-byte hash values).
Here's my admittedly poor KV implementation: https://github.com/MichaelGG/cidb/tree/master/CiDB