Redis persistence demystified

sbarre · on March 26, 2012

I love your blog posts so much. In 20 minutes I just learned so much about databases and internals.

Thanks for writing these!

antirez · on March 26, 2012

Thank you for reading it. It is a long post, takes some patience and interest to read it.

giulianob · on March 26, 2012

Are there any good posts on transactions? I've found articles around but nothing very concise that explains the exact behavior of MULTI/EXEC when things go bad.

antirez · on March 26, 2012

basically MULTI/EXEC is always handled correctly, either everything or nothing is committed to the database memory, RDB file, AOF file, slave, ...

giulianob · on March 26, 2012

I had read that MULTI/EXEC would not "rollback" incase a command fails (e.g. you do first SET, second SET, third SET fails, then the two initial SETs would still be applied). I guess this doesn't so much have to do specifically with persistance problems (system crash, process crash, etc..) but more with just how transactions work in general.

Is this not true or at least no longer true?

aaronblohowiak · on March 26, 2012

  redis> multi
  OK
  redis> set hi 'hello'
  QUEUED
  redis> rpop hi
  QUEUED
  redis> set okay 'done'
  QUEUED
  redis> exec
  1. OK
  2. (error) ERR Operation against a key holding the wrong kind of value
  3. OK
  redis> get okay
  "'done'"
  redis> get hi
  "'hello'"
  redis>

Note that it keeps on processing, even though the second command fails. This also has nothing to do with persistence, as the exhibited behavior will be the same in a fully-alive system and from disk and anything else, because redis maintains a perfect total ordering of operations in its log format and sync behavior.

nbpoole · on March 26, 2012

http://redis.io/commands/set

> Status code reply: always OK since SET can't fail.

So, what kind of situation are you envisioning?

giulianob · on March 26, 2012

Maybe SET wasnt a good example .. but lets say you are writing to the wrong type or something like that. I know that its unlikely something will fail in production like that since you should catch it during dev but bugs can happen.

ypcx · on March 26, 2012

For a high-valued production data, if you think your data is safe after it's (finally) written to disk, then you're doing it wrong. Your data is not safe until a) it left the physical database machine, b) it left the physical datacenter to be stored into multiple datacenters around the world (e.g. S3).

Thus ideally, the (horribly slow) disk doesn't even come into play, especially for in-memory DBs. You buffer the data in memory before they are sent out of the machine/datacenter, but you make sure to mirror this buffer at multiple separate physical machines (which your database cluster should support), in case one goes down or over. Once the data are committed into a replicated store, you can clear that buffer. Fast and reliable.

This is not to say that there are zillion cases where harddrive is still the ideal persistence device. After all, it's very hard to destroy a harddrive in a way to make the data unrecoverable (of course, I'm talking about cases where RAID failed or wasn't present). However in reality, data from broken harddrives are seldom attempted to be recovered, mainly I guess due to the price and relatively long service waiting times.

antirez · on March 26, 2012

If you read the article carefully there are multiple mentions about this, specifically I wrote that RDB persistence is just perfect for this: a single-file compact representation of data to send far away :)

ypcx · on March 27, 2012

Hmm, I was thinking more of "sending chunks of AOF" out to a separate, distributed storage (and the RDB snapshot file occasionally). Loading data from network could then be slower than from a disk, but also faster than from a disk, depending on network speed and number of machines to read from.

quink · on March 26, 2012

And because I just know the whole VM deprecated thing will come up, here's a pretty awesomely informative recent status:

https://github.com/antirez/redis/issues/254

Smerity · on March 26, 2012

Antirez mentions that a limited set of on-disk data structures could work well but that it will take one or two years to even reach the drawing board. Fair enough -- Redis is first and foremost an in-memory database.

If there was the time though, I'd love to see LevelDB[1] merge the gap between in-memory and on-disk. Inspired by Google's BigTable, all keys are kept sorted on disk (see: SSTable[2]). Keys, lists, sets, sorted sets and hashtables could be encoded in sorted key-value form and would be reasonably (though not tremendously) efficient to retrieve, especially if the query can be converted to a range query (i.e. list retrieval or set intersection). Keep hot data in memory, the rest ends up securely on disk.

Yes, Cassandra is based on the same lineage but it's not simple or clean to operate. Redis is a pain free setup, features a simple API but has no transparent way to overflow to disk for cold data. LevelDB is an optional backend for Riak but I must admit I've not explored Riak heavily... Have I missed a contender from another database crowd?

[1]: http://code.google.com/p/leveldb/ [2]: http://www.igvita.com/2012/02/06/sstable-and-log-structured-...

Seldaek · on March 26, 2012

Speaking of LevelDB and Redis, did you know about Edis[1]? It is a protocol compatible implementation of Redis that uses LevelDB as its data store. I haven't had a chance (nor the need) to try it, but it sounds interesting.

[1]: http://inaka.github.com/edis/

Smerity · on March 26, 2012

Indeed I hadn't -- Edis doesn't rank highly for searches involving "LevelDB + Redis" which could be how it avoided me for so long. This sounds like what I envisioned, so I'll be looking at it with a keen interest =] The real question is how they implemented the encoding of the Redis data structures into the LevelDB SSTable format and the implications that will have on performance. If the Github issues are any indication it's an interesting proof of concept but hasn't been used or tested widely yet[1]. Along those lines, leveldb-server also looks interesting as a simple (API) and easy to install LevelDB backed DB.

[1]: https://github.com/inaka/edis/issues/2 [2]: https://github.com/srinikom/leveldb-server

ot · on March 26, 2012

For anybody interested in this topic, this SQLite doc page is excellent:

http://sqlite.org/atomiccommit.html

Check out in particular the sections "Hardware Assumptions" and "Things That Can Go Wrong"

obtu · on March 26, 2012

Durability through replication should probably be mentioned as well; either to address performance requirements, or to provide stronger durability against hardware failure.

_Lemon_ · on March 26, 2012

Redis can only perform asynchronous replication because it uses a single thread. It cannot block the main thread waiting for the network and have acceptable performance. This makes replication as good as "appendfsync no" as you have no guarantees as to what happened on the network write.

The upside of the design is that it makes things simple (e.g., transactions, append only file).

(This is my understanding, please correct me if I'm wrong!)

antirez · on March 26, 2012

Yes it is correct that asynchronous replication is the only way Redis handles replication, however replication and durability are still in topic: if the master burns in fire, the slave will contain your data ;)

However there are people that turn Redis async replication into synchronous replication with a trick: they perform:

    MULTI
    SET foo bar
    PUBLISH foo:ack 1
    EXEC

Because PUBLISH is propagated to the slave if they are listening with another connection to the right channel they'll get the ACK from the slave once the write reached the slave. Not always practical but it's an interesting trick.

wildmXranat · on March 26, 2012

As a long time Redis user, source code admirer and spectator of its evolution, I have to say that I learned quite a lot about open source project management.

Androsynth · on March 26, 2012

One of the additional benefits of RDB is the fact for a given database size, the number of I/Os on the system is bound, whatever the activity on the database is. This is a property that most traditional database systems (and the Redis other persistence, the AOF) do not have.

Can you expand on this? Specifically:

-Do you mean 'bound' as in 'limited by' or 'known'?

-Why are RDB snapshots I/O bound when other systems are not?

-Why is this an advantage?

antirez · on March 26, 2012

If you write the DB on disk sequentially, every 5 minutes, the I/O you perform is a fixed amount regardless of the amount of writes you have against the dataset. For instance using pipelining Redis can easily peak 400k operations per second, and you can have a few instances in the same box. In this setup 5 minutes of data loss may be acceptable, if you are writing 2 millions of records per second, and RDB make this possible. The I/O performed will always be proportional to the number of keys, it is not proportional to the operations the instances are receiving per second. With Redis AOF, and generally with most other databases, it is unlikely that you have an operational mode where the I/O is simply proportional to the size of the data set, and not to the amount of reads/writes.

_urga · on March 27, 2012

What about a hybrid RDB/AOF option, where AOF is not written immediately but every N seconds, using the latest delta?

johnkchow · on March 26, 2012

antirez, thanks for the objective look at Redis's internals. As a young engineer 2 years removed from college, I feel these articles serve more to us young'ns with lots of knowledge gaps.