Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

That's probably the best way of looking at it. PostgreSQL is generally going to be the best solution for 90% of your data, short of having a subset that's write heavy to the point of streaming.


What is a good space to look at when you get write heavy data to the point of streaming?


Depends on your application. In most cases your "write heavy" is going to be isolated to 1 or 2 tables worth of data (or equivalent) and for those cases some type of NoSQL solution can be a really good option. Especially since you can use a PG foreign data wrapper to allow PG to run queries against that information too.

If it's system wide and you need PostgreSQL itself able to handle it, using PG's async features is one potential option but setting up and managing a Postgres-XC cluster would be the next best. XC allows scale-out for writes. If you're at a company with a budget for that type of thing I think I remember reading that EnterpriseDB (the PG company) is offering first class support for PG XC.

In most cases though, I find that the write heavy parts of a system are so isolated that diverting them to a simple NoSQL solution tends to be easiest (Mongo, Couchbase, DynamoDB from AWS, etc).


Depending on your exact needs, Cassandra's not a bad idea for this space. O(n) scalablity for write capacity (+), and even on an individual node it's pretty write-friendly: its sstable data structures stream to disk well, keeping spinning disks happy while avoiding write amplification issues on SSDs. It does help if the data is nicely shardable, of course.

That's the one I'm familiar with, anyway.

(+) Write capacity is O(n) as you add machines but an individual write's time is pretty constant and cluster-wide maintenance operations do start taking longer as you add machines and they gossip to each other. It's not magic, obviously :)


The filesystem.

Ok, that not completely serious, but almost completely.


Could work dump the incoming steam to disk in some sensibel text format and have an asynchronous queue process the data into the database.


At that point may be something like Kafka starts looking attractive


It depends on what your needs are. If you just have a lot of data that is coming in quickly but you aren't doing constant analysis on it, you can still use Postgres. Switch to large batch writes for getting the data into the database to reduce the transaction overhead and look at using a Master-Slave replica setup. The 9.4/9.5 log replication features worked really well last time I used for them handling streaming data. We had a write master and a read slave and optimized accordingly. It worked pretty well once we got the log replication tuned accordingly.


The kind of setup described by http://c2.com/cgi/wiki?PrevalenceLayer works quite well




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: