Well, as I said, there is nothing that cant be expressed in an RDBMS - at least ...

staticshock · on March 27, 2008

you either have to modify your schema, or you have created a database that is highly flexible via totally generalized mapping tables, but is not optimized for these kinds of structures

A generalizable mapping schema with tables for edges may not be optimal, but your comparison seems to be a bit of a bait-and-switch. Why compare the optimality of such a schema to a rigid schema instead of comparing it to the optimality of an alternative "graph-based" data store?

Granted, an extensible schema will be slow to query/etc. What makes you say that you can achieve better efficiency using a non-RDBMS approach? (Not that you can't, but I didn't see your argument to that effect. I'd say that without such an argument, the optimality/speed point is unsignificant.)

wheels · on March 27, 2008

You can. Definitely. In fact, I've implemented this a few times (most recently last week); some for specific problems, some for more generic graph support.

In a nutshell:

The problem with RDBMS approaches is that the good ones assume you can pack your complex logic into a monster query or stored procedure and let the query optimizer do its thing. But if you're implementing an attribute-value system or graph traversal on top of an SQL database, you end up generating a ginormous number of queries just to do some basic traversal. You could potentially wrap those into a stored procedure that was doing selects into a temporary table, but that's not really the sort of thing that most query optimizers go to town on.

On the other hand, there are a number of systems out there that either attempt to be full object oriented databases, or object relational mappings, or RDF based stores, but the current off of the shelf ones tend to perform poorly since they're not very mature (and I get the feeling are more focused on just being able to conveniently store stuff, not actually hitting it very hard).

When I first started looking at the sort of problems that Hank's addressing (in a series of talks I did in 2004 titled "Beyond Hierarchical Interfaces") I naïvely thought that you could do everything with an SQL backend, tried and failed. I could blab on about the sort of indexing that you need for these sorts of storage, but I'll duck out for now.

Edit: Just one example of where I've done this, if anyone cares, was replacing the old SQL backend with a dynamic (schema-less) attribute-value system and basic query language, for my current job: http://grunge-nouveau.net/Kore.mp4

staticshock · on March 27, 2008

Now, I may be pretty naive here, but if you're doing full on graph traversal, why not just extract the full graph from the database and traverse it in memory on your own terms instead of leaving it to large unoptimized traversal queries?

wheels · on March 27, 2008

For the latest data set that I'm working on there are 5 million nodes and 50 million edges, and each one has some meta-data associated with it. :-)

wanorris · on March 27, 2008

Good points. I've run into this problem a lot, and generally handle it in one of 2 ways:

1. Make it easy for semi-technical project managers who are not coders to extend the schema. This is the 95% solution for us, and the core of our system.

2. use n-to-n lookup tables or lookup fields that use a second field to determine what you reference. We don't do this a lot, but we do it in a few places where there can be a more or less unbounded set of things that can be referenced. These indeed have problems, so we try to avoid them, especially in high-volume situations.

Then again, note that this solution (a) requires using our framework to be effective (b) has RDMS purists seeing red. So maybe you're right.

Edit: section redacted.