Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Relational databases aren’t dinosaurs, they’re sharks (simplethread.com)
202 points by e2e4 on Oct 8, 2021 | hide | past | favorite | 130 comments


Great article, and I this particularly brought back memories for me:

> [...] in some instances you might work with vast quantities of data, or deal with transactional systems that just don’t easily fit the operational limitations of relational databases. And in those cases, you should consider moving some, or all, of your data into a non-relational database.

I've worked with a big application that utilised this approach with good results.

The overall problem was that we had an application tailored for a specific country, but needed to expand to other countries. One of the specific problems was that we needed to store addresses and related data differently for each of the countries' users, and this didn't fit well with the current database schema.

Our solution was to move the typing of this information to the application layer, and just store the arbitrary data in a NoSQL solution. This worked perfectly, and to my knowledge it's still working without a glitch.

This was before RDBMS solutions supported JSON, and if I were to do this again, I'd probably just continue to use MySQL, PostgreSQL or whatever, and store it as JSON in the database.

So, yes, you can get the best of both worlds, but I would _never_ use a NoSQL solution to store all the application's data, independent of the type of application.


> Our solution was to move the typing of this information to the application layer, and just store the arbitrary data in a NoSQL solution. This worked perfectly, and to my knowledge it's still working without a glitch.

> This was before RDBMS solutions supported JSON, and if I were to do this again, I'd probably just continue to use MySQL, PostgreSQL or whatever, and store it as JSON in the database.

Both solutions sound like anti-patterns to me. I was under the impression that the correct solution in an RDBMS would be the introduction of a custom type for the address that would cover all of your use cases, and all the associated operations, indices, etc. (in PostgreSQL, maybe even with PostGIS support for spatial operations and spatial indices). That not only keeps your logic, data integrity checks etc. on the server but also doesn't need text munging for every operation.


I work with locations and addresses on an international system. They are not anywhere near as standardized as to allow this. And also at scale, you'll get things that seem like they shouldn't be addresses but are.

One of our addresses that caused trouble is literally: "The yellow sign across the street from the Seven-Eleven at <reasonable address>".

We have one address that's legally in two countries at the same time.

One address is just a whole city. Like the entirety of the city, but also it still needs to be considered a separate place from the city.

It of course depends on your use cases and etc. But I find addresses can be like storing names in a lot of contexts. i.e. Just take the bytes the user gives you and alert them if some service downstream complains, but don't require they change it to meet your requirements.

There are systems to help offer standardized addresses and you can display them as suggestions to the user. But sometimes you get a multi billion dollar company telling you "Maybe the address is legally X, but the bus stops at the yellow sign across the street, and we get 30 customers calling for refunds every week because they didn't get on the bus. So either accommodate this change or we'll need to find a new partner."


I've come around to this view, too. Also like names, the best solution is to avoid using them for analysis. Run them through an address geocoding service and store the coordinates next to the address. Use the original address for sending mail or filling out forms, and coordinates for analysis.

Unless cleaning, parsing, and geocoding addresses is one of your core business values, let somebody else do it. It's a lot of work that's never really finished. Find a good service, hand them your garbage addresses, and feel confident they'll do better than you could.


This is why two successive Deliveroo drivers went to the wrong street this week. The first 'tried to deliver' and gave up, and I had to run after the replacement one.

Despite the address and postcode being both correct and unambiguous the driver followed a pin to a badly geocoded coordinate, and didn't even look at the street name.

I'd encourage anyone (especially Deliveroo) to make more effort to model addresses in the correct locale (UK in my case) rather than taking these shortcuts.


I agree. We’ve built automation that ingests geocoding data from multiple services, and keeps the system up to date. The models for an “Address“ get complex, a single entity can have many different types of addresses.


My idea is not in contradiction with your idea. In fact your idea is a part of my idea.


Given that the person started their comment with:

> I've come around to this view, too.

I don't think they were disagreeing with you.


Concurring with a disagreeing view is disagreeing, isn't it?


No.


That was a rhetorical question. Of course it is.


I understand these issues. That's precisely why a custom type seems appropriate -- it should be able to cover all the alternatives while not burdening you with the problem of storing a discriminated union as a set of disparate relations. Or, at the very least, if this approach still has some issues remaining, it should still have fewer issues than any other approach I can think of, because there's nothing you can't do with a custom type that you can do with a JSON blob. It's just that the custom type is more likely to be much more efficient for the task, and it also keep integrity checks as part of the type's implementation. And considering how often one needs to manipulate addresses in business settings, it seems a bit of a no-brainer to me that there should be some king of first-class support for addresses, just like there's for example first-class support for datetimes with time zones these days.


> I understand these issues.

I seriously doubt you do. The closest you can get to “standard address format” is:

Address Line 1

Address Line 2

Address Line n

Postal code (which can be blank)

Postal Area (which can be blank)

Country

There is no way to build “first-class” support for addresses. Because theres no such thing as a valid or invalid address, only whether or not someone can find the correct location by reading the address. Of course, that person should be a local and intimately familiar with local address conventions. Conventions that will change how you describe flat numbers, street name, and address line ordering.

> And considering how often one needs to manipulate addresses in business settings,

Anyone who’s ever had to deal with real addresses would know this is the one thing you avoid doing with addresses. It’s pretty much impossible to correctly “manipulate” an address, because again, there’s no standard, it’s entirely dictated by local conventions, which can change street-to-street, city-to-city.

The best you can hope for when you’re forced to mutilate and address, is make the mutilation simple and obvious enough that the human actually delivering the post can un-mutilate it when they read it.


  Address Line 1
  Address Line 2
Is the only way to do this, btw. any other options are going to be broken in other locations. I’m thinking specifically of apartment/unit number, eg

  Street address 
  Apartment number
This seems equivalent, but due to country specific differences, street and apartment is impossible to do correctly, or at least way harder than using line 1, line 2. This is because sometimes apartment number should come first, other times street needs to come first so the user has to second-guess your system in order to get mail delivered, while numbered lines are (hopefully) less ambiguous.


Yeah this is one of my big frustrations with many address systems.

  Street address 
  Apartment number
Is an (i think) and American convention. Here in the UK you would have

   Flat Number, Building Name
   Street Name (or possibly Street Number Street Name)
Having addresses printed with the lines in the opposite order looks wrong to me. Additionally the American convention seems to assume all addresses are pretty simple, with there being an XOR relationship between street number and building name.


> Additionally the American convention seems to assume all addresses are pretty simple, with there being an XOR relationship between street number and building name.

It doesn't assume anything of the sort.

American addresses are a reference tag to direct the postal service where to send your mail. That's it.


I don't see how this changes anything. You can accommodate any address format, or any finite union of multiple address formats, including any computed or materialized views of the address with a custom type (AND including an "I give up" default for when everything else fails).

> there’s no standard, it’s entirely dictated by local conventions, which can change street-to-street, city-to-city

Well, that may be an international issue. In my country's case, it's quite clearly defined by law. A type for an international case might by necessity be a union of unions.


> In my country's case, it's quite clearly defined by law.

I theory the same is true in my country. But clearly someone forgot to inform the populous that not using the official standard is criminal, because I’ve seen plenty of “valid” addresses that don’t follow the standard. These addresses are clearly encoding local conventions, which makes decoding using the official standard not only impossible, but nonsensical, because manipulations that assume the standard will produce unusable addresses.

Best part is, i know this because i was responsible for the system that was then to munge these addresses into the “official” format for tax reasons. That code is 99% edge cases, 0.09% “well shit, just shove it all in the last line and hope for the best” and 0.01% “official” standard.


The type should always include provisions for manual fixes. IMO such an address type should even include a provision for manual geocoding for cases where automated geocoding fails ("This is how I write it, and this is where it's located").


In that case, whats the value of the system?

99% of the time an address is just a “unique” opaque identifier. You ingest the address, then you print it on the parcel. You avoid manipulating or trying to interpret it too much.

A system that attempts to codify address to a standard that can’t express all addresses (including their nuance) is useful. You can’t trust any analytics created from it, because by its nature, address that can’t be interpreted will appear in geographic clusters, and thus skew all your stats.

You can however collect address as opaque strings, and optionally request extra data of a know format (like zipcode or postal code) which is generally considered part of the address. You can then produce stats only on those well know identifiers, and ignore the rest.

But doing that doesn’t require a complicated address type, or supporting address manipulations or any other crap like that. It just requires a free text box, and a separate postal code box.

The most important thing to recognise is that any arbitrary address will fit many different address conventions, but each of those conventions will result in a different location. It practically impossible to definitively interpret an arbitrary address correctly, without significant amounts of additional local context. So its best not to bother, and let the postal workers figure it out using their local knowledge.


Well, in my case, analytics for planning/scheduling of operations. I need to figure out which service points to cluster together. I need to point out that I specifically don't deal with parcels. Occasional individual outliers (some of these points DO need specifically GPS coordinates because occasionally there's an item like "the side of a shed on a parking lot", where the parking lot doesn't have a postal address) can be dealt with, but having, say, 99.5-99.9% ("best effort") systematization is very useful, especially when looking at how to migrate former units of work (which might involve fighting some organizational structures in a large national company). Without this it's impossible for me for example to estimate the objective function difference between a system that routes operations completely arbitrarily with optimal route length and a system that uses somewhat sub-optimal routes but with vastly lesser "human complexity"; for example, with routes spanning a small set of roads) that doesn't need that much automation (a paper list of locations in order - the status quo of the former system).

> The most important thing to recognise is that any arbitrary address will fit many different address conventions, but each of those conventions will result in a different location.

In my case it definitely should not do that, even with different conventions. Maybe that's one of those international things.


What are the last 0.9%? ;-P


That might work if you have infinite resources and can create special handling for individual cities in the world or maybe even individual neighborhoods.

Mailing works in a very decentralized way and has a lot of local variations when you go outside of places that have put a lot of effort into standardizing addresses. Most post offices won't be looking at your whole address. They just care about understanding enough to forward it to a post office one step closer to the final destination. Understanding the final local address might not even use written data. It might just be tacit knowledge that's shared between a few local postal workers.


IMO that just makes it a very interesting problem to work on.

> Understanding the final local address might not even use written data. It might just be tacit knowledge that's shared between a few local postal workers.

To me there seems to be a contradiction in those two statements. By definition, an address is written. You can't decide where to deliver an item in any other way. If one and the same written text of the address could imply two or more "final local addresses", and somehow the delivery worker decided where is the item actually supposed to arrive, how would the sender indicate the alternatives if not by including it in the text of the address? Or did you mean something different by this?


> You can't decide where to deliver an item in any other way.

Only if you’re a robot, which postal workers aren't. They can use local context, such as no one lives at address X so they must have meant address Y. Or even, person at address X has a birthday this week, so this envelop that looks like a birthday card, and has their misspelled name on it, is obviously for address X not address Y.

I’ve had our friendly postal worker deliver post correctly to me, despite having a throughly munged and incorrect address, because she recognised my name, and knew someone with a similar name didn’t live at the more obvious interpretation of the incorrect address.

So address parsing and mail delivery is an extremely human and imprecise process. Full of nuance and edge cases that can’t even be observed, unless you actually follow the humans making deliveries and see what they’re doing.


> Or even, person at address X has a birthday this week, so this envelop that looks like a birthday card, and has their misspelled name on it, is obviously for address X not address Y.

That seems awfully contextual and ad-hoc. Surely this mechanism won't work in many instances unless you only receive mail on your birthday. It's a nice thing if it sometimes succeeds even when it shouldn't, but that's not something you can rely on. And should you get a different mail worker who doesn't know you, poof, your mail is gone.

> I’ve had our friendly postal worker deliver post correctly to me, despite having a throughly munged and incorrect address, because she recognised my name, and knew someone with a similar name didn’t live at the more obvious interpretation of the incorrect address.

Considering that this was presumably a problem with an address written on a physical item as a linear text, that's not quite in the purview of the problems that I'm trying to solve for my own application which needs to process physical addresses of objects (sometimes not even involving people in any way). So I can't comment on mail delivery specifically, sadly.


> That seems awfully contextual and ad-hoc.

Yup, but that’s never stopped someone from relying on a method in the past. The vast majority of addresses aren't written by engineers. If you sent a letter using an address once, and it worked, then most people will just assume it'll always work. How would they know any better?

> It's a nice thing if it sometimes succeeds even when it shouldn't, but that's not something you can rely on.

Have you seen the internet? Or even just HTML? The entire world relies on things working when they shouldn’t. We can talk all day about the merits of that approach, but it wont change reality.

> Considering that this was presumably a problem with an address written on a physical item as a linear text

The text was printed perfectly if thats what your saying. It was just wrong. Some system somewhere had attempted to manipulate it, and ended up misinterpreting the original address, and produced something completely wrong as a result.

> I'm trying to solve for my own application which needs to process physical addresses of objects

That’s slightly different, and presumably you own far more of the process thats producing and interpreting these addresses. I’ve worked in systems that had to deal with addresses created by normal people, and let me tell you, normal people have a very diverse view on how to write addresses.


> That seems awfully contextual and ad-hoc. Surely this mechanism won't work in many instances unless you only receive mail on your birthday.

No, you're only more likely to receive stuff that looks like birthday cards on or around your birthday.

But, forget the birthday -- your earlier statement:

>>> To me there seems to be a contradiction in those two statements. By definition, an address is written

...is already contradicted if the postal worker just recognizes your name, and knows that you live at your actual address and not the one the written one more resembles. Your name isn't written in the address itself.


I know what to write on my envelopes so they get to me. You do not. Your job is to make it possible for things you send to get to me, not the other 3 units in the apartment.

Good luck with the Oakland addresses in Berkeley.


Yes, I said that the complete type would have to have components developed by others as well, since I can't provide input for other countries. Doesn't mean that a minimum complexity formalization is impossible (in fact, mathematically, by enumeration one such formalization must exist). An "as-given" component is obviously always going to work for you.


Addresses are first and foremost a social convention. Simple mathematics will not help you understand something with so much implicit complexity and contradiction.


What's the point vs repeating what people give?


This sounds to me like a "just draw the rest of the owl" solution.

You're right in that there's nothing you can do with JSONB that you can't do with a custom type, but the larger problem is "do I even know how to make a custom type that might fit all of the various inconsistencies that I might face with this data?"

The answer, for me, is no. I have no idea, and I'm not going to pretend. I can throw it all into a giant chunk of JSON and leave explicit note that address handling isn't, and at least that way no one actually thinks that it is, which is probably safer for everyone involved.


In my case, I actually do know that; there's even a legal definition of what an address is and there's a national registry of all addresses. The problem is that I can't solve this for every country myself since I only have national knowledge (which is sufficient for my needs, fortunately).


Sounds like, for your purposes, you can define an address type that you can map to columns. But I think you're the outlier.

We have fairly regular addresses in the UK. You can enforce a country and a postcode. You can enforce at least two lines of the local address. That's where it ends.

I live right now I'm a house that's described as one village everywhere except by the local council, who address us by the other village. We're in a lane between the two. But the lane is impassable in the middle except with an off-road vehicle. Google maps doesn't know that. Visitors get lost even with GPS and precise co-ordinates!

Before moving out here, I've frequently had to just enter "London" twice due to aggressive validation. I've even entered "London, London, Greater London". It works, but I know there's a senior engineer somewhere who's obstinate and wrong. Hell, I lived somewhere where flats B, C and D for our house number were a different door to A, because properties can be converted.

What is the value of structuring this data? Store a country, a postal code, and a multiline local field, and don't try to validate that field except maybe to clean up surplus whitespace.


When validation and parsing fails, there should be room for local fixes. There's an interesting question as to what extent should corner cases be code-driven and to what extent they should be data-driven. Your case definitely sounds like one of those things that would need to be fixed by hand and then associatively recalled (so that they'd need to be fixed only once).

> What is the value of structuring this data?

Well, for example I definitely need it for analytical purposes, so I have to try.


> Well, for example I definitely need it for analytical purposes, so I have to try.

Try accepting the inherent limitations of trying to analyze something so unruly and take your analysis with a larger grain of salt instead of forcing order on something so inherently ad hoc.


You have highlighted the problem I described in my original reply to the post, and although you have been pretty active in this thread trying to come up with solutions, you seem to summarise with "I don't know the answer" with this reply.

Try imagining your application(s) that requires addresses moving into new markets, like Norway, Sweden, Ghana or whatever.


This is one of those things that sounds good in theory and the reality gets in the way and is way too messy.


Like for so many questions, the answer is: it depends. For some applications an address is just an opaque string that needs to be spit back out at an appropriate point. If you have such an application it’s probably not worth over engineering some elaborate solution because someday someone might want to do something more elaborate with that data.

YAGNI


> I was under the impression that the correct solution in an RDBMS would be the introduction of a custom type for the address [...]

We were using MySQL, and MySQL didn't (doesn't?) have user defined types, so it was really not an option.

(This was back in 2013, btw.)


> [...] in some instances you might work with vast quantities of data, or deal with transactional systems that just don’t easily fit the operational limitations of relational databases. And in those cases, you should consider moving some, or all, of your data into a non-relational database.

This is WRONG. With capital letters.

Confuse limitations of (the codebase/the enterprise/the specific RDBMS) with the MODEL:

> And in those cases, you should consider moving some, or all, of your data into a non-relational database.

Making data "non-relational" is not a cure for the above. Is more often than not, a regression (like some on this thread: Using JSON instead of regular tables!).

---

Now where we must focus instead is on make RDBMs more powerful (or more exactly: less constrained that cause less powers).

RDBMS are hurt more for the interfacing (SQL) and the design decisions of 40 years ago, but not because the "relational model" is the reason performance, scalability or anything like that suffer.

P.D: And also: Is incredible how many truly think the "relational model" is agains dynamic data. Even using SQL, that is a subset of it, is HIGHLY dynamic! What is a trouble, is that current RDBMS lack a easy way to embed relations in relation, forcing to detour to sub-optimal, ad-hoc, bug-ridden hacks.


> Making data "non-relational" is not a cure for the above.

ACID imposes a limitation on scalability, there is no way to escape that. If you want to claim that approximately nobody reaches that scale, that's true, and if you want to claim that you don't need to ditch your data schema, that's also true. But relations aren't worth much without global coherence, and global coherence requires ACID, and will only work up to some size (that fits almost everyone, but not literally everyone).


Yeah, that is right.

But is tangential to be relational. You can have relational data in a causal way. Now, i see that some think "relations" are PK/FKs and that only GLOBAL coherence is the only way to make relations to work.

If we can take a JSON and make it valid and correct only in this replica in this moment of time, relational data can too.


Honestly, I think you’ve misunderstood the previous post.


No. I was directly pointed to the quote. Because is a common narrative "if you have problems in a RDBMs, get out of a RDBMS and use nosql".

The previous post kinda say the same, but is weird because call the quote great?


> No.

Yes.

You replied to my reply, where I used addresses as an example. I totally agree with you in general, i.e. use RDBMS as much as possible, but when it's no longer possible - what would you do in the case of addresses?


The big problem with addresses and all that is the same whatever you use JSON or not (in the DB or in the app logic). I concede that using JSON in the DB and move the logic to the App is ok and in part because SQL is an incomplete language.

But that is orthogonal to move out completely to a nosql. That is going too far (in this case).


This is exactly what I do - store location components as jsonb in Postgres


IMHO Blobs usually turn into a mess though. Some VP will ask "what weekday do users comment the most". With blobs this is drama. You end up parsing all these blobs and creating relational tables anyway. Sure you could do metrics separately -- but good luck predicting what stats you will want in the future. Need an admin portal? With blobs you are coding all these custom admin pages.

With mysql, I love when new people join and they can see and play with actual tables - slap in a good sql viewer and BAM you now have an admin portal. VP wants stats? give them mysql access and slack any sql they want: done .


> Some VP will ask "what weekday do users comment the most".

This is why people invented read models. Because the data model that works for the app and makes it easy to work with there might not necessarily work for analytics.

Generate analytics data from you app model so that both world can move independently.

> Need an admin portal? With blobs you are coding all these custom admin pages.

Admin portal with direct access to database columns? No thanks! IMHO, admin portal should hook into you business logic so that all the constraints are validated. Unless your app is just a CRUD.


or logic is in db, with validation triggers and constraints constraints, but that is not always the case.


Nah, database can and should check data consistency but not the business logic (otherwise you whole app is just sql triggers and stuff like that). Also, business logic tend to change over time, but existing items do not


With Postgres’s JSONB type you can query and index the unstructured JSON data just like regular columns. You generally want to avoid storing arrays of data in JSONB as much as possible, but unstructured objects are just as easy to query as regular columns with only a small performance hit.


jsonb isn't a blob, JSON attributes are queryable and indexable like any column (in Postgres at least). And as somebody else says, read models exist for analytics use-cases. I wouldn't (at some scale) use the same DB for analytics as for production use anyway, it would require crazy indexes on the production DB and you'd have to take analytics into account for any schema change. Not to mention that anyway, quite often the prod DB schema requires domain knowledge to query and interpret correctly: read models can be built to abstract this domain knowledge.


I haven't used mysql this way but we do lots of reporting on jsonb data using the provided json functions of postgres. It is very easy to extract a (nested) key, or extract a set of key/values to a table value that you join to just as if it were a separate table.


I often put json blobs into MySQL tables. Modern MySQL has support for accessing the fields in the json from normal sql. You can also create computed columns that deref the json, and even index it.

Other rdbms often have even better json support.


Those don't sound like "blobs" to me -- that stands for "Binary Large OBject"; JSON is text, not binary, isn't it?


Why can't you make a proper, first-class supported custom type in your database instead?


I tried at first, google location data is somewhat inconsistent. I do pull out some of the items and stick them in a varchar column. And I also pull out and store long/lat as a postgis type (which is an amazing bit of tech btw).


I think your solution here was to move the typing to the application layer, which kind of makes sense because that's where you know the locale. But why also move the data to NoSQL, I don't see what that would add. If you had already removed the typing from the relation DB I think that would have worked as well, or am I missing something?


The main problem was that we were unable to store the different types of addresses because of their different "layouts" in a "relational way."


I understand it might not be possible due to other contraints, but in smaller projects with international addresses I just used a text/memo field for the address and seperate field for the country.

Addresses now can be entered the way the user prefers.


This works fine for _small_ systems. But with large data sets, where you actually want to act upon the data, it's impossible to have it stored as text.

One lazy solution would of course be to index that data into Elasticsearch (or similar solutions), but you will end up with database records that have missing or invalid indexed data at some point.

A NoSQL solution is far from perfect, but for us it was _a lot_ better than just storing it as text somewhere.


I don't quite understand "was before RDBMS solutions supported JSON "... did (for example) MySQL not support storing JSONs as binary blobs or text before?


You could always store JSON as text, but you'd miss out on the ability to query over JSON efficiently, among some other things, see [1]. Maybe that's what made it useful/viable to GP.

[1] https://dev.mysql.com/doc/refman/5.7/en/json.html


Correct.

If only storing/retrieving was the goal, a text field would have been more than enough.

But we also needed to retrieve data based on their address, and to generate reports on a daily basis, which made for being able to do "JSON queries" so much easier.

So we put that logic in the application layer instead, something like:

    User
      - lookupByAddress(data)

    GermanUser inherits from User
      - overrides lookupByAddress(data)
...etc., so that we could be very specific about our queries depending on the country, or even _region_ in some countries.


Tradeoffs, and also circumstances where the trade off is made. It is all very well to say "X is better than Y" but in practice there needs to be an "at task Z" qualifier.

The relational data model is the best model for arbitrary data - by definition we know nearly nothing about the data and that gets us schema control, guidelines for how to normalise data and joins mostly for free.

In practice unless there is a complete understand your data before starting work (ie, nearly never) the first attempt should be to model it relationally. Then if that proves unsuitable - or the scale is so large that even the relational model is too demanding - only then response is to fall back to something else.


The choice of technology by developers and their managers is guided in large part by trends rather than by wholly rational decision process.

The type of database should be dictated by the application requirements.

Relational databases provide some exceptional guarantees while also being able to run quite large systems.

This means, when you have:

1. Relational data,

2. Queries that are not known beforehand,

3. Data that can fit one server or can be sharded to fit,

that RDBMS is probably the best choice for you.

You may not like SQL as a language but at least there is large body of knowledge on how to use SQL effectively for your problem, how different choices affect performance, etc. And a lot of very good tools to help you with that.

I have seen time and time again small teams to "revolt" against SQL databases choosing something like Cassandra or MongoDB. The effect that the team spends now years learning the new database, complicates their application to provide same functionality they got from SQL for free, contorts the data to the new paradigm.

My team chose, years ago, before I came, to use MongoDB for what is very relational problem. This resulted in huge duplication, performance issues and complexity on the application side.

No, the team does no longer have SQL problems. Instead we have other problems that consume large part of our focus, rather than use it to make the product better.


I default to choosing either SQLite or PostgreSQL until there's sufficient reason to think about using something else. I think it's pretty rare to build an application where you know you will have scalability issues with a relational model, but I can tell you with confidence that if you ever do need to make changes to how your data is stored, you will be happy if the source is a datastore that has enforced integrity guarantees. It's much easier to migrate data from one system to another when you don't first have to check that it's all valid in the first place.


True, I've also seen enterprises where central architects decide for what framework/solution the whole company should go.

The answer to questions like 'what is the best DB' is online always 'depends on your use case', but in enterprises is usually 'what upstairs decided'.


And the reason for this is there is usually "a guy" that mistakes his current love for X with X being better than every other competing product.

Sometimes what happens is that somebody buys X, X is expensive, and so now everybody must use X for everything (even if it is not strictly needed). Usually because it looks silly when X is being paid for on an ongoing basis but not being used for anything important.


For some reason, Java developers didn’t like writing SQL, so we introduced Hibernate which “does SQL for you”.

Hibernate creates appallingly bad SQL, so “databases are slow”. Particularly when using a getter on a lazy-loaded relationship. A query might end up taking 1ms per record instead of 10ms for 10k records.

You can rewrite all you want in Hibernate and greatly improve performance, but you often need to introduce Projections, lazy/nonlazy flags, in the end you program Hibernate more than you would have written basic SQL. Ah, also you’re writing JQL not SQL, so you need to learn “how it’s written in JQL”. But every Java developer is happy, because it’s Java. Phew, at least you didn’t write SQL!

- Any storage, even file or memory storage, can perform better on production than Hibernate.

- Devspeed is much faster without Hibernate. Source: I’m a founder, initiated a few apps, one is on prod making money after 2 weeks, the other one is still losing money after 18 months, guess which one uses React-Spring-Hibernate and which one used jQuery-Freemarker-Dropwizard.

- But if you want competent developers, React-Spring-Hibernate makes you look young and cool.

It’s sad, because frameworks are as difficult as the maximum difficulty our developers can handle, and if they’re not, they will add a layer. Conclusion: People have no love for databases because they’ve put too many layers before them. But they are not the problem.


Hibernate isn't at all meant to avoid understanding or writing SQL; rather, it is meant to map result sets of queries to managed object graphs, track mutations to the loaded graphs, and synchronize those changes efficiently back to the database.

The SQL generated by Hibernate is as good or bad as you tell it to; something like inappropriate config causing the n+1 SELECT you describe will instantly show up in any decent dev workflow. You're not bound to JPQL, you can very well use SQL directly too if that benefits your situation.

As any powerful tool, Hibernate ORM requires good knowledge of it in order to use it efficiently and effectively. Use it if you have the problems which it solves; don't use it, if you don't have those problems.

Disclaimer: former Hibernate core team member


C# person here and big fan of Entity Framework (but I have used NHibernate in the past).

I find the the main benefit of ORMs is type safety. Provided you're using the ORMs to also manage your schema (which you really should if you're using an ORM), then the compiler can guard you from a whole class of mistakes. You're also able to change table and column names with trivial effort and without fear of missing updating some SQL query string somewhere and only finding out during runtime that something is wrong.

While it's possible to write LINQ which results in bad SQL, if you didn't have an ORM it would still be possible for developers to write bad SQL anyway. Either way your developers should understand SQL at least a bit and be able to use something like SQL Profiler to ensure that their queries are performant. There is no substitute for competent developers.

There is a performance overhead to Entity Framework in the way it tracks changes to objects in contexts. If you're needing to work with a lot of objects from/to a database, then for these particular use cases you can opt out of some of these convenience features to avoid the performance costs.

ORMs are an incredibly useful tool if the trade-offs make sense for your project and you know how to use them correctly. But as with any tool, if you use it inappropriately, you're going to have a bad time.


I've been using Hibernate for 17 years and never had any performance issue with it.

Want to know my secret? It's very simple. I just read the manual in 2004.

After using it successfully for so long, I wrote a best-seller about how to get the most out of it:

https://www.amazon.com/High-Performance-Java-Persistence-Vla...

And, just because afterward I even worked on Hibernate to add all sorts of performance improvements to it, it doesn't mean I don't love SQL. In fact, I happen to run a High-Performance SQL training too:

https://vladmihalcea.com/trainings/high-performance-sql/

Your comment is based on a very common misconception that Hibernate aims to replace SQL. If that were true, then why do you think Hibernate has been offering the createSQLQuery (a.k.a. createNativeQuery) since 2002?


What ORMs have taught me: Just Learn SQL

https://wozniak.ca/blog/2014/08/03/1/index.html


I like how the very first paragraph walks back the title.


Building SQL queries by gluing strings together is tedious and error prone. So you write a lot of helper functions to build queries for you and pretty soon you've invented your own crappy ORM.

Why not save a lot of time and bugs and use a battle tested ORM and drill down to SQL for the queries that really matter instead?


Or you rewrite you queries so that you can use prepared statements. You’re right that we should avoid gluing strings together, but in most cases we can use prepared statements.

There’s also a middleground where developers learn to use the ORM better. I’ve seem people get terrible performance using the Django ORM, but after a rewrite, redesigning the queries and using the more advanced features performance would improve massively. The problem is that need to be able to write the SQL and then mentally backport to the ORMs syntax.

We’ve had customers complain about poor database performance. When we find the horrible queries generated by their ORM it’s frequently easier for the developer to just request more hardware or ask if we can: “performance tune” the database.


Prepared statements are a good tool to have in your arsenal but for the majority of the queries you need an ORM will be fine. It doesn't absolve you of the need to understand SQL but it does save a huge amount of time and generally results in easier to follow code.


We’ve had the opposite experience. We had a mix of irk code and raw SQL, and we’ve found that the SQL is much more readable. New code is all being written in SQL. Exception is insert and upsert queries which only ever write to one table and involve a lot of boilerplate.


There's a difference between an ORM and a query builder. Full-blown ORMs often allow you to pretend that all you have are "regular" objects that just happen to do database queries when you access them; and that's the kind of ORM that often leads to inefficient database access and fragile code when the program throws an exception when you just wanted to iterate over a list because that iteration actually caused a fetch from a database that happened to have a failover event.

With a query builder, you can still use the database in a type-safe way and without gluing strings together, but accessing the database is also more explicit, so mistakes are harder to make.

It's fine to represent the result of a database query as a sequence, but its type and interface should be different from that of eg. a list datastructure, because it can fail. similarly, if you have "User" objects representing users, it needs to be explicit which methods fetch data from the database and which use only in-memory properties that code may freely access. Preferably, have different types for "UserInDB" and "User" altogether, with some glue to go between them.


In Java, you don't build SQL queries by gluing strings together. You use https://www.jooq.org, of course.


ORMs are made primarily for OLTP workloads, and they are the primary means of converting SQL results into objects.

There is nothing inherently slow about Hibernate (other than people not knowing how to use a tool and blaming it), and frankly, providing a unified SQL dialect that ports over another DB is a plus.


I've worked on a real-life product built using MyBatis, which you'd think is supposed to be best of both worlds - non connected DTOs, seamlessly filled out from SQL statements you write in the mapping configuration, etc etc.

In practice the developers go overboard with SQL - the native-first SQL actually makes for quite confusing data model. There is value in clarity that limitations of Hibernate model brings or promotes.

With MyBatis approach the underlying DB can contain incredibly bizarre joins, crazy FKs, you find aggregation functions in queries trip you up, massive views used under the hood, triggers make appearance to confuse you, the list just goes on.

(The reason this happens is as years tick by, extra requirements get fitted into SQL model by doing these "clever hacks" and avoiding re-architecting. The kludges pile up, it's done because it's possible to do so, and every little decision seems like fair tradeoff when made in isolation. End result = massive pile of confusion.)

You think staying close to SQL is the salvation, but really it's just another way to hang yourself. I will not argue it's possible to do excellent work, but in no way it's guaranteed.


Not to be rude but your comment can be summarized as „one can mess up with SQL as well“, which is kind of „duh“. Absolutely every technology can be criticized this way. The more important point is that the layers on top of SQL always lead to horrible performance as soon as something non-trivial is done. And the trivial parts would’ve been trivial in SQL as well. People just really hate learning SQL to the extent that they implement their own, worse query language instead of wielding SQL with skill.


The term NoSQL is meaningless. It only means that a database is not SQL (d'oh) but people (such as the author of the article) use it as if it meant anything beyond that.

Talking about "NoSQL tradeoffs" implies all non SQL databases share similar features, operational models, use cases, etc, which is simply not true. For example, DynamoDB, Mongo, and Fauna have absolutely nothing in common.


At this point we all generally understand that NoSQL means a system that lacks one or more of: the SQL language, a relational model or ACID guarantees. It’s a useful shorthand for all of that.


If that's the case, it wouldn't make sense to talk about NoSQL tradeoffs as if all NoSQL dbs were actually similar in any way.

Eg:

Fauna is considered a NoSQL database and doesn't have any of the drawbacks the article mentions. It has ACID guarantees, a relational model, and strong consistency.

Mongo and Dynamo also offer transactions with ACID guarantees these days.

Etc.


What's the benefit then? Just use a CSV file. It's usually faster than any of the "NoSQL" systems out there, and used to be on par in terms of consistency and data-safety with mongodb. Actually better if you only append data.


Your comment made me read the article and I find the article’s language sufficiently abstract by say things like “many NoSQL databases”, etc

Perhaps your comment was meant to say that in general talking about tradeoffs can fall into that trap, but the article here looks like a good discussion


> I find the article’s language sufficiently abstract by say things like “many NoSQL databases”, etc

I'd say it's more vague than abstract.

For example, what dbs is the author referring to when saying things like "NoSQL databases generally make tradeoffs around these guarantees." when referring to ACID?

This seems to be an outdated view. These days, all major NoSQL databases offer transactions with ACID guarantees.


Naming in tech is shit anyway. It's loaded with marketing mumbo jumbo.

NoSQL should mean: No-SQL, No SQL query language, and therefore no required implementations of the SQL standard (tables, relations, transactions etc).

A lot of "NoSQL" databases actually include an SQL layer.

Heck. CSV is NoSQL too.

The transaction model is one of the things which would be nice to be able to select (eventual-consistent, non-consistent, consistent). In case you have different performance requirements. Similar to UDP vs TCP. But again, it has nothing to do with no-sql.

There was a term - object database. But it was old, so it couldn't be used. Then it was document store / database.

Caching and naming things are the most difficult parts. But instead of selecting a name that makes sense and reflects a system/architecture/we, we let some marketing people (read - advocates) promote a new, seo-clean, name.


> The term NoSQL is meaningless. It only means that a database is not SQL (d'oh) but people (such as the author of the article) use it as if it meant anything beyond that. […]

Except that it does not. It does mean „Not Only SQL“, and not „no SQL“.


I think you're wrong. I distinctly recall NoSQL being hailed as a move away from SQL, as in you'd never need SQL again.

This turned out to be misleading, so the Not Only SQL was suggested as an alternative name.

See e.g. https://hostingdata.co.uk/nosql-database/


I think that is a more recent backronym and was not the original, so it is wrong to say that what the above post said is incorrect.


'backronym' - like it, thank you.


@chrisandchris is correct. NoSQL means "Not Only SQL"

Furthermore, the article is uninformed and writes as if "NoSQL" is an alternative paradigm to SQL. In fact, NoSQL covers a whole range of paradigms and approaches, from key-value, to document, to graph, to more exotic flavors. Some of which can even be queried with SQL

ACID can be a feature of other database paradigms as well, if necessary. With MongoDB Atlas, for instance, an engineer can ensure that data consistency is high priority across clusters. Or not, if that's not important.

On top of all that, table-based database management systems are designed to prioritize saving hard drive space over cpu cycles. As cpu cycles have become more expensive relative to "hard drive space", the need for this kind of database has declined.


> NoSQL covers a whole range of paradigms and approaches

Exactly, and that's my whole point.


> > The term NoSQL is meaningless. It only means that a database is not SQL (d'oh) but people (such as the author of the article) use it as if it meant anything beyond that. […]

> Except that it does not. It does mean „Not Only SQL“, and not „no SQL“.

That's a (silly, IMO) retronym. "No SQL" means literally no SQL in English, and that's all the original "NoSQL" DB evangelism meant, too. Later, as they found their paradigm more or less sucks, the proponents retrofitted -- more or less hastily, frantically, or desperately -- SQL to it.

The original dBase and Paradox formats were also databases, and didn't have SQL: They were the canonical "NoSQL" databases. Are you claiming they're now somehow "Not Only SQL"?



Let's just rewrite history and state some events were different than what they actually are.

You're tech-gaslighting


A far more nuanced article than the clickbait (not a problem to me) title implies. Also the clearest definition of ACID I can recall.


Relational databases may be ACID, but no, they do not "give us" ACID.

Take LMDB as an example. LMDB is a fast low-level KV storage engine. NoSQL here, but with full ACID semantics & usable as a backend for whichever DB flavor you so wish to implement. LumoSQL and the older sqlightning are sqlite implementations backed by LMDB.


I feel like relational databases improved much since MongoDB came around.

Since then there were tons of small projects but none handled the atomicity and solidness as good as SQL databases.


I don’t understand why we still need these kind of article..

After all these years, I thought the advantage and trade-off of different database should be well-understood. But the fact is, there are still lots of mis-infomation floating around.

It looks like the lesson we have learnt are not communicated to border groups of engineers.


I'll provide my own anecdote - I was a part of a team that ended up choosing a document-store for a small service that services ~500 users (total, concurrent figure is far lower), where writes are uncommon, and where we ended up building ourselves all the tooling that common SQL tools offer for free. Why? Because people think RDBMS are dead. No matter how much I tried explaining it was the wrong choice - we still went for it.


I've made this comment a few times on HN, one very recently - so apologies if someone reads it twice now - but your comment really wants to make me do it again.

When I talked to my dad about RDBMS he was like "weeeelll, sure, there's things like DB2 UDB that do relational but performance wise, nothing beats reading the data straight by key in exactly the format you need.".

DB2 UDB: "Initial release: 1987; 34 years ago". I.e. what he would rather use and is sort of the NoSQL equivalent is _even older than that_. DBM (Ken Thompson - released by AT&T in 1979) comes to mind, tho I don't remember exactly what it was he was using/referring to, which would have been something that would rather run on S/360 and S/370 systems. It's been a while.

Background: he started off with 360 assembler and worked all his working life on IBM Mainframes and the various technologies in and around it. They had it all and they had it before it came to "us". We're just re-inventing most of these things on much cheaper and more open hardware and software.

All that to say: We still need these kinds of articles, because people "tend to forget". Or not even check "prior art". There was a recent article and comments around even research papers essentially being re-done and presented as novel research. And on some level that is even correct, because the authors genuinely came up with the same ideas and research as the original authors did. But 20+ years after the fact.


That would make MUMPS implementations Coelacanths? Or maybe cephalopods?

https://en.wikipedia.org/wiki/MUMPS


A database being relational is (mostly) orthogonal to it being ACID. ACID is generally a property of the underlying storage engine (a key-value store, generally).


SQL is just a label for a whole bunch of query language dialects.

Schema, stability, performance, fault tolerance, and the full laundry list of things is available in either form.


But MongoDB is web scale.


> Relational DBs are Sharks

> OOP is the Roman numerals of paradigms

> C is a PDP assembler that thinks it’s a compiler

Yup, everything popular is evil and bad. Run away to your ivory towers.


You might have missed the fact that sharks are awesome.


Facts. It's also true they kill nearly six times less people than vending machines.

Sharks are healthier than vending machines.


[flagged]


We've banned this account for repeatedly posting unsubstantive comments and ignoring our request to stop.

If you don't want to be banned, you're welcome to email hn@ycombinator.com and give us reason to believe that you'll follow the rules in the future. They're here: https://news.ycombinator.com/newsguidelines.html.


It doesn’t seem like I’m banned


Comments by banned users still go through, but they're only visible to you and any other users who have "showdead" turned on in their profile.

If you don't want to be banned, you're welcome to email hn@ycombinator.com and give us reason to believe that you'll follow the rules in the future. They're here: https://news.ycombinator.com/newsguidelines.html.


Another consideration is that, at scale, no sql is way cheaper. I run a service with approx. 900k daily users, each generating about 210 object writes and reads that need to execute within 50ms, and I am running this on firestore for about $350 a month, incl. Elb, waf, regionally replicated compute, managed NLP and translate. I sync the no sql stuff to bigquery for analytical usage. Cheap, and scales without any problems as long as one observes the recommended patterns. And all of that scaled from a 5 user POC without a single adjustment in the infra setup or architecture.

Not doing relational again.


> Another consideration is that, at scale, no sql is way cheaper.

Define "scale" and "cheaper".

I have some SQLite databases that are chewing through over 1K writes per second in production (WAL=on). Many of our transactions resolve within 10-100 microseconds because the application process and the database process are the same thing.

SQLite seems to be one of the cheapest options available, being totally free and of negligible footprint. It's also exceedingly cheap in terms of engineering complexity and scary unknowns. Something about being literally the most deployed piece of software on earth seems to help.

NoSQL reads like a religion to me these days. Starting with something that is actively trying to tear itself apart due to the inherent complexity seems like a horrible way to build a 5 user POC.


I have some SQLite databases that are chewing through over 1K writes per second in production (WAL=on).

Is that with synchronous full(2) or normal(1)?

What's your SQLITE_THREADSAFE mode?


> each generating about 210 object writes

Would it be 210 writes in a relational db though? Our product at work started on Firestone and one of the main problems was it was causing us to have to do far more operations than we would otherwise have needed. E.g in a relational db you can do UPDATE WHERE or DELETE WHERE in a single operation. I’m Firestone that’s a get followed by an update/delete for each record!

I guess it depends on your workload, but we’re the opposite: never doing Firestone again!


That's more the benefits of cloud rather than nosql though.

200 writes/second is not even scale, a single instance would handle that fine, with beefy hardware you could do 100x that.

You could even run it in cloud, for about $50/month, and as a bonus use the same instance to perform the analytics which will be real time.

Not to mention enjoying much better data integrity.


Which cloud lets you rent a beefy SQL server for $50/mo? This is actually really useful info to me, as we're looking at needing to migrate to a new provider soon.


I mean you're going to have to be more specific than that, some clouds have high cpu but expensive disk, others the reverse. $50/mo will not get you both, but it will easily get you 200tx/s.


I was thinking low disk costs. There doesn't seem to be much need for a CPU other than input validation. Although I guess currently our analysis is mostly done offline with a downloaded partial data set. It would be nice to do it in the cloud.


I've seen some NoSQL object/schemaless database turn into intractable balls of mud over time. Things that would be trivial in a relational DB with joins can turn into major engineering efforts and data corruption can be very hard to clean up.


And what do you think that would cost with an RDBMS? I see approx. 2000 writes and reads (presumably combined) a second. Depending on how big those objects were, it seems doable in a single computer running SQL.

Also, and this is neither here nor there, I thought Firebase hit a hard user limit of 1,000,000 daily users.


You’re comparing a technology with a product. Would it be $350 a month if you had to run all that by yourself?

You’re enjoying vast economies of scale and for practical terms it’s fine. For now all those bits with cute names aren’t dinosaurs but for damn sure they aren’t sharks.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: