Warning: most cloud providers (Google, Amazon, Microsoft) require you to accept unlimited liability to use their services.
If you're running a business and you have lawyers, then fair enough — just play the game. But for individuals, it seems crazy that so many of us accept this sort of thing. Good luck contesting the charge with your credit card company when you already agreed to a contract that said Google could bill you thousands of dollars and then you used thousands of dollars worth of their service.
Big cloud providers are not your friend. They do not care if they destroy the lives of you and your family, unless it's happening so often that it's making mainstream news.
My advice is to go and delete your cloud accounts, and only use services that offer hard spending caps, and ideally prepaid accounts.
Maybe this doesn't leave many options. Oh well. Maybe if you can't afford big lawyers then you also can't afford the risks of using big cloud.
This is just a single data point but I had a surprise bill with Google. I talked to the support and got it waived off.
I used Amazon EC2 instances for years and I always felt in control. There were never any surprises. I knew even in the worst case situation I would be okay because I had faith in the Amazon support. With Google I felt insecure. I never played with any of Google cloud services since then.
Amazon's customer first policy is really true. They try their absolute best to make sure there are no surprises to a great extent. Even the UI is very intuitive.
Same here - incidentally was also one of the weirdest interactions with customer support I've ever had. I suspect the first point of contact was some sort of LLM/chatbot that desperately wanted to make sure I was feeling fine and that there was nothing to worry about. When I was forwarded to the billing support team the interaction went back to normal - couple of messages back and forth and some homework to set the real budget limit (the quota is just for alarms) and they waved the charge.
Same here. GCP waived off a surprise bill of $4,500 when I accidentally left a TPUv1 running for a month many years ago on a personal project (I was just toying around with the new TPU for an hour or so in my own free time, and didn't realize that unlike a GPU, the TPU has to be shut off separately from the CPU/VM or else it keeps charging by the hour.
Amazon definitely also has it's share of billing issues.
A personal example would be that we reserved an instance based on information given by our AWS account manager.
Said instance turned out to have issues linked to my original question to the account manager who answered incorrectly.
The reserved instance team then refused to refund us but also refused to tell how much they would prorate if we were to upgrade instead.
I simply don’t accept this argument, primarily because the way AWS handles NAT gateway fees is really only explainable as something that is designed to be predatory
Yeah, I have spent much more than $14k to date and would have spent much more over time, losing my business isn't rational. I think it's just another "Google can't do customer support to literally save their life" example.
All of the cloud services I have are setup only with privacy.com cards. I have each individual cards limited to just above what the monthly expected spend is. Even if there's a (reasonable) spike I can see it and I have to take manual action before the charge will go through.
That's not what privacy.com does or is for. They advertise it, but I've had transactions blow right through the façade. Specifically, the New York Times, after my trial subscription ended and I watched the stupendously-expensive charges bounce, they kept trying and eventually tried a different way and it went through.
I emailed support, and here's what I got back:
> Hi, $firstname. I've been reviewing your dispute and wanted to touch base with you to explain what happened.
> It appears that the disputed charge is a "force post" by the merchant. This happens when a merchant cannot collect funds for a transaction after repeated attempts and completes the transaction without an authorization — it's literally an unauthorized transaction that's against payment card network rules. It's a pretty sneaky move used by some merchants, and unfortunately, it's not something Privacy can block.
What's interesting is that they seem to be glossing over the truth. It's not unauthorized, per se, it's using a prior authorization code. And it's intended for processing offline transactions. It seems like 'force' is an industry term and a bit hyperbolic when used in lay discussion.
>It's literally an unauthorized transaction that's against payment card network rules. It's a pretty sneaky move used by some merchants, and unfortunately, it's not something Privacy can block
Have you found a site that does "block" this? Did you communicate with your credit card company about this? I am wondering
Use a prepaid card that you bought at a grocery store a few cities away from your hometown with cash while wearing a mask and not bringing any phones with you or driving a car that logs its location or beacons any identifying signals.
I think that might finally allow you to pay for the New York Times on your own terms and not worry about their hounds sniffing you down.
Having talked to credit card issuers about this, what they told me was to close the account. They said they had no way to ever stop the charges from coming in.
In my case, even closing an account wasn't sufficient. A charge posted to a credit card I'd closed more than a year prior, and the card issuer was legally obligated to process the charge because of the renewal contract that apparently I had signed with the merchant. This led a single late payment, which, in turn, caused my credit score to tank by ~90 points just as I was applying for mortgages. I try not to think about what that, and waiting a year, until mortgage rates climbed to nearly 6%, will have cost me if I'm lucky enough to outlast my thirty-year-fixed mortgage.
Edit: and the dark Lord surely reserves a particularly unpleasant circle of hell for loan officers who encourage borrowers to consider a 5-1 variant rate because "we know rates will fall next year."
Doesn't stop them from trying to collect after the transaction is declined. It's not a prepaid service, you're agreeing to pay the charges _after_ you've used the service.
Will they pursue? Do they have enough info to purse? Who knows, but they can if they want to.
This is very much not what privacy.com is for, and it won't protect you from $14k in BigQuery bills. There is no clause in the GCP contract (or any other contract, for that matter) which says "if your payment method is invalid when we go to collect what you owe us, we forfeit all right to be paid."
For small charges they might just give up because it's not worth it, but when dealing with a $14k bill you should assume that they will at the very least hand the debt off to a collections agency if you try to just ignore it.
You're still liable to Google/whoever for the full amount, so it is only a temporary reprieve. Which can be useful, but does not solve the main problem.
IANAL, but if this happened to me I would be gathering as many examples as I could of this having happened to other people. The angle being: Google knows this is a huge issue. Effectively, they know that they have (presumably accidentally) created a really dangerous trap for small players, and have chosen to do nothing about it.
In some jurisdictions I think that reduces the legitimacy of their claim that you actually owe them money.
EDIT: Even better, focus on the examples where Google "forgave" the debt; you could argue that those examples prove that Google knows it's at least partly their fault.
I think we (the developer community) need to start pushing back against this abuse, it's getting out of control.
The thing that bothers me the most is I caught this $14k charge b/c I'm a small fry and that money matters to me. How many big accounts just wouldn't notice that? I can't help but think a very non-trivial % of all cloud revenue is just obscure fees that nobody notices - engineers doing the engineering, accounts receivable pays the bills, and the cloud providers get fat.
I honestly think it would be better if they didn't have the option to "forgive the debt" — at least without following up by eliminating the trap that created said debt.
How often is one of these accidental debts created? How often do customers just pay up because it's small enough that it's not worth fighting? How often does AWS (or Google or whoever) decide whether to forgive the debt based on PR damage control rather than the legitimacy of the debt? Jeez I hope someone leaks those numbers one day.
It reminds me of all those horror stories of hospital visits in the USA, where the first bill you receive is just a test to see if they can squeeze that much out of you, but if you know what you're doing or just can't pay then the actual bill is way lower. It's all just yucky.
If big cloud providers couldn't selectively choose which of these debts to enforce, I bet there would be a media shitstorm and then they would suddenly discover that it's not all thaaaaat hard to implement real time billing and hard caps after all.
Well, the "trap" is the lack of hard limits which, if implemented, would enable some companies to blow up their businesses. Which arguably is a better outcome than people who can't afford it getting big bills. But it is a tradeoff even aside from the providers arguably collecting some money people didn't intend to give them.
To be honest, even the official guide [1] for BG does not have any information about how to make some info about query cost, budget, and service limits mechanisms [2].
I think the HTTP Archive team could set something in that regard.
PS: When I was an instructor for some cloud training in AWS, the first 2 hours were only to set up billing and budgets to avoid any kind of situation like this. No one would start training without all those locks in place in the first place.
Yeah, I'm basically just having to write this off so it sucks for me (a lot - I'm bootstrapping a start up), but I'm more worried about other people (especially students) getting caught up in what feels like a scam given the language on the website not, ya know, mentioning the risk of being charged $14k.
The getting started guide linked by the website states:
> Note: The size of the tables you query are important because BigQuery is billed based on the number of processed data. There is 1TB of processed data included in the free tier, so running a full scan query on one of the larger tables can easily eat up your quota. This is where it becomes important to design queries that process only the data you wish to explore
Could this be a bigger warning? Sure.
Is something a scam just because they don't explain the general implications of entering your payment information to a usage-billed product? Not really.
There's "scam" in the sense of "it didn't do what they said"/"charged me more than they said", and there's a more colloquial "scam" where the UI is designed to obscure the cost of a task (quintessential dark pattern stuff). I don't think the reporter is saying "they lied about big query", they're saying the UI is set up to make extremely expensive mistakes very easy, and it's set up to hide the actual cost of the query.
Estimating the total cost of a query is obviously fraught, but from the UI and other comments it sounds like BigQuery knows up front how much data a query will require, and there's at least a minimum cost per TB, so the UI could just say "this will cost at least $X" but the UI has a very basic "this will process X PB of data" text. So they're charging by TB but showing the usage in PB which is a) a 1000x smaller number, and b) visually similar to "TB".
It's very hard to see that as anything other than "design to obscure cost" given that there's no reason to not say "this will cost $X" when the cost is per TB, even if they don't the pricing is per TB but they're showing PB, the checkbox and the textual description are smaller that other text on the page, and there's no ability to specify a cost cap.
I understand the argument against hard circuit-breakers (yeah, seems like a good idea, but had a good traffic spike and I'm down). But it makes even me cautious with respect to scenarios where I could just fat finger something. There are some controls but there are no guarantees in most cases.
This website makes it seem like this “public” dataset is for the community to use, but it is instead a for-profit money maker for Google Cloud and you can lose tens of thousands of dollars.
Last week I ran a script on BigQuery for historical HTTP Archive data and was billed $14,000 by Google Cloud with zero warning whatsoever, and they won’t remove the fee.
This official website should be updated to warn people Google is apparently now hosting this dataset to make money. I don’t think that was the original mission, but that’s what it is today, there’s basically zero customer support, and you can lose $14k in the blink of an eye.
Academics, especially grad students, need to be aware of this before they give a credit card number to Google. In fact, I’d caution against using this dataset whatsoever with this new business model attached.
The real issue here is that you didn't quite understand what BigQuery was when you pressed the button.
What it is, roughly, is a publicly-accessible data supercomputer. If you lost $14k in a blink of the eye, then I would think you consumed at least $4k of Google's actual resources -- maybe $7k. Maybe more. That thing can move some serious data, and you apparently moved around over 2PB.
Google bears some significant responsibility for not making the cost transparent to you, it's true. But on the the other hand, don't they bear some significant credit for making such an awesome power available to a lowly peon with a credit card?
This happens because Google hides the query cost behind its abstracted "TBs scanned" (for their data format, not even open-source so it's hard to estimate in advance) or even worse "slots" mechanism. Only a fraction of people try to understand how much these slots cost and most of them are the people who got an unexpected bill after using BigQuery and became more aware of how the product works.
If GCP would return the query cost in the API and show it directly in the console when you run a query, it would be much easier for their users but unfortunately, it's not Google's interest for obvious reasons.
Exactly, even after seeing the issue I can't make heads or tails of what the hell a "TBs scanned" is relative to row counts, etc. Likewise, it seems to place a lot of assumptions on knowing what tables include - and on a dataset you didn't build yourself how can you know the tables are optimized to lower your costs? Hell, how can you even know what the costs are?
"TBs scanned" is the number of tebibytes of stored data that the system had to scan to serve your query. This is how BQ is billed, in the on-demand model.
The console shows you this number (in very small letters) after you have entered the query but before you press go. In the on-demand billing model, which is what you were using, you can multiply this number by $6.25 to understand your query cost, exactly.
It's a design that's hostile to new customers, I agree. But it is comprehensible.
There should be a cost estimate displayed prominently by default, and an option to turn it off for power users who know what they're doing (but keep the current less-prominently displayed amount of data estimate).
If the latter... I'm not sure that it's explicitly against the rules, but coopting a name of something as your handle just to complain about it is in poor taste and probably should be.
> The worst part is you posting this to hackernews under the username ‘httparchive’ to make it look like it was the httparchive posting this themselves.
This was the last comment in TFA, so it seems like they just used it because it was the topic...
BigQuery provides various methods to estimate cost:
Use the query dry run option to estimate costs before running a query using the on-demand pricing model.
Calculate the number of bytes processed by various types of query.
Get the monthly cost based on projected usage by using the Google Cloud Pricing Calculator.
When I use the BQ interface, it estimates the bytes for each query in real time before I run it, does that turn off if the query is too big? I guess that isn't directly a cost estimate, but if I saw hundreds of TB I'd think twice before hitting "Run"...
Well, sure. But it is convenient to have lots of sample data. Also you get the first TiB per month free in BQ.
Also note that anyone can make a dataset available for public use, where they pay the storage and the consumer pays the compute. The official Google datasets are just curated and maintained by Google itself.
If you're going to make a throwaway account to criticize a website, you shouldn't use that website name as your username. That makes you look like a troll even if you have legitimate complaints.
I used this data when I was a grad student, back when there wasn't a fee for it, so I'm mostly concerned students will get hit with charges that will make it so they can't buy groceries.
The website has the Internet Archive logo on it, and it looks like a public resource for researchers, and it used to be free to use.
The point of this is for the HTTP Archive to make it clear this is a paid product from Google Cloud, not a "public service".
There are multiple notes about cost. In particular, this one stands out.
> Note: The size of the tables you query are important because BigQuery is billed based on the number of processed data. There is 1TB of processed data included in the free tier, so running a full scan query on one of the larger tables can easily eat up your quota. This is where it becomes important to design queries that process only the data you wish to explore
You have to give them a credit card in order to use the free tier, and they refuse to implement any features that would let you add safeguards (beyond setting an alert so you can find out after you've already spent the money).
Edit: I apologize; they did in fact add something beyond alerts: https://cloud.google.com/billing/docs/how-to/notify#cap_disa... ...which is less them implementing a feature and more telling you how to badly implement it yourself. I don't believe this changes the gist of my comment, but it is worth pointing out in the interest of precision.
Edit 2: Per https://news.ycombinator.com/item?id=39447499 , GCP actually does have a way to cap some resources. It still strikes me as the most "how can we technically claim to be supporting that feature request while still making it as easy as possible to spend more money than you intended to" but there it is.
There are countless companies who specialise in managing cloud costs because of how difficult it is to know when and for what you are going to be charged. Especially for things like data transfer.
And by default they don't have a daily spending limit so it's very easy to see a major cost over-run at the beginning.
the data is a public service. the platform allowing you to query it is not.
you can print at a public library. each page costs a small amount. the printer in that case is the service. and if you print out millions of pages, you may owe hundreds of thousands of dollars.
slow down a bit, lest you blow off your other foot.
I frequently see this kind of surprising billing anecdotes across many cloud providers. Why don't they provide a way to set a hard budget limit applied for the entire account. I tried to see what can be done for GCP and this seems pretty daunting.
The reasons are probably quite complicated, because some of them are bound by hard technical limits to how quickly a system can react and thus make a hard limit actually a hard limit, but realistically that's largely solvable just by making it a softer hard limit (eg you set a limit of $1000 and the terms say you pay that plus whatever is used before the limit kicks in. More that $1000 but way less than $14000).
All of those technical reasons aside though, the commercial reason is obvious - people's mistakes and overages are a great source of revenue and profit. Companies refund the times where it'd be enough to lose the customer, or when it hits HN, but they make more money every time someone pays up. They have no incentive to fix it. It's part of the business model.
There is also the fact that if a company has critical systems go down because GCP hit some hard budget limit, it will be reported in the press as "Netflix down globally due to issue with Google Cloud".
Google doesn't want the bad press. Most real companies would prefer to have a big bill when their product surges in popularity than have unexpected downtime at the worst time.
oh there are - billing systems at scale almost exclusively work on logs. Logs can take minutes or hours to aggregate and transmit to a central place.
Ever notice how your "1GB" data plan sometimes lets you use 5GB if you happen to be roaming in another country and downloading something fast over 5G...? Same reason.
They are also checking your account .. and as easy as you can lock an account, as easy you can soft lock it via a flag because the billing system says enough. And the few cents in between they swallow easily (as with so many other inaccuracies).
Because then we'd see articles about how the next start up missed their opportunity whenever their site unexpectedly got discussed on the latest Rogan episode and subsequently was taken offline by the limits being tripped.
There's no "right" answer. In one case, it's checked the wrong box and got a $14K bill. In the other case, it's I checked the wrong box and my startup missed its one window. There are in-between levels of alerting etc. for both populations but they're probably unsatisfactory for the extreme conditions.
To be clear: I'd be very in favor of the major cloud providers having a "DO NOT! DO NOT! use this for production mode and your content could be deleted at any time if you screw up. But I suspect most people wouldn't use that."
I don't see the problem. Don't set a budget limit if you don't want your app to go offline. Lots of people wouldn't mind if their app went offline for a bit. They'd prefer to not suddenly get a $10,000 bill
Google AppEngine used to have that but — presumably in the interest of additional profit — they removed it. Now I have to make do with an alert that warns me long after I could be hypothetically bankrupted, i.e. in seconds.
The OP is probably a good person with strong interest in data science and building projects.
If it'd be "oh here's your $500 charge, upgrade your quota for more, 'ok fair enough, I did a mistake'", but $14k is not ok without explicit quota upgrade.
Unfortunately, if the customer has written their applications in such a way that they're effectively locked to the platform... they won't have much choice until they can dis-entangle themselves.
tbh, I have worked with AWS for at least 10 years, and recently their field support are quite prone to help avoid those scenarios (e.g. helped to save hundreds of thousands in a single-digit million account).
This was one of the main selling points for all portfolio companies of the group to adopt AWS in their digital transformation projects.
Limited use for a nobody who wants to run <$100 / year cloud spend and does not have account managers.
I would love to kick the tires on some AWS stuff, but the threat of unlimited ruin is not worth it. Sure, maybe the gods would take pity on me and wipe the debt, but far easier to just run with someone who caps costs. My toy project can gladly go down if the alternative is a huge unexpected bill.
My cynical self sees it as how cloud providers aim to make the most money: by making billing oblique and waiting for buzzword-happy project leads to mandate stuff be put on their service without understanding what the end billing will be.
I can't say that's for certain what it is. I just know a hallmark of any business with recurring charges that are otherwise incomprehensible is so they can hit you with the charge after the fact, and you have little recourse to avoid paying it without a ton of work for yourself or your team.
Notice that their "solution" is to tell you how if you want you can spin up effectively your own custom service to watch spend and if it goes over some threshold delete the entire project[0] after some delay. This is the malicious compliance version of letting you add a limit.
[0] At least, that's how I interpret "This example removes Cloud Billing from your project, shutting down all resources. Resources might not shut down gracefully, and might be irretrievably deleted. There is no graceful recovery if you disable Cloud Billing.
You can re-enable Cloud Billing, but there is no guarantee of service recovery and manual configuration is required."
> Custom quota is proactive, so you can't run an 11 TB query if you have a 10 TB quota. Creating a custom quota on query data lets you control costs at the project level or at the user level.
Oh, good catch! Yes, that does look like something that can be coerced into limiting it. Having actually tried to click through, it is very much not as simple as "don't spend more than $X"; the doc points to https://console.cloud.google.com/iam-admin/quotas and you have to find and set the right quota, but yes that can probably help.
Fits in well with everyone's natural first query (SELECT * FROM everything), so people can see the type of data it's returning in order to narrow it down.
Not specifically because of BigQuery, but I have taken to adding " LIMIT 10" to that for my default query because of accidentally locking up 10TB databases too many times.
So now you say on your webpage your pricing is $1/TB or something. Great. But there is a caveat: the amount you pay depends on some complex factors such as the size of the table or the duration of your code. If the factors are so simple that no more than grade-level arithmetic is required to calculate my costs then that would be fine. But what if it gets a little bit more complex than that, such as “table size is 1PB and cost per 1TB is $1”? Did you know 1PB=1000TB rather than 0.001TB? What about “you need another $10 query to figure out the size of the table”? Or “the cost depends on number of function calls that your code makes and if you accidentally recursed yourself too many times you can’t limit it”? Or “The server is $5/mon but IP is $1/h and outbound traffic is $10/GB and if someone download something 1TB on your server you will pay $10000 within 2 hours”?
At some point the factors related to billing is going to become non-trivial and every sentence in a long 10 page document could have 100x’d your costs, what makes this service different from a scam? You could have allowed me to set a billing cap so I won’t have to pay anything beyond $10, so that “$10” is everything I have to care about, could you?
> At some point the factors related to billing is going to become non-trivial
First response on the OG link covers this with a screenshot: the size of the query is previewed beforehand, and you have to check a checkbox to acknowledge it. (I dare say listing it in PB instead of $$$ is still a scummy move, etc. - but they do resolve about half of your concerns right there)
Not the same thing, but: some pre-Web Usenet programs would have warnings before "expensive" operations:
> Version 4.3 patch 30 of rn’s Pnews.SH (September 5, 1986, published to support the new top-level groups) introduced the “thousands of machines” message:
> > This program posts news to thousands of machines throughout the entire civilized world. You message will cost the net hundreds if not thousands of dollars to send everywhere. Please be sure you know what you are doing.
The dataset IS free to download, but running a query against it on Google Cloudis what costs $$$. BigQuery is basically renting servers to scan through the data, which is the fee
The complaint says there should be a warning that processing fees can be high. Go to the front page and check out the links. Nothing really about cost. Someone follows that path and 14k gone without a word about it. That's the path that people are sent down from the website. It explicitly talks about using BQ for analysis.
A simple "running queries over the whole dataset can cause significant costs due to the size of the dataset" should be enough. And I think that's a valid and fair point.
The whole part of accusing Google should just be ignored.
BQ charges you based on the volume of data being scanned. I think this is a situation which involves scanning the whole dataset again and again without fully understanding how it works. I’ve worked with much larger datasets on BQ (petabyte scale) and managed to not spend more than $1000 in an hour. Also, BQ tells you how much data will be processed BEFORE you run the query, which makes it easier to understand the cost implications.
Again, you could fit the whole dataset in memory in an EC2 instance and do your thing.
> Last week I ran a script on BigQuery for historical HTTP Archive data and was billed $14,000 by Google Cloud with zero warning whatsoever,
This comment kind of suggests that you do not understand how BigQuery bills. The archive pays for the storage, but you have to pay for the queries. You would also have had to attach a billing account to run those queries. Running BigQuery searches is not free.
Expensive lesson, but on the surface this one appears to be your error.
It seems excessive to allow USD 14k spend on a newly created account, or and account with no prior big spend. If I was Google, I would not allow it without explicitly raising limits or increasing quotas. Otherwise there is a big chance there customer can not pay and they just lost that resource – unless you don't really have an expense for that resource and you use predatory pricing.
Yes and no, I ran the script before and the fee wasn't that high (they jacked it up last summer). Usually I have to jump through a ton of hoops just to add more CPU cores to my VMs so I "trusted" that GCP would warn me if I ever made an error.
One of the bigger issues is they charged my card before I literally had any notice what the bill was - it wasn't even in the dashboard yet. I would have terminated the script ASAP had I gotten *any* warning.
I am sorry but this seems to be more of a “TLDR; didn’t read;” situation. The http archive clearly mentions that the data is available for offline processing or for querying online on BQ. And in the “Getting started” section of the instructions, it is mentioned multiple times on how BQ will charge you. And even if it wasn’t mentioned anywhere, it’s a little presumptuous to assume a tool for processing data will not charge you money for literally processing TBs of data again and again.
> Note: BigQuery has a free tier that you can use to get started without enabling billing. At the time of this writing, the free tier allows 10GB of storage and 1TB of data processing per month. Google also provides a $300 credit for new accounts.
> Note: The size of the tables you query are important because BigQuery is billed based on the number of processed data. There is 1TB of processed data included in the free tier, so running a full scan query on one of the larger tables can easily eat up your quota. This is where it becomes important to design queries that process only the data you wish to explore
> When we look at the results of this, you can see how much data was processed during this query. Writing efficient queries limits the number of bytes processed - which is helpful since that's how BigQuery is billed. Note: There is 1TB free per month
This comment reminds me of unsafe pedestrian crosswalks in car-centric cities.
Sure, a crosswalk may have an extensive system to warn drivers of pedestrians, but that doesn't change the fact a driver hits a pedestrian there at least once a month. It only has to happen once to ruin someone's life.
For cloud providers, the obvious solution is hard budget limits. Ask people to set a hard budget limit before they get the opportunity to drown themselves in debt. Free up some workload off of the support team in the process.
Hard budget limits change the process to avoid these charges almost entirely. Warnings only inform a few people that they're aware the process lands people in debt, and to please use the broken process correctly to avoid the severe financial consequences.
Yes, sure there's stuff I could have done better, and stayed up all night looking at the fine print. But that's not the point - this is *warning* to other people who see the Internet Archive logo, the words "public", and for some dumb reason also trust Google. I'm hoping this doesn't happen to others, I learned a costly lesson.
I'm on OP's side - even if I knew I'd be paying to run some queries against this dataset, I never would have thought it could reach 5 figures in such a short time. And you can't argue that the billing is straightforward. The "Getting Started" guide for the HTTP Archive doesn't even describe what indexes are available/commonly used for limiting the scanned rows.
If google provides a credit limited to $300 for new accounts, then it has the ability to limit spend.
It should make this available.
To be fair: I'm sure they don't provide this limit to make money, because this is a rare case, but to avoid the far more common case of established business going offline because someone forgot to update a limit.
My view on such matters is that it's best to find a solution that has a fallback option, which usually is an open-source software. That is to say, if you choose a cloud service, it's preferable that it is built on some open-source software. This way, if the costs become uncontrollable in the end, you can still fall back to an open-source software. For instance, CelerData has built its cloud service on StarRocks, and it's said that many users have used it to replace Snowflake and Big Query. Of course, you could also opt for Elasticsearch's cloud service, and if problems arise, you can replace it with OpenSearch.
It's interesting that in the post there's a maintainer pointing out that there's a very tiny little checkbox that says "this will process X PB of data". Given that your account knows how much your queries cost per TB it does seem like an "dark UI" design to not just say how much the query will actually cost.
Similarly that checkbox being a tiny part of the UI, and not allowing people to set up cost limits on a query (or not having them at the account level), does seem very much like an "encourage people to overspend" UX. I'm sure "overspend to the level of a $14k bill to an individual" is not intended, but that's a reasonably predictable occasional outcome for this design.
So on the one hand, yes they did click a checkbox saying they were aware of the amount of data being processed, but OTOH the UI seems to be specifically designed to encourage this kind of mistake.
This is a matter of a user not having read some fine prints, which doesn't mean that they're necessarily at fault. The only way to know which of the user, httparchive.org, or Google BQ is most responsible is to know how often similar situations arise in this specific context (i.e. using BQ by way of httparchive.org).
Can you please pick a different username that we can rename your account to? Some users are complaining that your current username is misleading (e.g. here: https://news.ycombinator.com/item?id=39447421)
Edit: since I didn't hear back from you, I've consed a 'not' onto the username 'httparchive'. If you prefer a different name, feel free to contact us at hn@ycombinator.com.
> This website makes it seem like this “public” dataset is for the community to use, but it is instead a for-profit money maker for Google Cloud and you can lose tens of thousands of dollars.
you didn't understand what you were doing. HA's datasets are public and free. It is not a "for-profit money maker for Google Cloud". Sorry, sucks for you but blaming the restaurant when you bit off more of the steak than you can chew is not how this works.
Wow. No guard rails whatsoever on queries like this?
Their UI clearly has all the info needed in order to put guard rails in place (aka big scary warning dialog in red), as it's already giving a non-obvious warning about the expected data usage.
Blaming users for this seems like a bastard act. Talk about causing further reputational damage... :( :( :(
i don't disagree with the premise that Google should be responsible and explicitly acknowledge that the average computer-interested person trying out bigquery has no clue how sharp of a knife it is and they actually do need to be protected from themselves. I was in this boat only a few months ago. One thing I will say though is that I think the documentation is actually quite comprehensive, and personally after taking the time to RTFM and actually understand things like columnar storage, partitioned and clustered tables, etc., I was able to optimize costs quite a bit for our use case and am quite pleased with the product overall. Just takes time to learn, it's a (necessarily imo) intricate machine.
Could you explain the steps you went through that led to you using BigQuery? The reason I ask is most of us probably use GCP and only ever interact with BigQuery via GCP. But it seems your entry point was a bit different to most (e.g. seems you might have clicked on a link to GCP from HTTP Archive, or perhaps something else?).
FWIW I use BigQuery a lot and as a rough guide I assume about 1c per GB scanned. So if I query a dataset that's 1TB, that's about $10. If the same data were stored on a relational db, the same query would take about a day (or at least a good part of a day). Because BigQuery returns a result so quickly (e.g. <1 minute) it can be easy to miss the insane amount of work it did to get there. So I could see someone accidentally putting that ~1min (but 1TB!) query into a loop or something, and boom, there's your $15k bill. Accidents happen.
Also FWIW, I've found although the big 3 cloud's pricing is tricky (since there are so many services), I find them much better than the PaaS built on top of the big 3 clouds. My suspicion is that the PaaS's have a strong incentive to obscure their pricing because customers can typically see what their costs are (e.g. if they buy some compute from AWS at $0.16/hr and sell it for $1.40/hr, that can be seen as a bit of a rip, hence they try to obscure it). But I think the big 3 are not too bad at this practice. It really bugs me when anyone deliberately obscures their prices, and it's often an indicator of more shady practices to come.
I was doing historical evaluation for a few sites, so I was running a query for each month going back to 2016 for each site. I've done this before with no real issues, and if I knew the charges were rapidly exploding I'd have halted the script immediately - but instead it ran for 2 hours and the first notice I got was the CC charge.
My guess is you were querying all the data each time.
If you instead filter out the rows you are interested in (e.g. the particular "few sites" by their URL) and put that in a new table, querying the resulting, tiny table will be very cheap.
I haven't looked at the exact schema for this dataset but for this type of query pattern to be efficient the data would need to be partitioned by date.^[1] I'm guessing that it's not partitioned this way and therefore each of these queries that was looking at "one month" of data was doing a full table scan, so if you queried N months you did N table scans even though the exact same query results could have been achieved even without partitioning by doing one table scan with some kind of aggregation (e.g. GROUP BY) clause.
I wouldn’t expect either of those filters to utilize a partition key if one exists. So yeah, you probably did a full table scan every time. Is the partitioning documented somewhere?
Yeah, 'LIKE' ops usually give you a full table scan, which is brutal. If it was my own data I'd chop the fields up and index them properly - which is the issue here, it's not your data, so you don't get a say in the indexes, but you do have to pay per row scanned even if you can't apply an index of your own.
Seems like an ideal case for pre-processing. You still have to do one full scan but you only have to do one scan.
I’m not familiar with your use case or BigQuery but in Redshift I’d just do a COPY to a local table from S3 then do a CREATE TABLE AS SELECT with some logic to split those URLs for your purpose.
You might even be able to do it all in one step with Spectrum.
Have you tried getting in touch with GCP to see if they would refund the charge? I've heard plenty of stories of cloud services refunding large one-off accidental spends like this one.
"Last week I ran a script on BigQuery for historical HTTP Archive data and was billed $14,000 by Google Cloud with zero warning whatsoever, *and they won’t remove the fee.*"
BigQuery is an amazing product and there are good reasons to use it.
One place I worked at had a table with 100 billion rows. And some other tables as well. If a manager asked for an ad-hoc query, it was 5 minutes of writing a SQL query including JOINs (which didn't need to worry about which fields were indexed etc. e.g. you could write WHERE then a regex), and $15 and 5 minutes later I'd have the answer. Apparently 100s of VMs were started and stopped to answer that query, but it all happened automatically, at very low cost.
The person responding to complaint was quick to point out that size of data displayed in the UI relates directly to estimated cost. I see no reason this estimated cost shouldn't be shown in the UI as well.
The cloud isn't something I'd ever use my private credit card on, there are just too many ways to screw it up if you're not very careful and know what you're doing. I don't think I would have hit this particular issue, but that is mainly because I've read a bunch of stories of this kind and BigQuery is one of the things I associate with "can get very expensive very quickly" based on those.
I know the explanations and justifications for it, but for personal use a service where I can't put a hard limit on usage is simply not acceptable for me. It's just not worth the risk.
IANAL but this can be risky in the US still because if you're not careful and demonstrate a clear separation of your business funds and your personal funds it can let those pursuing you for money owed to pierce the veil, thus losing a huge benefit of the LLC.
Is there a guide or someone I should talk to about how to do this?
I’ve long wondered what I can do with an LLC to protect me from debts like this but I don’t know how to get more information about it. Particularly as I’d be the sole owner I don’t really understand what the llc does/doesn’t do.
If you had just 1000$ (and made a few hundred a year) is it worth doing?
Maybe, maybe not. It depends on the risk you're trying to contain, not just the routine income.
The short, short version is: You have to have a reason for the LLC that isn't just "contain some risks". Something like "this is a legal entity for my side project bilombinaboloa.com, that I'm hoping will one day become a company and make me Rich" will work, "I pay my expenses via this and take my income directly" will not.
Read your sibling comment... basically doing this solely for the purpose of trying to avoid debt won't work unless the creditor is just too lazy to pursue it.
Of course. If you have a viable business and the debt is related to that, that's exactly what the corporate veil is for. If you just want to hedge your bets on your personal GCP bill for hobby stuff, not so much.
There is a really easy fix to this problem: setting billing limits. This can be done with almost all cloud providers and it takes almost no time. These incidents just show a lack of professionalism on the part of the person incurring the costs. I personally did on the first day I setup a cloud computing account when I was still doing my BS in college. It is not that hard folks. Set the billing limits.
The main reason I'd use a personal account for one of the big cloud providers would be to learn stuff. At that point a lack of professionalism is kinda expected, because learning stuff is the whole point.
And my understanding is that almost none of the way of setting limits are actual hard limits, but only alerts and some hacked-together emergency abort scripts. Correct me if I'm wrong, but can you actually limit the cost robustly for services that spend that much money in an hour or so? Doesn't help much if I get an email about it and read it two hours afterwards.
I understand the down votes but I would still say that being aware of the rough estimate costs of each service you are using is an integral part of an engineer’s job. After all, we care a lot about CPU cycles and those are measured in femto dollars.
That's not sufficient, you also must not make mistakes.
I have very limited cloud experience, but I did make a mistake that lead to a rather slow but constant cost. The amount was small enough to not be relevant in a professional context, but the memorable part was that I could not pinpoint the source easily with the AWS tools and my limited understanding of them. The categories and labels were too broad, and it took a bit until I figured out what went wrong. There are certainly better tools to investigate this, but I didn't know them. In the end it was simply luck that the mistake still fell into an area of insignificant amounts of money, but it could have easily been significantly more if a few parameters had been different for the same mistake
You can call someone an 'engineer' with the associated responsibilities when they are getting paid for what they are doing like an engineer, in a setting that provides them with the protections of an engineer.
Until that point, they are just an individual who got screwed by disguised billing practices.
> It is not that hard folks. Set the billing limits.
Excellent idea. Please describe how to create an account on AWS or GCP that is not allowed to spend more than $100/mo. Since it is "a really easy fix" and "takes almost no time" it should be easy to explain, right?
That's probably enough for 99% of people, and if you're highly motivated, you could make that trigger an SNS notification that trips a circuit breaker.
No, that's really not good enough. I don't want to need to be "highly motivated" in order to set a limit, I want to say this thing cannot use more than this many dollars each month, no conditions no exceptions no questions. If I make a fun little side project and it hits the front page of HN, I don't want to quibble about whether I cut it off in time or some hacked together little script turns things off correctly, I want it capped.
There are limitations to what you can get with spending limit accounts, but Azure has (always?) had more options for people looking for hard billing caps than the other two big providers.
While you can footgun yourself with hard limits I tend to think that learners/hobbyists should, in general, be able to access at least many services with an ironclad guarantee that they can't be billed for over a certain monthly amount or a total number.
I'm much more inclined to shrug if a startup screws themselves over with a hard spending limit than if a student screws themselves over because of a lack of one.
So honestly if that's true I might have to try Azure, thanks. However, when the claim was "This can be done with almost all cloud providers" I feel comfortable wanting an answer for the other two of the big three.
True these giants make their own lives easier and don’t implement much billing controls into the infrastructure. It is your money so it is your responsibility to protect it. Use billing alerts, hacks, and research things carefully before jumping with both feet.
There are plenty of APIs on the internet that are free that query a database for information. If queries are too expensive it's not viable to run for free.
Or, GCP could implement cost/resource/use limits, which would allow them to give away whatever they wanted for free without any concern about people over using it, while also allowing people to avoid shooting their own feet off.
I don’t disagree but how does that work exactly? When you hit the quota the query gets cancelled? That’s definitely already a feature of Redshift Spectrum with WLM. Does BigQuery offer something similar?
My first choice would be something like "this query will cost $13953, which exceeds your default cap of $100; please click the confirm button if you really want to run it". (The dollars could be CPU-minutes or whatever if you want to use resource based limits, which might play nicer with a free tier)
Edit: rereading, I think this is actually for non-interactive scripts, in which case yes it should just cancel the query
You can set the size limit for individual queries. Plus the custom quotas and everything.
Part of the problem is that the OP wrote a script with a loop. So say you set the limit to 50 GiB per query, but then write a script that runs a 49 GiB query 1000 times...
That type of batch process should be designed much more carefully to consider costs.
The article doesn't say anything about a loop, and the estimated usage by the Google responder makes it seem like the cost is from a single "SELECT *".
> I was doing historical evaluation for a few sites, so I was running a query for each month going back to 2016 for each site. I've done this before with no real issues, and if I knew the charges were rapidly exploding I'd have halted the script immediately - but instead it ran for 2 hours and the first notice I got was the CC charge.
So looks like a loop of ((6 * 12) + 2) * #sites iterations with a full table scan every time.
I've forgotten more Sql than most people ever learn. Time is also valuable and I make trade-offs. Should I spend hours (eg. $$$) to optimize or run a non-optimized query in the background for a different cost? Well, I didn't think the time/benefit/cost equation favored tuning, if I had known that I'd have spent time on tuning. If you offer something for "free" and then change the cost, and don't have any alerting mechanisms to inefficient queries, it's impossible to evaluate trade offs.
It's rarely interesting logic that makes it expensive. Because the per-query charge is not based on compute cycles but the amount of data scanned. This is sufficient:
`SELECT * FROM super_wide_table_with_lots_of_text
WHERE NOT filter_on_partitions_or_clusters`
Select * is dangerous because it's a column store. You really need to look at the schema and select only the things you want. And when exploring the data it's important to use sane limits and pull from a single partition.
If you're running a business and you have lawyers, then fair enough — just play the game. But for individuals, it seems crazy that so many of us accept this sort of thing. Good luck contesting the charge with your credit card company when you already agreed to a contract that said Google could bill you thousands of dollars and then you used thousands of dollars worth of their service.
Big cloud providers are not your friend. They do not care if they destroy the lives of you and your family, unless it's happening so often that it's making mainstream news.
My advice is to go and delete your cloud accounts, and only use services that offer hard spending caps, and ideally prepaid accounts.
Maybe this doesn't leave many options. Oh well. Maybe if you can't afford big lawyers then you also can't afford the risks of using big cloud.