I run a small startup called SEOJuice, where I need to crawl a lot of pages all the time, and I can say that the biggest issue with crawling is the blocking part and how much you need to invest to circumvent Cloudflare and similar, just to get access to any website. The bandwith and storage are the smallest cost factor.
Even though, in my case, users add their own domains, it's still took me quite a bit of time to reach 99% chance to crawl a website — with a mix of residential proxies, captcha solvers, rotating user-agents, stealth chrome binaries, otherwise I would get 403 immediately with no HTML being served.
It's a pretty easy game to win as the blocker. If you receive too many 404s against pages that don't exist, just ban the IP for a month. Actually got the idea from a hackernews comment too. Also thinking that if you crawl too many pages you should get banned as well.
There's no point in playing tug of war against unethical actors, just ban them and be done with it.
I don't think it's an uncommon opinion to behave this way either, nor are the crawlers users I want to help in any capacity either.
They're mostly non-technical/marketing people, but yes that would be a solution. I try to solve the issue "behind the scenes" so for them it "just works", but that means building all of these extra measures.
Would it make sense to advertise to the more technical minded a discount if they set up an IP whitelist with a tutorial you could provide ? A discount in exchange for reduced costs to you ?
> the biggest issue with crawling is the blocking part and how much you need to invest to circumvent Cloudflare and similar … mix of residential proxies, captcha solvers, rotating user-agents, stealth chrome binaries
I would like to register my hatred and contempt for what you do. I sincerely hope you suffer drastic consequences for your antisocial behavior.
Please elaborate, why exactly is it antisocial? Because Cloudflare decides who can or cant access a users website? When they specifically signed up for my service.
It intentionally circumvents the explicit desires of those who own the websites being exploited. It is nonconsensual. It says “fuck you, yes” to a clearly-communicated “please no”.
I've been working on the same tool since 2024 where I thought it might be a good time to build a tool for all the people who will build their own tools, eventually they will need to market it.
So I built a SEO/GEO Automation Tool for Small to Mid-Size Businesses who don't have a full-time team for that. [0]
The goal is to provide teams visibility across all the channels — Search and AI and give them the tools needed to outrank their competition. So far so good, the fully bootstrapped venture has grown over the last year and I've built quite a few big features — sophisticated audit system, AI Responses Monitoring, Crawler Analytics, Competitors Monitoring etc.
Adding a bit of context as well: This started out as a internal linking tool, but grew into something more based on the customer feedback — the database has now reached about 10TB of data about keywords, pages, AI responses etc, where I know who was ranking where and why.
And I'm trying to offer this "data advantage" to website owners, so they can grow, and also this is something that will be hard to replicate (at least quickly) with AI.
I’m working on SEOJuice [1], an automated tool for internal linking and on-page SEO optimizations. It's designed to make life a little easier for indie founders and small business owners who don’t have time to dig deep into SEO.
So far, I’ve managed to scale it to $3,000 MRR, and recently made the move from the cloud to Hetzner, which has been a game-changer for cost efficiency. We’re running across multiple servers now, and handling everything from link analysis to on-page updates with a bit more control.
The journey’s been a mix of hands-on coding (and a lot of coffee) and constant optimization. It’s been challenging but incredibly fun to see how much can be automated without compromising on quality.
Happy to chat more about the tech stack or any of the growth pains if anyone’s interested!
Oh wow, my package on the front page again. Glad that it's still being used.
This was written 10 years ago when I was struggling with pulling and installing projects that didn't have any requirements.txt. It was frustrating and time-consuming to get everything up and running, so I decided to fix it, apparently many other developers had the same issue.
[Update]: Though I do think the package is already at a level where it does one thing and it does it good. I'm still looking for maintainers to improve it and move it forward.
Hey everyone, I know HN community is very polarizing, and the discussions here are always great to read through as both sides are always eager to prove the other wrong. I think we need more of that in the community. People not being afraid to disagree.
I'm really curious to hear your thoughts and experiences.
I have a bit of a niggle about the use of the word "polarizing". That word implies things that I think are harmful overall, such as being unwilling to work with people you disagree with.
That said, I agree that it's important to express your opinions and stand by things you think are right. It's equally important to listen to those who disagree with you and take what they say as additional data that may (or may not) lead you to modify your opinion. At the very least, openly and honestly listening to others will inform you as to why they have a differing opinion. "Everyone seems crazy if you don't understand their point of view."
Also, "compromise" isn't a dirty word. It's how we get anything done.
Why not just retrain the python team to another language? I mean, software engineers are not really language specific, they can learn other languages if needed.
All SWEs at the same level at Google are making the same compensation (with some exceptions for high-flying AI researchers). They Python SWEs certainly weren't making more than anyone else.
That's not true at all. (Excluding the factor of location), the compensation of a SWE depends not only the level, but also on tenure, on performance rating (and the history of rating), and on stock market fluctuations (whether the stock price was low or high when the stocks were granted).
One of the rumors is that the better compensated you are on your level, the more likely you are to be targeted for layoff, because it saves the eng cost the most.
That's the thing, it's not clear that the Python core engineers are more talented than other Google SWEs on average. You have all sorts of talented engineers working on all sorts of random projects within Google.
Mostly I help developers grow — I share my thoughts as a CTO about building digital products, growing teams, scaling development and in general being a good technical founder.
Even though, in my case, users add their own domains, it's still took me quite a bit of time to reach 99% chance to crawl a website — with a mix of residential proxies, captcha solvers, rotating user-agents, stealth chrome binaries, otherwise I would get 403 immediately with no HTML being served.
reply