I feel like there are probably some ad based search engines which are privacy and service oriented, but in general even for those there remains a misalignment problem. Hence if I don’t want to be a product now or in the future, what good search engines are there that I can pay for?


I once read that running a search crawler costs upwards of a billion dollars a year. Anyone other than Microsoft or Google running their own search index are either not getting a wide spread of the internet or they are using their own index to supplement Google or Bing results.
That’s like saying that it’s impossible to run a car manufacturing company without 100 billion because that’s how much Ford spends on their car manufacturing processes. It makes no sense.
Yes, making an original search engine is hard, just like making trucks is. But that doesn’t mean that running either one requires billions of dollars to do.
Common crawl is a nonprofit that regularly shares free copies of every internet page with metadata, and it damn well doesn’t take billions to do it either. https://commoncrawl.org/
That website claims they add 3-5 billion pages a month. Google is doing that in a day or three, as recency of information is very important in search. Plus that site claims 100 billion pages to Googles 400 billion. It’s still an impressive project.
Size isn’t everything, so the real question is: what search site uses only the common crawl index and has results on par with bing or google?
None of them. At least, none that I’m aware of. I just don’t think that direct expenses are the reason that there are are only two major web search tools. I also don’t think Google and bing are good examples to point at when estimating the cost of running a complete search engine.
If you read all of your article, the author notes that while Google has index of about 400 billion, the internet archives index is actually bigger at around 865 billion.
The internet archive has an operating cost of about 33m/year. I think that is a much more reasonable example to point to and say “running a complete search engine would have a similar price as that”.
Also, very neat article btw. I would have never guessed that googles search index count has been shrinking for the past little bit. Or that Google actively culls results from their database that it thinks people won’t ever want to see.
I’m not disputing that you might be right, but the internet archive runs a very different service. Mainly that Google needs to continuously prune their 400 billion page index because of link rot. The Internet Archive has the opposite aim, they are preserving sites that no longer exist.
I’m also not sure they even crawl. Do sites get added on user request? When looking at a medium popularity page, you see it only has a couple of scrapes a year.
I would suggest direct expenses are the barrier, but perhaps crawling is not the main expense. I would be interested to know any speculations you have outside of expenses that cause a barrier?
When I said ‘direct expenses’ I mostly meant the cost of owning / running a database of internet pages and metadata comprehensive enough to be considered part of a ‘fully featured search engine’. There’s also the other half; the compute required to create that metadata, as well as obtain it, but at most I would guess that those would be equal in cost to just having the space for a database of all the internet pages (scaling up after that based on how many users you need to support). In short, a scaled down web engine that had access to every page on the internet that people would want to find could cost as low as 100,000$ for a first time purchase for the hardware.
The internet archive does in fact have their own web crawler they use. They also do sites upon request as well; i’ve had my personal website on there for almost two decades now, specifically at my request.
They also have a full-featured search function available for anyone on their website at archive.org. This is why I say they’re a reasonable price comparison for a full-featured search engine. They may spend more on storage and less on metadata compute than a theoretical smaller search engine, but at the end of the day, that’s just a re-balancing of the cost, not a completely new and more excessive cost.
I think direct expenses; the cost of owning and maintaining an internet index database, are definitely significant enough that the completely free access that google gives to anyone who wants it, are way more than any single private entity or company is able to support just because they want to have it. I don’t think it would be anywhere even close to a billion dollars though.
I think the hardest part of having a internet index database would be the knowledge required to create and maintain it, especially under the hostile forces that are the 75 billion dollar seo industry. If a selfhosted search engine became big enough that the seo industry started trying to break it, I don’t think that company would survive for very long at all.
Google is losing that battle, like, almost completely. What hope would a small startup style company have of battling it and staying financially solvent, especially if they’re trying to be different from google and bing and actually showing results without the pressure of advertisers breathing down their necks?
I think the hardware side of a search engine is solvable with silicon valley startup level of funding. I think it’s impossible for anyone in the current day and age to make that sort of project solvent while keeping the user (instead of the advertiser) as the main customer. For anyone else who can’t get those funds, or don’t actually want to do a results-oriented search engine, they can just mooch of off google and bing for free.
I think you’d be right that the direct cost of running the crawler and index would not be the issue. But fighting SEO to keep your results decent is probably a cost that dwarfs the basic technical cost of running the crawler and index.
And you’d need a technical security team on top of things as link farms aren’t your only risk, I’m sure there are countless ways to manipulate the algorithm to put your site on top that Google probably have multiple teams working on fighting it full time.
Many of these things would likely not be a problem for a startup, though. No one is paying SEO firms big money to get into a search index no one has heard of and hardly anyone uses, so these costs probably grow exponentially over time as you become more well known.
Yeah, and on the smaller / earlier side of a theoretical search engine company, google offers their api for free. I think this is actually another one of the biggest contributors to why nobody has tried to make a new search engine with their own index. Why waste hundreds of thousands of dollars in hardware, and even more on personnel costs, when you can just have google do it for you instead?
Yes offering everything for free to prevent competition has been a surprisingly effective strategy for Google.