Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I was hoping for Yahoo, Amazon, or Microsoft to throw a lot of resources at this about 5~8 years ago. Since then, Google kind of ran away with the game in crawling. They were far ahead of everyone else back then, but one could conceive of a rag-tag group of companies, institutions, and individuals pooling their resources and getting a crawl about 10% as good. These days, on the externally visible evidence they're probably several orders of magnitude better than everybody else on the planet combined.

Take crawl freshness. If I publish a new blog post, it gets crawled and added to the Google index in seconds. Other crawling efforts take weeks between refreshes.



Hi I work at commoncrawl. We have spent our time (in 2011) improving our algorithms, and hopefully this effort will start to show real results (with respect to crawl frequency and relevancy) in 2012. But you are right, it is pretty unlikely that our crawl will be able to be fully competitive with the likes of Google etc., multi-billion dollar corporations who dedicate huge amounts of engineering and hardware resources to stay competitive in this field.


It is not "Google etc., multi-billion dollar corporations" it is just Google.


> I was hoping for Yahoo, Amazon, or Microsoft to throw a lot of resources at this about 5~8 years ago. Since then, Google kind of ran away with the game in crawling.

In the 2004 timeframe, Yahoo was crawling about the same number of pages as Google. (More some months.)

> If I publish a new blog post, it gets crawled and added to the Google index in seconds. Other crawling efforts take weeks between refreshes.

Time from crawl to appearing in search results is a different issue.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: