When I worked as a patent examiner, it seemed to me that the bottleneck in searc...

sva_ · on April 10, 2022

I think the big difference between such a dataset and the web is, that the web is polluted with useless stuff like spam. How do you decide what is relevant, and what not, if not with some statistical/ML methods? It seems like only a whitelisting approach would work then, severely limiting the scope of such a system.

btrettel · on April 10, 2022

> How do you decide what is relevant, and what not, if not with some statistical/ML methods?

To be clear, you're referring to software determining relevance. I can determine relevance on my own, though it may be time consuming. Making manually determining relevance as quick as possible worked okay in my experience at the USPTO.

Right now there probably are reliable signals about the relevance of a document/webpage/etc. But Goodhart's law suggests that any ranking signal used would be unreliable in the long run. Without AI on par or better than a human, I think the equilibrium would tend to be that search results can't be ranked well.

If ranking doesn't work, then each result is roughly as plausibly useful as the next. Given that, figuring out how to efficiently manually handle a lot of results is reasonable strategy, one that worked in my experience at the USPTO. It's not for everyone, mind you, but search software for serious searchers should consider this approach.

> I think the big difference between such a dataset and the web is, that the web is polluted with useless stuff like spam.

While patent attorneys aren't actively SEOing their patent applications, they do tend to write legal/technical gibberish that's basically just as useful as spam. (I wish patent attorneys did some mild SEO like adding relevant keywords as it would make examining patents easier...)

necovek · on April 10, 2022

You'd have a point if there was a statistical/ML search engine that successfully filtered spam out.

Is there one?

(Google is certainly not it)