When I worked as a patent examiner, it seemed to me that the bottleneck in searching is usually the searcher, not the specific search technology. Many technical people (for example, HN folks) tend to severely overestimate their own search ability. I certainly did.
With the right search query, one can frequently find exactly what they are looking for very quickly. In some sense having a dumber search engine than Google is advantageous here as you won't become dependent on the "magic" of the search engine and will have to craft a good search strategy.
The most valuable technical feature of the internal search tools (PE2E Search or EAST) was speed, not anything fancy. I imagine this was the motivation: If the results can't easily be ranked (and they couldn't in my experience), make handling a ton of results as easy as possible. That's what the USPTO did.
You could "flip" through documents quickly using only the keyboard, and if what you were looking for was easily visible in a drawing then this usually was the best approach. For text they had a good way to show what you were looking for in context. I'd love to see a similar setup for web search but I don't think it would appeal to most, so it probably won't happen.
AI/ML based search tools were interesting but usually not helpful. I'd always try at least some of them. I think the main limitation for these in my technology area (mostly water heaters and ventilation) was that they didn't look at the drawings at all, just the text and citations. That's missing a lot. (When they were helpful they did save a lot of time.)
I think the big difference between such a dataset and the web is, that the web is polluted with useless stuff like spam. How do you decide what is relevant, and what not, if not with some statistical/ML methods? It seems like only a whitelisting approach would work then, severely limiting the scope of such a system.
> How do you decide what is relevant, and what not, if not with some statistical/ML methods?
To be clear, you're referring to software determining relevance. I can determine relevance on my own, though it may be time consuming. Making manually determining relevance as quick as possible worked okay in my experience at the USPTO.
Right now there probably are reliable signals about the relevance of a document/webpage/etc. But Goodhart's law suggests that any ranking signal used would be unreliable in the long run. Without AI on par or better than a human, I think the equilibrium would tend to be that search results can't be ranked well.
If ranking doesn't work, then each result is roughly as plausibly useful as the next. Given that, figuring out how to efficiently manually handle a lot of results is reasonable strategy, one that worked in my experience at the USPTO. It's not for everyone, mind you, but search software for serious searchers should consider this approach.
> I think the big difference between such a dataset and the web is, that the web is polluted with useless stuff like spam.
While patent attorneys aren't actively SEOing their patent applications, they do tend to write legal/technical gibberish that's basically just as useful as spam. (I wish patent attorneys did some mild SEO like adding relevant keywords as it would make examining patents easier...)
With the right search query, one can frequently find exactly what they are looking for very quickly. In some sense having a dumber search engine than Google is advantageous here as you won't become dependent on the "magic" of the search engine and will have to craft a good search strategy.
The most valuable technical feature of the internal search tools (PE2E Search or EAST) was speed, not anything fancy. I imagine this was the motivation: If the results can't easily be ranked (and they couldn't in my experience), make handling a ton of results as easy as possible. That's what the USPTO did.
You could "flip" through documents quickly using only the keyboard, and if what you were looking for was easily visible in a drawing then this usually was the best approach. For text they had a good way to show what you were looking for in context. I'd love to see a similar setup for web search but I don't think it would appeal to most, so it probably won't happen.
AI/ML based search tools were interesting but usually not helpful. I'd always try at least some of them. I think the main limitation for these in my technology area (mostly water heaters and ventilation) was that they didn't look at the drawings at all, just the text and citations. That's missing a lot. (When they were helpful they did save a lot of time.)