> demanding new —yes-destroy-my-computer-dangerously-and-step-on-my-face-daddy flags and crashing my automated scripts from last year.
Literally, my case. I recently had to compile an abandoned six-year-old scientific package written in C with Python bindings. I wasn’t aware that modern versions of pip handle builds differently than they did six years ago — specifically, that it now compiles wheels within an isolated environment. I was surprised to see a message indicating that %package_name% was not installed, yet I was still able to import it. By the second day, I eventually discovered the --no-build-isolation option of pip.
> GPT-5 is proving useful as a literature review assistant
No, it does not. It only produces a highly convincing counterfeit. I am honestly happy for people who are satisfied with its output: life is way easier for them than for me.
Obviously, the machine discriminates me personally. When I spend hours in the library looking for some engineering-related math made in the 70s-80s, as a last resort measure, I can try to play this gambling with chat, hoping for any tiny clue to answer my question. And then for the following hours, I am trying to understand what is wrong with the chat output. Most often, I experience the "it simply can't be" feeling, and I know I am not the only one having it.
In my experience doing literature super-deep-dives, it hallucinates sources about 50% of the time. (For higher-level literature surveys, it's maybe 5%.)
Of the other 50% that are real, it's often ~evenly split into sources I'm familiar with and sources I'm not.
So it's hugely useful in surfacing papers that I may very well never have found otherwise using e.g. Google Scholar. It's particularly useful in finding relevant work in parallel subfields -- e.g. if you work in physics but it turns out their are math results, or you work in political science and it turns out there are relevant findings from anthropology. And also just obscure stuff -- a random thesis that never got published or cited but the PDF is online and turns out to be relevant.
It doesn't matter if 75% of the results are not useful to me or hallucinated. Those only waste me minutes. The other 25% more than make up for it -- they're things I simply might never find otherwise.
Pretty much, though Google got bad at these things well before LLMs really came on to the scene, and we can all debate which project manager was responsible and the month and year things took a downward turn, but the IMO obvious catalyst was that "Barely Good Enough" search creates more ad impressions, especially when virtually all of the bad results you are serving are links to sites that also serve Google managed ads.
The main reason Google doesn't find good search results anymore is there are no good search results anymore because there are no websites anymore. You can't do it much better.
Right, Google definitely isn't helping themselves IMO,
but the reasons search got hard was that it became profitable to become the "winner" of a search query. It's a hostile market that works to actively undermine you.
AI absolutely will have the same problem if it "takes over" except the websites that win and get your views will not look like blogspam, they will look like (and be) the result of adversarial machine learning.
It was a very clear point: when Amit Singhal was kicked out for sexual harassment in the me too era. He was the heart of search quality but he went too far when he was drinking.
Nope. I'm talking about the stuff keywords are no good at, and which Google Scholar doesn't tend to surface because it's just not cited much or it's from a different niche.
The fact that LLM's understand your question semantically, not just with keyword matching, is huge.
Another win for big tech: Google has been enshittified to such a point that you can now spin up a machine that consumes 1000x the power to give you a result that has a coin toss odds of being totally made up.
A search query probably uses about 10x more electricity than a matching LLM query. There's enough wiggle-room depending on the assumptions that they might be about even. There is no way search uses 1/1000th of an LLM.
I use all of the current versions of ChatGPT, Gemini, and Claude.
The hallucination rates are about the same as far as I can tell. It depends mostly on how niche the area is, not which model. They do seem to train on somewhat different sets of academic sources, so it's good to use them all.
I'm not talking about deep research or advanced thinking modes -- those are great for some tasks but don't really add anything when you're just looking for all the sources on a subject, as opposed to a research report.
OpenAI has published a great deal of information about hallucination rates, as have the other major LLM providers.
You can't just give one single global hallucination rate since the rates depend on the different use cases and despite the abundant amount of information available to people on how to pick the appropriate tool for a given task, it seems very few people care to take the time to actually first recognize that these LLMs are tools, and that you do need to learn how to use these tools in order to be productive with them.
Saying it isn't useful is a bit of an overstatement. It can search, churn through 500k words in a few minutes, and come back with summaries, answers, and sources for each point.
Should you blindly trust the summary? No. Should you verify key claims by clicking through to the source? Yes. Is it still incredibly useful as a search tool and productivity booster? Absolutely.
I gave it a PDF recently and asked it to help me generate some tables based on the information there in. I thought I'd be saving myself time. I spent easily twice as long as I would have if it I had done it myself. It kept making trivial mistakes, misunderstanding what was in the PDF, hallucinating, etc.
Last summer I used one of the models to help translate a few German wikipedia pages to English, hoping it would make things easier by keeping all the formatting etc. that I'd lose if I copy-pasted mere content via Google Translate.
I did check the translations were correct as part of this — while my German isn't great, it was sufficient for this — and it was fine up until reaching a long table about the timeline of events relevant to the subject, at which point it couldn't help but make stuff up.
Still useful, but when you find the limits of their competence, there's no point attempting to cajole them to go further. They'll save you whatever % of the task in effort, now you have to do all the rest; it's a waste of effort to think either carrot or stick will get them to succeed if they can't do it in the first few tries.
It is excellent when just finding something is enough. Most often in my practice, I am dealing with questions that have no written-down answers, meaning the probability of finding a book/article that provides one is negligible. Instead, I am looking for indirect answers or proofs before I make a final engineering decision. Yet another problem is that the language itself changes over time. For instance, at the beginning of the 20th century, the integers were called integral numbers. IMHO, LLMs poorly handle such cases when considered as a substitute for search engines.
For full-text vector search, I am using https://www.recoll.org/ a real time saver for me, especially for desktop search.
Because this is precisely what the word counterfeit means, an imitation that deceives you. The functionality of counterfeit can be from 0% to 100%, depending on your luck. If you accidentally bought a fake iPhone, it can still make calls. Chat output is something that looks like a literature review, some collection of summaries of relevant papers. But the problem is that a review is not just text compression. It is also about rejecting low-quality research, considering the historical context, identifying contradictions, etc. No machine can do that analysis for you, yet. I have quite a confidence in that.
If we're talking about literature search here, the workflow is
1. Get list of sources and their summaries from an LLM.
2. Read through, find a paper who's title and summary seem interesting to you.
3. Follow the LLM's link, usually to an arXiv posting.
4. Read the title and abstract on arXiv. You can now judge the accuracy of the LLM's summary.
It's really easy to tell if the LLM is accurate when it is linking to something which has its own title and summary, which is almost always the case in literature search.
There's this principle, I forget the name, but how everyone when reading the newspaper, when they read on a subject they're familiar with, will instantly spot all the holes, all the errors. And they will ask themselves, how was this even published in the first place?
But then they flip to the next page and they read a story on a subject they're not an expert on and they just accept all of it without question.
I think people might have a similar relationship with ChatGPT.
I've thought of the same analogy, but you know, I've never actually seen someone go "It's wrong about the stuff I understand, but I'll trust it anyway on everything I know nothing about". It's either:
(1) people getting caught using it to do their own jobs for them (i.e. they don't even realise it's wrong about the stuff they do understand);
(2) people who see the problems and therefore don't trust them anywhere at all (i.e. no amnesia, quite sensible reaction);
(3) people who see the problems and therefore limit their use to domains where the answers can be verified (I do this).
--
As an aside, I'm a little worried that I keep spotting turns of phrase that I associate with LLMs, for example where you write "you're absolutely right": I have no idea if that's all just us monkeys copying what we see around us (something we absolutely do), or if you're using that phrase deliberately because of the associations.
The only thing I'm confident of is that you're not doing is karma-farming with an LLM, but that's based on your other comments not sounding at all like LLMs so why would you (oh how surprising it was when I was first accused of being an LLM), but eh, dead internet theory feels more and more real…
> The Gell-Mann Amnesia effect. And you're absolutely right, it's extremely pronounced in LLM users.
And I guess a lot of LLM-hype critics have the trait to be much less capable of "being able to flip to the next page and read a story on a subject they're not an expert on and they just accept all of it without question".
Because this is an unusual personality trait, these LLM-hype critics get reprimanded all the time by the "mob" that they don't see the great opportunities that LLMs could bring, even though the LLMs may not be perfect.
I wonder whether for a lot of the search & literature review-type use-cases where people are trying to use GPT-5 and similar we'd honestly be much better off with a really powerful semantic search engine? Any time you ask a chatbot to summarize the literature for you or answer your question, there's a risk it will hallucinate and give you an unreliable answer. Using LLM-generated embeddings for documents to retrieve the nearest match, by contrast, doesn't run any risk of hallucination and might be a powerful way to retrieve things that Google / Bing etc. wouldn't be able to find using their current algorithms.
I don't know if something like this already exists and I'm just not aware of it to be fair.
I think you have a very good point here: a semantic search would be the best option for such a search. The items would have unique identifiers so the language variations can be avoided. But unfortunately, I am not aware of any of these kinds of publicly available projects, except DBpedia and some biology-oriented ontologies that would massively analyze scientific reports.
Currently, I am applying RDF/OWL to describe some factual information and contradictions in the scientific literature. On an amateur level. Thus I do it mostly manually. The GPT-discourse somehow brings up not only the human-related perception problems, such as cognitive biases, but also truly philosophical questions of epistemology that should be resolved beforehand. LLM developers cannot solve this because it is not under their control. They can only choose what to learn from. For instance, when we consider a scientific text, it is not an absolute truth but rather a carefully verified and reviewed opinion that is based on the previous authorized opinions and subject to change in the future. So the same author may have various opinions over time. More recent opinions are not necessarily more "truthful" ones. Now imagine a corresponding RDF triple (subject-predicate-object tuple) that describes that. Pretty heavy thing, and no NLTK can decide for us what the truth is and what is not.
Since you specifically were wondering if something like this exist, I feel okay with mentioning my own tool https://keenious.com since I think it might fit your needs.
Basically we are trying to combine the benefits of chat with normal academic search results using semantic search and keyword search. That way you get the benefit of LLMs but you’re actually engaging with sources like a normal search.
If your issue is "incorrect answers" (hallucination) then an embedding search will naturally also produce "incorrect answers" because an embedding is a lossy compression of the source.
I think its scope is narrower than a lit review assistant. I use it mainly for finding papers that I or my RAs might have missed in our lit review.
I have a recent example where it helped me locate a highly relevant paper for my research. It was from an obscure journal and wouldn't show up in the first few pages of Google Scholar search. The paper was real and recently published.
However, using LLMs for doing lit review has been fraught with peril. LLMs often misinterpret the research findings or extrapolate them to make incorrect inferences.
If you’re interested in a literature review tool, I built a public one for some friends in grad school that uses hierarchical mixture models to organize bulk searches and citation networks.
Thank you for sharing! I like your dendrogram-like circular graphs! They are way more intuitive. That could be a nice companion for a bibliometrix/biblioshiny library for bibliometric analysis https://www.bibliometrix.org/. I tried "Deep Dive" with my own request, and ... it unfortunately stops at the end of "Organizing results". Maybe I should try again later.
Haha that’s embarrassing! The progress bars are an estimate. If a paper has a lot of citations, it may take a bit longer than the duration of the bars but it will hopefully finish relatively soon!
Edit:
Got home and checked the error logs. There was a very long search query with no results. Bug on my end to not return an error in that case.
If you were hoping to use the citation network, it needs the url as input rather than the title.
Today I tried with an old (1935) fundamental Bell Labs paper on harmonic distortions caused by ferromagnets in communication lines. It is for sure cited more times than it appears in Google Scholar: 25. DOI:10.1002/j.1538-7305.1935.tb00418.x
Here is the list of what the system proposed to take a look at:
1. Vascular at‐risk genotypes and disease severity in Lebanese sickle cell disease patients
2. Narrow band filter for solar spectropolarimetry based on Volume Holographic Gratings
3. Communication from Space: Radio and Optical by S. Weinreb
4. An Adaptive and High Coding Rate Soft Error Correction Method in Network-on-Chips
5. Plasma hemostasis in patients with coronavirus infection caus...
I would expect more papers like 3rd as the topic is communication systems. Unfortunately, inside the ref.3 there is noting said about distortions.
Maybe it is indeed a linguistical curse of this topic...
Struggling to understand this one. Is it that (1) it's lopsided toward reference materials found on the modern internet and not as useful for reviewing literature from the Before Times or (2) it's offering specific solutions but you're skeptical of them?
LLM-powered deep research tools are very useful at finding whether someone has already published on a research question. You could spend days performing keyword searches and traversing citation graphs but still miss relevant pockets of the literature. Semantic search is a genuine improvement here.
As to not trusting the generated text, you’re totally right. That’s why I use it as a search tool but mostly ignore the content of what the LLM has to say and go to the source.
I was reminded how terribly ChatGPT hallucinates this morning when I used it to look up the 7 point measurement locations for skin fold calipers to estimate body fat.
It correctly described the locations in text, then it offered to provide a diagram.
I said “sure”, and it generated an image saying the chest location is on the neck, and a bunch of other clearly incorrect locations for the other measurement sites.
I love Arch, but suggesting that even an above average Windows 10 user migrate to Arch without prior Linux experience is just irresponsible. They may be left with an unusable machine if they format their "C:" drive but don't manage to properly install Arch and a DE. Mint or even plain Debian seem far better for this, and updating such systems is usually more predictable.
> We don’t need to worry about AI itself, we need to be concerned about what “humans + AI” will do. Humans will do what they’ve always tried to do—gain power, enslave, kill, control, exploit, cheat, or just be lazy and avoid the hard work—but now with new abilities that we couldn’t have dreamed of.
Starting with the AI itself: LLMs sold as AI are the greatest mislead. Text generation using Markov chains is not particularly intelligent, even when it is looped back through itself a thousand times and appears alike an intelligent conversation. What is actually being sold is an enormous matrix trained on terabytes of human-written, high-quality texts, obviously in violation of all imaginable copyright laws.
Here is a gedanken experiment to test if an AI has any intelligence: until the machine starts to determine and resolve contradictions in its own outputs w/o human help, one can sleep tight. Human language is a fuzzy thing that is not quite suitable for a non-contradictory description of the world. Building such a machine would require resolving all the contradictions humanity has ever faced in a unified way. Before it happens, humanity will be drowned in low-quality generated LLM output.
I doubt ultrasound would trigger apoptosis in cancer cells, one of the reasons they're cancerous is that they refuse to commit suicide when they should.
Your article strongly consonated with the ideas I often try to communicate to those around me. I wish there were more open discussion about biases related to PhD qualifications and the growing influence of venture-capital-style practices in science. Many researchers dedicate themselves to exploring new and uncertain areas of knowledge, yet their work is sometimes undervalued by hiring managers and industrial professionals. I personally know HRs who consider the "PhD" tag in a CV a red flag.
There is also a tendency to overlook the fact that pursuing a PhD is a form of professional work, comparable in commitment and responsibility to other careers. Sometimes 699, sometimes 700. Academia can indeed be a challenging environment, and not everyone contributes in the same way—some may prioritize credentials over substance, and cases of research misconduct do exist. However, these examples definitely do not represent the academic community as a whole.
Regarding the venture-capital approach, in applied sciences, researchers are increasingly expected to present their ideas in short, pitch-style formats. At some point, this process itself becomes the goal of the work. I can imagine this encouraging concise communication. Still, it also shifts part of the management responsibilities—such as market analysis and outreach—onto scientists, which does not align with their core expertise or professional goals.
> I personally know HRs who consider the "PhD" tag in a CV a red flag.
Which should only tell us what a cancer HR departments are. Filtering someone from talking to the hiring department, for the crime of doing a PhD, and telling others about it - to me this sounds more likely like someone getting a kick out of wielding the bit of power they have over others than a useful contribution to society.
Fwiw everyone I knew in undergrad who ended up in HR that was basically their only option. They already got slammed in stem classes in HS or even first year or two in college so that was a nonstarter. They weren’t creative people so arts or humanities were out. They didn’t grasp business concepts so finance and other avenues in the business school were out. They landed an hr internship in their junior year then the rest is history.
It is bewildering to me that HR is responsible for hiring anyone outside the HR department.
As many gripes as there are about the competitive grant process, at least in the US it was formerly adjudicated by scientific experts. (Yes, subject to groupthink and overweighting en vogue topics, but still by experts)
You'll first need to frame Towers of Hanoi as a supervised learning problem. I suspect the answer to your question will differ depending on what you pick as the input-output pairs to train the model on.
YC remains a great source of creative inspiration for me. However, I tend to skip most AI-related content on HN, as the topic does more harm than good to me in my daily life. Some people around me delegate more and more decisions to the chat, and that frightens. Especially if you are somehow dependent on them or your work gets evaluated by some creepy AI-driven bossware. We should admit that AI, particularly LLMs, is not just eating: it is destroying society, human communications, the education system, and the scientific community. This enumeration is merely the sides that I personally faced.
I work with some undergrads and see this delegation increasing year over year. Unfortunately it's also happening at the expense of reading books, using library search tools to find proper sources, and information gathering in general.
"An LLM might be able to explain something to you, but it can never understand it for you."
I was recently in a work AI training where we were encouraged to have AI review all our vendor and contractor budgets and find holes and create rebuttals to line items. Was wondering what if the vendor has ai review our rebuttals and create counter-points to our ai-created arguments. At some point it will just be AI talking against itself to another AI chatbot.
> At some point it will just be AI talking against itself to another AI chatbot.
And then something like this [0] will happen, creating a weird wasteful meta-game about the model(s) used by each company.
It makes intuitive sense: If you outsourced "generate awesome assertions" to a contractor, then someone else hired the same contractor to "judge the awesomeness of these assertions", they are more likely to get lots of "awesome" results—whether they're warranted or not. The difference might even come down to quirks of word-choice and formatting which a careful human inspector would judge irrelevant.
I anticipate a time when two negotiating parties will give their negotiation points and red lines to their agents, and the negotiations will wholly take place between their representative agents.
I visibly cringe when I hear people talking about using AI for anything. With the help of the social disease we call 'social media', AI is destroying thought on all levels.
Not sure about destroying everything. Def benefitting few at the great, great expense of the many. But trickle down benefits are a thing. And I see no way to avoid the future; the genie’s out the bottle.