Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I don't disagree completely with this, but just want to point out that it's kind of a bad smell to have computational biologists who are - as someone in the article puts it - computationally illiterate. I have met lots of these types over the years, and usually their methods are kind of a gong show. If you can't properly sanitize your data inputs on your column headers, why should I trust that you've treated the rest of your data properly?


I have a strong feeling that, if people really put an effort into reading and replicating more papers, we would find that a lot of what's being published is simply meaningless.

In grad school I had a subletting roommate for a while who was writing code to match some experimental data with a model. He showed me his model. It was quite literally making random combinations of various trigonometric functions, absolute value, logarithms, polynomials, exponents, etc. into equations that were like a whole page long and just wiggling them around. He was convinced that he was on a path to a revolution in understanding the functional form of his (biological) data, and I believe his research PI was onboard.

I guess "overfitted" never made it into the curriculum.


> It was quite literally making random combinations of various trigonometric functions, absolute value, logarithms, polynomials, exponents, etc. into equations that were like a whole page long and just wiggling them around.

Technically, we call that a "neural network". Or "AI".


It reminds me more of a genetic algorithm


Everytime I think I'm getting somewhere with a GA I eventually realise I've just created a guided monte carlo simulation.


This answers the question "will we ever have AI as smart as a human?"

Yes. It just turns out it's a particular human, whose analysis is very very dumb.


I work in computational materials science (where ML brings funding) and a funny paper of this kind is here: https://journals.aps.org/prl/abstract/10.1103/PhysRevLett.11... - they are literally trying out 100000s of possible combinations by brute force, to build a "physical model".

Then they go on conferences and brag about it, because they have to (otr they know it's bs). Datasets are soso (you can have a look at QM9...) and for more specialized things, people generally don't bother trying to benchmark or compare their results on a common reference. It's just something new...

And with all that: even without doing fancy statistical methods without knowing too much about it, your theoretical computations might not make so much sense (at least in the sheer number which is pumped out and published)...


> (otr they know it's bs).

Well, that's a new acronym for me. I wonder where it came from. Apparently it's "on the real". Sounds like AAVE?


I thought it was off the record like the old Pidgin plugin or https://en.wikipedia.org/wiki/Off-the-Record_Messaging.


It means off the record.


off-the-record (not written, not cited, but available in personal discussions).


Oh, I thought "on the real" fit the context better, meaning they knew in their heart of hearts it was bullshit, but "off the record" is about the same.


> Well, that's a new acronym for me. I wonder where it came from. Apparently it's "on the real". Sounds like AAVE?

> AAVE

OK.


typo for 'or'?


> I have a strong feeling that, if people really put an effort into reading and replicating more papers, we would find that a lot of what's being published is simply meaningless.

People have figured that out long ago [1] (I know the author of that paper lately turned somewhat controversial, but that doesn't change his findings). It's not very widely known in the general public. But if you understand some basic issues like p-hacking and publication bias and combine that with the knowledge that most scientific fields don't do anything about these issues, there can hardly be any doubt that a lot of research is rubbish.

[1] https://journals.plos.org/plosmedicine/article?id=10.1371/jo...


isn’t the saying, 80% of everything is garbage?


Sturgeon's Law https://en.wikipedia.org/wiki/Sturgeon%27s_law "ninety percent of everything is crap."


Yeah, but one would hope that science has a higher standard. 80% garbage results in science sounds catastrophic to our understanding of the world, and in particular when it comes to making policies based on that science.


There's the saying "science advances one funeral at at time."

'‘A new scientific truth does not triumph by convincing its opponents and making them see the light, but rather because its opponents eventually die, and a new generation grows up that is familiar with it.’ This principle was famously laid out by German theoretical physicist Max Planck in 1950 and it turns out that he was right, according to a new study.'

https://www.chemistryworld.com/news/science-really-does-adva...

Also the story of Ignaz Semmelweis who discovered that if doctors washed their hands it reduced deaths during childbirth - but for a variety of reasons his findings were resisted.

https://www.npr.org/sections/health-shots/2015/01/12/3756639... https://www.npr.org/sections/health-shots/2015/01/12/3756639...

Point being, as awesome as science is, it's still a human enterprise, and humans are still, well, human.


You do realize we've been at this for a few thousand years, right? Catastrophic is putting it mildly.


Unfortunately - like bs, power law is fractal.


We better work on making more stuff.


Better stuff?


If you have more stuff, and 10% is good, than that stuff can be more of the better.


Not my area at all but isn't this genetic algorithm type stuff?


In grad school I had a friend that was doing olfactory (smell) research on rats with tetrode drives (wires in their brain). He was looking at the neuronal response to smells that they gave the rats and had a few signals to match up. There was the signal from the arduino running the scent gates, the amps that reported the weak neuronal currents, the nose lasers that gated the ardunio, etc. He was having a hard time getting through all the data in his MatLab code and I offered to help for some beer.

After the 11th nested 'if' statement, I upped the request to a case of beer. I'm not certain he ever got the code working.

To the larger point, scientists are not programmers. They got into their programs to do research. What keeps them going is not the joy of programming, but the thrill of discovery. Programming is nothing but a means to an end. One they will do the bare minimum to get working. Asking hyper stressed out grad students to also become expert coders isn't reasonable.

And yes, that means that the code is suspect at best. If you load the code on to another computer, make sure you can defenestrate that computer with ease, do not use your home device.


I keep seeing this sentiment when it comes to those in the natural sciences, but it makes no sense.

I could replace "programming" in your above little bit with "mathematics" and it would be just as weird.

Our modern world runs on computers and programs, just as our modern world and modern science built itself on mathematics and required many to use it. So too the new world of science may require everyone to know to program just as they know about the chemical composition of smells, or the particulars of differential equations, etc.

And I know your argument isn't "they shouldn't learn programming", but honestly since I keep seeing this same line of reasoning, I can't help but feel that is ultimately the real reasoning being espoused.

Science is getting harder, and its requirements to competently "find the exciting things" raises the bar each time. I don't see this as a bad thing. To the contrary, it means we are getting to more and more interesting and in-depth discoveries that require more than one discipline and specialty, which ultimately means more cross-functional science that has larger and deeper impacts.


Most scientists are not great at math either.

Again: these are tools that are means to an end. They only need to work well enough to get the researcher to that end.

A lot of what are considered essential practices by expert programmers are conventions centered around long-term productivity in programming. You can get a right answer out of a computer without following those conventions. Lots of people did back in the day before these conventions were created.

That's not to say that everybody with horrible code is getting the right answers out of it. I'm sure many people are screwing up! My point is just that ugly code does not automatically produce wrong answers just because it is ugly.

By analogy, I'm sure any carpenter would be horrified at how I built my kayak rack. But it's been holding up kayaks for 10 years and really, that's all it needs to do.

I will add that in general, statistical analysis of data is not by itself adequate for scientific theory--no matter how sophisticated the software is. You need explanatory causal mechanisms as well, which are discovered by humans through experimentation and analysis.

And you can do science very well with just the latter. Every grand scientific theory we have available to us today was created without good programming ability, or really the use of computers at all. Many were created using minimal math, for example evolution by natural selection, or plate tectonics. Even in physics, Einstein came up with relativity first, and only then went and learned the math to describe it.


Your point is maybe a little obtuse to me, because it sounds like you are arguing for "computers are tools that should be learned, but really no one does and who can blame them, they just want to science" and simultaneously arguing, "tools aren't science, and science can be done without them".

I feel like the later is obvious: of course the tools aren't science, but if you want to do real work and real science, your tools are going to be crucial for establishing measurements, repeatability, and sharing how one models their hypothesis onto real world mechanics.

Likewise, the former is just the same commonly repeated thing I just argued against and my reply is the same: so what? You building a kayak is not science and is irrelevant.

Scientists can't reach a meaningful conclusion without proper use of tools. All they can do is hypthesize, which is certainly a portion of science (and many fields are in fact stuck in this exact stage, unable to get further and come to grounded conclusions), but it is not the end-all of science, and getting to the end in the modern day science means knowing to program.

Of course there are exceptions and limitations and "good enough". No one is arguing that. The argument I am refuting is those who think "tools are just tools, who cares, I just want my science". That is the poor attitude that makes no sense to me.


> Scientists can't reach a meaningful conclusion without proper use of tools.

I'm just trying to make the point that "proper" is subjective. Software developers evaluate the quality of code according to how well it adheres to well-established coding practices, but those practices were established to address long-term issues like maintainability and security, not whether the software produces the right answer.

You can get the right answer out of software even if the code is ugly and hacky, and for a lot of scientific research, the answer is all that matters.


The usual reason programmers object to ugly, hacky code is that it's a lot harder to be justifiably confident that such code actually does produce the right answer -- "garbage in, garbage out" is just as true in function position as it is in argument position.


Tbh I think its a case for multidisciplinary research. You wouldn’t only hire one skill set to run a company, even a tech one, so why should research be any different? That’s probably where the deep insights are.


People that are just decent programmers can make at least twice (probable 3 or 4 times) as much money working for industry than for science in an academic environment. Most programmers that would work for less money because they are interested in science will be more interested in computer programming problems than basic programming to support a scientist. NSF won't give you $250k to hire a really good programmer to support your gene analysis project. More like 100k if you are lucky.

So what you end up with are that great scientists that are decent programmers are the ones who can do the cutting edge science at the moment.


That's a problem we should aim to solve.


Think of the flip side: Programmers are terrible biologists.

Sure, it would be great if we all had more time to learn how to code. Coding is important. But I'd say the onus should be on coders to build better tools and documentation so they are empowering people to do something other than code, rather than reduce everything to a coding exercise because making everything look like code means less boring documentation and UX work for coders.

I mean, biology is in fact a full on degree program and you pretty much need a PhD before you're defining an original research topic. It's not because biologists are dumber and learn slower. It's that biology is complicated and poorly understood, and it takes years to learn.

Contrast this to coding... you don't even need to go to college to launch a successful software product, and the average person can became proficient after a few years of dedicated study. However, this is a few years that biologists don't have, as their PhDs are already some of the longest time-wise to finish.

The decision to rename genomes is totally consistent with the biologists MO: if a cell won't grow in a given set of conditions, change the conditions. Sure we can CRISPR edit the genes to modify a cell to to grow in a set of conditions, but if it's usually far easier to just change the temperature or growth media than to edit a cell's DNA.

My take away is that this is more a failure of programmers and/or a failure of their managers to guide the programmers to make tools for biologists, than of biologists to learn programming. Sure, coders get paid more, but they aren't going to cure cancer or make a vaccine for covid-19 without a biologist somewhere in the equation. And I'm glad the biologists developing vaccines today are doing biology, and not held up in their degree programs learning how to code!


MatLab has taken over bio specifically because it has great documentation and examples. If Python was psuedo-code that compiles, then MatLab is just speaking English. Even still, the spaghetti that researchers get into is just insane.


> To the larger point, scientists are not programmers. They got into their programs to do research.

I would say most research, to an ever growing degree, is so heavily dependent on software that it's tough to make that claim anymore. It makes no sense to me. It's like saying Zillow doesn't need software engineers because they are in the Real Estate business, not the software business.


I maybe misspoke. I meant that scientists do not go into science to program, they go into it to discover and do research (among many many other things). Sure, some do find joy in good programming, but that's not why they are there to begin with. Becoming a better programmer isn't their passion, and those skills remain underdeveloped as a result.


> To the larger point, scientists are not programmers.

I mean, sort of. Some research is essentially just programming; other research can get by with nothing but excel. Regardless, it's unreasonable to ask most scientists to be expert programmers -- most aren't building libraries that need to be maintained for years. If they do code, they're usually just writing one-shot programs to solve a single problem, and nobody else is likely to look at that code anyway.


It's not usually computational biologists who are using Excel.

What if you want to share data with a wetlab biologist who want to explore their favorite list of genes on their own?


There are lots of great computational biologists, but being a computational biologist doesn't necessitate being good with computers. Plenty of PI's rely pretty much exclusively on grad students and post-docs to run all their analyses.

Not that I'm saying using excel is bad either. I use excel plenty to look at data. But scientists need to know how to use the tools that they have.


If people are just looking at the spreadsheets then wouldn’t the cells interpreted as dates not be a problem? It seems like it would only be a problem if you’re doing computation on the cells.


Excel changes the displayed text when it interprets the text as date. You store "MARCH1", it displays "1-Mar".


It's also my experience of research in biological sciences that it is a widespread belief/fact that in order to get published in a top journal, the analysis methods must be "fancy", for example involving sophisticated statistical techniques. I worked on computational statistical methods so I'm not against that per se, but the problem is that if you have the training to contribute at the research front of an area of biology you rarely have the training to understand the statistics. Some would say that the collaborative publication model is the solution to that, but in many cases the end result isn't what one would hope for. I do think that less emphasis on "fancy" statistics, and more emphasis on simple data visualizations and common sense analyses would be a good thing.


Agreed. And to add to that: The more fancy the statistics have to be, the less robust are the results.


I'm an ex-computation biologist who did most of his work in python but periodically had to interop with excel.

THe basic assumption I have is that when I input data into a system, it will not translate things, expecially according to ad-hoc rules from another domain, unless I explicitly ask it to do so.

It's not clear what data input sanitization would mean in this case; date support like this in Excel is deeply embedded in the product and nobody reads the documentation of Excel to learn how it works.


it would be nice if everyone was expert at everything, but they cant be. it would be nice if they hired experts but money doesn’t grow on trees. we often insist on a degree of excellence we refuse to pay for


It's not about being an expert at everything or hiring more people. These aren't particularly hard problems, it's not difficult to find biologists who are incredibly adept at using python, R or C. It's about thinking about how science gets funded and how it gets implemented. I've written here before about the difference between "grant work" and "grunt work", and how too computer touching tends to get looked down upon at a certain level.

If you're deciding who gets a large-scale computational biology grant, and you're choosing between a senior researcher with 5000 publications with a broad scope, and a more junior researcher with 500 publications and a more compuationally focused scope, most committees choose the senior researcher. However, the senior researcher might not know anything about computers, or they may have been trained in the 70's or 80's where the problems of computing were fundamentally different.

So you get someone leading a multi-million dollar project who fundamentally knows nothing about the methods of that project. They don't know how to scope things, how to get past roadblocks, who to hire, etc.


What's your source on it not being difficult to find biologists who are adept at using python, R, or C? Most biologists operating in private industry or academia have many years of training in their fields and many have learned their computational tools as they've gone on, meaning they've never received proper training. It seems dubious to claim that there's this neverending source of well trained biologists who are also adept at programming.


I would say the number of biologists who actually understand programming is extremely small. I've been programming for fun for ~15 years, and I'm about to finish a PhD in chemical biology (e.g. I started programming in C far before I started learning biology).

You might occasionally run into someone who is passable - at best - with R or Python. But most of the code they might write is going to be extremely linear, and I doubt they understand software architecture or control flow at all.

I don't know any biologists who program for fun like me (currently writing a compiler in Rust).


To be fair, linear code is often totally sufficient for most types of data analysis. Biologists don't really need to understand design patterns or polymorphism, they just need to not make computational mistakes when transforming the data.


Absolutely. My point was more than you can't expect comp. biologists to actually be "good" programmers when compared to SWE or even web devs.

Most of the code I write to do biological data analysis is fairly linear. However, I also generally use a static type system and modularity to help ensure correctness.

I've perused a lot of code written by scientists, and they could certainly learn to use functions, descriptively name variables, use type systems and just aspire to write better code. I just saw a paper published in Science had to issue a revision because they found a bug in their analysis code after publication that changed most of their downstream analysis.


It does get rather problematic when you have large quantities of...stuff. You can't run linear stuff in parallel so now you're bound to whatever single CPU core you have lying around.

I'd say that getting some basic data science computing skills should be more important than the silly SPSS courses they hand out. Once you have at least baseline Jupyter (or Databricks) skills you suddenly have the possibility to do actual high performance work instead of grinding for gruntwork. But at that point the question becomes: do the people involved even want that.


I write 'one off' programs all the time. Most of what I write I throw away, and I program for a living. Those are usually fairly linear. Which is fine. If I am writing something that will be re-used in 6 different ways and on a 5 person team. That is when you get out the programming methodologies. It is usually fairly obvious to me when you need to do it. For someone who does not do it all the time. They may not know 'hey stop you have a code mess'.

It one of the reasons why people end up with spreadsheets. Most of their data is giant tables of data. Excel does very well at that. It has a built in programming language that is not great but not totally terrible either. Sometimes all you need is a graph of a particular type. Paste the data in, highlight what you want, use the built in graph tools. No real coding needed. It is also a tool that is easy to mismanage if you do not know the quirks of its math.


It doesn't take being an expert at Excel to understand how Excel autoformats. It takes a few days of actually working with data or an introductory class that's today taught in American primary schools.


Sorry for asking but are you familiar with how MS Excel aggressively converts data to dates? There's no way to "sanitize" it (without resorting to hacky solutions like including extra characters) and even if you fix the data, it will re-change them to dates the next time you open the file.


You're simply incorrect. If you set the column format to Text it will never convert data to dates, including when you open the file.


Great, how do you set a custom column format in a CSV file?


I'm only familiar with LibreOffice and not Excel myself, but: if you want to be sure a column is treated as text in a CSV file, you have LO quote text fields on save, and have it treat quoted fields as text on import. I assume Excel must have similar options.


LibreOffice, on opening of CSV file, always pops an import dialog where you can do this. To keep the column format permanently, save as ods.


For the most part we aren't talking about computational biologists but experimentalists using Excel. People at the bench need to collect their data somehow, and using Excel for tabular data and Word for text data is just what they know. Typically they then pass these files over to computational biologists for analysis. Yes, it would be nice if they would use more appropriate tools, but I know from experience that the typical result of trying to teach them better tools is the experimentalists just rolling their eyes and saying that they don't have time to learn some nerdy program because they have experiments to run.



Excel is a wonderful tool and a type of programming that is very accessible to many people. I use it all the time.


Considering how Perl was chosen as the computational biologists lingua franca in 1990's - 2k's since it was good at text manipulation (since genes are represented by text) I would say they don't have a history of making good choices.


Why?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: