A few things to exclude from training might include:
- articles with mistakes such as incorrect product names, facts, dates, references
- fraudulent and non-repeatable research findings - see John Ioannidis among others
- outdated and incorrect scientific concepts like phlogiston and LaMarckian evolution
- junk content such as 4-chan comments section content
- flat earther "science" and other such nonsense
- debatable stuff like: do we want material that attributes human behavior to astrological signs or not? And when should a response make reference to such?
- prank stuff like script kiddies prompting 2+2=5 until an AI system "remembers" this
- intentional poisoning of a training set with disinformation
- suicidal and homicidal suggestions and ideation
- etc.
Even if we go with the notion that AGI is coming, there is no reason its training should include the worst in us.
Even if we go with the notion that AGI is coming, there is no reason its training should include the worst in us.