This is mostly another fix of bad design that was added in Reddit's early years,...

notatoad · on Dec 7, 2016

Complaints like this are the reason companies are reluctant to release public datasets. They don't have any obligation to release data, but they do. It's a gift. If they have to consider "how will this change affect the consumers of our public data releases?" every time they make a change, they're going to stop releasing public datasets.

minimaxir · on Dec 7, 2016

For clarity, the Reddit datasets are not released by Reddit itself, but scraped through the API. (More context/examples of what I do with the data: http://minimaxir.com/2015/10/reddit-bigquery/ )

notatoad · on Dec 7, 2016

Okay, but reddit does also release some public data sets: https://github.com/reddit/public-data-sets

minimaxir · on Dec 7, 2016

Ah, right, forgot about those. (Although, those are traffic aggregates and wouldn't be affected by changed in the score ranking)

vosper · on Dec 7, 2016

Does this mean you (or I if I want to do some analytics on Reddit data) will need to completely rescrape the site after scores are recomputed?

minimaxir · on Dec 7, 2016

If you wanted to compare raw scores for submissions before the change to those after the changes, yes.

Otherwise, it shouldn't matter.

erikpukinskis · on Dec 7, 2016

An API is not scraping. Scraping is taking a loosely structure document with no agreed upon interface and extracting data.

A relevant comparison would be Craigslist who will use legal force to prevent you using data you scrape off the site.