Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This is mostly another fix of bad design that was added in Reddit's early years, but people have started to notice the aggressive vote fuzzing as it was calibrated for a much smaller userbase.

At the least, this will break statistical analysis of Reddit data for awhile since the public datasets will not have their scores updated, which I in particular am not happy about. :p



Complaints like this are the reason companies are reluctant to release public datasets. They don't have any obligation to release data, but they do. It's a gift. If they have to consider "how will this change affect the consumers of our public data releases?" every time they make a change, they're going to stop releasing public datasets.


For clarity, the Reddit datasets are not released by Reddit itself, but scraped through the API. (More context/examples of what I do with the data: http://minimaxir.com/2015/10/reddit-bigquery/ )


Okay, but reddit does also release some public data sets: https://github.com/reddit/public-data-sets


Ah, right, forgot about those. (Although, those are traffic aggregates and wouldn't be affected by changed in the score ranking)


Does this mean you (or I if I want to do some analytics on Reddit data) will need to completely rescrape the site after scores are recomputed?


If you wanted to compare raw scores for submissions before the change to those after the changes, yes.

Otherwise, it shouldn't matter.


An API is not scraping. Scraping is taking a loosely structure document with no agreed upon interface and extracting data.

A relevant comparison would be Craigslist who will use legal force to prevent you using data you scrape off the site.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: