Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

wonder if that could be crawled and save money.


Accuracy. Every solution I've seen that relies on automatic crawling will eventually have a parsing error when someone changes their sentence structure of a press release.

It's not so obvious when you're looking at the breaking releases for a few stocks or companies, but historical records have at least 1 error per stock per year.


So split your stream:

    1. Data matching expectations (you do have a definition of correct, right?)

    2. Log for manual review -> manual inserts or correction and placed into queue for (1)
Monitor (2). When inserts start trending up, it may be time to update your processing logic.


I came up with a similar idea for a company several years ago where we had a team of people doing data entry from faxed documents. I wanted to build something that would do all the OCR it could and then display it to users to verify, which should have been a 10 times efficiency increase, not to mention speed and accuracy.

The idea was rejected, they wanted either a perfect solution or nothing. I don't know why, but for some reason the idea computers removing humans is acceptable to management, but computers augmenting humans wasn't.


right, it depends on what the reward for high precision vs high recall is


Would writing and maintaining the crawler be less expensive than paying the small army of third world employees to manually check pages all day?


The short answer is that it would depend on the value of any precision loss that occurred and whether or not the shift would disrupt any other systems, be they social or technical, within the company. But with that in mind, there's also the issue of whether or not the money saved would be worth it. It's very possible that S&P simply has enough money to play with that any savings from swapping to an automated solution for media monitoring aren't worth the potential disruption to the company workflow.


or just set google alerts.


financial companies usually use the worst formats to extract info from.


Woah there cowboy




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: