Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

OK, I think the bookmarklet/text scraper needs a little tweeking. I just tried Pocket's bookmarklet vs. Instapapers's on the "How Microsoft Fought True Open Standards" article that is also on the Hacker News front page. Pocket mis-identified the headline, instead getting the name of the blog. In fact the title of the article/blog post is not anywhere in Pocket's scrape.

Link to the article: http://blogs.computerworlduk.com/open-enterprise/2012/04/how...

Edit: It also didn't do well with Bruce Schneier's Crypto-Gram Newsletter at http://www.schneier.com/crypto-gram-1204.html

In both of these instances the simply needed to get the content of the <title> tag in the HTML. I would think (careful, here be dragons!) more often than not that the <title> tag should be a reliable way of getting the title of the story.



The good thing is that Instapaper's extraction process can be modified by anyone who has an account:

http://www.instapaper.com/bodytext (Login required)

So if something is not working properly you can improve Instapaper on a site by site basis.


I had no idea you could do that! That's cooler than the other side of the pillow!


I just tried adding that Microsoft article via the Pocket Chrome extension (https://chrome.google.com/webstore/detail/niloccemoadcdkdjli...) and had no problems with it identifying the content correctly.


I haven't tried the Chrome(or Firefox) extension(s), just the bookmarklet javascript that Pocket provides here: http://getpocket.com/welcome?b=Bookmarklet

I'll have to give the extensions a try when I get a chance.


Some web devs (like me! :D) are lazy and make the <title> tag always just the website's name.


As someone who writes an html content extractor, you have no idea how much I hate you.


Oh I hate myself too. I can't distinguish the tabs.


The first H1 tag might be better (and would also work in this case).


The first H1 tag in the computerworlduk article says "Blogs", the second has the title.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: