Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Isn't it just automatically extracting sentences that can be broken into 5, 7, 5 groups of syllables? I don't think anyone has made any attempt to summarise the article.

Very interesting though.



The process is described in the about section (http://haiku.nytimes.com/about). There is an algorithm that extracts the haikus but they only make it to the website when a journalist thinks it good:

"The algorithm discards some potential poems if they are awkwardly constructed and it does not scan articles covering sensitive topics. Furthermore, the machine has no aesthetic sense. It can't distinguish between an elegant verse and a plodding one. But, when it does stumble across something beautiful or funny or just a gem of a haiku, human journalists select it and post it on this blog."


Ah, very good! The other interesting point is they use a dictionary that includes number of syllable information that have augmented with words like "Rihanna".

Part of me wishes this page had been submitted instead of the top level.


Yes, to clarify, I started with the base CMUdict for syllable counts, but I had the program keep track of any term misses it ran into. This way I could augment its vocabulary. It also helped me find some tokenization bugs and also try some rules for dealing with compound words like "unsportsmanlike"


One approximate hack that works pretty well is to count the number of blocks of vowels separated by consonants. It breaks on some words, but was close enough to use for something I was working on. (Datamining rhymes from lyrics.)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: