Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The illustration has 2 dimensions, x and y.

If we have x, y, z, i, j, and n then we'd have six dimensions.

If each word represents a dimension, any blog title with that word gets a 1 on that word dimension, or a 0 if does not.

You can then calculate the distance between these points considering all of the dimensions, although clearly it's going to be a little limited on binary inputs, since the square of the differences is always going to be 1 or 0. Like for 4 dimensions, sqrt( ( 0 - 1 )^2 + ( 0 - 0 )^2 + ( 0 - 0 )^2 + ( 1 - 0 )^2 ), which just always gets simplified to something like sqrt( 1 + 0 + 0 + 1 ), which is a little boring.

The binary values cause your data points to all be stacked directly on top of each other, which leads me to believe that using binary inputs is a less than ideal application for k-means. Just look at it in the 2d case, where you have either [0,1], [0,0], [1,0], or [1,1] for each data point. Not very hard to determine the clustering there... basically just doing an overly complex boolean expression.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: