Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I bet it's way less than you think - orders of magnitude.

What could happen versus what does happen are entirely different.

Doing some algebraic permutations computation here would be like claiming a 50,000 letter English document has 27^50,000 possibilities.

I mean no, there's words, and they only go in certain orders and there's all these rules.

Here's another approach: humans are pretty lousy when remembering large amounts of anything so let's say there's in practice, only been 100,000 unique games played over and over. Without the help of a computer or careful tabulation, I'm pretty sure no human would realize it because no human can remember 100,000 unique games.

Anyways, it's worth digging into the data to see what the variation really is. I bet the 90th percentile is embarrassingly small with a long tail that's far shorter then most think

edit: so I actually took 7.7 million games from https://www.ficsgames.org/download.html and did some basic processing on them. These are people ostensibly with ELOs over 2000 which is pretty decent just to see if I'll eat crow on this one.

Going in, I was expecting a uniqueness level to be something like 50-70%. Actual percentage of unique games over 7.7 million? 98.7%.

Alright fine.

Although I could try to do 1 billion games, I expected the distribution to be readily visible around 7 million.

Now as an artifact of the data, I made the games as compact as possible, potentially leading to ambiguity maybe. So a game might look like so

    a3a5b3b6Bb2Bb7c4e6Nc3Nf6d4Be7Nf3O-Oe3c5d5exd5Nxd5d6Nxe7+Qxe7Bd3d5O-Odxc4Bxc4Nc6Qe2Rad8Rad1Ne4h3Nd6Ba6Bxa6Qxa6Qb7Qd3f6Qc3Rfe8Rd3Ne4Qc4+Kh8Rxd8Nxd8Rd1Qe7Qd5Nf7b4axb4axb4cxb4Qc6h6Rd7Qf8Qxb6Rd8Rxd8Qxd8Qxb4Qe8Bd4Kg8Qc4Nd6Qd5Kh8Nh4Kh7Qb3Qe4f3Qe6Qd3+f5e4g6exf5Nxf5Nxf5Qxf5Qxf5gxf5Bf2Kg6g4fxg4hxg4h5Kg2hxg4fxg4Ne5Kg3Nf7Kh4Ne5Be3Nxg41/2-1/2
Given this we can just run uniq with incrementing numbers and find out how things increase. I'm doing this on a pretty old laptop (3rd gen intel) so excuse me for cutting things a bit short

number of characters / unique entries / percentage duplicates

         99 7470034 0.039416039621657628
         98 7462496 0.040385363441781119
         97 7454437 0.041421683508957363
         96 7446241 0.042475620631500677
         95 7437517 0.043597454142611958
         94 7428583 0.044746291899176449
         93 7419379 0.045929849399894973
         92 7409325 0.047222709798876217
         91 7399016 0.048548361067336399
         90 7388091 0.049953224789125783
         89 7376939 0.051387278814333581
         88 7364995 0.052923177422393386
         87 7352785 0.054493281408027117
         86 7340118 0.056122151775432672
         85 7326671 0.057851323625950024
         84 7312421 0.059683754567414482
         83 7297272 0.061631789397747494
         82 7281670 0.063638076243272224
         80 7247260 0.068062914748240111
         79 7228770 0.07044057426456829
         78 7209233 0.072952869233227302
         77 7187947 0.075690070989017588
         76 7166719 0.07841981442939705
         75 7144539 0.081271977115830896
         74 7119854 0.084446261873027284
         73 7094951 0.087648579608837096
         72 7068178 0.091091363720824936
         71 7039566 0.094770627867995505
         70 7009607 0.098623104961001351
         69 6978335 0.10264442288391196
         68 6944221 0.10703119826195528
         67 6909408 0.11150785919986417
         66 6871309 0.1164070722832925
         65 6831848 0.12148142718723132
         64 6790088 0.12685141428305979
         63 6744878 0.13266504255419009
         62 6698251 0.13866088518630681
         61 6648086 0.14511168505848671
         60 6594883 0.15195314634822232
         59 6540926 0.15889156573829932
         58 6481469 0.16653723917595897
         57 6419070 0.17456122923325301
         56 6355249 0.18276807660975847
         55 6282195 0.19216221064468775
         54 6209896 0.20145925798763076
         53 6133584 0.21127234360201919
         52 6048154 0.22225792783565479
         51 5963420 0.23315401228435984
         50 5870519 0.24510030469790289
         49 5770633 0.25794480975187595
         48 5668750 0.27104611232094422
         47 5558836 0.28518013439112821
         46 5443035 0.30007117547551587
         45 5323607 0.31542861845637304
         44 5192507 0.33228698311784588
         43 5064129 0.34879532132158775
         42 4925937 0.36656565792950735
         41 4777302 0.38567887708631909
         40 4626169 0.40511331817237839
         39 4465105 0.42582480288508218
         38 4301802 0.44682420429097458
         37 4126391 0.46938059333470927
         36 3948981 0.49219403707682896
         35 3770310 0.5151696348833128
         34 3581328 0.5394711411415466
         33 3387960 0.5643366503548165
         32 3202702 0.58815928132701434
         31 3008299 0.61315788289287476
         30 2810171 0.63863548833641626
         29 2617450 0.66341779875536144
         28 2413377 0.68965988152851743
         26 2044415 0.73710531205655982
         25 1849362 0.76218749819167997
         24 1674682 0.78464988674290859
         23 1499682 0.80715342462054207
         22 1324053 0.82973784664289008
         21 1167329 0.84989124361622848
         20 1005747 0.87066933880105002
         19 862832 0.88904701374837569
         18 733254 0.90570966192613567
         17 599187 0.9229495579983682
         16 489782 0.93701812692123954
         15 395141 0.94918816879710877
         14 300280 0.9613865008348812
         13 233128 0.97002168698093183
         12 167915 0.97840753392729818
         11 115118 0.98519678700915769
         10 78474 0.98990889924908909
         9 46588 0.9940091724420389
         8 26581 0.99658190548385495
         7 14316 0.99815908200996462
         6 5390 0.99930689103336889
         5 2659 0.99965807481590496
         4 408 0.9999475346088339
         3 180 0.99997685350389731
         2 20 0.99999742816709969
         1 9 0.9999988426751949
It would be interesting to see later, when I crunch larger numbers on a more capable machine, if these distributions generally hold. Of course it won't, it's not possible. But I'm wondering if it's greater than what the Shannon limit would predict.

An ancillary analysis would be to compare it to the possible legal permutations at a character count although this would of course require a board and rule model.

I would expect those percentages to decrease as the length increases and perhaps such a function can give more predictive heft to the actual "language" of chess in practice



It's also worth noting that unique string != board state.

Proof: Both black and white could move the left rook pawn as their first move and right rook pawn as the second move.

Now reset the board and do right rook first and left rook second. Same board state, different game string.

In practice unique board states is a strict subset of number of moves but given how far off I was on my first assessment ... I wonder if we're talking about another < 2% hit.

All of this is dependent on an actual engine that can process the notation. There's apparently lots of options for pgn.

I'd also like to develop a heatmap based on statistical analysis. I'd imagine this would not only be way less than equally distributed but there'd be no way to slice the data to make it appear equally distributed




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: