I bet it's way less than you think - orders of magnitude.
What could happen versus what does happen are entirely different.
Doing some algebraic permutations computation here would be like claiming a 50,000 letter English document has 27^50,000 possibilities.
I mean no, there's words, and they only go in certain orders and there's all these rules.
Here's another approach: humans are pretty lousy when remembering large amounts of anything so let's say there's in practice, only been 100,000 unique games played over and over. Without the help of a computer or careful tabulation, I'm pretty sure no human would realize it because no human can remember 100,000 unique games.
Anyways, it's worth digging into the data to see what the variation really is. I bet the 90th percentile is embarrassingly small with a long tail that's far shorter then most think
edit: so I actually took 7.7 million games from https://www.ficsgames.org/download.html and did some basic processing on them. These are people ostensibly with ELOs over 2000 which is pretty decent just to see if I'll eat crow on this one.
Going in, I was expecting a uniqueness level to be something like 50-70%. Actual percentage of unique games over 7.7 million? 98.7%.
Alright fine.
Although I could try to do 1 billion games, I expected the distribution to be readily visible around 7 million.
Now as an artifact of the data, I made the games as compact as possible, potentially leading to ambiguity maybe. So a game might look like so
Given this we can just run uniq with incrementing numbers and find out how things increase. I'm doing this on a pretty old laptop (3rd gen intel) so excuse me for cutting things a bit short
number of characters / unique entries / percentage duplicates
It would be interesting to see later, when I crunch larger numbers on a more capable machine, if these distributions generally hold. Of course it won't, it's not possible. But I'm wondering if it's greater than what the Shannon limit would predict.
An ancillary analysis would be to compare it to the possible legal permutations at a character count although this would of course require a board and rule model.
I would expect those percentages to decrease as the length increases and perhaps such a function can give more predictive heft to the actual "language" of chess in practice
It's also worth noting that unique string != board state.
Proof: Both black and white could move the left rook pawn as their first move and right rook pawn as the second move.
Now reset the board and do right rook first and left rook second. Same board state, different game string.
In practice unique board states is a strict subset of number of moves but given how far off I was on my first assessment ... I wonder if we're talking about another < 2% hit.
All of this is dependent on an actual engine that can process the notation. There's apparently lots of options for pgn.
I'd also like to develop a heatmap based on statistical analysis. I'd imagine this would not only be way less than equally distributed but there'd be no way to slice the data to make it appear equally distributed
What could happen versus what does happen are entirely different.
Doing some algebraic permutations computation here would be like claiming a 50,000 letter English document has 27^50,000 possibilities.
I mean no, there's words, and they only go in certain orders and there's all these rules.
Here's another approach: humans are pretty lousy when remembering large amounts of anything so let's say there's in practice, only been 100,000 unique games played over and over. Without the help of a computer or careful tabulation, I'm pretty sure no human would realize it because no human can remember 100,000 unique games.
Anyways, it's worth digging into the data to see what the variation really is. I bet the 90th percentile is embarrassingly small with a long tail that's far shorter then most think
edit: so I actually took 7.7 million games from https://www.ficsgames.org/download.html and did some basic processing on them. These are people ostensibly with ELOs over 2000 which is pretty decent just to see if I'll eat crow on this one.
Going in, I was expecting a uniqueness level to be something like 50-70%. Actual percentage of unique games over 7.7 million? 98.7%.
Alright fine.
Although I could try to do 1 billion games, I expected the distribution to be readily visible around 7 million.
Now as an artifact of the data, I made the games as compact as possible, potentially leading to ambiguity maybe. So a game might look like so
Given this we can just run uniq with incrementing numbers and find out how things increase. I'm doing this on a pretty old laptop (3rd gen intel) so excuse me for cutting things a bit shortnumber of characters / unique entries / percentage duplicates
It would be interesting to see later, when I crunch larger numbers on a more capable machine, if these distributions generally hold. Of course it won't, it's not possible. But I'm wondering if it's greater than what the Shannon limit would predict.An ancillary analysis would be to compare it to the possible legal permutations at a character count although this would of course require a board and rule model.
I would expect those percentages to decrease as the length increases and perhaps such a function can give more predictive heft to the actual "language" of chess in practice