Calling it an algorithm may be giving it too much credit.
The packing program reads in a file, sorts the passwords into clusters based on their length, maintaining a popularity count as it goes. Then it prunes any passwords that occur less than some number of times in the input file ("4" by default, so it doesn't spend a lot of time churning on unpopular passwords). Next, starting with the longest passwords, it works its way down the list, merging the popularity count of shorter passwords into the most popular longer passwords which contain the entire shorter password. e.g., "password" gets merged into "password1234".
Finally it starts writing the output. Starting with the most popular password, it searches up to N passwords ahead in the list (40 by default), looking for any passwords that begin with the same characters that the current password ends with. So it finds "123456", and writes "password123456" to the output file.
I think the only thing I can do to optimize it further is to re-cluster the passwords by their starting character during the output stage, to improve its chances of always finding two passwords that can be glued together.
It's a crude brute-force approach but it works okay.
I just looked at the code, it's not that bad, I really will release it with the next update.
What I'd do is sort the popular passwords by length, start with the longest. Then as you're adding new passwords to the end, check if it's already in there. Then you'll go from phrases to words with a lot of them being duplicates.
At least if you wanted to keep it as a long string. I'd still probably approach this as a bloom filter as others suggested since it'd let you have a larger number of entries for similar memory footprint.
The packing program reads in a file, sorts the passwords into clusters based on their length, maintaining a popularity count as it goes. Then it prunes any passwords that occur less than some number of times in the input file ("4" by default, so it doesn't spend a lot of time churning on unpopular passwords). Next, starting with the longest passwords, it works its way down the list, merging the popularity count of shorter passwords into the most popular longer passwords which contain the entire shorter password. e.g., "password" gets merged into "password1234".
Finally it starts writing the output. Starting with the most popular password, it searches up to N passwords ahead in the list (40 by default), looking for any passwords that begin with the same characters that the current password ends with. So it finds "123456", and writes "password123456" to the output file.
I think the only thing I can do to optimize it further is to re-cluster the passwords by their starting character during the output stage, to improve its chances of always finding two passwords that can be glued together.
It's a crude brute-force approach but it works okay.
I just looked at the code, it's not that bad, I really will release it with the next update.