Half Bit

I have just realized what is now an obvious fact:

To represent a set S of size N which is not a power of 2, there is no equal mapping from S to S’ of size 2^b where equal is defined as all members of S map to the same number of members in S’.

This is because any non power of 2 has another prime in its decomposition and because it is prime, can never be made equal to a power of 2.

I was thinking about this because isn’t it a shame that we have to track a set’s size and not just its bitcount (which now that I say it, are exactly the same, but you get the nice log2 factor savings).

I think this plays a big role in how we store information on computers because it means we have to come up with encodings for things that shove multiple sets together in an attempt to save on those would-be wasted bits. In an alternate universe, if we had computers that could use a compound of p-bit (prime p) numbers, then we could store the 26 English lowercase letters with no waste. And then another 26 for uppercase. I think that is a better tactic then shoving those two together, adding punctuation and adding lots of other things and calling it ASCII.

The funny thing about decomposing these things into smaller sets is that you then need an extra log2 (or multi-prime-base log if you’re in my alternate universe) # of bits to store which set you’re talking about, so then you would end up losing against ASCII because each change of set would require specifying which set you are now using.

Perhaps then, a method to overcome this would be to have a standard header that specifies the following bytes contain members of the combination of the following sets S1, S2, ... SN. We would all agree on a mapping from words of length ceil(log2(|S1| + |S2| + ... + |SN|)) back to the respective set so that we could decode the set and the set member for each word in the code. The problem then is that a very large set and a very small set would have an inefficient encoding for any of the small words. To resolve that, you could introduce a way to announce using a new header. And for messages which densely use one set, then another, then the first, then the second etc you would probably want to do this with a dictionary of headers and messages for 1) define new header as header i 2) use header i.

This is of course closely related to compression algorithms which do this after examining the bytes at hand; I am thinking about pushing this further up the chain so that we always encode messages in a way that reflects the sets we are dealing with.

Of course there’s never a right answer about what granularity of set to use, because you might define vowels, consonants, and then the English alphabet as vowels + consonants. There are a lot of ways to decompose a set and they are all based on your task and perspective.

Any way you slice it, I think we need to be pushing more of our information into a standardized format around sets and be able to communicate which set we’re talking about, how sets relate to each other (subset, superset, disjoint, etc.) and mappings between sets (uppercase moves from English-alphabet to English-alphabet-upper).