If we work from Unicode, we could theoretically create a variable-length encoding based in base-5 instead of simply tossing UTF-8 into the block encoding. This would provide for greater efficiency for text storage while retaining the entire Unicode charset.
Any ideas on how to spec this out?