Contents:
Switching on strings (SOS) means that switch statements can test constant strings, not only constant integers. This is quite a useful feature lacking in standard C. SOS can be implemented in several ways (in decreasing order of preference):
The sos utility implements the fourth method.
The simplest string to integer mapping is certainly leaving the bit content unchanged and treating it as an integer instead of a string. This can be done by type casting e.g. "*(int *)character_buffer" or by building the encoding integer part by part using bit/arithmetic operations.
There is a subtle point, the machine architecture will determine what integer we will actually get. Strings are universally stored byte by byte in increasing address order but integers are practically stored in the two directions. The most significant byte can be stored either in the highest or lowest address. This is called respectively little endian order (e.g. Vax, Alpha, PC) and big endian order (e.g. Cray, IBM, Mac, SGI, Sun).
If our string is less than 4 bytes long and gets padded, the machine "endianess" also determines if we get left or right padding.
This architecture issue isn't important in practice because the same machine will compile/run the program. The switch statement condition will tackled with the same endianity as the case label.
One of sos's encoding functions is a variant of the trivial mapping. Another one is a slight modification which maps 10-character strings assembled from the C name character set.
If it's shorter than 8 chars pad it on the left with zero bytes.
The two steps above define a mapping from all C strings to long long ints.
Call the mapping function from the switch control expression with some string pointer you choose.
Separate the duplicates with some code: either a switch controlled by a different segment of the string or an "if-else" chain.
Of course duplicates can be found by inspection, without a compiler.
Note that even if we need one "strcmp" check per case the code would still be more efficient than an if-else chain using "strcmp" conditionals. Assuming that switch statements are implemented by linear search (the worst case), it should be more efficient to have N/2 "int" comparisons and one "strcmp" call than N/2 "strcmp" calls and N/2 "zero int" comparisons.
You don't have to remember the algorithm, sos does automatically most of the above steps.
sos used with the "-f" option maps strings to integers using a variant of the trivial mapping. The full 8-bit char range (except '\0') is supported. In this mode only the first eight chars in the string are significant.
By default sos maps the characters {a-z, '_', 0-9, A-Z} to integers in the range [1-63]. The mapping is one to one and covers the entire range. To map an identifier into an integer we treat it as if it was a number in base 64, the characters serving as digits according to the said character map.
Note there is no "zero" digit and so we don't have to worry about strings with leading "zeros" (i.e. a "01" string being numerically equal to "1", "001" strings). The identifier to integer mapping is one to one but doesn't cover all numbers (e.g. 64).
Each character requires six bits of storage and since a "long long int" is guaranteed to have 64 bits we can squeeze 10 characters into a case label. Characters beyond the 10-char limit are simply ignored. By the way, since we have a "spare bit" (actually 4 of them) the encoding integer will always be a positive.
Hoffman encoding based on the relative frequency of letters is very efficient. Using this method we could encode longer strings into our integers and improve the solution.
An expected difficulty with Hoffman encoding is that the result's length wouldn't be exactly 4/8 bytes. It would be more complicated to do the strcmp checks.
A nice solution would be a special purpose preprocessor. To facilitate locating the string case labels they could be enclosed in angle brackets. The preprocessor would convert the string case labels to integers, replace the switch controlling expression with a mapping function call using the old expression as argument.
With compilers implementing switch statements especially bad the preprocessor could:
All this could be done nicely with gcc extensions like taking the address of a label and using a goto with it (and maybe nested functions).
Such a pre-processor was not written yet. C compilers accept only integer constants in case labels and only integer expressions are allowed to control the switch. The only practical method left is to find some way to map strings to integers and to apply it manually.
gperf generates perfect hash functions together with other code that makes a keyword recognition engine. gperf code may sometimes replace a switch on strings but its hash functions are not very useful for SOS.
What is a hash function? Hash functions are used to map large sets to a smaller ones. Obviously such maps can't be one to one and the assigned value can't be a good representative of the original entry. A hash function tries to alleviate this problem by mixing and destroying the original order as much as possible.
Why mixing should help? Practical input is assumed to contain regularities which unbroken would lead to "exccesive" mapping of different entries to the same value. Since hashes don't preserve the ordering of the source set to the target set they are supposed to minimize this problem. Hash functions can be thought of as increasing the "one to one" characteristic for practical input, in a statistical sense.
We have said that a hash function maps a large set to a smaller one. In practice we may want to map only relatively few members of the large set, on the order of the small set size. In the case that these members are known beforehand we can do better than just mixing, we can tailor a perfect hash function for them.
What is a perfect hash function? A hash function that assigns different values to different strings, i.e. is one to one, on some subset (usually the set used to compute it). Outside this set the function may assign the same value to as many strings it likes.
Armed with these insights on hashing we can try to compare it with our simple minded approach. How does the natural mapping compares with gperf perfect hash functions?
The "natural" mapping require such checks only for labels longer than the size of the integer used.
The one "natural" mapping serves all cases.
We don't claim the natural mapping is "better" than gperf's perfect hashes, it just better supports switching on strings. No wonder considering that gperf was never designed to support switching on strings. gperf's keyword recognition code may well be more efficient than our switch statements, especially if our compiler generates bad code.
Our approach have a disadvantage not shared by gperf. For long strings our magic function will produce many duplicate values. In such cases we will have to group the duplicate labels and separate them manually with a nested switch controlled by another segment of the string. Hopefully this occurs rarely.
Another possible advantage of gperf is that assigned integers are small, sometimes in the range [0,N-1] where N is the size of the strings set (by the way, the last condition defines a minimal hash function). This is important if we want to implement the switch by a lookup table. Does it have a practical value? Maybe.