Inverting an Order-Preserving Minimal Perfect Hash Function in Better than O(K*lg N) Running Time
07:15 20 Oct 2021

I am trying to find a more efficient solution to a combinatorics problem than the solution I have already found.

Suppose I have a set of N objects (indexed 0..N-1) and wish to consider each subset of size K (0<=K<=N). There are S=C(N,K) (i.e., "N choose K") such subsets. I wish to map (or "encode") each such subset to a unique integer in the range 0..S-1.

Using N=7 (i.e., indexes are 0..6) and K=4 (S=35) as an example, the following mapping is the goal:
0 1 2 3 --> 0
0 1 2 4 --> 1
...
2 4 5 6 --> 33
3 4 5 6 --> 34

N and K were chosen small for the purposes of illustration. However, in my actual application, C(N,K) is far too large to obtain these mappings from a lookup table. They must be computed on-the-fly.

In the code that follows, combinations_table is a pre-computed two-dimensional array for fast lookup of C(N,K) values.

All code given is compliant with the C++14 standard.

If the objects in a subset are ordered by increasing order of their indexes, the following code will compute that subset's encoding:

template
typename T::value_type combination_encoder_t::encode(const T &indexes)
{
   auto offset{combinations_table[N1][K1] - combinations_table[N1 - indexes[0]][K1]};

   for (typename T::value_type index{1}; index < K1; ++index)
   {
      auto offset_due_to_current_index{
           combinations_table[N1 - (indexes[index-1] + 1)][K1 - index] -
           combinations_table[N1 - indexes[index]][K1 - index]
                                      };

      offset += offset_due_to_current_index;
   }

   return offset;
}

Here, template parameter T will be either a std::array<> or std::vector<> holding a collection of indexes we wish to find the encoding for.

This is essentially an "order-preserving minimal perfect hash function", as can be read about here:
https://en.wikipedia.org/wiki/Perfect_hash_function

In my application, the objects in a subset are already naturally ordered at the time of encoding, so I do not incur the added running time of a sort operation. Therefore, my total running time for encoding is that of the algorithm presented above, which has O(K) running time (i.e., linear in K and not dependent on N).

The code above works fine. The interesting part is trying to invert this function (i.e., to "decode" an encoded value back into the object indexes that produced it).

For decoding, I could not come up with a solution with linear running time.

Instead of direct calculation of the indexes that correspond to an encoded value (which would be O(K)), I ended up implementing a binary search of the index space to find them. This results in a running time that is (no worse than, but which we'll call) O(K*lg N). The code to do this is as follows:

template
void combination_encoder_t::decode(const typename T::value_type encoded_value, T &indexes)
{
   typename T::value_type offset{0};
   typename T::value_type previous_index_selection{0};

   for (typename T::value_type index{0}; index < K1; ++index)
   {
      auto lowest_possible{index > 0 ? previous_index_selection + 1 : 0};
      auto highest_possible{N1 - K1 + index};

      // Find the *highest* ith index value whose offset increase gives a
      // total offset less than or equal to the value we're decoding.
      while (true)
      {
         auto candidate{(highest_possible + lowest_possible) / 2};

         auto offset_increase_due_to_candidate{
                   index > 0 ?
                      combinations_table[N1 - (indexes[index-1] + 1)][K1 - index] -
                      combinations_table[N1 - candidate][K1 - index]
                             :
                      combinations_table[N1][K1] -
                      combinations_table[N1 - candidate][K1]
                                              };

         if ((offset + offset_increase_due_to_candidate) > encoded_value)
         {
            // candidate is *not* the solution
            highest_possible = candidate - 1;
            continue;
         }

         // candidate *could* be the solution. Check if it is by checking if candidate + 1
         // could be the solution. That would rule out candidate being the solution.
         auto next_candidate{candidate + 1};

         auto offset_increase_due_to_next_candidate{
                   index > 0 ?
                      combinations_table[N1 - (indexes[index-1] + 1)][K1 - index] -
                      combinations_table[N1 - next_candidate][K1 - index]
                             :
                      combinations_table[N1][K1] -
                      combinations_table[N1 - next_candidate][K1]
                                                   };

         if ((offset + offset_increase_due_to_next_candidate) <= encoded_value)
         {
            // candidate is *not* the solution
            lowest_possible = next_candidate;
            continue;
         }

         // candidate *is* the solution
         offset += offset_increase_due_to_candidate;
         indexes[index] = candidate;
         previous_index_selection = candidate;
         break;
      }
   }
}

Can this be improved on? I'm looking for two categories of improvements:

  1. Algorithmic improvements that yield better than the O(K*lg N) running time of the code given; ideally, direct calculation would be possible, giving the same O(K) running time the encoding process has
  2. Code improvements that execute the given algorithm faster (i.e., that lower any constant factor hidden within the O(K*lg N) running time)
algorithm performance combinations