On Packing Data

[updated 16 April 2010]

The number of bits needed to store an integer, a, a > 0, is:

nbits_a

Using this formula, 13 bits are needed for the value 7333. In binary, 7333 is 1110010100101, which is indeed 13 bits.

The number of bits needed to store two values, a and b, independently would be:

nbits_a_b

Using the identity:

identity

The number of bits needed to store the product of a and b is:

nbits_product

Therefore, we can save one bit by multiplying a and b when

fraction

As a trivial example, the values 5 (101) and 6 (110) each require three bits so storing them independently would take 6 bits. However, 6*5 is 30, which takes 5 bits.

Just for curiosity, this graph shows the fractional part of

log2x

using the equation

frac_log2_x

It's an interesting "sawtooth" pattern, with zero values at 2n. The larger version of the graph also shows lines of constant slope between the endpoints of each tooth (in blue), and the point where the tooth equals 0.5 (√2*2n - in red).

Click on the graph for a larger version.

sawtooth

Let's consider a practical application. The start and end of Daylight Savings Time can be specified as "the nth day of the week in month at hour and minute". For example, in 2010, DST in the US begins on the second Sunday in March at 2AM. Let:

d = day of week, Sunday..Saturday = 0..6
r = day rank, first..fifth = 0..4
m = month, January..December = 0..11
h = hour, 0..59
μ = minute, 0..23

Naively, this can be stored in 3+3+4+6+5 = 21 bits.

When multiplying 6 * 5 to save one bit, one of the numbers is needed in order to determine the other. Since we don't a priori know what values were multiplied together, we can't unpack the result. So while the possibility of saving a bit by multiplication made for an interesting analysis, it doesn't help here. There's no point in storing the product of n numbers if you have to know n-1 numbers to recover the nth number.

If we're going to conserve space, it will have to be through application of the first equation. Given two integers, a and b, a requires fewer bits than b when:

minimize_numbers

So we want to minimize the maximum value of the number being stored.

For the packing algorithm, construct a set, ω, of an upper limit for each item.

Then the largest number that can be encoded is:
max_encoding
To pack the values vi which correspond to the upper limit values ωi, evaluate:

pack

To unpack rn into values vi which correspond to the upper limit values ωi, evaluate:

unpack

Consider the decimal number 12,345. For base 10 encoding, ωi would be (10, 10, 10, 10, 10), and the largest number that could be encoded would be 99,999. But suppose we knew that the largest possible value of the first digit was one. Then let ωi be (2, 10, 10, 10, 10). The largest value that could be encoded would be 19,999. Since 19,999 < 99,999, it requires fewer bits.

For the daylight savings time values, construct a set, ω, of the maximum value of each item + 1. For (d, r, m, h, μ) this would be (7, 5, 12, 60, 24).

Packing the start times would requlre 20 bits:

nbits_dst

Packing both the start and end times would use 39 bits, for a savings of 3 bits.

nbits_begin_end

Thanks to Lee for his comments on this article.

blog comments powered by Disqus