Zipf

Last Revised September 22, 2001
Consider any large randomly-chosen sample of text, (taken from some existing written text material). We assume that the text consists wholly of words, each of which is taken from a finite dictionary of words. Let n be the total number of words in the text, and let t be the number of distinct words in the text. If the text includes some repeated words, then n > t.

Tiny Text Example:

Early to bed and early to rise makes a man healthy, wealthy, and wise.

n = 14 total words.
Three words (to, and, early) are repeated, each once.
t = 11 distinct words.

Make a list of all distinct words in the text, beginning with the most frequently occurring word in the text, and going down to the least frequent word in the text. Assign a rank index r to each distinct word in the text. Let r=1 for the most frequently occurring word, r=2 for the 2nd-most frequent words, etc., down to r=t for the least frequently occurring word in the text. Now define the relative frequency of any word as fr = nr / n where nr is the number of occurrences of the word in the text, and r is its index (chosen according to the word's frequency as described as above).

Tiny Text Example continued:

r = 1 for the word early. f1 = 2/14.
r = 2 for the word to. f2 = 2/14.
r = 3 for the word and. f3 = 2/14.
r = 4 for the word bed. f4 = 1/14.
etc.
r = 11 for the word wise. f11 = 1/14.

Clearly, fj >= fi if j < i. In 1935, George K. Zipf articulated an empirical law, based on samples of text he examined (in English, Chinese, and Latin), which further detailed and quantified this inequality. It is an approximation which seems to hold for many large samples of "typical" text.

Zipf's First Law of Natural Language: fr = k / r

where k is a proportionality constant,
r is a word's rank index (r = 1 for the most frequently occurring word),
and fr is the relative frequency of that word in the text.
Zipf's Law predicts that the plot of relative frequency of occurrence versus rank index r is an inverse relationship. On log-log paper, the plot of fr or nr versus r will be a straight line with slope = -1. Our tiny text example is too small for us to expect that Zipf's First Law can be even approximately satisfied in it.

Calculating the Proportionality Constant:
It is not unreasonable to assume that the least frequently occurring word #t occurs exactly once. Applying Zipf's First Law, and letting r = t, we have

ft = nt / n = k / t
from which we can conclude that k = t / n. Substituting this value for the constant k into Zipf's First Law, we get
Zipf's First Law Extended: fr = t / n r
This is an interesting result, because it provides a means of predicting the relative frequency of occurrence of a word in a text sample, based only on the word's index r and the constants t and n for the text sample. On the other hand, Zipf's Law tends to be least accurate for low nr values, and so k = t / n may not be the best estimate of the value of the proportionality constant.

Numerical Example Assuming k = t / n:

Suppose a text sample contains a total of n = 7485 words, of which t = 1000 are distinct words. Then the most frequent word in the sample occurs n1 = f1 n = (t / 10000) * 10000 = 1000 times. The second-most frequent word occurs 500 times, the third-most frequent word occurs 333 times, etc. (inverse proportionality).

An even more interesting result occurs if you consider that n = sum( nr ) = k n sum( 1/r ) = t sum( 1/r ) where r is summed from 1 to t. The last sum is approximated, for large t, by 0.5772 + ln(t). Therefore there is a relationship between the total number of words n and the number of distinct words t. It is as if one were to conclude that:

There is only so much one can say about anything, using just t distinct words.
To continue our example, if we have 1000 distinct words, then the total text length can be approximated by
n = 1000 (0.5772 + ln(1000)) = 7485 words
which is the number used in our previous calculation.

Modifications To Zipf's Law

Suggestions have been made to modify Zipf's equation somewhat to account for some experimentally observed anomalies. A generalized equation is
fr = C / (r + A)B
where constants A, B, and C are chosen to tweak the equation. If A=0, B=1, and C=n1/n, then the ideal Zipf equation (given before) results. The value of B affects the slope of the plot on log-log paper. See the web reference at the end of this paper.

Application To Computer Programs

A length of text can be viewed as a collection of words, arranged according to various rules of syntax and style, to form a coherent whole. Presumably a more detailed treatise on any subject will use more distinct words than a less detailed treatise, and so it is not unreasonable to suppose that the length of the treatise is a function of the number of distinct words used. This is in keeping with the previous numerical calculation.

A computer program can be viewed as a collection of functions, procedures, operators, and variables, arranged according to various rules of programming language syntax and style, to form a coherent and useful whole. Perhaps the length of a computer program can be estimated, knowing the data it must manipulate. This could be a valuable tool in estimating the size of a software project early in the specification and design process.

Halstead

Halstead proposed a "Software Physics" in which he represented a program as an arrangement of operands and operations on those operands. If
m1 = number of operator types, and
m2 = number of operand types, and
N = program length (total operators plus operands), then
N = m1 log2 m1 + m2 log2 m2

This equation form is reminiscent of the length formula coming from Zipf's Law. In fact, if m1 << m2 or if m1 >> m2, then Halstead's result is

N = m log2 (m)
where m = max(m1, m2). If we replace Zipf's t by Halstead's m, and if ln(m) >> 0.5772, then Zipf's result is
N = m ln(m).
These two equations produce results which differ only by about 30%.

References:

Shooman, Software Engineering, McGraw-Hill (1983), Chapter 3.

http://linkage.rockefeller.edu/wli/zipf/ Zipf's Law References on the World Wide Web.

Hamlet & Maybee, The Engineering of Software, Addison Wesley Longman (2001), pages 267-277.

Addenda - September 21, 2001

Shooman (p. 175, op. cit.) suggests how Zipf's Law can play an important role in Software Engineering when used for estimation of program complexity early in the design process. One method of initially estimating program length (number of tokens, excluding non-executable code, comments, declares, etc.) is to estimate the number of types, t. Assuming that the analyst initially has a complete description of the problem (requirements) and that a partial analysis and choice of key algorithms has been made (initial specifications and preliminary design), one would:
  1. Estimate the number of operator types in the programming language which will be used by the programmers.
  2. Estimate the number of distinct operands by counting input varaibles, output variables, intermediate variables, and constants which will be needed.
  3. Sum the estimates of steps 1 and 2 to obtain the value of t and substitute into the equation Length = t (0.5772 + ln(t)) .
Shooman reports results of this estimation experiment carried out on five different programs, in BASIC, FORTRAN, and PL/1 languages. The programs had estimated token lengths ranging from 90 to 275, and agreement with actual program lengths from -33% to +26%.

Example (Shooman page 173 - Fibonacci Program in PL/1)

FIBONNACI:
PROCEDURE;
N = 2;
GET LIST(PREV_F, LAST_F, LIMIT);
PUT LIST (PREV_F, LAST_F);
REPEAT:
TEMP = LAST_F;
LAST_F = LAST_F + PREV_F;
PREV_F = TEMP;
PUTLIST(LAST_F);
N = N + 1;
IF N < LIMIT THEN GOTO REPEAT;
END FIBONACCI;

Operators:

Total Operator Count: 31.

Operands:

Total Operand Count: 24.


Shooman combines the operators and operands (call each of them a token) and plots the relative frequency of each token versus its rank 1 through 21 on log-log coordinates. He suggests a best fit of fr = 0.35/r in the high-rank region. In the low-rank region, where the plotted points fall somewhat below the 0.35/r line, he suggests fr = 0.35/(r+1).

The Fibonacci program as listed has 11 statement lines. Taking 21 as the value of t, the length formula gives t(0.5772 + ln(t)) = 76. It is not clear what the units of this length are. Do 7 length units = 1 statement line? At best, we can expect only that the length formula can give us a proportionality relationship. The exact value of the proportionality constant must be determined from experience with other similar programs, and probably depends on a number of factors such as the programming language used, its style, the kind of application, platform, environment, etc.

Estimation Example

An existing program which uses 1000 distinct tokens is 1000 lines long. The program is to be enhanced, and it is determined that the new program version will use 1200 distinct tokens. Estimate the length of the new program.

Solution: Since 1000(0.5772 + ln(1000)) = 7485, and 1200(0.5772 + ln(1200)) = 9201, the new program length should be about 1000 x (9201/7485) = 1229 lines.

Critique

Zipf's estimates are only approximations, and the above example gives an estimate (1229 lines) which is very nearly a proportional estimate (1200 lines because 1200 lines / 1000 lines = 1200 tokens / 1000 tokens). When does the Zipf estimate differ significantly from a simple proportionality estimate? For a 20% increase in number of distinct tokens, the Zipf formula predicts an increase in program size by the factor
[1.2 t (0.5772 + ln(t) + ln(1.2))] / [t (0.5772 + ln(t))]
= [1.2 (0.5772 + ln(t) + ln(1.2))] / [0.5772 + ln(t)]
This is significantly different from proportionality only if ln(1.2) is significant compared with 0.5772 + ln(t).

For large programs (t >> 1) and modest increases in distinct token count (x << 1), the ratio

[(1+x) t (0.5772 + ln(t) + ln(1+x))] / [t (0.5772 + ln(t))]
= [(1+x) (0.5772 + ln(t) + ln(1+x))] / [0.5772 + ln(t)]
is likely to be nearly equal to (1+x). It may be that the Zipf model is better at initial estimation than at incremental (i.e., upgrading) estimation.