Last Revised September 22, 2001
Consider any large randomly-chosen sample of text,
(taken from some existing written text material).
We assume that the text consists wholly of words,
each of which is taken from a finite dictionary of words.
Let n be the total number of words in the text,
and let t be the number of distinct words in the text.
If the text includes some repeated words, then n > t.
Tiny Text Example:
Early to bed and early to rise makes a man healthy, wealthy, and wise.
n = 14 total words.
Three words (to, and, early) are repeated, each once.
t = 11 distinct words.
Make a list of all distinct words in the text,
beginning with the most frequently occurring word in the text,
and going down to the least frequent word in the text.
Assign a rank index r to each distinct word in the text.
Let r=1 for the most frequently occurring word, r=2 for the
2nd-most frequent words, etc., down to r=t for the least frequently
occurring word in the text.
Now define the relative frequency of any word as
fr = nr / n
where nr is the number of occurrences of the word
in the text, and r is its index
(chosen according to the word's frequency as described as above).
Tiny Text Example continued:
r = 1 for the word early. f1 = 2/14.
r = 2 for the word to. f2 = 2/14.
r = 3 for the word and. f3 = 2/14.
r = 4 for the word bed. f4 = 1/14.
r = 11 for the word wise. f11 = 1/14.
Clearly, fj >= fi if j < i.
In 1935, George K.
Zipf articulated an empirical law, based on samples of text he examined
(in English, Chinese, and Latin),
which further detailed and quantified this inequality.
It is an approximation which seems to hold
for many large samples of "typical" text.
Zipf's First Law of Natural Language:
fr = k / r
where k is a proportionality constant,
r is a word's rank index (r = 1 for the most frequently occurring word),
and fr is the relative frequency of that word in the text.
Zipf's Law predicts that the plot of relative frequency of
occurrence versus rank index r is an inverse relationship.
On log-log paper, the plot of fr or nr
versus r will be a straight line with slope = -1.
Our tiny text example is too small for us to expect that
Zipf's First Law can be even approximately satisfied in it.
Calculating the Proportionality Constant: ft = nt / n = k / t
from which we can conclude that k = t / n.
Substituting this value for the constant k into Zipf's First Law,
Zipf's First Law Extended: fr = t / n r
This is an interesting result, because it provides a means of predicting the
relative frequency of occurrence of a word in a text sample, based only
on the word's index r and the constants t and n for
the text sample.
On the other hand, Zipf's Law tends to be least accurate for low
nr values, and so k = t / n
may not be the best estimate of the value of the proportionality constant.
It is not unreasonable to assume that the least frequently occurring
word #t occurs exactly once.
Applying Zipf's First Law, and letting r = t, we have
Numerical Example Assuming k = t / n:
Suppose a text sample contains a total of n = 7485 words,
of which t = 1000 are
distinct words. Then the most frequent word in the sample occurs
n1 = f1 n = (t / 10000) * 10000 = 1000 times.
The second-most frequent word occurs 500 times, the third-most frequent
word occurs 333 times, etc. (inverse proportionality).
An even more interesting result occurs if you consider that
n = sum( nr ) = k n sum( 1/r ) = t sum( 1/r )
where r is summed from 1 to t.
The last sum is approximated, for large t, by 0.5772 + ln(t).
Therefore there is a relationship between the total number of words
n and the number of distinct words t.
It is as if one were to conclude that:
There is only so much one can say about anything,
using just t distinct words.
To continue our example, if we have 1000 distinct words, then the total
text length can be approximated by
n = 1000 (0.5772 + ln(1000)) = 7485 words
which is the number used in our previous calculation.
Modifications To Zipf's Law
Suggestions have been made to modify Zipf's equation somewhat
to account for some
experimentally observed anomalies. A generalized equation is
fr = C / (r + A)B
where constants A, B, and C are chosen to tweak the equation.
If A=0, B=1, and C=n1/n, then the ideal Zipf equation
(given before) results. The value of B affects the slope of the plot
on log-log paper.
See the web reference at the end of this paper.
Application To Computer Programs
A length of text can be viewed as a collection of words, arranged
according to various rules of syntax and style, to form a coherent whole.
Presumably a more detailed treatise on any subject will use more distinct
words than a less detailed treatise, and so it is not unreasonable to
suppose that the length of the treatise is a function of the number
of distinct words used. This is in keeping with the previous numerical
A computer program can be viewed as a collection of functions, procedures,
operators, and variables, arranged according to various rules of
programming language syntax and style, to form a coherent and useful whole.
Perhaps the length of a computer program can be estimated,
knowing the data it must manipulate.
This could be a valuable tool in estimating the size of a software project
early in the specification and design process.
Halstead proposed a "Software Physics" in which he represented
a program as an arrangement of operands and operations on those operands.
m1 = number of operator types, and
m2 = number of operand types, and
N = program length (total operators plus operands), then
N = m1 log2 m1
+ m2 log2 m2
This equation form is reminiscent of the length formula coming from
Zipf's Law. In fact, if
m1 << m2 or if
m1 >> m2, then Halstead's result is
N = m log2 (m)
where m = max(m1, m2).
If we replace Zipf's t by Halstead's m,
and if ln(m) >> 0.5772, then Zipf's result is
N = m ln(m).
These two equations produce results which differ only by about 30%.
Shooman, Software Engineering, McGraw-Hill (1983), Chapter 3.
Zipf's Law References on the World Wide Web.
Hamlet & Maybee, The Engineering of Software,
Addison Wesley Longman (2001), pages 267-277.
Addenda - September 21, 2001
Shooman (p. 175, op. cit.) suggests how Zipf's Law can play an important role
in Software Engineering when used for estimation of program complexity
early in the design process. One method of initially
estimating program length (number of tokens, excluding non-executable code,
comments, declares, etc.) is to estimate the number of types, t.
Assuming that the analyst initially has a complete description of the
problem (requirements) and that a partial analysis and choice of key
algorithms has been made (initial specifications and preliminary design),
Shooman reports results of this estimation experiment carried out on five
different programs, in BASIC, FORTRAN, and PL/1 languages.
The programs had estimated token lengths ranging from 90 to 275,
and agreement with actual program lengths from -33% to +26%.
- Estimate the number of operator types in the programming language
which will be used by the programmers.
- Estimate the number of distinct operands by counting input varaibles,
output variables, intermediate variables, and constants which will be needed.
- Sum the estimates of steps 1 and 2 to obtain the value of t
and substitute into the equation
Length = t (0.5772 + ln(t)) .
Example (Shooman page 173 - Fibonacci Program in PL/1)
N = 2;
GET LIST(PREV_F, LAST_F, LIMIT);
PUT LIST (PREV_F, LAST_F);
TEMP = LAST_F;
LAST_F = LAST_F + PREV_F;
PREV_F = TEMP;
N = N + 1;
IF N < LIMIT THEN GOTO REPEAT;
Total Operator Count: 31.
- Symbol ;. Rank 1. Count 11.
- Symbol =. Rank 2. Count 5.
- Symbol ,. Rank 3. Count 3.
- Symbol :. Rank 4. Count 2.
- Symbol PUTLIST. Rank 5. Count 2.
- Symbol +. Rank 6. Count 2.
- Symbol PROCEDURE. Rank 7. Count 1.
- Symbol GETLIST. Rank 8. Count 1.
- Symbol IF THEN. Rank 9. Count 1.
- Symbol <. Rank 10. Count 1.
- Symbol GOTO. Rank 11. Count 1.
- Symbol END. Rank 12. Count 1.
Total Operand Count: 24.
- Symbol LAST_F. Rank 1. Count 6.
- Symbol N. Rank 2. Count 4.
- Symbol PREV_F. Rank 3. Count 4.
- Symbol FIBONACCI. Rank 4. Count 2.
- Symbol LIMIT. Rank 5. Count 2.
- Symbol REPEAT. Rank 6. Count 2.
- Symbol TEMP. Rank 7. Count 2.
- Symbol 2. Rank 8. Count 1.
- Symbol 1. Rank 9. Count 1.
Shooman combines the operators and operands (call each of them a
token) and plots the relative frequency of each token
versus its rank 1 through 21 on log-log coordinates.
He suggests a best fit of
fr = 0.35/r in the high-rank region.
In the low-rank region, where the plotted points fall somewhat below
the 0.35/r line, he suggests fr = 0.35/(r+1).
The Fibonacci program as listed has 11 statement lines.
Taking 21 as the value of t, the length formula gives
t(0.5772 + ln(t)) = 76. It is not clear what the units of this length are.
Do 7 length units = 1 statement line?
At best, we can expect only that the length formula can give us
a proportionality relationship. The exact value of the
proportionality constant must be determined from experience with other
similar programs, and probably depends on a number of factors such as
the programming language used, its style, the kind of application, platform,
An existing program which uses 1000 distinct tokens is 1000 lines long.
The program is to be enhanced, and it is determined that
the new program version will use 1200 distinct tokens.
Estimate the length of the new program.
Since 1000(0.5772 + ln(1000)) = 7485, and 1200(0.5772 + ln(1200)) = 9201,
the new program length should be about 1000 x (9201/7485) = 1229 lines.
Zipf's estimates are only approximations, and the above example gives an
estimate (1229 lines) which is very nearly a proportional estimate
(1200 lines because 1200 lines / 1000 lines = 1200 tokens / 1000 tokens).
When does the Zipf estimate differ significantly from
a simple proportionality estimate?
For a 20% increase in number of distinct tokens, the Zipf formula
predicts an increase in program size by the factor
[1.2 t (0.5772 + ln(t) + ln(1.2))] / [t (0.5772 + ln(t))]
= [1.2 (0.5772 + ln(t) + ln(1.2))] / [0.5772 + ln(t)]
This is significantly different from proportionality only if ln(1.2)
is significant compared with 0.5772 + ln(t).
For large programs (t >> 1) and modest increases in distinct token count
(x << 1), the ratio
[(1+x) t (0.5772 + ln(t) + ln(1+x))] / [t (0.5772 + ln(t))]
= [(1+x) (0.5772 + ln(t) + ln(1+x))] / [0.5772 + ln(t)]
is likely to be nearly equal to (1+x).
It may be that the Zipf model is better at initial estimation than at
incremental (i.e., upgrading) estimation.