n = 14 total words.

Three words (to, and, early) are repeated, each once.

t = 11 distinct words.

Make a list of all distinct words in the text,
beginning with the most frequently occurring word in the text,
and going down to the least frequent word in the text.
Assign a rank index r to each distinct word in the text.
Let r=1 for the most frequently occurring word, r=2 for the
2nd-most frequent words, etc., down to r=t for the least frequently
occurring word in the text.
Now define the *relative frequency* of any word as
** f _{r} = n_{r} / n **
where n

r = 2 for the word

r = 3 for the word

r = 4 for the word

etc.

r = 11 for the word

Clearly, f_{j} >= f_{i} if j < i.
In 1935, George K.
Zipf articulated an empirical law, based on samples of text he examined
(in English, Chinese, and Latin),
which further detailed and quantified this inequality.
It is an approximation which seems to hold
for many large samples of "typical" text.

where k is a proportionality constant,

r is a word's rank index (r = 1 for the most frequently occurring word),

and f

**Calculating the Proportionality Constant:**

It is not unreasonable to assume that the least frequently occurring
word **#t** occurs exactly once.
Applying Zipf's First Law, and letting r = t, we have

An even more interesting result occurs if you consider that
**n = sum( n _{r} ) = k n sum( 1/r ) = t sum( 1/r )**
where r is summed from 1 to t.
The last sum is approximated, for large t, by 0.5772 + ln(t).

A computer program can be viewed as a collection of functions, procedures,
operators, and variables, arranged according to various rules of
programming language syntax and style, to form a coherent and useful whole.
Perhaps the length of a computer program can be estimated,
knowing the data it must manipulate.
**This could be a valuable tool in estimating the size of a software project
early in the specification and design process.**

m

m

N = program length (total operators plus operands), then

This equation form is reminiscent of the length formula coming from
Zipf's Law. In fact, if
m_{1} << m_{2} or if
m_{1} >> m_{2}, then Halstead's result is

Shooman, *Software Engineering*, McGraw-Hill (1983), Chapter 3.

http://linkage.rockefeller.edu/wli/zipf/ Zipf's Law References on the World Wide Web.

Hamlet & Maybee, *The Engineering of Software*,
Addison Wesley Longman (2001), pages 267-277.

- Estimate the number of operator types in the programming language which will be used by the programmers.
- Estimate the number of distinct operands by counting input varaibles, output variables, intermediate variables, and constants which will be needed.
- Sum the estimates of steps 1 and 2 to obtain the value of
**t**and substitute into the equation Length = t (0.5772 + ln(t)) .

PROCEDURE;

N = 2;

GET LIST(PREV_F, LAST_F, LIMIT);

PUT LIST (PREV_F, LAST_F);

TEMP = LAST_F;

LAST_F = LAST_F + PREV_F;

PREV_F = TEMP;

PUTLIST(LAST_F);

N = N + 1;

IF N < LIMIT THEN GOTO REPEAT;

END FIBONACCI;

**Operators:**

- Symbol
**;**. Rank 1. Count 11. - Symbol
**=**. Rank 2. Count 5. - Symbol
**,**. Rank 3. Count 3. - Symbol
**:**. Rank 4. Count 2. - Symbol
**PUTLIST**. Rank 5. Count 2. - Symbol
**+**. Rank 6. Count 2. - Symbol
**PROCEDURE**. Rank 7. Count 1. - Symbol
**GETLIST**. Rank 8. Count 1. - Symbol
**IF THEN**. Rank 9. Count 1. - Symbol
**<**. Rank 10. Count 1. - Symbol
**GOTO**. Rank 11. Count 1. - Symbol
**END**. Rank 12. Count 1.

**Operands:**

- Symbol
**LAST_F**. Rank 1. Count 6. - Symbol
**N**. Rank 2. Count 4. - Symbol
**PREV_F**. Rank 3. Count 4. - Symbol
**FIBONACCI**. Rank 4. Count 2. - Symbol
**LIMIT**. Rank 5. Count 2. - Symbol
**REPEAT**. Rank 6. Count 2. - Symbol
**TEMP**. Rank 7. Count 2. - Symbol
**2**. Rank 8. Count 1. - Symbol
**1**. Rank 9. Count 1.

Shooman combines the operators and operands (call each of them a
** token**) and plots the relative frequency of each token
versus its rank 1 through 21 on log-log coordinates.
He suggests a best fit of
f

The Fibonacci program as listed has 11 statement lines.
Taking 21 as the value of **t**, the length formula gives
t(0.5772 + ln(t)) = 76. It is not clear what the units of this length are.
Do 7 length units = 1 statement line?
At best, we can expect only that the length formula can give us
a *proportionality* relationship. The exact value of the
proportionality constant must be determined from experience with other
similar programs, and probably depends on a number of factors such as
the programming language used, its style, the kind of application, platform,
environment, etc.

**Solution:**
Since 1000(0.5772 + ln(1000)) = 7485, and 1200(0.5772 + ln(1200)) = 9201,
the new program length should be about 1000 x (9201/7485) = 1229 lines.

= [1.2 (0.5772 + ln(t) + ln(1.2))] / [0.5772 + ln(t)]

For large programs (t >> 1) and modest increases in distinct token count (x << 1), the ratio

= [(1+x) (0.5772 + ln(t) + ln(1+x))] / [0.5772 + ln(t)]