this page is prepared by wentian li of north shore
LIJ research institute

you are the visitor no. since January 1, 1999.

Zipf's law, named after the Harvard linguistic professor George Kingsley Zipf
(1902-1950), is the observation that frequency of occurrence of some event (*
P *), as a function of the rank (* i*) when the rank is determined by
the above frequency of occurrence, is a power-law function P_{i} ~
1/i^{a} with the exponent *a* close to unity.

The most famous example of Zipf's law is the frequency of English words. Click here (or here is a PDF file of the class note) to see a count of the top 50 words in 423 TIME magazine articles (total 245,412 occurrences of words), with "the" as the number one (appearing 15861 times), "of" as number two (appearing 7239 times), "to" as the number three (6331 times), etc. When the number of occurrences is plotted as the function of the rank (1, 2, 3, etc.), the functional form is a power-law function with exponent close to 1.

If you want to download English texts and analyze it yourself, get texts from Project Gutenberg (National Clearinghouse for Machine Readable Texts) (one mirror site is at UIUC ).

The second example Zipf showed in his book was the population of cities (or population of communities). The population of the city as plotted as a function of the rank (the most popular city is ranked number one, etc) is a power-law function with exponent close to 1.

The income or revenue of a company as a function of the rank is also an example of the Zipf's law (also in Zipf's book). This should also be called the Pareto's law because Pareto observed this at the end of the last century.

(new on sept-15-1999)

Well, both! It depends on the quantity used in ordering the events. If an event is number 1 because it is most popular, Zipf's plot describes the common events (e.g. the use of English words). On the other hand, if an event is number 1 because it is unusual (biggest, highest, largest...), then it describes the rare events (e.g. city population).

Actually, in Miller's preface of Zipf's book, he distinguished Zipf's "first law" and "second law", one for rare events and another for common events. We don't make such distinction here (it's hard to remember which is the first law and which is the second law!)

(new on dec-02-2002)

I am yet to find a more complete list, let me just start to compile papers which question whether a seemingly power-law function may not really be a power-law functions...

**J Aitchison, JAC Brown**(1954), "On criteria for descriptions of income distribution", Microeconomica, 6:88-98.**Colin Martindale, Andrzej K Konopka**(1996), "Oligonucleotide frequencies in DNA follow a Yule distribution", Computer & Chemistry, 20(1):35-38. (Yule distribution?)**Richard Perline**, "Zipf's law, the central limit theorem, and the random division of the unit interval", Physical Review E, 54(1):220-223 (1996). (Log-normal distribution?)**Jean Laherrere**,**D Sornette**(1998), "Stretched exponential distributions in Nature and Economy: 'Fat tails' with characteristic scales", European Physical Journals, B2:525-539. ( http://xxx.lanl.gov/abs/cond-mat/9801293) (Stretched exponential distribution?)**Ronald Rousseau**(1999), "A weak goodness-of-fit test for rank-frequency distributions", in*Proceedings of the Seventh Conference of the International Society for Scientometrics and Informetrics*, ed. C. Macias-Chapula, Universidad de Colima (Mexico), pages 421-430.**Carlos M Urzua**(2000), "A simple and efficient test for Zipf's Law", Economics Letters, 66:257-260. [PDF]**Bill Reed**(2001), "The double Pareto-lognormal distribution - A new parametric model for size distribution", preprint. [note: this paper is on size distribution, not on rank-frequency distribution.]**Allan Downey**(2001), "The structural cause of file size distributions", Technical Report (Wellesley College).**E Limpert, WA Stahl, M Abbt**(2001), "Lognormal distributions across the sciences: keys and clues", Bioscience, 51(5):341-352. [a general discussion on the lognormal distribution] [ PDF ]**Z Bi, C Faloutsos, F Korn**(2001), "The 'DGX distribution for mining massive, skewed data", Conference on Knowledge Discovery and Data Mining (KDD) 2001. [PDF ]**Michael Mitzenmacher**(2002), "A brief history of generative models for power law and lognormal distributions", preprint (EECS, Harvard Univ).

[PDF]

**GK Zipf**,*Selective Studies and the Principle of Relative Frequency in Language*(?, 1932)**GK Zipf**,*Psycho-Biology of Languages*(Houghton-Mifflin, 1935; MIT Press, 1965).

[Zipf actually thought about this 10 years earlier, i.e., around 1925.]**GK Zipf**,*Human Behavior and the Principle of Least Effort*(Addison-Wesley, 1949).**GK Zipf**,*National Unity and Disunity: The Nation As a Bio-Social Organism*(Principia Press, Bloomington Indiana, 1941).

**V Pareto**,*Cours d'economie politique*(Droz, Geneva Switzerland, 1896) (Rouge, Lausanne et Paris, 1897)**JB Estoup**,*Gammes Stenographiques*(Institut Stenographique de France, Paris, 1916).**JC Willis**,*Age and area*(Cambridge Univ Press, 1922).**GU Yule**, "A mathematical theory of evolution based on the conclusions of Dr. J.C. Willis, F.R.S. ", Philosophical Transactions of the Royal Society of London (Series B), 213:21-87 (1925).**GU Yule**,*Statistical Study of Literary Vocabulary*(Cambridge Univ Press, 1944).

**BB Mandelbrot**, "Adaptation d'un message a la ligne de transmission. I & II", Comptes Rendus (Paris), 232, 1638-1640 & 2003-2005 (1951).**BB Mandelbrot**, in "Contribution a la Theorie Mathematique des Jeux de communication" (Institute of Statistics, Univ of Paris, page 124, 1953)**BB Mandelbrot**, "An informational theory of the statistical structure of languages", in Communication Theory, ed. W. Jackson (Betterworth, 1953) , pp. 486-502.**BB Mandelbrot**, "Simple games of strategy occurring in communication through natural languages", symposium on statistical methods in communication engineering (Berkely, Aug 17-18, 1953). appearing in Transactions of IRE (professional groups on information theory), 3, 124-137 (1954).**GA Milller**, "Communication", Annual Review of Psychology, 5, 401-420 (1954).

[a summary of Mandelbrot's result.]**BB Mandelbrot**, "Information theory and psycholinguistics", in*Scientific Psychology: Principles and Approaches*, eds. B. Wolman, E. Nagel (Basic Books,1965), pp.550-562.**BB Mandelbrot**, "Les constantes chiffrees du discourts",in*Encyclopedie de la Pleisde: Linguistique*, ed. J. Martinet (Gallimard, 1968), pp. 46-56.

**HA Simon**(1955),

"On a class of skew distribution functions",

Biometrika, 42:425-440.

[ PDF]**BB Mandelbrot**, "A note on a class of skew distribution function. analysis and critique of a paper by H.A. Simon", Information and Control, 2,90-99 (1959).

[ABSTRACT:*This note is a discussion of H.A. Simon's model (1955) concerning the class of frequency distributions generally associated with the name of G.K. Zipf. The main purpose is to show that Simon's model is analytically circular in the case of the linguistic laws of Estoup-Zipf and Willis-Yule. Insofar as the economic law of Pareto is concerned, Simon has himself noted that his model is a particular case of that of Champernowne; this is correct, with some reservation. A simplified version of Simon's model is included.*]**HA Simon**, "Some further notes on a class of skew distribution functions", Information and Control, 3, 80-88 (1960).

[ABSTRACT:*This note takes issue with a recent criticism by Dr. B. Mandelbrot of a certain stochastic model to explain word-frequency data. Dr. Mandelbrot's principal empirical and mathematical objections to the model are shown to be unfounded. a central question is whether the basic parameter of the distributions is larger or smaller than unity. The empirical data show it is almost always very close to unity, Sometimes slightly larger, sometimes smaller. Simple stochastic models can be constructed for either case, and give a special status, as a limiting case, to instances where the parameter is unity. More generally, the empirical data can be explained by two types of stochastic models as well as by models assuming efficient information coding. The three types of models are briefly characterized and compared.*]**BB Mandelbrot**, "Final note on a class of skew distribution functions: analysis and critique of a model due to H.A. Simon", Information and Control, 4, 198-216 (1961).

[ABSTRACT:*We shall restate in detail our 1959 objections to Simon's 1955 model for the Pareto-Yule-Zipf distribution. Our objections are valid quite irrespectively of the sign of p-1, so that most of Simon's (1960) reply was irrelevant. We shall also analyze the other points brought up in that reply.*]**HA Simon**, "Reply to 'final note' by Benoit Mandelbrot", Information and Control, 4, 217-223 (1961).

[ABSTRACT:*Dr. Mandelbrot's original objection (1959) to using the Yule process to explain the phenomena of word frequencies were refuted in Simon (1960), and are now mostly abandoned. the present "reply" refutes the almost entirely new arguments introduced by Dr. Mandelbrot in his "final note", and demonstrates again the adequacy of the models in (1955).*]**BB Mandelbrot**, "Post scriptum to 'final note'", Information and Control, 4, 300-304 (1961).

[ABSTRACT:*My criticism has not changed since I first had the privilege of commenting upon a draft of Simon (1955).*]**HA Simon**, "Reply to Dr. Mandelbrot's post scriptum", Information and Control, 4, 305-308 (1961).

[ABSTRACT:*Dr. Mandelbrot has proposed a new set of objections to my 1955 models of the Yule distribution. Like his earlier objections, these are invalid.*]

Editorial note: Dr. Mandelbrot feels that no further comment is needed and this debate terminates herewith.

(updated on december-10-2001)

**GA Miller, EB Newman**(1958), "Tests of a statistical explanation of the rank-frequency relation for words in written English", American Journal of Psychology, 71, 209-218.**GA Miller, EB Newman, EA Friedman**(1958), "Length-frequency statistics for written English", Information and Control, 1, 370-389.**Henry Kucera, W Nelsen Francis**(1967),*Computational Analysis of Present-Day American English*(Brown Univ Press). [out of print: see Amazon]**Ronald E Wyllys**(1975), "Measuring scientific prose with rank-frequency ('Zipf') curves: a new use for an old phenomenon," Proceedings of the American Society for Information Science 12, 30-31. Washington, DC: American Society for Information Science.**H Dahl**(1979),*Word Frequencies of Spoken American*(Verbatim).

[rank-frequency of spoken words. the top twenty is: I, and, the, to, that, you, it, of, a, know, was, uh, in, but, is, this, me, about, just, don't]**R Rousseau**,**Qiaoqiao Zhang**(1992), "Zipf's data on the frequency of Chinese words revisited", Scientometrics, 24(2):201-220.**EG Bard**,**RC Shillcock**(1993), "Competitor effects during lexical access: Chasing Zipf's tail", In*Cognitive Models of Speech Processing: The Second Sperlonga Meeting*, Eds. GTM Altmann and RC Shillcock (Lawrence Erlbaum Associates).**DR Ridley , EA Gonzales**(1994), "Zipf's law extended to small samples of adult speech", Percept. Mot. Skills, 79:153-154.**J Cooke, S Gregor, J Luck, JL Clark, KT Lua, J McCallum**, "Analyzing the conformance of Chinese text to Zipf's law and Automatic indexing of natural language text in the UNIX environment", (transcript of slides, 1996? Univ of Central Queensland, Australia)**J Tuldava**(1996), "The frequency spectrum of text and vocabulary", Journal of Quantitative Linguistics, 3(1):?-?. [ABSTRACT:*The present paper deals with some problems of the analysis of the word-frequency distribution and the possibility of its analytical description*]**Colin Martindale, SM Gusein-Zade, Dean McKenzie, and Mark Yu. Borodovsky**(1996), "Comparison of equations describing the ranked frequency distributions of graphemes and phonemes", Journal of Quantitative Linguistics, 3(2):?-?.**VK Balasubrahmanyan, S Naranan**(1996), "Quantitative linguistics and complex system studies", Journal of Quantitative Linguistics, 3(3):?-?.**S Naranan, VK Balasubrahmanyan**(1998), "Models for power law relations in linguistics and information science", Journal of Quantitative Linguistics, 5(3):?-?.**W Li**, Letters to the editor, Complexity, 3:9-10 (1998).**B K Sen, Khong Wye Keen, Lee Soo Hoon, Lim Bee Ling, Mohd Rafae Abdullah, Ting Chang Nguan, Wee Siu Hiang**(1998), "Zipf's law and writings on LIS",

Malaysian Journal of Library & Information Science, 3(2):93-98. [ abstract ]**R Rousseau**(1998), "George Kingsley Zipf: life, ideas and recent developments of his theories", preprint (talk presented at the Beijing International Seminar of Quantitative Evaluation of R&D in Universities, and Fifth All-China Annual Meeting for Scientometrics and Informatics. Dec 4-6, 1998).**Leo Egghe**(1999), "On the law of Zipf-Mandelbrot for multi-word phrases", Journal of the American Society for Information Science, 50:?-?.**Claudia Prun**(1999), "G.K. Zipf's conception of language as an early prototype of synergetic linguistics", Journal of Quantitative Linguistics, 6(1):?-?.**MA Nowak**(2000), "The basic reproductive ratio of a word, the maximum size of a lexicon", Journal of Theoretical Biology, 204(2):179-189.**Marcelo A Montemurro**(2001), "Beyond the Zipf-Mandelbrot law in quantitative linguistics", arxiv.org e-print , cond-mat/0104066, [ abstract ]**Alexander Gelbukh, Grigori Sidorov**(2001), "Zipf and Heaps laws' coefficients depend on language", Proceeding of Conference on Intelligent Text Processing and Computational Linguistics (CICLing'2001), ed. Alexander Gelbukh, Lecture Notes in Computer Science, Vol 2004 (Springer-Verlag), pp. 332-335.**AB Downey**(2001), "Evidence for long-tailed distributions in the Internet", Proceedings of ACM SIGCOMM Internet Measurement Workshop 2001.

**G Landini**, "Zipf's laws in the Voynich manuscript" (http://sun1.bham.ac.uk/G.Landini/evmt/zipf.htm)**CR Turner**, "Relationship between vocabulary, text length, and Zipf's law", http://www.btinternet.com/~g.r.turner/ZipfDoc.htm**Serge Heiden**, "Lexploreur" (in French) http://diderot.lexico.ens-fcl.fr/doc/lexplorer/index.html**Jean Laherrere**, "Distributions de type 'fractal parabolique' dans la Nature" (in French), (http://hubbertpeak.com/laherrere/fractal.htm)

(new on feb-05-2002, I would like to thank Dr. Gabriel Altmann for this collection)

**JB Estoup**(1916),*Les Gammes Stenographiques*Paris, Institut Stenographique. (in French)**W Skalmowski**(1961), "Polskie przeklady Hafiza w swietle prawa Zipfa-Mandelbrota", Sprawozdania Kom. Orient. PAN 125-127.**VM Kalinin**(1964), "O statistike literaturnogo teksta", Voprosy jazykoznanija Nr. 1, ?-?.**VM Kalinin**(1964),*Razvitie schemy Puassona i ee primenenie dlja statisticeskich svojstv reci*, Leningrad: Diss. (in Russian)**Ju A Srejder**(1967), "O vozmoznosti teoreticeskogo vyvoda statisticeskich zakonomernostej teksta (k obosnovaniju zakona Cipfa)", in*Problemy peredaci informacii*, Vol 3, 57-63. Moskva.**EA Kalinina**(1968), "Izucenie leksiko-statisticeskich zakonomernostej na osnove verojanotnoj modeli", in*Statistika reci i avtomaticeskij analiz teksta*, Leningrad, ?-?.**G Billmeier**(1969),*Worthaufigkeiten vom Zipfschen Typ, uberprüft an deutschem Textmaterial*, Hamburg: Buske. (in German)**Ju K Orlov**(1970), "Statisticeskaja struktura soobscenij, optymalŽnych dlja celoveceskogo vosprijatija", Naucno-techniceskaja informacija, 2m(8):11-16.**PM Alekseev, ST NavalŽna**(1971), "Pro graficnij opis zaleznosti 'rang-castota' lingvisticeskich odinic", Visnik CharŽkivskogo universitetu 64, folologija, vyp. 8:?-?.**GG Belonogov, AP Novoselov**(1971), "Nekotorye kolicestvennye zakonomernosti v automatizirovannych informacionnych sistemach", in Avtomaticeskaja pererabotka teksta metodami prikladnoj lingvistiki. Materialy vsesojuznoj konferencii: 219-220. Kisinev.**BA Volosin, JK Orlov**(1972),*Obobscennyj zakon Cipfa-MandelŽbrota i raspredelenie cvetovych ploscadej v proizvedenijach zivopisi*, Tbilisi, AN GSSR Institut kibernetiki.**LS Kozackov**(1973),*Sistemy potokov naucnoj informacii*, Kiev: Naukova dumka.**MV Arapov, EN Efimova**(1975), "Ponjatie leksiceskoj struktury teksta", Naucno-techniceskaja informacija, 2:3-7.**MV Arapov, EN Efimova, Ju A Srejder**(1975), "O smysle rangovych raspredelenij", Naucno-techniceskaja informacija, 2:9-20.**MV Arapov, EN Efimova, Ju A Srejder**(1975), "Rangovye raspredelenija v tekste i jazyke", Naucno-techniceskaja informacija, 2:?-? .**AT Micevic**(1975), "Issledovanija struktury potokov naucno-techniceskoj informacii po masinostroenii", Naucno-techniceskaja informacija, 2(5):3-16.**Ju K Orlov**(1976), "O svjazi mezdu raspredeleniem Pareto i obobscennym zakonom Cipfa-Mandel'brota", Bulletin of the Academy of Sciences of the Georgian SSR, 83:57-60.**Ju K Orlov**(1976), "Obobscennyj zakon Cipfa-Mandelbrota i castotnye struktury informacion-nych edinic razlicnych urovnej", in*VycislitelŽnaja lingvistika*, ed. EK Guseva, pp. 179-202. Moskva: Nauka.**E Schurer**(1976),*Das Zipfsche Gesetz in der fruhen Kindersprache*, Munchen: Diss. (in German)**MV Arapov, JA Srejder**(1977), "Klassifikacija i rangovye raspredelenija", Naucno-techniceskaja informacija, 2(1-12):15-21.**MV Arapov**(1977), "Dve modeli rangovogo raspredelenija", Voprosy informacionnoj teorii i praktiki, 4: 3-42.**AI Jablonskij**(1977), "Struktura i dinamika sovremennoj nauki", in*Sistemnye issledovanija. Ezegodnik 1976*, ed. DM Gvisiani, pp. 66-90. Moskva: Nauka.**SV Kopejkin, VE Ostapenko**(1977), "Zakon Cipfa i sopostavitelŽnyj analiz castotnych struktur anglijskogo, fancuzskogo, rumynskogo i russkogo jazykov na baze matematiceskich modelej", Naucnye trudy Kujbysevskogo pedagogiceskogo instituta, 193:91-94.**PM Alekseev**(1978), "O nelinejnych formulirovkach zakona Cipfa", Voprosy kibernetiki 41:53-65.**MV Arapov, JA Srejder**(1978), "Zakon Cipfa i princip dissimetrii sistemy", Semiotika i informatika, 10:74-95.**LS Kozackov**(1978), "Informacionnye sistemy s ierarchiceskoj ('rangovoj') strukturoj", Naucno-techniceskaja informacija, 2(8):15-24.**W Marx, E Schuprer-Necker**(1978), "Uberlegungen zur Interpretation des Zipfschen Gesetzes am Beispiel der fruhen Kindersprachee", Glottometrika, 1:154-167. (in German)**A Rouault**(1978), "Loi de Zipf et sources markoviennes", Annales de lŽInstitut H. Poincare, 14:169-188. (in French)**H Birkhahn**(1979), "Das 'Zipfsche Gesetz', das schwache Prateritum und die germanische Lautverschiebung", Sitzungsberichte der osterreichischen Akademie der Wissenschaften, philosophisch-historische Klasse 348. (in German)**L Hoffmann, RG Piotrowski**(1979),*Beitrage zur Sprachstatistik*, Leipzig: ?**C Muller**(1979), "Du nouveau sur les distributions lexicales: la formule de Waring-Herdan", in*Langue Francais et Linguistique Quantitative*, ed. C Muller, pp. 177-195. Geneve: Slatkine (in French).**A Babanarov**(1980), "Castotnyj slovnik i avtomaticeskij slovarŽ dlja masynnogo perevoda tereckich gazetnych textov", in*Inzenernaja lingvistika i optimizacija prepodavanija inostrannych jazykov*, Leningrad, pp.?-?.**MG Boroda**(1980), "Haufigkeitsstrukturen musikalischer Texte", Glottometrika, 3:36-69. (in German)**Ju K Orlov**(1980), "Informacionnye potoki: statisticeskij analiz i prognozirovanie", Naucno-techniceskaja informacija, 2(2):23-30.**Ju K Krylov**(1982), "Stacionarnaja modelŽ porozdenija svjaznogo teksta", Acta et Commenta-tiones Universitatis Tartuensis, 774:81-102.**Ju K Orlov**(1982), "Dynamik der Haufigkeitsstrukturen", in*Studies on Zipf's Law*, eds. H Guiter, MV Arapov, pp. 116-153. Bochum: Brockmeyer. (in German)**Ju K Orlov**(1982), "Ein Modell der Haufigkeitsstruktur des Vokabulars", in*Studies on Zipf's Law*, eds. H Guiter, MV Arapov, pp. 154-233. Bochum: Brockmeyer. (in German)**Ju K Orlov**(1982), "Linguostatistik: Aufstellung von Sprachnormen oder Analyse des Redeprozesses? Die Antinomie 'Soprache-Rede' in der statistischen Linguistik", in*?*, eds. Ju K Orlov, MG Boroda, IS Nadarejsvili, pp. 1-55.**Ju V Orlov, MG Boroda, IS Nadarejsvili**(1982),*Sprache, Text, Kunst. Quantitative Analysen*, Bochum, Brockmeyer. (in German)**AN Lebedev**(1983), "Zakonomernosti postroenija slov v reci", Psichologiceskij zurnal, 4/5:11-23.**SD Haitun**(1983),*Naukometrika. Sostojanie i perspektivy*, Moskva: Nauka.**Ju K Orlov, RY Chitashvili**(1983), "Generalized Z-distribution generating the well-known ŽRank-DistributionsŽ", Bulletin of the Academy of Sciences of the Georgian, 110(2):269-272.**VN Byckov**(1984), "K probleme obobscenija i interpretacija rangovych raspredelenij v statisticeskoj lingvistike", Ucenye zapiski TGU, 689:61-70.**RG Piotrowski, KB Bektaev, AA Piotrovskaja**(1985),*Mathematische Linguistik*, Bochum, Brockmeyer. (in German)**J Tuldava**(1985), "Castotnaja struktura teksta i zakon Cipfa", Ucenye zapiski, TGU 711, 93-116.**G Altmann**(1988),*Wiederholungen in Texten*, Bochum, Brockmeyer. (in German)**Ju K Orlov**(1988), "Unsichtbare Harmonie", Musikometrika, 1:281-315.**C Prun**(1995),*Die linguistischen Hypothesen von G.K. Zipf aus systemtheoretischer Sicht*, Trier: Magisterarbeit.**A Knuppel**(1997),*Untersuchungen zum Zipf-Mandelbrot Gesetz an deutschen Texten*, Gottingen: Staatsexamensarbeit. (in German)**RG Piotrovskij, KB Bektaev, AA Piotrovskaja**(1997),*Matematiceskaja lingvistika*, Moskva: Nauka.**J Tuldava**(1998),*Probleme und Methoden der quantitativ-systemischen Lexikologie*, Trier: WVT.**A Knuppel**(2001), "Untersuchungen zum Zipf-Mandelbrot-Gesetz an deutschen Texten", in*Haufigkeitsverteilungen in Texten*ed. KH Best, pp. 248-280. Gottingen: Peust & Gutschmidt. (in German)

(updated on feb-12-2002)

**GA Miller**(1957), "Some effects of intermittent silence", American Journal of Psychology, 70:311-314.**GA Miller, N Chomsky**(1963), in*Handbook of Mathematical Psychology II*, eds, R. Luce, R. Bush, E. Galanter (Wiley), pp. 419-491.**J Nicolis**(1991),*Chaos and Information Processing: A Heuristic Outline*(World Scientific). [out of print, see Amazon]**W Li**(1992), "Random texts exhibit Zipf's-law-like word frequency distribution", IEEE Transactions on Information Theory , 38(6):1842-1845.**W Li**(1996), Comments to "Bell curves and monkey languages" (letter to the editor), Complexity, 1(6):6.**Richard Perline**(1996), "Zipf's law, the central limit theorem, and the random division of the unit interval", Physical Review E, 54(1):220-223.**G Troll, P beim Graben**(1998), "Zipf's law is not a consequence of the central limit theorem", Physical Review E, 57(2), 1347-1355.**Leo Egghe**(2000), "General study of the distribution of N-tuples of letters or words based on the distribution of the single letters of words", Mathematical and Computer Modelling, 31:35-41.**Leo Egghe**(2000), "The distribution of N-grams", Scientometrics, 47(2):237-252.**Ramon Ferrer, Richard V Sole**(2002), "Zipf's law and random texts", Advances in Complex Systems, to appear.

**Christer Samuelson**(1995), "Relating Turing's formula and Zipf's law", Proceedings of the 4th Workshop on Very Large Corpora, Copenhagen, Denmark, 1996. [ abstract ]

- P Harremoees, F Topsoe (2001), "Maximum entropy fundamentals", Entropy, 3:227-292.
- P Harremoees, F Topsoe (2002), "Zipf's law, hyperbolic distributions and entropy loss", IEEE International Symposium on Information Theory (ISIT) Proceedings, in press.

**BB Mandelbrot**(1977),*The Fractal Geometry of Nature*(W.H. Freeman and Company). section 38 "scaling and power laws without geometry". [ Amazon entry]**George A Miller**(1991),*The Science of Words*(Scientific American Library, a division of HPHLP, distributed by W.H. Freeman and Company). [ Amazon entry]**Manfred Schroeder**(1991),*Fractals, Chaos, Power Laws*(W.H. Freeman and Company), pp. 35-38. [ Amazon entry]**Murray Gell-Mann**(1994),*The Quark and the Jaguar*(W.H. Freeman and Company), pp.92-97. [ Amazon entry]**Lada A Adamic**Zipf, Power-laws, and Pareto - a ranking tutorial (online tutorial: http://ginger.hpl.hp.com/shl/papers/ranking/)

(updated on jul-30-2001)

**F Auerbach**(1913), "Das Gesetz der Bevolkerungskonzentration", Petermanns Geographische Mitteilungen, LIX:73-76.**Bruce M Hill**(1970), "Zipf's law and prior distributions for the composition of a population", Journal of the American Statistical Association, 65:1220-1232.**R Gunther, L Levitin, B Schapiro, P Wagner**(1996),

"Zipf's law and the effect of ranking on probability distribution",

International Journal of Theoretical Physics, 35(2):395-417.**Hernan A Makse, Shlomo Havlin, H Eugene Stanley**(1995),

"Modelling urban growth patterns",

Nature, 377:608-612.**P Krugman**(1996),*The Self-Organizing Economy*(Blackwell, Cambridge, MA).**DH Zanette and SC Manrubia**(1997),

"Role of intermittency in urban development: a model of large-scale city formation",

Physical Review Letters, 79:523-526.

[ PDF]

comments by M Marsili, S Maslov and Y-C Zhang, and reply at Physical Review Letters, 80:4831(1998).

(note: the x-axis in the paper is city population, not rank)**SC Manrubia, DH Zanette**(1998),

"Intermittency model for urban development",

Physical Review E, 58:295-302.**Matteo Marsili, Yi-Cheng Zhang**(1998),

"Interacting individuals leading to Zipf's law",

Physical Review Letters, 80(12):2741-2744.

[ PDF]**X Gabaix**(1999), "Zipf's law for cities: an explanation", Quarterly Journal of Economics, 114:739-767.**Bill Reed**(2002), "On the rank-size distribution for human settlements", J Regional Science, 41:1-17.

[ PDF ]**LC Malacarne, RS Mendes, EK Lenzi**(2002), "q-exponential distribution in urban agglomeration", Physical Review E, 65(1):article017106.

(updated on mar-07-2001)

See also, Mark Crovella's publication list

Jakob Nielsen's column Zipf curve and website
popularity

Jakob Nielsen's column Traffic from referring
sites

Hewlett-Packard's information
dynamics group

**Steve Glassman**, "A caching relay for the world wide web", In First International World-Wide Web Conference, pages 69-76 (May 1994). ( html)**WE Leland, MS Taqqu, W Willinger, DV Wilson**(1994), "On the self-similar nature of Ethernet traffic ", IEEE/ACM Transactions on Networking, 2:1-15.**Carlos R Cunha, Azer Bestavros, Mark E Crovella**, "Characteristics of WWW client-based traces", Technical Report TR-95-010, Boston University Computer Science Department, June 1995.**Virgilio Almeida, Azer Bestavros, Mark Crovella, and Adriana de Oliveira**(1996), "Characterizing reference locality in the WWW", Boston University Computer Science Department, TR-96-11, June 1996. In Proceedings of the Fourth International Conference on Parallel and Distributed Information Systems (PDIS '96), December 1996.**Martin F Arlitt, Carey L Williamson**(1997), "Internet web server: workload characterization and performance implications", IEEE/ACM Transactions on Networking, 5(5):631-645.**ME Crovella, A Bestavros**(1997), "Self-similarity in world wide web traffic: evidence and possible causes", IEEE/ACM Transactions on Networking, 5(6):835-846.**P Barford, ME Crovella**, "Generating representative web workloads for network and server performance evaluation," in Proceedings of Performance '98/ACM SIGMETRICS '98, 151-160, Madison WI. [Slightly expanded version appears as BUCS-TR-1997-006, November 4, 1997.]**ME Crovella, Murad S Taqqu, Azer Bestavros**(1998), "Heavy-tailed probability distributions in the world wide web", in*A Practical Guide To Heavy Tails*, eds RJ Adler, RE Feldman, MS Taqqu, Chapter 1, 3-26 (Chapman & Hall)**N Nishikawa, T Hosokawa, Y Mori, K Yoshida, H Tsuji**(1998), "Memory-based architecture for distributed WWW caching proxy", Computer Networks and ISDN Systems,30:205-214.**BA Huberman, PLT Pirollo, JE Pitkow, RM Lukose**, "Strong regularities in world wide web surfing", Science, 280:95-97 (April 3, 1998).**M Harchol-Balter, ME Crovella, CD Murta**(1998), "On choosing a task assignment policy for a distributed server system," in Proceedings of Performance Tools '98, Lecture Notes in Computer Science Vol 1469, pp. 231--242, 1998.**ME Crovella, R Frangioso, M Harchol-Balter**(1999), "Connection Scheduling in Web Servers," Boston University Computer Science Technical Report BUCS-TR-99-003.**ME Crovella, MS Taqqu**(1999), "Estimating the heavy tail index from scaling properties," Methodology and Computing in Applied Probability, 1(1):?-?.**P Barford, A Bestavros, A Bradley, and ME Crovella**(1999), "Changes in Web client access patterns: characteristics and caching implications," to appear in World Wide Web, Special Issue on Characterization and Performance Evaluation.**Albert-Laszlo Barabasi, Reka Albert**(1999), "Emergence of scaling in random networks", Science, 286(5439):509-512. (may be relevant, but i haven't checked)

An ABC News online article on this work can be found at http://abcnews.go.com/sections/science/WhosCounting/whoscounting991201.html(Dec 1, 1999)**JM Carlson, J Doyle**(2000), "Highly optimized tolerance: a mechanism for power laws in designed systems", Physical Review E, 60(2):1412-1427. [PDF ] (this paper describes a general theory for power-law, not just in internet traffic. but there is a section on this particular application.)**Lee Breslau, Pei Cao, Li Fan, Graham Phillips, Scott Shenker**(2000), "Web caching and Zipf-like distributions: evidence and implications", Proceedings of INFOCOM'99 (IEEE Press). [ abstract] [ PDF]**Sidney Resnick, Holger Rootzen**(2000), "Self-similar communication models and very heavy tails", Annals of Applied Probability, 10(3):753-778.**Lada A Adamic, Bernardo A Huberman**(2000), "The nature of markets in the World Wide Web", Quarterly Journal of Electronic Commerce, 1:5-12.

[ PDF]**Anders Johansen, Didier Sornette**(2000), "Download relaxation dynamics on the WWW following newsppaer publication of URL", Physica A, 276:338-345.**AB Downey**(2001), "Evidence for long-tailed distributions in the Internet", ACM SIGCOMM Internet Measurement Workshop (November 2001).**AB Downey**(2001), "The structural causes of file size distributions", Ninth International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS'2001).**Michael Mitzenmacher**(2002), "Improved models for file size distribution", preprint (EECS, Harvard Univ).

(updated on mar-07-2001)

This is similar to the Zipf's law in natural language, but discussed in the context of information retrieval and library science.

Some links to conferences:

7th
International Conference on Scientometrics and Informetrics (July 5-9, 1999,
Mexico)

6th
International Conference on Scientometrics and Informetrics (June 16-19, 1997,
Israel)

**AJ Lotka**(1926), "The frequency distribution of scientific productivity", Journal of the Washington Academy of Sciences, 16:317-323.**RA Fairthorne**(1969), "Empirical hyperbolic distributions (Bradford Zipf Mandelbrot) for bibliometric description and prediction", Journal of Documentation, 25:319-343.**Bertram Brookes**(1977), "Theory of the Bradford law", Journal of Documentation, 33:180-209.**Ronald E Wyllys**(1981), "Empirical and theoretical bases of Zipf's law," Library Trends. Summer; 30(1):53-64.**Bertram Brooks**(1982), "Quantitative analysis in the humanities: the advantage of ranking techniques", in*Studies on Zipf's law*, ed. H Guiter, MV Arapov (Brockmeyer), pages 65-115.**J Fedorowicz**(1982), "A Zipfian model of an automatic bibliographic system: an application to MEDLINE", Journal of American Society of Information Science, 33:223-232.**Bertram Brooks**(1984), "Towards informetrics: Haitun, Laplace, Zipf, Bradford and Alvey programme", Journal of Documentation, 40:120-143.**Linus Ikpaahindi**(1985), "An overview of bibliometrics: its measurements, laws and their applications", Libri, 35(2):163-177.**Ye-Sho Chen, Ferdinand F Leimkuhler**(1986), "A relationship between Lotka's law, Bradford's law, and Zipf's law", Journal of the American Society for Information Science, 37:307-314.**Ye-Sho Chen, Ferdinand F Leimkuhler**(1987), "Analysis of Zipf's law: an index approach", Information Processing and Management, 23:71-182.**Ye-Sho Chen, Ferdinand F Leimkuhler**(1987), "Bradford's law: an index approach", Scientometrics, 11:183-198.**Leo Egghe**(1989),*The Duality of Informetric Systems with Applications to the Empirical Laws*, Ph.D Thesis (City University, London).**Michael J Nelsen**(1989) "Stochastic models for the distribution of index terms", Journal of Documentation, 45:227-237.**Howard White, Katherine W McCain**(1989) "Bibliometrics", Annual Review of Information Science and Technology, 24:119-186.**Abraham Bookstein**(1990), "Informetric distributions. Part I: unified overview", Journal of the American Society for Information Science, 41:368-375.**Leo Egghe**(1990), "The duality of informetric systems with applications to the empirical laws", Journal of Information Science, 16:17-27.**Leo Egghe**,**Ronald Rousseau**(1990),*Introduction to Informetrics: Quantitative Methods in Library, Documentation and Information Science*(Elsevier).**Liwen Qiu**(1990), "An empirical examination of the existing models for Bradford's law", Information Processing and Management, 26:655-672.**Ronald Rousseau**(1990), "Relations between continuous versions of bibliometric laws", Journal of the American Society for Information Science, 41(3):197-203.**Leo Egghe**(1991), "The exact place of Zipf's and Pareto's law amongst the classical informetric laws", Scientometrics, 20:93-106.**Ronald Rousseau**, Qiaoqiao Zhang (1992), "Zipf's data on the frequency of Chinese words revisited", Scientometrics, 24:201-220.**Ronald Rousseau**,**Sandra Rousseau**(1993), "Informetric distributions: a tutorial review", CJILS/RCSIB, 18(2):51-63.**Quoniam Luc, Balme Frederic, Rostaing Herve, Giraud Eric, Dou Jean Mari**(1997), "Bibliometric law used for information retrieval", in*Proceedings of the Sixth Conference of the International Society for Scientometrics and Informetrics*, eds. Bluma C Peritz, Leo Egghe, Hebrew Univ of Jerusalem.**S Redner**(1998), "How popular is your paper? An empirical study of the citation distribution" European Physical Journal B, 4:131-134. (http://xxx.lanl.gov/abs/cond-mat/9804163)**ZK Silagadze**'s preprint: "Citations and the Zipf-Mandelbrot's law", arxiv.org e-print , physics/9901035 [ abstract ],

Complex Systems, 11:487-499 (1997).- Another preprint,
**C Tsallis, MP de Albuquerque**, "Are citations of scientific papers a case of nonextensivity ?", (March 1999) (http://xxx.lanl.gov/abs/cond-mat/9903433) **ZK Silagadze**(2000), "Citations and the Zipf-Mandelbrot law", Complex Systems, 11(6):?-?.**Robert Losee**(2001),"Term dependence: a basis for Luhn and Zipf models", Journal of the American Society for Information Science and Technology, 52(12):1019-1025.[ PDF]

(updated on sep-09-2001)

- Of course, Pareto's paper should be listed here.
- If the distribution is not plotted as the rank-frequency plot, but the number of companies in each revenue/sale/income/whatever category (this is actually the other type of Zipf's plot, see Zipf [1935]), the log-normal distribution is usually relevant (I haven't got the chance to trace the references...)
**D Champernowne**(1953), "A model of income distribution", Economic Journal, 63:318-351.**BB Mandelbrot**(1963), "", Journal of Business, 36:394-?.**BB Mandelbrot**(1963), "New methods in statistical economics", Journal of Political Economy, 71:421-440 .**E Fama**(1965), " ". Management Science, 11:404-419.**JP Bouchaud**(1995), "More Levy distributions in physics, in Levy Flights and related topics in physics, Lecture notes in physics 450, Springer pp 239-250.**MHR Stanley, SV Buldyrev, S Havlin, RN Mantegna, MA Salinger, HE Stanley**(1995), "Zipf's plots and the size distribution of firms", Economics Letters, 49:453-457.**BB Mandelbrot**(1997),*Fractals and Scaling in Finance : Discontinuity, Concentration, Risk*(Springer-Verlag, Nov 1997) [ Amazon entry ]- D. Sornette, D. Zajdenweber "Economic returns of research: the Pareto law and its implications", European Physical Journal B, 8:653-664 (1998). ( abstract)
**JP Bouchaud, D Sornette, C Walter, JP Aguilar**, "Taming large events: optimal portfolio theory for strongly fluctuating assets", International Journal of Theoretical and Applied Finance, 1:25-41 (1998).**N. Vandewalle and M. Ausloos**, "The n-Zipf analysis of financial data series and biased data series", Physica A, 268:170-176 (1999).**Greg Ip**, "Analyst discovers the order in internet stocks valuations", Wall Street Journal, Dec 27 (1999).

[ http://interactive.wsj.com/articles/SB946246776318315015.htm ][ a local copy ]**J J Ramsden, Gy Kiss-Haypdl**(2000), "Company size distribution in different countries", Physica A, 277:220-227.- Sorin Solomon, Peter Richmond (2000), "Stability of Pareto-Zipf law in non-stationary economics", arxiv.org e-print , cond-mat/0012479. [ abstract]
**H Aoyama, W Souma, Y Nagahara, M P Okazaki, H Takayasu, M Takayasu**(2000), "Pareto's law for income of individuals and debt of bankrupt companies", Fractals, 8(3):293-300.**A Dragulescu, VM Yakovenko**(2001), "Evidence for the exponential distribution of income in the USA", European Physical Journal B, 20:585-589.**Bill Reed**(2000), "The Pareto law of incomes - an explanation and an extention", submitted.**Bill Reed**(2001), "The Pareto, Zipf and other power laws", Economics Letters, in press. [note: the paper also contains a model for Zipf's law in general.]

[ PDF]**Robert L Axtell**(2001), "Zipf distribution of US firm sizes", Science, 293(5536):1818-1820. [note: it's a frequency-size plot, not the size-rank plot.] [ PDF]

(updated on dec-02-2002)

(well, i haven't checked the original papers, so i'm not sure the papers are in the right place ...)

- BM Hill, "The rank-frequency form of Zipf's law", Journal of American Statisticians, 3, 1163-1174 (1975).
**Juan Camacho, Richard V Sole**(2001) "Scaling in ecological size spectra", Europhysics Letters, 55:774-780.**WJ Reed, BD Hughes**(2002), "On the size distribution of live genera", Journal of Theoretical Biology, 217:?-?**WJ Reed, BD Hughes**(2002), "From gene families and genera to incomes and internet file sizes: why power-laws are so common in nature", Physical Review E, to appear.

- D Sornette, L
Knopoff, a YY Kagan, C Vanneste, "Rank-ordering statistics of extreme events:
application to the distribution of large earthquakes", Journal of Geophysical
Research, 101(B6):13883-13894 (1996).

[ PDF ]

(note that i didn't use the words "zipf's law", because these are not!)

**G Gamow, M Ycas**(1955), "Statistical correlation of protein and ribonucleic acid composition", Proceedings of National Academy of Sciences, 41 (12), 1011-1019 (Dec 15, 1955).- I wouldn't list the recent papers on the so-called Zipf's law in
subsequences in DNA sequences, because these rank-frequency plots do not
follow the power-law well, and the slope in the double-logarithm plot is far
from -1. These are rank-frequency plots, but are
*not*Zipf's law! **E Bornberg-Bauer**(1997), "How are model protein structures distributed in sequence space?", Biophysical Journal, 73(5):2393-2403. [If I understood correctly, some protein structure corresponds to many protein sequences, whereas other structure corresponds to fewer sequences. So structures can be ranked...]**M Gerstein, H Hegyi**(1998), "Comparing genomes in terms of protein structure: surveys of a finite parts list", FEMS Microbiol Review, 22(4):277-304. [well, the words Zipf's law is mentioned in the abstract...]- Vladimir A Kuznetsov (2001), "Distribution associated with stochastic processes of gene expression in a single eukaryotic cell", EURASIP Journal on Applied Signal Processing, 4:285-296. [ PDF ]
**W Li , Y Yang**(2002), "Zipf's law in importance of genes for cancer classification using microarray data", Journal of Theoretical Biology, 219:539-551.

or: arxiv.org e-print, [ physics/0104028 ]- Vladmir A Kuznetsov (2002) "Statistics of the numbers of transcripts and
protein sequences encoded in the genome", in
*Computational and Statistical Approaches to Genomics*(Kluwer). [ PDF] - NM Luscombe, J Qian, Z Zhang, T Johnson, M Gerstein (2002), "The dominance of the population by a selected few: power-law behaviour applies to a wide variety of genomic properties", Genome Biology, 3:research0040.
**WJ Reed, BD Hughes**(2002), "A model explaining the size distribution of gene and protein families", submitted to Discrete and Conts. Dyn. Systems - B

- BM Hill, "A simple general approach to inference about the tail of a distribution", Annals of Statistics, 3, 1163-1174 (1975).
- G.S. Lo, "Asymptotic behavior of Hill's estimate and application", Journal of Applied Probability, 23, 922-936 (1986).
- BM Hill, "Bayesian forecasting of extreme values in an exchangeable sequence", J Res National Institute of Standard Technology, 99:521-538 (1994).

(updated on jul-05-2001)

**CJ Brackenridge**(1978), "A study of phenotypic arrays derived from seven genetic systems in an Australian population sample", Ann. Human Biology, 5:381-388.**P Schuster, PF Stadler**(1994), "Landscapes: complex optimization problems and biopolymer structures", Computer & Chemistry, 18(3):295-324.**P Schuster, W Fontana, PF Stadler , IL Hofacker**(1994), "From sequences to shapes and back: a case study in RNA secondary structures", Proceedings of Royal Society of London (B. Biological Sciences), 255:279-284.**P Schuster**(1995), "How to search for RNA structures. Theoretical concepts in evolutionary biotechnology", Journal of Biotechnology, 41(2-3):239-257.

["The frequency with which a structure is realized in sequence space is inversely proportional to some power c > 1 of the structure's frequency rank, thus following a (generalized) Zipf law"]**MS Watanabe**(1996), "Zipf's law in percolation", Physical Review E, 53(4):4187-4190.**JD Burgos, P Moreno-Tovar**(1996), "Zipf-scaling behavior in the immune system", Biosystems, 39(3):227-232.**YG Ma**(1999), "Zipf's law in the liquid gas phase transition of nuclei", European Physics Journal, A6:367-371.**Piqueira JR, Monteiro LH, de Magalhaes TM, Ramos RT, Sassi RB, Cruz EG**(1999), "Zipf's law organizes a psychiatric ward", Journal of Theoretical Biology, 198:439-443. [what?]**J Kalda, M Sakki, M Vainu, M Laan**(Oct 2001), "Zipf's law in human heatbeat dynamics", arxiv.org e-print , physics/0110075. [ abstract]**WJ Reed, BD Hughes**(2002), "On the distribution of family names", Physica A, to appear.

(new on sep-19-2001)

**L Pietronero, E Tossati, V Tossati, A Vespignani**(2001),

"Explaining the uneven distribution of numbers in nature: the laws of Benford and Zipf",

Physica A, 293:297-304.

- S Newcomb , "Note on the frequency of use of the different digits in natural numbers", American Journal of Mathematics, 4:39-40 (1881).
- Frank Benford, "The law of anomalous numbers", Proc. American Phil Society, 78:551-572 (1938).
- RA Raimi, "The peculiar distribution of first digits", Scientific American, 221:109-119 (Dec 1969)
- J Burke, E Kincanon (1991), "Benford's law and physical constants: the distribution of initial digits", American Journal of Physics, 14:59-63 (1991).
- Mark J Nigrini,
*The Detection of Income Tax Evasion Through an Analysis of Digital Frequencies*(Ph.D Thesis, Univ Cincinnati, 1992) (current a professor of accountancy at the Southern Methodist University, Dallas, TX) - "He's got their number: Scholar uses math to foil financial fraud" (Wall Street Journal,July 10, 1995)
- E Ley, "On the peculiar distribution of the US stock indices digits", American Statistician, 1995
- Theodore P Hill, "A statistical derivation of the significant-digit law", Statistical Science, 10(4):354-363 (1995).
- M Nigrini, "A taxpayer compliance application of Benford's law", Journal of the American Taxation Association, 18:72-91 (1996).
- TP Hill, "The first digit phenomenon", American Scientist, 86:358-363 (1998).
- Matthews, The power of one, NewScientist, July 10, 1999.
- Eric Weisstein's Treasure Troves
of Science

http://www.treasure-troves.com/math/BenfordsLaw.html - Alexander Bogomolny's Interactive
Math Miscelany and Puzzles

http://www.cut-the-knot.com/do_you_know/zipfLaw.html - New York Times, Aug 4, 1998 "Following Benford's Law, or Looking Out for
No. 1" (a copy from

http://courses.nus.edu.sg/course/mathelmr/080498sci-benford.htm) - LM Leemis, BW Schmeiser, DL Evans (2000), "Survival distributions satisfying Benford's law", The American Statistician, 54:1-6.