Zipfs Law: a curious social and mathematical phenomenon
This law proposed by the linguist George Zipf shows a pattern in the repetition of words.
We use thousands of words every day, with meanings of all kinds and belonging to very varied grammatical categories. However, not all of them are used with the same frequency. Depending on how important they are to the sentence structure, some words recur more frequently than others.
Zipf's law is a postulate that takes into account this phenomenon and specifies how likely a word is to be used based on its position in the ranking of the total number of words used in a language. We will go into more detail about this law below.
Zipf's law
George Kingsley Zipf (1902-1950) was an American linguist, born in Freeport, Illinois, who encountered a curious phenomenon in his studies of comparative philology. In his work, in which he was carrying out statistical analyses, he found that the most frequently used words seemed to have a pattern of occurrence.This was the birth of the law that receives his surname.
According to Zipf's law, the vast majority of the time, if not always! the words used in a written text or in an oral conversation will follow the following pattern: the most used word, which occupies the first place in the text, will be the most used word.The most used word, which would occupy the first place in the ranking, would be twice as often used as the second most used word, three times as often as the third, four times as often as the fourth, and so on.
In mathematical terms, this law would be:
Pn ≈ 1⁄na
Where 'Pn' is the frequency of a word in order 'n' and the exponent 'a' is approximately 1.
It should be said that George Zipf was not the only one to observe this regularity in the frequency of the most used words of many languages, both natural and artificial. of many languages, both natural and artificial. In fact, there are records of others, such as the steganographer Jean-Baptiste Estoup and the physicist Felix Auerbach.
Zipf studied this phenomenon with English texts and, apparently, it holds true. If we take the original version of Charles Darwin's Origin of Species (1859) we see that the most used word in the first chapter is "the", with an appearance of about 1,050, while the second is "and", appearing about 400 times, and the third is "to," appearing about 300. Although not exactly, we can see that the second word appears half as many times as the first and the third one third.
In Spanish the same thing happens. If we take this same article as an example, we can see that the word "de" is used 85 times, being the most used, while the word "la", which is the second most used, can be counted up to 57 times.
Seeing that this phenomenon occurs in other languages, it becomes interesting to think about how the human brain processes language. While there are many cultural phenomena that mediate the use and meaning of many words, the language in question being a cultural factor in itself, the way in which we make use of the most used words seems to be a factor independent of culture.
- You may be interested in, "What is Cultural Psychology?"
Frequency of function words.
Let's look at the following ten words: 'that', 'of', 'not', 'a', 'the', 'the', 'the', 'is', 'and', 'in' and 'it'. What do they all have in common? That they are meaningless words on their own but, ironically, are the 10 most used words in the Spanish language..
By saying that they lack meaning we mean that, if a sentence is said in which there is no noun, adjective, verb or adverb, the sentence lacks meaning. For example:
... and ... ... ... in ... ... ... a ... of ... ... ... to ... of ... ... ....
On the other hand, if we replace the dots with meaningful words, we can have a sentence like the following.
Miguel and Ana have a little brown table next to their bed in their house.
These frequently used words are called function words, and they are responsible for giving grammatical structure to the sentence. They give grammatical structure to the sentence.. They are not only the 10 we have seen, in fact there are dozens of them, and all of them are among the hundred most used words in Spanish.
Although they have no meaning on their own, they are impossible to omit in a sentence, they are impossible to omit in any sentence to which you want to give meaning.. It is necessary for human beings, in order to transmit a message efficiently, to resort to words that constitute the structure of the sentence. For this reason they are, curiously enough, the most commonly used.
Research
Despite the observations of George Zipf in his studies of comparative philosophy, until relatively recently it has not been possible to deal empirically with the postulates of the law of the. Not because it was materially impossible to analyze all conversations or texts in English, or in any other language, but because of the titanic task and the great effort involved.
Fortunately, and thanks to the existence of modern computing and computer programs, it has been possible to investigate whether this law was given in the form in which Zipf originally proposed it or whether there were variations.
One case is the research carried out by the Center for Mathematical Research (CRM, in Catalan Centre de Recerca Matemàtica) linked to the Universitat Autònoma de Barcelona. Researchers Álvaro Corral, Isabel Moreno García and Francesc Font Clos carried out a large-scale analysis in which they analyzed thousands of digitized texts in English to see how true Zipf's law was.
Their work, in which an extensive corpus of nearly 30,000 volumes was analyzed, yielded a law equivalent to Zipf's law, in which it was found that Zipf's law was not true.in which it was found that the most frequently used word was twice as often used as the second most frequently used word, and so on.
Zipf's law in other contexts
Although Zipf's law was originally used to explain the frequency of words used in each language, comparing their range of occurrence with their actual frequency in texts and conversations, it has also been extrapolated to other situations.
A rather striking case is the number of the number of people living in U.S. capital cities.. According to Zipf's law, the most populous U.S. capital had twice as many people as the second most populous, and three times as many as the third most populous.
If you look at the 2010 population census, this matches. New York had a total population of 8,175,133 people, with the next most populous capital being Los Angeles, with 3,792,621 and the next highest ranking capitals being Chicago, Houston and Philadelphia with 2,695,598, 2,100,263 and 1,526,006 respectively.
This can also be seen in the case of the most populated cities in Spain, although Zipf's law is not completely fulfilled but it does correspond, to a greater or lesser extent, with the rank that each city occupies in the ranking. Madrid, with a population of 3,266,126 has twice as many as Barcelona, with 1,636,762, while Valencia has about a third with 800,000 inhabitants.
Another observable case of Zipf's law is with web pages.. Cyberspace is very large, with about 15 billion web pages created. Considering that there are about 6.8 billion people in the world, in theory for each of them there would be two web pages to visit every day, which is not the case.
The ten most visited websites today are: Google (60.49 million monthly visits), Youtube (24.31 million), Facebook (19.98 million), Baidu (9.77 million), Wikipedia (4.69 million), Twitter (3.92 million), Yahoo (3.74 million), Pornhub (3.36 million), Instagram (3.21 million) and Xvideos (3.19 million). Looking at these numbers, you can see that Google is twice as visited as Youtube, three times as much as Facebook, more than four times as much as Baidu....
Bibliographical references:
- Font-Clos, F., Boleda, G. and Corral, Á.(2013) A scaling law beyond Zipf's law and its relation to Heaps' law. New Journal of Physics, 15. doi.org/10.1088/1367-2630/15/9/093033.
- Montemurro, M. A. (2001). Beyond the Zipf–Mandelbrot law in quantitative linguistics. Physica A: Statistical Mechanics and its Applications 300: 567 - 578.
(Updated at Apr 13 / 2024)