Zipf Distribution

📊 Zipf Distribution in Python

The Zipf distribution is a discrete probability distribution that models ranked data, where the frequency of an item is inversely proportional to its rank.
It appears in linguistics, social networks, and city populations, following the 80/20 or power-law rule.


✅ 1. Characteristics of Zipf Distribution

  • Discrete distribution (integers: 1, 2, 3, …)

  • Parameter: a > 1 → exponent characterizing the distribution

  • Probability Mass Function (PMF):

P(X=k)=1/ka∑n=1N1/na,k=1,2,3,…P(X=k) = \frac{1/k^a}{\sum_{n=1}^{N} 1/n^a}, \quad k = 1,2,3,…

  • Heavy-tailed distribution: a few items have very high frequency, many items have low frequency

Applications:

  • Word frequencies in text (most common words appear much more often)

  • City populations

  • Website traffic distribution


✅ 2. Generate Zipf Data Using NumPy

import numpy as np

# Parameters
a = 2 # exponent parameter (>1)
size = 1000 # number of samples

# Generate Zipf random numbers
data = np.random.zipf(a, size)

print(data[:10])

Output (example):

[1 2 1 3 1 1 2 1 4 1]
  • Mostly small integers (rank 1, 2, 3)

  • Few large values → heavy tail


✅ 3. Visualize Zipf Distribution

import matplotlib.pyplot as plt
import seaborn as sns

sns.histplot(data, bins=50, kde=False, color='green')
plt.title("Zipf Distribution (a=2)")
plt.xlabel("Rank")
plt.ylabel("Frequency")
plt.show()

  • Most counts occur at low ranks

  • Long tail toward higher ranks


✅ 4. Compare Different Parameters

data1 = np.random.zipf(a=1.5, size=1000)
data2 = np.random.zipf(a=2.5, size=1000)

sns.histplot(data1, bins=50, color='red', label='a=1.5', alpha=0.5)
sns.histplot(data2, bins=50, color='blue', label='a=2.5', alpha=0.5)
plt.title("Zipf Distribution Comparison")
plt.legend()
plt.show()

  • Smaller a → heavier tail (more high-rank values)

  • Larger a → concentrated at low ranks


✅ 5. Summary Table

Function Parameters Description
np.random.zipf() a, size Generates Zipf random integers
a Exponent parameter Controls heaviness of tail (>1)
size Number of samples Output array size

🎯 Practice Exercises

  1. Generate 1000 Zipf random numbers with a=2 and plot histogram.

  2. Compare Zipf distributions with a=1.5 vs a=3.

  3. Analyze the proportion of top 10 ranks vs total data.

CodeCapsule

Sanjit Sinha — Web Developer | PHP • Laravel • CodeIgniter • MySQL • Bootstrap Founder, CodeCapsule — Student projects & practical coding guides. Email: info@codecapsule.in • Website: CodeCapsule.in

You may also like...

Leave a Reply

Your email address will not be published. Required fields are marked *