Entropy in Computer Science: From Shannon's Theory to Neural Networks
1. What is entropy and where did it come from in IT
The word “entropy” came into computer science from physics — and it’s no coincidence. In 1865, Rudolf Clausius introduced the term to describe the degree of “disorder” of a thermodynamic system. The greater the entropy, the greater the disorder, the less we know about the state of each individual particle.
In 1948, Claude Shannon — an American mathematician and engineer at Bell Labs — published the paper “A Mathematical Theory of Communication”, which laid the foundations of all modern information theory. Shannon consciously borrowed the term “entropy”: he noticed that his formula is mathematically identical to Boltzmann’s formula from physics. John von Neumann even advised him to use exactly this word — “because nobody knows what entropy really is, so in a debate you will always have the advantage.”
But in computer science, entropy has acquired a specific, measurable meaning: the amount of uncertainty (or information) in a message.
2. Shannon’s Information Entropy — the heart of the theory
2.1 Intuition
Imagine a coin. If the coin is fair (50/50), the result of a flip is maximally unpredictable — entropy is maximum. If the coin always lands on heads — there is no uncertainty, entropy is zero.
Now imagine an alphabet. In English, the letter e occurs ~13% of the time, and z — ~0.07%. If you see z, it’s a more “surprising” event — it carries more information.
Shannon formalized this principle: information from an event is inversely proportional to its probability.
2.2 Formula
For a discrete source with possible states with probabilities , Shannon’s entropy is defined as:
The unit of measurement is the bit (if the logarithm is base 2). Each term is the “contribution” of event to the total uncertainty.
A few edge cases:
- If for some single event, then (absolute certainty)
- If all are equal (), then — maximum
- Convention: (limit)
For example, for 8 equally probable symbols: bits — meaning we need exactly 3 bits to encode one symbol. Move the slider — and you will see how entropy reaches its maximum with equally probable events and drops to zero when one of them becomes dominant.
3. Entropy and Data Compression
This is the first and most important practical application of Shannon’s theory.
3.1 Source Coding Theorem
Shannon proved a fundamental theorem: no lossless compression algorithm can compress data below its entropy. This is a theoretical lower bound — the information “limit of compressibility”.
If a text has an entropy of 4 bits per symbol, you will never compress it to less than 4 bits per symbol without losing information — with no algorithm, no supercomputer.
3.2 Huffman Code
The Huffman algorithm (1952) is one of the most elegant ways to approach this limit. The idea: symbols with higher probability are assigned shorter codes, rare ones — longer.
Example for a text of four symbols:
| Symbol | Probability | Huffman Code | Length |
|---|---|---|---|
| A | 0.50 | 0 | 1 bit |
| B | 0.25 | 10 | 2 bits |
| C | 0.125 | 110 | 3 bits |
| D | 0.125 | 111 | 3 bits |
Average code length: bits. Entropy: bits.
The match is exact — Huffman achieved the theoretical minimum for this distribution.
In practice, modern algorithms — Deflate (PNG, ZIP), Brotli, Zstandard — combine several methods: first they find repeating strings (LZ77/LZ78), and then apply entropy coding (Huffman or arithmetic coding). Arithmetic coding approaches the theoretical limit even more accurately, but is more complex to implement.
3.3 Entropy of Different Data Types
| Data Type | Typical Entropy | Compression Ratio |
|---|---|---|
| Random bytes | ~8 bits/byte | ~1.0× (incompressible) |
| Executable file | ~6–7 bits/byte | ~1.3–1.5× |
| English text | ~4–5 bits/symbol | ~1.5–2× |
| HTML | ~3–4 bits/symbol | ~2–3× |
| Database (JSON) | ~2–3 bits/symbol | ~3–5× |
Why does an executable file compress poorly? Because machine code is constructed so that each byte carries maximum information. It is almost a uniform distribution of bytes — high entropy.
4. Entropy in Cryptography
Cryptography and information theory are inextricably linked — Shannon personally wrote the paper “Communication Theory of Secrecy Systems” in 1949.
4.1 Key Entropy
In cryptography, entropy means key unpredictability — and this is a matter of security, not academia.
A 128-bit key generated by a true random number generator (TRNG) has 128 bits of entropy. A brute-force attack requires an average of attempts — practically unfeasible.
But if a “128-bit” key is generated with a bad PRNG that effectively yields only 20 different values — it only has bits of entropy. Such a key is cracked in microseconds.
This is why proper entropy generation is one of the most important tasks in cryptography. Used for this are:
- Hardware sources: thermal noise, photon noise, radioactive decay
- In Linux:
/dev/randomand/dev/urandomcollect entropy from interrupts, network, mouse movement - CSPRNG (Cryptographically Secure PRNG): ChaCha20, Fortuna
4.2 Perfect Cipher and One-Time Pad
Shannon proved that the only theoretically unbreakable cipher is the one-time pad: the key is as long as the message, random, and used only once. Under these conditions, the ciphertext carries no information about the plaintext — the conditional entropy .
4.3 Password Entropy
A popular formula for evaluating passwords:
where is password length, is alphabet size. But this is an “upper” bound — if the password is not a truly random sequence, the real entropy is lower. Dictionary passwords like password123 have almost zero entropy in terms of security, even though they look “long”.
5. Entropy in Machine Learning
This is arguably the most active modern application of entropy concepts.
5.1 Cross-Entropy as a Loss Function
Cross-entropy () measures the difference between two probability distributions — the “true” and the “predicted” :
In classification tasks, the true distribution is a one-hot vector (1 for the correct class, 0 for the rest), and is the vector of predicted probabilities by the model. Then the formula simplifies:
That is: we penalize the model more, the lower the probability it assigned to the correct class. This is Binary Cross-Entropy (BCE) or Categorical Cross-Entropy (CCE) — standard loss functions for neural networks.
5.2 Kullback–Leibler Divergence (KL-divergence)
KL-divergence is the “distance” from distribution to (asymmetric!). It is widely used in:
- VAE (Variational Autoencoder): the regularization term “penalizes” the latent space for deviating from the standard normal distribution
- RL with Reinforcement (RLHF): PPO and derivative algorithms bound between the new and old policy to make training stable
- Knowledge Distillation: KL between “teacher” and “student” distributions
5.3 Entropy in Decision Trees
The classic ID3 algorithm and its descendants (C4.5, CART) use entropy to select a feature for splitting.
Information Gain for feature :
where is the entropy of the current dataset, is the subset of examples where feature has value .
The algorithm greedily chooses the feature with maximum — meaning the one that most reduces uncertainty in the data. This is a direct translation of the physical idea into a learning algorithm: we want each split to bring maximum “order” to the classification. Note: when a model gives the correct class only 10% probability — the loss explodes. This is why gradient descent so painfully punishes “overconfident mistakes”.
6. Entropy and Entropy Coding in Practice
6.1 ANS (Asymmetric Numeral Systems)
The modern de facto standard is the ANS algorithm, developed by Jarek Duda in 2007–2013. It combines the speed of Huffman tables with the accuracy of arithmetic coding.
ANS is used in:
- Zstandard (zstd) — Facebook, 2016, currently a standard in the Linux kernel
- Apple LZFSE — compression in macOS/iOS
- Facebook/Meta WebP-like formats
The idea of ANS: we encode a sequence of symbols as one large integer . Each new symbol “expands” by a number of bits equal to — exactly as much information as this symbol carries. The entropy efficiency is about 0.001 bits from the theoretical minimum.
6.2 Entropy in Network Protocols
HTTP/2 HPACK and HTTP/3 QPACK use entropy coding for HTTP headers. Headers like Content-Type: application/json repeat in every request — the algorithm builds a static table of the most common headers and encodes them with a single byte instead of dozens.
7. Thermodynamic Entropy and Landauer’s Principle
This is perhaps the most philosophical part — but with very real engineering consequences.
7.1 Landauer’s Principle (1961)
Rolf Landauer of IBM proved: erasing one bit of information physically requires expending at least of energy, where is the Boltzmann constant, is the temperature in kelvins.
At room temperature (300 K):
A modern transistor consumes about J per operation — millions of times more than Landauer’s limit. But we are approaching it: according to Moore’s law, transistor scaling has slowed down precisely due to thermal limits.
This means: even a perfect computer of the future is bounded by physics. Google DeepMind estimates show that if modern data centers approached Landauer’s limit, energy consumption would be reduced by times.
7.2 Reversible Computing
From this grows the idea of reversible computing: if we never erase bits (Toffoli gate, Fredkin gate), we can theoretically compute without wasting heat. It is precisely on this principle that quantum computers are built — quantum operations are unitary (reversible).
8. Entropy in Systems and Software Architecture
8.1 “Code Entropy”
In engineering culture, the term “entropy” is used metaphorically: codebase entropy is the degree of its disorder. Over time, without refactoring, code accumulates “technical debt”: coupling increases, coherence decreases, the number of edge cases grows.
The second law of thermodynamics in programming: a “closed system” (without refactoring) always moves towards greater disorder. Hence the task of architects is to design systems so that their “entropy” grows as slowly as possible.
8.2 Entropy in Distributed Systems
In the context of the CAP theorem and Eventual Consistency: the more “distributed” a system is (more nodes, larger network) — the greater the entropy of states. Consistency problems are essentially a fight against entropy: preserving an “ordered” global state in a system where parts change independently.
Gossip protocols (used in Cassandra, Redis Cluster) are entropy-resistant information dissemination algorithms: even with failures and delays, the system ultimately reaches a consistent state.
9. Common Table of Concepts---
10. Related Concepts to Know
Mutual Information
Measures how much information about is contained in . Used in feature selection, analyzing correlations in neural networks, and — in a broader context — for analyzing the “information flow” through transformer layers.
Differential Entropy
For continuous distributions, Shannon’s standard formula is replaced by:
The highest differential entropy among all distributions with fixed variance is the normal distribution — which is why it occurs so often in nature and in ML (Central Limit Theorem). VAEs use exactly the normal distribution for the latent space.
Algorithmic Complexity and Kolmogorov Entropy
Andrey Kolmogorov proposed another measure of complexity: the algorithmic complexity of a string is the length of the shortest program that generates . It is related to Shannon’s entropy but is a property of a specific string, not a distribution. The disadvantage is that it is uncomputable (halting problem), but it is theoretically important — particularly in studies of Minimum Description Length (MDL) and in PAC learning theory.
11. Practical Summary for a Developer
Entropy is not an abstraction, but a tool. Here is where it occurs in daily practice:
When working with data, you should know: if gzip compresses your JSON by only 5% — the data is already almost entropically saturated; if by 90% — there is high redundancy that can be eliminated at the schema level.
In security: never use Math.random() to generate tokens — in the browser it is not cryptographic. Use crypto.getRandomValues() (Web Crypto API) or secrets.token_bytes() in Python.
In ML: when your model outputs cross_entropy_loss → 0, check for overfitting — the model might have “memorized” answers rather than learned.
In decision trees: Information Gain and Gini Impurity are two different ways to measure the “disorder” of a subset. The first is based on Shannon’s entropy, the second is an approximation — faster, but less theoretically precise.
Entropy is one of those concepts that is simultaneously a fundamental law of nature and a concrete number that can be measured in code. From the minimum size of a ZIP archive to the loss function of your transformer — behind everything is the same Shannon formula, written in 1948.
Comments
Please ensure Giscus is configured with your correct Repository ID and Category ID at giscus.app to enable comments.