archive/friston/The_Free_Energy_Principle_Friston_2010.md

                                                                                                                                                                REVIEWS
                                                  The free-energy principle:  
                                                  a unified brain theory?
                                                  Karl Friston
                                                  Abstract | A free-energy principle has been proposed recently that accounts for action, 
                                                  perception and learning. This Review looks at some key brain theories in the biological (for 
                                                  example, neural Darwinism) and physical (for example, information theory and optimal 
                                                  control theory) sciences from the free-energy perspective. Crucially, one key theme runs 
                                                  through each of these theories — optimization. Furthermore, if we look closely at what is 
                                                  optimized, the same quantity keeps emerging, namely value (expected reward, expected 
                                                  utility) or its complement, surprise (prediction error, expected cost). This is the quantity that 
                                                  is optimized under the free-energy principle, which suggests that several global brain 
                                                  theories might be unified within a free-energy framework.
               Free energy                      Despite the wealth of empirical data in neuroscience,                  Motivation: resisting a tendency to disorder. The 
               An information theory measure    there are relatively few global theories about how the                 defining characteristic of biological systems is that 
               that bounds or limits (by being  brain works. A recently proposed free-energy principle                 they maintain their states and form in the face of a 
               greater than) the surprise on                                                                                                                      3–6
                                                for adaptive systems tries to provide a unified account                constantly changing environment              . From the point 
               sampling some data, given a      of action, perception and learning. Although this prin-                of view of the brain, the environment includes both 
               generative model.                ciple has been portrayed as a unified brain theory1, its               the external and the internal milieu. This maintenance 
               Homeostasis                      capacity to unify different perspectives on brain function             of order is seen at many levels and distinguishes bio-
               The process whereby an open      has yet to be established. This Review attempts to place               logical from other self-organizing systems; indeed, the 
               or closed system regulates its   some key theories within the free-energy framework, in                 physiology of biological systems can be reduced almost 
               internal environment to          the hope of identifying common themes. I first review                  entirely to their homeostasis7. More precisely, the rep-
               maintain its states within       the free-energy principle and then deconstruct several                 ertoire of physiological and sensory states in which an 
               bounds.                          global brain theories to show how they all speak to the                organism can be is limited, and these states define the 
               Entropy                          same underlying idea.                                                  organism’s phenotype. Mathematically, this means that 
               The average surprise of                                                                                 the probability of these (interoceptive and exterocep-
               outcomes sampled from a          The free-energy principle                                              tive) sensory states must have low entropy; in other 
               probability distribution or      The free-energy principle (BOX 1) says that any self-                  words, there is a high probability that a system will 
               density. A density with low 
               entropy means that, on           organizing system that is at equilibrium with its environ-             be in any of a small number of states, and a low prob-
               average, the outcome is          ment must minimize its free energy2. The principle is                  ability that it will be in the remaining states. Entropy 
               relatively predictable. Entropy  essentially a mathematical formulation of how adaptive                 is also the average self information or ‘surprise’8  
               is therefore a measure of        systems (that is, biological agents, like animals or brains)           (more formally, it is the negative log-probability of an 
               uncertainty.
                                                resist a natural tendency to disorder3–6. What follows is              outcome). Here, ‘a fish out of water’ would be in a sur-
                                                a non-mathematical treatment of the motivation and                     prising state (both emotionally and mathematically). 
                                                implications of the principle. We will see that although the           A fish that frequently forsook water would have high 
                                                motivation is quite straightforward, the implications are              entropy. Note that both surprise and entropy depend 
               The Wellcome Trust Centre        complicated and diverse. This diversity allows the prin-               on the agent: what is surprising for one agent (for 
               for Neuroimaging,                ciple to account for many aspects of brain structure and               example, being out of water) may not be surprising 
               University College London,       function and lends it the potential to unify different per-            for another. Biological agents must therefore mini-
               Queen Square, London, 
               WC1N 3BG, UK.                    spectives on how the brain works. In subsequent sections,              mize the long-term average of surprise to ensure that 
               e‑mail:                          I discuss how the principle can be applied to neuronal                 their sensory entropy remains low. In other words,  
               k.friston@fil.ion.ucl.ac.uk      systems as viewed from these perspectives. This Review                 biological systems somehow manage to violate the  
               doi:10.1038/nrn2787              starts in a rather abstract and technical way but then tries           fluctuation theorem, which generalizes the second law 
               Published online  
               13 January 2010                                                                                                                 9
                                                to unpack the basic idea in more familiar terms.                       of thermodynamics .
               NATuRE REvIEWs | NeuroscieNce                                                                                                     voluME 11 | FEBRuARy 2010 | 127
                                                                        © 2010 Macmillan Publishers Limited. All rights reserved
                   REVIEWS
                                                                
                                                               Box 1 | The free-energy principle
                                                               Part a of the figure shows the dependencies among the                                    a
                                                               quantities that define free energy. These include the                                            Environment                                            Agent
                                                               internal states of the brain μ(t) and quantities describing its 
                                                               exchange with the environment: sensory signals (and their                                                                  Sensations
                                                                                               T                                                                                          ~     ~       ~
                                                               motion) ˜s(t) = [s,s′,s″…]  plus action a(t). The environment                                                             s = g(x, ϑ) + z
                                                               is described by equations of motion, which specify the 
                                                               trajectory of its hidden states. The causes ϑ ⊃ {x˜ , θ, γ } of                                External states                                    Internal states
                                                               sensory input comprise hidden states x˜ (t), parameters θ                                       ~     ~          ~                                                 ~
                                                                                                                                                               ˙
                                                                                                                                                               x = f(x,                                        μ = arg min F(s,
                                                               and precisions γcontrolling the amplitude of the random                                                  a, ϑ) + w                                                   μ)
                                                                                       
                                                               fluctuations  z˜ (t) and  w˜ (t). Internal brain states and action 
                                                               minimize free energy F(s˜ ,μ), which is a function of sensory                                                     Action or control signals
                                                               input and a probabilistic representation q(ϑ|μ) of its causes.                                                                           ~
                                                                                                                                                                                      a = arg min F(s,
                                                               This representation is called the recognition density and is                                                                                μ)
                                                               encoded by internal states μ.
                                                                  The free energy depends on two probability densities:                                 b
                                                               the recognition density q(ϑ|μ) and one that generates                                                         Free-energy bound on surprise
                                                               sensory samples and their causes, p(s˜ ,ϑ|m). The latter                                                                   ~
                                                                                                                                                                             F = −<ln p(s,
                                                                                                                                                                                             ϑ | m)>  + <ln q(ϑ | μ)>
                                                               represents a probabilistic generative model (denoted by                                                                              q                  q
                                                               m), the form of which is entailed by the agent or brain.                                   Action minimizes prediction errors
                                                               Part b of the figure provides alternative expressions for the                              F = D(q(ϑ                          ~
                                                                                                                                                                       | μ) || p(ϑ)) − <ln p(s(a) | ϑ, m)>q
                                                               free energy to show what its minimization entails: action                                  a = arg max Accuracy
                                                               can reduce free energy only by increasing accuracy (that is, 
                                                               selectively sampling data that are predicted). Conversely,                                                                Perception optimizes predictions
                                                               optimizing brain states makes the representation an                                                                                               ~          ~
                   Surprise                                    approximate conditional density on the causes of sensory                                                                  F = D(q(ϑ | μ) || p(ϑ | s)) − ln p(s |  m)
                   (Surprisal or self information.)            input. This enables action to avoid surprising sensory                                                                    μ = arg max Divergence
                   The negative log-probability of             encounters. A more formal description is provided below.
                   an outcome. An improbable                   optimizing the sufficient statistics (representations)
                   outcome (for example, water                                                                                                                                                   Nature Reviews | Neuroscience
                                                               Optimizing the recognition density makes it a posterior or conditional density on the causes of sensory data: this can be 
                   flowing uphill) is therefore                seen by expressing the free energy as surprise –In p(s˜ ,| m) plus a Kullback-Leibler divergence between the recognition and 
                   surprising.                                 conditional densities (encoded by the ‘internal states’ in the figure). Because this difference is always positive, minimizing 
                   Fluctuation theorem                         free energy makes the recognition density an approximate posterior probability. This means the agent implicitly infers or 
                   (A term from statistical                    represents the causes of its sensory samples in a Bayes-optimal fashion. At the same time, the free energy becomes a tight 
                   mechanics.) Deals with the                  bound on surprise, which is minimized through action.
                   probability that the entropy                optimizing action
                   of a system that is far from the            Acting on the environment by minimizing free energy enforces a sampling of sensory data that is consistent with the 
                   thermodynamic equilibrium                   current representation. This can be seen with a second rearrangement of the free energy as a mixture of accuracy and 
                   will increase or decrease over              complexity. Crucially, action can only affect accuracy (encoded by the ‘external states’ in the figure). This means that  
                   a given amount of time. It                  the brain will reconfigure its sensory epithelia to sample inputs that are predicted by the recognition density — in other 
                   states that the probability of 
                   the entropy decreasing                      words, to minimize prediction error.
                   becomes exponentially smaller 
                   with time.
                   Attractor                                      In short, the long-term (distal) imperative — of main-                                Crucially, free energy can be evaluated because it is a 
                   A set to which a dynamical                taining states within physiological bounds — translates                                    function of two things to which the agent has access: its 
                   system evolves after a long               into a short-term (proximal) avoidance of surprise.                                        sensory states and a recognition density that is encoded 
                   enough time. Points that                  surprise here relates not just to the current state, which                                 by its internal states (for example, neuronal activity 
                   get close to the attractor                cannot be changed, but also to movement from one state                                     and connection strengths). The recognition density is a 
                   remain close, even under                  to another, which can change. This motion can be com-                                      probabilistic representation of what caused a particular 
                   small perturbations.
                                                             plicated and itinerant (wandering) provided that it revis-                                 sensation.
                   Kullback-Leibler divergence               its a small set of states, called a global random attractor10,                                  This (variational) free-energy construct was  
                   (Or information divergence,               that are compatible with survival (for example, driving a                                  introduced into statistical physics to convert difficult  
                   information gain or cross                 car within a small margin of error). It is this motion that                                probability-density integration problems into eas-
                   entropy.) A non-commutative 
                                                                                                                                                                                                      11
                   measure of the non-negative               the free-energy principle optimizes.                                                       ier optimization problems . It is an information  
                   difference between two                         so far, all we have said is that biological agents must                               theoretic quantity (like surprise), as opposed to a 
                   probability distributions.                avoid surprises to ensure that their states remain within                                  thermo dynamic quantity. variational free energy has 
                   Recognition density                       physiological bounds (see supplementary information s1                                     been exploited in machine learning and statistics to 
                                                                                                                                                                                                                            12–14
                   (Or ‘approximating conditional            (box) for a more formal argument). But how do they                                         solve many inference and learning problems                                . In this 
                   density’.) An approximate                 do this? A system cannot know whether its sensations                                       setting, surprise is called the (negative) model evidence. 
                   probability distribution of the           are surprising and could not avoid them even if it did                                     This means that minimizing surprise is the same as 
                   causes of data (for example,              know. This is where free energy comes in: free energy is                                   maximizing the sensory evidence for an agent’s exist-
                   sensory input). It is the product         an upper bound on surprise, which means that if agents                                     ence, if we regard the agent as a model of its world. In 
                   of inference or inverting a 
                   generative model.                         minimize free energy, they implicitly minimize surprise.                                   the present context, free energy provides the answer to 
                   128 | FEBRuARy 2010 | voluME 11                                                                                                                                             www.nature.com/reviews/neuro
                                                                                            © 2010 Macmillan Publishers Limited. All rights reserved
                                                                                                                                                                                   REVIEWS
                                                      a fundamental question: how do self-organizing adap-                               In summary, the free energy rests on a model of how 
                                                      tive systems avoid surprising states? They can do this by                      sensory data are generated and on a recognition density 
                                                      minimizing their free energy. so what does this involve?                       on the model’s parameters (that is, sensory causes). Free 
                                                                                                                                     energy can be reduced only by changing the recognition 
                                                      Implications: action and perception. Agents can                                density to change conditional expectations about what is 
                                                      suppress free energy by changing the two things it depends                     sampled or by changing sensory samples (that is, sensory 
                Generative model                      on: they can change sensory input by acting on the world                       input) so that they conform to expectations. In what fol-
                A probabilistic model (joint          or they can change their recognition density by chang-                         lows, I consider these implications in light of some key 
                density) of the dependencies          ing their internal states. This distinction maps nicely                        theories about the brain.
                between causes and                    onto action and perception (BOX 1). one can see what this 
                consequences (data), from             means in more detail by considering three mathematically                       The Bayesian brain hypothesis
                which samples can be 
                generated. It is usually                                                                                                                                   17
                                                      equivalent formulations of free energy (see supplementary                      The Bayesian brain hypothesis  uses Bayesian probability 
                specified in terms of the             information s2 (box) for a mathematical treatment).                            theory to formulate perception as a constructive process 
                likelihood of data, given their           The first formulation expresses free energy as energy                      based on internal or generative models. The underlying 
                causes (parameters of a model)                                                                                                                                                      18–22
                and priors on the causes.             minus entropy. This formulation is important for three                         idea is that the brain has a model of the world                      that 
                                                                                                                                                                                          23–28
                                                      reasons. First, it connects the concept of free energy as                      it tries to optimize using sensory inputs                 . This idea is 
                Conditional density                                                                                                                                        20 
                                                      used in information theory with concepts used in sta-                          related to analysis by synthesis and epistemological autom-
                (Or posterior density.) The                                                                                              19
                probability distribution of           tistical thermodynamics. second, it shows that the free                        ata . In this view, the brain is an inference machine that 
                                                                                                                                                                                              18,22,25
                causes or model parameters,           energy can be evaluated by an agent because the energy                         actively predicts and explains its sensations                  . Central 
                given some data; that is, a           is the surprise about the joint occurrence of sensations                       to this hypothesis is a probabilistic model that can gener-
                probabilistic mapping from            and their perceived causes, whereas the entropy is sim-                        ate predictions, against which sensory samples are tested 
                observed data to causes.              ply that of the agent’s own recognition density. Third, it                     to update beliefs about their causes. This generative 
                Prior                                 shows that free energy rests on a generative model of the                      model is decomposed into a likelihood (the probability of 
                The probability distribution or       world, which is expressed in terms of the probability of a                     sensory data, given their causes) and a prior (the a priori 
                density of the causes of data         sensation and its causes occurring together. This means                        probability of those causes). Perception then becomes the 
                that encodes beliefs about            that an agent must have an implicit generative model of                        process of inverting the likelihood model (mapping from 
                those causes before observing         how causes conspire to produce sensory data. It is this                        causes to sensations) to access the posterior probability of 
                the data.                             model that defines both the nature of the agent and the                        the causes, given sensory data (mapping from sensations 
                Bayesian surprise                     quality of the free-energy bound on surprise.                                  to causes). This inversion is the same as minimizing the 
                A measure of salience based               The second formulation expresses free energy as                            difference between the recognition and posterior densi-
                on the Kullback-Leibler               surprise plus a divergence term. The (perceptual) diver-                       ties to suppress free energy. Indeed, the free-energy for-
                divergence between the                gence is just the difference between the recognition den-                      mulation was developed to finesse the difficult problem 
                recognition density (which            sity and the conditional density (or posterior density) of the                 of exact inference by converting it into an easier optimi-
                encodes posterior beliefs) and 
                the prior density. It                 causes of a sensation, given the sensory signals. This con-                    zation problem11–14. This has furnished some powerful 
                measures the information that         ditional density represents the best possible guess about                      approximation techniques for model identification and 
                can be recognized in the data.        the true causes. The difference between the two densities                      comparison (for example, variational Bayes or ensemble 
                Bayesian brain hypothesis             is always non-negative and free energy is therefore an                         learning29). There are many interesting issues that attend 
                The idea that the brain uses          upper bound on surprise. Thus, minimizing free energy                          the Bayesian brain hypothesis, which can be illuminated 
                internal probabilistic                by changing the recognition density (without changing                          by the free-energy principle; we will focus on two.
                (generative) models to update         sensory data) reduces the perceptual divergence, so that                           The first is the form of the generative model and 
                posterior beliefs, using sensory      the recognition density becomes the conditional density                        how it manifests in the brain. one criticism of Bayesian 
                information, in an                    and the free energy becomes surprise.                                          treatments is that they ignore the question of how prior 
                (approximately) Bayes-optimal 
                fashion.                                  The third formulation expresses free energy as com-                        beliefs, which are necessary for inference, are formed27. 
                Analysis by synthesis                 plexity minus accuracy, using terms from the model                             However, this criticism dissolves with hierarchical 
                Any strategy (in speech coding)       comparison literature. Complexity is the difference                            generative models, in which the priors themselves are 
                                                                                                                                                   26,28
                in which the parameters of a          between the recognition density and the prior density                          optimized          . In hierarchical models, causes in one 
                                                                                                                 15
                signal coder are evaluated by         on causes; it is also known as Bayesian surprise  and is the                   level generate subordinate causes in a lower level; sen-
                decoding (synthesizing) the           difference between the prior density — which encodes                           sory data per se are generated at the lowest level (BOX 2). 
                signal and comparing it with          beliefs about the state of the world before sensory data are                   Minimizing the free energy effectively optimizes empiri-
                the original input signal.            assimilated — and posterior beliefs, which are encoded                         cal priors (that is, the probability of causes at one level, 
                Epistemological automata              by the recognition density. Accuracy is simply the sur-                        given those in the level above). Crucially, because empir-
                Possibly the first theory for why     prise about sensations that are expected under the recog-                      ical priors are linked hierarchically, they are informed 
                top-down influences (mediated         nition density. This formulation shows that minimizing                         by sensory data, enabling the brain to optimize its prior 
                by backward connections in            free energy by changing sensory data (without changing                         expectations online. This optimization makes every level 
                the brain) might be important         the recognition density) must increase the accuracy of                         in the hierarchy accountable to the others, furnishing an 
                in perception and cognition.          an agent’s predictions. In short, the agent will selectively                   internally consistent representation of sensory causes at 
                Empirical prior                       sample the sensory inputs that it expects. This is known                       multiple levels of description. Not only do hierarchical 
                A prior induced by hierarchical                               16
                models; empirical priors              as active inference . An intuitive example of this process                     models have a key role in statistics (for example, ran-
                                                      (when it is raised into consciousness) would be feeling                        dom effects and parametric empirical Bayes models30,31), 
                provide constraints on the            our way in darkness: we anticipate what we might touch                         they may also be used by the brain, given the hierarchical 
                recognition density in the usual 
                                                                                                                                                                                       32–34
                way but depend on the data.           next and then try to confirm those expectations.                               arrangement of cortical sensory areas                  .
                NATuRE REvIEWs | NeuroscieNce                                                                                                                     voluME 11 | FEBRuARy 2010 | 129
                                                                                 © 2010 Macmillan Publishers Limited. All rights reserved
                                      REVIEWS
                                                                                                                                                                                                                                                                                                                                 The second issue is the form of the recognition den-
                                          Box 2 | Hierarchical message passing in the brain                                                                                                                                                                                                                            sity that is encoded by physical attributes of the brain, 
                                                                                                                        (i)           (i)       (i)            (i)((i – 1)                     i)                                                                                                                      such as synaptic activity, efficacy and gain. In general, 
                                                                                                                    ξ = Π ε =Π (μ                                               – g(μ ))
                                                                                                                       v              v        v               v        v                                                                                                                                              any density is encoded by its sufficient statistics (for exam-
                                                                                                                        (i)           (i)       (i)            (i)((i )                       i)
                                                                                                                    ξ = Π ε =Π (Dμ – f(μ ))
                                                                                                                       x              x        x               x            x                                                                                                                                          ple, the mean and variance of a Gaussian form). The way 
                                                                                                                                                                                                                                                                                                                       the brain encodes these statistics places important con-
                                                                                                                                                                                                                                                                     (3)                                               straints on the sorts of schemes that underlie recognition: 
                                            Sensory                               (1)                                                                                                                                                                            ξ
                                                                                                                                                                                                                                                                    v
                                                input                         ξv                                                                                              (2)                                                  (2)                                                                                 they range from free-form schemes (for example, particle  
                                                                                                                                                                           ξ
                                                                                                                                                                              v                                                 ξ
                                                                                                                                                                                                                                   x                                       Backward:
                                                                                                                                             (1)                                                                                                                                                                                                   26                                                                                                                                35–38
                                                                                                                                                                                                                                                                                                                       filtering  and probabilistic population codes                                                                                                                             ),  
                                               Forward:                                                                                   ξx                                                                                                                               predictions
                                               prediction                                                                                                                                                                                                                                                              which use a vast number of sufficient statistics, to sim-
                                               error                                                                                                                                                                                                                 (2)                                               pler forms, which make stronger assumptions about 
                                                                                                                                                                                                                                                                 μ
                                                                             ~                                                                                                                                                                                       v                                                 the shape of the recognition density, so that it can be 
                                                                             s(t)                                                                                             (1)
                                                                                                                                                                          μ                                                          (2)
                                                                                                                                                                              v
                                                                                                                                                                                                                                 μ                                                                                     encoded with a small number of sufficient statistics. The 
                                                                                                                                                                                                                                     x
                                                                                                                                              (1)
                                                                                                                                           μ                                                                                                                                                                           simplest assumed form is Gaussian, which requires only 
                                                                                                                                              x
                                                                                                                                                                                                                                                                                                                       the conditional mean or expectation — this is known 
                                                                                                                                                                                                                                                                                                                                                                                                     39
                                           Lower cortical areas                                                                                                                                                            Higher cortical areas                                                                       as the Laplace assumption , under which the free energy 
                                                                                                                                 (i)              (i)((i)                       (i)         i + 1)
                                                                                                                                                                          T
                                                                                                                             ˙μ      = Dμ − (∂ ε ) ξξ−                                                                                                                                                                 is just the difference between the model’s predictions 
                                              Synaptic plasticity                                                                v                v             v                         v                                 Synaptic gain
                                                                         T                                                       (i)              (i)((i)                        i)                                                                                      T
                                                                                                                                                                          T                                                                                                                                            and the sensations or representations that are predicted. 
                                             �μ       = −∂ ε ξ�μ = ½tr(∂Π(ξξ − Π(μ)))
                                                 θ                θ                                                          ˙μ      = Dμ − (∂ ε ) ξ                                                                             γ                       γ                                  γ
                                                   ij               ij                                                           x                x             x                                                                  i                       i                                                           Minimizing free energy then corresponds to explaining 
                                          The figure details a neuronal architecture that optimizes the conditional expectations of                                                                                                                                                                                    away prediction errors. This is known as predictive coding 
                                          causes in hierarchical models of sensory input. It shows the putative cells of origin of forward                                                                                                                                                                             and has become a popular framework for understand-
                                                                                                                                                                                                                    Nature Reviews | Neuroscience
                                          driving connections that convey prediction error (grey arrows) from a lower area (for                                                                                                                                                                                        ing neuronal message passing among different levels of 
                                          example, the lateral geniculate nucleus) to a higher area (for example, V1), and nonlinear                                                                                                                                                                                                                                                  40
                                                                                                                                                                                                                                                                                                                       cortical hierarchies . In this scheme, prediction error 
                                          backward connections (black arrows) that construct predictions41. These predictions try to                                                                                                                                                                                   units compare conditional expectations with top-down 
                                          explain away prediction error in lower levels. In this scheme, the sources of forward and                                                                                                                                                                                    predictions to elaborate a prediction error. This predic-
                                          backward connections are superficial and deep pyramidal cells (upper and lower triangles),                                                                                                                                                                                   tion error is passed forward to drive the units in the 
                                          respectively, where state units are black and error units are grey. The equations represent a                                                                                                                                                                                level above that encode conditional expectations which 
                                          gradient descent on free energy using the generative model below. The two upper equations                                                                                                                                                                                    optimize top-down predictions to explain away (reduce) 
                                          describe the formation of prediction error encoded by error units, and the two lower 
                                          equations represent recognition dynamics, using a gradient descent on free energy.                                                                                                                                                                                           prediction error in the level below. Here, explaining  
                                          Generative models in the brain                                                                                                                                                                                                                                               away just means countering excitatory bottom-up 
                                          To evaluate free energy one needs a generative model of how the sensorium is caused.                                                                                                                                                                                         inputs to a prediction error neuron with inhibitory syn-
                                          Such models p(s˜ ,ϑ) = p(s˜  | ϑ) p(ϑ) combine the likelihood p(s˜  | ϑ) of getting some data given                                                                                                                                                                          aptic inputs that are driven by top-down predictions 
                                          their causes and the prior beliefs about these causes, p(ϑ). The brain has to explain                                                                                                                                                                                        (see BOX 2 and REFS 41,42 for detailed discussion). The 
                                          complicated dynamics on continuous states with hierarchical or deep causal structure                                                                                                                                                                                         reciprocal exchange of bottom-up prediction errors and 
                                          and may use models with the following form                                                                                                                                                                                                                                   top-down predictions proceeds until prediction error 
                                                                                                                                                                                                                                                                                                                       is minimized at all levels and conditional expectations 
                                                                        
         
        
               
                           
Ks                    
K       
K        
K             
K
                                                     U I
Z X  θ 
\                                                                  X           I
Z X  θ 
\                                                                                                                                            are optimized. This scheme has been invoked to explain 
                                                                                                                              ……
                                                ·                                                                                                 · 
                                                                                                                                                   
                                                                                                                                                    
K                
K       
K        
K               
K
                                                  
                  
        
         
                
                            Z  H
Z X  θ 
Y                                                                                                                                                                                                                                                                           40,43
                                               Z  H
Z X  θ 
Y                                                                                                                                                                                                                                                  many features of early visual responses                                                                                               and provides 
                                                                   (i)                (i)                                                                                                                                                                                                                              a plausible account of repetition suppression and mis-
                                               Here, g  and f  are continuous nonlinear functions of (hidden and causal) states, with                                                                                                                                                                                                                                                                                                             44
                                                                                                                                    Nature Reviews | Neuroscience                                                                                                                                                      match responses in electrophysiology . FIGURE 1 pro-
                                                                                (i)                                                                                     (i)                         (i)
                                          parameters θ . The random fluctuations z(t)  and w(t)  play the part of observation 
                                                                                                                                                                                                                                                                (i)                                                    vides an example of perceptual categorization that uses 
                                          noise at the sensory level and state noise at higher levels. Causal states v(t)  link                                                                                                                                                                                        this scheme.
                                          hierarchical levels, where the output of one level provides input to the next. Hidden 
                                                                      (i)                                                                                                                                                                                                                                                        Message passing of this sort is consistent with func-
                                          states x(t)  link dynamics over time and endow the model with memory.  
                                          Gaussian assumptions about the random fluctuations specify the likelihood                                                                                                                                                                                                                                                                                                                                                       45
                                                                                                                                                                                                                                                                                                                       tional asymmetries in real cortical hierarchies , where 
                                          and Gaussian assumptions about state noise furnish empirical priors in terms of                                                                                                                                                                                              forward connections (which convey prediction errors) 
                                          predicted motion. These assumptions are encoded by their precision (or inverse                                                                                                                                                                                               are driving and backwards connections (which model 
                                                                            (i)
                                          variance), П (γ), which are functions of precision parameters γ.                                                                                                                                                                                                             the nonlinear generation of sensory input) have both 
                                          recognition dynamics and prediction error                                                                                                                                                                                                                                    driving and modulatory characteristics46. This asym-
                                          If we assume that neuronal activity encodes the conditional expectation of states, then                                                                                                                                                                                      metrical message passing is also a characteristic feature 
                                          recognition can be formulated as a gradient descent on free energy. Under Gaussian                                                                                                                                                                                           of adaptive resonance theory47,48, which has formal simi-
                                          assumptions, these recognition dynamics can be expressed compactly in terms                                                                                                                                                                                                  larities to predictive coding.
                                                                                                                                                                 (i)             (i)      (i)
                                          of precision-weighted prediction errors ξ  =  П (ε)  on the causal states and motion of                                                                                                                                                                                                In summary, the theme underlying the Bayesian brain 
                                          hidden states. The ensuing equations (see the figure) suggest two neuronal populations                                                                                                                                                                                       and predictive coding is that the brain is an inference 
                                          that exchange messages: causal or hidden-state units encoding expected states and                                                                                                                                                                                            engine that is trying to optimize probabilistic representa-
                                          error units encoding prediction error. Under hierarchical models, error units receive                                                                                                                                                                                        tions of what caused its sensory input. This optimization 
                                          messages from the state units in the same level and the level above, whereas state units 
                                          are driven by error units in the same level and the level below. These provide bottom-up                                                                                                                                                                                     can be finessed using a (variational free-energy) bound 
                                                                                                                                                                                  (i)
                                          messages that drive conditional expectations μ  towards better predictions, which                                                                                                                                                                                            on surprise. In short, the free-energy principle entails 
                                                                                                                                                                                                                                                                        (i)                       (i)
                                          explain away prediction error. These top-down predictions correspond to g(μ ) and f(μ ).                                                                                                                                                                                     the Bayesian brain hypothesis and can be implemented 
                                          This scheme suggests that the only connections that link levels are forward connections                                                                                                                                                                                      by the many schemes considered in this field. Almost 
                                          conveying prediction error to state units and reciprocal backward connections that                                                                                                                                                                                           invariably, these involve some form of message passing 
                                          mediate predictions. See REFS 42,130 for details. Figure is modified from REF. 42.                                                                                                                                                                                           or belief propagation among brain areas or units. This 
                                      130 | FEBRuARy 2010 | voluME 11                                                                                                                                                                                                                                                                                                                                 www.nature.com/reviews/neuro
                                                                                                                                                                                            © 2010 Macmillan Publishers Limited. All rights reserved
                                                                                                                                                                                                                                                                                                                                                                                       REVIEWS
                                   a  Perceptual inference                                                                                                                                                                                                                            allows us to connect the free-energy principle to another 
                                                                                        Vocal centre                                                 Syrinx                                                              Sonogram                                                     principled approach to sensory processing, namely 
                                                                                                                                                                                                                                                                                      information theory.
                                                                                                                                                                                                                                                                                      The principle of efficient coding
                                                                                                                                                                                                                                                                                      The principle of efficient coding suggests that the brain 
                                                                                                                                                                                                                                                                                      optimizes the mutual information (that is, the mutual 
                                                                                                                                                                                                                                                                                      predictability) between the sensorium and its internal 
                                                                      v                                                                                                                                                                                                               representation, under constraints on the efficiency of 
                                                            v =          1
                                                                      v
                                                                         2                                                                                                                                                                                                            those representations. This line of thinking was articu-
                                                                                                                                                                                                                                                                                                                                    49
                                                                                                                                                                                                                                                                                      lated by Barlow  in terms of a redundancy reduction 
                                                                                                                                                                                                                                                                                      principle (or principle of efficient coding) and formal-
                                                                                                                                                                                                                                                                                                                                                                                                            50
                                                                                                                                                                                                                                                                                      ized later in terms of the infomax principle . It has been 
                                                                                                                                                            18x  − 18x                                                                                                                applied in machine learning51, leading to methods 
                                                                                                                                                                   2              1
                                                                                                                               ˙x  = f(x, v) =  v x  − 2x x  − x                                                                                                                                                                                                                                  52
                                                                                                                                                               1   1            3 1           2                                                                                       like independent component analysis , and in neuro-
                                                                                                                                                            2xx  − v x
                                                                                                                                                                  1   2         2 3                                                                                                   biology, contributing to an understanding of the nature 
                                                                                                                                                                                                                                                                                                                                                      53–56
                                                                                                                                                                                                                                                                                      of neuronal responses                                                      . This principle is extremely  
                                   b  Perceptual categorization                                                                                                                                                                                                                       effective in predicting the empirical characteristics of 
                                                                  Song a                                                                     Song b                                                                     Song c                                                                                                                                53
                                             5,000                                                                                                                                                                                                                                    classical receptive fields  and provides a principled 
                                                                                                                                                                                                                                                                                      explanation for sparse coding55 and the segregation of 
                                                                                                                                                                                                                                                                                                                                                                                                           57
                                                                                                                                                                                                                                                                                      processing streams in visual hierarchies . It has been 
                                             4,000                                                                                                                                                                                                                                    extended to cover dynamics and motion trajectories58,59 
                                                                                                                                                                                                                                                                                      and even used to infer the metabolic constraints on neu-
                                             3,000                                                                                                                                                                                                                                                                                   60
                                       equency (Hz)                                                                                                                                                                                                                                   ronal processing .
                                       Fr                                                                                                                                                                                                                                                      At its simplest, the infomax principle says that  
                                             2,000                                                                                                                                                                                                                                    neuronal activity should encode sensory information in 
                                                          0.0        0.2        0.4        0.6        0.8        1.0                0.0        0.2        0.4        0.6        0.8         1.0                 0.0        0.2        0.4        0.6        0.8        1.0            an efficient and parsimonious fashion. It considers the 
                                                                                                                                                          Time (s)
                                   c                                                                                                                                                                                                                                                  mapping between one set of variables (sensory states) 
                                                             50                                                                                                             3.5                                                                                                       and another (variables representing those states). At 
                                                                                                                                                                                            c                                                                                         first glance, this seems to preclude a probabilistic repre-
                                                             40                                                               µv1
                                                                                                                                                           a                   3                                                                                                      sentation, because this would involve mapping between 
                                                             30                                                                                                                                                                b                                                      sensory states and a probability density. However, the 
                                                         auses                                                                                             b               2.5                                                                          a
                                                         c   20                                                                                            c                                                                                                                          infomax principle can be applied to the sufficient sta-
                                                                                                                                                                      v
                                                         ted                                                                                                            2                                                                                                             tistics of a recognition density. In this context, the info-
                                                              10                                                                                                               2                                                                                                      max principle becomes a special case of the free-energy  
                                                         Estima0                                                              µ 1                                                                                                                                                     principle, which arises when we ignore uncertainty 
                                                                                                                                 v                                           1.5                                                                                                      in probabilistic representations (and when there is no 
                                                           –10
                                                          –20                                                                                                                   1                                                                                                     action); see supplementary information s3 (box) for 
                                                                   0              0.2              0.4              0.6             0.8                1                         10               15               20               25               30               35              mathematical details). This is easy to see by noting that 
                                                                                                    Time (s)                                                                                                                  v1                                                      sensory signals are generated by causes. This means that it  
                                  Figure 1 | Birdsongs and perceptual categorization. a | The generative model of                                                                                                                                                                     is sufficient to represent the causes to predict these 
                                  birdsong used in this simulation comprises a Lorenz attractor with two control parameters                                                                                                                                                           signals. More formally, the infomax principle can be 
                                                                                                                                                                                               Nature Reviews | Neuroscience
                                  (or causal states) (v ,v ), which, in turn, delivers two control parameters (not shown) to a                                                                                                                                                        understood in terms of the decomposition of free energy 
                                                                                   1     2
                                  synthetic syrinx to produce ‘chirps’ that were modulated in amplitude and frequency (an                                                                                                                                                             into complexity and accuracy: mutual information is  
                                  example is shown as a sonogram). The chirps were then presented as a stimulus to a                                                                                                                                                                  optimized when conditional expectations maximize 
                                  synthetic bird to see whether it could infer the underlying causal states and thereby                                                                                                                                                               accuracy (or minimize prediction error), and efficiency 
                                  categorize the song. This entails minimizing free energy by changing the internal                                                                                                                                                                   is assured by minimizing complexity. This ensures that 
                                  representation (μ ,μ ) of the control parameters. Examples of this perceptual inference or 
                                                                               v1      v2                                                                                                                                                                                             no excessive parameters are applied in the generative 
                                  categorization are shown below. b | Three simulated songs are shown in sonogram format.                                                                                                                                                             model and leads to a parsimonious representation of 
                                  Each comprises a series of chirps, the frequency and number of which fall progressively 
                                  from song a to song c, as a causal state (known as the Raleigh number; v  in part a) is                                                                                                                                                             sensory data that conforms to prior constraints on their 
                                                                                                                                                                                                                           1
                                  decreased. c | The graph on the left depicts the conditional expectations (μ ,μ ) of the                                                                                                                                                            causes. Interestingly, advanced model-optimization 
                                                                                                                                                                                                                                    v1      v2
                                  causal states, shown as a function of peristimulus time for the three songs. It shows that                                                                                                                                                          techniques use free-energy optimization to eliminate 
                                  the causes are identified after around 600 ms with high conditional precision (90%                                                                                                                                                                                                                                                         61
                                  confidence intervals are shown in grey). The graph on the right shows the conditional                                                                                                                                                               redundant model parameters , suggesting that free-
                                  density on the causes shortly before the end of the peristimulus time (that is, the dotted                                                                                                                                                          energy optimization might provide a nice explanation 
                                  line in the left panel). The blue dots correspond to conditional expectations and the grey                                                                                                                                                          for the synaptic pruning and homeostasis that take place 
                                                                                                                                                                                                                                                                                                                                                                                                   62                              63
                                  areas correspond to the 90% conditional confidence regions. Note that these encompass                                                                                                                                                               in the brain during neurodevelopment  and sleep .
                                  the true values (red dots) of (v ,v ) that were used to generate the songs. These results                                                                                                                                                                    The infomax principle pertains to a forward mapping 
                                                                                                             1 2                                                                                                                                                                      from sensory input to representations. How does this 
                                  illustrate the nature of perceptual categorization under the inference scheme in BOX 2: 
                                  here, recognition corresponds to mapping from a continuously changing and chaotic                                                                                                                                                                   square with optimizing generative models, which map 
                                  sensory input to a fixed point in perceptual space. Figure is reproduced, with permission,                                                                                                                                                          from causes to sensory inputs? These perspectives can be 
                                  from REF. 130 © (2009) Elsevier.                                                                                                                                                                                                                    reconciled by noting that all recognition schemes based 
                                  NATuRE REvIEWs | NeuroscieNce                                                                                                                                                                                                                                                                                   voluME 11 | FEBRuARy 2010 | 131
                                                                                                                                                                        © 2010 Macmillan Publishers Limited. All rights reserved
                 REVIEWS
                                                       on infomax can be cast as optimizing the parameters of a                         by synaptic efficacy (these are μ  in BOX 2) have to be 
                                                                               64                                                                                                     θ
                                                       generative model . For example, in sparse coding mod-                            optimized. This corresponds to optimizing connection 
                                                           55
                                                       els , the implicit priors posit independent causes that                          strengths in the brain — that is, plasticity that under-
                                                       are sampled from a heavy-tailed or sparse distribution42.                        lines learning. so what form would this learning take? It 
                                                       The fact that these models predict empirically observed                          transpires that a gradient descent on free energy (that is, 
                                                       receptive fields so well suggests that we are endowed                            changing connections to reduce free energy) is formally 
                                                       with (or acquire) prior expectations that the causes of                          identical to Hebbian plasticity28,42 (BOX 2). This is because 
                                                       our sensations are largely independent and sparse.                               the parameters of the generative model determine how 
                                                           In summary, the principle of efficient coding says                           expected states (synaptic activity) are mixed to form pre-
                                                       that the brain should optimize the mutual information                            dictions. Put simply, when the presynaptic predictions 
                                                       between its sensory signals and some parsimonious                                and postsynaptic prediction errors are highly correlated, 
                                                       neuronal representations. This is the same as optimizing                         the connection strength increases, so that predictions 
                                                       the parameters of a generative model to maximize the                             can suppress prediction errors more efficiently.
                                                       accuracy of predictions, under complexity constraints.                               In short, the formation of cell assemblies reflects the 
                                                       Both are mandated by the free-energy principle, which                            encoding of causal regularities. This is just a restate-
                                                       can be regarded as a probabilistic generalization of the                         ment of cell assembly theory in the context of a specific 
                 Sufficient statistics                 infomax principle. We now turn to more biologically                              implementation (predictive coding) of the free-energy 
                 Quantities that are sufficient to     inspired ideas about brain function that focus on neu-                           principle. It should be acknowledged that the learning 
                 parameterize a probability            ronal dynamics and plasticity. This takes us deeper into                         rule in predictive coding is really a delta rule, which 
                 density (for example, mean and        neurobiological mechanisms and the implementation of                             rests on Hebbian mechanisms; however, Hebb’s wider 
                 covariance of a Gaussian              the theoretical principles outlined above.                                       notions of cell assemblies were formulated from a non-
                 density).                                                                                                              statistical perspective. Modern reformulations suggest 
                 Laplace assumption                    The cell assembly and correlation theory                                         that both inference on states (that is, perception) and 
                                                                                                                           65
                 (Or Laplace approximation or          The cell assembly theory was proposed by Hebb  and                               inference on parameters (that is, learning) minimize 
                 method.) A saddle-point               entails Hebbian — or associative — plasticity, which is a                        free energy (that is, minimize prediction error) and 
                 approximation of the integral         cornerstone of use-dependent or experience-dependent                             serve to bound surprising exchanges with the world. so 
                 of an exponential function, that      plasticity66, the correlation theory of von de Malsburg67,68                     what about synchronization and the selective enabling 
                 uses a second-order Taylor            and other formal refinements to Hebbian plasticity                               of synapses?
                 expansion. When the function 
                                                               69
                 is a probability density, the         per se . The cell assembly theory posits that groups of 
                 implicit assumption is that           interconnected neurons are formed through a strength-                            Biased competition and attention
                 the density is approximately          ening of synaptic connections that depends on corre-                             Causal regularities encoded by synaptic efficacy  
                 Gaussian.                             lated pre- and postsynaptic activity; that is, ‘cells that fire                  control the deterministic evolution of states in the world. 
                 Predictive coding                     together wire together’. This enables the brain to distil                        However, stochastic (that is, random) fluctuations in 
                 A tool used in signal processing      statistical regularities from the sensorium. The correla-                        these states play an important part in generating sen-
                 for representing a signal using       tion theory considers the selective enabling of synaptic                         sory data. Their amplitude is usually represented as pre-
                 a linear predictive (generative)      efficacy and its plasticity (also known as metaplastic-                          cision (or inverse variance), which encodes the reliability 
                 model. It is a powerful speech            70
                 analysis technique and was            ity ) by fast synchronous activity induced by different                          of prediction errors. Precision is important, especially 
                 first considered in vision to         perceptual attributes of the same object (for example, a                         in hierarchical schemes, because it controls the relative 
                 explain lateral interactions in       red bus in motion). This resolves a putative deficiency                          influence of bottom-up prediction errors and top-down 
                 the retina.                           of classical plasticity, which cannot ascribe a presynaptic                      predictions. so how is precision encoded in the brain?  
                 Infomax                               input to a particular cause (for example, redness) in the                        In predictive coding, precision modulates the amplitude 
                 An optimization principle for         world67. The correlation theory underpins theoretical                            of prediction errors (these are μ  in BOX 2), so that pre-
                                                                                                                                                                                    γ
                 neural networks (or functions)        treatments of synchronized brain activity and its role in                        diction errors with high precision have a greater impact 
                 that map inputs to outputs. It        associating or binding attributes to specific objects or                         on units that encode conditional expectations. This 
                 says that the mapping should          causes68,71. Another important field that rests on associa-                      means that precision corresponds to the synaptic gain of 
                 maximize the Shannon mutual           tive plasticity is the use of attractor networks as models                       prediction error units. The most obvious candidates for 
                 information between the inputs 
                                                                                                      72–74
                 and outputs, subject to               of memory formation and retrieval                   . so how do corre-           controlling gain (and implicitly encoding precision) are 
                 constraints and/or noise              lations and associative plasticity figure in the free-energy                     classical neuromodulators like dopamine and acetylcho-
                 processes.                            formulation?                                                                     line, which provides a nice link to theories of attention 
                                                                                                                                                              75–77
                 Stochastic                                Hitherto, we have considered only inference on states                        and uncertainty            . Another candidate is fast synchro-
                 Governed by random effects.           of the world that cause sensory signals, whereby condi-                          nized presynaptic input that lowers effective postsynaptic  
                                                       tional expectations about states are encoded by synaptic                         membrane time constants and increases synchronous 
                 Biased competition                                                                                                           78
                                                       activity. However, the causes covered by the recognition                         gain . This fits comfortably with the correlation theory 
                 An attentional effect mediated        density are not restricted to time-varying states (for                           and speaks to recent ideas about the role of synchronous 
                 by competitive interactions                                                                                                                                               79,80
                 among neurons representing            example, the motion of an object in the visual field):                           activity in mediating attentional gain                 .
                 visual stimuli; these                 they also include time-invariant regularities that endow                             In summary, the optimization of expected precision 
                 interactions can be biased in         the world with causal structure (for example, objects                            in terms of synaptic gain links attention to synaptic gain 
                 favour of behaviourally relevant      fall with constant acceleration). These regularities are                         and synchronization. This link is central to theories of 
                 stimuli by both spatial and           parameters of the generative model and have to be                                attentional gain and biased competition80–85, particularly 
                 non-spatial and both                                                                                                                                                      86,87
                 bottom-up and top-down                inferred by the brain — in other words, the conditional                          in the context of neuromodulation                       . The theories  
                 processes.                            expectations of these parameters that may be encoded                             considered so far have dealt only with perception. 
                 132 | FEBRuARy 2010 | voluME 11                                                                                                                           www.nature.com/reviews/neuro
                                                                                  © 2010 Macmillan Publishers Limited. All rights reserved
                                                                                                                                                               REVIEWS
                                                However, from the point of view of the free-energy                    value or surprise is determined by the form of an agent’s  
                                                principle, perception just makes free energy a good                   generative model and its implicit priors — these specify 
                                                proxy for surprise. To actually reduce surprise we need               the value of sensory states and, crucially, are heritable 
                                                to act. In the next section, we retain a focus on cell                through genetic and epigenetic mechanisms. This means 
                                                assemblies but move to the selection and reinforcement                that prior expectations (that is, the primary repertoire) 
                                                of stimulus–response links.                                           can prescribe a small number of attractive states with 
                                                                                                                      innate value. In turn, this enables natural selection to 
                                                Neural Darwinism and value learning                                   optimize prior expectations and ensure they are con-
                                                In the theory of neuronal group selection88, the emergence            sistent with the agent’s phenotype. Put simply, valuable 
                                                of neuronal assemblies is considered in the light of selec-           states are just the states that the agent expects to fre-
                                                tive pressure. The theory has four elements: epigenetic               quent. These expectations are constrained by the form of 
                                                mechanisms create a primary repertoire of neuronal                    its generative model, which is specified genetically and 
                                                connections, which are refined by experience-dependent                fulfilled behaviourally, under active inference.
                                                plasticity to produce a secondary repertoire of neuro-                    It is important to appreciate that prior expectations 
                                                nal groups. These are selected and maintained through                 include not just what will be sampled from the world but 
                                                reentrant signalling among neuronal groups. As in cell                also how the world is sampled. This means that natural 
                                                assembly theory, plasticity rests on correlated pre- and              selection may equip agents with the prior expectation 
                                                postsynaptic activity, but here it is modulated by value.             that they will explore their environment until states 
                                                value is signalled by ascending neuromodulatory trans-                with innate value are encountered. We will look at this 
                                                mitter systems and controls which neuronal groups                     more closely in the next section, where priors on motion 
                                                are selected and which are not. The beauty of neural                  through state space are cast in terms of policies in  
                                                Darwinism is that it nests distinct selective processes               reinforcement learning.
                                                within each other. In other words, it eschews a single unit               Both neural Darwinism and the free-energy principle 
                                                of selection and exploits the notion of meta-selection                try to understand somatic changes in an individual in 
                                                (the selection of selective mechanisms; for example, see              the context of evolution: neural Darwinism appeals to 
                                                REF. 89). In this context, (neuronal) value confers evolu-            selective processes, whereas the free energy formulation 
                                                tionary value (that is, adaptive fitness) by selecting neu-           considers the optimization of ensemble or population  
                                                ronal groups that meditate adaptive stimulus–stimulus                 dynamics in terms of entropy and surprise. The key 
                                                associations and stimulus–response links. The capacity                theme that emerges here is that (heritable) prior expecta-
                                                of value to do this is assured by natural selection, in the           tions can label things as innately valuable (unsurprising); 
                                                sense that neuronal value systems are themselves subject              but how can simply labelling states engender adaptive 
                                                to selective pressure.                                                behaviour? In the next section, we return to reinforce-
                                                                                                                90
                                                    This theory, particularly value-dependent learning ,              ment learning and related formulations of action that try 
                                                has deep connections with reinforcement learning and                  to explain adaptive behaviour purely in terms of labels 
               Reentrant signalling             related approaches in engineering (see below), such as                or cost functions.
               Reciprocal message passing       dynamic programming and temporal difference mod-
                                                   91,92
               among neuronal groups.           els    . This is because neuronal value systems reinforce             Optimal control theory and game theory
                                                connections to themselves, thereby enabling the brain                 value is central to theories of brain function that are 
               Reinforcement learning           to label a sensory state as valuable if, and only if, it leads to     based on reinforcement learning and optimum con-
               An area of machine learning      another valuable state. This ensures that agents move                 trol. The basic notion that underpins these treatments 
               concerned with how an agent      through a succession of states that have acquired value to            is that the brain optimizes value, which is expected 
               maximizes long-term reward.      access states (rewards) with genetically specified innate             reward or utility (or its complement — expected loss 
               Reinforcement learning 
               algorithms attempt to find a     value. In short, the brain maximizes value, which may be              or cost). This is seen in behavioural psychology as rein-
               policy that maps states of the                                                                                                 98
                                                reflected in the discharge of value systems (for example,             forcement learning , in computational neuroscience 
               world to actions performed by    dopaminergic systems92–96). so how does this relate to                and machine learning as variants of dynamic program-
               the agent.                       the optimization of free energy?                                      ming such as temporal difference learning99–101, and in 
               Optimal control theory               The answer is simple: value is inversely proportional             economics as expected utility theory102. The notion of  
               An optimization method           to surprise, in the sense that the probability of a pheno-            an expected reward or cost is crucial here; this is the 
               (based on the calculus of        type being in a particular state increases with the value             cost expected over future states, given a particular policy 
               variations) for deriving an      of that state. Furthermore, the evolutionary value of                 that prescribes action or choices. A policy specifies the 
               optimal control law in a         a phenotype is the negative surprise averaged over all                states to which an agent will move from any given state 
               dynamical system. A control 
               problem includes a cost          the states it experiences, which is simply its negative               (‘motion through state space in continuous time’). This 
               function that is a function of   entropy. Indeed, the whole point of minimizing free                   policy has to access sparse rewarding states using a cost 
               state and control variables.     energy (and implicitly entropy) is to ensure that agents              function, which only labels states as costly or not. The 
               Bellman equation                 spend most of their time in a small number of valuable                problem of how the policy is optimized is formalized 
               (Or dynamic programming          states. This means that free energy is the complement of              in optimal control theory as the Bellman equation and its 
               equation.) Named after           value, and its long-term average is the complement of                 variants99 (see supplementary information s4 (box)), 
               Richard Bellman, it is a         adaptive fitness (also known as free fitness in evolution-            which express value as a function of the optimal policy 
               necessary condition for          ary biology97). But how do agents know what is valu-                  and a cost function. If one can solve the Bellman equa-
               optimality associated with       able? In other words, how does one generation tell the                tion, one can associate each sensory state with a value 
               dynamic programming in 
               optimal control theory.          next which states have value (that is, are unsurprising)?             and optimize the policy by ensuring that the next state 
               NATuRE REvIEWs | NeuroscieNce                                                                                                    voluME 11 | FEBRuARy 2010 | 133
                                                                        © 2010 Macmillan Publishers Limited. All rights reserved
                REVIEWS
                Optimal decision theory             is the most valuable of the available states. In general,                    because it explains why agents must minimize expected 
                (Or game theory.) An area of        it is impossible to solve the Bellman equation exactly,                      cost. Furthermore, free energy provides a quantitative 
                applied mathematics                 but several approximations exist, ranging from simple                        and seamless connection between the cost functions 
                concerned with identifying the                                        98
                values, uncertainties and other     Rescorla–Wagner models  to more comprehensive for-                           of reinforcement learning and value in evolutionary 
                                                                                      100
                constraints that determine an       mulations like Q-learning            . Cost also has a key role in           biology. Finally, the dynamical perspective provides a 
                optimal decision.                   Bayesian decision theory, in which optimal decisions                         mechanistic insight into how policies are specified in the 
                                                                                                                                                                                          99
                                                    minimize expected cost in the context of uncertainty                         brain: according to the principle of optimality  cost is the 
                Gradient ascent                     about outcomes; this is central to optimal decision theory                   rate of change of value (see supplementary information 
                (Or method of steepest                                                                      102–104
                ascent.) A first-order              (game theory) and behavioural economics                       .              s4 (box)), which depends on changes in sensory states. 
                optimization scheme that finds          so what does free energy bring to the table? If one                      This suggests that optimal policies can be prescribed by 
                a maximum of a function by          assumes that the optimal policy performs a gradient                          prior expectations about the motion of sensory states. 
                changing its arguments in           ascent on value, then it is easy to show that value is                       Put simply, priors induce a fixed-point attractor, and 
                proportion to the gradient of       inversely proportional to surprise (see supplementary                        when the states arrive at the fixed point, value will stop 
                the function at the current         information s4 (box)). This means that free energy is                        changing and cost will be minimized. A simple exam-
                value. In short, a hill-climbing    (an upper bound on) expected cost, which makes sense                         ple is shown in FIG. 2, in which a cued arm movement 
                scheme. The opposite scheme 
                is a gradient descent.              as optimal control theory assumes that action mini-                          is simulated using only prior expectations that the arm 
                                                    mizes expected cost, whereas the free-energy principle                       will be drawn to a fixed point (the target). This figure 
                                                    states that it minimizes free energy. This is important                      illustrates how computational motor control105–109 can 
                                                                                                                                 be formulated in terms of priors and the suppression of 
                                                                                                                                 sensory prediction errors (K.J.F., J. Daunizeau, J. Kilner 
                                                                Predictions                                                      and s.J. Kiebel, unpublished observations). More gener-
                                                (2)
                                               ξ                                                                                 ally, it shows how rewards and goals can be considered 
                                      (1)       v               Prediction errors
                                                                                                                                                                                                       16
                                     ξx                                                                                          as prior expectations that an action is obliged to fulfil  
                                                 (1)                                                                             (see also REF. 110). It also suggests how natural selection 
                                               μ
                                                 v
                          (1)           (1)                                                                                      could optimize behaviour through the genetic specifi-
                                      μ
                         ξ              x
                          v                                                                                                      cation of inheritable or innate priors that constrain the 
                                                                                                           Movement              learning of empirical priors (BOX 2) and subsequent goal-
                                                          V                                                trajectory            directed action.
                                                  s     =+ w
                                                   visual  J     visual                                                              It should be noted that just expecting to be attracted 
                                                                            (0, 0)                                               to some states may not be sufficient to attain those states. 
                     Motor 
                     signals                                                    x                                                This is because one may have to approach attractors vicar-
                                                                                 1
                                                          x                                                                      iously through other states (for example, to avoid obsta-
                                                  s        1                       J                                             cles) or conform to physical constraints on action. These 
                                                       =+ w                        1
                                                   prop   x      prop
                                                           2                                                   V = (v, v , v )   are some of the more difficult problems of accessing  
                                  (1)                                                                                1  2 3
                                ξv                                                         J                                     distal rewards that reinforcement learning and opti-
                                                                                 x2        2                                     mum control contend with. In these circumstances, 
                               a                       Action                                                                    an examination of the density dynamics, on which the  
                                                                                                            J = J  + J  = ( j , j )
                                                     ˙a = −∂ εTξ                 Jointed arm                    1   2    1 2
                                                            a                                                                    free-energy principle is based, suggests that it is sufficient 
                Figure 2 | A demonstration of cued reaching movements. The lower right part of the                               to keep moving until an a priori attractor is encountered 
                figure shows a motor plant, comprising a two-jointed arm with two hidden states, each of                         (see supplementary information s5 (box)). This entails 
                which corresponds to a particular angular position of the two joints; the current position                       destroying unexpected (costly) fixed points in the envi-
                                                                                         Nature Reviews | Neuroscience
                of the finger (red circle) is the sum of the vectors describing the location of each joint.                      ronment by making them unstable (like shifting to a new 
                Here, causal states in the world are the position and brightness of the target (green                            position when sitting uncomfortably). Mathematically, 
                circle). The arm obeys Newtonian mechanics, specified in terms of angular inertia and                            this means adopting a policy that ensures a positive 
                friction. The left part of the figure illustrates that the brain senses hidden states directly                   divergence in costly states (intuitively, this is like being 
                in terms of proprioceptive input (S          ) that signals the angular positions (x ,x ) of the 
                                                          prop                                           1  2                    pushed through a liquid with negative viscosity or  
                joints and indirectly through seeing the location of the finger in space (J ,J ). In addition, 
                                                                                                       1 2                       friction). see FIG. 3 for a solution to the classical  
                through visual input (S        ) the agent senses the target location (v ,v ) and brightness (v ). 
                                           visual                                              1  2                      3       mountain car problem using a simple prior that induces 
                Sensory prediction errors are passed to higher brain levels to optimize the conditional                          this sort of policy. This prior is on motion through state 
                expectations of hidden states (that is, the angular position of the joints) and causal (that 
                is, target) states. The ensuing predictions are sent back to suppress sensory prediction                         space (that is, changes in states) and enforces exploration  
                errors. At the same time, sensory prediction errors are also trying to suppress themselves                       until an attractive state is found. Priors of this sort may 
                by changing sensory input through action. The grey and black lines denote reciprocal                             provide a principled way to understand the exploration–
                message passing among neuronal populations that encode prediction error and                                                                 111–113
                                                                                                                                 exploitation trade-off             and related issues in evolu-
                conditional expectations; this architecture is the same as that depicted in BOX 2. The                                               114
                blue lines represent descending motor control signals from sensory prediction-error                              tionary biology        . The implicit use of priors to induce 
                units. The agent’s generative model included priors on the motion of hidden states that                          dynamical instability also provides a key connection 
                effectively engage an invisible elastic band between the finger and target (when the                             to dynamical systems theory approaches to the brain 
                target is illuminated). This induces a prior expectation that the finger will be drawn to                        that emphasize the importance of itinerant dynamics, 
                the target, when cued appropriately. The insert shows the ensuing movement trajectory                            metastability, self-organized criticality and winner-
                caused by action. The red circles indicate the initial and final positions of the finger,                        less competition115–123. These dynamical phenomena 
                which reaches the target (green circle) quickly and smoothly; the blue line is the                               have a key role in synergetic and autopoietic accounts of  
                simulated trajectory.                                                                                            adaptive behaviour5,124,125.
                134 | FEBRuARy 2010 | voluME 11                                                                                                                   www.nature.com/reviews/neuro
                                                                              © 2010 Macmillan Publishers Limited. All rights reserved
                                                                                                                                                                                                   REVIEWS
                                                            ab
                                                                    The mountain car problem                                    Loss functions (priors)                                Conditional expectations
                                                               0.7                                                          5                                                    30
                                                               0.6                                                          0                                                    25                                  −c(t)
                                                               0.5                                                         –5                                                   es20
                                                                                                                                                                                t
                                                                                                                                                                                a
                                                               0.4                                                       e–10                           c(x)                       15
                                                                                                                         c
                                                                                                                         or–15                                                  ted st
                                                             Height0.3   ϕ(x)                                            F                                                        10
                                                               0.2                                                       –20                                                    Estima   5
                                                                                                                                                                                                                          x
                                                                                                                                                                                                                      μ(t)
                                                                0.1                                                       –25                                                      0
                                                                 0                                                        –30                                                    –5
                                                                   -2        -1         0          12 –2 –1 012    0 20 40 60 80 100120
                                                                                  Position (x)                                                Position (x)                                          Time (seconds)
                  Principle of optimality
                  An optimal policy has                             Equations of motion                                         Trajectories                                           Action
                  the property that whatever the 
                  initial state and initial decision,                    ˙x             x′                                   2                                                     3
                  the remaining decisions must                      f == 1                                               x′
                                                                         ˙x′    −∇ϕ −  ⁄8 x′ + σ(a)                                                                                2
                  constitute an optimal policy                                     x
                  with regard to the state                                                                                   1                                                   l                             a(t)
                  resulting from the first decision.                                                                                                                                1
                  Exploration–exploitation                                                                                   0                                                   ol signa0
                                                                                                                         elocity
                  trade-off                                                                                              V                                                       ontr
                  Involves a balance between                                                                                                                                     C–1
                  exploration (of uncharted                                                                                 –1
                                                                                                                                                                                  –2
                  territory) and exploitation (of 
                  current knowledge). In                                                                                   –2                                                     –3
                  reinforcement learning, it has                                                                              –2         –1          012    0 20 40 60 80 100120
                  been studied mainly through                                                                                                 Position (x)                                          Time (seconds)
                  the multi-armed bandit 
                  problem.                                 Figure 3 | solving the mountain car problem with prior expectations. a | How paradoxical but adaptive behaviour (for 
                                                                                                                                                                                          Nature Reviews | Neuroscience
                                                           example, moving away from a target to ensure that it is secured later) emerges from simple priors on the motion of hidden 
                  Dynamical systems theory                 states in the world. Shown is the landscape or potential energy function (with a minimum at position x = –0.5) that exerts 
                  An area of applied                       forces on a mountain car. The car is shown at the target position on the hill at x =1, indicated by the red circle. The equations 
                  mathematics that describes               of motion of the car are shown below the plot. Crucially, at x = 0 the force on the car cannot be overcome by the agent, 
                  the behaviour of complex                 because a squashing function –1≤σ≤1 is applied to action to prevent it being greater than 1. This means that the agent can 
                  (possibly chaotic) dynamical             access the target only by starting halfway up the left hill to gain enough momentum to carry it up the other side. b | The 
                  systems as described by                  results of active inference under priors that destabilize fixed points outside the target domain. The priors are encoded in a 
                  differential or difference               cost function c(x) (top left), which acts like negative friction. When ‘friction’ is negative the car expects to go faster (see 
                  equations.
                                                           Supplementary information S5 (box) for details). The inferred hidden states (upper right: position in blue, velocity in green 
                  Synergetics                              and negative dissipation in red) show that the car explores its landscape until it encounters the target, and that friction then 
                  Concerns the self-organization           increases (that is, cost decreases) dramatically to prevent the car from escaping the target (by falling down the hill). The 
                  of patterns and structures in            ensuing trajectory is shown in blue (bottom left). The paler lines provide exemplar trajectories from other trials, with 
                  open systems far from                    different starting positions. In the real world, friction is constant. However, the car ‘expects’ friction to change as it changes 
                  thermodynamic equilibrium. It            position, thus enforcing exploration or exploitation. These expectations are fulfilled by action (lower right).
                  rests on the order parameter 
                  concept, which was generalized 
                  by Haken to the enslaving 
                  principle: that is, the dynamics             In summary, optimal control and decision (game)                                   Conclusions and future directions
                  of fast-relaxing (stable) modes          theory start with the notion of cost or utility and try to                            Although contrived to highlight commonalities, this 
                  are completely determined by 
                  the ‘slow’ dynamics of order             construct value functions of states, which subsequently                               Review suggests that many global theories of brain 
                  parameters (the amplitudes of            guide action. The free-energy formulation starts with                                 function can be united under a Helmholtzian percep-
                  unstable modes).                         a free-energy bound on the value of states, which is                                  tive of the brain as a generative model of the world it 
                                                                                                                                                             18,20,21,25
                  Autopoietic                              specified by priors on the motion of hidden environ-                                  inhabits                (FIG. 4); notable examples include the 
                  Referring to the fundamental             mental states. These priors can incorporate any cost                                  integration of the Bayesian brain and computational 
                  dialectic between structure              function to ensure that costly states are avoided. states                             motor control theory, the objective functions shared 
                  and function.                            with minimum cost can be set (by learning or evolu-                                   by predictive coding and the infomax principle, 
                  Helmholtzian                             tion) in terms of prior expectations about motion and                                 hierarchical inference and theories of attention, the  
                  Refers to a device or scheme             the attractors that ensue. In this view, the problem of                               embedding of perception in natural selection and  
                  that uses a generative model to          finding sparse rewards in the environment is nature’s                                 the link between optimum control and more exotic 
                  furnish a recognition density            solution to the problem of how to minimize the entropy                                phenomena in dynamical systems theory. The constant 
                  and learns hidden structures in          (average surprise or free energy) of an agent’s states: by                            theme in all these theories is that the brain optimizes 
                  data by optimizing the                   ensuring they occupy a small set of attracting (that is,                              a (free-energy) bound on surprise or its complement, 
                  parameters of generative 
                  models.                                  rewarding) states.                                                                    value. This manifests as perception (so as to change 
                  NATuRE REvIEWs | NeuroscieNce                                                                                                                                  voluME 11 | FEBRuARy 2010 | 135
                                                                                        © 2010 Macmillan Publishers Limited. All rights reserved
                   REVIEWS
                                                                   Attention and biased competition                                                                                             Computational motor control
                                                                        μ  = arg min  dtF                                                                                                                      T
                                                                         γ              ∫                                                                                                            ˙a = −∂aε ξ
                                                                   Optimization of synaptic gain                                                                                                Minimization of sensory 
                                                                   representing the precision                                                                                                   prediction errors
                                                                   (salience) of predictions
                                                                                                                           Predictive coding and hierarchical inference
                                                                                                                                  (i)     (i)((i)        i)     (i + 1)
                                                                                                                                    = Dμ − ∂ ε Tξ          − ξ
                                                                                                                                ˙μ
                                                                   Associative plasticity                                         v       v      v             v
                                                                        �μ  = −∂ εTξ                                       Minimization of prediction error                                     Optimal control and value learning
                                                                         θij      θij                                      with recurrent message passing
                                                                                                                                                                                                                            ~
                                                                   Optimization of synaptic efficacy                                                                                                 a, μ = arg max V (s | m)
                                                                                                                           The Bayesian brain hypothesis                                        Optimization of a free-energy 
                                                                                                                                μ = arg min D                     ~                             bound on surprise or value
                                                                   Perceptual learning and memory                                                   (q(ϑ) || (p(ϑ | s))
                                                                                                                                                  KL
                                                                           = arg min  dtF                                  Minimizing the difference between a 
                                                                        μ
                                                                         θ              ∫                                  recognition density and the conditional 
                                                                   Optimization of synaptic efficacy                       density on sensory causes
                                                                   to represent causal structure 
                                                                   in the sensorium
                                                                                                                           The free-energy principle                                            Infomax and the redundancy
                                                                                                                                                          ~
                                                                                                                                a, μ, m = arg min F (s,
                                                                   Probabilistic neuronal coding                                                             μ | m)                             minimization principle
                                                                                                                           Minimization of the free energy of                                                            ~  μ ) − H(μ)}
                                                                        q(ϑ ) = N ( μ, Σ)                                  sensations and the representation                                         μ = arg max {I (s,
                                                                   Encoding a recognition density                          of their causes                                                      Maximization of the mutual 
                                                                   in terms of conditional                                                                                                      information between sensations 
                                                                   expectations and uncertainty                            Model selection and evolution                                        and representations
                                                                                                                                m = arg min  dtF
                                                                                                                                                 ∫
                                                                                                                           Optimizing the agent’s model and 
                                                                                                                           priors through neurodevelopment 
                                                                                                                           and natural selection
                                                               Figure 4 | The free-energy principle and other theories. Some of the theoretical constructs considered in this Review 
                                                                                                                                                                                                       Nature Reviews | Neuroscience
                                                               and how they relate to the free-energy principle (centre). The variables are described in BOXES 1,2 and a full explanation 
                                                               of the equations can be found in the Supplementary information S1–S4 (boxes).
                                                               predictions) or action (so as to change the sensations                                      to old problems that might call for a reappraisal of  
                                                               that are predicted). Crucially, these predictions depend                                    conventional notions, particularly in reinforcement 
                                                               on prior expectations (that furnish policies), which                                        learning and motor control.
                                                               are optimized at different (somatic and evolutionary)                                            If the arguments underlying the free-energy principle  
                                                               timescales and define what is valuable.                                                     hold, then the real challenge is to understand how it 
                                                                    What does the free-energy principle portend for the                                    manifests in the brain. This speaks to a greater appre-
                                                                                                                                                                                                                              41
                                                               future? If its main contribution is to integrate estab-                                     ciation of hierarchical message passing , the func-
                                                               lished theories, then the answer is probably ‘not a lot’.                                   tional role of specific neurons and microcircuits and 
                                                               Conversely, it may provide a framework in which cur-                                        the dynamics they support (for example, what is the 
                                                               rent debates could be resolved, for example whether                                         relationship between predictive coding, attention 
                                                                                                                                                                                                                              129
                                                               dopamine encodes reward prediction error or sur-                                            and dynamic co ordination in the brain?                                ). Beyond  
                                                               prise126,127 — this is particularly important for under-                                    neuroscience, many exciting applications in engineering, 
                                                               standing conditions like addiction, Parkinson’s disease                                     robotics, embodied cognition and evolutionary biology 
                                                               and schizophrenia. Indeed, the free-energy formulation                                      suggest themselves; although fanciful, it is not difficult to 
                                                               has already been used to explain the positive symptoms                                      imagine building little free-energy machines that garner 
                                                                                                                                  128
                                                               of schizophrenia in terms of false inference                           . The free-          and model sensory information (like our children) to 
                                                               energy formulation could also provide new approaches                                        maximize the evidence for their own existence.
                   1.     Huang, G. Is this a unified theory of the brain?                           paper focuses on perception and the                                         Physics, Chemistry and Biology 3rd edn (Springer, 
                          New Scientist 2658, 30–33 (2008).                                          neurobiological infrastructures involved.                                   New York, 1983).
                   2.     Friston K., Kilner, J. & Harrison, L. A free energy                  3.    Ashby, W. R. Principles of the self-organising dynamic                6.    Kauffman, S. The Origins of Order: Self‑Organization 
                          principle for the brain. J. Physiol. Paris 100, 70–87                      system. J. Gen. Psychol. 37, 125–128 (1947).                                and Selection in Evolution (Oxford Univ. Press, Oxford, 
                          (2006).                                                              4.    Nicolis, G. & Prigogine, I. Self‑Organisation in Non‑                       1993).
                          An overview of the free-energy principle that                              Equilibrium Systems (Wiley, New York, 1977).                          7.    Bernard, C. Lectures on the Phenomena Common  
                          describes its motivation and relationship to                         5.    Haken, H. Synergistics: an Introduction. Non‑                               to Animals and Plants (Thomas, Springfield,  
                          generative models and predictive coding. This                              Equilibrium Phase Transition and Self‑Organisation in                       1974).
                   136 | FEBRuARy 2010 | voluME 11                                                                                                                                                 www.nature.com/reviews/neuro
                                                                                              © 2010 Macmillan Publishers Limited. All rights reserved
                                                                                                                                                                                                          REVIEWS
                  8.     Applebaum, D. Probability and Information: an                      36.  Zemel, R., Dayan, P. & Pouget, A. Probabilistic                     60.  Laughlin, S. B. Efficiency and complexity in neural 
                         Integrated Approach (Cambridge Univ. Press,                              interpretation of population code. Neural Comput. 10,                    coding. Novartis Found. Symp. 239, 177–187 
                         Cambridge, UK, 2008).                                                    403–430 (1998).                                                          (2001).
                  9.     Evans, D. J. A non-equilibrium free energy theorem                 37.  Paulin, M. G. Evolution of the cerebellum as a                      61.  Tipping, M. E. Sparse Bayesian learning and the 
                         for deterministic systems. Mol. Physics 101,                             neuronal machine for Bayesian state estimation.                          Relevance Vector Machine. J. Machine Learn. Res. 1, 
                         15551–11554 (2003).                                                      J. Neural Eng. 2, S219–S234 (2005).                                      211–244 (2001).
                  10.  Crauel, H. & Flandoli, F. Attractors for random                      38.  Ma, W. J., Beck, J. M., Latham, P. E. & Pouget, A.                  62.  Paus, T., Keshavan, M. & Giedd, J. N. Why do many 
                         dynamical systems. Probab. Theory Relat. Fields 100,                     Bayesian inference with probabilistic population                         psychiatric disorders emerge during adolescence? 
                         365–393 (1994).                                                          codes. Nature Neurosci. 9, 1432–1438 (2006).                             Nature Rev. Neurosci. 9, 947–957 (2008).
                  11.  Feynman, R. P. Statistical Mechanics: a Set of Lectures              39.  Friston, K., Mattout, J., Trujillo-Barreto, N.,                     63.  Gilestro, G. F., Tononi, G. & Cirelli, C. Widespread 
                         (Benjamin, Reading, Massachusetts, 1972).                                Ashburner, J. & Penny, W. Variational free energy and                    changes in synaptic markers as a function of sleep and 
                  12.  Hinton, G. E. & von Cramp, D. Keeping neural                               the Laplace approximation. Neuroimage 34,                                wakefulness in Drosophila. Science 324, 109–112 
                         networks simple by minimising the description length                     220–234 (2007).                                                          (2009).
                         of weights. Proc. 6th Annu. ACM Conf. Computational                40.  Rao, R. P. & Ballard, D. H. Predictive coding in the                64.  Roweis, S. & Ghahramani, Z. A unifying review of 
                         Learning Theory 5–13 (1993).                                             visual cortex: a functional interpretation of some                       linear Gaussian models. Neural Comput. 11, 305–345 
                  13.  MacKay. D. J. C. Free-energy minimisation algorithm                        extra-classical receptive field effects. Nature Neurosci.                (1999).
                         for decoding and cryptoanalysis. Electron. Lett. 31,                     2, 79–87 (1998).                                                   65.  Hebb, D. O. The Organization of Behaviour (Wiley, 
                         445–447 (1995).                                                          Applies predictive coding to cortical processing to                      New York, 1949).
                  14.  Neal, R. M. & Hinton, G. E. in Learning in Graphical                       provide a compelling account of extra-classical                    66.  Paulsen, O. & Sejnowski, T. J. Natural patterns of 
                         Models (ed. Jordan, M. I.) 355–368 (Kluwer                               receptive fields in the visual system. It emphasizes                     activity and long-term synaptic plasticity. Curr. Opin. 
                         Academic, Dordrecht, 1998).                                              the importance of top-down projections in                                Neurobiol. 10, 172–179 (2000).
                  15.  Itti, L. & Baldi, P. Bayesian surprise attracts human                      providing predictions, by modelling perceptual                     67.  von der Malsburg, C. The Correlation Theory of Brain 
                         attention. Vision Res. 49, 1295–1306 (2009).                             inference.                                                               Function. Internal Report 81–82, Dept. Neurobiology, 
                  16.  Friston, K., Daunizeau, J. & Kiebel, S. Active inference             41.  Mumford, D. On the computational architecture of the                      Max-Planck-Institute for Biophysical Chemistry 
                         or reinforcement learning? PLoS ONE 4, e6421                             neocortex. II. The role of cortico-cortical loops. Biol.                 (1981).
                         (2009).                                                                  Cybern. 66, 241–251 (1992).                                        68.  Singer, W. & Gray, C. M. Visual feature integration and 
                  17.  Knill, D. C. & Pouget, A. The Bayesian brain: the role               42.  Friston, K. Hierarchical models in the brain. PLoS                        the temporal correlation hypothesis. Annu. Rev. 
                         of uncertainty in neural coding and computation.                         Comput. Biol. 4, e1000211 (2008).                                        Neurosci. 18, 555–586 (1995).
                         Trends Neurosci. 27, 712–719 (2004).                               43.  Murray, S. O., Kersten, D., Olshausen, B. A., Schrater, P.          69.  Bienenstock, E. L., Cooper, L. N. & Munro, P. W. 
                         A nice review of Bayesian theories of perception                         & Woods, D. L. Shape perception reduces activity in                      Theory for the development of neuron selectivity: 
                         and sensorimotor control. Its focus is on Bayes                          human primary visual cortex. Proc. Natl Acad. Sci.                       orientation specificity and binocular interaction in 
                         optimality in the brain and the implicit nature of                       USA 99, 15164–15169 (2002).                                              visual cortex. J. Neurosci. 2, 32–48 (1982).
                         neuronal representations.                                          44.  Garrido, M. I., Kilner, J. M., Kiebel, S. J. & Friston,             70.  Abraham, W. C. & Bear, M. F. Metaplasticity: the 
                  18.  von Helmholtz, H. in Treatise on Physiological Optics                      K. J. Dynamic causal modeling of the response to                         plasticity of synaptic plasticity. Trends Neurosci. 19, 
                         Vol. III 3rd edn (Voss, Hamburg, 1909).                                  frequency deviants. J. Neurophysiol. 101,                                126–130 (1996).
                  19.  MacKay, D. M. in Automata Studies (eds Shannon,                            2620–2631 (2009).                                                  71.  Pareti, G. & De Palma, A. Does the brain oscillate? 
                         C. E. & McCarthy, J.) 235–251 (Princeton Univ. Press,              45.  Sherman, S. M. & Guillery, R. W. On the actions that                      The dispute on neuronal synchronization. Neurol. Sci. 
                         Princeton, 1956).                                                        one nerve cell can have on another: distinguishing                       25, 41–47 (2004).
                  20.  Neisser, U. Cognitive Psychology                                           “drivers” from “modulators”. Proc. Natl Acad. Sci. USA             72.  Leutgeb, S., Leutgeb, J. K., Moser, M. B. & Moser, E. I. 
                         (Appleton-Century-Crofts, New York, 1967).                               95, 7121–7126 (1998).                                                    Place cells, spatial maps and the population code for 
                  21.  Gregory, R. L. Perceptual illusions and brain models.                46.  Angelucci, A. & Bressloff, P. C. Contribution of                          memory. Curr. Opin. Neurobiol. 15, 738–746  
                         Proc. R. Soc. Lond. B Biol. Sci. 171, 179–196 (1968).                    feedforward, lateral and feedback connections to the                     (2005).
                  22.  Gregory, R. L. Perceptions as hypotheses. Philos.                          classical receptive field center and extra-classical               73.  Durstewitz, D. & Seamans, J. K. Beyond bistability: 
                         Trans. R. Soc. Lond. B Biol. Sci. 290, 181–197 (1980).                   receptive field surround of primate V1 neurons.                          biophysics and temporal dynamics of working memory. 
                  23.  Ballard, D. H., Hinton, G. E. & Sejnowski, T. J. Parallel                  Prog. Brain Res. 154, 93–120 (2006).                                     Neuroscience 139, 119–133 (2006).
                         visual computation. Nature 306, 21–26 (1983).                      47.  Grossberg, S. Towards a unified theory of neocortex:                74.  Anishchenko, A. & Treves, A. Autoassociative memory 
                  24.  Kawato, M., Hayakawa, H. & Inui, T. A forward-inverse                      laminar cortical circuits for vision and cognition.                      retrieval and spontaneous activity bumps in small-
                         optics model of reciprocal connections between visual                    Prog. Brain Res. 165, 79–104 (2007).                                     world networks of integrate-and-fire neurons. 
                         areas. Network: Computation in Neural Systems 4,                   48.  Grossberg, S. & Versace, M. Spikes, synchrony, and                        J. Physiol. Paris 100, 225–236 (2006).
                         415–422 (1993).                                                          attentive learning by laminar thalamocortical circuits.            75.  Abbott, L. F., Varela, J. A., Sen, K. & Nelson, S. B. 
                  25.  Dayan, P., Hinton, G. E. & Neal, R. M. The Helmholtz                       Brain Res. 1218, 278–312 (2008).                                         Synaptic depression and cortical gain control. Science 
                         machine. Neural Comput. 7, 889–904 (1995).                         49.  Barlow, H. in Sensory Communication (ed. Rosenblith, W.)                  275, 220–224 (1997).
                         This paper introduces the central role of generative                     217–234 (MIT Press, Cambridge, Massachusetts,                      76.  Yu, A. J. & Dayan, P. Uncertainty, neuromodulation 
                         models and variational approaches to hierarchical                        1961).                                                                   and attention. Neuron 46, 681–692 (2005).
                         self-supervised learning and relates this to the                   50.  Linsker, R. Perceptual neural organisation: some                    77.  Doya, K. Metalearning and neuromodulation. Neural 
                         function of bottom-up and top-down cortical                              approaches based on network models and                                   Netw. 15, 495–506 (2002).
                         processing pathways.                                                     information theory. Annu. Rev. Neurosci. 13,                       78.  Chawla, D., Lumer, E. D. & Friston, K. J. The 
                  26.  Lee, T. S. & Mumford, D. Hierarchical Bayesian                             257–281 (1990).                                                          relationship between synchronization among neuronal 
                         inference in the visual cortex. J. Opt. Soc. Am. A Opt.            51.  Oja, E. Neural networks, principal components, and                        populations and their mean activity levels. Neural 
                         Image Sci. Vis. 20, 1434–1448 (2003).                                    subspaces. Int. J. Neural Syst. 1, 61–68 (1989).                         Comput. 11, 1389–1411 (1999).
                  27.  Kersten, D., Mamassian, P. & Yuille, A. Object                       52.  Bell, A. J. & Sejnowski, T. J. An information                       79.  Fries, P., Womelsdorf, T., Oostenveld, R. & Desimone, R. 
                         perception as Bayesian inference. Annu. Rev. Psychol.                    maximisation approach to blind separation and blind                      The effects of visual stimulation and selective visual 
                         55, 271–304 (2004).                                                      de-convolution. Neural Comput. 7, 1129–1159                              attention on rhythmic neuronal synchronization in 
                  28.  Friston, K. J. A theory of cortical responses. Philos.                     (1995).                                                                  macaque area V4. J. Neurosci. 28, 4823–4835 
                         Trans. R. Soc. Lond. B Biol. Sci. 360, 815–836                     53.  Atick, J. J. & Redlich, A. N. What does the retina know                   (2008).
                         (2005).                                                                  about natural scenes? Neural Comput. 4, 196–210                    80.  Womelsdorf, T. & Fries, P. Neuronal coherence during 
                  29.  Beal, M. J. Variational Algorithms for Approximate                         (1992).                                                                  selective attentional processing and sensory-motor 
                         Bayesian Inference. Thesis, University College London              54.  Optican, L. & Richmond, B. J. Temporal encoding of                        integration. J. Physiol. Paris 100, 182–193 (2006).
                         (2003).                                                                  two-dimensional patterns by single units in primate                81.  Desimone, R. Neural mechanisms for visual memory 
                  30.  Efron, B. & Morris, C. Stein’s estimation rule and its                     inferior cortex. III Information theoretic analysis.                     and their role in attention. Proc. Natl Acad. Sci. USA 
                         competitors – an empirical Bayes approach. J. Am.                        J. Neurophysiol. 57, 132–146 (1987).                                     93, 13494–13499 (1996).
                         Stats. Assoc. 68, 117–130 (1973).                                  55.  Olshausen, B. A. & Field, D. J. Emergence of simple-                      A nice review of mnemonic effects (such as 
                  31.  Kass, R. E. & Steffey, D. Approximate Bayesian                             cell receptive field properties by learning a sparse                     repetition suppression) on neuronal responses and 
                         inference in conditionally independent hierarchical                      code for natural images. Nature 381, 607–609                             how they bias the competitive interactions between 
                         models (parametric empirical Bayes models). J. Am.                       (1996).                                                                  stimulus representations in the cortex. It provides 
                         Stat. Assoc. 407, 717–726 (1989).                                  56.  Simoncelli, E. P. & Olshausen, B. A. Natural image                        a good perspective on attentional mechanisms in 
                  32.  Zeki, S. & Shipp, S. The functional logic of cortical                      statistics and neural representation. Annu. Rev.                         the visual system that is empirically grounded.
                         connections. Nature 335, 311–317 (1988).                                 Neurosci. 24, 1193–1216 (2001).                                    82.  Treisman, A. Feature binding, attention and object 
                         Describes the functional architecture of cortical                        A nice review of information theory in visual                            perception. Philos. Trans. R. Soc. Lond. B Biol. Sci. 
                         hierarchies with a focus on patterns of anatomical                       processing. It covers natural scene statistics and                       353, 1295–1306 (1998).
                         connections in the visual cortex. It emphasizes the                      empirical tests of the efficient coding hypothesis in              83.  Maunsell, J. H. & Treue, S. Feature-based attention in 
                         role of functional segregation and integration (that                     individual neurons and populations of neurons.                           visual cortex. Trends Neurosci. 29, 317–322 (2006).
                         is, message passing among cortical areas).                         57.  Friston, K. J. The labile brain. III. Transients and                84.  Spratling, M. W. Predictive-coding as a model of 
                  33.  Felleman, D. J. & Van Essen, D. C. Distributed                             spatio-temporal receptive fields. Philos. Trans. R. Soc.                 biased competition in visual attention. Vision Res. 48, 
                         hierarchical processing in the primate cerebral cortex.                  Lond. B Biol. Sci. 355, 253–265 (2000).                                  1391–1408 (2008).
                         Cereb. Cortex 1, 1–47 (1991).                                      58.  Bialek, W., Nemenman, I. & Tishby, N. Predictability,               85.  Reynolds, J. H. & Heeger, D. J. The normalization 
                  34.  Mesulam, M. M. From sensation to cognition. Brain                          complexity, and learning. Neural Comput. 13,                             model of attention. Neuron 61, 168–185 (2009).
                         121, 1013–1052 (1998).                                                   2409–2463 (2001).                                                  86.  Schroeder, C. E., Mehta, A. D. & Foxe, J. J. 
                  35.  Sanger, T. Probability density estimation for the                    59.  Lewen, G. D., Bialek, W. & de Ruyter van Steveninck,                      Determinants and mechanisms of attentional 
                         interpretation of neural population codes.                               R. R. Neural coding of naturalistic motion stimuli.                      modulation of neural processing. Front. Biosci. 6, 
                         J. Neurophysiol. 76, 2790–2793 (1996).                                   Network 12, 317–329 (2001).                                              D672–D684 (2001).
                  NATuRE REvIEWs | NeuroscieNce                                                                                                                                       voluME 11 | FEBRuARy 2010 | 137
                                                                                           © 2010 Macmillan Publishers Limited. All rights reserved
                 REVIEWS
                 87.  Hirayama, J., Yoshimoto, J. & Ishii, S. Bayesian            106. Todorov, E. & Jordan, M. I. Smoothness maximization          119. Bressler, S. L. & Tognoli, E. Operational principles of 
                      representation learning in the cortex regulated by                along a predefined path accurately predicts the speed             neurocognitive networks. Int. J. Psychophysiol. 60, 
                      acetylcholine. Neural Netw. 17, 1391–1400 (2004).                 profiles of complex arm movements. J. Neurophysiol.               139–148 (2006).
                 88.  Edelman, G. M. Neural Darwinism: selection and                    80, 696–714 (1998).                                         120. Werner, G. Brain dynamics across levels of 
                      reentrant signaling in higher brain function. Neuron        107. Tseng, Y. W., Diedrichsen, J., Krakauer, J. W.,                    organization. J. Physiol. Paris 101, 273–279 (2007).
                      10, 115–125 (1993).                                               Shadmehr, R. & Bastian, A. J. Sensory prediction-           121. Pasquale, V., Massobrio, P., Bologna, L. L., 
                 89.  Knobloch, F. Altruism and the hypothesis of meta-                 errors drive cerebellum-dependent adaptation of                   Chiappalone, M. & Martinoia, S. Self-organization and 
                      selection in human evolution. J. Am. Acad.                        reaching. J. Neurophysiol. 98, 54–62 (2007).                      neuronal avalanches in networks of dissociated cortical 
                      Psychoanal. 29, 339–354 (2001).                             108. Bays, P. M. & Wolpert, D. M. Computational                         neurons. Neuroscience 153, 1354–1369 (2008).
                 90.  Friston, K. J., Tononi, G., Reeke, G. N. Jr, Sporns, O. &         principles of sensorimotor control that minimize            122. Kitzbichler, M. G., Smith, M. L., Christensen, S. R. & 
                      Edelman, G. M. Value-dependent selection in the                   uncertainty and variability. J. Physiol. 578, 387–396             Bullmore, E. Broadband criticality of human brain 
                      brain: simulation in a synthetic neural model.                    (2007).                                                           network synchronization. PLoS Comput. Biol. 5, 
                      Neuroscience 59, 229–243 (1994).                                  A nice overview of computational principles in                    e1000314 (2009).
                 91.  Sutton, R. S. & Barto, A. G. Toward a modern theory of            motor control. Its focus is on representing                 123. Rabinovich, M., Huerta, R. & Laurent, G. Transient 
                      adaptive networks: expectation and prediction.                    uncertainty and optimal estimation when                           dynamics for neural processing. Science 321 48–50 
                      Psychol. Rev. 88, 135–170 (1981).                                 extracting the sensory information required for                   (2008).
                 92.  Montague, P. R., Dayan, P., Person, C. & Sejnowski,               motor planning.                                             124. Tschacher, W. & Hake, H. Intentionality in non-
                      T. J. Bee foraging in uncertain environments using          109. Shadmehr, R. & Krakauer, J. W. A computational                     equilibrium systems? The functional aspects of self-
                      predictive Hebbian learning. Nature 377, 725–728                  neuroanatomy for motor control. Exp. Brain Res. 185,              organised pattern formation. New Ideas Psychol. 25, 
                      (1995).                                                           359–381 (2008).                                                   1–15 (2007).
                      A computational treatment of behaviour that                 110. Verschure, P. F., Voegtlin, T. & Douglas, R. J.              125. Maturana, H. R. & Varela, F. De máquinas y seres 
                      combines ideas from optimal control theory and                    Environmentally mediated synergy between                          vivos (Editorial Universitaria, Santiago, 1972).  
                      dynamic programming with the neurobiology of                      perception and behaviour in mobile robots. Nature                 English translation available in Maturana, H. R. & 
                      reward. This provided an early example of value                   425, 620–624 (2003).                                              Varela, F. in Autopoiesis and Cognition (Reidel, 
                      learning in the brain.                                      111.  Cohen, J. D., McClure, S. M. & Yu, A. J. Should I stay            Dordrecht, 1980).
                 93.  Schultz, W. Predictive reward signal of dopamine                  or should I go? How the human brain manages the             126. Fiorillo, C. D., Tobler, P. N. & Schultz, W. Discrete 
                      neurons. J. Neurophysiol. 80, 1–27 (1998).                        trade-off between exploitation and exploration. Philos.           coding of reward probability and uncertainty by 
                 94.  Daw, N. D. & Doya, K. The computational                           Trans. R. Soc. Lond. B Biol. Sci. 362, 933–942                    dopamine neurons. Science 299, 1898–1902 
                      neurobiology of learning and reward. Curr. Opin.                  (2007).                                                           (2003).
                      Neurobiol. 16, 199–204 (2006).                              112. Ishii, S., Yoshida, W. & Yoshimoto, J. Control of            127. Niv, Y., Duff, M. O. & Dayan, P. Dopamine,  
                 95.  Redgrave, P. & Gurney, K. The short-latency dopamine              exploitation-exploration meta-parameter in                        uncertainty and TD learning. Behav. Brain Funct. 1, 6 
                      signal: a role in discovering novel actions? Nature Rev.          reinforcement learning. Neural Netw. 15, 665–687                  (2005).
                      Neurosci. 7, 967–975 (2006).                                      (2002).                                                     128. Fletcher, P. C. & Frith, C. D. Perceiving is believing: a 
                 96.  Berridge, K. C. The debate over dopamine’s role in          113. Usher, M., Cohen, J. D., Servan-Schreiber, D.,                     Bayesian approach to explaining the positive 
                      reward: the case for incentive salience.                          Rajkowski, J. & Aston-Jones, G. The role of locus                 symptoms of schizophrenia. Nature Rev. Neurosci. 10, 
                      Psychopharmacology (Berl.) 191, 391–431 (2007).                   coeruleus in the regulation of cognitive performance.             48–58 (2009).
                 97.  Sella, G. & Hirsh, A. E. The application of statistical           Science 283, 549–554 (1999).                                129. Phillips, W. A. & Silverstein, S. M. Convergence of 
                      physics to evolutionary biology. Proc. Natl Acad. Sci.      114. Voigt, C. A., Kauffman, S. & Wang, Z. G. Rational                  biological and psychological perspectives on cognitive 
                      USA 102, 9541–9546 (2005).                                        evolutionary design: the theory of in vitro protein               coordination in schizophrenia. Behav. Brain Sci. 26, 
                 98.  Rescorla, R. A. & Wagner, A. R. in Classical                      evolution. Adv. Protein Chem. 55, 79–160 (2000).                  65–82 (2003).
                      Conditioning II: Current Research and Theory (eds           115. Freeman, W. J. Characterization of state transitions in      130. Friston, K. & Kiebel, S. Cortical circuits for perceptual 
                      Black, A. H. & Prokasy, W. F.) 64–99 (Appleton                    spatially distributed, chaotic, nonlinear, dynamical              inference. Neural Netw. 22, 1093–1104 (2009).
                      Century Crofts, New York, 1972).                                  systems in cerebral cortex. Integr. Physiol. Behav. Sci. 
                 99.  Bellman, R. On the Theory of Dynamic Programming.                 29, 294–306 (1994).                                         Acknowledgments
                      Proc. Natl Acad. Sci. USA 38, 716–719 (1952).               116. Tsuda, I. Toward an interpretation of dynamic neural         This work was funded by the Wellcome Trust. I would like to 
                 100. Watkins, C. J. C. H. & Dayan, P. Q-learning. Mach.                activity in terms of chaotic dynamical systems. Behav.      thank my colleagues at the Wellcome Trust Centre for 
                      Learn. 8, 279–292 (1992).                                         Brain Sci. 24, 793–810 (2001).                              Neuroimaging, the Institute of Cognitive Neuroscience and the 
                 101. Todorov, E. in Advances in Neural Information               117. Jirsa, V. K., Friedrich, R., Haken, H. & Kelso, J. A.        Gatsby Computational Neuroscience Unit for collaborations 
                      Processing Systems (eds Scholkopf, B., Platt, J. &                A theoretical model of phase transitions in the human       and discussions.
                      Hofmann T.) 19, 1369–1376 (MIT Press, 2006).                      brain. Biol. Cybern. 71, 27–35 (1994).
                 102. Camerer, C. F. Behavioural studies of strategic thinking          This paper develops a theoretical model (based on           Competing interests statement
                      in games. Trends Cogn. Sci. 7, 225–231 (2003).                    synergetics and nonlinear oscillator theory) that           The author declares no competing financial interests. 
                 103. Smith, J. M. & Price, G. R. The logic of animal conflict.         reproduces observed dynamics and suggests a 
                      Nature 246, 15–18 (1973).                                         formulation of biophysical coupling among brain                SUPPLEMENTARY INFORMATION
                 104. Nash, J. Equilibrium points in n-person games.                    systems.                                                       See online article: S1 (box) | S2 (box) | S3 (box) | S4 (box) |  
                      Proc. Natl Acad. Sci. USA 36, 48–49 (1950).                 118. Breakspear, M. & Stam, C. J. Dynamics of a                      S5 (box)
                 105. Wolpert, D. M. & Miall, R. C. Forward models for                  neural system with a multiscale architecture. Philos. 
                      physiological motor control. Neural Netw. 9,                      Trans. R. Soc. Lond. B Biol. Sci. 360, 1051–1074               All liNks Are AcTive iN The oNliNe pdf
                      1265–1279 (1996).                                                 (2005).
                 138 | FEBRuARy 2010 | voluME 11                                                                                                                         www.nature.com/reviews/neuro
                                                                                 © 2010 Macmillan Publishers Limited. All rights reserved
             
           SUPPLEMENTARY INFORMATION                                                In format provided by Friston (FEBRUARY 2010) 
                    Supplementary information S1 (box): The entropy of sensory states and their causes 
                    This  box  shows  that  the  entropy  of  hidden  states  in  the  environment  is  bounded  by  the 
                    entropy of sensory states. This means that if the entropy of sensory signals is minimised, so 
                    is the entropy of the environmental states that caused them. For any agent or model  m  the 
                    entropy of generalised sensory states  %          ′ ′′   T  is simply their average surprise 
                                                           s(t) =[s,s ,s ,K]
                            %
                     −ln p(s |m) (with a sight abuse of notion) 
                     
                                                                T
                        %             %          %      %               %                          S1.1 
                     H(s|m):=∫−p(s|m)ln p(s|m)ds = lim∫−ln p(s(t)|m)dt
                                                            T→•
                                                                0
                     
                    Under ergodic assumptions, this is just the long-term time or path-integral of surprise. We will 
                    assume sensory states are an analytic function of hidden environmental states plus some 
                    generalised random fluctuations 
                     
                     %     %      %
                     s = g(x,θ)+ z
                     &                                                                             S1.2 
                     %      %      %
                     x = f (x,θ)+w
                     
                    Here, hidden states change according to the stochastic differential equations of motion (with 
                                                      %       %
                    parameters θ ) in S1.2. Because  x and  z  are statistically independent, we have (see Eq. 
                    6.4.6 in Jones 1979, p149) 
                     
                       % %       %          %            %              %
                     I(s,z) = H(s |m)−H(x|m)− p(x|m)ln|∂%g|dx                                      S1.3 
                                                    ∫              x
                     
                             % %
                    Here,  I(s,z) ≥ 0 is the mutual information between the sensory states and noise. By Gibb’s 
                    inequality  this  cross-entropy  or  Kullback-Leibler  divergence is non-negative (Theorem 6.5; 
                    Jones 1979, p151). This means the entropy of the sensory states is greater than the entropy 
                    of the sensory mapping. Here. ∂%g  is the sensitivity or gradient of the sensory mapping with 
                                                    x
                    respect to the hidden states. The integral in S1.3 reflects the fact that entropy is not invariant 
           NATURE REVIEWS | NEUROSCIENCE                                                          www.nature.com/reviews/neuro 
                                                  © 2010 Macmillan Publishers Limited.  All rights reserved. 
         
       SUPPLEMENTARY INFORMATION                       In format provided by Friston (FEBRUARY 2010) 
                                                          %  %
              to a change of variables and assumes that the sensory mapping  g : x → s   is diffeomorphic 
              (i.e., bijective and smooth). This requires the hidden and sensory state-spaces to have the 
              same dimension, which can be assured by truncating generalised states at an appropriately 
              high order. For example, if we had  n hidden states in  m  generalised coordinates of motion, 
              we  would  consider  m  sensory  states  in  n  generalised  coordinates;  so  that 
                 %      %
              dim(x)=dim(s)=n×m.  Finally, rearranging S1.3 gives 
               
                %       %       %         %
              H(x|m)≤H(s|m)− p(x|m)ln|∂%g|dx                      S1.4 
                             ∫         x
               
              In conclusion, the entropy of hidden states is upper-bounded by the entropy of sensations, 
              assuming their sensitivity to hidden states is constant, over the range of states encountered. 
               
              Clearly,  the  ergodic  assumption  in  S1.1  only  holds  over  certain  temporal  scales  for  real 
              organisms that are on a trajectory from birth to death. This scale can be somatic (e.g., over 
              days  or  months,  where  development  is  locally  stationary)  or  evolutionary  (e.g.,  over 
              generations, where evolution is locally stationary). 
               
               
              Reference 
              Jones, DS. (1979). Elementary information theory. Publisher: Oxford: Clarendon Press; New 
              York: Oxford University Press 
               
       NATURE REVIEWS | NEUROSCIENCE                             www.nature.com/reviews/neuro 
                                 © 2010 Macmillan Publishers Limited.  All rights reserved. 
            
          SUPPLEMENTARY INFORMATION                                           In format provided by Friston (FEBRUARY 2010) 
                   Supplementary information S2 (box): Variational free energy 
                   Here, we derive the free-energy and show how its various formulations relate to each other. 
                   We start with the quantity we want to bound; namely, the surprise or log-evidence associated 
                                      %                                                           %
                   with sensory states  s(t)  that have been caused by some unknown quantities ϑ …{x,θ} , 
                   which include the hidden states and parameters in box (S1)  
                    
                          %              %
                   −ln p(s(t)) = −ln∫ p(s(t),ϑ)dϑ                                            S2.1 
                    
                                                                %
                   To create a free-energy bound on surprise  F (s(t),q(ϑ)), we simply add a non-negative 
                   cross-entropy  between  an  arbitrary  (recognition)  density  on  the  causes  q(ϑ)   and  their 
                                         %
                   posterior density  p(ϑ | s)  (dropping the dependency on m  for clarity). 
                    
                                  q(ϑ)             %
                   F =∫q(ϑ)ln            dϑ−ln p(s)
                                      %
                                 p(ϑ|s)
                                                                                             S2.2 
                                       %        %
                      =D(q(ϑ)|| p(ϑ|s))−ln p(s)
                    
                   The cross-entropy term is non-negative by Gibb’s inequality. In short, free-energy is cross-
                   entropy plus surprise. Because surprise depends only on sensory states, we can bring it 
                                                  %         %   %
                   inside the integral and use  p(ϑ,s) = p(ϑ | s)p(s) to show free-energy is expected energy 
                   minus entropy 
                    
                   F =∫q(ϑ)ln       q(ϑ)     dϑ
                                      %    %
                                 p(ϑ|s)p(s)
                                                        %
                      =∫q(ϑ)lnq(ϑ)dϑ−∫q(ϑ)ln p(ϑ,s)dϑ                                        S2.3 
                                  %
                      =− ln p(ϑ,s)    − −lnq(ϑ)
                                     q            q
                    
          NATURE REVIEWS | NEUROSCIENCE                                                     www.nature.com/reviews/neuro 
                                               © 2010 Macmillan Publishers Limited.  All rights reserved. 
            
          SUPPLEMENTARY INFORMATION                                      In format provided by Friston (FEBRUARY 2010) 
                                %                                               %      %
                  where  −ln p(ϑ,s) is Gibb’s energy. A final rearrangement, using  p(ϑ,s) = p(s |ϑ)p(ϑ), 
                  shows free-energy is also complexity minus accuracy, where complexity is the cross-entropy 
                  between the recognition q(ϑ) and prior density  p(ϑ) 
                   
                  F =∫q(ϑ)ln      q(ϑ)    dϑ
                                 %
                               p(s |ϑ)p(ϑ)
                              q(ϑ)                %
                    =∫q(ϑ)ln p(ϑ)dϑ−∫q(ϑ)ln p(s|ϑ)dϑ                                  S2.4 
                                           %
                    =D(q(ϑ)|| p(ϑ))− ln p(s|ϑ)
                                                 q
                     
           
          NATURE REVIEWS | NEUROSCIENCE                                              www.nature.com/reviews/neuro 
                                           © 2010 Macmillan Publishers Limited.  All rights reserved. 
           
         SUPPLEMENTARY INFORMATION                                In format provided by Friston (FEBRUARY 2010) 
                Supplementary information S3 (box): The free-energy principle and infomax 
                Here, we show that the free-energy principle is a probabilistic generalisation of the infomax 
                                                                     %
                principle.  The  infomax principle requires the mutual information  I(s,µ) between sensory 
                data and their conditional representation  µ(t) to be maximal, under prior constraints on the 
                representations; e.g.,  p(µ) = N (0,I). This can be stated as an optimisation of an infomax 
                criterion 
                 
                µ∗ =argmaxG
                       µ
                                                                              S3.1 
                      %
                 G=I(s,µ)−H(µ)
                       %     %
                   =H(s)−H(s|µ)−H(µ)
                 
                Because the representations do not change sensory data, they are only required to minimise 
                the average surprise about them, given the representations; and the average surprise about 
                the representations, given their prior constraints. These are the last two terms in (S3.1). If the 
                recognition density is a point  mass at  µ(t); i.e.,  q(ϑ) =δ(ϑ −µ), the  free-energy from 
                (S2.4) reduces to 
                 
                          %
                F =−ln p(s|µ)−ln p(µ)                                         S3.2 
                 
                From (S1.1), the path-integral of free-energy (also known as free-action) becomes 
                 
                          %           %
                AF=∫dt (s(t),µ(t))µ H(s|µ)+H(µ)                               S3.3 
                 
                This means optimising the conditional expectations with respect to free-energy and (by the 
                fundamental  lemma  of  variational  calculus)  free-action,  is  exactly  the  same  as  same  as 
                optimising the infomax criterion 
                 
         NATURE REVIEWS | NEUROSCIENCE                                       www.nature.com/reviews/neuro 
                                       © 2010 Macmillan Publishers Limited.  All rights reserved. 
          
        SUPPLEMENTARY INFORMATION                             In format provided by Friston (FEBRUARY 2010) 
               µ∗ =argminFA=argmin   =argmaxG                            S3.4 
                      µ        µ         µ
                
               In short, the infomax principle is a special case of the free-energy principle that obtains when 
               we discount uncertainty and represent sensory data with point estimates of their causes. 
               Alternatively,  the  free-energy  is  a  generalisation  of  the  infomax  principle  that  covers 
               probability densities on the unknown causes of data. In this context, high mutual information 
               is assured by maximising accuracy (e.g., minimising prediction error) and the prior constraints 
               are enforced by minimising complexity (see S2.4) 
        NATURE REVIEWS | NEUROSCIENCE                                    www.nature.com/reviews/neuro 
                                     © 2010 Macmillan Publishers Limited.  All rights reserved. 
           
         SUPPLEMENTARY INFORMATION                               In format provided by Friston (FEBRUARY 2010) 
                Supplementary information S4 (box): Value and surprise 
                Here, we compare and contrast optimal control and free-energy formulations of dynamics on 
                hidden or sensory states. To keep things simple, we will assume the hidden states are known 
                                                                            %
                (as is usually assumed in control theory) and ignore random fluctuations; i.e.,  w(t) = 0 (see 
                box S1). In optimum control, one starts with a loss or cost-function (negative reward or utility), 
                  %
                c(x) and optimises the motion of states to maximise value or expected reward over time 
                 
                      ∗          %        %
                    a =argmax f(x,a)⋅∇V(x)
                           a
                        •                                                    S4.1 
                  %          %       & %      %
                V(x(0)) = ∫−c(x(t))dt ⇒V(x(t)) = c(x)
                         0
                 
                The first equality says that motion ascends the gradients of the value-function and the second 
                just  defines value as reward that will be accumulated in the future. Note the equations of 
                      &
                motion  %  %    now include action. The value-function is the solution to the celebrated 
                      x = f (x,a)
                Hamilton-Jacobi-Bellman equation 
                 
                     & %      %
                max V(x(t))−c(x) =0⇒
                    {           }
                  a                                                          S4.2 
                       %       %     %
                max f(x,a)⋅∇V(x)−c(x) =0
                    {                 }
                  a
                 
                This solution ensures that the rate of change of value is cost, as required by the definition of 
                value. In summary, (S4.1) says that action maximises value and (S4.2) means that value is 
                the reward expected under this policy. This ensures low-cost regions attract all trajectories 
                through state-space. 
                 
                We now revisit value from the perspective of surprise and free-energy. If we put the random 
                fluctuations  back  and  assume  a  general  form  (the  Helmholtz  decomposition)  for  motion: 
                f =∇V +∇×W , it is fairly  easy  to  relate  value  and  surprise  (using  the  Fokker-Planck 
                equation, subject to ∇V ⋅(∇×W) = 0) 
         NATURE REVIEWS | NEUROSCIENCE                                      www.nature.com/reviews/neuro 
                                       © 2010 Macmillan Publishers Limited.  All rights reserved. 
               
            SUPPLEMENTARY INFORMATION                                                           In format provided by Friston (FEBRUARY 2010) 
                        
                        
                           %            %
                       V(x)=γ ln p(x|m)
                                              2                                                                  S4.3 
                           %
                        c(x) = f ⋅∇V +γ∇ V
                        
                       Here,  γ > 0 encodes the amplitude of the random fluctuations (and is known as an inverse 
                       sensitivity  or  temperature  parameter).  The  first  equality  shows  that  value  is  inversely 
                       proportional to surprise, where free-energy is surprise because we know the true states. This 
                       means the value of a state is proportional to the log-probability of finding an agent  m  in that 
                       state. This is also the log-sojourn time or the proportion of time the state is occupied by that 
                       agent.  
                        
                       In  the  limit  of  small  fluctuations  γ → 0,  the  ensemble  density        %               −1   %    
                                                                                                   p(x|m)=exp(γ V(x))
                       becomes a point mass at the minimum of the cost-function. This somewhat trivial case serves 
                       to connect optimal control theory to the equilibrium treatment that underpins the free-energy 
                       scheme. In this limit, cost is just the rate of change of value:          %                & %      , as 
                                                                                              c(x) = f ⋅∇V =V(x(t))
                       mandated  by  the  definition  of  value  in  Equation  S4.1,  which  is  the  solution  to  the 
                       (deterministic) Hamilton-Jacobi-Bellman equation (S4.2).  
                        
                       Crucially, Equation S4.3 also shows that peaks of the equilibrium density can only exist where 
                       cost is zero or less 
                        
                        ∇V(x)=0
                                     ⇒c(x)≤0                                                                    S4.4 
                          2
                       ∇V(x)≤0
                        
                       with c(x) = 0  in the limit γ → 0.  
                        
                       In summary, optimal control theory starts with a cost-function and solves for a value-function 
                       that  guides  the  flow  or  policy  to  minimise  expected  cost.  Conversely,  the  equilibrium 
            NATURE REVIEWS | NEUROSCIENCE                                                                       www.nature.com/reviews/neuro 
                                                         © 2010 Macmillan Publishers Limited.  All rights reserved. 
         
       SUPPLEMENTARY INFORMATION                       In format provided by Friston (FEBRUARY 2010) 
              perspective starts with flow and derives the implicit value and cost-functions, where value is 
              inversely proportional to surprise. In the last supplementary information box (S5), we show 
              how cost  can  define  policies,  without  solving  the  (generally  intractable)  Hamilton-Jacobi-
              Bellman equation. 
       NATURE REVIEWS | NEUROSCIENCE                             www.nature.com/reviews/neuro 
                                 © 2010 Macmillan Publishers Limited.  All rights reserved. 
             
           SUPPLEMENTARY INFORMATION                                                In format provided by Friston (FEBRUARY 2010) 
                    Supplementary information S5 (box): Policies and cost 
                    This box describes a scheme that ensures agents are attracted to locations in state-space, 
                    using prior expectations about the motion of hidden states;  %          ′ T      comprising 
                                                                                 x(t) =[x,x ] ∈X
                    position and velocity. This formulation of how an ensemble density can be restricted to an 
                    attractive subset of state-space  A⊂ X  rests on the Fokker-Planck description (see Frank 
                    2004) of how the density changes with time 
                     
                     & %           2                    
                     p(x|m)=γ∇ p− p∇⋅ f − f ⋅∇p
                     
                                    & %
                    At equilibrium,  p(x | m) = 0 and  
                     
                                γ∇2p− f ⋅∇p
                        %
                     p(x|m)=        ∇⋅ f                                                            S5.1 
                     
                    Notice that as the divergence ∇⋅ f   increases, the sojourn time (i.e., the proportion of time a 
                    state is occupied) falls. Crucially, at the peaks of the ensemble density, the gradient is zero 
                    and its curvature is negative, which means the divergence must be negative (from Equation 
                    S5.1)  
                     
                        p>0
                      ∇p=0⇒∇⋅f <0
                                                                                                   S5.2 
                     ∇2p<0
                             
                     
                    This provides a simple and general mechanism to ensure peaks of the ensemble density lie 
                                     A⊂X                                %              %                 %
                    in,  and  only  in       .  This  is  assured  if  ∇⋅ f (x) < 0   when  x∈ A  and  ∇⋅ f (x) ≥ 0  
                    otherwise. We can exploit this using the generic equations of motion 
                     
                             x′    
                     f = cx′−∂ ϕ     ⇒ ∇⋅f =c                                                     S5.3 
                                x  
           NATURE REVIEWS | NEUROSCIENCE                                                           www.nature.com/reviews/neuro 
                                                  © 2010 Macmillan Publishers Limited.  All rights reserved. 
           
         SUPPLEMENTARY INFORMATION                                     In format provided by Friston (FEBRUARY 2010) 
                  
                 This flow describes the Newtonian motion of a unit mass in a potential energy well ϕ(x,θ) , 
                 where cost plays the role of negative dissipation or friction. Crucially, under this policy or flow, 
                 divergence is simply cost; meaning the associated ensemble density can only have maxima 
                 in regions of negative cost. This provides a means to specify attractive regions  A⊂ X  by 
                 assigning them negative cost  
                  
                 c(x) ≤ 0: x∈A
                 c(x) > 0: x∉A                                                      S5.4 
                  
                 Put simply, this scheme ensures that agents are expelled from high-cost regions of state-
                 space and get ‘stuck’ in attractive regions. 
                  
                 In summary, the previous supplementary information box (S4) showed that any flow can be 
                 described in terms of a scalar value-function (and vector potential W ), from which an implicit 
                 cost-function can be derived. In this box (S5), we have addressed the inverse problem of how 
                 cost can be used to constrain flow, ensuring that it leads to attractive, low-cost states. The 
                 ensuing policy or flow can be used in a generative model of flow or state-transitions to provide 
                 predictions that action fulfils, under the free-energy principle. A full discussion of these and 
                 related ideas will be presented in Friston et al (in preparation). 
                  
                  
                 Reference 
                 Frank  TD  (2004).  Nonlinear  Fokker-Planck  Equations:  Fundamentals  and  Applications.