• Learning Dynamics and Robustness of Vector Quantization and ...


  •   
  • FileName: neurocomp2007.pdf [read-online]
    • Abstract: depending on the structure of the data, the Neural Gas does not always obtain the best asymptotic quantization error. ... Vector quantization (VQ) is an important unsu- pervised learning algorithm, widely used ...

Download the ebook

Learning Dynamics and Robustness of
Vector Quantization and Neural Gas
Aree Witoelar a,∗ Michael Biehl a Anarta Ghosh b Barbara Hammer c
a University of Groningen, Mathematics and Computing Science, P.O. Box 800, NL-9700 AV Groningen, the Netherlands
b WaNPRC, University of Washington, Seattle, WA-98195, USA
c Clausthal University of Technology, Institute of Computer Science, D-98678 Clausthal-Zellerfeld, Germany
Abstract
Various alternatives have been developed to improve the Winner-Takes-All (WTA) mechanism in vector quantiza-
tion, including the Neural Gas (NG). However, the behavior of these algorithms including their learning dynamics,
robustness with respect to initialization, asymptotic results, etc. has only partially been studied in a rigorous mathe-
matical analysis. The theory of on-line learning allows for an exact mathematical description of the training dynamics
in model situations. We demonstrate using a system of three competing prototypes trained from a mixture of Gaussian
clusters that the Neural Gas can improve convergence speed and achieves robustness to initial conditions. However,
depending on the structure of the data, the Neural Gas does not always obtain the best asymptotic quantization error.
Key words: Vector quantization, Clustering, Online learning, Winner-Takes-All algorithms, Neural Gas
1. Introduction A variety of alternatives to overcome this problem
have been proposed, some of which are heuristically
Vector quantization (VQ) is an important unsu- motivated while others are based on the minimiza-
pervised learning algorithm, widely used in different tion of a cost function related to the quantization
areas such as data mining, medical analysis, image error: the self-organizing map (SOM) [12], fuzzy-
compression, and speech or handwriting recognition k-means [2], stochastic optimization [7], to name
[1]. The main objective of VQ is to represent the data just a few. These algorithms have in common that
points by a small number of prototypes or codebook each pattern influences more than one prototype at
vectors. This can directly be used for compression, a time through a ”winner-takes-most” paradigm.
clustering, data mining, or (with post-labeling of the Neural gas (NG) as proposed in [13] is a particularly
prototypes) classification [9,14]. robust variation of vector quantization with the
The basic ”winner-takes-all” (WTA) or batch introduction of neighborhood relations. Unlike the
algorithms like the popular k-means clustering di- self-organizing map, [12], the NG system takes into
rectly optimize the quantization error underlying account the relative distances between prototypes
vector quantization. However, these methods can be in the input space and not on a predefined lattice.
subject to confinement in local minima of the quan- In practice, NG algorithms yield better solutions
tization error and can produce suboptimal results. than WTA; however, the effect of this strategy
on convergence speed or asymptotic behavior has
∗ Corresponding author. hardly been rigorously investigated so far.
Email address: [email protected] (Aree Witoelar). Methods from statistical physics and the theory
URL: http://www.cs.rug/∼aree (Aree Witoelar).
Preprint submitted to Elsevier 8 November 2007
of on-line learning [8] allow for an exact mathemat- The input data is presented sequentially during
ical description of learning systems for high dimen- training and one or more prototypes are updated on-
sional data. In the limit of infinite dimensionality, line. Algorithms studied here can be interpreted as
such systems can be fully described in terms of a few stochastic gradient descent procedures with respect
characteristic quantities, the so-called order param- to a cost function H(W ) related to E(W ). The gen-
eters. The evolution of these order parameters along eralized form reads
the training procedure is characterized by a set of S
coupled ordinary differential equations (ODE). By 1
H(W ) = dξP (ξ) d(ξ, wi )f (ri )
integrating these ODEs, it is possible to analyse the 2 i=1
performance of VQ algorithms in terms of stabil- 1
ity, sensitivity to initial conditions, and achievable − dξP (ξ)ξ 2 (2)
2
quantization error. This successful approach has also
been reviewed in [8,16], among others. where ri is the rank of prototype wi with respect to
The extension of the theoretical analysis of sim- the distance d(ξ, wi ), i.e. ri = S − j=i Θij . Rank
ple (WTA-based) vector quantization with two pro- rJ = 1 corresponds to the so-called winner, i.e. the
totypes and two clusters introduced in an earlier prototype wJ closest to the example ξ. The rank
work [5] is not straightforward. Additional proto- function f (ri ) determines the update strength for
types and clusters introduce more complex interac- the set of prototypes and satisfies the normalization
S
tions in the system that can result in radically dif- i=1 f (ri ) = 1; note that it does not depend ex-
ferent behaviors. Also, the mathematical treatment plicitly on distances but only on the ordering of the
becomes more involved and requires, for instance, prototypes with respect to the current example.
several numerical integrations. Here we introduce The corresponding stochastic gradient descent in
an additional prototype and a mixture of clusters. H(W ) is of the form
We investigate not only WTA but also the popular
µ µ−1 µ−1
Neural Gas approach [13] for vector quantization. wi = wi + ∆wi with
This is an important step towards the investigation µ−1 η µ−1
∆wi = f (ri )(ξ µ − wi ) (3)
of general VQ approaches based on neighborhood N
interaction such as self-organizing maps. where η is the learning rate and ξ µ is a single exam-
ple drawn independently at time step µ of the se-
2. Winner-Takes-All and Neural Gas quential training process. We compare two different
algorithms:
Assume input data ξ ∈ I N , generated according
R (i) WTA:
to a given probability density function P (ξ). Vector Only one prototype, the winner, is updated
Quantization represents the input data in the same for each input. The cost function directly min-
N -dimensional space by a set of prototypes W = imizes the quantization error with H(W ) =
S
wi ∈ I N i=1 . The primary goal of VQ is to find a
R E(W ). The corresponding rank function is
faithful representation by minimizing the so-called
fWTA (ri ) = Θij (4)
quantization or distortion error
j=i
S
1 (ii) Neural Gas:
E(W ) = dξP (ξ) d(ξ, wi ) Θij
2 i=1
The update strength decays exponentially
j=i
with the rank controlled by a parameter λ.
1 1
− dξP (ξ)ξ 2 (1) The rank function is f (ri ) = C(λ) hλ (ri )
2
where hλ (ri ) = exp (−ri /λ) and C(λ) =
S
where Θij ≡ Θ d(ξ, wj ) − d(ξ, wi ) . For each input ri =1 exp (−ri /λ) is a normalization con-
vector ξ the closest prototype wi is singled out by the stant. The parameter λ is adjusted during
product of Heaviside functions, Θ(x) = 0 if x < 0; 1 training; it is frequently set large initially and
else. Here we restrict ourselves to the quadratic Eu- decreased in the course of training. Note that
clidean distance measure d(ξ, wi ) = (ξ − wi )2 . The for λ → 0 the NG algorithm becomes identi-
1
constant 2 dξP (ξ)ξ 2 term is independent of pro- cal with WTA. We divide f (ri ) according to
totype positions and is subtracted for convenience. its ranks as
2
S
1 characteristic quantities, or so-called order param-
fNG (ri ) = hλ (k)gi (k) (5) eters. A suitable set of characteristic quantities for
C(λ)
k=1
the considered learning model is:
where gi (k) = 1 if ri = k; 0 else and
Riσ = wi · Bσ and Qµ = wi · wj .
µ µ
ij
µ µ
(8)
k gi (k) = 1. In a model with three proto-
types, this can be written in terms of Heavi- Note that Riσ are the projections of prototype vec-
side functions tors wi on the center vectors Bσ and Qµ correspond
µ
ij
to the self- and cross- overlaps of the prototype vec-
gi (1) = Θij tors.
j=i From the generic update rule defined above,
gi (2) = Θij (1 − Θik ) Eq. (3), we can derive the following recursions in
k=i j=k,i terms of the order parameters:
gi (3) = 1 − Θij . (6) µ µ−1
Riσ − Riσ µ−1
j=i = ηf (ri ) bµ − Riσ
σ
1/N
Qµ − Qij
ij
µ−1
3. Model = η f (rj ) hµ − Qij
i
µ−1
+ f (ri ) hµ j
1/N
µ−1 (ξ µ )2
We choose the model data as a mixture of M −Qij + η 2 f (ri )f (rj )
spherical Gaussian clusters: N
1
+O( ) (9)
M N
P (ξ) = pσ P (ξ|σ) with
where hµ and bµ are the projections of the input
i i
σ=1
data vector ξ µ :
1 1 2
P (ξ|σ) = exp − (ξ − σ Bσ ) (7)
(2πυσ )N/2 2υσ i
µ−1
h µ = wi · ξ µ and bµ = Bσ · ξ µ .
σ (10)
where pσ are the prior probabilities of each cluster. Note that the last two terms in Eq. (9) come from
The components of vector ξ µ are random numbers (ξ µ )2 − hµ − hµ + Qµ−1 , where (ξ µ )2 is the only term
i j ij
according to a Gaussian distribution with mean vec- that scales with N .
tors σ Bσ and variance υσ . The mean vectors are In the limit N → ∞, the O(1/N ) term can be
orthogonal, i.e. Bi · Bj = δij where δ is the Kro- neglected and the order parameters become self -
necker delta. The parameters σ describe the sepa- averaging [15] with respect to the random sequence
ration between the clusters. This model can be ex- of examples. This means that fluctuations of the or-
tended for mixtures of many Gaussian clusters with der parameters vanish and the system dynamics can
M < N by choosing a coordinate system where the be described exactly in terms of their mean values.
orthonormality conditions Bi · Bj = δij are fulfilled. Also for N → ∞ the rescaled quantity t ≡ µ/N
Note that the Gaussian clusters strongly over- can be conceived as a continuous time variable. Ac-
lap in high dimensions. The separation between cordingly, the dynamics can be described by a set
the clusters is apparent only in the I M subspace
R of coupled ODE [3,10] after performing an average
spanned by {B1 , B2 , . . . , BM }. It is therefore a over the sequence of input data:
non-trivial task to detect the structure of the data
in N dimensions. dRiσ
= η bσ f (ri ) − f (ri ) Riσ
dt
dQij
4. Analysis of Learning Dynamics = η hi f (rj ) − f (rj ) Qij + hj f (ri )
dt
− f (ri ) Qij + η 2 pσ υσ f (ri )f (rj ) σ
We give a brief description of the theoretical σ
framework and refer to [3] for further details. Fol- (11)
lowing the lines of the theory of on-line learning,
e.g. [8], in the thermodynamic limit N → ∞ the where . is the average over the density P (ξ) and
system can be fully described in terms of a few the . σ is the conditional average over P (ξ|σ). Here
3
we exploit the following relation in the last term of 5. Results
dQij
dt in Eq. (11):
5.1. Learning Dynamics
ξ2 1 2
lim = lim pσ (υσ N + σ) = p σ υσ .
N →∞ N N →∞ N
σ σ We study the performance of both WTA and NG
Exploiting the limit N → ∞ once more, the quan- in several cases using three prototypes and up to
tities hµ , bµ become correlated Gaussian quantities
i σ
three clusters. Stochastic gradient descent proce-
by means of the Central Limit Theorem. Thus the dures approach a (local) minimum of the objective
above averages reduce to integrations in S + M di- function in the limit η → 0. We can consider this
mensions over the joint density P (x) where x = ˜
limit exactly by rescaling the learning time as t = ηt.
(hµ , hµ , . . . , hµ , bµ , bµ , . . . , bµ ), which are fully spec-
1 2
Then, the O(η 2 ) terms in Eq. (11) can be neglected
S 1 2 M
ified by first and second moments [3,6]: and the set of ODEs is simplified. For all demon-

i
µ−1 µ
σ = σ Riσ , bτ σ = σ δτ σ
1.5
hµ hµ σ − hµ σ hµ σ = υσ Qij
i j i j
µ−1
b µ b µ σ − b µ σ b µ σ = υσ δτ ρ
τ ρ τ ρ 1
hµ bµ σ − hµ σ bµ σ = υσ Riτ .
i τ i τ
µ−1
(12) Ri2
Hence the joint density of hµ , bµ is given in terms of
i τ
0.5
the order parameters defined in Eq. (8). While most
of the integrations can be performed analytically, 0
some have to be implemented numerically. See the
Appendix for the computations.
−0.5
Given the averages for a specific rank function −0.5 0 0.5 1 1.5
f (ri ), cf. Eqs. (B.7) and (B.14) we obtain a closed Ri1
form expression of ODE. Using the initial conditions (a)
Riσ (0), Qij (0), we integrate this system for a given
algorithm and get the evolution of order parameters
in the course of training, Riσ (t), Qij (t). The behav- 1
ior of the system depends on the characteristic of
the data and the parameters of the learning scheme,
i.e. offset of the clusters σ , variance within the clus- 0
Ri2
ters υσ , learning rate η, and for NG, the rank func-
tion parameter λ. As shown in [5], this method of −1
analysis is in good agreement with large scale Monte
Carlo simulations of the same learning systems for
−2
dimensionality as low as N = 200.
Analogously, the quantization error, Eq. (1), can
0 1 2 3
be expressed in terms of order parameters Ri1
S S (b)
1
E(W ) = Θij Qii − hi Θij (13)
2 i=1 i=1 Fig. 1. Trajectories of prototypes on the plane spanned by
j=i j=i
B1 and B2 . The cluster centers σ Bσ are marked by crosses.
Note that E(W ) does not depend explicitly on ξ; The trajectories are marked by solid lines (WTA) and dashed
here it is shown how the subtracted constant term lines (NG). The prototypes at initialization are marked with
described in Eq. (1) and Eq. (2) becomes useful. ˜
squares and at t = 10 with circles (WTA) and dots (NG).
Both algorithms converge at the triangles, where two pro-
Plugging in the values of the order parameters
totypes coincide at {−0.07, 1.07}. The set of prototypes is
computed by integrating the ODEs, {Riσ (t), Qij (t)}, initially set (a) near the cluster centers, and (b) far away
we can study the so-called learning curve E in from the cluster centers. In both figures the parameters are
dependence of the training time t for a given VQ p1 = 0.45, 1 = 2 = 1, υ1 = υ2 = 1, η → 0, λi = 2, λf =
algorithm. 0.01 and tf = 50.
4
1.4 −0.6
1.2 WTA
Quantization error E(W)
−0.8 NG
1
−1
0.8
0.6 −1.2

R
0.4 −1.4
0.2
−1.6
0
−0.2 −1.8
−0.4 −2
0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50
Learning time Learning time
(a) (a)
1.5 0
−0.2 WTA
1
Quantization Error E(W)
−0.4 NG
0.5
−0.6
0 −0.8
i2
−0.5 −1
R
−1 −1.2
−1.4
−1.5
−1.6
−2 −1.8
−2.5 −2
0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50
Learning time Learning time
(b) (b)
Fig. 2. The corresponding order parameters Ri2 at learning Fig. 3. Evolution of the quantization error E(W ) in Fig. 1 at
˜
time t = ηt for WTA (solid lines) and NG (dashed lines) ˜
learning time t = ηt for WTA (solid line) and NG (dashed
algorithms in the system described in Fig.1. The initial sets line) algorithms. The prototypes are initialized (a) near the
of prototypes are defined in Figs.1(a) and (b), respectively. cluster centers and (b) far away from the cluster centers.
strations, the NG algorithm is studied for decreas- ˜
learning stage t = 10.
˜ ˜ ˜ ˜
ing λ with λ(t) = λi (λf /λi )t/tf where tf is a learn- This can be illustrated with the evolution of the
ing time parameter. The influence of the initial set order parameters Ri2 (t) in Fig. 2. In Fig. 2(a), the
of prototypes on the learning curves is investigated order parameters of both algorithms converge rel-
by choosing different values of {Riσ (0), Qij (0)}. atively fast. In Fig. 2(b), the order parameters of
Figure 1 presents the prototype dynamics in a one prototype change rapidly compared to that of
system with three prototypes and two clusters. other prototypes in WTA algorithm. One prototype
We examine two different initial sets of proto- dominates as the winner and gets frequent updates
types: close to the origin at {Ri1 (0), Ri2 (0)} ≈ towards the cluster centers, while the other proto-
{0, 0}, Qij (0) ≈ 0, ∀{i, j} in Fig. 1(a); and far away types are rarely updated. The NG algorithm par-
from the origin on the side of the weaker cluster, tially solves this problem by updating all prototypes
viz. p1 , at {Ri1 (0), Ri2 (0)} ≈ {3, −2}, Qij (0) = at the initial stages of learning.
Riσ (0) · Rjσ (0), ∀{i, j} in Fig. 1(b). While the pro- The quantization error obtained from the order
totypes have different trajectories in WTA and NG ˜ ˜
parameters {Riσ (t), Qij (t)} is displayed in Fig. 3.
algorithms, they converge at the identical configu- We observe that the quantization error decreases
ration at large t and λ → 0. Here, the projections faster in the WTA algorithm compared to NG meth-
of two prototypes converge near the center of the ods at the initial stages of the learning. This behav-
stronger cluster. The advantage of NG is apparent ior can be explained by the fact that the HNG dif-
in Fig. 1(b) where all prototypes already reach the fers from E(W ) by smoothing terms in particular
area near the cluster centers at an intermediate in early stages of training. We observe that WTA
5
6 6 6
4 4 4
i2
i2
i2
R
R
R
2 2 2
0 0 0
−2 0 2 −2 0 2 −2 0 2
R R R
i1 i1 i1
(a) (b) (c)
6 6 6
4 4 4
i2
i2
i2
R
R
R
2 2 2
0 0 0
−2 0 2 −2 0 2 −2 0 2
R R R
i1 i1 i1
(d) (e) (f)
Fig. 4. Trajectories of the prototypes on the plane spanned by B1 and B2 , corresponding to the WTA (solid lines) and the
NG (dashed lines) algorithms. Here, p1 = 0.45, p2 = 0.55, υ1 = 1, υ2 = 1.21, 1 = 1 and 2 = 5. The cluster centers σ Bσ are
marked by ×. The initial prototype configurations for both algorithms are marked with . While the asymptotic configurations
of WTA (circles) algorithm depends on initialization, the NG (dots) always produces identical asymptotic configurations. In
these cases, the NG algorithm always finds the optimal quantization error.
yields the best overall quantization error in the first 5.2. Asymptotic configuration
set of initial values in Fig. 3(a). This is mirrored
˜
by the fact that, for large t and λf → 0, both algo- The dynamics of the prototypes while learning
rithms yield the same quantization error. on a model data with a larger separation between
For WTA training, the prototypes reach t → ∞˜ the clusters are presented in Fig. 4. The initial
asymptotic positions corresponding to the global configurations correspond to the following val-
minimum of E(W ) for small learning rates η → 0. ues of {Ri1 (0), Ri2 (0)}: (a) {−1, 2}, (b) {−0.5, 2},
However, learning can slow down significantly at in- (c) {−1, 1.5}, (d) {−0.5, 1.5}, (e) {−1, 1} and (f)
termediate stages of the training process. Transient {−0.5, 1}. In all panels, Qij (0) = Riσ (0) · Rjσ (0).
configurations may persist in the vicinity of local In this case, the optimal configuration of pro-
minima and can indeed dominate the training pro- totypes is with two prototypes representing the
cess. The NG is more robust w.r.t. the initial po- stronger cluster as in Figs. 4(a to c). However, the
sition of prototypes than WTA while achieving the asymptotic configuration of the prototypes in the
best quantization error asymptotically. WTA algorithm are sensitive to the initial con-
ditions. In some cases, viz. Figs. 4(d to f), this
6
configuration is not the optimal set of prototypes. 6
6
Therefore, even in this comparably simple model,
prototypes in WTA can be confined in suboptimal 5 5
local minima of the cost function E(W ). The is- 4 4
sue of different regions of initialization which lead
3 3
to different asymptotic configurations are to be Ri3 Ri3
discussed in forthcoming projects. 2 2
The asymptotic configurations for the NG al- 1
1
gorithm are independent of initial conditions as
shown in Figs. 4(a to f). During the learning process 0 0
with λ > 0 the system moves towards intermedi- −1 −1
ate configurations with minimum HNG (W ). Given −1 0 1 2 −1 0 1 2
˜
sufficiently large λ and t, these configurations are Ri1 Ri1
identical and therefore the NG algorithm is robust (a) E(W ) = −13.261 (b) E(W ) = −13.079
with respect to initial conditions. In these cases, the
asymptotic configuration is the optimal configura- 6 6
tion and thus the NG algorithm achieves optimal 5 5
performance.
4 4
We demonstrate a model where the NG algo-
rithm does not yield optimal performance in Fig. 5. 3
Ri3 3
Ri3
In this more complex situation, the weaker cluster 2
2
(pσ = 0.45) is divided into two Gaussian clusters
with p1,2 = {0.25, 0.20}. This corresponds to a 1 1
system of three clusters, with σ = {1, 1, 5} and 0 0
pσ = {0.25, 0.20, 0.55}. The distance between the
−1 −1
first two clusters is small compared to their distance −1 0 1 2 −1 0 1 2
to the third cluster. In comparison to the previous Ri1 Ri1
case, where the weaker cluster spreads out evenly (c) E(W ) = −13.079 (d) E(W ) = −13.079
in all directions, here it has a particular orientation
along the vector (B1 − B2 ). Because of this struc- Fig. 5. (a) The optimal set of prototypes (solid dots) in
ture, the best quantization error is obtained when a system with three clusters projected on the plane space
one prototype is placed near each cluster center, spanned by {B1 , B3 }. The values of Ri2 are not shown here.
The cluster centers σ Bσ are marked by ×. (b,c,d) Trajec-
as in Fig. 5(a), even though one cluster has a very tor


Use: 0.4437