icassp12.tex 27.8 KB
Newer Older
Arnab Ghoshal's avatar
Arnab Ghoshal committed
1
\documentclass{article}
Dan Povey's avatar
Dan Povey committed
2
\usepackage{bm,spconf,amsmath,amssymb,graphicx,ngerman}
Arnab Ghoshal's avatar
Arnab Ghoshal committed
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
\selectlanguage{USenglish}

% Example definitions.
% --------------------
\def \x{{\mathbf x}}
\def \v{{\mathbf v}}
\def \w{{\mathbf w}}
\def \M{{\mathbf M}}
\def \m{{\bm \mu}}
\def \k{{\mathbf \Sigma}}
\def \nv{{\mathcal N}}
\def \L{{\cal L}}

% Title.
% ------
\title{REVISITING SEMI-CONTINUOUS HIDDEN MARKOV MODELS}

%
% Single address.
% ---------------
\name{
K.~Riedhammer$^1$,
T.~Bocklet$^1$,
A.~Ghoshal$^{2,3}$,
D.~Povey$^4$
}
\address{
$^1$Pattern Recognition Lab, University of Erlangen-Nuremberg, \sc{Germany}\\
$^2$Spoken Language Systems, Saarland University, \sc{Germany}\\
$^3$Centre for Speech Technology Research, University of Edinburgh, UK\\
$^4$Microsoft Research, Redmond, WA, {\sc USA}\\
{\small\tt korbinian.riedhammer@informatik.uni-erlangen.de}
}

\begin{document}
\ninept

\maketitle

\begin{abstract}
In the past decade, semi-continuous hidden Markov models (SC-HMMs) have not 
attracted much attention in the speech recognition community. Growing amounts 
of training data and increasing sophistication of model estimation led to the 
impression that continuous HMMs are the best choice of acoustic model.
%
48 49 50 51 52
However, recent work on recognition of under-resourced languages faces the same
old problem of estimating a large number of parameters from limited amounts 
of transcribed speech.
This has led to a renewed interest in methods of reducing the number of parameters 
while maintaining or extending the modeling capabilities of continuous models.
Arnab Ghoshal's avatar
Arnab Ghoshal committed
53
%
Dan Povey's avatar
Dan Povey committed
54 55 56 57 58 59
In this work, we compare classic and multiple-codebook semi-continuous models
using diagonal and full covariance matrices with continuous HMMs and subspace Gaussian 
mixture models.  
Experiments using on the RM and WSJ corpora show that while a classical semi-continous
system does not perform as well as a continuous one, multiple-codebook semi-continuous
systems can perform better, particular when using full-covariance Gaussians.
Arnab Ghoshal's avatar
Arnab Ghoshal committed
60 61 62 63 64 65 66 67
\end{abstract}

\begin{keywords}
automatic speech recognition, acoustic modeling
\end{keywords}

\section{Introduction}
\label{sec:intro}
Dan Povey's avatar
Dan Povey committed
68 69 70 71 72
In the past years, semi-continuous hidden Markov models (SC-HMMs) \cite{huang1989shm},
in which the Gaussian means and variances are shared among all the HMM states
and only the weights differ, have not attracted much attention in the automatic 
speech recognition (ASR) community. 
Most major frameworks (e.g.~CMU {\sc Sphinx} or HTK) have more or less retired
Arnab Ghoshal's avatar
Arnab Ghoshal committed
73 74 75
semi-continuous models in favor of continuous models, which performed 
considerably better with the growing amount of available training data.

Dan Povey's avatar
Dan Povey committed
76
We were motivated by a number of factors to look again at semi-continuous models.
Dan Povey's avatar
Dan Povey committed
77
One is that we are interested in applications where there is a small amount
Dan Povey's avatar
Dan Povey committed
78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112
of training data (or a small amount of in-domain training data), and since
in semi-continuous models the amount of state-specific parameters is relatively
small, we felt that they might have an advantage there.  We were also interested
in improving the open-source speech recognition toolkit {\sc Kaldi}~\cite{povey2011tks},
and felt that since a multiple-codebook form of semi-continous models is, to
our knowledge, currently being used in a state-of-the-art system~\cite{prasad2004t2b},
we should implement this type of model.  We also wanted to experiment with
the style of semi-continous system described 
in \cite{schukattalamazzini1994srf,schukattalamazzini1995as}, particularly the smoothing
techniques, and see how these compare against a more conventional system.
Lastly, we felt that it would be interesting to compare semi-continous
models with Subspace Gaussian Mixture Models (SGMMs)~\cite{povey2011sgm}, since
there are obvious similarities.

In the experiments we report here we have not been able to match the performance of
a standard diagonal GMM based system using a traditional semi-continuous system;
however, using a two-level tree and multiple codebooks similar to~\cite{prasad2004t2b},
we were able to get better performance than a traditional system (although not
as good as SGMMs).  For both traditional and multiple-codebook configurations,
we saw better results with full-covariance codebook Gaussians than with diagonal.
We investigated smoothing the statistics for weight estimation using the decision
tree, and saw only modest improvements.
We note that the software used to do the experiments is part of the {\sc Kaldi} speech
recognition toolkit~\cite{povey2011tks} and is freely available for download, along with the
example scripts to reproduce these results.

In Section~\ref{sec:am} we describe the traditional and tied-mixture models,
as well as the multiple-codebook variant of the tied system, and describe some
details of our implementation, such as how we build the two-level tree.
In Section~\ref{sec:smooth} we describe two different forms of smoothing of the statistics,
which we experiment with here.  
In Section~\ref{sec:sys} we describe the systems we used (we are using
the Resource Management and Wall Street Journal databases);
in Section~\ref{sec:results} we give experimental results, and we
summarize in Section~\ref{sec:summary}.
Arnab Ghoshal's avatar
Arnab Ghoshal committed
113

114
% TODO Fix the number of hours for RM and WSJ SI-284
Dan Povey's avatar
Dan Povey committed
115 116 117 118 119 120 121 122 123
%In this article, we experiment with classic SC-HMMs and two-level tree based 
%multiple-codebook SC-HMMs which are, in terms of parameter sharing, somewhere in 
%between continuous and subspace Gaussian mixture models; 
%we compare the above four types of acoustic models on a small (Resource Management, RM, 5 hours)
%pand medium sized corpus (Wall Street Journal, WSJ, 17 hours) of read English while keeping acoustic 
%frontend, training and decoding unchanged. 
%The software is part of the {\sc Kaldi} speech recognition toolkit 
%\cite{povey2011tks} and is freely available for download, along with the
%example scripts to reproduce the presented experiments.
Arnab Ghoshal's avatar
Arnab Ghoshal committed
124 125 126 127 128 129 130


\section{Acoustic Modeling}
\label{sec:am}
In this section, we summarize the different forms of acoustic model we use: 
the continuous, semi-continuous, multiple-codebook semi-continuous, and 
subspace forms of Gaussian Mixture Models. 
131
We also describe the phonetic decision-tree building process, which is 
Arnab Ghoshal's avatar
Arnab Ghoshal committed
132 133
necessary background for how we build the multiple-codebook systems.

134
\subsection{Acoustic-Phonetic Decision Tree Building}
Dan Povey's avatar
Dan Povey committed
135 136 137
The acoustic-phonetic decision tree provides a mapping from
a phonetic context window (e.g. a triphone), together with a HMM-state index, 
to an emission probability density function (pdf). 
Arnab Ghoshal's avatar
Arnab Ghoshal committed
138 139 140
%
The statistics for the tree clustering consist of the 
sufficient statistics needed to estimate a Gaussian for each seen combination 
Dan Povey's avatar
Dan Povey committed
141
of (phonetic context window, HMM-state).  These statistics are based on a Viterbi 
Arnab Ghoshal's avatar
Arnab Ghoshal committed
142
alignment of a previously built system.  The roots
Dan Povey's avatar
Dan Povey committed
143
of the decision trees can be configured; for the RM experiments, these correspond to
Arnab Ghoshal's avatar
Arnab Ghoshal committed
144 145 146 147 148
the phones in the system, and on Wall Street Journal, they
correspond to the ``real phones'', i.e. grouping different stress and
position-marked versions of the same phone together.  

The splitting procedure can ask questions not only about the context phones,
Dan Povey's avatar
Dan Povey committed
149
but also about the central phone and the HMM state; the questions about phones are 
Arnab Ghoshal's avatar
Arnab Ghoshal committed
150 151 152 153 154 155
derived from an automatic clustering procedure.  The splitting procedure is
greedy and optimizes the likelihood of the data given a single Gaussian in each
tree leaf; this is subject to a fixed variance floor to avoid problems caused
by singleton counts and numerical roundoff.  We typically split up to a 
specified number of leaves and then cluster the resulting leaves as long as the
likelihood loss due to combining them is less than the likelihood improvement
Dan Povey's avatar
Dan Povey committed
156 157
on the final split of the tree-building process; and we restrict this 
clustering to disallow combining leaves across different subtrees (e.g., from 
Arnab Ghoshal's avatar
Arnab Ghoshal committed
158 159 160 161
different monophones).  The post-clustering process typically reduces the number 
of leaves by around 20\%.

\subsection{Continuous Models}
Dan Povey's avatar
Dan Povey committed
162
For continuous models, every leaf of the tree is assigned a separate
Arnab Ghoshal's avatar
Arnab Ghoshal committed
163 164 165 166 167
Gaussian mixture model. Formally, the emission probability of tree leaf $j$ is 
computed as
\begin{equation}
p(\x | j) = \sum_{i=1}^{N_j} c_{ji} \nv(\x; \m_{ji}, \k_{ji}) 
\end{equation}
168
where $N_j$ is the number of Gaussians assigned to $j$, and the $\m_{ji}$ and
Arnab Ghoshal's avatar
Arnab Ghoshal committed
169 170 171 172
$\k_{ji}$ are the means and covariance matrices of the mixtures.

We initialize the mixtures with a single component each, and subsequently 
allocate more components by splitting components at every iteration until a 
Dan Povey's avatar
Dan Povey committed
173
pre-determined total number of components is reached. We allocate 
Arnab Ghoshal's avatar
Arnab Ghoshal committed
174 175 176 177 178 179 180 181
the Gaussians proportional to a small power (e.g.~0.2) of the state's 
occupation count.

\subsection{Semi-continuous Models}
The idea of semi-continuous models is to have a large number of Gaussians that
are shared by every tree leaf using individual weights, thus the emission 
probability of tree leaf $j$ can be computed as
\begin{equation}
Dan Povey's avatar
Dan Povey committed
182
p(\x | j) = \sum_{i=1}^{N} c_{ji} \; \nv(\x; \m_i, \k_i) 
Arnab Ghoshal's avatar
Arnab Ghoshal committed
183 184 185 186 187 188 189 190 191
\end{equation}
where $c_{ji}$ is the weight of mixture $i$ for leaf $j$. As the means
and covariances are shared, increasing the number of states only implies
a small increase in number of parameters even for full-covariance Gaussians.
Furthermore, the Gaussians need to be evaluated only once for each $\x$.

Another advantage is the initialization and use of the codebook. It can be
initialized and adapted in a fully unsupervised manner using expectation
maximization (EM), maximum a-posteriori (MAP), maximum likelihood linear
192
regression (MLLR) and similar algorithms on untranscribed audio data.
Arnab Ghoshal's avatar
Arnab Ghoshal committed
193 194 195 196 197 198 199 200 201

For better performance, we initialize the codebook using the tree statistics
collected on a prior phone alignment. For each tree leaf, we include
the respective Gaussian as a component in the mixture.
The components are merged and (if necessary) split to eliminate low count 
Gaussians and to match the desired number of components. In a final step, 
5 EM iterations are used to improve the goodness of fit to the acoustic 
training data.

Arnab Ghoshal's avatar
Arnab Ghoshal committed
202
\subsection{Multiple-Codebook Semi-continuous Models}
Arnab Ghoshal's avatar
Arnab Ghoshal committed
203 204 205 206
A style of system that lies between the semi-continuous and continuous types of
systems is one based on a two-level tree; this is described in~\cite{prasad2004t2b}
as a state-clustered tied-mixture (SCTM) system. The way we implement this is
to first build the decision tree used for phonetic clustering to a relatively
Arnab Ghoshal's avatar
Arnab Ghoshal committed
207 208 209
coarse level (e.g.~100 leaves). Each leaf of this tree corresponds to a codebook of Gaussians, i.e.~the Gaussian parameters are tied at this level. The leaves 
of this tree are then split further to give a more fine-grained clustering (e.g.~2500 leaves), where for each new leaf we remember which codebook it corresponds to. The leaves of this second tree correspond to the actual context-dependent HMM states and contain the weights.
For this type of system we do not apply any post-clustering.
Arnab Ghoshal's avatar
Arnab Ghoshal committed
210 211 212 213 214 215 216 217 218 219 220 221 222 223 224

The way we formalize this is that the context-dependent states $j$ each have 
a corresponding codebook indexed $k$, and the function $m(j) \rightarrow k$ 
maps from the leaf index to the codebook index. The likelihood function is now
%
\begin{equation}
p(\x | j) = \sum_{i=1}^{N_{m(j)}} c_{ji} \, \nv\left(\x; \m_{m(j), i}, \k_{m(j), i}\right)
\end{equation}
%
where $\m_{m(j), i}$ is the $i$-th mean vector of the codebook
associated with $j$.

Similar to the traditional semi-continuous codebook initialization, we start
from the tree statistics and distribute the Gaussians to the respective codebooks
using the tree mapping to obtain preliminary codebooks.
Dan Povey's avatar
Dan Povey committed
225 226
The target size $N_k$ of codebook $k$ is determined with respect to 
the occupancies of the respective leaves as
Arnab Ghoshal's avatar
Arnab Ghoshal committed
227 228
\begin{equation}
N_k = N_0 + \frac
229 230 231 232 233
%% power rule, maybe some other day...
%  { \left(  \sum_{l \in \{m(l) = k\}} \text{occ}(l)  \right)^q }
%  { \sum_r \left(  \sum_{t \in \{m(t) = r\}} \text{occ}(l)  \right)^q } 
  { \sum_{l \in \{m(l) = k\}} \text{occ}(l) }
  { \sum_r \sum_{t \in \{m(t) = r\}} \text{occ}(l) } 
Arnab Ghoshal's avatar
Arnab Ghoshal committed
234 235
  \left( N - K \cdot N_0 \right)
\end{equation}
236 237 238 239 240 241 242 243 244 245
where $N_0$ is a minimum number of Gaussians per codebooks (e.g., 3), $N$ is
the total number of Gaussians, $K$ the number of codebooks, and $\text{occ}(j)$ 
is the occupancy of tree leaf $j$. 
%
%% Korbinian: No power rule for now, seems to harm results.
%, and $q$ controls the influence of the occupancy regarding one codebook. 
%A typical value of $q=0.2$ leads to rather homogeneous codebook sizes while 
%still attributing more Gaussians to states with larger occupancies. 
%For the two-level architecture, a $q=1$ yielded best results.
%
Arnab Ghoshal's avatar
Arnab Ghoshal committed
246 247 248 249 250 251
The target sizes of the codebooks are again enforced by either splitting or
merging the components.

\subsection{Subspace Gaussian Mixture Models}
The idea of subspace Gaussian mixture models (SGMM) is, similar to 
semi-continuous models, to reduce the number of parameters by selecting the
252
Gaussians from a subspace spanned by a universal background model (UBM)
Dan Povey's avatar
Dan Povey committed
253
and state specific transformations.   The SGMM emission pdfs can
Arnab Ghoshal's avatar
Arnab Ghoshal committed
254 255 256 257 258 259
be computed as
\begin{eqnarray}
p(\x | j) & = & \sum_{i=1}^{N} c_{ji} \nv(\x; \m_{ji}, \k_i) \\
\m_{ji}    & = & \M_i \v_j \\
c_{ji}     & = & \frac{\exp \w_i^T \v_j}{\sum_l^N \exp \w_l^T \v_j}
\end{eqnarray}
Dan Povey's avatar
Dan Povey committed
260
where the covariance matrices $\k_i$ are shared between all leaves $j$. The
Arnab Ghoshal's avatar
Arnab Ghoshal committed
261 262
weights $w_{ji}$ and means $\m_{ji}$ are derived from $\v_j$ together with
$\M_i$ and $\w_i$. The term ``subspace'' indicates that the parameters of the
263
mixtures are limited to a subspace of the entire space of parameters of
Arnab Ghoshal's avatar
Arnab Ghoshal committed
264 265 266 267 268 269 270
the underlying codebook.
%
A detailed description and derivation of the accumulation and update formulas
can be found in \cite{povey2011sgm}. Note that the experiments in this article
are without adaptation.

\section{Smoothing Techniques for Semi-continuous Models}
Dan Povey's avatar
Dan Povey committed
271
\label{sec:smooth}
Arnab Ghoshal's avatar
Arnab Ghoshal committed
272 273 274
%
\subsection{Intra-Iteration Smoothing}
Although an increasing number of tree leaves does not imply a huge increase in
Arnab Ghoshal's avatar
Arnab Ghoshal committed
275 276
parameters, the estimation of weights at the individual leaves suffer from data 
fragmentation, especially when training models with small amounts of data.
Arnab Ghoshal's avatar
Arnab Ghoshal committed
277 278 279 280 281 282 283
The {\em intra-iteration} smoothing is motivated by the fact that tree leaves
that share the root are similar, thus, if the sufficient statistics of the
weights of a leaf $j$ are very small, they should be interpolated with the 
statistics of closely related leaf nodes.
To do so, we first propagate all sufficient statistics up the tree so that the
statistics of any node is the sum of its children's statistics.
Second, the statistics of each node and leaf are interpolated top-down with 
Arnab Ghoshal's avatar
Arnab Ghoshal committed
284
their parent's statistics using an interpolation weight $\tau$:
285
\begin{equation} \label{eq:intra}
Arnab Ghoshal's avatar
Arnab Ghoshal committed
286
\hat\gamma_{ji} \leftarrow 
Arnab Ghoshal's avatar
Arnab Ghoshal committed
287
  \underbrace{\left(\gamma_{ji} + \frac{\tau}{\left( \sum_k \gamma_{p(j),k} \right) + \epsilon} \, \gamma_{p(j),i}\right)}_{\mathrel{\mathop{:}}= \bar\gamma_{ji}}
Arnab Ghoshal's avatar
Arnab Ghoshal committed
288
  \cdot 
Arnab Ghoshal's avatar
Arnab Ghoshal committed
289
  \underbrace{\frac{\sum_k \gamma_{jk}}{\sum_k \bar\gamma_{jk}}}_\text{normalization},
Arnab Ghoshal's avatar
Arnab Ghoshal committed
290
\end{equation}
Arnab Ghoshal's avatar
Arnab Ghoshal committed
291 292
where $\gamma_{ji}$ is the occupancy of the $i$ component in tree leaf $j$
and $p(j)$ is the parent node of $j$.
Arnab Ghoshal's avatar
Arnab Ghoshal committed
293 294 295 296
%
For two-level tree based models, the propagation and interpolation cannot
be applied across different codebooks. 
%
Dan Povey's avatar
Dan Povey committed
297
A similar interpolation scheme without normalization was used in 
Arnab Ghoshal's avatar
Arnab Ghoshal committed
298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315
\cite{schukattalamazzini1994srf,schukattalamazzini1995as} along with a different
tree structure.

%This is similar to the {\em I} step of the {\em APIS} algorithm 
%\cite{schukattalamazzini1994srf,schukattalamazzini1995as} but smoothes the
%statistics with respect to their counts instead of replacing them.
%This modification is necessary as the tree in \cite{schukattalamazzini1995as}
%is constructed in a way, that the tree splits (nodes) represent acoustic
%models (the {\em polyphones}), i.e., the first node level would be the
%monophones, and any phone with more context would be a child node, with only
%the phones with the largest context being a leaf. For the training, any data contributing
%to a node $j$ should also contribute to its parent $p(j)$, and if a child
%has insufficient statistics, these can be interpolated with the parent's. 
%As we attach models only to the leaves, this motivation does not (necessarily)
%hold, thus requiring the above modification of the interpolation scheme.

\subsection{Inter-Iteration Smoothing}
Another typical problem that arises when training semi-continuous models is
Dan Povey's avatar
Dan Povey committed
316 317 318 319
that the weights tend to converge to a poor local optimum over the iterations. 
Our intuition is that the weights tend to overfit to the Gaussian parameters
in such a way that the model quickly converges with the
GMM parameters very close to the initialization point.
Arnab Ghoshal's avatar
Arnab Ghoshal committed
320
%
Dan Povey's avatar
Dan Povey committed
321 322
To try to counteract this effect, we smooth the newly estimated parameters 
$\Theta^{(i+1)}$ with the ones from the prior iteration using an interpolation weight $\rho$
Arnab Ghoshal's avatar
Arnab Ghoshal committed
323
\begin{equation}
Dan Povey's avatar
Dan Povey committed
324
\hat\Theta^{(i)} \leftarrow (1 - \rho) \, \Theta^{(i)} + \rho \, \Theta^{(i-1)} % \quad .
Arnab Ghoshal's avatar
Arnab Ghoshal committed
325
\end{equation}
326
which leads to an exponentially decreasing impact of the initial parameter set.
Dan Povey's avatar
Dan Povey committed
327 328 329
We implemented this for all the parameters but found it only helped when
applied to just the weights,
which is consistent with our interpretation that stopping the weights from overfitting
Dan Povey's avatar
Dan Povey committed
330 331 332 333 334
too fast allows the Gaussian parameters to move farther away from the initialization
than they otherwise would.
We note that other practitioners have also used mechanisms that
prevent the weights from changing too fast and
have found that this improved results\footnote{Jasha Droppo, personal communication.}.
Arnab Ghoshal's avatar
Arnab Ghoshal committed
335 336 337

%----%<------------------------------------------------------------------------

Dan Povey's avatar
Dan Povey committed
338 339 340
\section{System Descriptions}
\label{sec:sys}

Dan Povey's avatar
Dan Povey committed
341
The experiments presented in this article used the DARPA Resource 
Arnab Ghoshal's avatar
Arnab Ghoshal committed
342 343 344 345 346 347 348 349 350 351 352 353 354 355 356
Management Continuous Speech Corpus (RM) \cite{price1993rm} and the Wall Street
Journal Corpus \cite{garofalo2007wsj}.

For both data sets, we first train an initial monophone continuous system
with about 1000 diagonal Gaussians (122 states) on a small data subset; with 
the final alignments, we train an initial triphone continuous system with 
about 10000 Gaussians (1600 states).  The alignments of this triphone system
are used as a basis for the actual system training.

Note that this baseline may be computed on any other (reduced) data set
and has the sole purpose of providing better-than-linear initial alignments
to speed up the convergence of the models.
%
The details of the model training can also be found in the {\sc Kaldi} recipes.

Dan Povey's avatar
Dan Povey committed
357
For the acoustic frontend, we extract 13 Mel-frequency cepstral coefficients (MFCCs),
Dan Povey's avatar
Dan Povey committed
358 359 360 361
apply cepstral mean normalization, and compute deltas and delta-deltas.  For
decoding, we use the language models supplied with those corpora; for RM this
is a word-pair grammar, and for WSJ we use the bigram version of the language
models supplied with the data set.
Arnab Ghoshal's avatar
Arnab Ghoshal committed
362 363 364 365 366

\subsection{Resource Management}
For the RM data set, we train on the speaker independent training and development 
set (about 4000 utterances) and test on the six DARPA test runs Mar and Oct '87,
Feb and Oct '89, Feb '91 and Sep '92 (about 1500 utterances total).  The parameters 
Dan Povey's avatar
Dan Povey committed
367 368 369
are tuned to this joint test set since there was not enough data to hold out a
development set and still get meaningful results.
We use the following system identifiers in Table~\ref{tab:res_rm}:
Arnab Ghoshal's avatar
Arnab Ghoshal committed
370 371 372 373 374 375 376
%
\begin{itemize}
\item
  {\em cont}: continuous triphone system using 9011 diagonal covariance
  Gaussians in 1480 tree leaves
\item
  {\em semi}: semi-continuous triphone system using 768 full covariance
Korbinian Riedhammer's avatar
 
Korbinian Riedhammer committed
377
  Gaussians in 2500 tree leaves, no smoothing (explanations later).
Arnab Ghoshal's avatar
Arnab Ghoshal committed
378 379 380
\item
  {\em 2lvl}: two-level tree based semi-continuous triphone system using
  3072 full covariance Gaussians in 208 codebooks ($4/14.7/141$ min/average/max
Arnab Ghoshal's avatar
Arnab Ghoshal committed
381
  components), 2500 tree leaves, $\tau = 35$ and $\rho = 0.2$.
Arnab Ghoshal's avatar
Arnab Ghoshal committed
382 383 384 385 386 387 388
\item
  {\em sgmm}: subspace Gaussian mixture model triphone system using a 400 
  full-covariance Gaussian background model and 2016 tree leaves.
\end{itemize}

\subsection{WSJ}
% TODO fix the WSJ train/dev/eval description
Korbinian Riedhammer's avatar
 
Korbinian Riedhammer committed
389 390 391 392
%% Korbinian: For now, I inserted results on si84/half, using cmvn (tri2c),
%             the si284 take just too long for the submission.
%For the WSJ data set, we trained on the SI-284 training set, tuned on dev
%and tested on eval92,93.
Dan Povey's avatar
Dan Povey committed
393 394

Due to time constraints, we trained on 
Korbinian Riedhammer's avatar
 
Korbinian Riedhammer committed
395 396
half of the SI-84 training set using settings similar to the RM experiments; 
we tested on the eval '92 and eval '93 data sets.
397
As the classic semi-continuous system did not yield good performance on the
Dan Povey's avatar
Dan Povey committed
398
RM data both in terms of run-time and performance, we omitted it for the WSJ
Arnab Ghoshal's avatar
Arnab Ghoshal committed
399 400
experiments. Results corresponding to the following systems are presented in 
Table~\ref{tab:res_wsj}:
401
%
Korbinian Riedhammer's avatar
 
Korbinian Riedhammer committed
402 403
% TODO Korbinian: I hope the SI-284 experiments finish soon enough while the 
%                 submission is still open. For now, I describe SI-84/half.
Arnab Ghoshal's avatar
Arnab Ghoshal committed
404 405 406 407
\begin{itemize}
\item
  {\em cont}: continuous triphone system using 10000 diagonal covariance
  Gaussians in 1576 tree leaves.
Korbinian Riedhammer's avatar
 
Korbinian Riedhammer committed
408 409 410
%% TODO Korbinian: For now, I use the same setting as in RM, to evade the
%                  tuning on the dev set. Also, no semi system, too slow
%                  and bad performance on RM.
Arnab Ghoshal's avatar
Arnab Ghoshal committed
411
\item
Korbinian Riedhammer's avatar
 
Korbinian Riedhammer committed
412 413 414 415
    {\em 2lvl}: same configuration as for RM (no further tuning).
%\item
%  {\em 2lvl}: two-level tree based semi-continuous triphone system using
%  4096 full covariance Gaussians in 208 codebooks ( min/average/max
Arnab Ghoshal's avatar
Arnab Ghoshal committed
416
%  components),  tree leaves, $\tau = $ and $\rho = $.
Arnab Ghoshal's avatar
Arnab Ghoshal committed
417 418 419 420 421 422 423 424
\item
  {\em sgmm}: subspace Gaussian mixture model triphone system using a 400 
  full-covariance Gaussian background model and 2361 tree leaves.
\end{itemize}

%----%<------------------------------------------------------------------------

\section{Results}
Dan Povey's avatar
Dan Povey committed
425
\label{sec:results}
Arnab Ghoshal's avatar
Arnab Ghoshal committed
426

427
%------%<----------------------------------------------------------------------
Arnab Ghoshal's avatar
Arnab Ghoshal committed
428 429
\subsection{RM}

430
\begin{table}%[tb]
Arnab Ghoshal's avatar
Arnab Ghoshal committed
431 432 433 434 435 436 437 438 439 440
\begin{center}
%\footnotesize
%\begin{tabular}{|l||r|r|r|r|r|r||c|}
\begin{tabular}{|l||c|c|c|c|c|c||c|}
\hline
~               & ma87 & oc87 & fe89 & oc89 & fe91 & se92 & {\em avg}  \\ \hline\hline
%{\em mono/cont} &  6.24 &  8.72 &  7.85 &  9.80 &  8.21 & 13.60 & 9.50 \\ \hline
%{\em tri1/cont} &  0.96 &  2.76 &  2.69 &  3.61 &  3.30 &  6.33 & 3.64 \\ \hline\hline
{\em cont} &  1.08 &  2.48 &  2.69 &  3.46 &  2.66 &  5.90 & 3.38 \\ \hline
{\em semi} &  1.80 &  3.19 &  4.72 &  4.62 &  4.15 &  6.88 & 4.66 \\ \hline 
Korbinian Riedhammer's avatar
Korbinian Riedhammer committed
441
{\em 2lvl} &  0.48 &  1.70 &  2.46 &  3.35 &  1.89 &  5.31 & 2.90 \\ \hline
Arnab Ghoshal's avatar
Arnab Ghoshal committed
442 443 444 445 446 447 448 449 450 451
{\em sgmm} &  0.48 &  2.20 &  2.62 &  2.50 &  1.93 &  5.12 & 2.78 \\ \hline
\end{tabular}
\end{center}
\caption{\label{tab:res_rm}
Detailed recognition results in \% WER for the six DARPA test sets using 
different acoustic models on the RM data set.
}
\end{table}

The results on the different test sets on the RM data are displayed in 
Arnab Ghoshal's avatar
Arnab Ghoshal committed
452
Table~\ref{tab:res_rm}. The classic semi-continuous system shows the worst 
Dan Povey's avatar
Dan Povey committed
453 454 455 456
performance of the four, which serves to confirm the conventional wisdom
about the superiority of continous systems.
The two-level tree based multiple-codebook system performance lies in between the 
continuous and SGMM system.
Arnab Ghoshal's avatar
Arnab Ghoshal committed
457

458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476

%------%<----------------------------------------------------------------------
% Now let's talk about the diag/full cov and interpolations

\begin{table}%[tb]
\begin{center}
\begin{tabular}{|c|c||c|c|c|c|}
\hline
covariance & Gaussians & none & inter & intra & both \\ \hline\hline
full       &      1024 & 3.70 & 3.64 & 3.79 & 3.74 \\ \hline
full       &      3072 & 3.01 & 3.01 & 3.02 & 2.90 \\ \hline\hline
diagonal   &      3072 & 4.13 & 4.15 & 4.25 & 4.35 \\ \hline
diagonal   &      9216 & 3.22 & 3.09 & 3.28 & 3.20 \\ \hline
\end{tabular}
\end{center}
\caption{\label{tab:rm_diagfull}
Average \% WER of the multiple-codebook semi-continuous model using different 
numbers diagonal and full covariance Gaussians, and different smoothings 
on the RM data. 
Arnab Ghoshal's avatar
Arnab Ghoshal committed
477 478
Settings are 208 codebooks, 2500 context dependent states; $\tau = 35$ 
and $\rho = 0.2$ if active.
479 480 481 482
}
\end{table}

Keeping the number of codebooks and leaves as well as the smoothing parameters
Arnab Ghoshal's avatar
Arnab Ghoshal committed
483
$\tau$ and $\rho$ of {\em 2lvl} constant, we experiment with the number of
484
Gaussians and type of covariance; the results are displayed in 
Arnab Ghoshal's avatar
Arnab Ghoshal committed
485
Table~\ref{tab:rm_diagfull}.
486 487 488 489 490
Interestingly, the full covariances make a strong difference: Using the same
number of Gaussians, full covariances lead to a significant improvement.
On the other hand, a substantial increase of the number of diagonal Gaussians 
leads to a similar performance as the regular continuous system.
%
Dan Povey's avatar
Dan Povey committed
491 492 493 494 495 496 497
The effect of the two smoothing methods using $\tau$ and $\rho$ is not
clear from Table~\ref{tab:rm_diagfull}, i.e. the results are not always
consistent.  While the inter-iteration smoothing with $\rho$
helps in most cases, the intra-iteration smoothing coefficient $\tau$ 
shows an inconsistent effect.  In the future we may abandon this type of
smoothing or investigate other smoothing methods. 

Arnab Ghoshal's avatar
Arnab Ghoshal committed
498 499 500 501 502

\subsection{WSJ}

% TODO Korbinian: More expts are ongoing; these include the full training set
%                 and cmvn, and hopefully a good param combination for 2lvl...
503
\begin{table}%[tb]
Arnab Ghoshal's avatar
Arnab Ghoshal committed
504 505 506
\begin{center}
\begin{tabular}{|l||c|c|c|c|c|c||c|}
\hline
Korbinian Riedhammer's avatar
Korbinian Riedhammer committed
507 508 509 510 511 512 513 514
~                & eval '92 & eval '93 & {\em avg} \\ \hline\hline
% all experiments si-84/half, cmvn
{\em cont}       &    12.75 &    17.10 &     14.39 \\ \hline
{\em 2lvl/none}  &    12.92 &    17.85 &     14.79 \\ \hline
{\em 2lvl/intra} &    13.03 &    17.56 &     14.61 \\ \hline
{\em 2lvl/inter} &    12.80 &    17.59 &     14.61 \\ \hline
{\em 2lvl/both}  &    13.01 &    17.59 &     14.75 \\ \hline
{\em sgmm}       &    11.72 &    14.25 &     12.68 \\ \hline
Arnab Ghoshal's avatar
Arnab Ghoshal committed
515 516 517 518
\end{tabular}
\end{center}
\caption{\label{tab:res_wsj}
Detailed recognition results in \% WER for the '92 and '93 test sets using 
Korbinian Riedhammer's avatar
Korbinian Riedhammer committed
519 520
different acoustic models on the WSJ data; the {\em 2lvl} system was trained
with and without the different interpolations.
Arnab Ghoshal's avatar
Arnab Ghoshal committed
521 522 523 524
}
\end{table}

The results on the different test sets on the WSJ data are displayed in
Arnab Ghoshal's avatar
Arnab Ghoshal committed
525
Table~\ref{tab:res_rm}.
Korbinian Riedhammer's avatar
 
Korbinian Riedhammer committed
526 527 528
% TODO Korbinian: The following is for the si-84/half results, as currently
%                 in the table.
Using the parameters calibrated on the RM test data, the {\em 2lvl} system
Dan Povey's avatar
Dan Povey committed
529
performance is not as good as the continuous or GMM systems, but still
Korbinian Riedhammer's avatar
Korbinian Riedhammer committed
530 531 532 533 534
in a similar range while using about the same number of leaves but about a
third of the Gaussians.
Once the number of Gaussians and leaves, as well as the smoothing parameters 
are tuned on the development set, we expect the error rates to be in between 
the continuous and SGMM systems, as seen on the RM data.
Arnab Ghoshal's avatar
Arnab Ghoshal committed
535 536

\section{Summary}
Dan Povey's avatar
Dan Povey committed
537 538 539
\label{sec:summary}

In this article, we compared continuous and SGMM models with
540
two types of semi-continuous hidden Markov models, one using a single codebook 
Dan Povey's avatar
Dan Povey committed
541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564
and the other using multiple codebooks based on a two-level phonetic decision tree,
similar to~\cite{prasad2004t2b}.
Using the single-codebook system we were not able to obtain competitive
results, but using a multiple-codebook system 
we were able to get better results than a traditional system.  Interestingly,
for both the single-codebook or multiple-codebook systems, we saw better
results with full rather than diagonal covariances.  

Our best multiple-codebook results reported here incorporate two different forms of 
smoothing, one using the decision tree to interpolate count statistics among closely
related leaves, and one smoothing the weights across iterations to encourage 
convergence to a better local optimum.  In future work we intend to further investigate
these smoothing procedures and clarify how important they are.

If we continue to see good results from these types of systems, 
we may investigate ways to speed up the Gaussian computation for the 
full-covariance multiple-codebook tied mixture systems (e.g. using
diagonal Gaussians in a pre-pruning phase, as in SGMMs).  To get state-of-the-art
error rates we would also have to 
implement linear-transform estimation (e.g. Constrained MLLR for adaptation)
for these types of models.  We also intend to investigate the interaction of these
methods with discriminative training and system combination; and we will
consider extending the SGMM framework to a two-level architecture similar to
the one we used here.
Arnab Ghoshal's avatar
Arnab Ghoshal committed
565 566 567 568

%\section{REFERENCES}
%\label{sec:ref}

Korbinian Riedhammer's avatar
 
Korbinian Riedhammer committed
569
% Korbinian: We might need that ;)
Korbinian Riedhammer's avatar
Korbinian Riedhammer committed
570
\footnotesize
Arnab Ghoshal's avatar
Arnab Ghoshal committed
571 572 573 574
\bibliographystyle{IEEEbib}
\bibliography{refs-eig,refs}

\end{document}