Commit b30395ad authored by Dan Povey's avatar Dan Povey
Browse files

Final changes to paper.tex

git-svn-id: https://svn.code.sf.net/p/kaldi/code/trunk@539 5e6a8d80-dfce-4ca6-a32a-6e07a63d50c8
parent 9c6f2609
% Note: the original submission was rev. 530...
% Template for SLT-2006 paper; to be used with:
% spconf.sty - ICASSP/ICIP LaTeX style file, and
% IEEEbib.bst - IEEE bibliography style file.
......@@ -77,7 +78,7 @@
% Title.
% ------
\title{Generating exact lattices in a WFST framework}
\title{Generating exact lattices in the WFST framework}
\makeatletter
......@@ -85,9 +86,9 @@
\makeatother
\name{ \em Daniel Povey$^1$, Mirko Hannemann$^{1,2}$, \\
\em {Gilles Boulianne}$^3$, {Luk\'{a}\v{s} Burget}$^4$, {Arnab Ghoshal}$^5$, {Milos Janda}$^2$, {Stefan Kombrink}$^2$, \\
\em {Gilles Boulianne}$^3$, {Luk\'{a}\v{s} Burget}$^{2,4}$, {Arnab Ghoshal}$^5$, {Milo\v{s} Janda}$^2$, {Martin Karafi\'{a}t}$^2$, {Stefan Kombrink}$^2$, \\
\em {Petr Motl\'{i}\v{c}ek}$^6$, {Yanmin Qian}$^7$, {Ngoc Thang Vu}$^8$, {Korbinian Riedhammer}$^9$, {Karel Vesel\'{y}}$^2$
\thanks{Thanks to Sanjeev Khudanpur for his help in preparing the paper, and to Honza Cernocky, Renata Kohlova, Martin Karafiat and Tomas Mikolov for their help relating to the Kaldi'11 workshop at BUT.}}
\thanks{Thanks to Honza \v{C}ernock\'{y}, Renata Kohlov\'{a}, and Tom\'{a}\v{s} Ka\v{s}p\'{a}rek for their help relating to the Kaldi'11 workshop at BUT, and to Sanjeev Khudanpur for his help in preparing the paper. Researchers at BUT were partly supported by Technology Agency of the Czech Republic grant No. TA01011328, Czech Ministry of Education project No. MSM0021630528, and Grant Agency of the Czech Republic project No. 102/08/0707.}}
%%% TODO: fix thanks.
......@@ -95,12 +96,12 @@
% Programme under grant agreement no. 213850 (SCALE); BUT researchers were partially
%supported by Czech MPO project No. FR-TI1/034.}}
\address{$^1$ Microsoft Research, Redmond, WA, {\normalsize \tt dpovey@microsoft.com} \\
$^2$ Brno University of Technology, Czech Republic, {\normalsize \tt ihannema@fit.vutbr.cz} \\
$^3$ CRIM, Montreal, Canada \ \ $^4$ SRI International, Menlo Park, CA, USA \\
$^5$ University of Edinburgh, U.K. \ \ $^6$ IDIAP, Martigny, Switzerland \\
$^7$ Tsinghua University, Beijing, China \ \ $^8$ Karlsruhe Institute of Technology, Germany \\
$^9$ Pattern Recognition Lab, University of Erlangen-Nuremberg, Germany}
\address{{ $^1$ Microsoft Research, Redmond, WA, {\normalsize \tt dpovey@microsoft.com} }\\
{ $^2$ Brno University of Technology, Czech Republic, {\normalsize \tt ihannema@fit.vutbr.cz}} \\
{ $^3$ CRIM, Montreal, Canada \ \ $^4$ SRI International, Menlo Park, CA, USA} \\
{ $^5$ University of Edinburgh, U.K. \ \ $^6$ IDIAP, Martigny, Switzerland } \\
{ $^7$ Tsinghua University, Beijing, China \ \ $^8$ Karlsruhe Institute of Technology, Germany} \\
{ $^9$ Pattern Recognition Lab, University of Erlangen-Nuremberg, Germany}}
\begin{document}
......@@ -110,7 +111,7 @@
%
\maketitle
%
\pagestyle{plain} % was set to empty in spconf.sty... this makes page number appear.
%\pagestyle{plain} % was set to empty in spconf.sty... this makes page number appear.
\begin{abstract}
We describe a lattice generation method that is exact, i.e. it satisfies all the
......@@ -149,13 +150,15 @@ where the Weighted Finite State Transducer (WFST) decoding graph is
\begin{equation}
\HCLG = \min(\det(H \circ C \circ L \circ G)),
\end{equation}
where $\circ$ is WFST composition (note: view $\HCLG$ as a single symbol).
where $H$, $C$, $L$ and $G$ represent the HMM structure, phonetic
context-dependency, lexicon and grammar respectively,
and $\circ$ is WFST composition (note: view $\HCLG$ as a single symbol).
For concreteness we will speak of ``costs'' rather
than weights, where a cost is a floating point number that typically represents a negated
log-probability. A WFST has a set of states with one distinguished
start state\footnote{This is the formulation that corresponds best with the toolkit we use.},
start state;\footnote{This is the formulation that corresponds best with the toolkit we use.},
each state has a final-cost (or $\infty$ for non-final states);
and there is a set of arcs, where each arc has a weight
and there is a set of arcs between the states, where each arc has a weight
(just think of this as a cost for now), an input label and an output
label. In $\HCLG$, the input labels are the identifiers of context-dependent
HMM states, and the output labels represent words. For both the input and output
......@@ -165,19 +168,20 @@ symbols, the special label $\epsilon$ may appear meaning ``no label is present.'
\begin{figure}
\begin{center}
\includegraphics[height=2.1in,angle=270]{figures/acceptor_utterance.eps}
\vspace*{-0.03in}
\caption{Acceptor $U$ describing the acoustic scores of an utterance}
\vspace*{-0.2in}
\vspace*{-0.15in}
\label{fig:acceptor}
\end{center}
\end{figure}
Imagine we have an utterance, of length $T$, and want to ``decode'' it (i.e. find
the most likely word sequence, and the corresponding alignment). A WFST
Imagine we want to ``decode'' an utterance of $T$ frames, i.e. we want to
find the most likely word sequence, and the corresponding alignment. A WFST
interpretation of the decoding problem is as follows. We construct an
acceptor, or WFSA, as in Fig.~\ref{fig:acceptor} (an acceptor is represented as a
WFST with identical input and output symbols). It has $T{+}1$ states,
and an arc for each combination of (time, context-dependent HMM state). The
with an arc for each combination of (time, context-dependent HMM state). The
costs on these arcs are the negated and scaled acoustic log-likelihoods.
Call this acceptor $U$. Define
\begin{equation}
......@@ -191,7 +195,7 @@ sentence. In practice we do not do a full search of $S$, but use beam pruning.
Let $B$ be the searched subset of $S$, containing a subset of the states and arcs
of $S$ obtained by some heuristic pruning procedure.
When we do Viterbi decoding with beam-pruning, we are finding the best path through $B$.
%
Since the beam pruning is a part of any practical search procedure and cannot
easily be avoided, we will define the desired outcome of lattice generation in terms
of the visited subset $B$ of the search graph $S$.
......@@ -223,12 +227,12 @@ pruning beam $\alpha > 0$ (interpret this as a log likelihood difference).
\item The scores and alignments in the lattice should be accurate.
\item The lattice should not contain duplicate paths with the same word sequence.
\end{itemize}
Actually, this definition is not completely formalized-- we have left
the second condition (accuracy of scores and alignments) a little vague.
Let the lattice be $L$. The way we would like to state this requirement is:
We need to be a little more precise about what we mean by the scores
and alignments being ``accurate''. Let the lattice be $L$. The
way we would like to state this requirement is:
\begin{itemize}
\item For every path in $L$, the score and alignment corresponds to
the best-scoring path in in $B$ for the corresponding word
the best-scoring path in $B$ for the corresponding word
sequence\footnote{Or one of the best-scoring paths, in case of a tie.}.
\end{itemize}
The way we actually have to state the requirement in order to get an efficient procedure is:
......@@ -244,8 +248,9 @@ to weaken the condition. This is no great loss, since regardless
of pruning, any word-sequence not within $\alpha$ of the best one could be omitted altogether,
which is the same as being assigned a cost of $\infty$).
By ``word-sequence'' we mean a sequence of whatever symbols are on the
output of $\HCLG$. In our experiments these symbols represent words, but not including
silence, which we represent via alternative paths in $L$.
output of $\HCLG$. In our experiments these output symbols represent words, but
silences do not appear as output symbols (they are
represented via alternative paths in $L$).
\section{Previous lattice generation methods}
\label{sec:previous}
......@@ -256,7 +261,7 @@ A common assumption underlying lattice generation methods is the {\em word-pair
of~\cite{ney_word_graph}. This is the notion that the time boundary between a pair of words
is not affected by the identity of any earlier words. In a decoder in which there is
a different copy of the lexical tree for each preceding word,
assuming the word-pair assumption holds, in order to generate an accurate lattice it is
assuming the word-pair assumption holds, in order to generate an accurate lattice, it is
sufficient to store a single Viterbi back-pointer at the word level; the entire set of
such back-pointers contains enough information to generate the lattice. Authors who have used
this type of lattice generation method~\cite{ney_word_graph,odell_thesis} have generally
......@@ -277,10 +282,9 @@ lattice; project it to leave only word labels; and then remove $\epsilon$
symbols and determinize.
Note that the form of pruning referred to here is not the same as beam pruning as
it takes account of both the forward and backward parts of the cost.
The paper also reported experiments with a different method that did not require any
phone-pair assumption; these experiments showed that the more
efficient method that relied on the phone-pair assumption
had almost the same lattice oracle error rate as their more efficient method. However, the
The paper also reported experiments with an accurate, ``reference'' method that did not require any
phone-pair assumption; these experiments showed that the main method they were describing
had almost the same lattice oracle error rate as the reference method. However, the
experiments did not evaluate how much impact the assumption had on the accuracy of the scores,
and this information could be important in some applications.
......@@ -292,8 +296,8 @@ on the word-pair assumption, but since the copies of the lexical tree in the dec
graph do not have unique word histories, the resulting algorithm has to be quite
different. Viterbi back-pointers at the word level are used, but the algorithm keeps track
of not just a single back-pointer in each state, but the $N$ best back-pointers for the $N$ top-scoring distinct
preceding. Therefore, this algorithm has more in common with the sentence N-best
algorithm than with the Viterbi algorithm. By limiting $N$ to be quite small (e.g. $N{=}5$)
word histories. Therefore, this algorithm has more in common with the sentence N-best
algorithm than with the Viterbi algorithm. By limiting $N$ to be quite small (e.g. $N{=}5$),
the algorithm was made efficient, but at the cost of losing word sequences
that would be within the lattice-generation beam.
......@@ -306,13 +310,13 @@ In order to explain our algorithm in the easiest way, we will first explain how
be if we did not keep the alignment information, and were storing only a single cost
(i.e. the total acoustic plus language-model cost). This is just for didactic purposes;
we have not implemented this simple version.
In this case our algorithm would be quite similar
In this case, our algorithm would be quite similar
to~\cite{efficient_general}, except at the state level rather than the phone level.
We actually store forward rather than backward pointers: for each active state on each
frame, we create a forward link record for each active arc out of that state; this points
to the record for the destination state of the arc on the next frame (or on
the current frame, for $\epsilon$-input
arcs). As in~\cite{efficient_general}, at the end of the utterance
arcs). As in~\cite{efficient_general}, at the end of the utterance,
we prune the resulting graph to discard any paths that are not within the beam $\alpha$ of the best cost.
Let the pruned graph be $P$, i.e.
\begin{equation}
......@@ -323,7 +327,7 @@ We project on the output labels (i.e. we keep only the word labels), then remove
arcs and determinize. In fact, we use a determinization algorithm that does $\epsilon$
removal itself.
As in~\cite{efficient_general}, to save memory we actually do the pruning
As in~\cite{efficient_general}, to save memory, we actually do the pruning
periodically rather than waiting for the end of the file (we do it every 25 frames).
Our method is equivalent to their method of linking all currently active states to a ``dummy''
final state and then pruning in the normal way. However, we implement it in such a way
......@@ -335,10 +339,10 @@ unchanged for a particular frame, the pruning algorithm can stop going backward
After the determinization phase, we prune again using the beam $\alpha$. This is needed because
the determinization process can introduce a lot of unlikely arcs. In fact, for particular
utterances the determinization process can cause the lattice to expand enough to
utterances, the determinization process can cause the lattice to expand enough to
exhaust memory. To deal with this, we currently just detect when determinization
has produced more than a pre-set maximum number of states, then we prune with a tighter
beam and try again. In future we may try more sophisticated methods such as a
beam and try again. In future, we may try more sophisticated methods such as a
determinization algorithm that does pruning itself.
This ``simple'' version of the algorithm produces an acyclic, deterministic
......@@ -381,7 +385,7 @@ to avoid complicating the present discussion.
We will define a semiring in which symbol sequences are encoded into the weights.
Let a weight be a pair $(c, s)$, where $c$ is a cost and $s$ is a sequence of symbols.
We define the $\otimes$ operation as $(c, s) \otimes (c', s') = (c+c', (s,s'))$, where
$(s,s')$ is $s$ and $s'$ appended together. We define the $\oplus$ operation so that
$(s,s')$ is a concatenation of $s$ and $s'$. We define the $\oplus$ operation so that
it returns whichever pair has the smallest cost: that is, $(c,s) \oplus (c',s')$
equals $(c,s)$ if $c < c'$, and $(c',s')$ if $c > c'$. If the costs are identical,
we cannot arbitrarily return the first pair because this would not satisfy the semiring
......@@ -391,7 +395,7 @@ the lengths are the same, whichever string appears first in dictionary order.
Let $E$ be an encoding of the inverted state-level lattice $Q$
as described above, with the same number of states and arcs; $E$ is an acceptor, with its symbols
equal to the input symbol (word) on the corresponding arc of $Q$, and
the weights on the arcs of $E$ containing both the the weight and the output symbol (p.d.f.),
the weights on the arcs of $E$ containing both the weight and the output symbol (p.d.f.),
if any, on the corresponding arcs of $Q$. Let $D = \mathrm{det}(\mathrm{rmeps}(E))$.
Determinization
will always succeed because $E$ is acyclic (as long as the original decoding
......@@ -408,7 +412,7 @@ same word-sequence on it.
\subsection{Summary of our algorithm}
We now summarize the lattice generation algorithm. During decoding, we create a data-structure
corresponding to a full state-level lattice. That is, for every arc of $\HCLG$ we traverse
corresponding to a full state-level lattice. That is, for every arc of $\HCLG$, we traverse
on every frame, we create a separate arc in the state-level lattice. These arcs
contain the acoustic and graph costs separately. We prune the state-level graph using
a beam $\alpha$; we do this periodically (every 25 frames) but this is equivalent to
......@@ -441,8 +445,7 @@ Instead we use a data structure that enables this to be done in linear
time (it involves a hash table).
We will briefly describe another unique aspect of our
algorithm. Determinization algorithms involve weighted subsets of states,
of the form
algorithm. Determinization algorithms involve weighted subsets of states, e.g.:
\begin{equation}
S = \{ (s_1, w_1), (s_2, w_2), \ldots \} .
\end{equation}
......@@ -513,11 +516,11 @@ We report experimental results on the Wall Street Journal database of
read speech.
Our system is a standard mixture-of-Gaussians system trained on the SI-284
training data; we test on the November 1992 evaluation data.
For these experiments we generate lattices with the bigram language model
For these experiments, we generate lattices with the bigram language model
supplied with the WSJ database, and for rescoring experiments we
use the trigram language model. The acoustic scale was $1/16$ for first-pass
decoding and $1/15$ for LM rescoring.
For simplicity we used a decoder that
For simplicity, we used a decoder that
does not support a ``maximum active states'' option, so the only variables
to consider are the beam used in the Viterbi beam search, and the separate
beam $\alpha$ used for lattice generation.
......@@ -529,6 +532,7 @@ beam $\alpha$ used for lattice generation.
\vspace*{-0.1in}
\caption{\vspace*{-0.2in} Varying lattice beam $\alpha$ (Viterbi pruning beam fixed at 15) }
\vspace{-0.08in}
\label{fig:latbeam}
\end{figure}
......@@ -536,7 +540,9 @@ Figure~\ref{fig:latbeam} shows how various quantities change as we vary
$\alpha$, with the Viterbi beam fixed at 15. Note that we get all the improvement from
LM rescoring by increasing $\alpha$ to 4. The time taken by our algorithm started to increase rapidly after about
$\alpha=8$, so a value of $\alpha$ anywhere between about 4 and 8 is sufficient for LM rescoring
and still does not slow down decoding too much.
and still does not slow down decoding too much. We do not display the real-time factor of the non-lattice-generating
decoder on this data (2.26) as it was actually slower than the lattice generating decoder; this is
presumably due to the overhead of reference counting.
Out of vocabularly words (OOVs) provide a floor on the lattice oracle error rate:
of 333 test utterances, 87 contained at least one OOV word, yet only 93 sentences (6 more) had oracle errors
with $\alpha=10$. Lattice density is defined as the average number of arcs crossing each frame.
......@@ -549,6 +555,7 @@ Viterbi decoding beam, while leaving $\alpha$ fixed at 7.
\vspace*{-0.1in}
\caption{\vspace*{-0.2in} Varying Viterbi pruning beam (lattice beam $\alpha$ fixed at 7) }
\label{fig:viterbibeam}
\vspace{-0.08in}
\end{figure}
......@@ -682,3 +689,8 @@ rescoring WER(15): 9.87%
ins/del/sub: 119/38/400
rescore utt wrong: 209
2.259 is baseline RT factor w/ decoding beam of 15, with simple-decoder.
note: paper# is 3887
passwd CF8FECE1
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment