# Note: these results will vary somewhat from OS to OS, because
# some algorithms call rand().
First, comparing with published results
feb89 oct89 feb91 sep92 avg
2.77 4.02 3.30 6.29 4.10 % from ICASSP'99 paper by Povey & Woodland on Frame Discrimination (ML baseline)
3.20 4.10 2.86 6.06 4.06 % from decode_tri2c (which is triphone + CMN)
exp/decode_mono/wer:Average WER is 14.234421 (1784 / 12533) # Monophone system, subset
exp/decode_tri1/wer:Average WER is 4.420330 (554 / 12533) # First triphone pass
exp/decode_tri1_fmllr/wer:Average WER is 3.837868 (481 / 12533) # + fMLLR
exp/decode_tri1_regtree_fmllr/wer:Average WER is 3.789994 (475 / 12533) # + regression-tree
# results in decode_tri1_latgen with varying acoustic
# scale: looks like we had not tuned this too well (10 is the
# default).
exp/decode_tri1_latgen/wer:Average WER is 4.420330 (554 / 12533)
exp/decode_tri1_latgen/wer_6:Average WER is 3.917657 (491 / 12533)
exp/decode_tri1_latgen/wer_7:Average WER is 3.813931 (478 / 12533)
exp/decode_tri1_latgen/wer_8:Average WER is 3.869784 (485 / 12533)
exp/decode_tri1_latgen/wer_9:Average WER is 4.013405 (503 / 12533)
exp/decode_tri1_latgen/wer_10:Average WER is 4.085215 (512 / 12533)
exp/decode_tri1_latgen/wer_11:Average WER is 4.188941 (525 / 12533)
exp/decode_tri1_latgen/wer_12:Average WER is 4.420330 (554 / 12533)
exp/decode_tri1_latgen/wer_13:Average WER is 4.555972 (571 / 12533)
# Lattice oracle error rate for exp/decode_tri1_latgen/
# when acoustic scale is set to 10
Beam 0.01 Average WER is 4.085215 (512 / 12533)
Beam 0.5 Average WER is 3.702226 (464 / 12533)
Beam 1 Average WER is 3.279343 (411 / 12533)
Beam 5 Average WER is 1.412272 (177 / 12533)
Beam 10 Average WER is 0.582462 ( 73 / 12533)
# Results on a second pass of triphone system building--
# various configurations.
exp/decode_tri2a/wer:Average WER is 3.973510 (498 / 12533) # Second triphone pass
exp/decode_tri2a_fmllr/wer:Average WER is 3.590521 (450 / 12533) # + fMLLR
exp/decode_tri2a_fmllr_utt/wer:Average WER is 3.933615 (493 / 12533) # [ fMLLR per utterance ]
exp/decode_tri2a_dfmllr/wer:Average WER is 3.861805 (484 / 12533) # + diagonal fMLLR
exp/decode_tri2a_dfmllr_utt/wer:Average WER is 3.933615 (493 / 12533) # [ diagonal fMLLR per utterance]
exp/decode_tri2a_dfmllr_fmllr/wer:Average WER is 3.622437 (454 / 12533) # diagonal fMLLR, then estimate fMLLR and re-decode
exp/decode_tri2b/wer:Average WER is 3.143701 (394 / 12533) # Exponential transform
exp/decode_tri2b_fmllr/wer:Average WER is 3.055932 (383 / 12533) # +fMLLR
exp/decode_tri2b_utt/wer:Average WER is 3.295300 (413 / 12533) # [adapt per-utt]
exp/decode_tri2c/wer:Average WER is 3.957552 (496 / 12533) # Cepstral mean subtraction (per-spk)
exp/decode_tri2d/wer:Average WER is 4.316604 (541 / 12533) # MLLT (= global STC)
exp/decode_tri2e/wer:Average WER is 4.659698 (584 / 12533) # splice-9-frames + LDA features
exp/decode_tri2f/wer:Average WER is 3.885742 (487 / 12533) # splice-9-frames + LDA + MLLT
exp/decode_tri2g/wer:Average WER is 3.303279 (414 / 12533) # Linear VTLN
exp/decode_tri2g_diag/wer:Average WER is 3.135722 (393 / 12533) # Linear VTLN; diagonal adapt in test
exp/decode_tri2g_diag_fmllr/wer:Average WER is 3.063911 (384 / 12533) # as above but then est. fMLLR (another decoding pass)
exp/decode_tri2g_diag_utt/wer:Average WER is 3.399027 (426 / 12533)
exp/decode_tri2g_vtln/wer:Average WER is 3.239448 (406 / 12533) # Use warp factors -> feature-level VTLN + offset estimation
exp/decode_tri2g_vtln_diag/wer:Average WER is 3.127743 (392 / 12533) # feature-level VTLN + diag fMLLR
exp/decode_tri2g_vtln_diag_utt/wer:Average WER is 3.407006 (427 / 12533) # as above, per utt.
exp/decode_tri2g_vtln_nofmllr/wer:Average WER is 3.694247 (463 / 12533) # feature-level VTLN but no fMLLR
exp/decode_tri2h/wer:Average WER is 4.252773 (533 / 12533) # Splice-9-frames + HLDA
exp/decode_tri2i/wer:Average WER is 3.981489 (499 / 12533) # Triple-deltas + HLDA
exp/decode_tri2j/wer:Average WER is 3.853826 (483 / 12533) # Triple-deltas + LDA + MLLT
exp/decode_tri2k/wer:Average WER is 3.071890 (385 / 12533) # LDA + exponential transform (ET)
exp/decode_tri2k_utt/wer:Average WER is 3.039974 (381 / 12533) # per-utterance adaptation
exp/decode_tri2k_fmllr/wer:Average WER is 2.641028 (331 / 12533) # fMLLR (per-spk)
exp/decode_tri2k_regtree_fmllr/wer:Average WER is 2.688901 (337 / 12533) # +regression-tree
exp/decode_tri2l/wer:Average WER is 2.704859 (339 / 12533) # Splice-9-frames + LDA + MLLT + SAT (fMLLR in test)
exp/decode_tri2l_utt/wer:Average WER is 4.930982 (618 / 12533) # [ as decode_tri2l but per-utt in test. ]
# linear-VTLN on top of LDA+MLLT features.
exp/decode_tri2m/wer:Average WER is 3.223490 (404 / 12533) # offset-only transform after VTLN part of LVTLN
exp/decode_tri2m_diag/wer:Average WER is 3.119764 (391 / 12533) # diagonal transform after VTLN part of LVTLN
exp/decode_tri2m_diag_fmllr/wer:Average WER is 2.784649 (349 / 12533) # + fMLLR
exp/decode_tri2m_diag_utt/wer:Average WER is 3.279343 (411 / 12533) # [per-utt]
exp/decode_tri2m_vtln/wer:Average WER is 4.747467 (595 / 12533) # feature-space VTLN, plus offset-only transform
# (for some reason it failed)
exp/decode_tri2m_vtln_diag/wer:Average WER is 3.087848 (387 / 12533) # + diagonal transform
exp/decode_tri2m_vtln_diag_utt/wer:Average WER is 4.340541 (544 / 12533) # [per-utterance]
exp/decode_tri2m_vtln_nofmllr/wer:Average WER is 5.784728 (725 / 12533) # feature-space VTLN, with no fMLLR
# sgmma is SGMM without speaker vectors.
exp/decode_sgmma/wer:Average WER is 3.319237 (416 / 12533)
exp/decode_sgmma_fmllr/wer:Average WER is 2.928269 (367 / 12533)
exp/decode_sgmma_fmllr_utt/wer:Average WER is 3.303279 (414 / 12533)
exp/decode_sgmma_fmllrbasis_utt/wer:Average WER is 3.191574 (400 / 12533)
# sgmmb is SGMM with speaker vectors.
exp/decode_sgmmb/wer:Average WER is 2.521344 (316 / 12533)
exp/decode_sgmmb_fmllr/wer:Average WER is 2.377723 (298 / 12533)
exp/decode_sgmmb_utt/wer:Average WER is 2.728796 (342 / 12533)
# sgmmc is like sgmmb but with gender dependency
exp/decode_sgmmc/wer:Average WER is 2.720817 (341 / 12533)
exp/decode_sgmmc_fmllr/wer:Average WER is 2.489428 (312 / 12533)
# sgmmd is like sgmmb but with LDA+MLLT features.
exp/decode_sgmmd/wer:Average WER is 2.656986 (333 / 12533)
exp/decode_sgmmd_fmllr/wer:Average WER is 2.409639 (302 / 12533)
# sgmme is like sgmmb but with LDA+ET features.
exp/decode_sgmme/wer:Average WER is 2.337828 (293 / 12533)
exp/decode_sgmme_fmllr/wer:Average WER is 2.266018 (284 / 12533)
#### Note: stuff below this line may be out of date / not computed
# with most recent version of toolkit.
# note: when changing (phn,spk) dimensions from (40,39) -> (30,30),
# WER in decode_sgmmb/ went from 2.62 to 2.92
# when changing from (40,39) -> (50,39) [40->50 on iter 3],
# WER in decode_sgmmb/ went from 2.62 to 2.66 [and test likelihood
# got worse].
# sgmmc is as sgmmb but with gender-dependent UBM, with 250
# Gaussians per gender instead of 400 Gaussians. Note: use
# gender info in test.
exp/decode_sgmmc/wer:Average WER is 2.784649 (349 / 12533)
exp/decode_sgmmc_fmllr/wer:Average WER is 2.688901 (337 / 12533)
# notes on timing of training with ATLAS vs. MKL:
# all below are with -O0.
# tested time taken with "time steps/train_tri2a.sh"
# [on svatava]: with "default" compile (which is 32-bit+ATLAS)
# real 14m19.458s
# user 15m38.695s
# 64-bit+ATLAS:
#
# 64-bit+MKL:
# real 12m45.664s
# user 13m53.770s
# + removed -O0 -DKALDI_PARANOID:
# [made almost no difference to training]:
# real 12m31.829s
# user 13m48.967s
# sys 0m28.146s
# 64-bit but ATLAS instead of MKL
# [and with default options, which includes: -O0 -DKALI_PARANOID].
#real 10m50.088s
#user 12m6.914s
#sys 0m17.419s
# Did this again:
#real 10m17.891s
#user 11m28.695s
#sys 0m14.087s
# But when I tested "fmllr-diag-gmm-test", all after removing
# the options -O0 -DKALDI_PARANOID, the ordering of timing was
# different:
# 64-bit+ATLAS was 0.361s
# 64-bit+MKL was 0.307s
# 32-bit+ATLAS was 0.571s
# Testing /homes/eva/q/qpovey/sourceforge/kaldi/trunk/src/gmm/am-diag-gmm-test:
# 64-bit+ATLAS was 0.171s
# 32-bit+ATLAS was 0.205s
# 64-bit+MKL was 0.291s