# These results were obtained around svn revision 23 (just prior to # tagging kaldi-1.0). # Note: these results will vary somewhat from OS to OS, because # some algorithms call rand(). First, comparing with published results feb89 oct89 feb91 sep92 avg 2.77 4.02 3.30 6.29 4.10 % from my ICASSP'99 paper on Frame Discrimination (ML baseline) 3.20 4.10 2.86 6.06 4.06 % from decode_tri2c (which is triphone + CMN) exp/decode_mono/wer:Average WER is 14.234421 (1784 / 12533) # Monophone system, subset exp/decode_tri1/wer:Average WER is 4.420330 (554 / 12533) # First triphone pass exp/decode_tri1_fmllr/wer:Average WER is 3.837868 (481 / 12533) # + fMLLR exp/decode_tri1_regtree_fmllr/wer:Average WER is 3.789994 (475 / 12533) # + regression-tree exp/decode_tri2a/wer:Average WER is 3.973510 (498 / 12533) # Second triphone pass exp/decode_tri2a_fmllr/wer:Average WER is 3.590521 (450 / 12533) # + fMLLR exp/decode_tri2a_fmllr_utt/wer:Average WER is 3.933615 (493 / 12533) # [ fMLLR per utterance ] exp/decode_tri2a_dfmllr/wer:Average WER is 3.861805 (484 / 12533) # + diagonal fMLLR exp/decode_tri2a_dfmllr_utt/wer:Average WER is 3.933615 (493 / 12533) # [ diagonal fMLLR per utterance] exp/decode_tri2a_dfmllr_fmllr/wer:Average WER is 3.622437 (454 / 12533) # diagonal fMLLR, then estimate fMLLR and re-decode exp/decode_tri2b/wer:Average WER is 3.303279 (414 / 12533) # Exponential transform exp/decode_tri2b_fmllr/wer:Average WER is 3.047953 (382 / 12533) # +fMLLR exp/decode_tri2b_utt/wer:Average WER is 3.335195 (418 / 12533) # [adapt per-utt] exp/decode_tri2c/wer:Average WER is 3.957552 (496 / 12533) # Cepstral mean subtraction (per-spk) exp/decode_tri2d/wer:Average WER is 4.316604 (541 / 12533) # MLLT (= global STC) exp/decode_tri2e/wer:Average WER is 4.659698 (584 / 12533) # splice-9-frames + LDA features exp/decode_tri2f/wer:Average WER is 3.885742 (487 / 12533) # splice-9-frames + LDA + MLLT exp/decode_tri2g/wer:Average WER is 3.303279 (414 / 12533) # Linear VTLN exp/decode_tri2g_diag/wer:Average WER is 3.135722 (393 / 12533) # Linear VTLN; diagonal adapt in test exp/decode_tri2g_diag_fmllr/wer:Average WER is 3.063911 (384 / 12533) # as above but then est. fMLLR (another decoding pass) exp/decode_tri2g_diag_utt/wer:Average WER is 3.399027 (426 / 12533) exp/decode_tri2g_vtln/wer:Average WER is 3.239448 (406 / 12533) # Use warp factors -> feature-level VTLN + offset estimation exp/decode_tri2g_vtln_diag/wer:Average WER is 3.127743 (392 / 12533) # feature-level VTLN + diag fMLLR exp/decode_tri2g_vtln_diag_utt/wer:Average WER is 3.407006 (427 / 12533) # as above, per utt. exp/decode_tri2g_vtln_nofmllr/wer:Average WER is 3.694247 (463 / 12533) # feature-level VTLN but no fMLLR exp/decode_tri2h/wer:Average WER is 4.252773 (533 / 12533) # Splice-9-frames + HLDA exp/decode_tri2i/wer:Average WER is 3.981489 (499 / 12533) # Triple-deltas + HLDA exp/decode_tri2j/wer:Average WER is 3.853826 (483 / 12533) # Triple-deltas + LDA + MLLT exp/decode_tri2k/wer:Average WER is 2.968164 (372 / 12533) # LDA + exponential transform exp/decode_tri2k_utt/wer:Average WER is 3.175616 (398 / 12533) # per-utterance adaptation. exp/decode_tri2k_fmllr/wer:Average WER is 2.505386 (314 / 12533) # +fMLLR (per-spk) exp/decode_tri2k_regtree_fmllr/wer:Average WER is 2.513365 (315 / 12533) # +regression tree exp/decode_tri2l/wer:Average WER is 2.704859 (339 / 12533) # Splice-9-frames + LDA + MLLT + SAT (fMLLR in test) exp/decode_tri2l_utt/wer:Average WER is 4.930982 (618 / 12533) # [ as decode_tri2l but per-utt in test. ] # sgmma is SGMM without speaker vectors. exp/decode_sgmma/wer:Average WER is 3.319237 (416 / 12533) exp/decode_sgmma_fmllr/wer:Average WER is 2.934308 (289 / 9849) exp/decode_sgmma_fmllr_utt/wer:Average WER is 3.303279 (414 / 12533) exp/decode_sgmma_fmllrbasis_utt/wer:Average WER is 3.191574 (400 / 12533) # sgmmb is SGMM with speaker vectors. exp/decode_sgmmb/wer:Average WER is 2.760712 (346 / 12533) exp/decode_sgmmb_utt/wer:Average WER is 2.808585 (352 / 12533) exp/decode_sgmmb_fmllr/wer:Average WER is 2.553259 (320 / 12533) # sgmmc is like sgmmb but with gender dependency [doesn't help here] exp/decode_sgmmc/wer:Average WER is 2.776670 (348 / 12533) exp/decode_sgmmc_fmllr/wer:Average WER is 2.601133 (326 / 12533) exp/decode_tri2a/wer:Average WER is 4.476183 (561 / 12533) exp/decode_tri2a_fmllr/wer:Average WER is 3.718184 (466 / 12533) exp/decode_tri2a_fmllr_utt/wer:Average WER is 4.452246 (558 / 12533) exp/decode_tri2b/wer:Average WER is 2.992101 (375 / 12533) exp/decode_tri2b_utt/wer:Average WER is 3.247427 (407 / 12533) exp/decode_tri2c/wer:Average WER is 3.789994 (475 / 12533) exp/decode_tri2d/wer:Average WER is 4.188941 (525 / 12533) exp/decode_tri2e/wer:Average WER is 4.923003 (617 / 12533) exp/decode_tri2f/wer:Average WER is 3.782015 (474 / 12533) exp/decode_tri2g/wer:Average WER is 3.670310 (460 / 12533) exp/decode_tri2g_diag/wer:Average WER is 3.550626 (445 / 12533) # +change mean-only to diagonal fMLLR exp/decode_tri2g_vtln/wer:Average WER is 3.534668 (443 / 12533) # More conventional VTLN (+mean-only fMLLR) exp/decode_tri2g_vtln_diag/wer:Average WER is 3.438921 (431 / 12533) #+change mean-only to diagonal fMLLR exp/decode_tri2g_vtln_diag_utt/wer:Average WER is 3.614458 (453 / 12533) #[per-utt] exp/decode_tri2g_vtln_nofmllr/wer:Average WER is 4.069257 (510 / 12533) # more conventional VTLN, no fMLLR exp/decode_tri2h/wer:Average WER is 4.252773 (533 / 12533) # Splice-9-frames + HLDA exp/decode_tri2i/wer:Average WER is 4.077236 (511 / 12533) # Triple-deltas + HLDA exp/decode_tri2j/wer:Average WER is 3.694247 (463 / 12533) # Triple-deltas + LDA + MLLT exp/decode_tri2k/wer:Average WER is 2.856459 (358 / 12533) # LDA + exponential transform exp/decode_tri2k_utt/wer:Average WER is 3.071890 (385 / 12533) # per-utterance adaptation. exp/decode_tri2k_fmllr/wer:Average WER is 2.585175 (324 / 12533) # +fMLLR (per-spk) exp/decode_tri2k_regtree_fmllr/wer:Average WER is 2.561238 (321 / 12533) # +regression tree exp/decode_tri2l/wer:Average WER is 2.688901 (337 / 12533) # Splice-9-frames + LDA + MLLT + SAT (fMLLR in test) exp/decode_tri2l_utt/wer:Average WER is 5.066624 (635 / 12533) # [ as decode_tri2l but per-utt in test. ] exp/decode_tri2m/wer:Average WER is 3.223490 (404 / 12533) # Splice + LDA + MLLT + Linear VTLN exp/decode_tri2m_diag/wer:Average WER is 3.119764 (391 / 12533) # diagonal not offset CMLLR component exp/decode_tri2m_diag_fmllr/wer:Average WER is 2.784649 (349 / 12533) # diagonal CMLLR component; then est. CMLLR and re-decode again. exp/decode_tri2m_diag_utt/wer:Average WER is 3.279343 (411 / 12533) # [per-utterance] exp/decode_tri2m_vtln/wer:Average WER is 4.747467 (595 / 12533) # feature-level VTLN computation exp/decode_tri2m_vtln_diag/wer:Average WER is 3.087848 (387 / 12533) # diagonal, not offset, adapt exp/decode_tri2m_vtln_diag_utt/wer:Average WER is 4.340541 (544 / 12533) # per-utterance, diag adapt. # sgmma is SGMM without speaker vectors. exp/decode_sgmma/wer:Average WER is 3.151680 (395 / 12533) exp/decode_sgmma_fmllr/wer:Average WER is 2.728796 (342 / 12533) exp/decode_sgmma_fmllr_utt/wer:Average WER is 3.087848 (387 / 12533) exp/decode_sgmma_fmllrbasis_utt/wer:Average WER is 2.896354 (363 / 12533) # sgmmb is SGMM with speaker vectors. exp/decode_sgmmb/wer:Average WER is 2.617091 (328 / 12533) exp/decode_sgmmb_utt/wer:Average WER is 2.696880 (338 / 12533) exp/decode_sgmmb_fmllr/wer:Average WER is 2.505386 (314 / 12533) # sgmmc is like sgmmb but with gender dependency [doesn't help here] exp/decode_sgmmc/wer:Average WER is 2.784649 (349 / 12533) exp/decode_sgmmc_fmllr/wer:Average WER is 2.688901 (337 / 12533) # note: when changing (phn,spk) dimensions from (40,39) -> (30,30), # WER in decode_sgmmb/ went from 2.62 to 2.92 # when changing from (40,39) -> (50,39) [40->50 on iter 3], # WER in decode_sgmmb/ went from 2.62 to 2.66 [and test likelihood # got worse]. # sgmmc is as sgmmb but with gender-dependent UBM, with 250 # Gaussians per gender instead of 400 Gaussians. Note: use # gender info in test. exp/decode_sgmmc/wer:Average WER is 2.784649 (349 / 12533) exp/decode_sgmmc_fmllr/wer:Average WER is 2.688901 (337 / 12533) # notes on timing of training with ATLAS vs. MKL: # all below are with -O0. # tested time taken with "time steps/train_tri2a.sh" # [on svatava]: with "default" compile (which is 32-bit+ATLAS) # real 14m19.458s # user 15m38.695s # 64-bit+ATLAS: # # 64-bit+MKL: # real 12m45.664s # user 13m53.770s # + removed -O0 -DKALDI_PARANOID: # [made almost no difference to training]: # real 12m31.829s # user 13m48.967s # sys 0m28.146s # 64-bit but ATLAS instead of MKL # [and with default options, which includes: -O0 -DKALI_PARANOID]. #real 10m50.088s #user 12m6.914s #sys 0m17.419s # Did this again: #real 10m17.891s #user 11m28.695s #sys 0m14.087s # But when I tested "fmllr-diag-gmm-test", all after removing # the options -O0 -DKALDI_PARANOID, the ordering of timing was # different: # 64-bit+ATLAS was 0.361s # 64-bit+MKL was 0.307s # 32-bit+ATLAS was 0.571s # Testing /homes/eva/q/qpovey/sourceforge/kaldi/trunk/src/gmm/am-diag-gmm-test: # 64-bit+ATLAS was 0.171s # 32-bit+ATLAS was 0.205s # 64-bit+MKL was 0.291s