RESULTS 8.15 KB
Newer Older
Dan Povey's avatar
Dan Povey committed
1 2 3 4

# Note: these results will vary somewhat from OS to OS, because
# some algorithms call rand().

Dan Povey's avatar
Dan Povey committed
5 6 7
First, comparing with published results

feb89 oct89 feb91 sep92   avg
8
  2.77 4.02 3.30 6.29 4.10  % from ICASSP'99 paper by Povey & Woodland on Frame Discrimination (ML baseline)
Dan Povey's avatar
Dan Povey committed
9 10
  3.20 4.10 2.86 6.06 4.06  % from decode_tri2c (which is triphone + CMN)

11 12 13 14 15 16 17
exp/decode_mono/wer:Average WER is 14.234421 (1784 / 12533) # Monophone system, subset

exp/decode_tri1/wer:Average WER is 4.420330 (554 / 12533)    # First triphone pass
exp/decode_tri1_fmllr/wer:Average WER is 3.837868 (481 / 12533) # + fMLLR
exp/decode_tri1_regtree_fmllr/wer:Average WER is 3.789994 (475 / 12533) # + regression-tree


18 19 20
# results in decode_tri1_latgen with varying acoustic
# scale: looks like we had not tuned this too well (10 is the 
# default).
21 22 23 24 25 26 27 28 29 30
exp/decode_tri1_latgen/wer:Average WER is 4.420330 (554 / 12533) 
exp/decode_tri1_latgen/wer_6:Average WER is 3.917657 (491 / 12533)
exp/decode_tri1_latgen/wer_7:Average WER is 3.813931 (478 / 12533)
exp/decode_tri1_latgen/wer_8:Average WER is 3.869784 (485 / 12533)
exp/decode_tri1_latgen/wer_9:Average WER is 4.013405 (503 / 12533)
exp/decode_tri1_latgen/wer_10:Average WER is 4.085215 (512 / 12533)
exp/decode_tri1_latgen/wer_11:Average WER is 4.188941 (525 / 12533)
exp/decode_tri1_latgen/wer_12:Average WER is 4.420330 (554 / 12533)
exp/decode_tri1_latgen/wer_13:Average WER is 4.555972 (571 / 12533)

31 32 33 34 35 36 37
# Lattice oracle error rate for exp/decode_tri1_latgen/
# when acoustic scale is set to 10
Beam 0.01 Average WER is 4.085215 (512 / 12533) 
Beam 0.5  Average WER is 3.702226 (464 / 12533) 
Beam 1    Average WER is 3.279343 (411 / 12533) 
Beam 5    Average WER is 1.412272 (177 / 12533) 
Beam 10   Average WER is 0.582462 ( 73 / 12533) 
38

39 40 41
# Results on a second pass of triphone system building--
# various configurations.

42 43 44
exp/decode_tri2a/wer:Average WER is 3.973510 (498 / 12533)  # Second triphone pass
exp/decode_tri2a_fmllr/wer:Average WER is 3.590521 (450 / 12533) # + fMLLR
exp/decode_tri2a_fmllr_utt/wer:Average WER is 3.933615 (493 / 12533)  # [ fMLLR per utterance ]
45 46 47 48
exp/decode_tri2a_dfmllr/wer:Average WER is 3.861805 (484 / 12533)  # + diagonal fMLLR
exp/decode_tri2a_dfmllr_utt/wer:Average WER is 3.933615 (493 / 12533)  # [ diagonal fMLLR per utterance]
exp/decode_tri2a_dfmllr_fmllr/wer:Average WER is 3.622437 (454 / 12533)  # diagonal fMLLR, then estimate fMLLR and re-decode

49 50 51
exp/decode_tri2b/wer:Average WER is 3.143701 (394 / 12533)  # Exponential transform
exp/decode_tri2b_fmllr/wer:Average WER is 3.055932 (383 / 12533)  # +fMLLR
exp/decode_tri2b_utt/wer:Average WER is 3.295300 (413 / 12533)  # [adapt per-utt]
52 53 54 55 56 57
exp/decode_tri2c/wer:Average WER is 3.957552 (496 / 12533) # Cepstral mean subtraction (per-spk)
exp/decode_tri2d/wer:Average WER is 4.316604 (541 / 12533) # MLLT (= global STC)
exp/decode_tri2e/wer:Average WER is 4.659698 (584 / 12533) # splice-9-frames + LDA features
exp/decode_tri2f/wer:Average WER is 3.885742 (487 / 12533) # splice-9-frames + LDA + MLLT
exp/decode_tri2g/wer:Average WER is 3.303279 (414 / 12533) # Linear VTLN
exp/decode_tri2g_diag/wer:Average WER is 3.135722 (393 / 12533) # Linear VTLN; diagonal adapt in test
58 59
exp/decode_tri2g_diag_fmllr/wer:Average WER is 3.063911 (384 / 12533) # as above but then est. fMLLR (another decoding pass)
exp/decode_tri2g_diag_utt/wer:Average WER is 3.399027 (426 / 12533) 
60 61 62 63 64 65 66 67
exp/decode_tri2g_vtln/wer:Average WER is 3.239448 (406 / 12533) # Use warp factors -> feature-level VTLN + offset estimation
exp/decode_tri2g_vtln_diag/wer:Average WER is 3.127743 (392 / 12533)  # feature-level VTLN  + diag fMLLR
exp/decode_tri2g_vtln_diag_utt/wer:Average WER is 3.407006 (427 / 12533)  # as above, per utt.
exp/decode_tri2g_vtln_nofmllr/wer:Average WER is 3.694247 (463 / 12533) # feature-level VTLN but no fMLLR

exp/decode_tri2h/wer:Average WER is 4.252773 (533 / 12533) # Splice-9-frames + HLDA
exp/decode_tri2i/wer:Average WER is 3.981489 (499 / 12533) # Triple-deltas + HLDA
exp/decode_tri2j/wer:Average WER is 3.853826 (483 / 12533) # Triple-deltas + LDA + MLLT
68 69


70
exp/decode_tri2k/wer:Average WER is 3.071890 (385 / 12533) # LDA + exponential transform (ET)
71 72 73
exp/decode_tri2k_utt/wer:Average WER is 3.039974 (381 / 12533)  # per-utterance adaptation
exp/decode_tri2k_fmllr/wer:Average WER is 2.641028 (331 / 12533) # fMLLR (per-spk)
exp/decode_tri2k_regtree_fmllr/wer:Average WER is 2.688901 (337 / 12533)  # +regression-tree
74 75 76 77

exp/decode_tri2l/wer:Average WER is 2.704859 (339 / 12533) # Splice-9-frames + LDA + MLLT + SAT (fMLLR in test)
exp/decode_tri2l_utt/wer:Average WER is 4.930982 (618 / 12533) # [ as decode_tri2l but per-utt in test. ]

78 79 80 81 82 83 84 85 86 87 88
# linear-VTLN on top of LDA+MLLT features.
exp/decode_tri2m/wer:Average WER is 3.223490 (404 / 12533) # offset-only transform after VTLN part of LVTLN
exp/decode_tri2m_diag/wer:Average WER is 3.119764 (391 / 12533)  # diagonal transform after VTLN part of LVTLN
exp/decode_tri2m_diag_fmllr/wer:Average WER is 2.784649 (349 / 12533) # + fMLLR
exp/decode_tri2m_diag_utt/wer:Average WER is 3.279343 (411 / 12533)   # [per-utt]
exp/decode_tri2m_vtln/wer:Average WER is 4.747467 (595 / 12533)   # feature-space VTLN, plus offset-only transform
                                                                  #  (for some reason it failed)
exp/decode_tri2m_vtln_diag/wer:Average WER is 3.087848 (387 / 12533)  # + diagonal transform
exp/decode_tri2m_vtln_diag_utt/wer:Average WER is 4.340541 (544 / 12533)  # [per-utterance]
exp/decode_tri2m_vtln_nofmllr/wer:Average WER is 5.784728 (725 / 12533)  # feature-space VTLN, with no fMLLR

89 90
# sgmma is SGMM without speaker vectors.
exp/decode_sgmma/wer:Average WER is 3.319237 (416 / 12533) 
91
exp/decode_sgmma_fmllr/wer:Average WER is 2.928269 (367 / 12533) 
92 93 94 95
exp/decode_sgmma_fmllr_utt/wer:Average WER is 3.303279 (414 / 12533) 
exp/decode_sgmma_fmllrbasis_utt/wer:Average WER is 3.191574 (400 / 12533) 

# sgmmb is SGMM with speaker vectors.
96 97 98
exp/decode_sgmmb/wer:Average WER is 2.521344 (316 / 12533) 
exp/decode_sgmmb_fmllr/wer:Average WER is 2.377723 (298 / 12533) 
exp/decode_sgmmb_utt/wer:Average WER is 2.728796 (342 / 12533) 
99

100
# sgmmc is like sgmmb but with gender dependency
101 102
exp/decode_sgmmc/wer:Average WER is 2.720817 (341 / 12533) 
exp/decode_sgmmc_fmllr/wer:Average WER is 2.489428 (312 / 12533) 
103

104
# sgmmd is like sgmmb but with LDA+MLLT features.
105 106
exp/decode_sgmmd/wer:Average WER is 2.656986 (333 / 12533) 
exp/decode_sgmmd_fmllr/wer:Average WER is 2.409639 (302 / 12533) 
107

108
# sgmme is like sgmmb but with LDA+ET features.
109 110 111 112
exp/decode_sgmme/wer:Average WER is 2.337828 (293 / 12533) 
exp/decode_sgmme_fmllr/wer:Average WER is 2.266018 (284 / 12533) 


Dan Povey's avatar
Dan Povey committed
113

114

115 116
#### Note: stuff below this line may be out of date / not computed
# with most recent version of toolkit.
Dan Povey's avatar
Dan Povey committed
117 118 119 120 121
# note: when changing (phn,spk) dimensions from (40,39) -> (30,30),
# WER in decode_sgmmb/ went from 2.62 to 2.92
# when changing from (40,39) -> (50,39)  [40->50 on iter 3],
# WER in decode_sgmmb/ went from 2.62 to 2.66 [and test likelihood
# got worse].
Dan Povey's avatar
Dan Povey committed
122 123 124 125

# sgmmc is as sgmmb but with gender-dependent UBM, with 250
# Gaussians per gender instead of 400 Gaussians.  Note: use
# gender info in test.
Dan Povey's avatar
Dan Povey committed
126 127
exp/decode_sgmmc/wer:Average WER is 2.784649 (349 / 12533) 
exp/decode_sgmmc_fmllr/wer:Average WER is 2.688901 (337 / 12533) 
Dan Povey's avatar
Dan Povey committed
128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167


# notes on timing of training with ATLAS vs. MKL:
# all below are with -O0.
# tested time taken with "time steps/train_tri2a.sh"
#  [on svatava]: with "default" compile (which is 32-bit+ATLAS)
#   real    14m19.458s
#   user    15m38.695s
# 64-bit+ATLAS:
#
# 64-bit+MKL:
# real    12m45.664s
# user    13m53.770s
# + removed -O0 -DKALDI_PARANOID:
# [made almost no difference to training]:
# real    12m31.829s
# user    13m48.967s
# sys 0m28.146s

# 64-bit but ATLAS instead of MKL
# [and with default options, which includes: -O0 -DKALI_PARANOID].
#real    10m50.088s
#user    12m6.914s
#sys 0m17.419s
# Did this again:
#real    10m17.891s
#user    11m28.695s
#sys     0m14.087s

# But when I tested "fmllr-diag-gmm-test", all after removing
# the options -O0 -DKALDI_PARANOID, the ordering of timing was
# different:
# 64-bit+ATLAS was 0.361s
# 64-bit+MKL was 0.307s
# 32-bit+ATLAS was 0.571s

# Testing /homes/eva/q/qpovey/sourceforge/kaldi/trunk/src/gmm/am-diag-gmm-test:
# 64-bit+ATLAS was 0.171s
# 32-bit+ATLAS was 0.205s
# 64-bit+MKL was 0.291s
168