Commit e6291791 authored by Ilya Edrenkin's avatar Ilya Edrenkin
Browse files

RNNLM-HS: fast recurrent nnet language model; WSJ example

git-svn-id: https://svn.code.sf.net/p/kaldi/code/trunk@4305 5e6a8d80-dfce-4ca6-a32a-6e07a63d50c8
parent c21eece5
......@@ -89,6 +89,21 @@ exit 0
%WER 5.74 [ 324 / 5643, 46 ins, 41 del, 237 sub ] exp/tri3b/decode_bd_tgpr_eval92_fg/wer_19
%WER 5.90 [ 333 / 5643, 46 ins, 39 del, 248 sub ] exp/tri3b/decode_bd_tgpr_eval92_tg/wer_18
# this section demonstrates RNNLM-HS rescoring (commented out by default)
# the exact results might differ insignificantly due to hogwild in RNNLM-HS training that introduces indeterminism
%WER 5.92 [ 334 / 5643, 58 ins, 32 del, 244 sub ] exp/tri3b/decode_bd_tgpr_eval92_fg/wer_14 # baseline (no rescoring)
%WER 5.42 [ 306 / 5643, 50 ins, 34 del, 222 sub ] exp/tri3b/decode_bd_tgpr_eval92_fg_rnnlm-hs100_0.3/wer_16
%WER 5.49 [ 310 / 5643, 47 ins, 36 del, 227 sub ] exp/tri3b/decode_bd_tgpr_eval92_fg_rnnlm-hs300_0.3/wer_18
%WER 5.90 [ 333 / 5643, 54 ins, 37 del, 242 sub ] exp/tri3b/decode_bd_tgpr_eval92_fg_rnnlm-hs30_0.15/wer_15
%WER 5.49 [ 310 / 5643, 45 ins, 38 del, 227 sub ] exp/tri3b/decode_bd_tgpr_eval92_fg_rnnlm-hs400_0.15/wer_18
%WER 5.49 [ 310 / 5643, 45 ins, 38 del, 227 sub ] exp/tri3b/decode_bd_tgpr_eval92_fg_rnnlm-hs400_0.15_N1000/wer_18
%WER 5.33 [ 301 / 5643, 41 ins, 41 del, 219 sub ] exp/tri3b/decode_bd_tgpr_eval92_fg_rnnlm-hs400_0.3/wer_20
%WER 5.40 [ 305 / 5643, 41 ins, 41 del, 223 sub ] exp/tri3b/decode_bd_tgpr_eval92_fg_rnnlm-hs400_0.3_N10/wer_20
%WER 5.33 [ 301 / 5643, 41 ins, 41 del, 219 sub ] exp/tri3b/decode_bd_tgpr_eval92_fg_rnnlm-hs400_0.3_N1000/wer_20
%WER 5.26 [ 297 / 5643, 44 ins, 36 del, 217 sub ] exp/tri3b/decode_bd_tgpr_eval92_fg_rnnlm-hs400_0.4/wer_18
%WER 5.25 [ 296 / 5643, 44 ins, 36 del, 216 sub ] exp/tri3b/decode_bd_tgpr_eval92_fg_rnnlm-hs400_0.4_N1000/wer_18
%WER 5.26 [ 297 / 5643, 42 ins, 39 del, 216 sub ] exp/tri3b/decode_bd_tgpr_eval92_fg_rnnlm-hs400_0.5_N1000/wer_20
%WER 14.17 [ 1167 / 8234, 222 ins, 123 del, 822 sub ] exp/tri3b/decode_tgpr_dev93/wer_17
%WER 19.37 [ 1595 / 8234, 315 ins, 153 del, 1127 sub ] exp/tri3b/decode_tgpr_dev93.si/wer_15
......
#!/bin/bash
. cmd.sh
# This step interpolates a small RNNLM (with weight 0.15) with the 4-gram LM.
steps/rnnlmrescore.sh --rnnlm_ver rnnlm-hs-0.1b \
--N 100 --cmd "$decode_cmd" --inv-acwt 17 \
0.15 data/lang_test_bd_fg data/local/rnnlm-hs.h30.voc10k data/test_eval92 \
exp/tri3b/decode_bd_tgpr_eval92_fg exp/tri3b/decode_bd_tgpr_eval92_fg_rnnlm-hs30_0.15 \
|| exit 1;
steps/rnnlmrescore.sh --rnnlm_ver rnnlm-hs-0.1b \
--N 100 --cmd "$decode_cmd" --inv-acwt 17 \
0.3 data/lang_test_bd_fg data/local/rnnlm-hs.h100.voc20k data/test_eval92 \
exp/tri3b/decode_bd_tgpr_eval92_fg exp/tri3b/decode_bd_tgpr_eval92_fg_rnnlm-hs100_0.3 \
|| exit 1;
steps/rnnlmrescore.sh --rnnlm_ver rnnlm-hs-0.1b \
--N 100 --cmd "$decode_cmd" --inv-acwt 17 \
0.3 data/lang_test_bd_fg data/local/rnnlm-hs.h300.voc30k data/test_eval92 \
exp/tri3b/decode_bd_tgpr_eval92_fg exp/tri3b/decode_bd_tgpr_eval92_fg_rnnlm-hs300_0.3 \
|| exit 1;
steps/rnnlmrescore.sh --rnnlm_ver rnnlm-hs-0.1b \
--N 100 --cmd "$decode_cmd" --inv-acwt 17 \
0.3 data/lang_test_bd_fg data/local/rnnlm-hs.h400.voc40k data/test_eval92 \
exp/tri3b/decode_bd_tgpr_eval92_fg exp/tri3b/decode_bd_tgpr_eval92_fg_rnnlm-hs400_0.3 \
|| exit 1;
steps/rnnlmrescore.sh --rnnlm_ver rnnlm-hs-0.1b \
--N 1000 --cmd "$decode_cmd" --inv-acwt 17 \
0.3 data/lang_test_bd_fg data/local/rnnlm-hs.h400.voc40k data/test_eval92 \
exp/tri3b/decode_bd_tgpr_eval92_fg exp/tri3b/decode_bd_tgpr_eval92_fg_rnnlm-hs400_0.3_N1000
steps/rnnlmrescore.sh --rnnlm_ver rnnlm-hs-0.1b \
--N 1000 --cmd "$decode_cmd" --inv-acwt 17 \
0.3 data/lang_test_bd_fg data/local/rnnlm-hs.h400.voc40k data/test_eval92 \
exp/tri3b/decode_bd_tgpr_eval92_fg exp/tri3b/decode_bd_tgpr_eval92_fg_rnnlm-hs400_0.3_N1000 \
|| exit 1;
dir=exp/tri3b/decode_bd_tgpr_eval92_fg_rnnlm-hs400_0.4_N1000
rm -rf $dir
cp -r exp/tri3b/decode_bd_tgpr_eval92_fg_rnnlm-hs400_0.3_N1000 $dir
steps/rnnlmrescore.sh --rnnlm_ver rnnlm-hs-0.1b \
--stage 7 --N 1000 --cmd "$decode_cmd" --inv-acwt 17 \
0.4 data/lang_test_bd_fg data/local/rnnlm-hs.h400.voc40k data/test_eval92 \
exp/tri3b/decode_bd_tgpr_eval92_fg $dir
dir=exp/tri3b/decode_bd_tgpr_eval92_fg_rnnlm-hs400_0.4
rm -rf $dir
cp -r exp/tri3b/decode_bd_tgpr_eval92_fg_rnnlm-hs400_0.3 $dir
steps/rnnlmrescore.sh --rnnlm_ver rnnlm-hs-0.1b \
--stage 7 --N 100 --cmd "$decode_cmd" --inv-acwt 17 \
0.4 data/lang_test_bd_fg data/local/rnnlm-hs.h400.voc40k data/test_eval92 \
exp/tri3b/decode_bd_tgpr_eval92_fg $dir
dir=exp/tri3b/decode_bd_tgpr_eval92_fg_rnnlm-hs400_0.15
rm -rf $dir
cp -r exp/tri3b/decode_bd_tgpr_eval92_fg_rnnlm-hs400_0.3 $dir
steps/rnnlmrescore.sh --rnnlm_ver rnnlm-hs-0.1b \
--stage 7 --N 100 --cmd "$decode_cmd" --inv-acwt 17 \
0.15 data/lang_test_bd_fg data/local/rnnlm-hs.h400.voc40k data/test_eval92 \
exp/tri3b/decode_bd_tgpr_eval92_fg $dir
steps/rnnlmrescore.sh --rnnlm_ver rnnlm-hs-0.1b \
--N 10 --cmd "$decode_cmd" --inv-acwt 17 \
0.3 data/lang_test_bd_fg data/local/rnnlm-hs.h400.voc40k data/test_eval92 \
exp/tri3b/decode_bd_tgpr_eval92_fg exp/tri3b/decode_bd_tgpr_eval92_fg_rnnlm-hs400_0.3_N10 \
|| exit 1;
dir=exp/tri3b/decode_bd_tgpr_eval92_fg_rnnlm-hs400_0.4_N1000
rm -rf $dir
cp -r exp/tri3b/decode_bd_tgpr_eval92_fg_rnnlm-hs400_0.3_N1000 $dir
steps/rnnlmrescore.sh --rnnlm_ver rnnlm-hs-0.1b \
--stage 7 --N 1000 --cmd "$decode_cmd" --inv-acwt 17 \
0.4 data/lang_test_bd_fg data/local/rnnlm-hs.h400.voc40k data/test_eval92 \
exp/tri3b/decode_bd_tgpr_eval92_fg $dir
dir=exp/tri3b/decode_bd_tgpr_eval92_fg_rnnlm-hs400_0.15_N1000
rm -rf $dir
cp -r exp/tri3b/decode_bd_tgpr_eval92_fg_rnnlm-hs400_0.3_N1000 $dir
steps/rnnlmrescore.sh --rnnlm_ver rnnlm-hs-0.1b \
--stage 7 --N 1000 --cmd "$decode_cmd" --inv-acwt 17 \
0.15 data/lang_test_bd_fg data/local/rnnlm-hs.h400.voc40k data/test_eval92 \
exp/tri3b/decode_bd_tgpr_eval92_fg $dir
dir=exp/tri3b/decode_bd_tgpr_eval92_fg_rnnlm-hs400_0.5_N1000
rm -rf $dir
cp -r exp/tri3b/decode_bd_tgpr_eval92_fg_rnnlm-hs400_0.3_N1000 $dir
steps/rnnlmrescore.sh --rnnlm_ver rnnlm-hs-0.1b \
--stage 7 --N 1000 --cmd "$decode_cmd" --inv-acwt 17 \
0.5 data/lang_test_bd_fg data/local/rnnlm-hs.h400.voc40k data/test_eval92 \
exp/tri3b/decode_bd_tgpr_eval92_fg $dir
......@@ -16,8 +16,11 @@ cmd=run.pl
nwords=10000 # This is how many words we're putting in the vocab of the RNNLM.
hidden=30
class=200 # Num-classes... should be somewhat larger than sqrt of nwords.
direct=1000 # Probably number of megabytes to allocate for hash-table for "direct" connections.
direct=1000 # Number of weights that are used for "direct" connections, in millions.
rnnlm_ver=rnnlm-0.3e # version of RNNLM to use
threads=1 # for RNNLM-HS
bptt=2 # length of BPTT unfolding in RNNLM
bptt_block=20 # length of BPTT unfolding in RNNLM
# End configuration section.
[ -f ./path.sh ] && . ./path.sh
......@@ -42,20 +45,24 @@ export PATH=$KALDI_ROOT/tools/$rnnlm_ver:$PATH
# needed for me as I ran on a machine that had been setup
# as 64 bit by default.
cd $KALDI_ROOT/tools || exit 1;
if [ -d $rnnlm_ver ]; then
if [ -f $rnnlm_ver/rnnlm ]; then
echo Not installing the rnnlm toolkit since it is already there.
else
echo Downloading and installing the rnnlm tools
# http://www.fit.vutbr.cz/~imikolov/rnnlm/$rnnlm_ver.tgz
if [ ! -f $rnnlm_ver.tgz ]; then
wget http://www.fit.vutbr.cz/~imikolov/rnnlm/$rnnlm_ver.tgz || exit 1;
if [ $rnnlm_ver == "rnnlm-hs-0.1b" ]; then
extras/install_rnnlm_hs.sh
else
echo Downloading and installing the rnnlm tools
# http://www.fit.vutbr.cz/~imikolov/rnnlm/$rnnlm_ver.tgz
if [ ! -f $rnnlm_ver.tgz ]; then
wget http://www.fit.vutbr.cz/~imikolov/rnnlm/$rnnlm_ver.tgz || exit 1;
fi
mkdir $rnnlm_ver
cd $rnnlm_ver
tar -xvzf ../$rnnlm_ver.tgz || exit 1;
make CC=g++ || exit 1;
echo Done making the rnnlm tools
fi
fi
mkdir $rnnlm_ver
cd $rnnlm_ver
tar -xvzf ../$rnnlm_ver.tgz || exit 1;
make CC=g++ || exit 1;
echo Done making the rnnlm tools
fi
) || exit 1;
......@@ -128,15 +135,15 @@ echo "Training RNNLM (note: this uses a lot of memory! Run it on a big machine.)
# -direct-order 4 -direct 1000 -binary >& $dir/rnnlm1.log &
$cmd $dir/rnnlm.log \
$KALDI_ROOT/tools/$rnnlm_ver/rnnlm -independent -train $dir/train -valid $dir/valid \
-rnnlm $dir/rnnlm -hidden $hidden -rand-seed 1 -debug 2 -class $class -bptt 2 -bptt-block 20 \
$KALDI_ROOT/tools/$rnnlm_ver/rnnlm -threads $threads -independent -train $dir/train -valid $dir/valid \
-rnnlm $dir/rnnlm -hidden $hidden -rand-seed 1 -debug 2 -class $class -bptt $bptt -bptt-block $bptt_block \
-direct-order 4 -direct $direct -binary || exit 1;
# make it like a Kaldi table format, with fake utterance-ids.
cat $dir/valid.in | awk '{ printf("uttid-%d ", NR); print; }' > $dir/valid.with_ids
utils/rnnlm_compute_scores.sh $dir $dir/tmp.valid $dir/valid.with_ids \
utils/rnnlm_compute_scores.sh --rnnlm_ver $rnnlm_ver $dir $dir/tmp.valid $dir/valid.with_ids \
$dir/valid.scores
nw=`wc -w < $dir/valid.with_ids` # Note: valid.with_ids includes utterance-ids which
# is one per word, to account for the </s> at the end of each sentence; this is the
......
......@@ -66,6 +66,18 @@ local/wsj_format_data.sh || exit 1;
# local/wsj_train_rnnlms.sh --cmd "$decode_cmd -l mem_free=16G" \
# --hidden 300 --nwords 40000 --class 400 --direct 2000 data/local/rnnlm.h300.voc40k &
# )
false && \ # Comment this out to train RNNLM-HS
(
num_threads_rnnlm=8
local/wsj_train_rnnlms.sh --rnnlm_ver rnnlm-hs-0.1b --threads $num_threads_rnnlm \
--cmd "$decode_cmd -l mem_free=1G" --bptt 4 --bptt-block 10 --hidden 30 --nwords 10000 --direct 0 data/local/rnnlm-hs.h30.voc10k
local/wsj_train_rnnlms.sh --rnnlm_ver rnnlm-hs-0.1b --threads $num_threads_rnnlm \
--cmd "$decode_cmd -l mem_free=1G" --bptt 4 --bptt-block 10 --hidden 100 --nwords 20000 --direct 0 data/local/rnnlm-hs.h100.voc20k
local/wsj_train_rnnlms.sh --rnnlm_ver rnnlm-hs-0.1b --threads $num_threads_rnnlm \
--cmd "$decode_cmd -l mem_free=1G" --bptt 4 --bptt-block 10 --hidden 300 --nwords 30000 --direct 0 data/local/rnnlm-hs.h300.voc30k
local/wsj_train_rnnlms.sh --rnnlm_ver rnnlm-hs-0.1b --threads $num_threads_rnnlm \
--cmd "$decode_cmd -l mem_free=1G" --bptt 4 --bptt-block 10 --hidden 400 --nwords 40000 --direct 0 data/local/rnnlm-hs.h400.voc40k
)
) &
......@@ -246,6 +258,10 @@ steps/lmrescore.sh --cmd "$decode_cmd" data/lang_test_bd_tgpr data/lang_test_bd_
# that build the RNNLMs, so it would fail.
# local/run_rnnlms_tri3b.sh
# The command below is commented out as we commented out the steps above
# that build the RNNLMs (HS version), so it would fail.
# wait; local/run_rnnlm-hs_tri3b.sh
# The following two steps, which are a kind of side-branch, try mixing up
( # from the 3b system. This is to demonstrate that script.
steps/mixup.sh --cmd "$train_cmd" \
......
......@@ -13,8 +13,10 @@ use_phi=false # This is kind of an obscure option. If true, we'll remove the o
# difference (if any) to WER, it's more so we know we are doing the right thing.
test=false # Activate a testing option.
stage=1 # Stage of this script, for partial reruns.
rnnlm_ver=rnnlm-0.3e
# End configuration section.
echo "$0 $@" # Print the command line for logging
[ -f ./path.sh ] && . ./path.sh
......@@ -151,7 +153,7 @@ fi
if [ $stage -le 6 ]; then
echo "$0: invoking rnnlm_compute_scores.sh which calls rnnlm, to get RNN LM scores."
$cmd JOB=1:$nj $dir/log/rnnlm_compute_scores.JOB.log \
utils/rnnlm_compute_scores.sh $rnndir $adir.JOB/temp $adir.JOB/words_text $adir.JOB/lmwt.rnn \
utils/rnnlm_compute_scores.sh --rnnlm_ver $rnnlm_ver $rnndir $adir.JOB/temp $adir.JOB/words_text $adir.JOB/lmwt.rnn \
|| exit 1;
fi
if [ $stage -le 7 ]; then
......
......@@ -17,9 +17,12 @@
# words; unk.probs gives the probs for words given this class, and it
# has, on each line, "word prob".
rnnlm_ver=rnnlm-0.3e
. ./path.sh || exit 1;
. utils/parse_options.sh
rnnlm=$KALDI_ROOT/tools/rnnlm-0.3e/rnnlm
rnnlm=$KALDI_ROOT/tools/$rnnlm_ver/rnnlm
[ ! -f $rnnlm ] && echo No such program $rnnlm && exit 1;
......
#!/bin/bash
# Make sure we are in the tools/ directory.
if [ `basename $PWD` == extras ]; then
cd ..
fi
! [ `basename $PWD` == tools ] && \
echo "You must call this script from the tools/ directory" && exit 1;
echo "Installing RNNLM-HS 0.1b"
cd rnnlm-hs-0.1b
make
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
1. Definitions.
"License" shall mean the terms and conditions for use, reproduction,
and distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by
the copyright owner that is granting the License.
"Legal Entity" shall mean the union of the acting entity and all
other entities that control, are controlled by, or are under common
control with that entity. For the purposes of this definition,
"control" means (i) the power, direct or indirect, to cause the
direction or management of such entity, whether by contract or
otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.
"You" (or "Your") shall mean an individual or Legal Entity
exercising permissions granted by this License.
"Source" form shall mean the preferred form for making modifications,
including but not limited to software source code, documentation
source, and configuration files.
"Object" form shall mean any form resulting from mechanical
transformation or translation of a Source form, including but
not limited to compiled object code, generated documentation,
and conversions to other media types.
"Work" shall mean the work of authorship, whether in Source or
Object form, made available under the License, as indicated by a
copyright notice that is included in or attached to the work
(an example is provided in the Appendix below).
"Derivative Works" shall mean any work, whether in Source or Object
form, that is based on (or derived from) the Work and for which the
editorial revisions, annotations, elaborations, or other modifications
represent, as a whole, an original work of authorship. For the purposes
of this License, Derivative Works shall not include works that remain
separable from, or merely link (or bind by name) to the interfaces of,
the Work and Derivative Works thereof.
"Contribution" shall mean any work of authorship, including
the original version of the Work and any modifications or additions
to that Work or Derivative Works thereof, that is intentionally
submitted to Licensor for inclusion in the Work by the copyright owner
or by an individual or Legal Entity authorized to submit on behalf of
the copyright owner. For the purposes of this definition, "submitted"
means any form of electronic, verbal, or written communication sent
to the Licensor or its representatives, including but not limited to
communication on electronic mailing lists, source code control systems,
and issue tracking systems that are managed by, or on behalf of, the
Licensor for the purpose of discussing and improving the Work, but
excluding communication that is conspicuously marked or otherwise
designated in writing by the copyright owner as "Not a Contribution."
"Contributor" shall mean Licensor and any individual or Legal Entity
on behalf of whom a Contribution has been received by Licensor and
subsequently incorporated within the Work.
2. Grant of Copyright License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
copyright license to reproduce, prepare Derivative Works of,
publicly display, publicly perform, sublicense, and distribute the
Work and such Derivative Works in Source or Object form.
3. Grant of Patent License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
(except as stated in this section) patent license to make, have made,
use, offer to sell, sell, import, and otherwise transfer the Work,
where such license applies only to those patent claims licensable
by such Contributor that are necessarily infringed by their
Contribution(s) alone or by combination of their Contribution(s)
with the Work to which such Contribution(s) was submitted. If You
institute patent litigation against any entity (including a
cross-claim or counterclaim in a lawsuit) alleging that the Work
or a Contribution incorporated within the Work constitutes direct
or contributory patent infringement, then any patent licenses
granted to You under this License for that Work shall terminate
as of the date such litigation is filed.
4. Redistribution. You may reproduce and distribute copies of the
Work or Derivative Works thereof in any medium, with or without
modifications, and in Source or Object form, provided that You
meet the following conditions:
(a) You must give any other recipients of the Work or
Derivative Works a copy of this License; and
(b) You must cause any modified files to carry prominent notices
stating that You changed the files; and
(c) You must retain, in the Source form of any Derivative Works
that You distribute, all copyright, patent, trademark, and
attribution notices from the Source form of the Work,
excluding those notices that do not pertain to any part of
the Derivative Works; and
(d) If the Work includes a "NOTICE" text file as part of its
distribution, then any Derivative Works that You distribute must
include a readable copy of the attribution notices contained
within such NOTICE file, excluding those notices that do not
pertain to any part of the Derivative Works, in at least one
of the following places: within a NOTICE text file distributed
as part of the Derivative Works; within the Source form or
documentation, if provided along with the Derivative Works; or,
within a display generated by the Derivative Works, if and
wherever such third-party notices normally appear. The contents
of the NOTICE file are for informational purposes only and
do not modify the License. You may add Your own attribution
notices within Derivative Works that You distribute, alongside
or as an addendum to the NOTICE text from the Work, provided
that such additional attribution notices cannot be construed
as modifying the License.
You may add Your own copyright statement to Your modifications and
may provide additional or different license terms and conditions
for use, reproduction, or distribution of Your modifications, or
for any such Derivative Works as a whole, provided Your use,
reproduction, and distribution of the Work otherwise complies with
the conditions stated in this License.
5. Submission of Contributions. Unless You explicitly state otherwise,
any Contribution intentionally submitted for inclusion in the Work
by You to the Licensor shall be under the terms and conditions of
this License, without any additional terms or conditions.
Notwithstanding the above, nothing herein shall supersede or modify
the terms of any separate license agreement you may have executed
with Licensor regarding such Contributions.
6. Trademarks. This License does not grant permission to use the trade
names, trademarks, service marks, or product names of the Licensor,
except as required for reasonable and customary use in describing the
origin of the Work and reproducing the content of the NOTICE file.
7. Disclaimer of Warranty. Unless required by applicable law or
agreed to in writing, Licensor provides the Work (and each
Contributor provides its Contributions) on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied, including, without limitation, any warranties or conditions
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
PARTICULAR PURPOSE. You are solely responsible for determining the
appropriateness of using or redistributing the Work and assume any
risks associated with Your exercise of permissions under this License.
8. Limitation of Liability. In no event and under no legal theory,
whether in tort (including negligence), contract, or otherwise,
unless required by applicable law (such as deliberate and grossly
negligent acts) or agreed to in writing, shall any Contributor be
liable to You for damages, including any direct, indirect, special,
incidental, or consequential damages of any character arising as a
result of this License or out of the use or inability to use the
Work (including but not limited to damages for loss of goodwill,
work stoppage, computer failure or malfunction, or any and all
other commercial damages or losses), even if such Contributor
has been advised of the possibility of such damages.
9. Accepting Warranty or Additional Liability. While redistributing
the Work or Derivative Works thereof, You may choose to offer,
and charge a fee for, acceptance of support, warranty, indemnity,
or other liability obligations and/or rights consistent with this
License. However, in accepting such obligations, You may act only
on Your own behalf and on Your sole responsibility, not on behalf
of any other Contributor, and only if You agree to indemnify,
defend, and hold each Contributor harmless for any liability
incurred by, or claims asserted against, such Contributor by reason
of your accepting any such warranty or additional liability.
END OF TERMS AND CONDITIONS
APPENDIX: How to apply the Apache License to your work.
To apply the Apache License to your work, attach the following
boilerplate notice, with the fields enclosed by brackets "[]"
replaced with your own identifying information. (Don't include
the brackets!) The text should be enclosed in the appropriate
comment syntax for the file format. We also recommend that a
file or class name and description of purpose be included on the
same "printed page" as the copyright notice for easier
identification within third-party archives.
Copyright [yyyy] [name of copyright owner]
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
CC = gcc
#The -Ofast might not work with older versions of gcc; in that case, use -O2
CFLAGS = -lm -pthread -Ofast -march=native -Wall -funroll-loops -Wno-unused-result -std=c99 -g
all: rnnlm
rnnlm : rnnlm.c
$(CC) rnnlm.c -o rnnlm $(CFLAGS)
clean:
rm -rf rnnlm
\ No newline at end of file
1) OVERVIEW
This is a tool to estimate recurrent neural network language models on large amounts of text data.
The source code is heavily based on the word2vec toolkit. Basically this is just a small extension of word2vec that makes possible to apply it for language modeling.
https://code.google.com/p/word2vec/
The main ideas, interface and maxent extension implementation come from RNNLM written by Tomas Mikolov:
http://rnnlm.org
The differences from the published Mikolov's RNNLM are hierarchical softmax and hogwild multithreading (both tricks are taken directly from word2vec). This makes possible to train RNNLM-HS (hierarchical softmax) on large corpora, e.g. billions of words. However, on small and average-sized corpora Tomas Mikolov's RNNLM works considerably better both in terms of entropy and WER. RNNLM-HS is also much faster in test time, which is useful for online ASR.
Please send you ideas and proposals regarding this tool to ilia@yandex-team.com (Ilya Edrenkin, Yandex LLC). Bugreports and fixes are also of course welcome.
2) USAGE EXAMPLES
A typical example to obtain a reasonable model on a large (~4 billion words) corpus in a couple of days on a 16-core machine:
./rnnlm -train corpus.shuf.split-train -valid corpus.shuf.split-valid -size 100 -model corpus.shuf.split-train.h100me5-1000.t16 -threads 16 -alpha 0.1 -bptt 4 -bptt-block 10 -maxent-order 5 -maxent-size 1000
Fine-tuning of an existing model on a smaller in-domain corpora:
./rnnlm -train corpus.indomain.split-train -valid corpus.indomain.split-valid -model corpus.shuf.split-train.h100me5-1000.t16 -threads 1 -bptt 0 -alpha 0.01 -recompute-counts 1
Obtaining individual logprobs for a set of test sentences:
./rnnlm -model corpus.shuf.split-train.h100me5-1000.t16 -test corpus.test
Interactive sampling from an existing model:
./rnnlm -model corpus.shuf.split-train.h100me5-1000.t16 -gen -10
3) USAGE ADVICE
- you don't need to repeat structural parameters (size, maxent-order, maxent-size) when using an existing model. They will be ignored. The vocab saved in the model will be reused.
- the vocabulary is built based on the training file on the first run of the tool for a particular model. The program will ignore sentences with OOVs in train time (or report them in test time).
- set the number of threads to the number of physical CPUs or less.
- vocabulary size plays very small role in the performance (it is logarithmic in the size of vocabulary due to the Huffman tree decomposition). Hidden layer size and the amount of training data are the main factors.
- the model will be written to file after a training epoch if and only if its validation entropy improved compared to the previous epoch.
- using multithreading together with unlimited BPTT (setting -bptt 0) may cause the net to diverge.
- unlike Mikolov's RNNLM which has -independent switch, this tool always considers sentences as independent and doesn't track global context.
- it is a good idea to shuffle sentences in the set before splitting them into training and validation sets (GNU shuf & split are one of the possible choices to do it).
4) PARAMETERS
-train <file>
Use text data from <file> to train the model
-valid <file>
Use text data from <file> to perform validation and control learning rate
-test <file>
Use text data from <file> to compute logprobs with an existing model
Train, valid and test corpora. All distinct words that are found in the training file will be used for the nnet vocab, their counts will determine Huffman tree structure and remain fixed for this nnet.
If you prefer using limited vocabulary (say, top 1 million words) you should map all other words to <unk> or another token of your choice. Limited vocabulary is usually a good idea if it helps you to have enough training examples for each word.
-rnnlm <file>
Use <file> to save the resulting language model
Will create <file> and <file>.nnet files (for storing vocab/counts in the text form and the net itself in binary form).
If the <file> and <file>.nnet already exist, the tool will attempt to load them instead of starting new training.
-hidden <int>
Set size of hidden layer; default is 100
Large (300 and more) hidden layers are slow; sometimes they even fail to learn well in combination with multithreading and large or unlimited BPTT depth.
-bptt <int>
Set length of BPTT unfolding; default is 3; set to 0 to disable truncation
-bptt-block <int>
Set period of BPTT unfolding; default is 10; BPTT is performed each bptt+bptt_block steps
Default parameters for BPTT are reasonable in the most cases.
-gen <int>
Sampling mode; number of sentences to sample, default is 0 (off); enter negative number for interactive mode
Can complete a given prefix or generate a whole sentence (which corresponds to empty given prefix). Can be fun with a well trained model. Probably could be used for generating text for ngram variational approximation of the RNNLM, never tried it myself though.
-threads <int>
Use <int> threads (default 1)
The performance does not scale linearly with the number of threads (it is sublinear due to cache misses, false hogwild assumptions, etc).
Testing, validation and sampling are always performed by a single thread regardless of this setting.
-min-count <int>
This will discard words that appear less than <int> times; default is 0
Inherited from word2vec; not tested thoroughly. It is better to map rare words to <unk> by hand before training.
-alpha <float>
Set the starting learning rate; default is 0.1
-maxent-alpha <float>
Set the starting learning rate for maxent; default is 0.1
Maxent is somewhat prone to overfitting. Sometimes it makes sense to make maxent-alpha less than the 'main' alpha.
-reject-threshold <float>
Reject nnet and reload nnet from previous epoch if the relative entropy improvement on the validation set is below this threshold (default 0.997)
-stop <float>
Stop training when the relative entropy improvement on the validation set is below this threshold (default 1.003); see also -retry
-retry <int>
Stop training iff N retries with halving learning rate have failed (default 2)
The training schedule is as follows: if the validation set entropy improvement is less than <stop>, learning rate (both nnet and maxent) will start halving with each iteration.
In this mode, if the validation set entropy improvement is again less than <stop>, the retry counter will increment; when it reaches <retry> the training stops.
In addition to that, if the validation set entropy improvement is less than <reject-threshold> the nnet will be rejected and reloaded from the previous epoch.
-debug <int>
Set the debug mode (default = 2 = more info during training)
Inherited from word2vec. Set debug to 0 if you don't want to see speed statistics.
-direct-size <int>
Set the size of hash for maxent parameters, in millions (default 0 = maxent off)
-direct-order <int>
Set the order of n-gram features to be used in maxent (default 3)
Maxent extension. Off by default. Speeds up convergence a lot, also improves entropy; the only drawback is memory demand, e.g. setting -direct-size 1000 will cost you ~4 GB for the nnet file.
-beta1 <float>
L2 regularisation parameter for RNNLM weights (default 1e-6)
-beta2 <float>
L2 regularisation parameter for maxent weights (default 1e-6)
Maxent is somewhat prone to overfitting. Sometimes it makes sense to make beta2 larger than beta1.
-recompute-counts <int>
Recompute train words counts, useful for fine-tuning (default = 0 = use counts stored in the vocab file)
Vocabulary counts are stored in the nnet file and used to reconstruct Huffman tree. When fine-tuning an existing model on a new corpus, use this option. New counts will not be saved, they are used for the fine-tuning session only.
5) FUTURE PLANS
Large amount of tricks that help training RNNs has been discussed in the literature and could be applied to his tool, but currently are not implemented in this release:
- Better initialization (spectral radius trick, Bayesian model selection)
- Using ReLU/softplus instead of sigmoid
- Momentum/NAG
- Second order methods, HF
- LSTM
- Adapting code for GPUs
- More efficient SGD parallelisation
If you are interested in contribution feel free to participate.
This diff is collapsed.
Markdown is supported
0% or .