kfold

kfold creates K-fold splits from data files and assists in training and testing (useful for cross-validation in supervised machine learning)

Command overview

help                 Display global or [command] help documentation.		
split                Split a data file into K partitions		
test                 Apply trained models on a dataset previously split using kfold		
train                Train models on a dataset previously split using kfold

Example usage

10-fold cross-validation of the standard MaltParser on a treebank named shuffled.c32.conll may be done as follows:

kfold split -f -i shuffled.c32.conll --fold -d '\n\n'
kfold train -f --base shuffled.c32.conll -- java -jar ~/Tools/malt-1.4.1/malt.jar -c %B.model_%N -i %T -m learn
kfold test -f --base shuffled.c32.conll -- java -jar ~/Tools/malt-1.4.1/malt.jar -c %B.model_%N -i %T -o %O -m parse
eval07.pl -q -g shuffled.c32.conll -s shuffled.c32.conll.output

The MaltParser does not like to put its models in a subdirectory, so rather than using the standard model files suggested by kfold (%M), we construct custom non-nested model filenames using %B.model_%N.

Command details

The following is simply the output of the built-in help commands.

Splitting data files

NAME:

  split

DESCRIPTION:

  Given the data file INPUT, the partitions are written to files named INPUT.parts/{01..K}

SYNOPSIS:

  kfold split -i INPUT [options]

EXAMPLES:

# Split the file sample.txt into 4 parts
kfold split -k4 sample.txt

# Split the double-newline-delimited file sample.conll into 10 parts
kfold split -d"\n\n" sample.conll

OPTIONS:

-i, --input FILE 
    Data file to split

-k, --parts N 
    The number of partitions desired

-d, --delimiter DELIM 
    String used to separate individual entries (newline per default)

-g, --granularity N 
    Ensure the number of entries in each partition is divisible by N (useful for block-structured data)

-f, --overwrite 
    Remove existing parts prior to executing

--fold 
    Additionally, create K folds of K-1 parts in a another folder

--parts-name STRING 
    Use the given name as suffix for the partitions folder created

--folds-name STRING 
    Use the given name as suffix for the folds folder created

Training on the folds

NAME:

  train

DESCRIPTION:

  Given training data previously split in K parts and folds, train K models on the K folds

  Certain keywords in the training command and its arguments are interpolated at runtime:

   * %N  - fold number, e.g. '01'
   * %F  - fold filename, e.g. 'brown.train/01'
   * %I  - alias for %F
   * %M  - model filename, e.g. 'brown.models/01'
   * %B  - basename (as specified on the command line), e.g. 'brown'

SYNOPSIS:

  kfold train --base NAME [options] -- CMD [--CMD-OPTIONS] [CMD-ARGS]

EXAMPLES:

# Train MaltParser for cross-validation
kfold train -f --base shuffled.c32.conll -- java -jar ~/Tools/malt-1.4.1/malt.jar -c %B.model_%N -i %T -m learn

OPTIONS:

-f, --overwrite 
    Remove existing models prior to executing

--base NAME 
    Default prefix of training folds and model files

--folds-name SUFFIX 
    Look for folds {01..K} in the folder BASE.SUFFIX

--models-name SUFFIX 
    Yield model names as BASE.SUFFIX/{01..K} as interpolation pattern %M

Testing the models on their reciprocal data file parts

NAME:

  test

DESCRIPTION:

  Process K parts of a split datafile using K previously trained models.

  Certain keywords in the testing command and its arguments are interpolated at runtime:

   * %N  - part number, e.g. '01'
   * %T  - part filename, e.g. 'brown.test/01'
   * %I  - alias for %T
   * %O  - output filename, e.g. 'brown.outputs/01'
   * %M  - model filename, e.g. 'brown.models/01'
   * %B  - basename (as specified on the command line), e.g. 'brown'

SYNOPSIS:

  kfold test --base NAME [options] -- CMD [--CMD-OPTIONS] [CMD-ARGS]

EXAMPLES:

# Apply trained MaltParser models for cross-validation
kfold test -f --base shuffled.c32.conll -- java -jar ~/Tools/malt-1.4.1/malt.jar -c %B.model_%N -i %T -o %O -m parse

OPTIONS:

-f, --overwrite 
    Remove existing test output prior to executing

--base NAME 
    Default prefix of model files and test outputs

--parts-name SUFFIX 
    Look for parts {01..K} to be processed in the folder BASE.SUFFIX

--models-name SUFFIX 
    Yield model names as BASE.SUFFIX/{01..K} as interpolation pattern %M

--outputs-name SUFFIX 
    Yield output filenames as BASE.SUFFIX/{01..K} as interpolation pattern %O

--output-name SUFFIX 
    Put the concatenated output of all models in BASE.SUFFIX