BrainScript and Python: Understanding and Extending Readers

Starting from version 1.5, CNTK is moving away from the monolithic reader design towards a more composable model that allows you to specify and compose input data of different formats.

Before, each and every reader was responsible for different aspects of data reading, including but not limited to:

  • Deserialization of the data from external storage into in-memory representation
  • Randomization of the whole corpus
  • Different transformations of input sequences/samples (i.e. cropping or scaling of images)
  • Creation of minibatches for different modes (i.e. frame, sequence or truncated BPTT) with a layout that can be consumed by GPU
  • Prefetch on the level of minibatches and IO chunks

In version 1.5, the major pieces of the above functionality were factored out and moved to core CNTK to be shared between different readers. This version also introduces two main abstractions that can be extended in order to support new data formats:

  • deserializer - it is responsible for deserialization of input from external storage into in-memory sequences
  • transform - it transforms an input sequence into an output sequence

In the next sections we discuss these abstractions in more detail.

Configuring a reader (minibatch source) in Python

This section provides several examples on how a composite reader (aka MinibatchSource) can be configured in Python.

The following example was adapted from AlexNet_ImageNet_Distributed.py, it shows the Python equivalent of the AlexNet reader from Transforms section.

import cntk.io

mean_file = ...
map_file = ...

# model dimensions
image_height = 227
image_width  = 227
num_channels = 3  # RGB
num_classes  = 1000

transforms = [
     ImageDeserializer.crop(crop_type='randomside', 
                            side_ratio=0.88671875, 
                            jitter_type='uniratio'),
     ImageDeserializer.scale(width=image_width, 
                            height=image_height, 
                            channels=num_channels, 
                            interpolations='linear'),
     ImageDeserializer.mean(mean_file)
]

reader = MinibatchSource(
    ImageDeserializer(map_file, StreamDefs(
        # first column in map file is referred to as 'image'
        features = StreamDef(field='image', transforms=transforms),
        # and second as 'label' 
        labels   = StreamDef(field='label', shape=num_classes)))) 

The following example (adapted from A2_RunCntk_py3.py) shows how several deserializers can be combined together.

n_rois = 100
n_classes = 17
rois_dim = 4 * n_rois
label_dim = n_classes * n_rois

map_file = ...
roi_file = ...
label_file = ...

# read images
scale = ImageDeserializer.scale(width=1000, 
                                height=1000, 
                                channels=3,
                                scale_mode="pad", 
                                pad_value=114, 
                                interpolations='linear')
image_source = ImageDeserializer(map_file)
image_source.ignore_labels()
image_source.map_features('features', [scale])

# read rois and labels
roi_source = CTFDeserializer(roi_file)
roi_source.map_input('rois', dim=rois_dim, format="dense")
label_source = CTFDeserializer(label_file)
label_source.map_input('roiLabels', dim=label_dim, format="dense")

# define a composite reader
reader = MinibatchSource([image_source, roi_source, label_source])

...

# define mapping from reader streams to network inputs
input_map = {
    image_input: reader.streams.features,
    roi_input: reader.streams.rois,
    label_input: reader.streams.roiLabels
}

BrainScript

Deserializers

Let's have a look at the following fragment of configuration for the HTKMLFReader from the end-to-end LSTM/FullUtterance test (full config here):

...
# Old reader config. For illustration only.
reader = [
    readerType = "HTKMLFReader"
    readMethod = "blockRandomize"
    nbruttsineachrecurrentiter = 32
    randomize = "auto"
    verbosity = 0

    features = [
        dim = 363
        type = "real"
        scpFile = "$DataDir$/glob_0000.scp"
    ]

    labels = [
        mlfFile = "$DataDir$/glob_0000.mlf"
        labelMappingFile = "$DataDir$/state.list"

        labelDim = 132
        labelType = "category"
    ]
]

This fragment of configuration declares a reader that produces two streams of data with names "features" and "labels". It takes as input two types of files:

  • a list of feature files known in HTK parlance as an scp file (“script” file)
  • a label file known as mlf file (“master label file”)

In the above configuration fragment there are no explicit entities that would define how scp or mlf formats are deserialized. Everything is encapsulated in the HTKMLFReader configuration. So if you need to expose yet another input stream of different data format together with scp and mlf, you would need to change HTKMLFReader and add support there.

To increase composability and reuse, the new configuration for the same input explicitly defines deserializers and the input streams that they produce:

reader = [
    verbosity = 0
    randomize = true

    # A list of deserializers the reader uses.
    deserializers = (
        [
            # Type of deserializer, in this case the one that knows
            # how to deserialize HTK feature files.
            type = "HTKFeatureDeserializer"
            # Module (.dll or .so) where this deserializer is implemented
            module = "HTKDeserializers"

            # Description of input streams the deserializer provides,
            # can be one or many, depending on a particular
            # deserializer implementation
            # For HTKFeatureDeserializer, just one stream can be described.
            input = [
                # Description of input stream to feed the Input node named "features"
                features = [
                    dim = 363
                    scpFile = "$DataDir$/glob_0000.scp"
                ]
            ]
        ]:
        [
            # Type of deserializer, in this case the one
            # that knows how to deserialize mlf files.
            type = "HTKMLFDeserializer"
            module = "HTKDeserializers"
            # Description of input streams the deserializer provides,
            # For HTKMLFDeserializer, just one stream can be described.
            input = [
                # Description of input stream to feed the Input node named "labels"
                labels = [
                    dim = 132
                    mlfFile = "$DataDir$/glob_0000.mlf"
                    labelMappingFile = "$DataDir$/state.list"
                    # whether phone boundary information should be encoded
                    # set to true in CTC-type training
                    phoneBoundaries=false
                ]
            ]
        ]
    )
]

The sequences produced by the mlf and htk deserializers are combined together based on their logical key (which is a string that uniquely identifies a speech utterance and is present in both scp and mlf files). When you need another stream of different format, you can simply add the corresponding deserializer to the configuration (it is not possible with the HTK feature and HTK MLF deserializers right now to expose more than one input stream each).

Note

Currently, both old and new reader configurations are supported. When the "deserializers" key is used in the reader configuration, the reader type is implicitly set to "CompositeDataReader". To make sure the CompositeDataReader module can be loaded on Windows the Cntk.Composite.dll should be located in the same directory as the CNTK executable. On Linux Cntk.Composite.so should be in the lib folder that sits side-by-side to the bin folder containing the CNTK executable.

Currently CNTK supports the below deserializers:

Deserializer type Module Description
HTKFeatureDeserializer HTKDeserializers Deserializer for HTK feature files
HTKMLFDeserializer HTKDeserializers Deserializer for HTK MLF files
ImageDeserializer ImageReader Deserializer for images encoded as plain files or in zip archive.
Base64ImageDeserializer ImageReader Deserializer for images encoded as base64 strings in the mapping file.
CNTKTextFormatDeserializer CNTKTextFormatReader Deserializer for CNTK text format files
CNTKBinaryFormatDeserializer CNTKBinaryReader Deserializer for CNTK binary format files

Please refer to the tables below for the full description of the configuration parameters.

Transforms

A transform is a simple abstraction that take a sequence as an input, performs some transformation of samples in the sequence and returns the output sequence. Typical examples of transforms are different transformations of images such as crop, scale or transpose. Transforms can be configured on per input basis.

Let's have a look how transforms can be applied to the input (the config is taken from the Tests/EndToEndTests/Image/AlexNet test):

deserializers = ([
    type = "ImageDeserializer"
    module = "ImageReader"

    # Map file which maps images to labels
    file = "$ConfigDir$/train_map.txt"

    # Description of input streams
    input = [
            # Description of input stream to feed the Input node named "features"
            features = [
                transforms = (
                    [
                        type = "Crop"
                        # Possible values: Center, RandomSide, RandomArea, Multiview10. Default: Center
                        cropType = "RandomSide"
                        # Crop scale side ratio.
                        sideRatio = 0.875
                        # Crop scale ratio jitter type
                        jitterType = "UniRatio"
                    ]:[
                        type = "Scale"
                        width = 224
                        height = 224
                        channels = 3
                        # Interpolation to use when scaling image to width x height size.
                        interpolations = "linear"
                    ]:[
                        type = "Mean"
                        # Stores mean values for each pixel in OpenCV matrix XML format.
                        meanFile = "$ConfigDir$/ImageNet1K_mean.xml"
                    ]:[
                        # Changes the image layout from HWC to CHW
                        type = "Transpose"
                    ]
                )
            ]
            # Description of input stream to feed the Input node named "labels"
            labels = [
                labelDim = 1000
            ]
        ]
    ]
])

In this configuration four transforms are applied to the input stream features. Initially, the image data deserializer produces sequences consisting of a single image in HWC representation. After that the ordered list of transforms is applied to the image: firstly the Crop transform, followed by Scale and Mean. The last transformation is Transpose that changes the image layout from HWC to CHW.

Currently the following transforms are implemented. For their detailed description please see ImageReader.

Transform type Module
Crop ImageReader
Scale ImageReader
Color ImageReader
Mean ImageReader
Transpose ImageReader

New Reader Configuration Format Description

A reader configuration section to compose several data deserializers looks like follows:

reader = [
    randomize = true|false
    verbosity = 0|1|2
    ...

    deserializers = (
        [<deserializerConfiguration1>]:
        [<deserializerConfiguration2>]:
        ...
        [<deserializerConfigurationN>]
    )
]

Each deserializer configuration is specified as:

[
    module = "<readerModuleName>"   # Name of the external module (.dll or .so) where this particular deserializer is implemented
    type = "<deserializerType>"     # The type of the deserializer

    # There could be more deserializer-specific options in this section

    # Date deserializer input - describes a set of streams this deserializer produces.
    # It can be one (as in HTK) or many (as in CNTKTextFormat)
    input = [
        # Replace 'InputNameN' by the name of the corresponding input node in the network.
        InputName1 = [<inputConfiguration>]
        InputName2 = [<inputConfiguration>]
        ...
    ]
]

An input configuration contains input-specific options and, optionally, an ordered list of transforms that should be applied to the input:

[
    # Per-input data deserializer-specific options

    # Optionally a pipeline of transformations, to be implemented by data deserializer's reader module:
    transforms = (
       [<transformationConfiguration1>]:
       [<transformationConfiguration2>]:
       ...
       [<transformationConfigurationN>]
    )
]

Transform configuration identifies the transform type and any transform-specific options:

[
    type = "<transformName>"
    # Transform-specific options
]

Configuration Options

General Reader Configuration

Parameter Description
verbosity Verbosity level (0, 1, 2), controls the diagnostic output of different components (Randomizer, Deserializer, Bundler, etc.) Optional, defaults to 0.
randomize Specifies whether the input should be randomized ( true, false). Randomization method is identical to blockRandomize of the HTKMLFReader. Optional, defaults to true.
randomizationSeed Initial randomization seed value (incremented every sweep when the input data is re-randomized). Optional, defaults to 0.
randomizationWindow Specifies the size (positive integer) of the randomization window (i.e., randomization range). This parameter affects how much of the dataset needs to reside in memory at once. Optional, defaults to the size of the whole dataset (in samples or in chunks, depending on the sampleBasedRandomizationWindow value). However, if one of the deserializers is CNTKTextFormatDeserializer and sampleBasedRandomizationWindow was not explicitly set to true, randomizationWindow will default to 128 (which is approximately 4GB of disk space worth of chunks). This parameter is ignored when randomize is false.
sampleBasedRandomizationWindow If true, the size of the randomization window is interpreted as a certain number of samples, and as a number of chunks otherwise. Optional, defaults to true if CNTKTextFormatDeserializer is not present in the list of deserializers, and to false otherwise. Similarly to randomizationWindow, this parameter is ignored, when randomize is false.
truncationLength Specifies the truncation length in samples for BPTT (positive integer). Required only if truncated is true, ignored otherwise.
multiThreadedDeserialization Specifies if multiple threads should be used when collecting sequences for a minibatch from the deserializers (true, false). Optional.
frameMode Specifies if data should be randomized and returned at the frame or sequence level. When true, input sequence are split into frames. Optional. Both frameMode and truncated can not be set to true at the same time.
truncated When true, enables truncated back-propagation through time (BPTT). Optional. Both frameMode and truncated can not be set to true at the same time.
useNumericSequenceKeys Sequence keys are used to correlated sequences between different deserializers. For some deserializers (i.e. HTK and MLF) the sequence keys are arbitrary strings. Storing them requires much memory on big corpus. If you are sure your sequence keys are numeric, please set this parameter to true, in that case all string keys will be converted to integers decreasing memory pressure. Optional, default false.
hashSequenceKeys For the memory reasons described above, the string keys can also be hashed by setting this parameter to true. Please use it only for deserializers that support string sequence keys (HTK, MLF). Optional, default false.
cacheIndex Specifies whether the meta-data built during the pre-processing stage should be written out to disk and loaded in from disk if available (true, false). Optional, defaults to false. For more details, see the section below. New in CNTK version 2.1.
Index caching

Note

New in CNTK version 2.1.

Index caching allows to significantly (by a factor of 2-3x) reduce start-up times, especially when working with large input files. Setting the cacheIndex flag to true will signal the reader to write the indexing meta-data to disk (same directory as the input file) if the cache file is not available or if it is stale (older than the input file). The writing is best effort and is carried out on a separate thread in order to not affect the reader performance. If the cache file is present and is up-to-date, the reader will no longer skim the input file to build the index, instead it will load the index from the cache file. Please note that certain reader configuration parameters have a direct impact on indexing (for instance, different values of frameMode could potentially result in indices that have different number of sequences). For that reason, a cache file could be ignored by a reader with a configuration different from the one that produced the cache. To see the full benefit of caching, the configuration should not be modified on subsequent reruns.

cacheIndex has no effect on ImageDeserializer and CNTKBinaryFormatDeserializer, as the former does not index the input data and the later has the index information embedded in the format itself.

General Deserializer Configuration

Parameter Description
module Specifies the reader module name that implements the data deserializer. Required.
type Specifies a data deserializer name exposed by the given reader module. Required.

General Transform Configuration

Parameter Description
type Specifies a transform name exposed by the reader module implementing the data deserializer. Required.

HTKFeatureDeserializer options

Parameter Description
scpFile A list of paths to SCP files to be processed. The files should be HTK compatible files and must be specified in the “archive” format. The details of using an archive are described in HTKMLF Reader. Required.
dim An integer that specifies the full feature vector dimension with the desired context window.1 Required
contextWindow Can either be specified as a pair of positive integers or as a single positive integer (in which case it's interpreted as a pair with the same number repeated twice). Specifies left and right size (first and second integer of the pair) of the context window in samples. Optional, defaults to 1.
prefixPathInSCP A prefix string to apply to the paths specified within the SCP files. Optional.

1 For example, if you had 72-dimensional features (24-dimensional filterbank features plus delta and delta-delta coefficients) and the network is designed to process a context window of 11 frames, the specified dimension should be 792.

HTKMLFDeserializer options

Parameter Description
mlfFile Path to an HTK-style mlf file that contains the labels for all utterances specified in the scp file(s). Required if mlfFileList is not specified.
mlfFileList Array of paths to HTK-style mlf file(s) that contains the labels for all utterances specified in the scp file(s). Required if mlfFile is not specified.
dim Total cardinality of the label set (positive integer). Required.
labelMappingFile Path to a file which lists all the labels seen in the mlf file, one per line. Required.

labelDim can be used as a synonym for dim.

CNTKTextFormatDeserializer options

Same options that can be used with CNTKTextFormatReader

ImageDeserializer options

  • file: a simple text file where each line contains a tab-separated mapping between logical sequence key, image file (e.g. JPEG, PNG etc.) and 0-based label.

For more information please see ImageReader.

Base64ImageDeserializer options

This deserializer supports the same options that can be used with ImageDeserializer. The only difference is in the format of the mapping file:

  • file: a simple text file where each line contains a tab-separated mapping between logical sequence key (optional, can be omitted), 0-based category label and base 64 encoded image file (e.g. JPEG, PNG etc.).

Examples of Configurations and Tests

You will find complete network definitions and the corresponding data set examples in the CNTK Repository. There you will also find Unit and End-to-End Tests that use deserializers, i.e.