March 29, 2016

Input Data Format

The dataset input format for KeLP takes inspiration from the SvmLight/LibSVM formalism, extending it in order to deal with multiple labels and multiple representations. Notice that the following classes are described here.

A dataset is generally represented in a text file, where each row is an example, that can have one of the following forms:

The former row refers to a SimpleExample while the latter describes an ExamplePair, where leftExample and rightExample recursively have the form of one of these two formalisms.

Each example starts with a list of labels separated by a white space. A label can be a simple string in the case of a classification label, or can have the form propertyName:value (for instance height:10) in the case of regression values. This formalisms allows to deal with multilabel classification tasks as well as with multivariate regression tasks. Note that an isolated number will be considered a classification label.

In the SimpleExample case, after the labels parts, a list of representations begins. In the previous example there are two representations. Each representation must be included between a begin of representation sequence of the form |Btype:name|| and an end of representation sequence of the form |Etype| where type is an identifier of the representation class (e.g., V for SparseVector) and name is an identifier for that specific representation (e.g., BoW for a bag-of-words representation). If no name is specified for a representation, it will be identified by its position within the sequence (i.e., the third representation will be automatically named 3). The name identifies uniquely a representation for an example and it is necessary to support examples having multiple representations of the same class.

Each representation has its own formalism:

  • DenseVector. Its type identifier is DV and its textual description is a sequence of numbers separated by a white space (or a comma, or a semicolon). For instance:
  • SparseVector. Its type identifier is V and its textual description is a sequence of featureName:featureValue pairs separated by a white space (this is the same formalism of SVMlight and LibSVM, but featureName is not forced to be a number, i.e., it can be a generic string). For instance:
  • StringRepresentation. Its type identifier is S and its textual description is a simple text. For instance:

Structured Data Format

The structured representations have nodes whose content is a StructureElement. Its textual format is a pair type##content, where type identifies a the specific implementation of the class StructureElement, while content is a text defining the parameters of the structure element. Every implementation of StructureElement has its own content formalism. For instance we implemented some nodes to be used in NLP tasks:

  • LexicalStructureElement: its type identifier is LEX and its content has the form word::part-of-speech, as in LEX##KeLP::n;
  • PosStructureElementits type identifier is POS and it is a simple part-of-speech symbol, as in POS##NN;
  • SyntacticStructureElement:  its type identifier is SYNT and its is a simple syntactic symbol (e.g., a constituent, a chunk, or a syntactic dependency), as in SYNT##VP;
  • CompositionalStructureElement: its type identifier is COMP and its content has the form <head,modifier>, as in COMP##<tool,useful>;
  • UntypedStructureElement: its type identifier is NOTYPE and its content is a generic text, as in NOTYPE##KeLP. This is the default StructureElement that is instantiated when the type information is missing (and the separator ## is missing too); for instance the text KeLP is automatically instantiated as an UntypedStructureElement.

The StructureElement formalism is employed in the formats of the following structured representations:

  • SequenceRepresentation: its type identifier is SQ and its textual description is a sequence of structured elements in round brackets, as in:
  • TreeRepresentation. Its type identifier is T and its textual description must be in the Penn Treebank notation, where each node label must respect the StructureElement formalism. For instance (in the following example the compact format of the UntypedStructureElement is adopted):
  • DirectedGraphRepresentation: Its type identifier is G. The format depends on three string separators that can be set inside the class: NODE_EDGE_SEPARATOR (as an example here we will use “%”), NODE_SEPARATOR and EDGE_SEPARATOR (both set here as “&”).
    The format consists of a list of node representations, then the NODE_EDGE_SEPARATOR, and finally a (optional) list of edge representations.
    A node is composed by a numeric identifier, a white space, and the node content which is a StructureElement.
    Nodes are separated by NODE_SEPARATOR.
    An edge is composed by two node identifiers separated by a white space.
    Edges are separated by EDGE_SEPARATOR.
    In the example below a fully connected graph with three nodes, labelled as 9,7,10 (node node identifiers are 1, 2, 3, respectively), is represented:

 


The following line is a complete textual example containing a classification label, a regression label, a sparse vector representation and a tree representation:

Given a file written in the KeLP format, it can be loaded by simply calling the populate method of the Dataset class:

Alternatively, it is possible to define an alternative DatasetReader that allows to read different data formats. Currently, KeLP supports the CSV data format and the LibSVM/SvmLight formats.

To generate input data structures for KeLP please refer to this page.