{"id":39,"date":"2016-03-29T15:45:57","date_gmt":"2016-03-29T15:45:57","guid":{"rendered":"http:\/\/kelp.unidevel.it\/?page_id=39"},"modified":"2017-07-20T10:50:07","modified_gmt":"2017-07-20T10:50:07","slug":"javadoc-2","status":"publish","type":"page","link":"http:\/\/www.kelp-ml.org\/?page_id=39","title":{"rendered":"Input Data Format"},"content":{"rendered":"<p>The dataset input format for KeLP\u00a0takes inspiration from the SvmLight\/LibSVM formalism, extending it in order to deal with multiple labels and multiple representations. Notice that the following classes are described <a href=\"http:\/\/www.kelp-ml.org\/?page_id=183\">here<\/a>.<\/p>\n<p>A dataset is generally represented in a text file, where each row is\u00a0an example, that can have one of the following forms:<\/p>\n<pre class=\"\">label1 ... labelN |Btype1:name1|description|Etype1| |Btype2:name2|description|Etype2| ...\r\nlabel1 ... labelN |<| leftExample |,| rightExample |>|\r\n<\/pre>\n<p>The former row refers to a <a href=\"http:\/\/www.kelp-ml.org\/kelp-javadoc\/current-version\/it\/uniroma2\/sag\/kelp\/data\/example\/SimpleExample.html\">SimpleExample<\/a> while the latter describes an <a href=\"http:\/\/www.kelp-ml.org\/kelp-javadoc\/current-version\/it\/uniroma2\/sag\/kelp\/data\/example\/ExamplePair.html\">ExamplePair<\/a>, where <em>leftExample<\/em> and <em>rightExample<\/em> recursively have the form of one of these two formalisms.<\/p>\n<p>Each example starts with a list of labels separated by a white space. A label can be a simple string in the case of a classification label, or can have the form <em>propertyName:value<\/em> (for instance <em>height:10<\/em>) in the case of regression values. This formalisms allows to deal with multilabel classification tasks as well as with multivariate regression tasks. Note that an isolated number will be considered a classification label.<\/p>\n<p>In the <a href=\"http:\/\/www.kelp-ml.org\/kelp-javadoc\/current-version\/it\/uniroma2\/sag\/kelp\/data\/example\/SimpleExample.html\">SimpleExample<\/a>\u00a0case, after the labels parts, a list of representations begins. In the previous example there are two representations. Each representation must be included between a <em>begin of representation<\/em> sequence of the form |B<em>type:name<\/em>|| and an <em>end of representation<\/em> sequence of the form |E<em>type<\/em>| where <em>type<\/em> is an identifier of the representation class (e.g., V for <a href=\"http:\/\/www.kelp-ml.org\/kelp-javadoc\/current-version\/it\/uniroma2\/sag\/kelp\/data\/representation\/vector\/SparseVector.html\">SparseVector<\/a>) and <em>name<\/em>\u00a0is an identifier for that specific representation (e.g.,\u00a0BoW for a bag-of-words representation). If no name is specified for a representation, it will be identified by its position within the sequence (i.e., the third representation will be automatically named 3). The name identifies uniquely a representation for an example and it is necessary to support examples having multiple representations of the same class.<\/p>\n<p>Each representation has its own formalism:<\/p>\n<ul>\n<li style=\"text-align: justify;\"><a href=\"http:\/\/www.kelp-ml.org\/kelp-javadoc\/current-version\/it\/uniroma2\/sag\/kelp\/data\/representation\/vector\/DenseVector.html\">DenseVector<\/a>. Its type identifier is DV and its textual description is a sequence of numbers separated by a white space (or a comma, or a semicolon). For instance:\n<pre class=\"\">|BDV:lsa|10 89 0.4 -43 19 -9.3 |EDV|<\/pre>\n<\/li>\n<li style=\"text-align: justify;\"><a href=\"http:\/\/www.kelp-ml.org\/kelp-javadoc\/current-version\/it\/uniroma2\/sag\/kelp\/data\/representation\/vector\/SparseVector.html\">SparseVector<\/a>. Its type identifier is V and its textual description is a sequence of <em>featureName:featureValue<\/em> pairs separated by a white space (this is the same formalism of SVMlight and LibSVM, but <em>featureName<\/em> is not forced to be a number, i.e., it can be a generic string). For instance:\n<pre class=\"\">|BV:bow| KeLP:0.33 is:0.33 amazing:0.33 |EV|<\/pre>\n<\/li>\n<li style=\"text-align: justify;\"><a href=\"http:\/\/www.kelp-ml.org\/kelp-javadoc\/current-version\/it\/uniroma2\/sag\/kelp\/data\/representation\/string\/StringRepresentation.html\">StringRepresentation<\/a>. Its type identifier is S and its textual description is a simple text. For instance:\n<pre class=\"\">|BS:comment| KeLP is amazing |ES|<\/pre>\n<\/li>\n<\/ul>\n<h3>Structured Data Format<\/h3>\n<p>The structured representations have nodes whose content is a <a href=\"http:\/\/www.kelp-ml.org\/kelp-javadoc\/current-version\/it\/uniroma2\/sag\/kelp\/data\/representation\/structure\/StructureElement.html\">StructureElement<\/a>. Its textual format is a pair <em>type##content<\/em>, where <em>type<\/em> identifies a the specific implementation of the class <a href=\"http:\/\/www.kelp-ml.org\/kelp-javadoc\/current-version\/it\/uniroma2\/sag\/kelp\/data\/representation\/structure\/StructureElement.html\">StructureElement<\/a>, while <em>content<\/em>\u00a0is a text defining the parameters of the structure element. Every implementation of <a href=\"http:\/\/www.kelp-ml.org\/kelp-javadoc\/current-version\/it\/uniroma2\/sag\/kelp\/data\/representation\/structure\/StructureElement.html\">StructureElement<\/a>\u00a0has its own <em>content<\/em> formalism. For instance we implemented some nodes to be used in NLP tasks:<\/p>\n<ul>\n<li><a href=\"http:\/\/www.kelp-ml.org\/kelp-javadoc\/current-version\/it\/uniroma2\/sag\/kelp\/data\/representation\/structure\/LexicalStructureElement.html\">LexicalStructureElement<\/a>:\u00a0its type identifier is <em>LEX<\/em>\u00a0and its content has the form <em>word::part-of-speech<\/em>, as in <em>LEX##KeLP::n<\/em>;<\/li>\n<li><a href=\"http:\/\/www.kelp-ml.org\/kelp-javadoc\/current-version\/it\/uniroma2\/sag\/kelp\/data\/representation\/structure\/PosStructureElement.html\">PosStructureElement<\/a><strong>:\u00a0<\/strong>its type identifier is <em>POS<\/em>\u00a0and it is a simple part-of-speech symbol, as in <em>POS##NN<\/em>;<\/li>\n<li><a href=\"http:\/\/www.kelp-ml.org\/kelp-javadoc\/current-version\/it\/uniroma2\/sag\/kelp\/data\/representation\/structure\/SyntacticStructureElement.html\">SyntacticStructureElement<\/a>:\u00a0\u00a0its type identifier is <em>SYNT<\/em> and its is a simple syntactic symbol (e.g., a constituent, a chunk, or a syntactic dependency), as in <em>SYNT##VP<\/em>;<\/li>\n<li><a href=\"http:\/\/www.kelp-ml.org\/kelp-javadoc\/current-version\/it\/uniroma2\/sag\/kelp\/data\/representation\/structure\/CompositionalStructureElement.html\">CompositionalStructureElement<\/a>:\u00a0its type identifier is <em>COMP<\/em> and its content has the form &lt;head,modifier&gt;, as in <em>COMP##&lt;tool,useful&gt;<\/em>;<\/li>\n<li><a href=\"http:\/\/www.kelp-ml.org\/kelp-javadoc\/current-version\/it\/uniroma2\/sag\/kelp\/data\/representation\/structure\/UntypedStructureElement.html\">UntypedStructureElement<\/a>: its type identifier is <em>NOTYPE<\/em>\u00a0and its content is a generic text, as in <em>NOTYPE##KeLP.\u00a0<\/em>This is the default <a href=\"http:\/\/www.kelp-ml.org\/kelp-javadoc\/current-version\/it\/uniroma2\/sag\/kelp\/data\/representation\/structure\/StructureElement.html\">StructureElement<\/a>\u00a0that is instantiated when the <em>type<\/em> information is missing (and the separator <em>##<\/em> is missing too); for instance the text <em>KeLP<\/em> is automatically instantiated as an <a href=\"http:\/\/www.kelp-ml.org\/kelp-javadoc\/current-version\/it\/uniroma2\/sag\/kelp\/data\/representation\/structure\/UntypedStructureElement.html\">UntypedStructureElement<\/a>.<\/li>\n<\/ul>\n<p>The <a href=\"http:\/\/www.kelp-ml.org\/kelp-javadoc\/current-version\/it\/uniroma2\/sag\/kelp\/data\/representation\/structure\/StructureElement.html\">StructureElement<\/a>\u00a0formalism is employed in the formats of the following\u00a0structured representations:<\/p>\n<ul>\n<li><a href=\"http:\/\/www.kelp-ml.org\/kelp-javadoc\/current-version\/it\/uniroma2\/sag\/kelp\/data\/representation\/sequence\/SequenceRepresentation.html\">SequenceRepresentation<\/a><strong>:<\/strong>\u00a0its type identifier is SQ and its textual description is a sequence of structured elements in round brackets, as in:\n<pre class=\"\"> |BSQ:sequence| (LEX##KeLP::n) (LEX##is::v) (LEX##amazing::j) |ESQ|<\/pre>\n<\/li>\n<li style=\"text-align: justify;\"><a href=\"http:\/\/www.kelp-ml.org\/kelp-javadoc\/current-version\/it\/uniroma2\/sag\/kelp\/data\/representation\/tree\/TreeRepresentation.html\">TreeRepresentation<\/a>. Its type identifier is T and its textual description must be in the <a href=\"http:\/\/www.cis.upenn.edu\/~treebank\/\" target=\"_blank\">Penn Treebank<\/a> notation, where each node label must respect the <a href=\"http:\/\/www.kelp-ml.org\/kelp-javadoc\/current-version\/it\/uniroma2\/sag\/kelp\/data\/representation\/structure\/StructureElement.html\">StructureElement<\/a>\u00a0formalism. For instance (in the following example the compact format of the <a href=\"http:\/\/www.kelp-ml.org\/kelp-javadoc\/current-version\/it\/uniroma2\/sag\/kelp\/data\/representation\/structure\/UntypedStructureElement.html\">UntypedStructureElement<\/a> is adopted):\n<pre class=\"\" style=\"padding-left: 60px;\">|BT:constituentTree|(ROOT (S (NP (NNP KeLP)) (VP (VBZ is) (ADJP (JJ amazing))))) |ET|<\/pre>\n<\/li>\n<li><a href=\"http:\/\/www.kelp-ml.org\/kelp-javadoc\/current-version\/it\/uniroma2\/sag\/kelp\/data\/representation\/graph\/DirectedGraphRepresentation.html\">DirectedGraphRepresentation<\/a>: Its type identifier is G. The format depends on three string separators that can be set inside the class: NODE_EDGE_SEPARATOR (as an example here we will use &#8220;%&#8221;), NODE_SEPARATOR and EDGE_SEPARATOR (both set here as &#8220;&amp;&#8221;).<br \/>\nThe format consists of a list of node representations, then the NODE_EDGE_SEPARATOR, and finally a (optional) list of edge representations.<br \/>\nA node is composed by a numeric identifier, a white space, and the node content which is a StructureElement.<br \/>\nNodes are separated by NODE_SEPARATOR.<br \/>\nAn edge is composed by two node identifiers separated by a white space.<br \/>\nEdges are separated by EDGE_SEPARATOR.<br \/>\nIn the example below a fully connected graph with three nodes, labelled as 9,7,10 (node node identifiers are 1, 2, 3, respectively), is represented:<\/p>\n<pre class=\"\">|BG:graph| 1 9&2 7&3 10%1 2&1 3&2 3|EG|<\/pre>\n<\/li>\n<\/ul>\n<p>&nbsp;<\/p>\n<hr \/>\n<p>The following line is a complete textual example containing a classification label, a regression label, a sparse vector representation and a tree representation:<\/p>\n<pre class=\"\">ML_tool utility:100 |BV:bow| KeLP:0.33 is:0.33 amazing:0.33 |EV| |BT:constituentTree|(ROOT (S (NP (NNP KeLP)) (VP (VBZ is) (ADJP (JJ amazing))))) |ET|\r\n<\/pre>\n<p>Given a file written in the KeLP format, it can be loaded by simply calling the <em>populate<\/em> method of the <a href=\"http:\/\/www.kelp-ml.org\/kelp-javadoc\/current-version\/it\/uniroma2\/sag\/kelp\/data\/dataset\/Dataset.html\">Dataset<\/a> class:<\/p>\n<pre>SimpleDataset dataset = new SimpleDataset();\r\ndataset.populate(\"datasetPath.klp\");\r\n<\/pre>\n<p>Alternatively, it is possible to define an alternative <a href=\"http:\/\/www.kelp-ml.org\/kelp-javadoc\/current-version\/it\/uniroma2\/sag\/kelp\/data\/dataset\/DatasetReader.html\">DatasetReader<\/a> that allows to read different data formats. Currently, KeLP supports the CSV data format and the LibSVM\/SvmLight formats.<\/p>\n<pre>\/*\r\n * Loading data in CSV format\r\n *\/\r\nSimpleDataset dataset = new SimpleDataset();\r\nString path = \"datasetPath.csv\";\r\nString representationName = \"featureVector\";\r\nboolean skipFirstLine = true; \/\/in case of header\r\nLabelPosition position = LabelPosition.LAST_COLUMN;\r\nCsvDatasetReader csvReader = new CsvDatasetReader(path, representationName, skipFirstLine, position);\r\ndataset.populate(csvReader);\r\n<\/pre>\n<pre>\/*\r\n * Loading data in LibSVM\/SVMLight format\r\n *\/\r\nSimpleDataset dataset = new SimpleDataset();\r\nString path = \"datasetPath.libSvm\";\r\nString representationName = \"featureVector\";\r\nLibsvmDatasetReader libSvmReader = new LibsvmDatasetReader(path, representationName);\r\ndataset.populate(libSvmReader);\r\n<\/pre>\n<p>To generate input data structures for KeLP please refer to this <a href=\"http:\/\/www.kelp-ml.org\/?page_id=1025\">page<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The dataset input format for KeLP\u00a0takes inspiration from the SvmLight\/LibSVM formalism, extending it in order to deal with multiple labels and multiple representations. Notice that the following classes are described here. A dataset is generally represented in a text file, where each row is\u00a0an example, that can have one of the following forms: label1 &#8230; <a href=\"http:\/\/www.kelp-ml.org\/?page_id=39\" rel=\"nofollow\"><span class=\"sr-only\">Read more about Input Data Format<\/span>[&hellip;]<\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"parent":112,"menu_order":4,"comment_status":"closed","ping_status":"closed","template":"","meta":[],"_links":{"self":[{"href":"http:\/\/www.kelp-ml.org\/index.php?rest_route=\/wp\/v2\/pages\/39"}],"collection":[{"href":"http:\/\/www.kelp-ml.org\/index.php?rest_route=\/wp\/v2\/pages"}],"about":[{"href":"http:\/\/www.kelp-ml.org\/index.php?rest_route=\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"http:\/\/www.kelp-ml.org\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/www.kelp-ml.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=39"}],"version-history":[{"count":26,"href":"http:\/\/www.kelp-ml.org\/index.php?rest_route=\/wp\/v2\/pages\/39\/revisions"}],"predecessor-version":[{"id":1039,"href":"http:\/\/www.kelp-ml.org\/index.php?rest_route=\/wp\/v2\/pages\/39\/revisions\/1039"}],"up":[{"embeddable":true,"href":"http:\/\/www.kelp-ml.org\/index.php?rest_route=\/wp\/v2\/pages\/112"}],"wp:attachment":[{"href":"http:\/\/www.kelp-ml.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=39"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}