Class: OpenTox::Dataset

Inherits:
Object
  • Object
show all
Includes:
OpenTox
Defined in:
lib/dataset.rb

Overview

Ruby wrapper for OpenTox Dataset Webservices (opentox.org/dev/apis/api-1.2/dataset).

Direct Known Subclasses

LazarPrediction

Instance Attribute Summary collapse

Attributes included from OpenTox

#uri

Class Method Summary collapse

Instance Method Summary collapse

Methods included from OpenTox

sign_in, text_to_html

Constructor Details

#initialize(uri = nil, subjectid = nil) ⇒ OpenTox::Dataset

Create dataset with optional URI. Does not load data into the dataset - you will need to execute one of the load_* methods to pull data from a service or to insert it from other representations.

Examples:

Create an empty dataset

dataset = OpenTox::Dataset.new

Create an empty dataset with URI

dataset = OpenTox::Dataset.new("http:://webservices.in-silico/ch/dataset/1")

Parameters:

  • uri (optional, String) (defaults to: nil)

    Dataset URI



17
18
19
20
21
22
# File 'lib/dataset.rb', line 17

def initialize(uri=nil,subjectid=nil)
  super uri
  @features = {}
  @compounds = []
  @data_entries = {}
end

Instance Attribute Details

#compoundsObject (readonly)

Returns the value of attribute compounds.



8
9
10
# File 'lib/dataset.rb', line 8

def compounds
  @compounds
end

#data_entriesObject (readonly)

Returns the value of attribute data_entries.



8
9
10
# File 'lib/dataset.rb', line 8

def data_entries
  @data_entries
end

#featuresObject (readonly)

Returns the value of attribute features.



8
9
10
# File 'lib/dataset.rb', line 8

def features
  @features
end

#metadataObject (readonly)

Returns the value of attribute metadata.



8
9
10
# File 'lib/dataset.rb', line 8

def 
  @metadata
end

Class Method Details

.all(uri = CONFIG[:services], subjectid = nil) ⇒ Array

Get all datasets from a service

Parameters:

  • uri (optional, String) (defaults to: CONFIG[:services])

    URI of the dataset service, defaults to service specified in configuration

Returns:

  • (Array)

    Array of dataset object without data (use one of the load_* methods to pull data from the server)



82
83
84
# File 'lib/dataset.rb', line 82

def self.all(uri=CONFIG[:services]["opentox-dataset"], subjectid=nil)
  RestClientWrapper.get(uri,{:accept => "text/uri-list",:subjectid => subjectid}).to_s.each_line.collect{|u| Dataset.new(u.chomp, subjectid)}
end

.create(uri = CONFIG[:services], subjectid = nil) ⇒ OpenTox::Dataset

Create an empty dataset and save it at the dataset service (assigns URI to dataset)

Examples:

Create new dataset and save it to obtain a URI

dataset = OpenTox::Dataset.create

Parameters:

  • uri (optional, String) (defaults to: CONFIG[:services])

    Dataset URI

Returns:



29
30
31
32
33
# File 'lib/dataset.rb', line 29

def self.create(uri=CONFIG[:services]["opentox-dataset"], subjectid=nil)
  dataset = Dataset.new(nil,subjectid)
  dataset.save(subjectid)
  dataset
end

.create_from_csv_file(file, subjectid = nil) ⇒ OpenTox::Dataset

Create dataset from CSV file (format specification: toxcreate.org/help)

  • loads data_entries, compounds, features

  • sets metadata (warnings) for parser errors

  • you will have to set remaining metadata manually

Parameters:

  • file (String)

    CSV file path

Returns:



41
42
43
44
45
46
47
48
# File 'lib/dataset.rb', line 41

def self.create_from_csv_file(file, subjectid=nil) 
  dataset = Dataset.create(CONFIG[:services]["opentox-dataset"], subjectid)
  parser = Parser::Spreadsheets.new
  parser.dataset = dataset
  parser.load_csv(File.open(file).read)
  dataset.save(subjectid)
  dataset
end

.exist?(uri, subjectid = nil) ⇒ Boolean

replaces find as exist check, takes not as long, does NOT raise an un-authorized exception

Parameters:

  • uri (String)

    Dataset URI

Returns:

  • (Boolean)

    true if dataset exists and user has get rights, false else



69
70
71
72
73
74
75
76
77
# File 'lib/dataset.rb', line 69

def self.exist?(uri, subjectid=nil)
  return false unless uri
  dataset = Dataset.new(uri, subjectid)
  begin
    dataset.( subjectid ).size > 0
  rescue
    false
  end
end

.find(uri, subjectid = nil) ⇒ OpenTox::Dataset

Find a dataset and load all data. This can be time consuming, use Dataset.new together with one of the load_* methods for a fine grained control over data loading.

Parameters:

  • uri (String)

    Dataset URI

Returns:



59
60
61
62
63
64
# File 'lib/dataset.rb', line 59

def self.find(uri, subjectid=nil)
  return nil unless uri
  dataset = Dataset.new(uri, subjectid)
  dataset.load_all(subjectid)
  dataset
end

.from_json(json, subjectid = nil) ⇒ Object



50
51
52
53
54
# File 'lib/dataset.rb', line 50

def self.from_json(json, subjectid=nil) 
  dataset = Dataset.new(nil,subjectid)
  dataset.copy_hash Yajl::Parser.parse(json)
  dataset
end

.merge(dataset1, dataset2, metadata, subjectid = nil, features1 = nil, features2 = nil, compounds1 = nil, compounds2 = nil) ⇒ Object

merges two dataset into a new dataset (by default uses all compounds and features) precondition: both datasets are fully loaded example: if you want no features from dataset2, give empty array as features2

Parameters:

  • dataset1 (OpenTox::Dataset)

    to merge

  • dataset2 (OpenTox::Dataset)

    to merge

  • metadata (Hash)
  • subjectid (optional, String) (defaults to: nil)
  • features1, (optional, Array)

    if specified only this features of dataset1 are used

  • features2, (optional, Array)

    if specified only this features of dataset2 are used

  • compounds1, (optional, Array)

    if specified only this compounds of dataset1 are used

  • compounds2, (optional, Array)

    if specified only this compounds of dataset2 are used



414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
# File 'lib/dataset.rb', line 414

def self.merge( dataset1, dataset2, , subjectid=nil, features1=nil, features2=nil, compounds1=nil, compounds2=nil )
  features1 = dataset1.features.keys unless features1
  features2 = dataset2.features.keys unless features2
  compounds1 = dataset1.compounds unless compounds1
  compounds2 = dataset2.compounds unless compounds2
  data_combined = OpenTox::Dataset.create(CONFIG[:services]["opentox-dataset"],subjectid)
  LOGGER.debug("merging datasets #{dataset1.uri} and #{dataset2.uri} to #{data_combined.uri}")
  [[dataset1, features1, compounds1], [dataset2, features2, compounds2]].each do |dataset,features,compounds|
    compounds.each{|c| data_combined.add_compound(c)}
    features.each do |f|
      m = dataset.features[f]
      m[OT.hasSource] = dataset.uri unless m[OT.hasSource]
      data_combined.add_feature(f,m)
      compounds.each do |c|
        dataset.data_entries[c][f].each do |v|
          data_combined.add(c,f,v)
        end if dataset.data_entries[c] and dataset.data_entries[c][f]
      end
    end
  end
   = {} unless 
  [OT.hasSource] = "Merge from #{dataset1.uri} and #{dataset2.uri}" unless [OT.hasSource]
  data_combined.()
  data_combined.save(subjectid)
  data_combined
end

Instance Method Details

#accept_values(feature) ⇒ Array

returns the accept_values of a feature, i.e. the classification domain / all possible feature values

Parameters:

  • feature (String)

    the URI of the feature

Returns:

  • (Array)

    return array with strings, nil if value is not set (e.g. when feature is numeric)



194
195
196
197
198
# File 'lib/dataset.rb', line 194

def accept_values(feature)
  accept_values = features[feature][OT.acceptValue]
  accept_values.sort if accept_values
  accept_values
end

#add(compound, feature, value) ⇒ Object

Insert a statement (compound_uri,feature_uri,value)

Examples:

Insert a statement (compound_uri,feature_uri,value)

dataset.add "http://webservices.in-silico.ch/compound/InChI=1S/C6Cl6/c7-1-2(8)4(10)6(12)5(11)3(1)9", "http://webservices.in-silico.ch/dataset/1/feature/hamster_carcinogenicity", 1

Parameters:

  • compound (String)

    Compound URI

  • feature (String)

    Compound URI

  • value (Boolean, Float)

    Feature value



318
319
320
321
322
323
324
# File 'lib/dataset.rb', line 318

def add (compound,feature,value)
  @compounds << compound unless @compounds.include? compound
  @features[feature] = {}  unless @features[feature]
  @data_entries[compound] = {} unless @data_entries[compound]
  @data_entries[compound][feature] = [] unless @data_entries[compound][feature]
  @data_entries[compound][feature] << value if value!=nil
end

#add_compound(compound) ⇒ Object

Add a new compound

Parameters:

  • compound (String)

    Compound URI



363
364
365
# File 'lib/dataset.rb', line 363

def add_compound (compound)
  @compounds << compound unless @compounds.include? compound
end

#add_feature(feature, metadata = {}) ⇒ Object

Add a feature

Parameters:

  • feature (String)

    Feature URI

  • metadata (Hash) (defaults to: {})

    Hash with feature metadata



337
338
339
# File 'lib/dataset.rb', line 337

def add_feature(feature,={})
  @features[feature] = 
end

#add_feature_metadata(feature, metadata) ⇒ Object

Add/modify metadata for a feature

Parameters:

  • feature (String)

    Feature URI

  • metadata (Hash)

    Hash with feature metadata



357
358
359
# File 'lib/dataset.rb', line 357

def (feature,)
  .each { |k,v| @features[feature][k] = v }
end

#add_metadata(metadata) ⇒ Object

Add/modify metadata, existing entries will be overwritten

Examples:

dataset.({DC.title => "any_title", DC.creator => "my_email"})

Parameters:

  • metadata (Hash)

    Hash mapping predicate_uris to values



330
331
332
# File 'lib/dataset.rb', line 330

def ()
  .each { |k,v| @metadata[k] = v }
end

#complete_data_entries(compound_sizes) ⇒ Object

Complete feature values by adding zeroes

Parameters:

  • key: (Hash)

    compound, value: duplicate sizes



343
344
345
346
347
348
349
350
351
352
# File 'lib/dataset.rb', line 343

def complete_data_entries(compound_sizes)
  all_features = @features.keys
  @data_entries.each { |c, e|
    (Set.new(all_features.collect)).subtract(Set.new e.keys).to_a.each { |f|
      compound_sizes[c].times { 
        self.add(c,f,0) 
      }
    }
  }
end

#copy_hash(hash) ⇒ Object

Copy a hash (eg. from JSON) into a dataset (rewrites URI)



471
472
473
474
475
476
477
478
479
480
481
# File 'lib/dataset.rb', line 471

def copy_hash(hash)
  @metadata = hash["metadata"]
  @data_entries = hash["data_entries"]
  @compounds = hash["compounds"]
  @features = hash["features"]
  if @uri
    self.uri = @uri 
  else
    @uri = hash["metadata"][XSD.anyURI]
  end
end

#delete(subjectid = nil) ⇒ Object

Delete dataset at the dataset service



466
467
468
# File 'lib/dataset.rb', line 466

def delete(subjectid=nil)
  RestClientWrapper.delete(@uri, :subjectid => subjectid)
end

#feature_name(feature) ⇒ String

Get name (DC.title) of a feature

Parameters:

  • feature (String)

    Feature URI

Returns:



304
305
306
# File 'lib/dataset.rb', line 304

def feature_name(feature)
  @features[feature][DC.title]
end

#feature_type(subjectid = nil) ⇒ String

Detect feature type (reduced to one across all features) Classification takes precedence over regression DEPRECATED –

HAS NO SENSE FOR DATASETS WITH MORE THAN 1 FEATURE
FEATURES CAN HAVE MULTIPLE TYPES

Replacement: see feature_types()

Returns:

  • (String)

    ‘classification“, ”regression“, ”mixed“ or unknown`



207
208
209
210
211
212
213
214
215
216
217
# File 'lib/dataset.rb', line 207

def feature_type(subjectid=nil)
  load_features(subjectid)
  feature_types = @features.collect{|f,| [RDF.type]}.flatten.uniq
  if feature_types.include?(OT.NominalFeature)
    "classification"
  elsif feature_types.include?(OT.NumericFeature)
    "regression"
  else
    "unknown"
  end
end

#feature_types(subjectid = nil) ⇒ Hash

Detect feature types. A feature can have multiple types. Returns types hashed by feature URI, with missing features omitted. Example (YAML):

http://toxcreate3.in-silico.ch:8082/dataset/152/feature/nHal: 
- http://www.opentox.org/api/1.1#NumericFeature
- http://www.opentox.org/api/1.1#NominalFeature
...

Returns:

  • (Hash)

    Keys: feature URIs, Values: Array of types



229
230
231
232
233
234
235
# File 'lib/dataset.rb', line 229

def feature_types(subjectid=nil)
  load_features(subjectid)
    @features.inject({}){ |h,(f,)| 
      h[f]=[RDF.type] unless [RDF.type][0].include? "MissingFeature" 
      h
    }
end

#load_all(subjectid = nil) ⇒ Object

Load all data (metadata, data_entries, compounds and features) from URI



157
158
159
160
161
162
163
164
# File 'lib/dataset.rb', line 157

def load_all(subjectid=nil)
  if (CONFIG[:json_hosts].include?(URI.parse(@uri).host))
    copy_hash Yajl::Parser.parse(RestClientWrapper.get(@uri, {:accept => "application/json", :subjectid => subjectid}))
  else
    parser = Parser::Owl::Dataset.new(@uri, subjectid)
    copy parser.load_uri(subjectid)
  end
end

#load_compounds(subjectid = nil) ⇒ Array

Load and return only compound URIs from the dataset service

Returns:

  • (Array)

    Compound URIs in the dataset



168
169
170
171
172
173
174
175
176
177
# File 'lib/dataset.rb', line 168

def load_compounds(subjectid=nil)
  # fix for datasets like http://apps.ideaconsult.net:8080/ambit2/dataset/272?max=50
  u = URI::parse(uri)
  u.path = File.join(u.path,"compounds")
  u = u.to_s
  RestClientWrapper.get(u,{:accept=> "text/uri-list", :subjectid => subjectid}).to_s.each_line do |compound_uri|
    @compounds << compound_uri.chomp
  end
  @compounds.uniq!
end

#load_csv(csv, subjectid = nil) ⇒ OpenTox::Dataset

Load CSV string (format specification: toxcreate.org/help)

  • loads data_entries, compounds, features

  • sets metadata (warnings) for parser errors

  • you will have to set remaining metadata manually

Parameters:

  • csv (String)

    CSV representation of the dataset

Returns:



128
129
130
131
132
133
# File 'lib/dataset.rb', line 128

def load_csv(csv, subjectid=nil) 
  save(subjectid) unless @uri # get a uri for creating features
  parser = Parser::Spreadsheets.new
  parser.dataset = self
  parser.load_csv(csv)
end

#load_features(subjectid = nil) ⇒ Hash

Load and return only features from the dataset service

Returns:

  • (Hash)

    Features of the dataset



181
182
183
184
185
186
187
188
189
# File 'lib/dataset.rb', line 181

def load_features(subjectid=nil)
  if (CONFIG[:json_hosts].include?(URI.parse(@uri).host))
    @features = Yajl::Parser.parse(RestClientWrapper.get(File.join(@uri,"features"), {:accept => "application/json", :subjectid => subjectid}))
  else
    parser = Parser::Owl::Dataset.new(@uri, subjectid)
    @features = parser.load_features(subjectid)
  end
  @features
end

#load_json(json) ⇒ Object



93
94
95
# File 'lib/dataset.rb', line 93

def load_json(json)
  copy_hash Yajl::Parser.parse(json)
end

#load_metadata(subjectid = nil) ⇒ Hash

Load and return only metadata of a Dataset object

Returns:

  • (Hash)

    Metadata of the dataset



150
151
152
153
154
# File 'lib/dataset.rb', line 150

def (subjectid=nil)
   Parser::Owl::Dataset.new(@uri, subjectid).(subjectid)
  self.uri = @uri if @uri # keep uri
  @metadata
end

#load_rdfxml(rdfxml, subjectid = nil) ⇒ Object



97
98
99
100
101
102
103
104
# File 'lib/dataset.rb', line 97

def load_rdfxml(rdfxml, subjectid=nil)
  raise "rdfxml data is empty" if rdfxml.to_s.size==0
  file = Tempfile.new("ot-rdfxml")
  file.puts rdfxml
  file.close
  load_rdfxml_file file, subjectid
  file.delete
end

#load_rdfxml_file(file, subjectid = nil) ⇒ OpenTox::Dataset

Load RDF/XML representation from a file

Parameters:

  • file (String)

    File with RDF/XML representation of the dataset

Returns:



109
110
111
112
113
# File 'lib/dataset.rb', line 109

def load_rdfxml_file(file, subjectid=nil)
  parser = Parser::Owl::Dataset.new @uri, subjectid
  parser.uri = file.path
  copy parser.load_uri(subjectid)
end

#load_sdf(sdf, subjectid = nil) ⇒ Object



115
116
117
118
119
120
# File 'lib/dataset.rb', line 115

def load_sdf(sdf,subjectid=nil)
  save(subjectid) unless @uri # get a uri for creating features
  parser = Parser::Sdf.new
  parser.dataset = self
  parser.load_sdf(sdf)
end

#load_spreadsheet(book, subjectid = nil) ⇒ OpenTox::Dataset

Load Spreadsheet book (created with roo gem roo.rubyforge.org/, excel format specification: toxcreate.org/help)

  • loads data_entries, compounds, features

  • sets metadata (warnings) for parser errors

  • you will have to set remaining metadata manually

Parameters:

  • book (Excel)

    Excel workbook object (created with roo gem)

Returns:



141
142
143
144
145
146
# File 'lib/dataset.rb', line 141

def load_spreadsheet(book, subjectid=nil)
  save(subjectid) unless @uri # get a uri for creating features
  parser = Parser::Spreadsheets.new
  parser.dataset = self
  parser.load_spreadsheet(book)
end

#load_yaml(yaml) ⇒ OpenTox::Dataset

Load YAML representation into the dataset

Parameters:

  • yaml (String)

    YAML representation of the dataset

Returns:



89
90
91
# File 'lib/dataset.rb', line 89

def load_yaml(yaml)
  copy YAML.load(yaml)
end

#save(subjectid = nil) ⇒ String

Save dataset at the dataset service

  • creates a new dataset if uri is not set

  • overwrites dataset if uri exists

Returns:



445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
# File 'lib/dataset.rb', line 445

def save(subjectid=nil)
  # TODO: rewrite feature URI's ??
  @compounds.uniq!
  if @uri
    if (CONFIG[:json_hosts].include?(URI.parse(@uri).host))
      #LOGGER.debug self.to_json
      RestClientWrapper.post(@uri,self.to_json,{:content_type =>  "application/json", :subjectid => subjectid})
    else
      File.open("ot-post-file.rdf","w+") { |f| f.write(self.to_rdfxml); @path = f.path }
      task_uri = RestClient.post(@uri, {:file => File.new(@path)},{:accept => "text/uri-list" , :subjectid => subjectid}).to_s.chomp
      Task.find(task_uri).wait_for_completion
      self.uri = RestClientWrapper.get(task_uri,{:accept => 'text/uri-list', :subjectid => subjectid})
    end
  else
    # create dataset if uri is empty
    self.uri = RestClientWrapper.post(CONFIG[:services]["opentox-dataset"],{:subjectid => subjectid}).to_s.chomp
  end
  @uri
end

#split(compounds, features, metadata, subjectid = nil) ⇒ OpenTox::Dataset

Creates a new dataset, by splitting the current dataset, i.e. using only a subset of compounds and features

Parameters:

  • compounds (Array)

    List of compound URIs

  • features (Array)

    List of feature URIs

  • metadata (Hash)

    Hash containing the metadata for the new dataset

  • subjectid (String) (defaults to: nil)

Returns:



373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
# File 'lib/dataset.rb', line 373

def split( compounds, features, , subjectid=nil)
  LOGGER.debug "split dataset using "+compounds.size.to_s+"/"+@compounds.size.to_s+" compounds"
  raise "no new compounds selected" unless compounds and compounds.size>0
  dataset = OpenTox::Dataset.create(CONFIG[:services]["opentox-dataset"],subjectid)
  if features.size==0
    compounds.each{ |c| dataset.add_compound(c) }
  else
    compounds.each do |c|
      features.each do |f|
        if @data_entries[c]==nil or @data_entries[c][f]==nil
          dataset.add(c,f,nil)
        else
          @data_entries[c][f].each do |v|
            dataset.add(c,f,v)
          end
        end
      end
    end
  end
  # set feature metadata in new dataset accordingly (including accept values)      
  features.each do |f|
    self.features[f].each do |k,v|
      dataset.features[f][k] = v
    end
  end
  dataset.()
  dataset.save(subjectid)
  dataset
end

#titleObject



308
309
310
# File 'lib/dataset.rb', line 308

def title
  @metadata[DC.title]
end

#to_csvString

Get CSV string representation (data_entries only, metadata will be discarded)

Returns:

  • (String)

    CSV representation



257
258
259
# File 'lib/dataset.rb', line 257

def to_csv
  Serializer::Spreadsheets.new(self).to_csv
end

#to_jsonObject



239
240
241
# File 'lib/dataset.rb', line 239

def to_json
  Yajl::Encoder.encode({:uri => @uri, :metadata => @metadata, :data_entries => @data_entries, :compounds => @compounds, :features => @features})
end

#to_ntriplesString

Get OWL-DL in ntriples format

Returns:

  • (String)

    N-Triples representation



263
264
265
266
267
# File 'lib/dataset.rb', line 263

def to_ntriples
  s = Serializer::Owl.new
  s.add_dataset(self)
  s.to_ntriples
end

#to_rdfxmlString

Get OWL-DL in RDF/XML format

Returns:

  • (String)

    RDF/XML representation



271
272
273
274
275
# File 'lib/dataset.rb', line 271

def to_rdfxml
  s = Serializer::Owl.new
  s.add_dataset(self)
  s.to_rdfxml
end

#to_sdfString

Get SDF representation of compounds

Returns:

  • (String)

    SDF representation



279
280
281
282
283
284
285
286
287
288
289
290
291
292
# File 'lib/dataset.rb', line 279

def to_sdf
  sum=""
  @compounds.each{ |c|
    sum << OpenTox::Compound.new(c).to_inchi
    sum << OpenTox::Compound.new(c).to_sdf.sub(/\n\$\$\$\$/,'')
    @data_entries[c].each{ |f,v|
      sum << ">  <\"#{f}\">\n"
      sum << v.join(", ")
      sum << "\n\n"
    }
    sum << "$$$$\n"
  }
  sum
end

#to_spreadsheetSpreadsheet::Workbook

Get Spreadsheet representation

Returns:

  • (Spreadsheet::Workbook)

    Workbook which can be written with the spreadsheet gem (data_entries only, metadata will will be discarded))



245
246
247
# File 'lib/dataset.rb', line 245

def to_spreadsheet
  Serializer::Spreadsheets.new(self).to_spreadsheet
end

#to_urilistObject



294
295
296
297
298
299
# File 'lib/dataset.rb', line 294

def to_urilist
  @compounds.inject { |sum, c|
    sum << OpenTox::Compound.new(c).uri
    sum + "\n"
  }
end

#to_xlsSpreadsheet::Workbook

Get Excel representation (alias for to_spreadsheet)

Returns:

  • (Spreadsheet::Workbook)

    Workbook which can be written with the spreadsheet gem (data_entries only, metadata will will be discarded))



251
252
253
# File 'lib/dataset.rb', line 251

def to_xls
  to_spreadsheet
end

#value_map(prediction_feature_uri) ⇒ Object



483
484
485
486
487
488
# File 'lib/dataset.rb', line 483

def value_map(prediction_feature_uri)
  training_classes = accept_values(prediction_feature_uri).sort
  value_map=Hash.new 
  training_classes.each_with_index { |c,i| value_map[i+1] = c }
  value_map
end