Class: MiGA::Dataset

Inherits:
MiGA
  • Object
show all
Includes:
DatasetResult
Defined in:
lib/miga/dataset.rb

Overview

Dataset representation in MiGA.

Constant Summary collapse

@@RESULT_DIRS =
{
  # Preprocessing
  raw_reads: "01.raw_reads", trimmed_reads: "02.trimmed_reads",
  read_quality: "03.read_quality", trimmed_fasta: "04.trimmed_fasta",
  assembly: "05.assembly", cds: "06.cds",
  # Annotation
  essential_genes: "07.annotation/01.function/01.essential",
  ssu: "07.annotation/01.function/02.ssu",
  mytaxa: "07.annotation/02.taxonomy/01.mytaxa",
  mytaxa_scan: "07.annotation/03.qa/02.mytaxa_scan",
  # Distances (for single-species datasets)
  distances: "09.distances", taxonomy: "09.distances/05.taxonomy",
  # General statistics
  stats: "90.stats"
}
@@KNOWN_TYPES =
{
  genome: {description: "The genome from an isolate.", multi: false},
  scgenome: {description: "A Single-cell Genome Amplification (SGA).",
    multi: false},
  popgenome: {description: "A population genome (including " +
    "metagenomic bins).", :multi=>false},
  metagenome: {description: "A metagenome (excluding viromes).",
    multi: true},
  virome: {description: "A viral metagenome.", multi: true}
}
@@PREPROCESSING_TASKS =
[:raw_reads, :trimmed_reads, :read_quality,
:trimmed_fasta, :assembly, :cds, :essential_genes, :ssu, :mytaxa,
:mytaxa_scan, :distances, :taxonomy, :stats]
@@EXCLUDE_NOREF_TASKS =

Tasks to be excluded from query datasets.

[:mytaxa_scan, :taxonomy]
@@_EXCLUDE_NOREF_TASKS_H =
@@ONLY_NONMULTI_TASKS =

Tasks to be executed only in datasets that are not multi-organism. These tasks are ignored for multi-organism datasets or for unknown types.

[:mytaxa_scan, :distances, :taxonomy]
@@_ONLY_NONMULTI_TASKS_H =
@@ONLY_MULTI_TASKS =

Tasks to be executed only in datasets that are multi-organism. These tasks are ignored for single-organism datasets or for unknwon types.

[:mytaxa]
@@_ONLY_MULTI_TASKS_H =

Constants included from MiGA

CITATION, VERSION, VERSION_DATE, VERSION_NAME

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Methods included from DatasetResult

#cleanup_distances!

Methods inherited from MiGA

CITATION, DEBUG, DEBUG_OFF, DEBUG_ON, DEBUG_TRACE_OFF, DEBUG_TRACE_ON, FULL_VERSION, LONG_VERSION, VERSION, VERSION_DATE, clean_fasta_file, initialized?, #result_files_exist?, root_path, script_path, tabulate

Constructor Details

#initialize(project, name, is_ref = true, metadata = {}) ⇒ Dataset

Create a MiGA::Dataset object in a project MiGA::Project with a uniquely identifying name. is_ref indicates if the dataset is to be treated as reference (true, default) or query (false). Pass any additional metadata as a Hash.



105
106
107
108
109
110
111
112
113
# File 'lib/miga/dataset.rb', line 105

def initialize(project, name, is_ref=true, ={})
  raise "Invalid name '#{name}', please use only alphanumerics and " +
    "underscores." unless name.miga_name?
  @project = project
  @name = name
  [:ref] = is_ref
  @metadata = MiGA::Metadata.new(
    File.expand_path("metadata/#{name}.json", project.path),  )
end

Instance Attribute Details

#metadataObject (readonly)

MiGA::Metadata with information about the dataset.



98
99
100
# File 'lib/miga/dataset.rb', line 98

def 
  @metadata
end

#nameObject (readonly)

Datasets are uniquely identified by name in a project.



94
95
96
# File 'lib/miga/dataset.rb', line 94

def name
  @name
end

#projectObject (readonly)

MiGA::Project that contains the dataset.



90
91
92
# File 'lib/miga/dataset.rb', line 90

def project
  @project
end

Class Method Details

.exist?(project, name) ⇒ Boolean

Does the project already have a dataset with that name?

Returns:

  • (Boolean)


76
77
78
# File 'lib/miga/dataset.rb', line 76

def self.exist?(project, name)
  File.exist? "#{project.path}/metadata/#{name}.json"
end

.INFO_FIELDSObject

Standard fields of metadata for datasets.



82
83
84
# File 'lib/miga/dataset.rb', line 82

def self.INFO_FIELDS
  %w(name created updated type ref user description comments)
end

.KNOWN_TYPESObject

Supported dataset types.



38
# File 'lib/miga/dataset.rb', line 38

def self.KNOWN_TYPES ; @@KNOWN_TYPES end

.PREPROCESSING_TASKSObject

Returns an Array of tasks to be executed before project-wide tasks.



52
# File 'lib/miga/dataset.rb', line 52

def self.PREPROCESSING_TASKS ; @@PREPROCESSING_TASKS ; end

.RESULT_DIRSObject

Directories containing the results from dataset-specific tasks.



19
# File 'lib/miga/dataset.rb', line 19

def self.RESULT_DIRS ; @@RESULT_DIRS end

Instance Method Details

#add_result(result_type, save = true, opts = {}) ⇒ Object

Look for the result with symbol key result_type and register it in the dataset. If save is false, it doesn’t register the result, but it still returns a result if the expected files are complete. The opts array controls result creation (if necessary). Supported values include:

  • is_clean: A Boolean indicating if the input files are clean.

Returns MiGA::Result or nil.



194
195
196
197
198
199
200
201
202
203
204
# File 'lib/miga/dataset.rb', line 194

def add_result(result_type, save=true, opts={})
  dir = @@RESULT_DIRS[result_type]
  return nil if dir.nil?
  base = File.expand_path("data/#{dir}/#{name}", project.path)
  r_pre = MiGA::Result.load("#{base}.json")
  return r_pre if (r_pre.nil? and not save) or not r_pre.nil?
  r = File.exist?("#{base}.done") ?
      self.send("add_result_#{result_type}", base, opts) : nil
  r.save unless r.nil?
  r
end

#closest_relatives(how_many = 1, ref_project = false) ⇒ Object

Returns an Array of how_many duples (Arrays) sorted by AAI:

  • 0: A String with the name(s) of the reference dataset.

  • 1: A Float with the AAI.

This function is currently only supported for query datasets when ref_project is false (default), and only for reference dataset when ref_project is true. It returns nil if this analysis is not supported.



281
282
283
284
285
286
287
288
# File 'lib/miga/dataset.rb', line 281

def closest_relatives(how_many=1, ref_project=false)
  return nil if (is_ref? != ref_project) or is_multi?
  r = result(ref_project ? :taxonomy : :distances)
  return nil if r.nil?
  db = SQLite3::Database.new(r.file_path :aai_db)
  db.execute("SELECT seq2, aai FROM aai WHERE seq2 != ? " +
    "GROUP BY seq2 ORDER BY aai DESC LIMIT ?", [name, how_many])
end

#done_preprocessing?(save = false) ⇒ Boolean

Are all the dataset-specific tasks done? Passes save to #add_result.

Returns:

  • (Boolean)


249
250
251
# File 'lib/miga/dataset.rb', line 249

def done_preprocessing?(save=false)
  !first_preprocessing(save).nil? and next_preprocessing(save).nil?
end

#each_result(&blk) ⇒ Object

For each result executes the 2-ary blk block: key symbol and MiGA::Result.



181
182
183
184
185
# File 'lib/miga/dataset.rb', line 181

def each_result(&blk)
  @@RESULT_DIRS.keys.each do |k|
    blk.call(k, result(k)) unless result(k).nil?
  end
end

#first_preprocessing(save = false) ⇒ Object

Returns the key symbol of the first registered result (sorted by the execution order). This typically corresponds to the result used as the initial input. Passes save to #add_result.



215
216
217
218
219
# File 'lib/miga/dataset.rb', line 215

def first_preprocessing(save=false)
  @@PREPROCESSING_TASKS.find do |t|
    not ignore_task?(t) and not add_result(t, save).nil?
  end
end

#get_result(result_type) ⇒ Object

Gets a result as MiGA::Result for the datasets with result_type. This is equivalent to add_result(result_type, false).



209
# File 'lib/miga/dataset.rb', line 209

def get_result(result_type) ; add_result(result_type, false) ; end

#ignore_task?(task) ⇒ Boolean

Should I ignore task for this dataset?

Returns:

  • (Boolean)


238
239
240
241
242
243
244
245
# File 'lib/miga/dataset.rb', line 238

def ignore_task?(task)
  return !["run_#{task}"] unless ["run_#{task}"].nil?
  return true if task==:taxonomy and project.[:ref_project].nil?
  pattern = [true, false]
  ( [@@_EXCLUDE_NOREF_TASKS_H[task], is_ref?     ]==pattern or
    [@@_ONLY_MULTI_TASKS_H[task],    is_multi?   ]==pattern or
    [@@_ONLY_NONMULTI_TASKS_H[task], is_nonmulti?]==pattern )
end

#infoObject

Get standard metadata values for the dataset as Array.



137
138
139
140
141
# File 'lib/miga/dataset.rb', line 137

def info
  MiGA::Dataset.INFO_FIELDS.map do |k|
    (k=="name") ? self.name : [k.to_sym]
  end
end

#is_multi?Boolean

Is this dataset known to be multi-organism?

Returns:

  • (Boolean)


153
154
155
156
157
# File 'lib/miga/dataset.rb', line 153

def is_multi?
  return false if [:type].nil? or
    @@KNOWN_TYPES[type].nil?
  @@KNOWN_TYPES[type][:multi]
end

#is_nonmulti?Boolean

Is this dataset known to be single-organism?

Returns:

  • (Boolean)


161
162
163
164
165
# File 'lib/miga/dataset.rb', line 161

def is_nonmulti?
  return false if [:type].nil? or
    @@KNOWN_TYPES[type].nil?
  !@@KNOWN_TYPES[type][:multi]
end

#is_query?Boolean

Is this dataset a query (non-reference)?

Returns:

  • (Boolean)


149
# File 'lib/miga/dataset.rb', line 149

def is_query? ; ![:ref] ; end

#is_ref?Boolean

Is this dataset a reference?

Returns:

  • (Boolean)


145
# File 'lib/miga/dataset.rb', line 145

def is_ref? ; !![:ref] ; end

#next_preprocessing(save = false) ⇒ Object

Returns the key symbol of the next task that needs to be executed. Passes save to #add_result.



224
225
226
227
228
229
230
231
232
233
234
# File 'lib/miga/dataset.rb', line 224

def next_preprocessing(save=false)
  after_first = false
  first = first_preprocessing(save)
  return nil if first.nil?
  @@PREPROCESSING_TASKS.each do |t|
    next if ignore_task? t
    return t if after_first and add_result(t, save).nil?
    after_first = (after_first or (t==first))
  end
  nil
end

#profile_advance(save = false) ⇒ Object

Returns an array indicating the stage of each task (sorted by execution order). The values are integers:

  • 0 for an undefined result (a task before the initial input).

  • 1 for a registered result (a completed task).

  • 2 for a queued result (a task yet to be executed).

It passes save to #add_result



260
261
262
263
264
265
266
267
268
269
270
271
272
# File 'lib/miga/dataset.rb', line 260

def profile_advance(save=false)
  first_task = first_preprocessing(save)
  return Array.new(@@PREPROCESSING_TASKS.size, 0) if first_task.nil?
  adv = []
  state = 0
  next_task = next_preprocessing(save)
  @@PREPROCESSING_TASKS.each do |task|
    state = 1 if first_task==task
    state = 2 if !next_task.nil? and next_task==task
    adv << state
  end
  adv
end

#remove!Object

Delete the dataset with all it’s contents (including results) and returns nil.



130
131
132
133
# File 'lib/miga/dataset.rb', line 130

def remove!
  self.results.each{ |r| r.remove! }
  self..remove!
end

#result(k) ⇒ Object

Get the result MiGA::Result in this dataset identified by the symbol k.



169
170
171
172
173
# File 'lib/miga/dataset.rb', line 169

def result(k)
  return nil if @@RESULT_DIRS[k.to_sym].nil?
  MiGA::Result.load(
    "#{project.path}/data/#{@@RESULT_DIRS[k.to_sym]}/#{name}.json" )
end

#resultsObject

Get all the results (Array of MiGA::Result) in this dataset.



177
# File 'lib/miga/dataset.rb', line 177

def results ; @@RESULT_DIRS.keys.map{ |k| result k }.compact ; end

#saveObject

Save any changes you’ve made in the dataset.



117
118
119
120
121
# File 'lib/miga/dataset.rb', line 117

def save
  self.[:type] = :metagenome if ![:tax].nil? and
    ![:tax][:ns].nil? and [:tax][:ns]=="COMMUNITY"
  self..save
end

#typeObject

Get the type of dataset as Symbol.



125
# File 'lib/miga/dataset.rb', line 125

def type ; [:type] ; end