Class: MiGA::Dataset

Inherits:

MiGA

Object
MiGA
MiGA::Dataset

show all

Includes:: DatasetResult

Defined in:: lib/miga/dataset.rb

Overview

Dataset representation in MiGA.

Constant Summary collapse

@@RESULT_DIRS =

{
  # Preprocessing
  raw_reads: "01.raw_reads", trimmed_reads: "02.trimmed_reads",
  read_quality: "03.read_quality", trimmed_fasta: "04.trimmed_fasta",
  assembly: "05.assembly", cds: "06.cds",
  # Annotation
  essential_genes: "07.annotation/01.function/01.essential",
  ssu: "07.annotation/01.function/02.ssu",
  mytaxa: "07.annotation/02.taxonomy/01.mytaxa",
  mytaxa_scan: "07.annotation/03.qa/02.mytaxa_scan",
  # Distances (for single-species datasets)
  distances: "09.distances", taxonomy: "09.distances/05.taxonomy",
  # General statistics
  stats: "90.stats"
}

@@KNOWN_TYPES =

{
  genome: {description: "The genome from an isolate.", multi: false},
  scgenome: {description: "A Single-cell Genome Amplification (SGA).",
    multi: false},
  popgenome: {description: "A population genome (including " +
    "metagenomic bins).", :multi=>false},
  metagenome: {description: "A metagenome (excluding viromes).",
    multi: true},
  virome: {description: "A viral metagenome.", multi: true}
}

@@PREPROCESSING_TASKS =

[:raw_reads, :trimmed_reads, :read_quality,
:trimmed_fasta, :assembly, :cds, :essential_genes, :ssu, :mytaxa,
:mytaxa_scan, :distances, :taxonomy, :stats]

@@EXCLUDE_NOREF_TASKS = Tasks to be excluded from query datasets.

[:mytaxa_scan, :taxonomy]

@@_EXCLUDE_NOREF_TASKS_H =

@@ONLY_NONMULTI_TASKS = Tasks to be executed only in datasets that are not multi-organism. These tasks are ignored for multi-organism datasets or for unknown types.

[:mytaxa_scan, :distances, :taxonomy]

@@_ONLY_NONMULTI_TASKS_H =

@@ONLY_MULTI_TASKS = Tasks to be executed only in datasets that are multi-organism. These tasks are ignored for single-organism datasets or for unknwon types.

[:mytaxa]

@@_ONLY_MULTI_TASKS_H =

Constants included from MiGA

CITATION, VERSION, VERSION_DATE, VERSION_NAME

Instance Attribute Summary collapse

#metadata ⇒ Object readonly

MiGA::Metadata with information about the dataset.
#name ⇒ Object readonly

Datasets are uniquely identified by name in a project.
#project ⇒ Object readonly

MiGA::Project that contains the dataset.

Class Method Summary collapse

.exist?(project, name) ⇒ Boolean

Does the project already have a dataset with that name?.
.INFO_FIELDS ⇒ Object

Standard fields of metadata for datasets.
.KNOWN_TYPES ⇒ Object

Supported dataset types.
.PREPROCESSING_TASKS ⇒ Object

Returns an Array of tasks to be executed before project-wide tasks.
.RESULT_DIRS ⇒ Object

Directories containing the results from dataset-specific tasks.

Instance Method Summary collapse

#add_result(result_type, save = true, opts = {}) ⇒ Object

Look for the result with symbol key result_type and register it in the dataset.
#closest_relatives(how_many = 1, ref_project = false) ⇒ Object

Returns an Array of how_many duples (Arrays) sorted by AAI: - 0: A String with the name(s) of the reference dataset.
#done_preprocessing?(save = false) ⇒ Boolean

Are all the dataset-specific tasks done? Passes save to #add_result.
#each_result(&blk) ⇒ Object

For each result executes the 2-ary blk block: key symbol and MiGA::Result.
#first_preprocessing(save = false) ⇒ Object

Returns the key symbol of the first registered result (sorted by the execution order).
#get_result(result_type) ⇒ Object

Gets a result as MiGA::Result for the datasets with result_type.
#ignore_task?(task) ⇒ Boolean

Should I ignore task for this dataset?.
#info ⇒ Object

Get standard metadata values for the dataset as Array.
#initialize(project, name, is_ref = true, metadata = {}) ⇒ Dataset constructor

Create a MiGA::Dataset object in a project MiGA::Project with a uniquely identifying name.
#is_multi? ⇒ Boolean

Is this dataset known to be multi-organism?.
#is_nonmulti? ⇒ Boolean

Is this dataset known to be single-organism?.
#is_query? ⇒ Boolean

Is this dataset a query (non-reference)?.
#is_ref? ⇒ Boolean

Is this dataset a reference?.
#next_preprocessing(save = false) ⇒ Object

Returns the key symbol of the next task that needs to be executed.
#profile_advance(save = false) ⇒ Object

Returns an array indicating the stage of each task (sorted by execution order).
#remove! ⇒ Object

Delete the dataset with all it’s contents (including results) and returns nil.
#result(k) ⇒ Object

Get the result MiGA::Result in this dataset identified by the symbol k.
#results ⇒ Object

Get all the results (Array of MiGA::Result) in this dataset.
#save ⇒ Object

Save any changes you’ve made in the dataset.
#type ⇒ Object

Get the type of dataset as Symbol.

Methods included from DatasetResult

#cleanup_distances!

Methods inherited from MiGA

CITATION, DEBUG, DEBUG_OFF, DEBUG_ON, DEBUG_TRACE_OFF, DEBUG_TRACE_ON, FULL_VERSION, LONG_VERSION, VERSION, VERSION_DATE, clean_fasta_file, initialized?, #result_files_exist?, root_path, script_path, tabulate

Constructor Details

#initialize(project, name, is_ref = true, metadata = {}) ⇒ `Dataset`

Create a MiGA::Dataset object in a project MiGA::Project with a uniquely identifying name. is_ref indicates if the dataset is to be treated as reference (true, default) or query (false). Pass any additional metadata as a Hash.

# File 'lib/miga/dataset.rb', line 105

def initialize(project, name, is_ref=true, metadata={})
  raise "Invalid name '#{name}', please use only alphanumerics and " +
    "underscores." unless name.miga_name?
  @project = project
  @name = name
  metadata[:ref] = is_ref
  @metadata = MiGA::Metadata.new(
    File.expand_path("metadata/#{name}.json", project.path), metadata )
end

Instance Attribute Details

#metadata ⇒ `Object` (readonly)

MiGA::Metadata with information about the dataset.



98
99
100

# File 'lib/miga/dataset.rb', line 98

def metadata
  @metadata
end

#name ⇒ `Object` (readonly)

Datasets are uniquely identified by name in a project.



94
95
96

# File 'lib/miga/dataset.rb', line 94

def name
  @name
end

#project ⇒ `Object` (readonly)

MiGA::Project that contains the dataset.



90
91
92

# File 'lib/miga/dataset.rb', line 90

def project
  @project
end

Class Method Details

.exist?(project, name) ⇒ `Boolean`

Does the project already have a dataset with that name?

Returns:

(Boolean)



76
77
78

# File 'lib/miga/dataset.rb', line 76

def self.exist?(project, name)
  File.exist? "#{project.path}/metadata/#{name}.json"
end

.INFO_FIELDS ⇒ `Object`

Standard fields of metadata for datasets.



82
83
84

# File 'lib/miga/dataset.rb', line 82

def self.INFO_FIELDS
  %w(name created updated type ref user description comments)
end

.KNOWN_TYPES ⇒ `Object`

Supported dataset types.

38	# File 'lib/miga/dataset.rb', line 38 def self.KNOWN_TYPES ; @@KNOWN_TYPES end

.PREPROCESSING_TASKS ⇒ `Object`

Returns an Array of tasks to be executed before project-wide tasks.

52	# File 'lib/miga/dataset.rb', line 52 def self.PREPROCESSING_TASKS ; @@PREPROCESSING_TASKS ; end

.RESULT_DIRS ⇒ `Object`

Directories containing the results from dataset-specific tasks.

19	# File 'lib/miga/dataset.rb', line 19 def self.RESULT_DIRS ; @@RESULT_DIRS end

Instance Method Details

#add_result(result_type, save = true, opts = {}) ⇒ `Object`

Look for the result with symbol key result_type and register it in the dataset. If save is false, it doesn’t register the result, but it still returns a result if the expected files are complete. The opts array controls result creation (if necessary). Supported values include:

is_clean: A Boolean indicating if the input files are clean.

Returns MiGA::Result or nil.

# File 'lib/miga/dataset.rb', line 194

def add_result(result_type, save=true, opts={})
  dir = @@RESULT_DIRS[result_type]
  return nil if dir.nil?
  base = File.expand_path("data/#{dir}/#{name}", project.path)
  r_pre = MiGA::Result.load("#{base}.json")
  return r_pre if (r_pre.nil? and not save) or not r_pre.nil?
  r = File.exist?("#{base}.done") ?
      self.send("add_result_#{result_type}", base, opts) : nil
  r.save unless r.nil?
  r
end

#closest_relatives(how_many = 1, ref_project = false) ⇒ `Object`

Returns an Array of how_many duples (Arrays) sorted by AAI:

0: A String with the name(s) of the reference dataset.
1: A Float with the AAI.

This function is currently only supported for query datasets when ref_project is false (default), and only for reference dataset when ref_project is true. It returns nil if this analysis is not supported.

# File 'lib/miga/dataset.rb', line 281

def closest_relatives(how_many=1, ref_project=false)
  return nil if (is_ref? != ref_project) or is_multi?
  r = result(ref_project ? :taxonomy : :distances)
  return nil if r.nil?
  db = SQLite3::Database.new(r.file_path :aai_db)
  db.execute("SELECT seq2, aai FROM aai WHERE seq2 != ? " +
    "GROUP BY seq2 ORDER BY aai DESC LIMIT ?", [name, how_many])
end

#done_preprocessing?(save = false) ⇒ `Boolean`

Are all the dataset-specific tasks done? Passes save to #add_result.

Returns:

(Boolean)



249
250
251

# File 'lib/miga/dataset.rb', line 249

def done_preprocessing?(save=false)
  !first_preprocessing(save).nil? and next_preprocessing(save).nil?
end

#each_result(&blk) ⇒ `Object`

For each result executes the 2-ary blk block: key symbol and MiGA::Result.

# File 'lib/miga/dataset.rb', line 181

def each_result(&blk)
  @@RESULT_DIRS.keys.each do |k|
    blk.call(k, result(k)) unless result(k).nil?
  end
end

#first_preprocessing(save = false) ⇒ `Object`

Returns the key symbol of the first registered result (sorted by the execution order). This typically corresponds to the result used as the initial input. Passes save to #add_result.

# File 'lib/miga/dataset.rb', line 215

def first_preprocessing(save=false)
  @@PREPROCESSING_TASKS.find do |t|
    not ignore_task?(t) and not add_result(t, save).nil?
  end
end

#get_result(result_type) ⇒ `Object`

Gets a result as MiGA::Result for the datasets with result_type. This is equivalent to add_result(result_type, false).

209	# File 'lib/miga/dataset.rb', line 209 def get_result(result_type) ; add_result(result_type, false) ; end

#ignore_task?(task) ⇒ `Boolean`

Should I ignore task for this dataset?

Returns:

(Boolean)

# File 'lib/miga/dataset.rb', line 238

def ignore_task?(task)
  return !metadata["run_#{task}"] unless metadata["run_#{task}"].nil?
  return true if task==:taxonomy and project.metadata[:ref_project].nil?
  pattern = [true, false]
  ( [@@_EXCLUDE_NOREF_TASKS_H[task], is_ref?     ]==pattern or
    [@@_ONLY_MULTI_TASKS_H[task],    is_multi?   ]==pattern or
    [@@_ONLY_NONMULTI_TASKS_H[task], is_nonmulti?]==pattern )
end

#info ⇒ `Object`

Get standard metadata values for the dataset as Array.

# File 'lib/miga/dataset.rb', line 137

def info
  MiGA::Dataset.INFO_FIELDS.map do |k|
    (k=="name") ? self.name : metadata[k.to_sym]
  end
end

#is_multi? ⇒ `Boolean`

Is this dataset known to be multi-organism?

Returns:

(Boolean)

# File 'lib/miga/dataset.rb', line 153

def is_multi?
  return false if metadata[:type].nil? or
    @@KNOWN_TYPES[type].nil?
  @@KNOWN_TYPES[type][:multi]
end

#is_nonmulti? ⇒ `Boolean`

Is this dataset known to be single-organism?

Returns:

(Boolean)

# File 'lib/miga/dataset.rb', line 161

def is_nonmulti?
  return false if metadata[:type].nil? or
    @@KNOWN_TYPES[type].nil?
  !@@KNOWN_TYPES[type][:multi]
end

#is_query? ⇒ `Boolean`

Is this dataset a query (non-reference)?

Returns:

(Boolean)

149	# File 'lib/miga/dataset.rb', line 149 def is_query? ; !metadata[:ref] ; end

#is_ref? ⇒ `Boolean`

Is this dataset a reference?

Returns:

(Boolean)

145	# File 'lib/miga/dataset.rb', line 145 def is_ref? ; !!metadata[:ref] ; end

#next_preprocessing(save = false) ⇒ `Object`

Returns the key symbol of the next task that needs to be executed. Passes save to #add_result.

# File 'lib/miga/dataset.rb', line 224

def next_preprocessing(save=false)
  after_first = false
  first = first_preprocessing(save)
  return nil if first.nil?
  @@PREPROCESSING_TASKS.each do |t|
    next if ignore_task? t
    return t if after_first and add_result(t, save).nil?
    after_first = (after_first or (t==first))
  end
  nil
end

#profile_advance(save = false) ⇒ `Object`

Returns an array indicating the stage of each task (sorted by execution order). The values are integers:

0 for an undefined result (a task before the initial input).
1 for a registered result (a completed task).
2 for a queued result (a task yet to be executed).

It passes save to #add_result

# File 'lib/miga/dataset.rb', line 260

def profile_advance(save=false)
  first_task = first_preprocessing(save)
  return Array.new(@@PREPROCESSING_TASKS.size, 0) if first_task.nil?
  adv = []
  state = 0
  next_task = next_preprocessing(save)
  @@PREPROCESSING_TASKS.each do |task|
    state = 1 if first_task==task
    state = 2 if !next_task.nil? and next_task==task
    adv << state
  end
  adv
end

#remove! ⇒ `Object`

Delete the dataset with all it’s contents (including results) and returns nil.

# File 'lib/miga/dataset.rb', line 130

def remove!
  self.results.each{ |r| r.remove! }
  self.metadata.remove!
end

#result(k) ⇒ `Object`

Get the result MiGA::Result in this dataset identified by the symbol k.

# File 'lib/miga/dataset.rb', line 169

def result(k)
  return nil if @@RESULT_DIRS[k.to_sym].nil?
  MiGA::Result.load(
    "#{project.path}/data/#{@@RESULT_DIRS[k.to_sym]}/#{name}.json" )
end

#results ⇒ `Object`

Get all the results (Array of MiGA::Result) in this dataset.

177	# File 'lib/miga/dataset.rb', line 177 def results ; @@RESULT_DIRS.keys.map{ \|k\| result k }.compact ; end

#save ⇒ `Object`

Save any changes you’ve made in the dataset.

# File 'lib/miga/dataset.rb', line 117

def save
  self.metadata[:type] = :metagenome if !metadata[:tax].nil? and
    !metadata[:tax][:ns].nil? and metadata[:tax][:ns]=="COMMUNITY"
  self.metadata.save
end

#type ⇒ `Object`

Get the type of dataset as Symbol.

125	# File 'lib/miga/dataset.rb', line 125 def type ; metadata[:type] ; end

Class: MiGA::Dataset

Overview

Constant Summary collapse

Constants included from MiGA

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Methods included from DatasetResult

Methods inherited from MiGA

Constructor Details

#initialize(project, name, is_ref = true, metadata = {}) ⇒ Dataset

Instance Attribute Details

#metadata ⇒ Object (readonly)

#name ⇒ Object (readonly)

#project ⇒ Object (readonly)

Class Method Details

.exist?(project, name) ⇒ Boolean

.INFO_FIELDS ⇒ Object

.KNOWN_TYPES ⇒ Object

.PREPROCESSING_TASKS ⇒ Object

.RESULT_DIRS ⇒ Object

Instance Method Details

#add_result(result_type, save = true, opts = {}) ⇒ Object

#closest_relatives(how_many = 1, ref_project = false) ⇒ Object

#done_preprocessing?(save = false) ⇒ Boolean

#each_result(&blk) ⇒ Object

#first_preprocessing(save = false) ⇒ Object

#get_result(result_type) ⇒ Object

#ignore_task?(task) ⇒ Boolean

#info ⇒ Object

#is_multi? ⇒ Boolean

#is_nonmulti? ⇒ Boolean

#is_query? ⇒ Boolean

#is_ref? ⇒ Boolean

#next_preprocessing(save = false) ⇒ Object

#profile_advance(save = false) ⇒ Object

#remove! ⇒ Object

#result(k) ⇒ Object

#results ⇒ Object

#save ⇒ Object

#type ⇒ Object

#initialize(project, name, is_ref = true, metadata = {}) ⇒ `Dataset`

#metadata ⇒ `Object` (readonly)

#name ⇒ `Object` (readonly)

#project ⇒ `Object` (readonly)

.exist?(project, name) ⇒ `Boolean`

.INFO_FIELDS ⇒ `Object`

.KNOWN_TYPES ⇒ `Object`

.PREPROCESSING_TASKS ⇒ `Object`

.RESULT_DIRS ⇒ `Object`

#add_result(result_type, save = true, opts = {}) ⇒ `Object`

#closest_relatives(how_many = 1, ref_project = false) ⇒ `Object`

#done_preprocessing?(save = false) ⇒ `Boolean`

#each_result(&blk) ⇒ `Object`

#first_preprocessing(save = false) ⇒ `Object`

#get_result(result_type) ⇒ `Object`

#ignore_task?(task) ⇒ `Boolean`

#info ⇒ `Object`

#is_multi? ⇒ `Boolean`

#is_nonmulti? ⇒ `Boolean`

#is_query? ⇒ `Boolean`

#is_ref? ⇒ `Boolean`

#next_preprocessing(save = false) ⇒ `Object`

#profile_advance(save = false) ⇒ `Object`

#remove! ⇒ `Object`

#result(k) ⇒ `Object`

#results ⇒ `Object`

#save ⇒ `Object`

#type ⇒ `Object`