Class: MiGA::Dataset
- Includes:
- DatasetResult
- Defined in:
- lib/miga/dataset.rb
Overview
Dataset representation in MiGA.
Constant Summary collapse
- @@RESULT_DIRS =
{ # Preprocessing raw_reads: "01.raw_reads", trimmed_reads: "02.trimmed_reads", read_quality: "03.read_quality", trimmed_fasta: "04.trimmed_fasta", assembly: "05.assembly", cds: "06.cds", # Annotation essential_genes: "07.annotation/01.function/01.essential", ssu: "07.annotation/01.function/02.ssu", mytaxa: "07.annotation/02.taxonomy/01.mytaxa", mytaxa_scan: "07.annotation/03.qa/02.mytaxa_scan", # Distances (for single-species datasets) distances: "09.distances", taxonomy: "09.distances/05.taxonomy", # General statistics stats: "90.stats" }
- @@KNOWN_TYPES =
{ genome: {description: "The genome from an isolate.", multi: false}, scgenome: {description: "A Single-cell Genome Amplification (SGA).", multi: false}, popgenome: {description: "A population genome (including " + "metagenomic bins).", :multi=>false}, metagenome: {description: "A metagenome (excluding viromes).", multi: true}, virome: {description: "A viral metagenome.", multi: true} }
- @@PREPROCESSING_TASKS =
[:raw_reads, :trimmed_reads, :read_quality, :trimmed_fasta, :assembly, :cds, :essential_genes, :ssu, :mytaxa, :mytaxa_scan, :distances, :taxonomy, :stats]
- @@EXCLUDE_NOREF_TASKS =
Tasks to be excluded from query datasets.
[:mytaxa_scan, :taxonomy]
- @@_EXCLUDE_NOREF_TASKS_H =
- @@ONLY_NONMULTI_TASKS =
Tasks to be executed only in datasets that are not multi-organism. These tasks are ignored for multi-organism datasets or for unknown types.
[:mytaxa_scan, :distances, :taxonomy]
- @@_ONLY_NONMULTI_TASKS_H =
- @@ONLY_MULTI_TASKS =
Tasks to be executed only in datasets that are multi-organism. These tasks are ignored for single-organism datasets or for unknwon types.
[:mytaxa]
- @@_ONLY_MULTI_TASKS_H =
Constants included from MiGA
CITATION, VERSION, VERSION_DATE, VERSION_NAME
Instance Attribute Summary collapse
-
#metadata ⇒ Object
readonly
MiGA::Metadata with information about the dataset.
-
#name ⇒ Object
readonly
Datasets are uniquely identified by
namein a project. -
#project ⇒ Object
readonly
MiGA::Project that contains the dataset.
Class Method Summary collapse
-
.exist?(project, name) ⇒ Boolean
Does the
projectalready have a dataset with thatname?. -
.INFO_FIELDS ⇒ Object
Standard fields of metadata for datasets.
-
.KNOWN_TYPES ⇒ Object
Supported dataset types.
-
.PREPROCESSING_TASKS ⇒ Object
Returns an Array of tasks to be executed before project-wide tasks.
-
.RESULT_DIRS ⇒ Object
Directories containing the results from dataset-specific tasks.
Instance Method Summary collapse
-
#add_result(result_type, save = true, opts = {}) ⇒ Object
Look for the result with symbol key
result_typeand register it in the dataset. -
#closest_relatives(how_many = 1, ref_project = false) ⇒ Object
Returns an Array of
how_manyduples (Arrays) sorted by AAI: -0: A String with the name(s) of the reference dataset. -
#done_preprocessing?(save = false) ⇒ Boolean
Are all the dataset-specific tasks done? Passes
saveto #add_result. -
#each_result(&blk) ⇒ Object
For each result executes the 2-ary
blkblock: key symbol and MiGA::Result. -
#first_preprocessing(save = false) ⇒ Object
Returns the key symbol of the first registered result (sorted by the execution order).
-
#get_result(result_type) ⇒ Object
Gets a result as MiGA::Result for the datasets with
result_type. -
#ignore_task?(task) ⇒ Boolean
Should I ignore
taskfor this dataset?. -
#info ⇒ Object
Get standard metadata values for the dataset as Array.
-
#initialize(project, name, is_ref = true, metadata = {}) ⇒ Dataset
constructor
Create a MiGA::Dataset object in a
projectMiGA::Project with a uniquely identifyingname. -
#is_multi? ⇒ Boolean
Is this dataset known to be multi-organism?.
-
#is_nonmulti? ⇒ Boolean
Is this dataset known to be single-organism?.
-
#is_query? ⇒ Boolean
Is this dataset a query (non-reference)?.
-
#is_ref? ⇒ Boolean
Is this dataset a reference?.
-
#next_preprocessing(save = false) ⇒ Object
Returns the key symbol of the next task that needs to be executed.
-
#profile_advance(save = false) ⇒ Object
Returns an array indicating the stage of each task (sorted by execution order).
-
#remove! ⇒ Object
Delete the dataset with all it’s contents (including results) and returns nil.
-
#result(k) ⇒ Object
Get the result MiGA::Result in this dataset identified by the symbol
k. -
#results ⇒ Object
Get all the results (Array of MiGA::Result) in this dataset.
-
#save ⇒ Object
Save any changes you’ve made in the dataset.
-
#type ⇒ Object
Get the type of dataset as Symbol.
Methods included from DatasetResult
Methods inherited from MiGA
CITATION, DEBUG, DEBUG_OFF, DEBUG_ON, DEBUG_TRACE_OFF, DEBUG_TRACE_ON, FULL_VERSION, LONG_VERSION, VERSION, VERSION_DATE, clean_fasta_file, initialized?, #result_files_exist?, root_path, script_path, tabulate
Constructor Details
#initialize(project, name, is_ref = true, metadata = {}) ⇒ Dataset
Create a MiGA::Dataset object in a project MiGA::Project with a uniquely identifying name. is_ref indicates if the dataset is to be treated as reference (true, default) or query (false). Pass any additional metadata as a Hash.
105 106 107 108 109 110 111 112 113 |
# File 'lib/miga/dataset.rb', line 105 def initialize(project, name, is_ref=true, ={}) raise "Invalid name '#{name}', please use only alphanumerics and " + "underscores." unless name.miga_name? @project = project @name = name [:ref] = is_ref @metadata = MiGA::Metadata.new( File.("metadata/#{name}.json", project.path), ) end |
Instance Attribute Details
#metadata ⇒ Object (readonly)
MiGA::Metadata with information about the dataset.
98 99 100 |
# File 'lib/miga/dataset.rb', line 98 def @metadata end |
#name ⇒ Object (readonly)
Datasets are uniquely identified by name in a project.
94 95 96 |
# File 'lib/miga/dataset.rb', line 94 def name @name end |
#project ⇒ Object (readonly)
MiGA::Project that contains the dataset.
90 91 92 |
# File 'lib/miga/dataset.rb', line 90 def project @project end |
Class Method Details
.exist?(project, name) ⇒ Boolean
Does the project already have a dataset with that name?
76 77 78 |
# File 'lib/miga/dataset.rb', line 76 def self.exist?(project, name) File.exist? "#{project.path}/metadata/#{name}.json" end |
.INFO_FIELDS ⇒ Object
Standard fields of metadata for datasets.
82 83 84 |
# File 'lib/miga/dataset.rb', line 82 def self.INFO_FIELDS %w(name created updated type ref user description comments) end |
.KNOWN_TYPES ⇒ Object
Supported dataset types.
38 |
# File 'lib/miga/dataset.rb', line 38 def self.KNOWN_TYPES ; @@KNOWN_TYPES end |
.PREPROCESSING_TASKS ⇒ Object
Returns an Array of tasks to be executed before project-wide tasks.
52 |
# File 'lib/miga/dataset.rb', line 52 def self.PREPROCESSING_TASKS ; @@PREPROCESSING_TASKS ; end |
.RESULT_DIRS ⇒ Object
Directories containing the results from dataset-specific tasks.
19 |
# File 'lib/miga/dataset.rb', line 19 def self.RESULT_DIRS ; @@RESULT_DIRS end |
Instance Method Details
#add_result(result_type, save = true, opts = {}) ⇒ Object
Look for the result with symbol key result_type and register it in the dataset. If save is false, it doesn’t register the result, but it still returns a result if the expected files are complete. The opts array controls result creation (if necessary). Supported values include:
-
is_clean: A Boolean indicating if the input files are clean.
Returns MiGA::Result or nil.
194 195 196 197 198 199 200 201 202 203 204 |
# File 'lib/miga/dataset.rb', line 194 def add_result(result_type, save=true, opts={}) dir = @@RESULT_DIRS[result_type] return nil if dir.nil? base = File.("data/#{dir}/#{name}", project.path) r_pre = MiGA::Result.load("#{base}.json") return r_pre if (r_pre.nil? and not save) or not r_pre.nil? r = File.exist?("#{base}.done") ? self.send("add_result_#{result_type}", base, opts) : nil r.save unless r.nil? r end |
#closest_relatives(how_many = 1, ref_project = false) ⇒ Object
Returns an Array of how_many duples (Arrays) sorted by AAI:
-
0: A String with the name(s) of the reference dataset. -
1: A Float with the AAI.
This function is currently only supported for query datasets when ref_project is false (default), and only for reference dataset when ref_project is true. It returns nil if this analysis is not supported.
281 282 283 284 285 286 287 288 |
# File 'lib/miga/dataset.rb', line 281 def closest_relatives(how_many=1, ref_project=false) return nil if (is_ref? != ref_project) or is_multi? r = result(ref_project ? :taxonomy : :distances) return nil if r.nil? db = SQLite3::Database.new(r.file_path :aai_db) db.execute("SELECT seq2, aai FROM aai WHERE seq2 != ? " + "GROUP BY seq2 ORDER BY aai DESC LIMIT ?", [name, how_many]) end |
#done_preprocessing?(save = false) ⇒ Boolean
Are all the dataset-specific tasks done? Passes save to #add_result.
249 250 251 |
# File 'lib/miga/dataset.rb', line 249 def done_preprocessing?(save=false) !first_preprocessing(save).nil? and next_preprocessing(save).nil? end |
#each_result(&blk) ⇒ Object
For each result executes the 2-ary blk block: key symbol and MiGA::Result.
181 182 183 184 185 |
# File 'lib/miga/dataset.rb', line 181 def each_result(&blk) @@RESULT_DIRS.keys.each do |k| blk.call(k, result(k)) unless result(k).nil? end end |
#first_preprocessing(save = false) ⇒ Object
Returns the key symbol of the first registered result (sorted by the execution order). This typically corresponds to the result used as the initial input. Passes save to #add_result.
215 216 217 218 219 |
# File 'lib/miga/dataset.rb', line 215 def first_preprocessing(save=false) @@PREPROCESSING_TASKS.find do |t| not ignore_task?(t) and not add_result(t, save).nil? end end |
#get_result(result_type) ⇒ Object
Gets a result as MiGA::Result for the datasets with result_type. This is equivalent to add_result(result_type, false).
209 |
# File 'lib/miga/dataset.rb', line 209 def get_result(result_type) ; add_result(result_type, false) ; end |
#ignore_task?(task) ⇒ Boolean
Should I ignore task for this dataset?
238 239 240 241 242 243 244 245 |
# File 'lib/miga/dataset.rb', line 238 def ignore_task?(task) return !["run_#{task}"] unless ["run_#{task}"].nil? return true if task==:taxonomy and project.[:ref_project].nil? pattern = [true, false] ( [@@_EXCLUDE_NOREF_TASKS_H[task], is_ref? ]==pattern or [@@_ONLY_MULTI_TASKS_H[task], is_multi? ]==pattern or [@@_ONLY_NONMULTI_TASKS_H[task], is_nonmulti?]==pattern ) end |
#info ⇒ Object
Get standard metadata values for the dataset as Array.
137 138 139 140 141 |
# File 'lib/miga/dataset.rb', line 137 def info MiGA::Dataset.INFO_FIELDS.map do |k| (k=="name") ? self.name : [k.to_sym] end end |
#is_multi? ⇒ Boolean
Is this dataset known to be multi-organism?
153 154 155 156 157 |
# File 'lib/miga/dataset.rb', line 153 def is_multi? return false if [:type].nil? or @@KNOWN_TYPES[type].nil? @@KNOWN_TYPES[type][:multi] end |
#is_nonmulti? ⇒ Boolean
Is this dataset known to be single-organism?
161 162 163 164 165 |
# File 'lib/miga/dataset.rb', line 161 def is_nonmulti? return false if [:type].nil? or @@KNOWN_TYPES[type].nil? !@@KNOWN_TYPES[type][:multi] end |
#is_query? ⇒ Boolean
Is this dataset a query (non-reference)?
149 |
# File 'lib/miga/dataset.rb', line 149 def is_query? ; ![:ref] ; end |
#is_ref? ⇒ Boolean
Is this dataset a reference?
145 |
# File 'lib/miga/dataset.rb', line 145 def is_ref? ; !![:ref] ; end |
#next_preprocessing(save = false) ⇒ Object
Returns the key symbol of the next task that needs to be executed. Passes save to #add_result.
224 225 226 227 228 229 230 231 232 233 234 |
# File 'lib/miga/dataset.rb', line 224 def next_preprocessing(save=false) after_first = false first = first_preprocessing(save) return nil if first.nil? @@PREPROCESSING_TASKS.each do |t| next if ignore_task? t return t if after_first and add_result(t, save).nil? after_first = (after_first or (t==first)) end nil end |
#profile_advance(save = false) ⇒ Object
Returns an array indicating the stage of each task (sorted by execution order). The values are integers:
-
0 for an undefined result (a task before the initial input).
-
1 for a registered result (a completed task).
-
2 for a queued result (a task yet to be executed).
It passes save to #add_result
260 261 262 263 264 265 266 267 268 269 270 271 272 |
# File 'lib/miga/dataset.rb', line 260 def profile_advance(save=false) first_task = first_preprocessing(save) return Array.new(@@PREPROCESSING_TASKS.size, 0) if first_task.nil? adv = [] state = 0 next_task = next_preprocessing(save) @@PREPROCESSING_TASKS.each do |task| state = 1 if first_task==task state = 2 if !next_task.nil? and next_task==task adv << state end adv end |
#remove! ⇒ Object
Delete the dataset with all it’s contents (including results) and returns nil.
130 131 132 133 |
# File 'lib/miga/dataset.rb', line 130 def remove! self.results.each{ |r| r.remove! } self..remove! end |
#result(k) ⇒ Object
Get the result MiGA::Result in this dataset identified by the symbol k.
169 170 171 172 173 |
# File 'lib/miga/dataset.rb', line 169 def result(k) return nil if @@RESULT_DIRS[k.to_sym].nil? MiGA::Result.load( "#{project.path}/data/#{@@RESULT_DIRS[k.to_sym]}/#{name}.json" ) end |
#results ⇒ Object
Get all the results (Array of MiGA::Result) in this dataset.
177 |
# File 'lib/miga/dataset.rb', line 177 def results ; @@RESULT_DIRS.keys.map{ |k| result k }.compact ; end |
#save ⇒ Object
Save any changes you’ve made in the dataset.
117 118 119 120 121 |
# File 'lib/miga/dataset.rb', line 117 def save self.[:type] = :metagenome if ![:tax].nil? and ![:tax][:ns].nil? and [:tax][:ns]=="COMMUNITY" self..save end |
#type ⇒ Object
Get the type of dataset as Symbol.
125 |
# File 'lib/miga/dataset.rb', line 125 def type ; [:type] ; end |