forthebadge forthebadge Gem Version

This gem was last updated on the 25.09.2022 (dd.mm.yyyy notation), at 23:02:05 o'clock.

The Bioroebe Project

Bioroebe

The above pictures, more or less, represent DNA - in particular a dsDNA helix (a double-stranded DNA helix).

The very left picture no longer looks like a dsDNA helix, though, due to my attempts to 'improve' it over the years and failing hard at that. I really should re-do that image one day.

The other two images are more recent. For instance, the second picture (the one that is almost in the middle) shows a schematic for a dsDNA helix. Of course DNA does not look like this at all whatsoever (hydrogen bonds can not possibly ever look as depicted like that, aside from the spacing being incorrect), but it is a pretty picture nonetheless - so let's go with pretty for the fancy visual show effects! \o/

Minor nitpick: keep in mind that DNA in regular cells is right-handed, so if you see a DNA double helix displayed that is going in the left direction then this is technically incorrect. So the image on the very right hand side is showing two dsDNA where one is evidently "incorrect" - but we could reason about it still being correct if we assume that one of the two dsDNA molecules shown is a synthetic one for a mirror cell, similar to racemases and epimers in chemistry.

About the BioRoebe project: History and Goals

The BioRoebe project was initially created in the year 2007 - or at the least close to 2007, give or take, under another name.

I was using the project for quite some time for my own, personal use cases in regards to bioinformatics and molecular biology; just a small hobby project, for very small, minor tasks. In many ways the project is still just a hobby project really - it's not a professional suite of software, and won't be for the foreseeable time either due to time constraints alone.

In early 2013, the project was finally published on rubygems.org, which has been its new home ever since - and probably will remain its home, for a very long time to come. It is hard to predict the future accurately, though.

Nonetheless, since as of 2013 the project has grown considerably in size. That makes describing the project a bit difficult, too, since the use cases for the project have increased, changed and been adapted over the years. There is not merely 'one' use case only - the bioroebe gem is a toolset project. Different people make use of different tools. Even programming languages may vary - while most of the bioroebe project is written in ruby, there are some (smaller) parts written in Java as well and I do not rule out using Python or other programming languages to do specific tasks either. Some parts of the bioroebe project are more widely (and often) used; other parts refer to niche use cases and thus are less frequently used.

Despite the plethora of options supported by this project, the BioRoebe project has several very important goals that are still valid as of today and stand out compared to other goals of lesser importance.

The primary purpose for this project - that is the main use case, if but one has to be named - is to be able to quickly help in regards to bioinformatics-related tasks, and associated wet-lab-related use cases from within molecular biology.

For example, the project should allow its users to run it on a local computer, on a remote computer, be used on the commandline, via different GUIs or via the www or on a smartphone/mobile device in the long run.

There should be no limitation in where and how the project should or could be run, including the possibility to run it on as many different operating systems as possible (at the least if ruby is available on that operating system; or possibly Java as an additional option one day). This is why the project has to try to stay as flexible as possible - we must support different operating systems, with all their quirks and unique oddities, if this is doable.

Note that this goal also applies to programming languages, as pointed out already. While the primary focus for the bioroebe project is (and will remain) on ruby, I specifically do not exclude the possibility that languages such as Java, C/C++ or even Python may be included and used in this project. In fact: since as of 2021, Java-specific parts will be extended in the bioroebe project as well. In the long run I would like to support both ruby and Java from the get go. The primary question that is relevant here in regards to different programming languages is that of use case; usability and usefulness, and then maintainability of the code base as well, as a secondary consideration.

The bioroebe project additionally has to solve real problems, in particular from a molecular biology point of view. Most bioinformatics-related toolkits were written by experts in the field who tend to have a strong background in mathematics and informatics. While that is a perfectly fine background to have, and most definitely an asset, I simply came from the "other" side - molecular biology and molecular genetics. This makes for a difference in thinking too, because I tend to be closer towards the side of, say, synthetic biology, than on the side of (bio)mathematics or statistics, due to my own interests in the way how I think or approach a given problem set. It's different to a primary in-silico driven approach. Nonetheless, the bioroebe project attempts to remain as flexible as possible, including exploring other ways in how a given problem set can be solved.

When the bioroebe project was initially created, I wanted it to be more natural to people who may not necessarily excel at designing (or even understanding) algorithms. Not that algorithms should be neglected, mind you, for efficiency reasons alone; but the primary view for the whole BioRoebe suite of programs will be to focus on "real" biology first and foremost, not just in-silico dry runs or simulations, per se. Although most of the fields of bioinformatics is dominated by mathematicians and computer scientists, I know that there are plenty of people who come from a simpler molecular biology-specific background. So the project tries to cater primarily to the latter group, without trying to exclude anyone else. This also includes the focus on documentation - while the documentation is far from perfect, I try to polish it every now and then. The aim here is to make the documentation useful for "Average Joe" - the common user, at the least from a wet-lab focus on molecular biology.

So, how can the BioRoebe project be helpful to its users?

The BioRoebe project can be used to solve (some) problems related to biology, molecular biology, genetics and, last but not least, bioinformatics.

For example, say that you quickly wish to reverse-translate a sequence of amino acids, and select all possible codons, or the most likely codon candidate; or just random codon candidates, and display that result on the commandline or via a www interface.

This is easily possible through BioRoebe. How fancy and useful! \o/

For instance, when I do this on the commandline via bash and KDE konsole:

revseq AAT # Alanine-Alanine-Threonine

Then it will show the following DNA sequence:

GCCGCCACC # These are 9 nucleotides, corresponding to the three amino acids.

(This will only work if your alias of revseq points to the correct bin/ entry. In my case I have aliased revseq onto bin/deduce_most_likely_aminoacid_sequence. Of course you can use any other alias, or just flat out call bin/deduce_most_likely_aminoacid_sequence directly.)

Requiring BioRoebe and starting the bioshell interface

In order to require BioRoebe, do use a line in ruby code such as the following one:

require 'bioroebe'

To automatically include the main namespace upon require-time, the following line of code can be used:

require 'bioroebe/autoinclude'

Note that this will include into the Object namespace, so if you want to have more control over the include-action, you need to first require the bioroebe project, and then include it onto the target namespace that you wish to use specifically, such as a subclass or another module. If you do not need to autoinclude then I recommend to simply use the first variant how to require the bioroebe gem - that should suffice. The reason why the file called autoinclude.rb exists is mostly due to laziness, so we can type less. We can omit include Bioroebe after all.

The BioRoebe project comes with a file called bin/bioshell, which allows you to start this project from a typical shell, such as bash, by issuing this command on the command prompt:

bioshell

You can also load up the project and run it from within a .rb file or during an IRB session, by doing the following:

require 'bioroebe'

Bioroebe::BioShell[]

Or alternatively, which may be more convenient to type:

require 'bioroebe'

Bioroebe.start_shell # No need to use the [], unlike the example shown above

Usage of the BioRoebe project

Not all subcomponents within the BioRoebe project have received equal attention and thus, the quality of these subcomponents may often differ, to word this nicely.

Patches and contributions to extend functionality, improve the documentation, fix existing bugs or improve the usability and general quality of the project, are welcome. Take note that I in general tend to add new entries at the bottom of this file here (README.gen or README.md, respectively); use the navigation menu on the top right of this page to quickly jump to these entries. Sometimes headers change a bit, but by and large content is rarely removed; so if you ever found something in the past, you should be able to find it again in the future - except for when APIs are removed. These may be omitted in the documentation. That's quite rare, though.

I will move entries that have received updates with more recent releases to the ~bottom of this very page here - that way it should be a bit easier to keep up to date with what has changed within this project. It is admittedly becoming a fairly large project, which is why I try to keep things somewhat organized.

In the long run I will also, most likely, publish the documentation in a "booklet" format such as https://yaml.readthedocs.io/en/latest/. That way one may be more easily able to read individual subchapters.

Differences and Compatibility towards BioRuby

This subsection will explain some of the philosophical - and, more importantly, practical - differences between BioRoebe and BioRuby, as both projects have somewhat similar, hence shared goals.

One philosophical difference, for example, is that BioRoebe is less focused on bioinformatics as such, and more focused on molecular biology. The main point of view for this project comes from molecular biology and wet-lab rather than bioinformatics and maths.

First, in regards to this subsection, it ought to be noted down that I consider compatibility to be a good, useful property to have. So BioRobe will attempt to stay somewhat compatible to BioRuby whenever this is feasible - and whenever it makes sense.

In principle, if you work with BioRuby then you also should be able to conveniently map your existing scripts towards BioRoebe, in a very simple and straightforward manner, without requiring any big change to your project API-wise - if this can be avoided. (Not all of the API will be copied 1:1, unfortunately, mostly because I think that some of the BioRuby API does not make that much sense at all, such as class NA. That name tells me absolutely nothing, so it is pointless to add it to BioRoebe.)

The compatibility aspect is one smaller focus that the BioRoebe project has - to provide the core-functionality available through BioRuby, so that the two projects could, in principle, be used interchangably without any feature-loss. So 100% feature parity towards BioRuby is a goal BioRoebe has. Note that this is an "on-going" effort; BioRoebe as it currently is in 2021 is quite far away from that goal. Stay tuned for future changes in this regard, though. Eventually feature parity will be reached.

Cooperation is a a laudable goal.

However had, in particular larger projects may still require adaptations, as some incompatibilies may exist, such as no class NA in Bioroebe as mentioned above (I do not think any user should ever remember an abstract name such as NA). Workarounds in these cases will have to exist and explained - again, stay tuned in this regard.

Feedback about the BioRoebe-project will not be ignored, in order to make BioRoebe more useful for other people. For example, NA's functionality could still be made available, to some extent, through other classes and such - just point it to class Bioroebe::Sequence instead and let Sequence handle the behaviour as-is. It could keep track whether we have RNA, DNA or Protein (the latter being a sequence of aminoacids). See the various subclasses of Bioroebe::Sequence which allow exactly that, so Bioroebe::Sequence::RNA for instance, to deal with RNA sequences.

BioRoebe is quite opinionated as a whole, much more so than BioRuby, which is another difference between these two projects.

A different philosophy is at work in BioRoebe. For example, I actually think that there should be a suite, a collection of ready-made scripts, that solve a given set of problems as-is, without requiring any new code to be written. That is, the bioroebe project will attempt to provide "finished" scripts that should ideally be well documented and usable for people to make use of.

The BioRuby project follows more a bottom-up approach where it provides a core set of tools, and then lets people extend these, via e. g. https://biogems.info/, for example.

The BioRoebe project is similar in spirit to the bioruby project, but it deviates from it in that it attempts to provide convenient solutions for many of the problems faced by molecular biologists on a daily basis - the 'batteries included' philosophy of toolsets. BioRoebe, thus, is a "tool"-set project and one mandatory requirement is that these tools are useful, at the least to some extent. (Usually this refers to the classes that are part of this project, but it also includes documentation and examples.)

This also means that classes that are not sufficiently useful enough may be removed eventually.

Note that improving documentation and examples is an ongoing effort, so bear with me as I will improve these parts of BioRoebe in the long run.

Are there more differences between these two projects?

Yes! Take the different codon tables, for instance. These codon tables for different organisms are stored in different .yml files as part of the bioroebe gem (if they have been ported yet that is). The yaml file 1.yml is the standard codon table for the Eukaryotes, for example. That way other programming languages can easily tap into that .yml file and make it available as well. As far as I know the BioRuby project uses mostly a hardcoded, per-.rb file approach instead, via hardcoded hash tables.

Both of these approaches have their pros and cons - I personally much prefer yaml files, also due to portability (you can re-use the yaml file, but how can you re-use hardcoded tables in .rb files?), so I opted to go the yaml route. But if people want to use a hash instead, they can do so, too - see the API for codon tables lateron. Simply define your own constants and pass them to the appropriate methods.

Readline support in the BioRoebe project

The BioRoebe project will attempt to make use of Readline, if readline is available on the given host system. This will be only used for situations where user input is required, in particular for the interactive part of BioRoebe (the bioshell).

The readline-related functionality is, in general, handled by code stored in the following file:

bioroebe/readline/readline.rb

Normally the code contained in that file, respectively the code for the bioshell variant, will try to detect whether readline is available on startup of the bioshell; if readline is not available then the toplevel instance variable @readline_is_available will be set to false, and support for Readline is not available for the current invocation of the bioshell.

YAML support in the BioRoebe project

BioRoebe makes good use of different yaml files. For example, the canonical aminoacids (the default 20 aminoacids) are stored in several yaml files, such as bioroebe/yaml/aminoacids/amino_acids_long_name_to_one_letter.yml.

In the past the syck gem was used (Syck engine) for these yaml files, but since several years, the psych gem is used by most ruby projects (if they are still maintained that is).

In 2019 or so, the BioRoebe project transitioned to use the psych gem as well, by default. Legacy support for syck will be retained, though, which is why code exists that allows users to use syck rather than psych.

If you ever have a need to query which yaml engine is in use for the BioRoebe project, you can use this toplevel method:

Bioroebe.use_which_yaml_engine?

Personally I recommend that people should switch to psych and give it a try. It should work fine really. But ultimately this is up to them.

Phred quality score

If you need support for PHRED, you could use this method:

Bioroebe.phred_error_probability

For more information about the Phred quality score, have a look at wikipedia here:

https://en.wikipedia.org/wiki/Phred_quality_score

Stride

class Bioroebe::StrideParser can be used to parse files generated by the program called stride.

The homepage of stride can be found here:

http://webclu.bio.wzw.tum.de/stride/

Glycosylation and Glycosylation patterns

Glycosylation is the attachment of a sugar to a biomolecule, and the actual attachment of this sugar, onto a protein; at the least most usually done onto a protein.

There may be different motifs in use for Glycosylation patterns.

For example, the N-glycosylation motif may be written as NP[ST]P.

In Ruby, this should translate to this Regex:

/(N[^P][ST][^P])/

This is also used to solve one problem at Rosalind.

Unfortunately, this regex does not capture on areas that have already been matched, which may lead to problems such as when you encounter an amino sequence like NNTSY. If you look carefully then you can see that two N-glycosylation motifs are in that sequence.

So I came up with this regex instead:

/(?=(N[^P][ST][^P]))/

This appears to work and catch matches even when there already was a match in a nearby/close area. It also solves the Rosalind problem - see the table about Rosalind in this document.

If you work via class Protein then you can query whether that protein is glycosylated through:

.glycosylated?

The DEAD box motif

The DEAD box motif is so named after the four key aminoacids:

Asp-Glu-Ala-Asp == DEAD

 D - E - A - D

This motif is very often found in RNA helicases. These helicases make use of ATP hydrolysis for their functionality.

The BioRoebe project currently does not scan for this motif, but in the future this functionality may eventually be added.

NLS - Nuclear localization sequences

The Nuclear localization sequence (NLS) is required for protein nuclear import. These NLS usually consist of either one or two stretches of basic amino acids (such as Lysine or Arginine).

The subsection here mostly just tries to collect some of the NLS sequences that have been identified so far.

SV40 virus: P-P-K-K-K-R-K-V

If the second Lysine (K) is mutated to e. g. a Threonine, then the NLS import will no longer work for the SV40 virus.

Toll-like receptors

This listing is just for information since I needed to look it up every now and then.

TLR recognizes URL
TLR1 peptidoglycan https://en.wikipedia.org/wiki/TLR_1
TLR2 recognizes cell-wall components of gram-negative bacteria https://en.wikipedia.org/wiki/TLR2
TLR3 double-stranded RNA (dsRNA) https://en.wikipedia.org/wiki/TLR3
TLR4 binds cell-wall components of gram-negative bacteria (via their LPS) https://en.wikipedia.org/wiki/TLR4

Enzymes

This subsection will be expanded at a later time - it will be about enzymes in general.

For now, if you need a table, as memory, for the enzyme classes, here is one:

Enzyme class (EC) Name
1 Oxidoreductases
2 Transferases
3 Hydrolases
4 Lyases
5 Isomerases
6 Ligases
7 Translocases

See wikipedia for more information:

https://en.wikipedia.org/wiki/Enzyme_Commission_number#Top_level_codes

Using Bioroebe in a project

Of considerable usefulness to the end-user may be the BioShell.

The idea is that, for every instruction related to the Life Sciences that may "make sense", the BioShell will try to provide you with a meaningful result.

For instance, an instruction such as .to_protein on a DNA sequence object, should yield the corresponding aminoacid sequence. (In reality it may be more complicated, in that it also depends on the codon usage, the codon table, whether the protein sequence is modified post-translationally, the intermediate RNA step which may be subject to alternative splicing and so on - but in principle, the DNA-to-Protein translation is somewhat simply kept as a 1:1 mapping according to the genetic code at hand.)

Once you are inside a running Bioshell, you can do other commands such as this one here:

random # ← This will generate a random DNA sequence. Each nucleotide has the same chance to be added.

To assign a DNA sequence, do:

assign ATAGGGCTTTT

Note that since as of the year 2016, if you input a nucleotide sequence like the one above, without any other commands/words, then we will assume that you did mean to do an assignment as-is anyway. The "assign" part then becomes superfluous and can be omitted.

This is how this is simply done, by omitting the "assign" part of the above instruction altogether:

ATAGGGCTTTT

But if you want to be more explicit, you can use "assign" or "set_dna" and several other aliases. There is more way to do something in ruby - often, the best way is the shortest and simplest though.

If you want to get the complement sequence to that input sequence, do the following instruction:

complement

This can also be done via the commandline:

reverse_complement ATAGGGCTTTT # => "AAAAGCCCTAT"

(Do note that this is actually the reverse complement, that is, the output of that command will be 3'-AAAAGCCCTAT, which is the reverse complement of 5'-ATAGGGCTTTT.)

Certain helper submodules are available.

Note that you can also start the interactive shell with a display of a BLOSUM matrix. For example, to show BLOSUM80 and start the shell, do:

bioroebe --blosum80

You can also use bin/blosum_2D_table to show the BLOSUM data matrix as a 2D table on the commandline.

Example:

blosum_2D_table blosum50

Do note that you are not required to use the _, the following also works:

show blosum
show codon
show codon table

In general, you should be able to use commands without the _ token - if you encounter a method where this does not work, let me know and it will be added.

To show the codon usage of the main sequence at hand, the following command can be issued within the bioroebe shell:

codon_usage

Obviously, since organisms consist of multiple genes, this may be a bit complicated. The approach used by the *BioShell for now is to just use one "main" sequence, one main gene. Future modifications to the project may allow you to treat all genes in a given organism as a main sequence.

To query something for uniprot (Universal Protein Resource), do:

uniprot? A1A2

You can design a polylinker site in a given vector.

For this to work, simply type:

mcs

To show a small table of agarose concentrations that may be applicable in order to separate different DNA fragments, issue the following command in the interactiove bioroebe shell:

agarose

There are many more options here, not all of which are properly documented in this README.md file here.

Please have a look at the corresponding .cgi file that is in the www/ subdirectory for a more complete tutorial and documentation.

File location example:

bioroebe/www/bioroebe.cgi

If you need to colourize a nucleotide sequence or an aminoacid sequence, for display on the WWW, have a look at the directory colour_schemes/ that is part of this gem.

This directory has some colours that may be tweaked to your liking.

Highlighting colours

By default, subsequences that are important, will be highlighted in colour in the commandline variant of the BioRoebe::Shell.

The method that does so is called set_highlight_colour().

You can change the colour in use via that method. From the interactive shell, you can do this to get e. g. palegreen as a colour:

highlight palegreen

And then, display the sequence, such as via:

CpG

for showing all CpG islands in the sequence.

If you want to get rid of ALL highlighting, use either one of the following instructions:

clear highlight
clear highlighting

If you need a list of available HTML colours (the one word colours), you can do either of the following within Bioroebe::Shell:

available_colours?
show_html_colours

You can also query whether any sequence has been assigned yet, by calling:

assigned?

Different ways to show the main sequence

alt text

If you wish to display the main nucleotide sequence separated by a '|' character at every 3 nucleotides (codon), then you can use either one of the following commands:

show_codon_piped_sequence
as_pipe

Adding and removing nucleotides

You can add and remove nucleotides from the 3-prime end of a nucleotide sequence.

The following subsection shows how this is done from within the BioRoebe::Shell instance.

Add 3 nucleotides:

+3

Remove 3 nucleotides:

-3

Note that you can also use names such as add or chop in order to remove nucleotides. The above + and - are essentially just shortcuts.

Chop means to "chop off" nucleotides, in the context of the BioRoebe project, in particular the bioshell component.

The default behaviour is to chop from the right hand side (the 3' end), but you can also chop from the left hand side, such as via this way:

left_chop 3

This will remove the 3 leading nucleotides from our main DNA sequence.

You can also chop towards any substring. Normally you may want to chop towards the first start codon, which is usually AUG. A usage example follows:

chop_to AUG

This will get rid of all nucleotides before the first AUG found. If you want to chop towards AUG sequences afterwards, you can combine chop_to with left_chop. (Note that leftchop and chopto also work.)

You can add and remove nucleotides from the 3' end of a string in the *bioshell.

To add 3 nucleotides do:

+3

To remove 3 nucleotides do:

-3

Note that you can also use names such as add or chop in order to remove nucleotides. The above + and

  • just exist as shortcuts.'

Goals for the BioRoebe project

The BioRoebe project has several goals.

One is that the project tries to act as a "trainer" to teach people how to learn more about datasets used in the various life sciences disciplines in general.

This explains one focus of this project - a tutorial. Teaching other human beings how to use this project, too.

Internally, BioRoebe will use primarily Ruby, but C/C++ code could also be used, at a later time.

The BioRoebe project will also accept custom solutions provided that the code is documented and preferentially simple rather than complex. (But documentation is more important than how it is solved internally.)

The project is really an attempt to have a larger set of tools brought together eventually, for the whole ruby ecosystem.

The Bioroebe-project was first released in early 2013, so bear with me as I will slowly improve on it. As you can see from the releases, it has been steadily growing in size. And hopefully, it has also grown in regards to usability, quality and general usefulness.

It should eventually be sufficiently useful to solve many bioinformatics-related problems.

Documentation is also an important aspect for this project. Things should be explained so that other human beings can understand what is going on and why. This is why a tutorial is distributed with the project in the form of a .cgi file or an external link to a .pdf file. (The .pdf is auto-generated from the .cgi file)

Another minor goal is to clone the functionally that BioPHP offers, including its usage on the www. Most functionality is already available within BioRoebe itself - not all components are ready-to-be-used on the www, though.

The bioroebe-project was created due to several reasons. One was so that I could create (and modify) my own set of tools, in ruby. Another reason was to provide a learning platform, similar as to how the Linux from Scratch project teaches people how a Linux sysem functions from the get-go. Another reason was that I feel that the "bioruby" project is not moving forward at a ... really fast speed, to put it nicely, so think of the project here a bit as a "friendly competition". And another reason was that I was free to make any changes that I see fit, which helps me keep on moving the project forward - and which may also change/improve this project here.

Xorg buffer

On linux systems, xorg is usually used through the xorg-server, and that also means that the xorg-buffer may be available, such as via binaries like xset.

Sometimes it may be convenient to automatically assign the main DNA sequence in use in BioRoebe to the xorg-buffer, so that you can then simply press the middle mouse button to paste the xorg-buffer into some other application, such as another GUI.

In the subdirectory configuration/additionally_set_xorg_buffer.yml you can toggle this behaviour to true or false. If true then upon assignment to the main DNA sequence, the xorg-buffer is also automatically modified. This can be either useful, or annoying - I personally have it disabled since as of December 2017, but perhaps you may want to enable this for your particular use case.

I do not know if an easy way exists to do so on windows systems - if anyone knows how to do so on Windows, let me know and I'll integrate this into BioRoebe.

The name of the project - BioRoebe

The name, in particular the word "bio", insinuates that this project has something to do with "biology". Indeed, this is the case so - the project is somewhat similar to the older project called "bioruby", whose gem is called "bio".

Both projects also share some commonalities such as collecting a bioinformatics-related set of tools. Operations such as querying a genome sequence or predicting CpG islands and so forth - this will all eventually be possible with BioRoebe too.

Do also note that when the term "bio" is used, in this context of the BioRoebe project the whole "bio"-sciences are meant - biology, genetics, genomics, proteomics, bioinformatics, synthetic biology, system biology. Also chemistry, if we focus on analytical chemistry including Next Generation Sequencing.

Contributing to the project

There are several ways how to contribute to this project. This subsection will eventually be expanded; and eventually the project will be hosted on github - but for the time being, also because the quality of the project isn't really that high, collaboration has to go through email. (I may, however had, also enable contact via bioroebe itself and the mail gem, which may allow you to quickly send feedback to me even without having to use email.)

Most importantly - if you have a valid use case, and this use case is not well documented or does not properly work, you can suggest improvements here. See the "melting temperature curve analysis" subsection from a few years ago. I will focus more on the parts that either I use, or that other people use. After all, the project is a toolset project. Tools should be useful.

If there is missing functionality, and you want to extend the project in regards to this missing functionality, there there are some guidelines that you should ideally adhere to.

In no particular order, these may include:

  • All added .rb files should ideally have a frozen-string literal set to true. The primary reason for this is to try to maximize ruby's performance when possible. This is especially important when it includes memory-intensive processes. Speed is not the primary aim of ruby nor the bioroebe project, but when possible, we should try to maximize for speed (unless there are significant trade-offs, such as decreased readability; readability is usually more important than lightning speed).

  • All methods should be documented and, more importantly, there should be an overall documentation that briefly explains what the code does. This documentation does NOT have to be very long. It can be short and succinct but it should be correct. The reason why documentation is necessary is because other people may have to understand the code.

  • Ideally, the code should be simple whenever possible. This is not a strict requirement, but when several paths are possible, the one that is the simplest, or the shortest, is the one that should usually be preferred.

  • If your code deals with databases, I would like to encourage you to provide more extensive documentation. The reason is that, while databases are very useful, they can be a bit confusing to newcomers, and thus, the more there is in regards to documentation, the more useful this will be in the long run.

  • If you wish to provide sequences that are commonly used, and these sequences are not too large, then I will happily add them in the FASTA format. An example for this is the sequence for GFP. GFP is quite often used, so it made sense to integrate it into BioRoebe, but many other sequences could also be added.

  • Code should be "integratable", that is, added into the BioRoebe project. If you need some core functionality that you do not find available within BioRoebe itself, I am willing to add it if the overall task is something related to bioinformatics, molecular biology and so forth.

Other than that, code will be easily integrated. Tests are not necessary but I need to be able to verify that it works at the least once.

Electron microscopy

In particular for virology, but also for some applications of biotechnology, electron microscopes and cryo-electron microscopes are employed.

The subsection here is reserved to deal with these issues.

For the time being (February 2018), this is just a stub though. It will be extended slowly.

A few classes will be briefly mentioned here.

class Bioroebe::ElectronMicroscopy::SimpleStarFileGenerator can generate .star files, from .mrc files, which are used by the software Relion.

Bioroebe::PositionSpecificScoringMatrix (PSSM)

class Bioroebe::CalculateThePositionSpecificScoringMatrix can be used to calculate the PSSM. This is currently not very advanced though.

Currently the matrices and the input test-strings (such as AAA, AAT, ATG and so forth) are hardcoded into that class.

Phylogenetic Trees

This subsection may be extended at a later time.

For now, keep in mind that trees can be conveniently represented as strings.

Aliases

The Bioroebe project makes use of aliases.

For example, class Sequence is also available as Seq. While I personally prefer the longer name, others may prefer Seq, as name, since it is easier to type.

Aliases also allow us to use different APIs.

For example, in biopython, instantiating a new sequence object goes like this usually:

from Bio.Seq import Seq
seq = Seq("AGTACACTGGT")

This is also possible in Bioroebe:

require 'bioroebe/autoinclude'
seq = Seq("AGTACACTGGT")
# or: seq = Seq "AGTACACTGGT"

Or, an alternative to the above that is still very similar:

require 'bioroebe'    
Seq = Bioroebe
seq = Seq('AAAACCCGGT')

Of course the rest of the API is quite different, but this is just a demo anyway. The differences aren't that huge, in my opinion. The idea is to make jumping from another toolkit right into bioroebe as simple as possible. Personally I use Bioroebe.reverse_complement() only - helps my brain remember APIs too.

Comparing two alignments/strings and handling alignments

If you want to compare two protein sequences, via an alignment, then typically the following symbols (special characters) are used:

'|' denotes an exact match
':' denotes a conservative match
'.' denotes a non-conservative match

In the BioShell, you can simply compare two strings, such as in this way:

compare TCGCACTAT TCGCACTAG

This will then show which nucleotide positions match and which ones do not match.

Philosophical considerations

Bioinformatics as a field is, in my opinion, presently mostly influenced by mathematics and informatics. I assume this to be largely due to the origin of the field.

There is also computational biology, some of which originated from protein structure determination. Last but not least, there is the field of chemoinformatics, which originated from chemistry - including datasets from mass spectrometry and so forth.

When I first heard about the term bioinformatics, I assumed that it was to be able to generate organisms willy-nilly. That is, you put in some random sequence, and out comes a happy living thingy (if the sequence is viable). You can imagine my surprise when I then realized that the field isn't quite as much ... fun. Unless you perhaps love playing with numbers - but isn't it cooler to know that information stored somewhere is indeed biologically viable?

This impression has continued to this day for me. I think that the current way that bioinformatics is done, is too centric to mathematics, statistics and in general fields outside of molecular biology and genetic, leading to a lot more boredom than there should be. If you look at the current field of synthetic biology then this is, again in my opinion, a lot more exciting and interesting. So why not combine it with these fields?

So, yes - ideally if that would be possible, I'd love to make the project here useful for the traditional bioinformatics field BUT also for the more modern topics. The project here should be fun rather than as dull as "regular" bioinformatics.

Suffix Arrays (for Trees)

A suffix array, introduced in the year 1990, is a sorted array of all suffixes of a string. Do note that suffix arrays are an efficient alternative to suffix trees.

Let's use the string "banana" which we wish to index. The character $ denotes the end of the string.

Which index Sequence position
index 1
S[index] b

The text banana thus has the following suffixes:

Suffix index
banana$ 1
anana$ 2
nana$ 3
ana$ 4
na$ 5
a$ 6
$ 7

Hidden Markov Models (HMMs)

This subsection deals with some aspects of HMMs.

Why are HMMs useful in biology? They can be used to represent protein families, for example (via pHMMs - profile hidden markov models).

Furthermore, they can show some bias in the mutation rate that can be observed. Different genomes are known to have different hotspots where mutations are more likely to happen, for various reasons. These are examples where a HMM may be useful.

HMMs are usually based on the Shannon model where you assign different probabilities to "change" events. An example that was mentioned back in 1948 was the english alphabet - some letters, and combinations of letters, are more commonly seen. Shannon gave the example of "E" versus "W", as shown in the following graph (a finite state graph):

Here you can see that E is more likely (0.12) to occur than W (0.02). This is similar to the k-mer value used in bioinformatics. Linguistics uses a similar model, called N-grams. In bioinformatics, given a DNA sequence, a 10-mer would be equivalent to 10 base pairs.

The individual transition states are based on an assumption of "randomness", but ensuring that these are truly random is not necessarily trivial. Computers do not really 'generate' true randomness, at the least not when they are working solo, "on their own". You can even 'predict' some randomness here or there via various techniques - see vulnerabilities such as Specter or similar variants where software can read from areas of the memory that should be inaccessible to them. Some of this is based on co-predictions. For distributed computers, you may often use random noise or decay of atoms as 'a source of randomness'. For any DNA nucleotide sequence, we would assume that each base pair has a 25% chance to exist at any given position, but this is not necessarily true, again for various reasons.

An interesting thought is ... why is ATP so important? Yes, of course due to it being 'the energy currency in a cell' but .. why is this ATP, aka adenine? Why not GTP, aka guanine or any of the other two nucleotides? (GTP is used too, but why? Why not CTP and TTP?) I can not answer this question; there may be many reasons, including differential chemical storage power as well as mere random chance event in evolution, but for whatever the reason, you will not find a complete 25% percentage value for every given "slot" in DNA, depending on the organism.

From a practical point of view, how can we approach Hidden Markov Models and use them?

Let's take the following simple sequence:

ACGTACGC

From this sequence we can see that the 3-mer "ACG" is followed by either a T, or a C. Have a look at the sequence again to see if you can identify the two ACG subsequences there. You can see one at the start, and the other one following a bit later, hence why we come to the conclusion that either a T or a C will follow this 3-mer.

The probability of either T or C to occur on that position, thus, is 0.5 (50%); for A and G to follow there is 0% so the latter two can be ignored.

Thus, we could use a ruby Hash as follows that should describe these probabilities:

probabilities = {'T': 0.5, 'C': 0.5} # ignoring A and G here, but we could denote them via 0 as well 

Now we must 'weigh' (connect) these probabilities to the "ACG" substring. Example:

weighted_probability = {'ACG': {'T': 0.5, 'C': 0.5}}

The weight of some common proteins and the molecular weight of proteins

There is a little helper-method in the shell component.

It can be invoked via either of these calls:

weight_of_common_proteins?
weight_of_common_proteins

This will just show a brief overview over the weight of some common proteins, such as IgG or IgM or RUBISCO and so forth.

This list may be extended in the future, but do not expect it to ever be all-inclusive. It is merely meant to just give a quick help on the commandline.

If you need to calculate the exact weight of a protein, that is, of each of its constituent aminoacid, then you can use the following command from within the shell:

protein_weight?
protein_weight

Note that this presently does NOT deduct the weight of water molecules. I checked against most online molecular weight calculators and they also do not deduct water from their calculations. Strictly speaking, since the individual aminoacids (save for the first and last one) are connecting to two partner-aminoacids through a peptide bond, which was formed when water was lost, we would have to deduct the weight of water as well).

The partner nucleotide

The "partner" of A is T, and vice versa; and the partner of C is G.

A small bin/ file exists that shows this on the commandline too:

partner_nucleotide G
partner_nucleotide A

The reason why that was added was mostly because we often need to find out what the corresponding partner nucleotide is - so I decided to also put this into a standalone file as well.

This is not too terribly useful on its own, but I wanted to make the functionality at the least available.

Fitch's algorithm

The Fitch algorithm, devised in 1970, is also called Maximum Parsimony.

The most common application - or at the least a very common one - in bioinformatics is to use the Fitch algorithm in order to reconstruct an evolutionary (== phylogenetic) tree. And to find the most likely tree that gave rise to the sequences/organisms.

Each tree may have nodes, that is, branching to occur at that particular node. This represents the idea that, at this moment in time, the two sequences have diverged from one another.

See this picture:

http://www.cs.ubc.ca/labs/beta/Courses/CPSC536A-01/Class10/tree2.png

Tree

The parsimony score is the sum of Hamming distances along each edge.

Parsimony assumes that substitutions are rare and that back-mutations do not occur.

Displaying the main sequence with delimiter characters

From within the bioshell, you can use some alternative ways to display the main DNA sequence.

To use the default display, simply type:

show?

If you want to separate this via the ' ' character, do this:

split

If you want to separate via '-' character, do this:

spacer

The method show_sequence_in_splitted_form is taking care of this display style.

Exponential growth analysis of bacteria in controlled environments (e. g. fermentation)

This subsection is about growth "analysis" of bacteria in any controlled environment, such as during fermentation - but it also provides a very theoretical explanation of how bacteria grow.

In principle, from one bacterium two bacteria are "generated", during regular growth. This is binary fission and we can easily model this with simple mathematics.

2 ** 1 # => 2
2 ** 2 # => 4
2 ** 3 # => 8
2 ** 4 # => 16

and so forth. From one bacterium we will end up with two bacteria; from two we will end up with four; from four we will then have eight, then sixteen and so forth.

Perhaps you can see that this is an exponential growth series. From a small number of bacteria, provided that the nutrient supply is high and no harsh conditions are abound (temperature, pH value, phages etc...), bacteria will show a close-to-the-above growth rate.

So, how many individual cells may we have after 100 steps? The general formula for this is:

Nt = (N0) * (2 ** n)

or without the first parenthesis:

Nt = N0 * (2 ** n)

Where:

N0 is the initial number of cells
Nt is the total amount of individual cells, at time t
The exponent **n** is the amount of division steps

Let's look at two different specific examples next.

Starting with 10 cells and 10 generations, we will have:

Nt = (10) * (2 ** 10) = 10240

This is in ruby-code notation, so 2 ** 10 is two to the power of 10.

So, we will have 10240 cells after 10 generations, starting with only 10 cells. Quite amazing really, how so few cells can lead up to so many new cells.

If we want to find out how many cells we have had, when we reach a number of 1600 cells after 5 generations, we can use this formula:

N0 = 1600 / (2 ** 5) = 50

Both formulas are supported by the Bioroebe project, just to make your life a little easier. These two methods are described next.

First, we have a method to calculate the exponential growth of bacteria, aptly called calculate_exponential_growth():

Bioroebe.calculate_exponential_growth()

This method accepts two arguments - the first one is the amount of cells that we had at the starting time. This can be as low as 1 cell (obviously, if we have 0 cells or "less" than zero cells, we can not grow anything out of ... nothing). The second argument is the number of generations, aka how many times the cells will have divided. From a first division of a single cell, we will have two cells. If these two cells each divide, we will have four cells. If these four cells divide, we will have eight cells, and so forth.

In the example given above, we had 10 cells and 10 generations, so we use 10, 10 as arguments:

Bioroebe.calculate_exponential_growth(10, 10) # => 10240

You can also pass a hash instead, such as through this way:

Bioroebe.calculate_exponential_growth(number_of_cells: 10, n_divisions: 10) # => 10240

This allows for a little bit more extra flexibility, in particular if you can not remember whether the amount of cells comes first or second. You can also use other key-names for the hash, such as:

Bioroebe.calculate_exponential_growth(n_cells: 10, n_divisions: 10) # => 10240

You can also do the "reverse" operation, that is to determine how many cells have been there initially, to begin with. In order to do so, consider using the following method:

Bioroebe.calculate_original_amount_of_cells_of_exponential_growth()
Bioroebe.calculate_original_amount_of_cells_of_exponential_growth(1600, 5) # => 50

You can also use a simple calculator to check on these values. :)

The code for this resides in the file bioroebe/toplevel_methods/calculate_exponential_growth.rb.

I am open to adding aliases to the above method names, if you can think of additional names for these methods.

Note that a simple gtk2-widget also exists to at the least the method Bioroebe.calculate_exponential_growth(). The code for this resides under bioroebe/gui/gtk2/calculate_cell_numbers_of_bacteria.rb. (Since as of August 2021, only gtk3 code remains. Have a look at the gtk3 parts.)

C and C++ code in the BioRoebe project

There are plans to add C respectively C++ code to the Bioroebe project, perhaps in the form of mruby bindings as well. This is very limited as of now, mostly because I don't quite know which parts could really benefit from C++ code. After all not even the core of Bioroebe, written in ruby, is finished - there are so many missing parts as of yet. The future will show.

What I would, however had, like to have available within the BioRoebe project as-is, is this:

  • Clone the functionality of the Emboss tools; first into ruby, then the relevant parts into C++ and/or Java.
  • Implement all the commonly used Bioinformatics algorithms in C++, including the corresponding ruby bindings (but provide ruby bindings only for platforms that can not make use of C++). Algorithms such as binary (tree) search, Nussinov algorithm and so forth.

RGG Boxes (RGG motif)

Motifs rich in arginines and glycines were recognized to play functional roles and were termed glycine-arginine-rich (GAR) domains* or **RGG boxes. Some authors als call it the RGG/RG motif.

More than 1000 human proteins contain the RGG/RG motif. These proteins are usually involved in diverse processes such as transcription, pre-mRNA splicing or mRNA translation.

Let's have a look at some of these sequences that contain a RGG motif.

Lischwe and colleagues reported in 1985 (Lischwe et al., 1985) a sequence that contained a RGG motif. That sequence was, at the C terminus of nucleolin, this one:

RGGGFGG RGGFGDRGGRGGGRGG R_GGGFGG R_GGFGDR_GGR_GGGR_GG

That sequence contained dimethylated arginines. (In the second sequence, the _ that comes AFTER the Arginin, denotes this dimethylated arginine.)

This provided the first clue that these are substrate recognition sites for protein arginine methyltransferases (PRMTs).

Arginines nested within GRG and RGG sequences are preferred sites for certain PRMTs (Boisvert et al., 2005c).

These methylated arginine motifs are often called glycine-arginine-rich (GAR) regions.

class RGG_Scanner in Bioroebe will detect such motifs - at the least attempt to detect such motifs.

Modifying the default DNA sequence

You can modify the default DNA sequence assigned via various ways, in the bioroebe-shell instance.

Two such ways, which are ultimately almost identical, are described next:

[5,0] = GAATTC
8, 15 = ''

The above two lines are used to re-assign to the main DNA sequence in use.

The first variant means to begin at the nucleotide at ** position 5*, and then simply **insert* a new sequence there (GAATTC, which is also known as the EcoRI restriction enzyme).

So, in other words, this will add the restriction site for an restriction enzyme (restriction endonuclease) at position 5.

The second variant will delete the nucleotides 8 up to (and including)

  1. (Note that in the Bioroebe project, we start at position 1 for sequences; in Ruby, arrays start at position 0. Keep this in mind.)

If the syntax above via the = assignment operator is looking awkward to you, this is mostly just regular ruby syntax on Array and String objects, except that in this case, we do not specify a leading object to apply this method call on. In ruby this may typically be:

object[5,0] = 'GAATTC'
object.send(:=) # Or a similar variant.

so, syntax-wise, it is not a huge difference.

BioRoebe will assume that the user meant the main DNA sequence at hand.

Note that in order for this to work, you should first have assigned some sequence. If you don't want to think of a specific sequence, simply use "random" before making use of the above syntax.

Next comes a short table that specifies the rules under which enzymes, such as trypsin, cleave:

Enzyme name Rule of cleaving
Trypsin cleaves after K and R, except when followed by P.

The above entry for Trypsin thus means that Trypsin will always cleaves proteins after a lysine or an arginine.

You can simulate this via the following API:

sequence_goes_in_here= 'TACGACCGCASAGAC'
Bioroebe.cleave_with_trypsin(sequence_goes_in_here)
Bioroebe.cleave :with_trypsin, sequence_goes_in_here

Currently (July 2021) only support for Trypsin is included, but in the long run the goal is to add as many digestive (peptide-bond cleaving) enzymes here as possible.

MUMmer

This subsection has only a tiny bit of information about MUMmer.

To compare two sequences, you can issue a command such as:

mummer -mum -b -c NC_014649.1.fasta JF801956.1.fasta   

Converting the three aminoacid letter code to one aminoacid letter code and vice versa

If you wish to quickly convert the three amino acid letter code to the one amino acid letter code then you can use the following command in the bioroebe-shell:

three_to_one Thr Thr Glu Ala Val Glu Ser Thr Val Ala Thr Leu Glu Asp Ser # => TTEAVESTVATLEDS
3to1 Thr Thr Glu Ala Val Glu Ser Thr Val Ala Thr Leu Glu Asp Ser # => TTEAVESTVATLEDS

So the command is three_to_one (you can omit the _ or use 3-1 or 3to1, too), followed by the three-letter aminoacid sequence that you wish to convert into the single letter code.

I personally find the single letter code to be a lot more useful, but I know quite a few who do happen to prefer the three-letter notation, as it may be a bit easier to know which amino acid is meant (not for me, though - I prefer the one letter code all around; I think it is much simpler than the three letter code as well).

Since as of February 2021, there is also a small ruby-gtk3 widget available that allows this functionality via a GUI interface. You can start it, among other ways, via:

three_to_one --gui

This looks like so (in August 2021):

If you want to do the reverse, aka translate from the one amino acid code letter code to the three amino acid letter code, then you can use the following syntax in the bioshell:

1to3 KRKAKAGAGAUUGAUGAAGCCACA # Lys-Arg-Lys-Ala-Lys-Ala-Gly-Ala-Gly-Ala-Sec-Sec-Gly-Ala-Sec-Gly-Ala-Ala-Gly-Cys-Cys-Ala-Cys-Ala

As you can see the latter has the three amino acid code now.

The CRISPR System

This subsection contains some background information about the CRISPR system.

First, if you need to design a target for CRISPR then you can use crispor:

http://crispor.tefor.net/

All these sequences that are designed that way require a PAM - a Protospacer Adjacent Motif.

PAM matrices

PAM250 corresponds to 20% amino acid identity. This in turn represents 250 mutations per 100 residues.

PAM matrices with lower serial numbers are more suitable for aligning more closely related sequences.

Translate a DNA-strand into the complementary sequence

If you have assigned a DNA sequence in the bioshell, you can find the complementary sequence by issuing:

dna_translate
dnatranslate

For example, if you assign the main sequence to be:

ATGGGAGTCCGAAGTTGC

Then running the above command (dna_translate) from the bioshell would yield the following sequence:

TACCCTCAGGCTTCAACG

This is the complementary DNA strand to the first sequence, thus dsDNA:

ATGGGAGTCCGAAGTTGC
TACCCTCAGGCTTCAACG

All A pair with T; all T pair with A; all G pair with C; all C pair with G.

chop and restore

Chop in the context of the BioRoebe project, means to "chop off" nucleotides from a sequence. Restore, in the context of the BioRoebe project, means to "restore" the chopped-off nucleotides again - in other words, restore will revert the chop-action.

If you think about the chopping action, in a linear DNA or RNA sequence, we can chop off from either side, left or right, aka 5' or 3':

5'-ATGGGCGGTTAG-3'

The default action for the BioRoebe project is to chop from the right hand side (the 3' end), but you can also chop from the left hand side (the 5' end) if you have to.

First, let's see examples for regular chop-off action from the 3' end (the "right-hand side" end):

chop 3       # ← chop off the three last nucleotides, aka from the **3' end**

Next, show how to chop away from the leading 5' end:

lchop 3      # ← chop off the three first nucleotides, aka from the **5' end**
left_chop 3  # ← same as above
chop_first 3 # ← This would also work just fine.

This will remove the 3 leading nucleotides from our main DNA sequence.

You can also chop towards any substring. Normally you may want to chop towards the first start codon, which is usually AUG. A usage example follows:

chop_to AUG

This will get rid of all nucleotides before the first AUG found. If you want to chop towards AUG sequences afterwards, you can combine chop_to with left_chop. (Note that leftchop and chopto also work.)

After you have used a chop-operation you can also restore this operation again, via restore. This presently (December 2018) is limited in that it will only append to the right hand side (3'), so it is not a 100% restore. I may consider adding more code for this behaviour, but since most chop-operations that I am doing are on the right-hand side, I don't really need to use restore in a more sophisticated manner than that.

You can also add random nucleotides to the left side (5' P group) of a DNA sequence via:

left_add 3
ladd 9

Comparing two sequences for equality

Within the BioRoebe shell you can quickly compare two sequences for equality, such as in this way:

ATCGATCGATCG == ATCGATCG

This will print false. Of course you can do so within irb too, or in a .rb script - I only added this to the bioshell due to convenience.

Gene regulatory proteins and DNA sequences that is recognized by them

Note that the following table is not exhaustive; and it may not be accurate, so use at your own risk. It is only listed here to give a few examples about the sequences that are recognized by some proteins able to bind to dsDNA.

Name of the protein Sequence
Lac repressor coli 5'AATTGTGAGCGGATAACAATT
CAP 5'TGTGAGTTAGCTCACT
Lambda repressor 5'TATCACCGCCAGAGGT

RNA-Editing

Deaminase are the enzymes that can perform RNA Editing. For example, a sequence such as CAA may become UAA (for C → U Editing).

N50

N50 is used for genome assembly.

In any sequencing for a genome there may be reads (and contigs) of different length. Lots of overly short sequences are not extremely useful, though - that is why very short contigs may have to be removed from an assembly.

Given a set of sequences of varying lengths, the N50 length is defined as the length N for which 50% of all bases in the sequences are in a sequence of length L < N.

How to calculate N50?

Take these 10 contigs of the following length (in descending order):

91 77 70 69 62 56 45 29 16 4

The sum of these contigs is 519bp, so the sum of all contigs equal to or greater than N50 must be equal to or greater than 519/2 or 259.5.

We can see by a brute force approach alone that:

91+77=168
91+77+70=238
91+77+70+69=307 (that'll do as it is larger than 259.5)

so the N50 for this assembly is 69bp.

Take another example:

1989+1934+1841+1785+1737+1649+1361+ 926+ 848+ 723

The sum is 14793, so half of that is 7396.5.

We can calculate this, first three:

1989+1934+1841      = 5764 (is smaller than 7396.5)
1989+1934+1841+1785 = 7549 (is larger than  7396.5)

So our main contig is 1785.

Another way to look at this:

at least half the nucleotides in this assembly belong
to contigs of size 69bp or longer.

The EMBL format

Here is an example for this format type:

https://www.ebi.ac.uk/ena/browser/api/embl/X81322.1?download=true

ViennaRNA

ViennaRNA is a package that excels at RNA-related bioinformatics.

You can read up more about this project at:

https://www.tbi.univie.ac.at/RNA/tutorial/

Here, this subsection refers to how the BioRoebe may interact with ViennaRNA (or may not; this section is a stub).

You can also see what "science"-related addons exist in general, which the BioRoebe project may then make use of, via:

science_addons?
addons?

Download remote dataset from the commandline

bin/bioshell can be used to download remote dataset from the commandline.

Example:

bioroebe --download=ftp://ftp.ensembl.org/pub/release-92/gtf/mus_musculus/

Disabling Opn.opn()

Opn.opn() is used in some classes in the BioRoebe project to denote which class is actively responsible for the given output on the terminal. Not everyone may want to have this functionality, so it is possible to disable this via:

bioroebe --disable-opn

YAML files distributed in the BioRoebe project - psych versus syck

Since as of January 2019, the .yml files that are distributed within the BioRoebe project are conforming to psych. Before that, syck had to be used, but in January 2019 the dependency on syck has finally been eliminated.

If, for whatever reason, you may wish to use syck, you have to first install the try_syck gem, and also compile/install syck.

Then you can start the bioshell via either of:

bioshell --try_syck
bioshell --syck

This exists mostly just for legacy reasons. Psych is the new yaml engine and is the preferred one, too, as far as the BioRoebe project is concerned.

Antibody Engineering

In order to create a scFv fragment, the linker is typically made up of the aminoacids Glycine and Serine, e. g. following this formula:

(GGGGS)3

This information may be useful if you wish to model a 3D structure.

The full amino acid sequence may then be:

MADYADAVVTQESALTTSPGETVTLTCRSSTGAVTTSNYASWVQEKPDHLTGLIGGTNN
RAPGVPARFSGSLIGDKAALTITGAQTEDEAIYFCALWYSNHWVFGGGTKLTVL
GGGGGSGGGGSGGGGSGGGGSDVQLQESGPGLVAPSQSLSITCTVSGFSLTDYGVN
WVRQSPGKGLEWLGVIWGDGITDYNSALKSRLSVTKDNSKSQVFLKMNSLQSGDSARYYC
VTGLFDYWGQGTTLTVSSASGADHHHHHH

(Note the third line being the linker, e. g. very Glycine-rich.)

For a typical scFv fragment, the order for its construction is as-follows:

VL - Linker - VH - His-Tag

So the Linker sits in the middle; the His-tag is for later purification. After separate PCR-amplication steps for VH and VL, an assembly-PCR is typically used in order to mix together both fragments.

Random residues can be added in the CDRs, via in-vitro mutagenesis, for the generation of new binding sites.

Famous sequences in molecular biology

The following table lists a few famous sequences

Name of the sequence Found in this organism Sequence
loxP P1 phage ATAACTTCGTATAATGTATGCTATACGAAGTTAT

Note that loxP is separated into three sub-sequences, called:

left inverted repeat sequence:

ATAACTTCGTATA

Spacer (5' → 3'):

ATGTATGC

Right inverted repeat sequence:

TATACGAAGTTAT

See https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1479339/.

Phage receptors

This table shall only show some commonly used phage-receptors by different phages. It is not meant to be all-inclusive; only showing some as example.

Host bacterium Phage Used Receptor
Escherichia coli K3 OmpA
Escherichia coli T2 OmpF
Escherichia coli T1, T5 TonA
--------------------------- ---------- --------------------------------------------------------------------
Salmonella typhimurium PH42 36k porin
Salmonella typhimurium PH51 34k porin
Salmonella typhimurium ES18 SidK (siderophore uptake)
--------------------------- ---------- --------------------------------------------------------------------
Bacillus subtilis SP50 Teichoic acid+pepdidoglycan
--------------------------- ---------- --------------------------------------------------------------------
Streptococcus pneumoniae Dp-1 Phosphorylcholine component of wall teichoic acids
--------------------------- ---------- --------------------------------------------------------------------
Lactobacillus casei J-1 Rhamnose and GlcNAc containing extracellular polysaccharide (sugars alone can compete)

Post-Translational modification of proteins/aminoacids

15 of the 20 aminoacids are modified post-translationally in one way or another. The 5 aminoacids that are not modified are:

(1) Valine        (V)
(2) Alanine       (A)
(3) Leucine       (L)
(4) Isoleucin     (I)
(5) Phenylalanine (F)

Mnemonic: (no) V-A-L-I-F PTMs ("no valid PTMs"; valif sounds a bit similar to valid).

How many genes do different organisms have?

This subsection shows a table showing how many different genes different organisms have - as a total number.

Estimating this number accurately is difficult, for various reasons. Obviously you need to know the full genome sequence, but you also need to define what a gene is, exactly. A good definition is that a gene is any part of DNA sequence that may give rise to some kind of RNA. Traditionally, the primary definition used to be that a gene will code for a protein - but alternative splicing exists, thus giving rise to multiple proteins for, at the least, some genes.

For the following table, we will focus on protein-coding genes - this may not necessarily include all genes, per se, but it is a simple definition - so let's show the table next without further ado:

# Organism n protein-coding genes in its genome Sequence published in this year Resources / URLs
1 Mycoplasma genitalum 500 https://en.wikipedia.org/wiki/Mycoplasma_genitalium
2 Helicobacter pylori 1500 https://en.wikipedia.org/wiki/Helicobacter_pylori
3 Methanococcus jannaschii 1800 https://en.wikipedia.org/wiki/Methanocaldococcus_jannaschii
4 Escherichia coli (K-12 derivative MG1655) 4288 1997 https://en.wikipedia.org/wiki/Escherichia_coli
5 Saccharomyces cerevisiae 6275 https://en.wikipedia.org/wiki/Saccharomyces_cerevisiae

OMIM Diseases

The following table shows a few selected OMIM-registered diseases (OMIM is an abbreviation for Online Mendelian Inheritance in Man).

OMIM ID URL Description
105400 https://omim.org/clinicalSynopsis/105400 ALS (amyotrophic lateral sclerosis) is a disease that causes the death of neurons controlling voluntary muscles (thus, movement/locomotion).
143100 https://omim.org/clinicalSynopsis/143100 The huntingtin disease is caused by a trinucleotide repeat expansion (CAG)n in the huntingtin gene (HTT, 613004.0001)
219700 https://omim.org/clinicalSynopsis/219700 Cystic fibrosis is a life-limiting autosomal recessive disorder that affects 70,000 individuals worldwide.

Return the position of subsequences within a larger sequence

Consider the following DNA sequence:

ACGTACGTGACG

Say that you wish to determine where the following subsequence can be found within that main DNA sequence:

GTA

This would return an Array that has these entries:

[3, 7]

The following Bioroebe toplevel method does this:

Bioroebe.return_array_of_sequence_matches("ACGTACGTAACG", "GTA") # => [3, 7]

Convert DNA/RNA into the corresponding aminoacid sequence

You can use this API to convert the given DNA or RNA strand at hand into the corresponding aminoacid sequence:

Bioroebe.to_aminoacids
Bioroebe.to_aminoacids(pass_in_your_dna_sequence_here)

This presently (August 2018) does not make use of the codon table, but in the future it will. (The Bioshell already handles this correctly; I just have not yet synced and unified the code.)

If you are using the BioShell then you can also quickly convert the main sequence into the corresponding aminoacid sequence, via:

raw_aa

Digestions of DNA polymers and aminoacid polymers

It is possible to simulate a digestion. Currently (Mar 2018) we only simulate the digestion of DNA polymerase, but at a later time, we may also add support for digestions performed by proteins (e. g. trypsin digestion, which is used in mass spectrometry analysis; protein digestion will usually go towards the aminoacid sequence of proteins). Support for trypsin-digestion was added in September 2019.

The method that does digest on DNA is called restriction_enzyme_digest, within the Bioroebe::Shell namespace - so it is a method available on the bioroebe-shell component.

Let's simulate this.

First, fire up a new instance of the bioroebe-shell, then issue a command such as this:

random 2000
digest_at EcoRI

This will run a simulation of an EcoRI-digestion. You can either put in the name, or the target sequence - both should be supported.

If in doubt, use the sequence, though, as input. Less chance for errors that way.

On the commandline, output is generated that will show which fragments are generated.

In September 2019 a new subdirectory was added, called digestion/.

An example class was added, called:

Bioroebe::Trypsin.new

An executable exists for this too, called trypsin_digest.

Simply pass to this class the .fasta file or sequence that you wish to analyze. The result (aka the output) will be an Array that denotes the nucleotides at which trypsin will cut after (and, actually, the commandline executable will display more information too).

Example - if the file is foobar.fasta, simply do:

trypsin_digest foobar.fasta

Do note that if you input a FASTA file then the current behaviour is that only the first entry will be picked.

If you need a slightly more advanced script, usable on the www, consider using:

https://web.expasy.org/cgi-bin/peptide_mass/peptide-mass.pl

Accession numbers (at NCBI and elsewhere)

This subsection will only list a few species and sequences. Do not expect this to be all-inclusive; it will be biased to my own focus. I will, however had, eventually also add most of the model organisms.

Accession number Organism n base pairs (bp) URL
NC_000913.3 Escherichia coli str. K-12 substr. MG1655, complete genome 4_641_652 https://www.ncbi.nlm.nih.gov/nuccore/NC_000913.3
NC_004354.4 Drosophila melanogaster chromosome X https://www.ncbi.nlm.nih.gov/nuccore/NC_004354.4
NC_020104.1 Acanthamoeba polyphaga moumouvirus, complete genome 1_021_348 https://www.ncbi.nlm.nih.gov/nuccore/NC_020104.1
NC_0146491.1 Mimivirus (Acanthamoeba polyphaga mimivirus) 1_181_549 https://www.ncbi.nlm.nih.gov/nuccore/NC_014649.1
JF801956.1 Acanthamoeba castellanii mamavirus 1_191_693 https://www.ncbi.nlm.nih.gov/nuccore/JF801956.1
AP017645.1 Acanthamoeba castellanii mimivirus DNA, strain: Mimivirus shirakomae 1_182_849 https://www.ncbi.nlm.nih.gov/nuccore/AP017645.1
NC_016072.1 Megavirus chiliensis, complete genome 1_259_197 https://www.ncbi.nlm.nih.gov/nuccore/NC_016072.1
NC_021858.1 Pandoravirus dulcis, complete genome 1_908_524 https://www.ncbi.nlm.nih.gov/nuccore/NC_021858.1
NC_022098.1 Pandoravirus salinus, complete genome 2_473_870 https://www.ncbi.nlm.nih.gov/nuccore/NC_022098.1

Human mRNA and proteins of important or interesting genes:

Accession number Organism n base pairs (bp) FASTA sequence URL
NM_000207.3 Homo sapiens insulin (INS), transcript variant 1, mRNA 465 https://www.ncbi.nlm.nih.gov/nuccore/NM_000207.3?report=fasta https://www.ncbi.nlm.nih.gov/nuccore/NM_000207.3
NM_173080.3 Homo sapiens small proline rich protein 4 (SPRR4) 751 https://www.ncbi.nlm.nih.gov/nuccore/NM_173080.3?report=fasta https://www.ncbi.nlm.nih.gov/nuccore/NM_173080.3

Sometimes you may find additional information, such as NC_000009.12: g.114195977G > C. Here the letter g refers to a genomic reference sequence. On the other hand, NM_032888 (COL27A1): c.2089G > C, the c refers to a coding DNA reference sequence. Weird naming schematics.

Quickly showing a DNA sequence and its complementary (partner) DNA strand

Say that you have the sequence GAATTC and you wish to quickly show, on the commandline, the complementary DNA strand to this sequence.

You can use bin/show_this_dna_sequence for that, as in:

show_this_dna_sequence GAATTC

This will show the DNA sequence in colours and with assumed 5' and 3' parts, and a bit of padding and spacing.

This subsection lists some alternative toolkits in different programming languages, in a table format. This is purely for information and not a full listing by any means.

Number URL Programming Language
1 https://rubygems.org/gems/bio Ruby
2 https://biopython.org/ Python
3 https://rust-bio.github.io/ Rust
4 https://bioperl.org/ Perl
5 http://www.biophp.org/ PHP
6 https://github.com/brentp/hts-nim NIM
7 https://biojava.org/wiki/BioJava%3AGetStarted Java
8 https://github.com/BioHaskell Haskell
9 https://github.com/biojs BioJS
10 http://biocaml.org/ OCaml

Displaying a subsequence

You can display a subsequence, e. g. the four nucleotides from position 22 to (including) position 25 (22,23,24,25):

22..25

Some alternatives exist to the above syntax:

22-25
22.25
22,25

Using a - may be simplest, though. The other variants exist mostly due to convenience (and .. typically refers to a Range in ruby). The last option, for ,, has been added because some users may prefer the use of , over the use of ..

Note that this can also be used for copy-pasting remote numbers from websites into the Bioroebe shell.

Example for this:

1,500-1,750

This will also include the nucleotide at position 1500, the starting one; so the above command will display 251 nucleotides in the above scenario, e. g. ending at nucleotide position 1750 (and including that one as well).

You can also include spaces, such as if you copy/paste from the www:

22     - 33

This is all handled by a regex, which you can look at as part of the bioroebe-shell subcomponent.

Note that this currently only works via the interactive shell.

Note that you can do so on aminoacids as well, via the Bioroebe shell. You have to prefix two a, aka *aa.

Usage examples:

aa22..44
aa22 .. 44

The Hamming Distance

The hamming distance is the amount of different characters between two target sequences (strings) that have the same overall length.

Let's look at the following two sequences:

ATCG
ATCC

If you look at these two sequences then you may quickly notice that the last character is different; the other three characters are the same.

You can calculate the hamming distance between two Strings via class HammingDistance. Simply input a String to it and use a | character to split that String - the part before the | character will become the first string; and the past after the | character will become the second string.

Since as of 20th August 2018 there exists a second variant, a toplevel module-level method called Bioroebe.hamming_distance().

The code for this resides at bioroebe/toplevel_methods/hamming_distance.rb and the API of the method is simple: put two strings into this method of equal length, and an integer will be returned which tells us the hamming distance.

Via the commandline you can calculate the Hamming distance by doing:

hamming ATTATTATTATTATTATTATTATTATTATTATTATTATTATTATTATTATTATTATTATTATTATTATTATTATTATTATTATTATTATTTTATTATTATTATTATTATTATTATTATTATTATTATTATTATTATTATTATTATTATTATTATTATTATTATTATTATTATTATTATTATTATTGGGGGGGGGGGGGGGGGGGGGGGGGGGGAAA ATTATTATTATTATTATTATTATTATTATTATTATTATTATTATTATTATTATTATTATTATTATTATTATTATTATTATTATTATTATTATTATTATTGGGGGGGGGGGGGGGGGGGGGGGGGGGGCCCTTATTATTATTATTATTATTATTATTATTATTATTATTTTATTATTATTATTATTATTATTATTATTATTATTATTATTATTATTA

The above uses two sequences, separated by a " " space character.

If you wish to calculate the hamming distance between two strings from within ruby, as pointed out before, you could use the following code from BioRoebe:

require 'bioroebe/toplevel_methods/hamming_distance.rb'

Bioroebe.hamming_distance('ATCG','ATCC') # => 1

Simply input your two sequences into that method and the number (as an Integer) will be returned.

Multiple Sequence Alignment and software applications

Different programs exist that can do multiple sequence alignment (MSA).

Examples include:

ClustalW
Muscle
ProbCons
TCoffee
MAFFT    # https://mafft.cbrc.jp/alignment/server/
Expresso
Jalview

The most important database for protein families is Pfam. Here is the URL to Pfam:

https://pfam.xfam.org/

ClustalW / Omega

Clustalw respectively Clustalomega can be used to align multiple sequences to one another at the same time. The subsection here just contains a little bit information in this regard.

Is anyone using Clustal Omega out there? Yes indeed! The UniProt web site uses Clustal Omega to align sequences.

The symbols used in clustalw are:

* denotes an exact match
: denotes a conservative match
. denotes a less conservative match

Finding the consensus sequence and constructing a frequency profile

If you have a text file with content like this:

ACTCC
CACCA
AGCCA
AACGC
CGCAT
CGACC
ACCGC
GGCCG
GGCGT
GGCGT

Then you can use class Bioroebe::Alignment to read in that file and output the profile for that alignment:

A: 0.4 0.2 0.1 0.1 0.2 
T: 0.0 0.0 0.1 0.0 0.3 
C: 0.3 0.2 0.8 0.5 0.4 
G: 0.3 0.6 0.0 0.4 0.1 

And then also output the degenerate consensus sequence:

AGCCC

This currently (December 2019) is not very sophisticated but more of a proof-of-concept. At a later time this may be improved.

Since as of July 2021 there is a small ruby-gtk3 widget for this functionality, stored in gui/gtk3/alignment/alignment.rb. This is currently not very pretty but eventually I will improve on this to simplify working with it.

External tutorial

In the past, up until December 2019, I provided and maintained an external tutorial for the BioRoebe project.

However had, in December 2019, I have removed this external tutorial. The reason is that the .md file here is updated a lot more frequently and will include all the information from that external tutorial in the long run. Hopefully this could one day become more like a "book" that could be read, since it is quite a large file at this moment in time.

While the external tutorial looked visually nicer than the markdown format here, due to me being able to make use of more advanced CSS rules or autogenerated JavaScript, in the long run the markdown format is just a lot easier to use, maintain and extend "as you go" - so since as of December 2019, no external tutorial will be provided anymore. All the information that may be useful will be gathered in this file here instead, eventually.

I may also add more information to this README.md here rather than the .pdf tutorial, mostly because the .md is simpler to maintain in the long run really, and more easily available as-is.

IUPAC codes

Next follows the content of two IUPAC-related files, showing the code used by IUPAC for DNA/RNA/proteins.

doc/IUPAC_nucleotide_code.md doc/IUPAC_aminoacids_code.md

IUPAC code for nucleotides may also include "all-except-for-this nucleotide" specifications, specifically:

B  All bases except A      (mnemonic: B follows A)
D  All bases except C      (mnemonic: D follows C)
H  All bases except G      (mnemonic: H follows G)
V  All bases except T or U (mnemonic: V follows U)

Transitions and Transversions

In genetics, a transition is a point mutation that changes a purine nucleotide to another purine (A ↔ G), or a pyrimidine nucleotide to another pyrimidine (C ↔ T).

A transversion changes a purine to a pyrimidine, or vice versa.

Given two different nucleotide sequences, such as two DNA strands, you can calculate the number of transitions between these two strands like so:

Bioroebe.n_transitions('ATGAAAAACA', 'ATGCTGATGG') # => 2

Likewise, Bioroebe.n_transversions() is also available, for the number of transversions between two sequences.

Scanning an input sequence for startsequences

If you wish to find all ATG/AUG start codons in a given sequence then try this method:

Bioroebe.scan_this_input_for_startcodons
Bioroebe.scan_this_input_for_startcodons 'ATGGGGACAGAATGCCGATGAGAGTCGCCCCCGATACCCCCCGGGGGGGAT'

This will start at 1, for the first nucleotide. (This is mentioned in the event you may be confused, for pure-ruby code. In that case do not forget to deduct 1, as arrays or strings in ruby would begin at 0 rather than 1. 0 does not make sense for nucleotides, though.)

Consensus sequence

If you need to obtain a consensus sequence of similar DNA sequences, you can use the following method:

Bioroebe.return_consensus_sequence

Specific usage examples:

Bioroebe.return_consensus_sequence(%w(ACTCC CACCA AGCCA AACGC CGCAT CGACC ACCGC GGCCG GGCGT GGCGT))    # => "AGCCC"
Bioroebe.return_consensus_sequence_of(%w(ACTCC CACCA AGCCA AACGC CGCAT CGACC ACCGC GGCCG GGCGT GGCGT)) # => "AGCCC"

Note that this will return the most likely DNA sequence at each position, as a String.

Since as of April 2020 it is possible to use this from the commandline as well.

Issue the following command:

consensus_sequence
consensus_sequence ACTCGATGCA ACACGCTCCA ACTTGCTCCA GCGCGGCGCA TCATGTTGCA TCGCGATCCT TCTCCATCCT

This variant will be a bit more detailed, and use colours too. The colour blue (steelblue, actually), indicates a position where different nucleotides may exist, so that script will randomly choose one among these alternatives (they will each have the same probability, in that case).

Mismatches and class Bioroebe::CheckForMismatches

If you want to compare two strings for mismatches, you can use the following API:

Bioroebe::CheckForMismatches.new('AAAAATTTTTCAAATTAA|TTTTTAAAAAATTTAAAA')

Note that the class is presently a bit chatty on the commandline.

I recommend to use it only with a short input, like 20 nucleotides or something like this. More options may be added to the class in the future if it turns out to be a useful class.

The hydrophobicity of amino acids

Aminoacids have a different hydrophobicity. For example, Isoleucin is very hydrophobic and Arginine is very hydrophilic.

This can be measured via a scoring system. A commonly used scoring system is the one defined by Kyte and Doolittle.

BioRoebe supports this in various areas. A simple one e. g. when you need to want to output which aminoacids are more hydrophobic, can be this API:

Bioroebe.sort_aminoacid_based_on_its_hydrophobicity('FDMS')

This will then output a String like the following:

D < S < M < F

This may not be hugely useful, but it was needed for one exam question in structural biology and I was too lazy to do this manually, so ruby had to solve this.

Keep in mind that the different scoring schemes may give slightly different results, so the above is really only valid for the Kyte and Doolittle (K & D) scoring scheme. From my experience, the K & D scoring system is the most widely used one, though.

The hydrophobicity values can be shown via a class too, such as in the following way:

Bioroebe::ShowHydrophobicity.new 'Ser-Arg-Phe-Thr'

The Smith Waterman algorithm

A prototype exists to solve the smith waterman algorithm for dynamic programming, in the class called Bioroebe::SmithWaterman.

Do note that the code is not really optimized as-is; in the future further changes may occur, but for now this shall suffice.

Melting temperature calculations for PCR reactions

What next follows is some theory behind melting temperature calculations. This subsection has been inspired by this old email here:

https://lists.open-bio.org/pipermail/bioruby/2014-October/005197.html

The primary class for this functionality within the BioRoebe project can be found in the following file:

bioroebe/calculate/calculate_melting_temperature.rb

So in theory, you could just require it in this way:

require 'bioroebe/calculate/calculate_melting_temperature.rb'

However had, over the years a few additional formulas have been added, so this may be simpler:

require 'bioroebe/requires/require_all_calculate_files.rb'

Or just require 'bioroebe'.

If you are curious, the BioPHP variant is available here:

http://www.biophp.org/minitools/melting_temperature/demo.php?formula=Base-Stacking

The Bioroebe project was initially inspired by the BioPHP project, in particular the web-related activities of BioPHP.

Let's first explain how to calculate the molecular weight of a nucleotide sequence.

The weight comes from the molecular mass of the four DNA nucleotides, naturally, which you can obtain from e. g. wikipedia. For example, Adenosine has the chemical formula C10H13N5O4 and a molar mass of 267.241 g/mol.

https://en.wikipedia.org/wiki/Adenosine

Here we have to keep in mind that a ssDNA strand is connected via a phospho-di-ester bond, internally.

This is essentially a PO₄ group, which has a molecular weight (molar mass) of 94.97 u (g / mol).

However had, two of these O atoms are already present in the DNA nucleotides (specifically in the deoxyribose sugar component), so we only have to add the molecular mass of PO2, which is 62.972 u (g / mol). So approximately, for Adenosine (A), we would come to a molecular weight of (251.24 + 62.972 - 2) which is 312.212. The -2 come from the two H atoms that are lost in the process. Note that this is not valid for the first and the last nucleotide, which will not have any PO4 group attached.

Wikipedia shows the internal bonds here:

https://en.wikipedia.org/wiki/Phosphodiester_bond

There are other ways to calculate the Tm (melting temperature) though. In March 2018 I investigated this a little more in detail, due to an assignment.

One popular method is called the Salt-adjusted Tm. This variant works best for the nucleotide range 18 to 25; however had, it can also be used up to about 40 or 50 nucleotides as well.

Note that the Nearest-Neighbor Tm may provide an even better algorithm in the event of a range above 25 nucleotides - but one advantage that the Salt-adjust Tm algorithm has is that it is significantly simpler to calculate than the nearest-neighbor Tm.

Since it is fairly simple, Bioroebe makes the salt-adjusted Tm available via this method:

Bioroebe.salt_adjusted_tm
Bioroebe.salt_adjusted_tm 'ATGCGGCCTTAGAGACTGG' # => 59.482

This will just return a float value, the degree in n degrees C.

If you want to output more information then you can use the following method instead:

Bioroebe.calculate_melting_temperature_for_more_than_thirteen_nucleotides 'ATGCGGCCTTAGAGACTGG'

This will be more verbose. It is the same as invoking that .rb file (calculate_melting_temperature_for_more_than_thirteen_nucleotides.rb) from the commandline, by the way. An alias exists for this, called:

Bioroebe.melting_temperature()

Presently the concentration of sodium is hardcoded towards 0.050 mMol, so 50m Mol is 0.050 Mol. (In the future we may have to add a way to modify this, but you can instantiate a new object, and then modify the concentration of sodium via .set_sodium_concentration()).

Do also note that the Bioroebe::Shell allows the usage of that class too, via the entry point salt_adjusted_tm.

Furthermore, take note that the melting temperature also depends on whether you have dsDNA - or dsDNA with short ssDNA overhangs. ssDNA on one end of a dsDNA structure will tend to increase the Tm, as compared to a dsDNA that does NOT have such an overhang. Within a given cell this is not really of great importance, but if you ever want to synthesize or otherwise use (artificial) nucleic acid, keep this in mind.

Lastly, there is also the nearest neighbor method, which is the most accurate one - but also the most complex one.

I may add code for this at a later time (statement here made in April 2018) - we'll see.

You may wonder "why so many formulas for melting temperature calculation"?

Well, the thing is that many different factors contribute to the actual melting temperature. The older formulas tend to be simpler - but they are, by and large, typically less accurate. The newer formulas are more complicated but should fit the experimental data somewhat better, since they take more factors into account.

We can quote Dr Richard Owczarzy from the IDT, when he said:

[...] "Tm is not a constant value, but is dependent on the
conditions of the experiment. Additional factors must be
considered, such as oligo concentration and the environment
in which hybridization will take place." 

So essentially, we really need to take many factors into consideration - and this is exactly what the more complex formulas attempt to do.

The oligo concentration alone can cause a Tm-change of up to +10°C or down to -10°C. In general, the DNA-strand that is in excess determines the Tm; at the least has a higher influence on the actual Tm.

A high DNA concentration favors duplex formation - hence, the Tm increases.

Note that there are also some other formulas, such as:

Tm = 81,5 + 16,6 (log10 [J+]) + 0,41 (%G + C) - (600/L) - 0,63 (% FA)

where:

L  = length of the primer
FA = formamide
[J+] = the concentration of monovalent cations

I do not know how accurate this formula will be, but it was used for a practical biotech-course and their PCR reactions, so perhaps the above formula works well in several cases. The bioroebe project currently does not support this formula, largely because I do not know how well that formula is compared to others; but who knows. In the future, Bioroebe may support this formula as well. (One day we may have to do a statistical analysis of these formulas, and compare it to real data - but I hope someone else can do this, since it takes such a long time to compile the data and analyse it ...)

Further useful links related to the melting temperature:

http://www.genbeans.org/ibe/5.1/help/org-genbeans-modules-seqcore/oligo_tm.html http://www.cmbn.no/tonjum/oligocalculator3.26/OligoCalc.html

And, my favourite one:

https://labcalculator.net/wiki/oligo-tm

SimpleStringComparer

class SimpleStringComparer can be used to compare two strings visually, similar as to how NCBI BLAST compares two sequences to one another.

By default the output of that class will be, in colours, on the commandline, but you can disable this like so:

Bioroebe::SimpleStringComparer.new(ARGV) { :disable_colours }

Let's look at another example:

Bioroebe::SimpleStringComparer.new('AAAAAAAAAAAAAATTTTTTTTTTTAAAAAAAAAAAATATA|GAAAAAAAAAAAAAAAATATTTTTTTTTTTTTTTTTTTTTT')

This may look like so:

Bioroebe::SanitizeNucleotideSequence

class Bioroebe::SanitizeNucleotideSequence can be used to sanitize a nucleotide sequence. In this context, sanitizing means that numbers and newlines will be removed, as the following example shows. Pay attention to the part after the '# => ' comment.

Example:

Bioroebe.sanitize_nucleotide_sequence "1 ATCCG\n30 TTACG\n50 AAATTTG" # => "ATCCGTTACGAAATTTG"

The main idea behind the above input is when you have a file that has entries such as:

10 AGTA
20 TTGC

Or, respectively:

1  AGTA 10
11 TTGC 20

This class has been created in particular because some formatting includes numbers and newlines (genbank file format, for instance), so it was a bit annoying to obtain the raw sequence manually. Thus this class will remove all numbers and yield a significantly simpler String without any " " characters or newlines or numbers. This functionality may be useful for FASTA format files as well.

A more specific example, in ruby+bioroebe, will follow next:

require 'bioroebe/nucleotides/sanitize_nucleotide_sequence.rb'

Bioroebe::SanitizeNucleotideSequence.new("  10 AGTA \n  20 TTGC").result?

From within the Bioroebe::Shell (bioshell) itself, you can trigger this functionality by issuing the following command:

sanitize_nucleotide_sequence "  10 AGTA \n  20 TTGC"

Furthermore, this will assign the sequence to the main DNA sequence in use, if it is used in the bioshell - mostly for reasons of convenience.

Bioroebe::ShowNucleotideSequence

class Bioroebe::ShowNucleotideSequence has been added in December 2018. It will eventually replace all other instances of displaying the nucleotide sequence within the BioRoebe Shell.

Built-in support exists for colourizing subsequences or showing a (colourized) ruler on top of the sequence.

The transition into using this class consistently is not yet complete (in the year 2020), so it may take some time before this class will replace the older code. But it will happen.

If you wish to make use of this class in your own projects, first require it:

require 'bioroebe/nucleotides/show_nucleotide_sequence.rb'

And then you can use it to display a nucleotide sequence, such as via:

Bioroebe::ShowNucleotideSequence.new('ATGATTGCACGACACACACTAGCTAGCTAGACTGTAAAAAGGGGACTAGCTAGCTAATGGACTGTAAAACTAGCTAGCTAGACGCGGCGCGCTACTAGCTAGCTAGACTGTAAAGTAAA')
Bioroebe::ShowNucleotideSequence.new('ATGATTGCACGACACACACTAGCTAGCTAGACTGTAAAAAGGGGACTAGCTAGCTAATGGACTGTAAAACTAGCTAGCTAGACGCGGCGCGCTACTAGCTAGCTAGACTGTAAAGTAAA')  {{ colourize_this_subsequence: 'CG' }}
Bioroebe::ShowNucleotideSequence.new(ARGV) {{ colourize_this_subsequence: 'CG' }}

And from the commandline:

show_nucleotide_sequence ATGGGCACATACTACGACGACTCAUAG
show_nucleotide_sequence ATGGGCACATACTACGACGACTCAUAG --ruler

This may look as follows:

Bioroebe::Protein

This class is a subclass of class Bioroebe::Sequence. The intended use is to store a protein.

However had, I am still not entirely sure (in April 2020) whether having this class is a good idea or not.

It responds to some ad-hoc methods such as .to_rna, but this functionality is also available in another method.

For now keep this in mind; at some later point I may decide whether this class is to be kept or not.

In July 2022 I noticed that the bio-gem has the following method:

p Bio::AminoAcid['A'] # => "Ala"

I liked this functionality, but class Bioroebe::Protein already has a [] method which is used to instantiate a new instance of class Bioroebe::Protein. So, a toplevel method was added instead.

Usage example:

Bioroebe::Aminoacids.one_to_three('A') # => Ala

So this is the equivalent to what the bio-gem does, more or less.

If you want to find out the name of a one-letter aminoacid you can also use this method:

Bioroebe::Protein.name('A') # => "alanine"

Decoding aminoacids

Decoding aminoacids means to take the aminoacid at hand, ideally via its one-letter abbreviation, and return a list of all codons that code for this aminoacid, in the form of an Array.

Let's illustrate this example.

First, let's work with a single aminoacid, Lysin, which has the single letter K as abbreviation.

Bioroebe.decode_this_aminoacid 'K'

So you input, as a String, the aminoacid Lysin here, via the 'K'.

This will produce the following result (as an Array):

["AAA", "AAG"]

As can be seen, these two codons code for the aminoacid Lysin. Note that by default this refers to the human genome (or more generally, mammals). This can be changed, via the codon table, for different organisms or mitochondria - but for now, let's ignore these details and continue with the explanation for decoding aminoacids.

If you need to decode a longer aminoacid sequence then another method has to be used. Let's illustrate this via an example.

Say that you wish to decode the aminoacid sequence KKKA.

Use the following API for this:

Bioroebe.decode_this_aminoacid_sequence 'KKKA'

This will return the following nested Array:

[["AAA", "AAG"], ["AAA", "AAG"], ["AAA", "AAG"], ["GCT", "GCC", "GCA", "GCG"]]

As can be seen, each entry in this Array has been replaced with the corresponding codon for that particular aminoacid. This works for different species too, if the correct codon table is used. (By default the main codon table is number 1, which is for eukaryotes, including humans.)

You can also use another method to return a random codon sequence that codes for the given aminoacid sequence (aka, the protein).

Examples:

Bioroebe.return_random_codon_sequence_for_this_aminoacid_sequence('KKKA') # => ["AAA", "AAA", "AAA", "GCG"]
Bioroebe.return_random_codon_sequence_for_this_aminoacid_sequence('KKKA') # => ["AAA", "AAA", "AAG", "GCC"]

Converting this into a String is trivial; just add .join, like you would regularly do in ruby anyway.

This conversion is purely random, though. A more sophisticated approach will take into consideration the most frequent codons for a given organism at hand, which may help when you, for example, wish to produce certain proteins in a particular organism (as is commonly the case in biotechnology).

To obtain the most likely codon sequence for a given amino acid sequence the following API can be used:

Bioroebe.return_the_most_likely_codon_sequence_for_this_aminoacid_sequence 'KKKA' # => ["AAG", "AAG", "AAG", "GCC"]

This defaults to the codon table of humans (Homo sapiens).

If you would rather like to show this in E. coli, then you can use the the API slightly differently:

Bioroebe.return_the_most_likely_codon_sequence_for_this_aminoacid_sequence 'KKKA', :ecoli # => ["AAA", "AAA", "AAA", "GCG"]

As you can see from the result above, E. coli and Homo sapiens use a slightly different preferred codons.

For yeast (Saccharomyces cerevisiae):

Bioroebe.return_the_most_likely_codon_sequence_for_this_aminoacid_sequence 'KKKA', :yeast # => ["AAA", "AAA", "AAA", "GCU"]

You can also use class Bioroebe::ParseFrequencyTable to parse the codon table from the remote .cgi page here. The code for this resides at:

bioroebe/codon_tables/frequencies/parse_frequency_table.rb

Note that at the least 35799 organisms exist in the database, so it is unlikely that BioRoebe wil distribute 35799 .yml files merely for the codon table. In the future a SQL database may be used for this data; for the time being, though, we will have to deal with hardcoded entries to individual .yml files denoting the codon frequencies. At the least for the main model organisms this will be the case - BioRoebe currently already handles the humans, E. coli (K12), yeast, the fruit fly Drosophila melanogaster and Arabidopsis thaliana. This list will be extended, eventually.

You can also automatically download the dataset and store it into a .yml format, in a local file.

Back to the topic of the most likely aminoacid sequence. You can use bin/deduce_most_likely_aminoacid_sequence to output this on the commandline.

Example:

deduce_most_likely_aminoacid_sequence MTTT # => AUGACCACCACC

This defaults to the human genome (or rather, eukaryotes). The codon table can be changed too, if you work with other organisms. (This is presently limited, though; stay tuned for changes).

How does this work?

For every codon, a percentage-likelihood is recorded. This likelihood is taken as value to deduce which aminoacid sequence would be the most likely one.

The Alu sequence

The Alu sequence has two prominent features typically:

  • a 5' A box with the consensus sequence being TGGCTCACGCC
  • a 3' B box with the consensus sequence being GWTCGAGAC

This makes use of the IUPAC nucleic acid notation.

Note that Alu I endonuclease cuts at 5' AG/CT 3'.

The PROSITE database

The PROSITE database can be found here:

https://prosite.expasy.org/

Curiously enough, the PROSITE database makes use of regular expressions (regex).

Consider the following regex here as an example:

[GSTNE]-[GSTQCR]-[FYW]-{ANW}-x(2)-P

This refers to seven amino acids.

The first one can be either glycine, serine, threonine, asparagine, or glutamate.

The second position can be glycine, serine, threonine, glutamine, cysteine or arginine.

The third position can be phenylalanine, tyrosine, or tryptophan.

The fourth position can be any amino acid except for alanine, asparagine, and tryptophan (notice the {} there).

In positions five and six, any amino acid can follow, as denoted by the x, and the number in ().

Position seven is occupied by proline.

It seems useful that the Bioroebe project supports and understands regexes like this, so code was added to support this.

First, each component can be issued to:

Bioroebe.return_random_aminoacid_from_this_prosite_pattern()
Bioroebe.return_random_aminoacid_from_this_prosite_pattern('[GSTNE]')
Bioroebe.return_random_aminoacid_from_this_prosite_pattern('{ANW}')
Bioroebe.return_random_aminoacid_from_this_prosite_pattern('x(2)')

That way you can decode the individual units. Note that this may return one, or more aminoacids, such as the example for x(2). But aside from this, the method will always try to return just one aminoacid.

Note that this method will return one aminoacid from the possibilities, so for the first input of 5 possibilities, one will be randomly selected. Keep this in mind when making use of the above method.

And if you want to parse a whole prosite-regex you can use the following method call:

Bioroebe.parse_this_prosite_pattern('[GSTNE]-[GSTQCR]-[FYW]-{ANW}-x(2)-P') # => "EGFIWQP"

Note that presently (April 2020) not all of PROSITE may be supported via this regex, but in the long run the plan is to support all of PROSITE's regex expression.

AT and GC content

alt text

You can find out the AT or GC content of a DNA sequence in the bioshell.

Do either of these:

at_content?
gc_content?

Or shorter:

at?
atcontent
gc?
gccontent

If you have a use case to quickly desire to obtain the GC content of a given sequence, then make use of the following toplevel API, if you would like to:

Bioroebe.gc_content("ATCG") # => "50.000"
Bioroebe.gc_content("ATCGAAAAAAATACGTAGCTACGA") # => "33.333"

Chromosome Table

alt text

You can print out a chromosome table by issuing this command in the bioshell:

chromosome_table

Colours in the BioRoebe project

alt text

This is a subsection about colours, both in regards to the BioShell but also for other classes in the bioroebe project.

There is a default "colourize" tag in the method rev().

By default, the bioroebe project will make use of green colour if anything has to be displayed on the commandline.

You can assign another colour in the bioshell by doing this:

pick_colour pcolour

Or another variant.

This will currently use the Konsole submodule of the Colours project, but in theory, we could also use plain and simple ansi colours for now. Make sure that the colour exists, though.

If you dislike the colours, you can disable them, by issuing this command in the bioshell:

disable colours
disable_colours
no colours

American English spelling should also work.

If you wish to re-enable them, do:

enable colours
use colours

Either will work.

You can also designate another default highlight colour to use. The highlight colour is the colour that will be used to highlight important sequences.

set_highlight_colour violet
set_highlight_colour random

Do note that in the Bioshell, you can also use "highlight" to highlight a substring. So for example:

highlight ATG
seek ATG

seek is an alias to highlight.

Would find all ATG in the sequence and highlight them.

If you wish to clear all highlighting, do either of:

clear highlight
clear highlighting'

If you want to globally enable colours, use this method:

Bioroebe.enable_colours

If you want to globally disable colours, use this method:

Bioroebe.disable_colours

Start codons

alt text

By default, the most commonly used start codon will be ATG (respectively AUG in the mRNA).

Sometimes another start codon may be used by the given organism at hand. In this case, we need to have the Bioroebe project make use of that other codon too.

You can modify the start codon in the Bioshell, by doing:

set_start_codon GUG

This would use GUG as the new start codon. (The BioShell will convert to the DNA form automatically).

You can also show both start and stop codons simultaneously, by issuing:

start_stop

Alternative, since as of January 2018, you can also use this command instead:

start_and_stop?

Load and Dump

alt text

You can store the current main DNA string like so:

dump

And you can also load this dataset again, via:

load

This allows you to save a DNA string (to a local file) and then continue to work on it at a later time again.

class Bioroebe::Sequence

This class is the base for all sequences, within the Bioroebe project.

Usage example:

require 'bioroebe'
seq = Bioroebe::Sequence.new('atgcatgcaaaa')
seq.complement  # => "TACGTACGTTTT"
seq.sequence?   # => "ATGCATGCAAAA"

To determine the complementary strand, you can also use the method called .complementary:

x = Bioroebe::Sequence.new('ATGCAA')
x.complementary # => "TACGTT"

You can also read e. g. a .fasta (FASTA) file with this class.

Examples:

x = Bioroebe::Sequence.sequence_from_file('/Depot/Temp/Bioroebe/vector_pBR322.fasta') x = Bioroebe::Sequence.sequence_from_file('/home/x/DATA/PROGRAMMING_LANGUAGES/RUBY/src/bioroebe/lib/bioroebe/data/alu_elements.fasta') x = Bioroebe::Sequence.from_this_file('/home/x/DATA/PROGRAMMING_LANGUAGES/RUBY/src/bioroebe/lib/bioroebe/data/alu_elements.fasta')

Note that class Bioroebe::Sequence has many other additional methods, such as .gc_percentage, .at_percentage and so forth.

If you wish to append onto a Sequence object then you can do this:

.append()
.append('AGGCACA')

You can add a random stop-codon as well, via this:

.append :stop

This will be totally random, though, so only use it if you don't quite care about the exact stop codon in use. It will respect the current codon table in use.

If you are unsure which codon table is in use, consider doing:

pp Bioroebe.codon_table?

A typical DNA sequence should contain only four nucleotides - A, T, C and G. However had, the IUPAC standard allows for more codes, in particular a single letter to refer to two or more nucleotides, so the bioroebe project has to support this as well.

If you need to have a regexp (regular expression) to check for that, you can use the following code:

sequence = Bioroebe::Sequence.new('ATGCYRWSKMBDHVN')
sequence.to_regexp # or .to_regex or .to_re

The last would return the following regexp:

/ATGC[TC][AG][AT][GC][TG][AC][TGC][ATG][ATC][AGC][ATGC]/

To convert the main sequence into the gen-bank format, you can try to use:

.to_genbank

On class Bioroebe::Sequence. More customizability may be added to that method in this regard, if users need this.

Obtaining a subsequence from a Bioroebe::Sequence object

Say that you have the DNA sequence ATGCATGCAAAA.

There are several ways how to obtain a subsequence from this. One variant will be shown next, by making use of the method called .subseq().

Example:

seq = Bioroebe::Sequence.new("ATGCATGCAAAA"); seq.subseq(1,3) # => "ATG"

You can also randomize the sequence, via .randomize().

Example:

x = Bioroebe::Sequence.new; x.randomize

This is similar to the method in Bioruby here:

https://github.com/bioruby/bioruby/blob/master/lib/bio/sequence/common.rb#L243

The Hydropathy index

You can display the hydropathy index for aminoacids from within the bioshell.

Simply issue:

hydropathy?

The GFF file format

From within the bioshell you can analyze .gff and .gff3 files, such as by issuing the following command:

gff3? foobar.gff3

Evidently for this to work the file at hand has to exist.

The NCBI Taxonomy database (the Taxonomy submodule of the Bioroebe project)

The NCBI taxonomy database is made available to the public via FTP here:

ftp://ftp.ncbi.nih.gov/pub/taxonomy/ ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz

More than 160,000 organisms are stored in that database (this current value is valid at: 18th September 2019).

The BioRoebe project makes use of this dataset, since as of August 2015. Knowing the taxonomic species ID entries helps in other regards, e. g. if you wish to download the genomic dataset from an organism based on the species ID.

You can download the remote NCBI taxonomy database and have it populate a local database, such as postgreSQL.

To download the database via ruby and the BioRoebe project, the following API may be used:

Bioroebe.download_taxonomy_database

You could also invoke the class stored in the file bioroebe/databases/download_taxonomy_database.rb.

The Taxonomy project allows us the user to mirror the NCBI Taxonomy database, and interconnect this dataset with local data. So you can, for instance, tag the "official" ID for Homo sapiens and obtain the full lineage as well.

Presently (December 2017), this does not work perfectly well due to various changes that were made since the year 2015, but the functionality used to work in 2015, so, sooner or later, it will work again. :)

Stay tuned for now.

Eventually, when this works again, one can find out how many species are registered in the NCBI Taxonomy database, by issuing the following command in the BioShell:

n_species?
nspecies

This will call the class method called Bioroebe::Taxonomy.report_n_species.

CpG islands

alt text

You can show whether there may be CpG islands from within the bioshell:

CpG
CpG?

Replay

alt text

Sometimes you may have to store your input history within the bioshell. This can be done by:

save_history

This will create a local file. You can, if this file exists, also use it to "replay" input history. This will allow you to simulate that you already did input these specific instructions into the BioShell.

The reason as to why this exists is simple - I needed a quicker way to redo the very same sequence.

replay file

A configuration option also allows us to log all inputted commands into a file. By defaut, I have set this on - if you wish to disable it, you can modify the file configuration/configuration.rb.

Colour schemes

alt text

You can test the available colour schemes by issuing something such as:

test_colour_scheme

You can also do so for the main nucleotide sequence, by issuing:

colour_scheme

You can also create a test-demo page, via:

colour_tests

Molmass

alt text

You can query the molecular mass for an aminoacid by issuing:

molmasse glycine

If you want to get an overview over all aminoacids and their respective mass, do this:

show_aminoacids_mass_table

Weight of Nucleotides

alt text

You can calculate the total weight of a nucleotide sequence.

Simply issue one of the following commands - they are equivalent to each other (== aliases):

weight?
molwt

Do not forget to first assign a DNA string, though.

Or, to randomly grab one nucleotide:

random 1; weight?

You can also show the individual weight of the four DNA nucleotides by issuing:

show_individual_weight_of_the_four_dna_nucleotides

Rosalind Challenges

alt text

Rosalind is an on-line challenge for bioinformatics-related questions and tasks. The problems available on Rosalind are somewhat contrived, but usually very simple and straightforward to solve, if given some time to think about the problem at hand.

I wanted to demonstrate that BioRoebe can solve at the least some of the problems mentioned on Rosalind. In order to do so, I looked at this old repository first - for inspiration:

https://github.com/stungeye/Rosalind-Ruby/tree/master/challenges

Next, I have created the following table to show the implementation for these entries - the Bioroebe project can solve these tasks since as of 20th August 2018 as well, and even more than that (presently 17 tasks can be solved through BioRoebe):

Number Name of the Coding Challenge Implementation Status File location URL to the specific Rosalind challenge
1 Complementing a Strand of DNA Implemented in class Bioroebe::ReverseComplement bioroebe/sequences/reverse_complement.rb https://rosalind.info/problems/revc/
2 Computing GC Content Implemented in Bioroebe.gc_content() and Bioroebe.return_fasta_entry_with_the_highest_gc_content bioroebe/calculate/calculate_gc_content.rb https://rosalind.info/problems/gc/
3 Counting DNA Nucleotides Implemented in class Bioroebe::CountAmountOfNucleotides via the --short option bioroebe/count/count_amount_of_nucleotides.rb https://rosalind.info/problems/dna/
4 Counting Point Mutations Implemented in Bioroebe.hamming_distance() bioroebe/toplevel_methods/hamming_distance.rb https://rosalind.info/problems/hamm/
5 Enumerating Gene Orders Implemented in Bioroebe.permutations() bioroebe/utility_scripts/permutations.rb https://rosalind.info/problems/perm/
6 Finding a Motif in DNA Implemented in Bioroebe.find_substring() bioroebe/toplevel_methods/searching_and_finding.rb https://rosalind.info/problems/subs/
7 Finding a Shared Motif Implemented in Bioroebe.longest_common_substring() bioroebe/toplevel_methods/longest_common_substring.rb https://rosalind.info/problems/lcsm/
8 Translating RNA into Protein (Protein Translation) Implemented in Bioroebe.to_aminoacids() bioroebe/sequences/dna.rb https://rosalind.info/problems/prot/
9 Transcribing DNA into RNA Implemented in Bioroebe.to_rna() or Bioroebe::DNA.new.to_rna bioroebe/sequences/dna.rb https://rosalind.info/problems/rna/
10 Calculating Protein Mass Implemented in Bioroebe.weight_of_these_aminoacids?() bioroebe/sequences/protein.rb https://rosalind.info/problems/prtm/
11 Variables and Some Arithmetic Implemented in Roebe.hypotenuse() roebe/toplevel_methods/misc.rb https://rosalind.info/problems/ini2/
12 Strings and Lists Implemented in Bioroebe.return_subsequence_based_on_indices() bioroebe/toplevel_methods/return_subsequence_based_on_indices.rb https://rosalind.info/problems/ini3/
13 Conditions and Loops Implemented in Bioroebe.sum_of_odd_integers() bioroebe/toplevel_methods/sum_of_odd_integers.rb https://rosalind.info/problems/ini4/
14 Working with Files Implemented in Roebe.return_even_numbered_lines_from_this_file() roebe/toplevel_methods/return_even_numbered_lines_from_this_file.rb https://rosalind.info/problems/ini5/
15 Dictionaries Implemented in Roebe.word_frequencies() roebe/toplevel_methods/word_frequencies.rb https://rosalind.info/problems/ini6/
16 RNA Splicing Implemented in Bioroebe.remove_subsequence() and Bioroebe.rna_splicing() bioroebe/toplevel_methods/remove_subsequence.rb https://rosalind.info/problems/splc/
17 Transitions and Transversions Implemented in Bioroebe.transitions_to_transversions_ratio() bioroebe/toplevel_methods/transitions_and_transversions.rb https://rosalind.info/problems/tran/
18 Finding a Protein Motif Implemented in Bioroebe.show_n_glycosylation_motifs() bioroebe/toplevel_methods/matches.rb https://rosalind.info/problems/mprt/
19 Inferring mRNA from Protein Implemented in Bioroebe.this_aminoacid_has_n_codons() - you will have to split the dataset prior on your own bioroebe/toplevel_methods/codons_for_this_aminoacid.rb https://rosalind.info/problems/mrna/
20 Enumerating k-mers Lexicographically Implemented in Bioroebe.nucleotide_permutations() bioroebe/toplevel_methods/nucleotide_permutations.rb https://rosalind.info/problems/lexf/
21 Fibonacci numbers Implemented in Roebe.fast_fibonacci_implementation() roebe/toplevel_methods/fibonacci.rb https://rosalind.info/problems/fibo/
22 Installing Python This can be easily solved on your own. :-) roebe/toplevel_methods/fibonacci.rb https://rosalind.info/problems/ini1/
23 Locating Restriction Sites Implemented in Bioroebe.every_reverse_palindrome_in_this_string() bioroebe/toplevel_methods/palindromes.rb https://rosalind.info/problems/revp/
Number Name of the Coding Challenge Implementation Status File location URL to the specific Rosalind challenge
24 Open Reading Frames Partially implemented in Bioroebe.open_reading_frames() bioroebe/toplevel_methods/open_reading_frames.rb https://rosalind.info/problems/orf/
25 A Brief Introduction to Graph Theory Not yet implemented unknown as of yet https://rosalind.info/problems/grph/
26 Mendel's First Law Not yet implemented unknown as of yet https://rosalind.info/problems/iprb/
27 Partial Permutations Not yet implemented unknown as of yet https://rosalind.info/problems/pper/
28 Rabbits and Recurrence Relations Not yet implemented unknown as of yet https://rosalind.info/problems/fib/

I may add more solutions to the above table in the long run, again inspired by Rosalind. This may take some time since it is not that important compared to the core functionality of Bioroebe, but we'll see how this will progress.

Note that you can use the code in the Bioroebe project to solve the questions that are asked at Rosalind as well, of course. Some of the code may have to be changed a little bit, but if you know ruby then this should be quite easy. I do, however had, encourage you to first try to solve it on your own; perhaps you can come up with a better solution as well (should not be too difficult, as I literally just sought a simple solution, even if ugly, and then sticked with it).

My solutions were not optimized for speed or anything other than to quickly solve the given problem domain at hand and move on.

A simpler top-level API exists in some cases, such as:

Bioroebe.count_amount_of_nucleotides()

But again - test it out before trying to fill in the answer for a question asked at Rosalind.

For the problem domain of LCSM, at:

http://rosalind.info/problems/lcsm/

called "Finding a Shared Motif", we have to find the longest matching substring between different sequences.

The way how I solved it was to make use of class ParseFasta (which you can access through Bioroebe.fasta() too); then obtain all sequences that were found in that fasta file, via the method .sequences?, and then simply feeding this Array into the top-level method Bioroebe.longest_common_substring(). The outcome was the longest substring - and Rosalind accepted that answer.

The method for this in the Bioroebe project is fairly slow, though, so I do not recommend it for very long sequences. But for short ones the current algorithm works very well.

If you have solved a problem of Rosalind, in ruby code, and have some documentation to it, feel free to drop me an email for inclusion into the Bioroebe project. I may re-arrange the code and put it somewhere where it may seem to fit within the project - and the licence has to be LGPLv2 since as of April 2019, but you will retain your copyright of course. (The reason why the code may have to be re-arranged is to fit within the Bioroebe project.)

In order to calculate the transitions to transversion ratio, you can use the module-method Bioroebe.transitions_to_transversions_ratio().

Do also note that some problems at Rosalind are not about bioinformatics but about other things, such as file handling. In these cases, the solution may be in another project, in particular in the gem called roebe, which is for general purpose activities. I have still included the solutions in the above table in these cases, as it may help people who need some help.

For the problem domain described by Rosalind's REVP challenge, it should be pointed out that most restriction enzymes target DNA in the form of reverse palindromes. So if you can identify these, you may find targets that are cleavable via restriction enzmes.

Further useful links related to Rosalind:

https://github.com/Shivi91/Rosalind-1

I should add that as of the year 2021 I am no longer really active at Rosalind. The primary reason is lack of time, but aside from this many problem descriptions are too sparse. I can't even figure out what they really want to see half of the time. Since it takes too much time to want to figure out what they want, besides the fact that many problem domains are totally contrived anyway, I decided to skip investing more time into Rosalind. Let's focus on solving real, existing problems instead - at the least as far as the Bioroebe project is concerned.

Align two different sequences

alt text

You can compare two different sequences via sscompare, from within the bioshell.

This will then show an alignment between these two different sequences.

sscompare

This is how it may look like:

alt text

You can also use the command "align". This one works a bit differently.

Consider the following example, which will show how align works:

setseq1 CAGGTCCC
setseq3 GGGGGCAC
align 1 3

Another image follows next, it may explain what happens there:

alt text

As you may be able to see, the output is a bit different. Via "align", the bioshell will show in red which ones do NOT match. Via sscompare, the bioshell will show via a red dot which ones will not match, whereas a (green) vertical tab will indicate which nucleotides will match.

Colourize Aminoacids

alt text

You can colourize some aminoacids in red colour, at the least on the commandline.

In order to do so, you first must tell the BioShell which aminoacid you want to colourize. This can be done like so:

cola K

This would colourize Lysine. K is the one amino acid letter for Lysine.

Next, you can test whether this works. A simply way is to ask for the Ubiquitin sequence. Pay attention to lysine at position 48.

From within the bioshell, you can query for the ubiquitin sequence like so:

ubi?

You can also uncolourize it again:

uncola K

Or just reset to the initial state again, via reset.

Phosphorylation Sites

alt text

You can colourize phosphorylation sites in your amino acid sequence, within the BioShell, if you issue this instruction:'

psites?
phosphorylation_sites?

This will feedback some information as to where phosphorylation sites can be found, within any given DNA sequence or aminoacid sequence.

Vertical Barrier

alt text

You can convert a String such as ATGCCG into ATG|CCG by issuing the following command:

barrier ATGCCG

This may be useful if you need to be able to more easily read a DNA sequence, for example.

compseq

alt text

You can use compseq to compare a sequence to "itself", from within the bioshell. That is, we will show how many dinucleotides can be found in the given sequence and whether this is "unexpected" or not. This is similar to the Emboss compseq tool, which can be found here:

https://www.bioinformatics.nl/cgi-bin/emboss/compseq

Remember, this may be useful for when you need to find CpG islands, which can possibly be noticed by the amount of distribution of CG dinucleotides.

Usage example:

compseq
compseq CGACGAGTCCTTGTTGGACTGGTGCCACAGCCCACGCAACGCACTATGGCCGGCCTCCGAGATTATC

The last command would report 6 CG sequences.

If you do not wish to see any colours in the output, and only wish to retain the raw sequence, try:

compseq CGACGAGTCCTTGTTGGACTGGTGCCACAGCCCACGCAACGCACTATGGCCGGCCTCCGAGATTATC --disable-colours

Note that compseq is also available via the sinatra-interface, but don't expect it to be hugely pretty. :)

sinatra

The BioRoebe project contains some code for sinatra, to allow a use of BioRoebe on the www. Code for this can be found in the bioroebe/www/ subdirectory, such as in the file called sinatra.rb there.

Consider this a work-in-progress, though. It will be extended at a semi-random times in the future.

Available components are stored in the array called ARRAY_AVAILABLE_TOPLEVEL_ACTIONS_FOR_THE_SINATRA_INTERFACE.

Implemented functionality so far:

# API Functionality Example
1 /to_aa convert a DNA or RNA sequence into the corresponding aminoacid sequence
2 /to_dna convert a RNA sequence into the corresponding DNA sequence
3 /to_chunked display the nucleotides in a chunked-manner, like in Gene Bank. This can be found in the file to_chunked.rb
4 /random_aminoacids "generate" random aminoacids; pass in a number such as 30
5 /reverse_complement provide the reverse-complement to a specific page
6 /palindromes generate a palindrome of n elements http://localhost:4568/palindromes/30
7 /n_stop_codons_in_this_sequence query how many stop codons are in a given sequence http://localhost:4568/n_stop_codons_in_this_sequence/ATGCTTTTTTAG
8 /compseq compare two different sequences http://localhost:4568/compseq/ATGCTTTAAAAA
9 /mirror ---
10 /view ---

To use this, you can directly work with the URL here, such as in this way:

http://localhost:4568/to_aa/ATGCCA

This would display the String MP, which would correspond to the aminoacids encoded by this DNA sequence (ATG, then CCA).

If you would like to start the sinatra interface from the commandline, try the following command:

bioroebe --sinatra

If you would like to start this from within ruby code, try this:

Bioroebe.start_sinatra_interface

Since as of January 2020, functionality is slowly added to display the source code as well, by appending _source to the URL.

So for example, to_dna_source would display the source code to the method at hand. This does not work 100% reliably yet, and has not been added to all available APIs above, but work is done to achieve this in the future.

http://localhost:4568/to_aa_source
http://localhost:4568/to_dna_source

Since as of October 2020 you can also show the source code to some of these methods, by appending _source.

For example:

http://localhost:4568/reverse_complement
http://localhost:4568/reverse_complement_source

This has not yet been added for all these methods, but eventually all these methods will respond to _source.

Stop Codons

To showcase all stop codons, do:

default_stop_codons?

You can also show and colourize all existing stop codons in a mRNA.

In order to do this, do:

stop?

You must have assigned a sequence prior to this of course.

Escherichia coli (E. coli)

This subsection contains just a little bit of information about E. coli, such as seen here:

http://bacteria.ensembl.org/Escherichia_coli_str_k_12_substr_mg1655/Info/Annotation/#about

E. coli has:

4,140 coding genes. 175 non-coding genes. 182 pseudogenes. 4,497 gene transcripts.

The E. coli genome could be downloaded here (or from NCBI):

https://raw.github.com/ecerami/samtools_primer/master/tutorial/genomes/NC_008253.fna

Dotplots

Support for dotplots is presently fairly ... minimal, to word this nicely.

An example may go like this:

Bioroebe::AdvancedDotplot.new('ABCD', 'ABCD') { :use_asterisk }

If you just need a simple Array with results, to then manipulate on your own, you can use this API:

require 'bioroebe/toplevel_methods/dotplots.rb'
Bioroebe.dotplot_array(ARGV[0], ARGV[1])
Bioroebe.dotplot_array('TAATGCCTGAAT', 'CTCTATGCC')

The latter would return a 2D-Matrix (array inside an array) looking like this:

0 0 0 0 0 1 1 0 0 0 0 0
1 0 0 1 0 0 0 1 0 0 0 1
0 0 0 0 0 1 1 0 0 0 0 0
1 0 0 1 0 0 0 1 0 0 0 1
0 1 1 0 0 0 0 0 0 1 1 0
1 0 0 1 0 0 0 1 0 0 0 1
0 0 0 0 1 0 0 0 1 0 0 0
0 0 0 0 0 1 1 0 0 0 0 0
0 0 0 0 0 1 1 0 0 0 0 0

Keep in mind that this is NOT memory efficient. If you wish to compare two large genomes then this is not recommended to use; for shorter sequences it should do fine.

You can also invoke this from the commandline via the bin/ script at:

advanced_dotplot CDDCDDCCCCC DDCCDDCDCCCC

But again - this is just for really simple use cases. Bioroebe does not (yet?) have support for comparing two genomes to one another and generate a visual map indicating the findings there.

class Bioroebe::MoveFileToItsCorrectLocation

This class will move a bio-file to its "correct" location, with respect to bioroebe. So a file such as foobar.pdb will be moved into e. g. $HOME/bioroebe/pdb/ whereas a .fasta file will be moved into $HOME/bioroebe/fasta/.

The class was created because I did not want to move the files manually anymore.

Ruler - showing the numbers when displaying a sequence

In the Bioshell you can display the nucleotide position on top of the sequence, via:

ruler

This will have some colours on position 1,5 and 10 (aka 0, but remember that the 0 refers to 10 or 20 or 30 and so forth)

This entry point in the bioshell accepts arguments. So if you wish to only show 40 nucleotides per line, use the following variant:

ruler 40

The following image shows how this may look on a terminal with a black background - but keep in mind that the colours may be subject to change (the picture was created in May 2020):

If you do not want these extra colours, you can use any of the following, more plain variants (these listed below are synonymous, aka aliases):

ruler-no-colours
ruler2
ruler2 25 # ← use 25 characters per line
ruler2 50 # ← use 50 characters per line

Piped

alt text

You can add the | character in display at every third position of a nucleotide. This is mostly just a visual display aid, on the commandline or the BioShell.

Invocation example:

piped

In the bioshell you can sort of "toggle" between these two modes quickly; try it for yourself:

show
piped
show
piped

On KDE konsole this may look like the following picture shows (partial screenshot):

alt text

Genomes sequenced in this year

This subsection will list some milestones when genomes of different organisms were sequenced. Do not expect this to be complete - it is just meant as a simple overview, for those who are interested in that information.

Haemophilus influenzae:           1995
S. cerevisiae:                    1996
Caenorhabditis elegans:           1998
Arabidopsis thaliana:             2000
Drosophila melanogaster:          2000
Homo sapiens:                     2003
Mus musculus (the mouse):         2002
Plasmodium falciparum:            2002
Anopheles gambiae:                2002
Rattus norvegicus:                2004
Pan troglodytes (the chimpanzee): 2005

In February 2016, about ~16800 genomes were sequenced and annotated.

The genome distribution is:

Bacteria: ~8000
Archaea:   ~520
Eukarya:  ~2750
Fungi:     1350
Protists:   350
Plants:     250
Animals:    800
Viruses:  ~5500

Genome Sizes of different organisms

The following table is not exhaustive - I just wanted to have a table that would show the genome size of different organisms. This table includes phages as well as bacteria and eukaryotes, so it is somewhat 'universal'.

Name of the organism n mbp (mega base pairs) n genes
SV40 (a virus) 0.005243
M13 (a phage) 0.006407
λ (a phage) 0.048502
Mycobacterium tuberculosis 4.411
E. coli 4.639
Yeast 15
Nematode 80 # or 97 19_000
Arabidopsis thaliana (a plant) 120
Drosophila 155 # or 122 13_600
Chicken 1_000
Human 2_800
Dog 2_500 19_000
Mouse 3_000 # or 2_500 28_000
Maize 15_000
Salamander 90_000
Lily 90_000

So, for instance, E. coli has about 4700000 base pairs in its genome. (It is easier to write it as 4_700_000, but for the above table I wanted to easily show the comparison to eukaryotes as well. A value of 90_000 means 90000.)

Why are the number of genes so different? Keep in mind that the complexity of an organism is NOT limited to a simple "how many genes that organism has". Genes in itself are not that well-defined, so they are not necessarily the primary means of complexity. Think of this more as an interactome, where RNAs play a major dynamic role as well.

class Bioroebe::DisplayOpenReadingFrames

class Bioroebe::DisplayOpenReadingFrames, created in May 2020, will eventually replace the older class Bioroebe::ShowOrf. Thus, class Bioroebe::DisplayOpenReadingFrames will have to remain quite flexible. It shall also support sixpack and showorf from the Emboss online tools. (In fact, supporting these two use cases was the original reason as to why this class has been created.)

Where does the code to this class reside?

It can be found here:

bioroebe/utility_scripts/display_open_reading_frames/
require 'bioroebe/utility_scripts/display_open_reading_frames/display_open_reading_frames.rb'

The display of this class is typically aimed for the commandline, but it is planned to use the class on the www too (via sinatra).

Take note that this class also reports how many ORFs (open reading frames) have been found. The number displayed here differs from the EMBOSS tools. I have learned that an ORF begins at a start codon, and ends at a stop codon. The start codon is typically ATG (respectively AUG). For some reason the EMBOSS tools consider everything to be an ORF, even if it does not begin with an ATG. Thus, they will report a higher number than class DisplayOpenReadingFrames does.

I have no idea as to why EMBOSS uses this different approach, but the approach taken by Bioroebe appears to be more logical - how can it be an ORF without a start codon? Or without a stop codon? That would make ANY such sequence an "open reading frame", and that is quite a rubbish assumption really.

By default, the class will display a chunk of the nucleotide sequence defaulting to 60 nucleotides per display-run.

If you want to change this, use something like this:

display_open_reading_frames ATGAAACCCGGGAAAGGGCCCAAATTTGGGAGACCATGAATGAAACCCGGGAAAGGGCCCAAATTTGGGAGACCAAGAAAACCCGGGAAAGGGCCCAAA --threshold=45

Again, keep in mind that the class is not absolutely perfect yet; I will continue to improve it over the coming weeks and months.

Here is a partial screenshot of the output of this class, via the KDE Konsole:

alt text

Codon Table

A codon table is a human-made mapping that keeps track of which particular codon will be translated into the corresponding aminoacid at the ribosome, during translation. They can be species-specific, that is, in different organisms, different codons may code for different aminoacids - or be interpreted as a stop codon.

You can designate a specific codon table to use via the following API (the two are equivalent):

Bioroebe.set_codon_table()
Bioroebe.codon_table = 

And you can query the codon table in use via:

Bioroebe.codon_table?

We will follow the NCBI convention for the codon table, so using 1 will use the eukaryote table, for instance (this is the default anyway).

Note that you can also use a symbol instead, such as :human or :humans, which should be easier to remember.

Example for this:

Bioroebe.set_codon_table(:humans)

Do also note that when you use the bioshell (Bioroebe::Shell), you can also get a simple ASCII table overview if you issue codon_table?.

You can get a list of all registered codon_tables via:

show_all_codon_tables

From the BioShell.

If you only want to show the headers, do this:

codon_tables_headers
codontablesheaders

You can also obtain the codons that code for a specific aminoacid by simply typing the name of this aminoacid, and a trailing question mark. This currently (Mai 2016) only works for aminoacids in german.

glycin?

The above will tell us some things about the aminoacid Glycin.

Note that there is also a small ruby-gtk3 GUI that can be used to show the codon table in use. In the long run it is planned to have a unified GUI base so that this all can work from the www as well, but for now (May 2021) most focus is done on the ruby-gtk3 bindings. Since as of August 2021 only ruby-gtk3 is supported. Support for ruby-gtk2 is dropped. While I do not intend to stop using ruby-gtk2 permanently so, as far as the Bioroebe project is concerned it is simpler to work solely via ruby-gtk3 past this point on. Less code to maintain as well.

Aminoacids

The BioRoebe project can handle information about aminoacids.

Let's first show some examples in the BioShell, though.

If you want to display all aminoacids, listed in a table format, then you can use the following commands in a running BioShell instance:

aa_table
aa?

To assign a random aminoacid sequence from within a running instance of BioShell, do either of the following:

random_aa
random aa

You can also query the molecular formula of an aminoacid or some chemical substances such as Glycine.

mass? arginine
mmasse Glycine

This will also show the molecular formula of arginine.

If you have a DNA sequence assigned and want to see the corresponding aminoacid sequence, then you can invoke the following command from within a running instance of the bioroebe-shell:

to_aa

This will also generate a .csv file, normally at /Depot/Bioroebe/aminoacid_composition.csv (by default).

This .csv file follows the following format:

(1) The short letter of this aminoacid
(2) The number of times this aminoacid occurs (as integer value)
(3) The percentage value that this aminoacid has in the overall
    sequence, e. g. 10% Leucine 

You can also display the codons that code for the particular aminoacids in use.

An example follows for the BioShell:

set_aa 5 # Use 5 random aminoacids
deduce_this_aminoacid_sequence # Deduce the codons.

Do also note that you can determine a most plausible degenerate primer based on this, via the Shell component. See the subsection about Degenerate Primers here in this file.

You can batch-identify all aminoacids, within the BioShell, via:

identify_aminoacid A-Z

For a picture of the twenty aminoacids, have a look at this remote image:

This image shows the 20 amino acids

The following table should show the one-letter code and the three-letter code, as comparison:

One letter code Three letter code Amino acid Possible codons
A Ala Alanine GCA, GCC, GCG, GCT
B Asx Asparagine or Aspartic acid AAC, AAT, GAC, GAT
C Cys Cysteine TGC, TGT
D Asp Aspartic acid GAC, GAT
E Glu Glutamic acid GAA, GAG
F Phe Phenylalanine TTC, TTT
G Gly Glycine GGA, GGC, GGG, GGT
H His Histidine CAC, CAT
I Ile Isoleucine ATA, ATC, ATT
K Lys Lysine AAA, AAG
L Leu Leucine CTA, CTC, CTG, CTT, TTA, TTG
M Met Methionine ATG
N Asn Asparagine AAC, AAT
P Pro Proline CCA, CCC, CCG, CCT
Q Gln Glutamine CAA, CAG
R Arg Arginine AGA, AGG, CGA, CGC, CGG, CGT
S Ser Serine AGC, AGT, TCA, TCC, TCG, TCT
T Thr Threonine ACA, ACC, ACG, ACT
V Val Valine GTA, GTC, GTG, GTT
W Trp Tryptophan TGG
X X any codon NNN
Y Tyr Tyrosine TAC, TAT
Z Glx Glutamine or Glutamic acid CAA, CAG, GAA, GAG
* * stop codon TAA, TAG, TGA

Bioinformatics mantra

This is just a little mantra of potentially useful sayings.

They are not equally important - if you feel inspired by some but not others, simply disregard those that are not to your liking.

1) Treat data as read-only

Deprecations and code-removal within the Bioroebe project

Over the years a lot of code that at one point in time has been written for the Bioroebe project, has been modified.

Content of files has been changed, new files and directories were added, renamed or removed, and so forth.

It is thus unavoidable that as a byproduct of such continuous changes, sometimes certain files are no longer available as a consequence. They may also become obsolete over time.

The subsection here is a slight reminder for me as to what has changed, when and why. I'll only add entries that are somewhat important to the project in the long run.

File:

codons_of_the_aminoacids.yml

This file has been deprecated and removed in March 2020. The reason is because it was obsoleted by the file 1.yml. That file contains all the aminoacid codons for eukaryotes anyway, so there was no more reason to have essentially the same information in yet another file.

To ease any transition, be it only of the mind, the symbol :codons_of_the_aminoacids can still be used as a mnemonic, such as in:

Bioroebe.load_the_codon_table_dataset(:codons_of_the_aminoacids)

Directory:

dbget/

DBGET has been discontinued in 2004. Bioroebe still had some old code, but it was pointless to retain code for software that has been abandoned many years ago, so during the rewrite in March 2020, support for dbget has been removed.

Classes (deprecations of classes or functionality stored in classes):

*Bioroebe::Nucleotide

Up until April 2020, class Nucleotide existed as part of the Bioroebe project, but it was not in use. Thus, it was removed.

The idea was that class Nucleotide could be used for a gene - but the more general class Sequence is doing this instead nowadays.

Bioroebe::Sequence

Up until 23.04.2020 (23rd April 2020) it was possible to specify a display_id for a Sequence object. It was removed on that day, though. The reason is that I am not sure whether a specific id is necessary for the sequence object.

If similar functionality is desired in the future then perhaps it may have to use another name, or go into another class - but for the time being, it was removed.

Constants:

FILE_BIOROEBE_TUTORIAL

In the past this constant pointed to /Depot/Temp/lighty/BIOROEBE_TUTORIAL.pdf. It made available a pdf-generated file, from a .cgi file.

However had, in December 2020 it was finally deprecated and removed. The reason for this is simple: the markdown format seems to work much better than having to generate a .pdf file from a .cgi file. In the future this markdown file will be the main documentation for the bioroebe project.

FILE_TALENS

This constant was replaced via a method call:

Bioroebe.file_talens

FILE_ALU_ELEMENTS

This constant has probably been deprecated years ago, but it was not registered; thus, in December 2020 this is now deprecated. The code still exists uncommented, but may be removed at a later time once I am completely sure that it is no longer necessary.

YAML_RESTRICTION_ENZYMES

This constant has been removed; if you need to load the dataset from the restriction-enzymes, consider making use of the following API:

Bioroebe.load_and_return_the_restriction_enzymes

AC_ELEMENT

This constant is probably not really necessary, so it was removed in December 2020.

Sequence Objects

You can build Sequence objects by doing something like this:

sequence_object = Bioroebe::Sequence.new(
  sequence: "aaaatgggggggggggccccgtt",
  alphabet: :dna
)

or

sequence_object = Bioroebe.sequence(
  sequence: "aaaatgggggggggggccccgtt",
  alphabet: :dna
)

This is similar to BioPerl and BioRuby, by the way. If you want to, you can also use Bioroebe::Seq rather than Bioroebe::Sequence. I personally prefer the longer variant but both ways are doable.

Usage examples:

Bioroebe::Sequence.new "aaaatgggggggggggccccgtt"
Bioroebe::Sequence.new "aaaatgggggggggggccccgtt", :upcase # This variant will upcase the input.

Let's look at a slightly more complicated example for creating such a sequence object.

sequence_object = Bioroebe::Sequence.new(
  seq:        "aaaatgggggggggggccccgtt",
  display_id: "#12345",
  desc:       "example 1",
  alphabet:   "dna"
)
puts sequence_object.seq
puts sequence_object.sequence?

You can also cut this sequence with a restriction enzyme.

Example:

sequence.cut_with_enzyme('EcoRI')

Note that if you wish to show the main DNA string assigned in the BioRoebe-Shell, then you can use either of the following:

seq?
seq_with_tab?

Leader and Trailer

By default, the DNA output from the BioShell will show the single stranded nucleotide sequence. It will also show a leading padding, the 5-prime leader and the 3-prime trailer.

This can be disabled, if you so want to, via:'

no_trailer
no_leader

class Bioroebe::ConsensusSequence

This class can be used to show a consensus-alignment between different nucleotides.

The API, in ruby (for bioroebe), is:

Bioroebe::ConsensusSequence.new(ARGV)

You can also use a toplevel method if you'd prefer that. Example:

Bioroebe.return_consensus_sequence_of(%w( ACTCC CACCA AGCCA AACGC CGCAT CGACC ACCGC GGCCG GGCGT GGCGT )) # => "AGCCC"

From the commandline, this example may suffice:

consensussequence ACTCGATGCA ACACGCTCCA ACTTGCTCCA GCGCGGCGCA TCATGTTGCA TCGCGATCCT TCTCCATCCT CCTTGCTGCA CCGTCGTGCA CCACCTTACA

This would, on the commandline, generate the following result:

class Bioroebe::DotAlignment

This cute little class can align two sequences. Every matching position will have a '*' denoted below these two sequences.

Usage example:

Bioroebe::DotAlignment.new('CTGCAGGAAGCGAGGCTGGAGAGCAGGAGGGGCTCTG--CGCAGAAA--------TTCTT',
'ATGAATGCATATATATGTATATGTATGTGTGTATATATACACACATATATATATATATTT')

# Commandline usage:
dot_alignment CTGCAGGAAGCGAGGCTGGAGAGCAGGAGGGGCTCTG--CGCAGAAA--------TTCTT ATGAATGCATATATATGTATATGTATGTGTGTATATATACACACATATATATATATATTT

# The Output would then be:
CTGCAGGAAGCGAGGCTGGAGAGCAGGAGGGGCTCTG--CGCAGAAA--------TTCTT # 60
ATGAATGCATATATATGTATATGTATGTGTGTATATATACACACATATATATATATATTT # 60
 ** * * *   *         * * * * *  * *   * ** * *        *  **

Alternatively you can also pass the path to a local file as the first argument to this class.

The class has been written because I received a .pdf with malformed matching ''. Copy/pasting from that .pdf did not work well, save for the core sequences, so I copy/pasted just the sequence, and then used class BioroebeDotAlignment to sanitize the placement of the '' characters.

How does the commandline output look?

It looks as the following image shows:

What sequence is this?

In the BioShell, you can try to find known sequences by issuing this command:

what_sequence_is_this?
what_sequence_is_this? ATG

Mutations

You can mutate the DNA nucleotide sequence.

Example:

mutate_position 5 C
mutate_position 4 A'

Here we will mutate the nucleotide at position 5 to C or the one at position 4 to A.

random 35; mutate_position 5 C

You can also mutate the aminoacid sequence. The following example shows that.

mutate_aminoacid_position 5

Mirror Repeats in DNA

A mirror repeat in DNA is a sequence that is repeated in both directions, as if someone would have put a "mirror" into that spot in the sequence.

So for instance, if we have a sequence called CAG then the mirror sequence would be GAC, for the total sequence being:

CAGGAC
CAG|GAC

The | denotes where the "mirror" would be.

Do note that palindromes and mirror repeats are very similar; the difference is that the mirror repeat happens on the same DNA-strand, whereas the palindrome occurs on the sister strand.

Bioroebe supports the simple "creation" of mirror repeats, through the file bioroebe/utility_scripts/mirror_repeat.rb and the method there called Bioroebe.mirror_repeat_of().

Simply pass in the sequence that you wish to mirror, such as:

Bioroebe.mirror_repeat_of('CAG') # => "CAGGAC"
Bioroebe.mirror_repeat('CAG') # => "CAGGAC"

Both variants work fine; the second option is evidently shorter, and may thus be easier to type.

This also works on the commandline, via bin/mirror_repeat.

Example:

mirror_repeat CAG
mirror_repeat T A C A C G G C A C A T --spacer
mirror_repeat GCATGGTACG --spacer

Note that it follows from simple logic that a DNA mirror repeat will have a center of symmetry on a single strand.

An example for a DNA mirror repeat is:

5'-GCATGGTACG-3'

Imperfect DNA mirror repeats (IMR) are less than 100% symmetrical. They can be found in many different biological entities; for example in the gag gene of HIV-1 (HXB2).

The BioRoebe shell prompt

The BioRoebe shell has a prompt, which usually defaults to showing the current working directory (cwd; pwd).

This can be quite useful, but may sometimes be distracting. An empty prompt may be more useful in this case, which could be set via either of the following, from within a running instance of the BioRoebe shell:

no prompt
no_prompt

If you want the old prompt back again, you could issue any of the following commands:

restore_prompt
restore_default_prompt

Showing both DNA strands in the BioRoebe shell

In order to show both DNA strands, issue the following command in a running instance of the BioRoebe shell:

both
double
dsDNA

Manipulating the Xorg Buffer on Linux while using the BioRoebe-Shell

You can set the main DNA sequence to the xorg buffer on linux.

To copy the DNA sequence into the Xorg Buffer, issue the following command:

to_buffer

This requires the program called "xsel" on linux.

If you do not have this program, do not want to install it or do not want to use it at all, you can disable this functionality from within the BioRoebe-Shell like so:

disable xsel

Reverse complement

If you ever need the reverse complement of a DNA string, you can use the following API:

Bioroebe.reverse_complement 'AAAACCCGGT' # => "ACCGGGTTTT"
Bioroebe.reverse_complement("GATATC") == "GATATC" # => true

This is also available via the commandline, through bin/reverse_complement, which I aliased to reversecomplement on my computer system:

reversecomplement CCGGATTTCAGCAGAATGGATGTTTAAGGATGA # => TCATCCTTAAACATCCATTCTGCTGAAATCCGG

Note that the display here will always be in 5'→3' direction.

What is the reverse-complement?

Say that you have a sequence such as 5'-CCGGATTTCAGCAGAATGGATGTTTAAGGATGA-3'.

Given a dsDNA strand, this ssDNA strand is typically base-paired.

Let's see how this DNA sequence aligns:

5'-CCGGATTTCAGCAGAATGGATGTTTAAGGATGA-3'
3'-GGCCTAAAGTCGTCTTACCTACAAATTCCTACT-5'

The second part is the regular complement. The regular complement is simply the one that applies to the complementary DNA strand, the one that can base-pair with this sequence as-is.

The reverse complement would be the sequence that would be shown if that complementary DNA strand would be displayed in the 5' to 3' direction. In other words, in Ruby, this would be:

"GGCCTAAAGTCGTCTTACCTACAAATTCCTACT".reverse # => "TCATCCTTAAACATCCATTCTGCTGAAATCCGG"

Why is this important to understand? Well, you may wish to design or check that both primers in PCR, forward and reverse, would be correct.

Logging and Log output

The BioRoebe project may autogenerate some files, including log files.

In order to be able to do so, the bioroebe project needs the user to be able to access a base directory, the so-called working base directory. This is where bioroebe assumes most working files to exist.

On my linux system this used to default to the directory called /Depot/Bioroebe/. On other systems, the log directory may default into the user's home directory, via a call to "#File.expand_path('~')/". (I use this presently, since 2020.)

This should work on most systems, but it may not be what you want to have or use in your own workflow. Thus, code exists that allows you to designate another log directory to use.

The API for this is simply called:

Bioroebe.set_log_dir()

The method can be found in lib/bioroebe/toplevel_methods/log_directory.rb.

If you want to do this from within the bioshell itself, try:

set_log_dir /tmp/test
setlogdir /tmp/test

If you use the interactive bioroebe-shell then you can use home? to determine where the log directory is on your system. For example, I tend to just use /root/Bioroebe/ these days, when I am the superuser.

Note that you can also define the environment variable called BIOROEBE_DEFAULT_LOG_DIRECTORY. If this is set on startup of the bioroebe-shell, then it will overrule the initial :default value that is used otherwise.

Working with Blosum

BLOSUM is used to sequence-align proteins; thus, it can be used to compare alignments of aminoacids.

You can read more about Blosum (BLOcks SUbstitution Matrix), which was developed in 1992, on wikipedia:

https://en.wikipedia.org/wiki/BLOSUM

BLOSUM62, the 62 there, indicates that the sequences selected for constructing the matrix share an average identity value of 62%.

There are several different types of BLOSUM-Matrices, which have been developed for different scenarios.

The by far most often used BLOSUM Matrix is Blosum45.

BLOSUM matrices are used to score alignments between evolutionarily divergent protein sequences, based on local alignments.

Within the Bioroebe project, these ruby files currently deal with BLOSUM:

bioroebe/calculate/calculate_blosum_score.rb
bioroebe/parsers/blosum_parser.rb

Since as of 2021 you can also output the blosum-matrices distributed within Bioroebe.

The API for this is:

Bioroebe::Blosum.show_matrix

Provide the name of the .yml file (aka the BLOSUM matrix at hand) for this. Code support for this will improve over the coming weeks and months.

There is also a small ruby-gtk3 available, at:

bioroebe/gui/gtk3/blosum_matrix_viewer/blosum_matrix_viewer.rb 

You can call this via the commandline directly (just use ruby to run that .rb file) or simply use bin/bioroebe like this:

bioroebe --blosum-GUI

This ruby-gtk3 widget currently (in August 2021) looks like so:

If you are working with the bioshell then you can show the BLOSUM matrices via:

show_blosum_matrix

Translation

Different organisms may use different "codon tables" to translate DNA → mRNA → protein.

By default, Bioroebe will use the table available from NCBI. This is also known as the standard genetic code, aka NCBI table id 1.

Let's assume that we are dealing with a mitochondrial sequence instead. We thus need to tell Bioroebe that we have to use another genetic table.

Bioroebe.set_genetic_code(to: 'Vertebrate Mitochondrial')

You can also use numbers instead:

Bioroebe.set_genetic_code(2)

Both variants listed above are synonymous. Use whichever variant you prefer.

To show the available codon tables, as overview, within the BioShell, try the following code:

codon_table?

To show the codon table used by species "Homo sapiens", try either of the following variants:

show_codon_table
show_all_codon_tables

Working with codons of different organisms and show all codons for a particular aminoacid

If you need to obtain an Array of all codons of a particular aminoacid then you can use the following:

Bioroebe.codons_for_this_aminoacid? :serine    # => ["TCT", "TCC", "TCA", "TCG", "AGT", "AGC"]
Bioroebe.codons_for_this_aminoacid? :threonine # => ["ACT", "ACC", "ACA", "ACG"]

You can also use the one-letter abbreviation instead:

Bioroebe.codons_for_this_aminoacid? 'S' # => ["TCT", "TCC", "TCA", "TCG", "AGT", "AGC"]
Bioroebe.codons_for_this_aminoacid? 'T' # => ["ACT", "ACC", "ACA", "ACG"]

This will default to the eukaryotic codon tables, which includes humans, the mouse and so forth.

Sometimes you may have to use another codon table dataset, such as for the yeast mitochondria.

The following code shows how this could be done:

Bioroebe.codons_for?('T') # => ["ACT", "ACC", "ACA", "ACG"]
Bioroebe.use_this_codon_table(:yeast_mitochondria) # => {"START"=>"ATA | ATG", "TAG"=>"*", ... and so forth
Bioroebe.codons_for?('T') # => ["CTT", "CTC", "CTA", "CTG", "ACT", "ACC", "ACA", "ACG"]

Notice how the same code now yields another result - this is because we now use the codon table dataset for yeast mitochondria, which is different to the human codon table.

If you need to load up a specific codon table then you could use this:

codon_table = Bioroebe::CodonTable[2]
codon_table = Bioroebe::CodonTable[:humans]

If you wish to also change the codon table at the same time considering using this variant:

Bioroebe::CodonTable.by_name("Vertebrate Mitochondrial")
Bioroebe::CodonTable.by_id(1)

Showing the available codon tables on the commandline

The Bioroebe project allows for the use of different codon tables. This works on the commandline as well, via the file bin/show_this_codon_table.

To show the codon table for eukaryotes, do:

show_this_codon_table 1

To show the codon table for eukaryotic mitochondria, do:

show_this_codon_table 2

Which codon tables (translation tables) are available right now, as of May 2020?

This is the list:

 1: Standard (Eukaryote)
 2: Vertebrate Mitochondrial
 3: Yeast mitochondorial
 4: Mold, Protozoan, Coelenterate Mitochondrial and Mycoplasma/Spiroplasma
 5: Invertebrate Mitochondrial
 6: Ciliate Macronuclear and Dasycladacean
 9: Echinoderm Mitochondrial
10: Euplotid Nuclear
11: Bacteria
12: Alternative Yeast Nuclear
13: Ascidian Mitochondrial
14: Flatworm Mitochondrial
15: Blepharisma Macronuclear
16: Chlorophycean Mitochondrial
21: Trematode Mitochondrial
22: Scenedesmus obliquus mitochondrial
23: Thraustochytrium Mitochondrial Code
24: Pterobranchia Mitochondrial Code
25: Candidate Division SR1 and Gracilibacteria Code
26: Pachysolen tannophilus Nuclear Code
27: Karyorelict Nuclear Code
28: Condylostoma Nuclear Code
29: Mesodinium Nuclear Code
30: Peritrich Nuclear Code
31: Blastocrithidia Nuclear Code
33: Cephalodiscidae Mitochondrial UAA-Tyr Code

By default DNA codons are shown. You can use RNA codons instead via the --rna commandline flag. The only difference is that all U will be T, or vice versa, depending on whether you have DNA or RNA.

For example, to show the codon table for humans, as RNA codons, do this:

show_this_codon_table 1 --rna

Note that since as of January 2021, a ruby-gtk3 wrapper exists, merely for testing-purposes. It can be found at bioroebe/gui/gtk3/show_codon_table/show_codon_table.rb.

Yeast

The yeast genome can be searched for at:

https://www.yeastgenome.org/

Two examples for how to use it follow:

https://www.yeastgenome.org/search?q=cell%20cycle https://www.yeastgenome.org/search?q=mating

Viral genome entries in NCBI and RefSeq (Viral genomes)

Because I had to look up several viral genomes, I finally decided to make a list/table for this, which I intend to expand to, say, at the least 50 different viral genomes eventually - one day.

For now, here is the list:

virus NC ID n bp URL
human papillomavirus type 16 (HPV 16) NC_001526 7906 bp https://www.ncbi.nlm.nih.gov/nuccore/NC_001526
human papillomavirus type 18 (HPV 18) NC_001357 7857 bp https://www.ncbi.nlm.nih.gov/nuccore/NC_001357
human rhinovirus 1 strain ATCC VR-1559 NC_038311.1 7137 bp https://www.ncbi.nlm.nih.gov/nuccore/NC_038311.1

The T-Bacteriophages

The following table only shows a short summary for the T-phages.

name of the phage Plaque size phage-head diameter (nm) tail diameter latent period (in minutes) Burst size n genes
T1 medium 50 150 x 15 13 180
T2 small 65 x 80 120 x 20 21 120
T3 large 45 invisible 13 300
T4 small 65 x 80 120 x 20 23.5 300 300
T5 small 100 tiny 40 300
T6 small 65 x 80 120 x 20 25.5 200-300
T7 large 45 invisible 13 300

The next table will show some phage genomes.

name of the phage NCBI Reference Sequence Remote URL
Deftia phage phiW-14 NC_013697.1 https://www.ncbi.nlm.nih.gov/nuccore/NC_013697.1
T4 phage NC_000866 https://www.ncbi.nlm.nih.gov/nuccore/NC_000866

Histidine Tags

alt text

You can add a Histidine Tag, randomly, from within the bioshell, via:

add_his_tag

Perhaps at a later time this part will be extended.

Working with Aminoacids within the bioshell

alt text

You can assign some aminoacids via:

setaa LLLTLTLT

Then, you can query the positions via:

positions?

This will display the amino acids in a chunked, numbered manner.

If you want to find out how many aminoacids the target sequence has, do:

aa_size?

Translate into amino acids

alt text

You can translate any given DNA sequence into the corresponding amino acids by issuing:

translate
trans

Simply add the DNA sequence as argument to translate. If you omit it, the BioShell will attempt to use the main DNA sequence.

translate ATGCCCCGCTGA # => MPR*

Note that we will also show, for the first frame, the specific position.

You can do so too via:

frame1
frame2
frame3

Note that this will show the respective frame at hand.

Also note that you can do the reverse again, by issuing:

backtrack

This will attempt to use the most likely codon, in order to deduce the corresponding DNA sequence.

backtrack uses a codon usage table which gives the frequency of usage of each codon for each amino acid.

For each amino acid in the input sequence given, the corresponding most frequently occuring codon is used in the nucleic acid sequence that is output.

If you want to quickly find out as for which particular aminoacid a codon codes for, use this method:

codon? ATG

Internally class Bioroebe::ShowCodonUsage handles this.

A small ruby-gtk3 wrapper exists over this functionality as well. It can read in local files too; but as of February 2021, this functionality is still very limited. Consider it more as a proof-of-concept than any more complex GUI application really.

EMBOSS tools

alt text

The EMBOSS tools were made available as .cgi files at sites such as:

http://www.bioinformatics.nl/cgi-bin/emboss/

Note that one goal of BioRoebe is to clone the functionality of all these scripts available at EMBOSS. However had, that does not necessarily mean that all the names will be copied as 1:1, nor that really all of the functionality will be included - the primary focus will be on functionality that seems important and relevant.

Some functionality has already been copied, such as transeq and cgplot. But there is still a lot missing, so stay tuned for now.

The EMBOSS compseq tool has been partially copied as of January 2016.

You can invoke this by tool doing "compseq" in the interactive menu from the BioShell component. Respectively, you can also invoke the compseq.rb file from bioroebe/utility_scripts/compseq/compseq.rb.

The following table shows which EMBOSS tools have been ported already:

# EMBOSS tool Implemented in Bioroebe Main code to be found in Bioroebe API from within Bioroebe EMBOSS URL
1 compseq partially bioroebe/utility_scripts/compseq/compseq.rb https://www.bioinformatics.nl/cgi-bin/emboss/compseq
2 backtranseq fairly well bioroebe/codons/codons.rb Bioroebe.deduce_most_likely_aminoacid_sequence https://www.bioinformatics.nl/cgi-bin/emboss/backtranseq
3 sizeseq partially bioroebe/parse_fasta/misc.rb Bioroebe.sizeseq https://www.bioinformatics.nl/cgi-bin/emboss/sizeseq
4 sixpack almost complete bioroebe/utility_scripts/display_open_reading_frames/display_open_reading_frames.rb https://www.bioinformatics.nl/cgi-bin/emboss/sixpack

The file emboss.md in the doc/ subdirectory may also include that information, but it is not updated that frequently.

Next, some of these components will be described in a bit more detail:

sizeseq: sizeseq reads a set of sequences and sorts them by
         length.

API usage example for sizeseq:

Bioroebe.sizeseq "existing_fasta_file_goes_in_here.fasta"
Bioroebe.sizeseq "globins.fasta"

Or just input a String.

Note that the Bioroebe.sizeseq method is a bit different to the EMBOSS sizeseq. Bioroebe will re-arrange the sequence at hand by indicating the length, in the > comments section. The reason why this is the default is because I like reading the number instantly. (This behaviour may eventually be up to the user, so I will most likely add a way to disable this. But for the time being I will retain it as-is.)

How does the output of sixpack look like?

Say that you input the sequence ACGTACGTACTGACA to it. It will then yield something like this on the commandline:

The output may be a bit confusing at first glance, but it may be helpful to first focus on the middle where you see 1 on the left side (the starting nucleotide) and 15 on the right side (the last nucleotide). Based on this the three forward frames are shown above, and the three reverse frames are shown below - see the designate F1, F2, F3, as well as R1, R2 and R3.

You can designate different codon tables on the commandlie, such as via --codon-table=yeast. See more information about this class at Bioroebe::DisplayOpenReadingFrames.

Set the name of a gene

You can set a name of your gene, in the BioShell, via:

set_name foo

The argument will be the name of your target gene in question.

This may be useful at a later point, if you want to keep track of different target sequences.

The idea is to be able to store all these genes via annotated FASTA files, while showing them in a GUI as well in the long run.

Pubmed

This subsection will eventually contain more information as to how to interact with pubmed programmatically.

For now, the Bioshell part of BioRoebe has a pubmed entry point that can be used via:

pubmed

This will open pubmed in the browser. You can also use search terms for this such as:

pubmed virus

As said, currently this is just a fancified browser-opener, but in the future this may well become an API that people can use to programmatically search for pubmed articles.

Browser setting in the configuration file

The browser (for opening external websites) is defined here:

bioroebe/yaml/configuration/browser.yml

This defaults to palemoon. If you want to use another browser then you have to modify that yaml-file for now. I may add support for picking a browser as-is via the commandline at a later time, though, so folks don't have to change any file.

Bisulfite Treatment of DNA

For more information about bisulfite treatment of DNA, have a look at the following URL:

https://en.wikipedia.org/wiki/Bisulfite_sequencing

The subsection here will deal with bisulfite treatment of DNA from a bioinformatics point of view, as it is applicable to the Bioroebe project.

Via bisulfite treatment, and subsequent bisulfite sequencing, the methylation pattern (an epigenetic mark) can be determined.

Typically a methyl group is added to the carbon-5 position of cytosine residues, of the dinucleotide CpG.

The key idea here is that when we treat DNA with bisulfite, all Cytosine residues are converted into Uracil (C→U) - whereas 5-methylcytosine residues remain unaffected by bisulfite. In other words, the methyl-group protects against bisulfite treatments.

Let's have a look at a specific sequence, which should simplify the text above.

Take the following sequence:

5'-CCCGCAATGCATACCTCGCCG-3'

As you can see, this sequence contains several CpG islands (CG pairs), in fact, 3 CG runs.

Let's assume that no Cytosine is protected via a methyl group. To convert this sequence into the corresponding, bisulfite treated sequence, use the following toplevel-method API:

Bioroebe.bisulfite_treatment('CCCGCAATGCATACCTCGCCG') # => "UUUGUAATGUATAUUTUGUUG"

(Remember that the 5' end usually comes first, on the left hand side.)

shuffleseq

Shuffleseq will shuffle the given input sequence, while maintaining composition. So, if given ATGC, any combination of these four nucleotides can be used, such as ATCG or GCTA and so forth.

Internally, the following toplevel-method API can be used for this:

Bioroebe.shuffleseq()
Bioroebe.shuffleseq('ATGC') 

The emboss shuffleseq functionality, which inspired this functionality, can be seen here:

http://www.bioinformatics.nl/cgi-bin/emboss/shuffleseq

ShowOrf

The class ShowOrf can be shown to display, on the commandline, the open reading frames (ORF) of a given nucleotide sequence.

It may be best to just try the following examples and notice the differences:

showorf CATGACAGCAGCAGGCCCTCGGTCGTCGCTCCACATCAGCCGTCACATAGAGCTGATAG
showorf CACAGCAGCAGGCCCTCGGTCGTCGCTCCACATCAGCCGTCACATAGAGCTGATAG --all
showorf CATGACAGCAGCAGGCCCTCGGTCGTCGCTCCACATCAGCCGTCACATAGAGCTGATAG --frame2
showorf CATGACAGCAGCAGGCCCTCGGTCGTCGCTCCACATCAGCCGTCACATAGAGCTGATAG --f2
showorf CATGACAGCAGCAGGCCCTCGGTCGTCGCTCCACATCAGCCGTCACATAGAGCTGATAG --frame3
showorf CATGACAGCAGCAGGCCCTCGGTCGTCGCTCCACATCAGCCGTCACATAGAGCTGATAG --f3
showorf CATGACAGCAGCAGGCCCTCGGTCGTCGCTCCACATCAGCCGTCACATAGAGCTGATAG --frame-1_and_2

The most-inclusive of these options is -all. This option will show all 6 possible reading frames (three in the sense strand, three in the other strand).

Frame2 and frame3 refer to the second and third frame; --f2 and --f3 are abbreviations for this.

Note that the -- are mandatory for now. In the future this may change, so that the following may also work:

showorf CATGACAGCAGCAGGCCCTCGGTCGTCGCTCCACATCAGCCGTCACATAGAGCTGATAG f2

But for now, the -- are mandatory there.

Finding patterns in a sequence

Finding sub-patterns or sub-sequences in a longer sequence is evidently a hugely important task in bioinformatics.

If you wish to only find out how many sub-sequences of a given pattern exist in a sequence, such as the number of 'ATG' codons in a given sequence, you could use the following API:

Bioroebe.return_n_repeats "AGGACCTGAATGAAGGGCTCCTCAGTATACTGCAATAAGGCTATTAGCGAACAGCCATGCATGAACTGTGCGTGTCAGTCTAGACTTGATCATCTCGCGAAACCTTATTAGCTTTCGTAACGGACCTGACAACTAGAGGATCCCGTAACGTAGCCCGCCGCCTCGTGAGGTGACTATCCTACTACTGAGTTGCACCCTTAAGTGATCCCCTGGTTCTCTCTTAATACGAACAGATCAAGAAAAGTATTGCGGTCAGGTAACCCGCTACCTCCAGGATAATAATCCGTACGTTAGTCCGTATGTGCAGCTTGGGCGCCGAATTTCTCGGACGCAGACCTAATGCCGAGCTCTTTTGCAGTTGGAGGGTTGTAGGGCCTAGACAGATGCTCCTCGAAGGCACTCAAGCATAGGGTTGGTCCTCTTATATCATGGATACATCGCCAGCCTTTTGACCTGCGTAGTTTAAGATTGTTTATAGACAATCTCCACACTAATCCAGATGAACTGATTTCATCCACGAAAGTTGACTTTACGATGCTATTGGAGTCTGTATCCCCGCATGCGCCGAAAAATCCTAAAAGGCTACGCTAGTACATTTCCTCACTAGGACATACGTGCCATCGATAGACAGTTTGCTTGTCTGAAATAAGTTCTAGACCGTCTCACTTGTCTGCTGGTACCACAGTTAAAAAGTCATCCGGACAGGCAGATTTTTGCATCTCCTAGCTTTTATGCGAACCTCCGTTTAGATGTGCGTCTTTTTATAGGATCACGTAGCTCTTTAGCGAGTACCTTGAGTGTAATGTAACTTTGTAAGTAACTGTATAGGGGTTCCGGTTTCATACGAAGGGGGATTTCCTCTACCTCCTTGATAGCTCAGCTTCTGAATTGATAAGTAAGTAAGTAAGTAAG",'ATG' # => 13

This of course works with other sub-sequences as well, and really just makes use of .scan() of class String anyway.

This functionality was added in March 2021 to allow a way to scan for repeats.

The idea is to be able to specify a pattern such as:

(TAAG)n # simple_repeat
(ATAA)n # simple_repeat

And determine how often that pattern may be found in the given sequence.

Note that currently the implementation is in pure ruby; for long sequences this may make this too slow, so this may be implemented via java/jruby or C/C++ one day. For now, though, it will remain in pure ruby. (We probably have to give the user the choice to select what to use, once more than one way to use this has been added.)

Downloading via the BioRoebe project

This subsection is about downloading remote datasets through the BioRoebe project.

Let's take the phage lambda of E. coli. The refseq entry is at:

https://www.ncbi.nlm.nih.gov/nuccore/9626243

The genome is 48502 bp long (dsDNA).

The NCBI ID is: NC_001416.1

You can simply download this from the interactive bioroebe-shell via:

download lambda

Now you should have a .fasta file that holds this sequence. If you are the superuser then this may be stored in /root/Bioroebe/fasta/.

Next, we will try to identify att-sites, in particular attL.

You can see this described here:

http://ecoliwiki.net/colipedia/index.php/Cis_Element:Phage_lambda_attL

Or, if you just want the sequence, here:

https://www.ncbi.nlm.nih.gov/nuccore/M12458.1?report=fasta

So the sequence for attL is:

CCTGCTTTTTTATACTAAGTTGGCATTATAAAAAAGCATTGCTTATCAATTTGTTGCAACGAACAGGTCACTATCAGTCAAAATAAAATCATTATTTGATT

Ok, let's assign this in the BioShell:

set_search_for CCTGCTTTTTTATACTAAGTTGGCATTATAAAAAAGCATTGCTTATCAATTTGTTGCAACGAACAGGTCACTATCAGTCAAAATAAAATCATTATTTGATT

This is the full sequence but it can not be found; a subsequence can be found, though:

set_search_for AAAAAGCATTGCTTATCAAT

In regards to the above download lambda example, some more phages or viruses are available for "easy download".

Here is an incomplete list:

download P1
download P2
download T2
download T4
download T6
download T8

Note that a few more sequences are also aliased and available within the BioShell, such as via:

cytochrome c # this would refer to the human cytochrome C variant

I am not sure whether I will retain this, but for the more common proteins, when they are important like cytochrome c, then perhaps these will be added on an ad-hoc basis. We'll see in the future (May 2020).

The mmCIF format

The mmCIF format will eventually replace the older .pdb format. This subsection will include some information about that format.

Presently (May 2020) there is no support for the mmCIF format in the Bioroebe project, but this will eventually change.

Sugars and glyco-patterns

I am currently having to do an assignment related to glyco-patterns in proteins, in April 2021. As I have very little knowledge about this, I will expand this subsection here with specific information about glycosylation in proteins, and how this may relate to bioinformatics.

class Bioroebe::AnalyseGlycosylationPattern can be used to analyse potential glycosylation patterns in a sequence of aminoacids.

TaxID

You can find out the taxonomic ID correlating to a name or vice versa.

Presently, we prefer to handle the ID number itself.

Example:

taxid? 9606

This would show that 9606 corresponds to Homo sapiens.

Note that Version 1.0.15 of the Taxonomy submodule was written in April 2014. While it has been integrated into Bioroebe, it is currently somewhat broken. I will improve on this in the future.

Padding in the bioshell

By default, all strings/sequences output will be padded with 2 space characters on their left side.

You can check this by doing the following in the bioshell:

padding?

You can set another padding by doing either:

set_padding 0
setpadding 0

The number given will be how many space characters we will use.

To restore to the initial padding, either do set_padding 2 or set_padding default.

Appending a PolyA Tail in the bioshell

You can add a PolyA tail to a mRNA sequence by doing:

polyA

Note that for now, this will only append 250 A on the 3-prime end of the RNA in question. In the long run, we will try to make this more accurate. (Statement here made in June 2016)

Degenerate Primers

A sequence is called degenerate if some of its positions have several possible bases. An example will be shown in this section at a later time.

A degenerate primer is defined as:

A mix of oligonucleotide sequences in which some positions contain a number of possible bases, giving a population of primers with similar sequences that cover all possible nucleotide combinations for a given protein sequence. (Iserte 2013).

In short - degenerate primers are actually mixtures of similar, but not identical primers.

https://en.wikipedia.org/wiki/Primer_%28molecular_biology%29#Degenerate_primers

If you look at a protein sequence, the aminoacids there can usually be represented by more than one nucleotide sequence, due to the degeneracy of the genetic code. Thus, it becomes difficult to deduce which codon is used in a particular case.

Consider the following sequence as an example to illustrate this idea, where the possible alternative positions are denoted via []:

ATCGTT[GC]AAGT[AGC]ATC

This refers to a series of primers in which the seventh and twelfth nucleotides are degenerate.

The amount of degeneracy is defined by the number of different primer combinations in the mix. The above example would have a degeneracy value of six.

Using the IUPAC system for degenerate bases we can insert a single-letter code to represent our unknown region. In the above example we would use S in place of GC, and V in place of AGC.

You can try to determine the degenerate primers via the Shell component of the BioRoebe project. Issue the following instructions:

setdna 33
degenerate_primer

On a side note, keep in mind that the IUPAC system uses uppercased characters.

You can deduce the most likely mRNA sequence for any given aminoacid sequence via class Bioroebe::ConvertAminoacidToDNA.

Usage example for this class, from within ruby:

require 'bioroebe/conversions/convert_aminoacid_to_dna.rb'
Bioroebe::ConvertAminoacidToDNA.new('M-Y-T-N-R-S-S-R-STOP')

This would then yield the following mRNA sequence:

AUG-UAU-ACG-AAU-AGG-AGU-UCU-CGC-UAG

Used on the commandline this may output results that may look like this:

Do note that this class currently does NOT take into consideration the "most likely codons" for a given species. The code is simply like this right now:

array_with_our_codons = array_with_our_codons.sample

However had, it would be possible to change this if we add per-species support. Let's see if this is necessary; for the time being, we'll just use this random approach. Re-running the class thus may lead to a slightly different nucleotide sequence.

If you need a better overview, you can use the bioshell.

First start the bioshell, then try the following command:

deduce MTTAGKLIIRRSAAP

Displaying a useful aminoacid table on the commandline

To display a useful aminoacid table for a given aminoacid sequence, on the commandline, you can use the following class:

Bioroebe::DisplayAminoacidTable.new(ARGV)

From the commandline:

display_aminoacid_table AATGARNANANAAM

This class will also report the molecular weight of that protein.

The above output, via KDE konsole on black background, would look like this:

Since as of July 2022 invalid amino acids will be automatically filtered away before being assigned to the input.

Colourizing hydrophilic and hydrophobic aminoacids on the commandline

Via class Bioroebe::ColourizeHydrophilicAndHydrophobicAminoacids you can colourize aminoacids based on their hydrophobicity level.

Example output for this:

Proteases

This subsection contains some information about proteases.

Trypsin: https://en.wikipedia.org/wiki/Trypsin cuts at: Trypsin cuts peptide chains mainly at the carboxyl side of the amino acids lysine or arginine.

Chymotrypsin: https://en.wikipedia.org/wiki/Chymotrypsin cuts at: Chymotrypsin preferentially cleaves peptide amide bonds where the side chain of the amino acid N-terminal to the scissile amide bond is a large hydrophobic amino acid (specifically: tyrosine, tryptophan, and phenylalanine). Chymotrypsin will cleave proteins on the carboxyl side of aromatic or large hydrophobic amino acids.

Thrombin: https://en.wikipedia.org/wiki/Thrombin cuts at: Thrombin acts as a serine protease that converts soluble fibrinogen into insoluble strands of fibrin. It catalyzes the hydrolysis of Arg-Gly bonds in particular peptide sequences only.

Plasmin: https://en.wikipedia.org/wiki/Plasmin cuts at: Plasmin is a serine protease.

Papain: https://en.wikipedia.org/wiki/Papain cuts at: Papain prefers to cleave after an arginine or lysine preceded by a hydrophobic unit (Ala, Val, Leu, Ile, Phe, Trp, Tyr) and not followed by a valine.

factor Xa:

ER retention sequences

Some proteins may permanently reside in the lumen of the endoplasmic reticulum (ER).

Often such proteins will have a special signal sequence attached to their C-terminal part, such as KDEL (Lys-Asp-Glu-Leu).

KDEL is not the only signal that may be used, though. Some species may use different signals, such as:

aminoacids species
KDEL Vertebrates, Drosophila, Caenorhabditis elegans, plants
HDEL Saccharomyces cerevisiae, Kluyveromyces lactis, plants
DDEL Kluyveromyces lactis
ADEL Schizosaccharomyces pombe (fission yeast)
SDEL Plasmodium falciparum

If you work with the bioshell then you can simply use this method to query whether the given aminoacid sequence has a KDEL sequence:

KDEL?

Protein translocation into the ER

The paragraph above showed that there are retention signals for the ER - such proteins will remain inside of the ER.

But there are also signal sequences for import into the ER. Signal sequences (the signal peptide, SP) typically contain 7-12 hydrophobic amino acid residues.

The following shows some examples for this:

A. thaliana a-glucosidase II SP:MRSLLFVLSLICFCSQTALSGS
Tobacco chitinase            SP:MKTNLFLFLIFSLLLSLSSA
Barley a-amylase             SP:MGKNGSLCCFSLLLLLLLAGLASG
Soybean peroxidase           SP:MGSMRLLVVALLCAFAMHAGFSVSYAQ

NCBI

alt text

The NCBI provides a lot of information related to different organisms.

You can search for NCBI entries via the file ncbi.rb. This file resides at the path bioroebe/ncbi/ncbi.rb

You can also download NCBI files, that is, usually fasta files.

Let's provide some examples here.

If you know the ID, you can simply download this fasta file in the bioshell:

download_fasta 296010862
dfasta NC_000913.3

This will download the fasta sequence of the id 296010862, from NCBI.

You can batch download from instructions given in files as well. Simply store these IDs in a file, and then, you can do the following (if you aliased dfasta to download_fasta that is):

dfasta /FASTA_URLS.md

If the file exists locally, we will grab its content and batch-proceed.

You can also download from uniprot.

Example:

dfasta https://www.uniprot.org/uniprot/Q9CR68.fasta

Finding a substring

To find a substring you could use this method:

Bioroebe.find_substring 'GATATATGCATATACTT', 'ATAT' # => [2, 4, 10]

This will yield the positions, as an Array, of where the substrings can be found. The above means that the substring ATAT will be at position 2, 4 and 10. Note that this refers to the nucleotide position, and thus starts at 1, whereas ruby Arrays begin at 0 - keep that in mind when writing code.

You can also try to find a subsequence within the BioShell.

Try:

find ATG

This should discover all ATG nucleotides.

Obviously we can only 'find' something if this pattern is part of the string.

Nucleotide Table

You can display the nucleotide table via:

nucleotide_table?

You can also query a subsequence from this.

Example:

125-250

This would show the subsequence from 125 to 250.

If you have assigned a nucleotide string, you can also specifically manipulate certain positions.

Example:

658-660 = CCA

If you input only nucleotides into the BioShell, such as AGGTGCC, then hit enter, we will also assume that you wish to assign a new DNA sequence.

Example:

GTTGCAACAACCTACAGATGCGTTCAGTCTATCTATATACAAGAGGAAGAATACAA

Rcfiles::DirectoryAliases

By default the bioshell will try to make use of class Rcfiles::ExpandCdAliases for all systems that have the environment variable IS_ROEBE set to 1. This is the case for my home system, for instance.

class Rcfiles::DirectoryAliases allows us to quickly navigate to different directories. It is part of the rcfiles gem, so if you want to enable this functionality then install the rcfiles gem.

This may, however had, also slow down things considerably, so for most everyone else, this has been disabled.

If you wish to specifically enable it, nonetheless, do issue the following in the bioshell:

do_use_expand_cd_aliases

Try the following instruction in order to disable it again:

no_expand_cd_aliases

UniProt

UniProt is a freely accessible database of protein sequence and functional information. What makes it also useful for bioinformatics is that you can easily query the FASTA sequence of a protein.

Consider the protein called A2Z669. The entry to this protein can be found here:

https://www.uniprot.org/uniprot/A2Z669

And the corresponding FASTA sequence of that protein can be found here, if you append .fasta to the URL:

https://www.uniprot.org/uniprot/A2Z669.fasta

If you wish to save this file, from within Bioroebe itself, then you can use the following API:

Bioroebe.fetch_data_from_uniprot()
Bioroebe.fetch_data_from_uniprot('A2Z669')

NCBI also has entries related to UniProt.

Example:

https://www.ncbi.nlm.nih.gov/protein/P02768 https://www.ncbi.nlm.nih.gov/protein/P02768.2?report=fasta

This has the header RecName: Full=Serum albumin; Flags: PrecursorUniProtKB/Swiss-Prot: P02768.2.

From the bioroebe-shell, you can can fetch data from Uniprot, such as by issuing:

unitprot_fetch

AminoAcid composition

A small widget exists to show the amino-acid composition.

See:

The code for this resides under:

bioroebe/gui/gtk3/aminoacid_composition/aminoacid_composition.rb

This may be improved in the future, e. g. adding functionality to open a file, adjust the layout and connect it with other widgets.

CCAAT sequences

The CCAAT-Box in eukaryotes is an example for a promoter-proximal element.

From within the bioshell, you can try to find CCAAT sequences by issuing:

ccaat?

This is just a convenience feature - I was too lazy to use e. g.

find CCAAT

Promoters in general - weak and strong promoters

Bacteria tend to use different operons. For instance, the lac-operon or the trp-operon in E. coli.

Let's compare the consensus sequence between these two operons, at the two most important locations - which is the -35 area and the -10 area:

             -35    (spacer of 17 bp)  -10
lac-operon  TTTACA                    TATGAT
trp-operon  TTGACA                    TTAACT

Interestingly while the -10 area is typically called the TATA box, neither of these two operons make use of the TATA sequence there - we see TATG and TTAA.

Strong promoters are very useful for biotechnological applications, so the bioroebe gem will attempt to collect strong promoters into a .yml file. This should eventually help compare different promoters to one another.

GenBank

alt text

The Genbank format allows for the storage of information in addition to a DNA or a protein sequence. It can thus hold more information than a typical FASTA file contains.

You can parse or generate GenBank formats via the bioroebe project as well.

The main class that will generate GenBank file formats is called class GenbankFlatFileFormatGenerator. Not a very simple name that one is!

The basic use is to pass the DNA sequence into it. Then it will output a file that can be stored.

You can also "generate" this sequence from within the BioShell.

to_genbank
togenbank
togen

If you wish to find out the current GenBank version, do either of the following in the bioshell:

current_genbank_version?'
current_genbank_version'
genbank_version?'
genbank_version'

If you have a local genbank file and want to convert the sequence information into a fasta file (file conversion) then you can use the following API:

Bioroebe.genbank_to_fasta
Bioroebe.genbank_to_fasta('foobar.genbank')

Note that this will generate into the current working directory for now, so it may be better to start this in a directory that is more like a temp-directory for working files.

A bin file exists as well:

bin/genbank_to_fasta

This is also handled by the more generic method called Bioroebe.parse(), which attempts to parse any file that may be relevant in regards to bioinformatics eventually. For now (May 2021) only a few files are supported.

A small ruby-gtk3 widget exists for this as well:

Bioroebe::GUI::Gtk::FormatConverter.new

You can print the aminoacid table for a specific amino-acid sequence from within the bioshell.

Usage examples:

print_aminoacid_table
print_aa_table
patable

So, for example, consider a sequence that has 1700 aminoacids. Then this functionality would, on the commandline, yield something like this:

Bioroebe.input_as_dna()

Take this example:

Bioroebe.input_as_dna 'UUTGAGGACCA'

This would return a string such as "TGAGGACCA".

The two U characters are dropped since they are RNA rather than DNA.

class Bioroebe::DnaToAminoacidSequence

class Bioroebe::DnaToAminoacidSequence can be used to simply translate from a DNA sequence into the corresponding aminoacid sequence.

It can be called from the commandline as well, via bin/dna_to_aminoacid_sequence.

Example:

dna_to_aminoacid_sequence ATGATGCCTCCAGGG --use-this-codon-table=bacteria
dna_to_aminoacid_sequence ATGATGCCTCCAGGG --use-this-codon-table=humans

By default the codon table for humans, aka eukaryotes to be more precise, will be used.

Do note that the class responds to different reading frames, such as frame1, frame2 and frame3.

Example:

dna_to_aminoacid_sequence ATGATGCCTCCAGGG --frame1
dna_to_aminoacid_sequence ATGATGCCTCCAGGG --frame2
dna_to_aminoacid_sequence ATGATGCCTCCAGGG --frame3

You can also create a new sequence, and translate it via another codon table.

Example:

seq = Bioroebe::Sequence.new("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG")
seq.translate(table: "Vertebrate Mitochondrial") # => "MAIVMGRWKGAR*"

Note that this currently changes the default codon table in use.

At a later point this may be changed to carry the codon table in a local hash. But for now this is how things are there.

If you wish to stop at the first stop-codon you can use:

--stop-at-the-first-stop-codon

Example:

dna_to_aminoacid_sequence ATGATGCCTCCATGAGGGTTGTGAATGACTACGAATGA --frame1 --stop-at-the-first-stop-codon

Since as of 2021 you can also start the small ruby-gtk3 widget for this, via:

dna_to_aminoacid_sequence --gui

This widget looks something like this:

Useful databases for bioinformatics:

This subsection just contains some useful databases for bioinformatics.

→ Genbank: https://www.ncbi.nlm.nih.gov/genbank/ → RefSeq: https://www.ncbi.nlm.nih.gov/refseq/ → uniprot: https://www.uniprot.org/ → Meme (Motif-Based sequence analysis tools): http://meme-suite.org/ → NetPhorest: http://netphorest.info → uniref: https://www.uniprot.org/help/uniref → uniparc: https://www.uniprot.org/help/uniparc → trembl → swissprot → pir → owl → blastp → tblastn → phi-blast → delta-blast → interproscan → cd server → prowlescan → scanprosite → pattinprot → sirw → match box → gibbs → dialign → blockmakei → pdb → swissmodel → scop → modbase → cath → mmdd → consurf → castp → protskin → ligandprotein → predictprotein → o-glycobase → pshophobase → swissmodel → whatif → esypred3d → ebi

Oligo Frequency

You can analyze the oligo frequencies.

First, assing some sequences.

Next, do something such as this:

two
three

This will show the distribution of the oligos.

Number of chromomes in different species

Name of the organism Latin name Number of chromosomes
Zebrafish Danio rerio 16
House mouse Mus musculus 20

Working with start and stop codons

If you want to display both start and stop codons in a colourized manner, within the interactive bioroebe shell, then issue the following command:

start_and_stop?

If you work with a Sequence object, that is, an instance of class Bioroebe::Sequence (also aliased to Seq), you can query this sequence object whether it has a stop codon or whether it does not:

sequence_object.has_stop_codon?

This will return a Boolean result - either true or false, as the name of the method may indicate.

The stop codons that are used for this query are stored in a toplevel instance variable, which can be accessed in this way:

Bioroebe.stop_codons?

Do note that the method .has_stop_codon? will scan the whole sequence, without taking the frame (1, 2 or 3) into consideration.

If you wish to only look for stop codons in a particular frame then you can pass, as argument, any of the following:

:frame1
:frame2
:frame3
:in_frame1
:in_frame2
:in_frame3

Specific examples:

require 'bioroebe'; include Bioroebe; x = Seq('ATC GTC GGA ATAG'); puts x.has_stop_codon?         # => true
require 'bioroebe'; include Bioroebe; x = Seq('ATG GTC GGA ATAG'); puts x.has_stop_codon? :frame1 # => false
require 'bioroebe'; include Bioroebe; x = Seq('ATG GTC GGA ATAG'); puts x.has_stop_codon? :frame2 # => true
require 'bioroebe'; include Bioroebe; x = Seq('ATG GTC GGA ATAG'); puts x.has_stop_codon? :frame3 # => false

If you look at the above four lines then you can see that the four sequences are all the same. The first sequence has a stop codon; the other three variants will check for which frame has a stop codon, which is the line with frame

  1. If you look carefully, that sequence would check on this sequence:

    TGGTCGGAATAG TGG TCG GAA TAG

This does indeed have a stop codon (TAG, which is UAG in mRNA).

DetectMinimalCodon

You can use DetectMinimalCodon to detect the minimal codon. What does this mean?

Well, say that you have the following two (different) codons:

TTT
TTC

The last position is a bit redundant, though, in the sense that it can be written as an Y too, under the IUPAC naming scheme.

So this is what would be returned:

Bioroebe::DetectMinimalCodon[["TTT", "TTC"]] # => ["TTY"]

The Levensthein distance

The Levensthein distance - also called a 'string metric' - was formulated in the year 1965. It calculates the score between two strings, or nucleotide sequences, that are aligned to one another. Score here means keeping track of the amount of overlap or non-overlap between two different sequences.

Let's have a look at a specific example to illustrate the idea behind it.

Consider the following two nucleotide strings, aptly named a and b for the purpose of this example:

(a) CGGGGGGCTACTT
(b) CGGTGGGTGTG

Since we also want to show how to work with gaps, we will add two gaps to this sequence, denoted by the - character, in the second sequence:

(a) CGGGGGGCTACTT
(b) CGGTGGGT--GTG

As can be seen, these two sequences are somewhat similar to one another now, at the least in some areas. The first three nucleotides are the very same, then there is one mismatch, then three more nucleotides that are the same, another mismatch, two gaps, a mismatch, a match and another mismatch. It would be useful to calculate the similarities and dissimilarities via a numeric value. The higher this numeric value

  • aka the higher this number - the more different these two strings are.

For any given position in the sequence, between these two nucleotide sequences, there are three possibilities:

(1) a <b>match</b> (such as C being equal to C)
(2) a <b>mismatch</b> (such as C being NOT equal to T)
(3) a <b>gap</b> (as denoted via the <b>-</b>)

Now - the Levensthein distance will consider only a gap or a mismatch, as part of its edit distance. The edit distance denotes how many operations are needed to transform one string into the other (and vice versa of course).

In our example above, between the two strings at (a) and (b), the edit distance is 6, since there are two gaps, and four mismatches.

Next, we must also use a specific scoring scheme to denote the number of points (added, and multiplied, as a total of the cost for a match, a mismatch and a gap).

So for example, if you use the following scoring scheme:

match cost    = 10
mismatch_cost = -3
gap_cost      = -1

Then the total score for the above sequence is 56. You simply calculate the number of times a match, mismatch or gap occurs, via the corresponding scoring scheme.

This is exactly what class Bioroebe::Levensthein will do; it can be found at the following file path:

bioroebe/string_matching/levensthein.rb

You can call this class and pass the two sequences to it, as the two arguments.

Alternatively you can also call bin/levensthein which is also distributed with the bioroebe gem.

Usage example for the commandline:

levensthein CGGGGGGCTACTT CGGTGGGTGTG

If you dislike the colours, simply use:

levensthein CGGGGGGCTACTT CGGTGGGTGTG --disable-colours
levensthein CGGGGGGCTACTT CGGTGGGTGTG --disable-colors

If you want to use another scoring scheme for these three costs (for match, mismatch and gaps) then you can pass any of the following flags:

--use-different-costs
--input-costs
--costs

This will prompt you for input to these three cost values.

Note that this class has been created initially in order to answer exam-related questions pertaining to a university course. I grew tired of having to calculate this semi-manually, so I simply let ruby output the relevant information instead.

You can also read in the score for match/mismatch/gap from a local file. The format for this file should stay simple, e. g. one entry per line, like in this way:

gap:       5
mismatch: -3
match:     3

The commandline-flag to then read from such a file is --use-this-file=:

--use-this-file=/opt/bla.md
levensthein --use-this-file=/opt/bla.md

So, in other words, you set the scores there on a line-by-line basis.

You can also omit values, e. g. omit match, which will then make the class use the default (hardcoded) score for a match. It may be simpler to specify all three values, though - less need to have to think about things when they are explicit.

You can use all of this from within other ruby files, of course.

So, the above method for reading from a local file, is also made available via the following method:

.use_this_file_for_score=

Several aliases may exist to various methods. For example, to designate string1 and string2 (the two sequences that are to be compared), you can use any of the following methods:

.string1=
.string2=

.set_string1()
.set_string2()

.set_sequence1()
.set_sequence2()

Whatever suits your mental model the best should work just fine.

You can read in the two main strings from a file as well, via:

--read_sequences_from_this_file=/Depot/j/foo2.md

Furthermore, you can make use of this via the following method call:

.read_sequences_from_this_file()

The format here is simple as well: the very first line is the first string, and the very second line simply becomes the second string. Keep it simple, folks.

Since as of April 2018, you can also use this class from the commandline in a slightly simpler way. Consider the following commandline usage:

editdistance T-GT- TTGTG --match=+6 --mismatch-3 --gap=-3
levensthein  T-GT- TTGTG --match=+6 --mismatch-3 --gap=-3

You can see that we specify which match, mismatch and gap cost to use here. First and second argument are still our two target sequences that we wish to compare.

Since as of March 2018, there is now a small GUI wrapper for levensthein available, at bioroebe/gui/gtk3/levensthein_distance/levensthein_distance.rb, in ruby-gtk3. It is, however had, somewhat incomplete. I will extend it as time (and motivation) permits in the coming months. Read the section GUI Components more if you are interested in this.

If you want to start this from bin/levensthein then use this:

levensthein --gui

class Bioroebe::SimplifyFastaHeader

This class resides under bioroebe/fasta_and_fastq/simplify_fasta_header/simplify_fasta_header.rb.

It was created mostly to simplify the header of a .fasta file, in particular of protein files, so that they can more easily compared in a phylogenetic analysis.

By default it will create a new .fasta file, upon reading an existing .fasta file. If you want to overwrite the original FASTA file, use the --overwrite option.

Example:

simplify_fasta_header foobar.fasta --overwrite
simplify_header foobar.fasta --overwrite

(I aliased this towards the bin/simplify_fasta_header executable).

Entrez

Entrez: offers a collection of different databases tied together. Thus, it offers a vast amount of publications at one's disposal.

More information can be seen here:

https://www.ncbi.nlm.nih.gov/Web/Search/entrezfs.html

The PRODORIC Database

PRODORIC (prokaryotic database of gene regulation) is a comprehensive database about gene regulation and gene expression in prokaryotes - in other words, a database for regulatory networks in prokaryotes.

It includes a manually curated and unique collection of transcription factor binding sites. A variety of bioinformatics tools for the prediction, analysis and visualization of regulons and gene reglulatory networks is included. Metabolic networks can be studied as well.

A useful way to make use of this database is to look for pathogenic organisms such as Pseudomonas aeruginosa.

http://prodoric.tu-bs.de

Convert DNA into RNA

You can "convert" a DNA string into the corresponding mRNA representation.

In order to do so, try any of the following in the bioshell:'

to_mrna
tomrna
to_mRNA
mrna

To demonstrate this in the bioshell, do:

random 9

# Generating some DNA of length 9.

Or setting the DNA sequence to ATGGTAGAC.

 5' - ATGGTAGAC - 3'

And then:

tomrna

5' - AUGGUAGAC - 3'

If you need an API to convert DNA into RNA, you can use this:

Bioroebe.to_rna(your_DNA_sequence_goes_in_here)

Internally this is implemented via class Bioroebe::DNA, which has a method called .to_rna().

Changing the first nucleotide in the bioshell

You can change the first nucleotide of a sequence by doing this in the bioshell:

first A

The argument should be the nucleotide you wish to see appear on the first position there.

Decoding a codon

A codon in the mRNA, such as GUG, is decoded by the ribosome and the cognate tRNA, into the corresponding aminoacid - in this case V (for valine).

The toplevel API for using this in Bioroebe is this:

Bioroebe.codon_to_aminoacid('GUG') # => "V"

Of course you can also work with a sequence object, but if you just need a simple toplevel method then this should work fine.

The default codon table is 1, aka eukaryotes such as humans. In the long run, the user will be given the possibility to select other codon tables - but for now this has to suffice.

Since as of July 2021 you can designate another codon table than the default eukaryotic one. This will be stored in the instance variable @use_the_codon_table_for - see the file bioroebe/conversions/dna_to_aminoacid_sequence.rb for this.

You can also use this functionality via a ruby-gtk3 widget: see the "DNA to Aminoacid Sequence convert" widget for this.

Concatenating Bioroebe-Sequences

You can concatenate two Bioroebe sequences like this:

x = Bioroebe::Sequence.new('ATGGATCGATGC')
y = Bioroebe::Sequence.new('TTTGATCGATGC')
z = x + y
puts z # => "ATGGATCGATGCTTTGATCGATGC"

This functionality is similar to e. g. biopython. The + symbol (a method call) shall suffice for this operation.

Do note that z will then be a String, not a sequence object anymore. (This may be subject to change in the future, but for now, aka February 2020, it is that way.)

Calculating the GC content via class Bioroebe::CalculateGCContent

Calculating the GC content of a DNA sequence is trivial. You can easily do so in ruby or python just fine. So why create a class for this?

There were mostly two reasons as to why:

  • I wanted to tie the functionality of the class into a method. That method is ** Bioroebe.gc_content()**, so it can be used as toplevel method as-is.

  • I needed to control padding, layout and colours, so it was simpler to add a class that can be customised in this regard.

Usage example of the toplevel method:

Bioroebe.gc_content('ATCG') # => 50.0

Inferring the type of a sequence (DNA, RNA or Protein)

The following is a bit experimental as of 2021. In the long run it may be improved, but for now it's not quite perfect code. At any rate.

Say that you have a sequence and you do not know whether it is DNA, RNA or Protein.

You can use the following API to determine the type of sequence:

Bioroebe.infer_type_from_this_sequence

Examples:

Bioroebe.infer_type_from_this_sequence('ATCGATCATTCGATGCAGTCA')   # => :dna
Bioroebe.infer_type_from_this_sequence('UCAGUCAGUCAGUCAGUACGUCA') # => :rna
Bioroebe.infer_type_from_this_sequence('ATGGGVVVIIILLP')          # => :protein
Bioroebe.infer_the_type_from_this_sequence('ATGGLLLLLGVVVIIILLP') # => :protein; this alias works as well

Keep in mind that this is not very sophisticated yet. You may have to include statistical analysis in order to more reliably determine whether something is RNA or Protein, for instance. Or a protein that may be short and consist of only A (Alanine), T (Threonine), C (Cysteine), and G (Glycine).

However had, for the time being, the simple code will have to suffice, so just keep this in mind when you use the API for now.

Generating Palindromes and working with palindromes in general

If you need to generate a palindrome (in dsDNA form) of a specific size, such as 6, then you can use bin/palindrome_generator on the commandline for that task, such as in:

palindrome_generator 6
palindrome_generator 10
palindrome_generator 14

Simply provide the number, as length of that palindrome.

Internally class Bioroebe::PalindromeGenerator handles palindromes; some palindrome-specific code may also be found in bioroebe/toplevel_methods/palindromes.rb.

You can also use a toplevel method for this. Example:

Bioroebe.palindrome_generator 4 # => "CTAG\nGATC"

In June 2020 code was added to "display" a 2D structure of RNA or DNA palindromes. The code for this resides in the file require 'bioroebe/palindromes/palindrome_2d_structure.rb'.

Usage example from within Ruby:

Bioroebe::Palindrome2DStructure.new('CGCGAATTCGCG')
Bioroebe::Palindrome2DStructure.new('CGCGAA|TTCGCG')

(The | can be used as visual marker; it will be ignored by the class.)

The sequence CGCGAATTCGCG, as provided in the example above, would then yield a structure such as this here:

   A=T # 2
   A=T # 2
   G≡C # 3
   C≡G # 3
   G≡C # 3
   C≡G # 3 
  /   \

Note that this has a few shortcomings; the class can not handle bulges right now (bulges are unpaired RNAs, so they look a bit looped out of the main structure). One day support for this may be added, but for now this has to suffice - I needed to just quickly display such a simple structure on the commandline in June 2020.

If you need to determine whether a given sequence is a palindrome or whether it is not, consider using the following toplevel API for this:

Bioroebe.is_palindrome?
Bioroebe.is_palindrome?(ARGV)
Bioroebe.is_palindrome?('CGCGAATTCGCG') # => true
Bioroebe.is_palindrome?('CGCGAATTCG') # => false

You can generate palindromes from within the bioshell by issuing this command:

generate_palindrome 20

You can also scan for palindromes by issuing this:

discover_all_palindromes

Showing the atomic composition of a DNA sequence

You can use the following API to show the atomic composition of a given protein sequence:

Bioroebe.atomic_composition
Bioroebe.atomic_composition 'GGGGA' # 5 aminoacids: glycine and alanine

This will report the atomic composition for this sequence on the commandline.

Note that this currently only works for proteins. At some later point in time we may add support for showing this for nucleotide sequences as well.

Open reading frames: the longest Open Reading Frame (ORF) in a given sequence

ORF is an abbreviation for Open Reading Frame.

Let's look at a simple image to show the situation at hand, in regards to ORFs:

Most ORF-related functionality can be found in the following file:

require 'bioroebe/toplevel_methods/open_reading_frames.rb'

Naturally there is some overlap with other classes and code stored in other files, as ORFs are so important in biology.

If you wish to quickly display the longest open reading frame on the commandline, make use of bin/longest_ORF.

The following shows an example for this:

longest_ORF ATGAAACCCGGGAAAGGGCCCAAATTTGGGAGACCATGAATGAAACCTAGTTAG

If you need to work within ruby then use this method:

Bioroebe.return_longest_ORF_from_this_sequence(ARGV)
Bioroebe.return_longest_ORF_from_this_sequence("ATGAAACCCGGGAAAGGGCCCAAATTTGGGAGACCATGAATGAAACCTAGTTAG")

The code used therein will check both dsDNA strands, though; and the default start and stop codons are used, so this may not always be what you desire to use. But for most basic tasks this should work quite well.

Keep in mind that as a rule of thumb that open reading frames that are not translated into a protein usually contain all sense codons at about the same frequency, whereas the reading frames that are translated into protein tend to use a subset of the codons.

Additional methods that may be useful are aggregated here as simple examples without further explanation:

Bioroebe.return_longest_ORF_from_this_sequence(ARGV)
Bioroebe.return_all_open_reading_frames_from_this_sequence(ARGV)
Bioroebe.return_n_ORFs_in_this_sequence(ARGV)
Bioroebe.open_reading_frames_to_aminoacid_sequence(ARGV)

class Bioroebe::ShowThisCodonTable

This class can be used on the commandline (or in straight ruby code as-is) to quickly show the particular codon table. The input should be a number, such as 1 or 2 or 3 and so forth.

Usage examples:

Bioroebe::ShowThisCodonTable.new(1)
Bioroebe::ShowThisCodonTable.new(2) # and so forth

This is simply a small visual cue.

A commandline example, as output, will be shown next, for codon table 9:

class Bioroebe::ShowThisDNASequence

class ShowThisDNASequence can be used to quickly display a DNA sequence, based on one strand as input. Both the input-strand as well as the complementary strand will be shown on the commandline.

Usage example:

require 'bioroebe/utility_scripts/show_this_dna_sequence.rb' # If you want to require it specifically.
Bioroebe::ShowThisDNASequence.new('GAATTCGAATTCGAATTCGAATTCGAATTC')

cutseq

You can cut out any sequence via cutseq. There are two basic ways how to do this, syntax-wise:

cutseq 5 6
cutseq 5-6

This will remove nucleotide 5 and 6. Remember that we start to count at nucleotide 1.

Example:

set_dna ATCGATCGAT
cutseq 3-5

A similar syntax is possible:

cutseq 5 +3

This is the same as above but +3 means that we add this number to 5, so it is equivalent to:

cutseq 5 8

Another way is to use the interactive variant. This will ask you for the start and end position.

cutseq

Start Codons:

A start codon is where the ribosome will start to translate the mRNA into the corresponding polypeptide chain.

Different organisms may use slightly different start codons.

The most common one, normally, is the start codon AUG (or, in the DNA, ATG).

The code ::Bioroebe.start_codon? will point to this main start codon in use. You can redefine this too.

The value currently defaults to:

::Bioroebe.start_codon?

You can also show all translated ORFs. There are several ways how to do so, but a simple one is the following:

showorf

Since as of June 2016, you can also align all ORFs of the main sequence in the *bioshell. Simply issue the following command for this:

align_ORFS

If you wish to just show the first (assumed) ORF, do:

show_first_orf
first_orf
1storf
1st_ORF

If you input ORF? then a colourized result will be shown, if colours are enabled. But not always is the whole area shown also coding area. You can designate a subset of a nucleotide sequence to be the coding area, such as by doing this:

coding_entry 51..3251

This would notify the bioshell that only nucleotides from position 51 to (including) position 3251 will be colourized, when doing another "ORF?" invocation.

Restriction Enzymes

This subsection will eventually be expanded to explain various things about restriction enzymes (restriction endonucleases) in general - and also how to make use of them from within the bioroebe project.

DNA has, as most of us already know at this time, four different bases - namely A, T, C, G.

Assuming a random distribution of these bases, we can determine how common a particular restriction enzyme will cut any particular DNA sequence, if we know the length of the DNA sequence; here assuming an equal chance for all four bases to be part of that sequence, which may not be true (A-T may be more likely to see in an intron, for example; and G-C distribution is also not uniform within any given genome either).

This frequency of cutting can be calculcated as 4 ** n, where n is the length of the recognition sequence. So, for instance, the restriction enzyme EcoRI is a six-cutter and will cut 1 in (4 ** 6 = ) 4096 base pairs. In other words, a hexanucleotide site will occur every 4096 base pairs, viewed statistically. (The real distribution of course depends on the particular genome sequence at hand. Genomes may differ widely compared to one another.)

A restriction enzyme that is a four-cutter, aka recognizing a tetranucleotide sequence, will cut 1 in 256 ( 4 ** 4 aka 4⁴). An example for a four-cutter is HaeIII.

The following simple table shows this for a two-cutter up to a 10-cutter, just to have this available as a table without you needing to calculate anything anymore:

n-cutter cutting frequency
two-cutter 1:16
three-cutter 1:64
four-cutter 1:256
five-cutter 1:1024
six-cutter 1:4096
seven-cutter 1:16384
eight-cutter 1:65536
nine-cutter 1:262144
ten-cutter 1:1048576

The logical consequence of this also is that such a four-cutter will produce shorter DNA fragments than a six-cutter.

Over 3,000 restriction enzymes have been studied in detail, and more than 600 of these are available commercially.

Note that there are 5 different types of restriction enzymes:

Type l
Type II
Type III
Type IV
Type V

What database could be used to find out which restriction enzymes are available (and described)?

Rebase could be used:

http://rebase.neb.com/rebase/rebase.html

For FTP access, consider using:

ftp://ftp.neb.com/pub/rebase/

Such as:

ftp://ftp.neb.com/pub/rebase/emboss_e.108

Wikipedia has a somewhat simpler interface though:

https://en.wikipedia.org/wiki/List_of_restriction_enzyme_cutting_sites:_A

What if we encounter a restriction enzyme such as GGTCTC(1/5)?

This indicates the point of cleavage for non-palindrome enzymes.

For instance, GGTCTC(1/5) indicates cleavage at:

5' ...GGTCTCN/...3'
3' ...CCAGAGNNNNN/...5'

Now after that theoretical introduction, let's look at as to how the bioroebe project deals with restriction enzymes.

You can digest your DNA sequence or rather simulate digestion. Provided that you have a target DNA sequence, you can 'run' a digest by doing this:

digest KpnI?
digest EcoRI

This will show the corresponding fragments that are generated.

Note that different restriction enzymes cut at different frequencies, depending on how often they are situated within the target genome.

In other words, the longer the restriction enzyme, the less frequently it will cut.

If you need some help in deciding which restriction enzyme to use, there is a small table overview on the commandline available, which you can invoke by issuing any of the following commands:

restriction_table?
restrictiontable?
restriction_table

If you wish to see the available restriction enzymes in the Bioroebe project, you can use the following toplevel API:

require 'bioroebe/enzymes/show_restriction_enzymes.rb'
Bioroebe::ShowRestrictionEnzymes.new

If you need to determine whether a specific restriction enzyme is registered, and part of the Bioroebe project, then consider using the following API:

Bioroebe.has_this_restriction_enzyme? 'MvnI'    # => true
Bioroebe.has_this_restriction_enzyme? 'EcoRI'   # => true
Bioroebe.has_this_restriction_enzyme? 'EcoRII'  # => true
Bioroebe.has_this_restriction_enzyme? 'EcoRIII' # => false

You can query the restriction enzymes from the interactive BioShell as well (if it isn't buggy).

For example, using "ecori?" as input, the following result would be shown:

Colourizing FASTA sequences

You can colourize FASTA Sequences via:

Bioroebe.colourize_this_fasta_dna_sequence

or slightly shorter:

Bioroebe.colourize_this_fasta_sequence

This currently (January 2019) only works for DNA; and uses hardcoded colours meant to be used on a black background, in the terminal.

In the future this may change and allow for more customization, but for now, this is the way it is.

Note that an executable at bin/colourize_this_fasta_sequence also exists, so you can use this if you symlinked this file, via the commandline.

Example:

colourize_this_fasta_sequence ATGTACGTACGTAGTCAGACGCA

Let's look at an image as well, to visualize this feature.

The following example will lead to this image:

puts Bioroebe.colourize_this_fasta_dna_sequence('ATGCGCATGCGCGTATTAGTATTAATGCGCGTATTAATGCGCGTATTA')

And another example here:

SUMO - Small Ubiquitine

SUMO stands short for the Small Ubiquitin-like Modifier. This is an interesting protein that is 'attachable' (can be attached to other proteins), in that it has a role in regards to folding (of some proteins) as well as suppressing protein degradation.

SUMO-modified proteins typically contain the tetrapeptide consensus motif:

Ψ-K-x-D/E # where Ψ is a hydrophobic residue, K is the lysine
          # conjugated to SUMO, x is any amino acid, D or E is
          # an acidic residue (aspartate or glutamate).

See also the following wikipedia entry here, for more information about SUMO : https://en.wikipedia.org/wiki/SUMO_protein

Using Bioroebe.cliner()

This short subsection just briefly documents the toplevel method called Bioroebe.cliner().

The primary use case for this method is to simply display a horizontal line - hence the name "line" or rather "liner". It "makes a line".

What is the "c" meaning? This refers to the old usage called colourized_liner. As that was too annoying to type, it was shortened to cliner, so the "c" stands for colours or colourized line.

You can use it as follows:

Bioroebe.cliner('*') {{ colour: 'steelblue'  }}
Bioroebe.cliner('=') {{ colour: 'dodgerblue' }}

Several commandline scripts make use of that. I found it to be useful to have a short visual separator ready.

Using Bioroebe.three_delimiter()

The method Bioroebe.three_delimiter() can be used to split a String into a String where every third position has a trailing '|' token.

So for instance:

Bioroebe.three_delimiter 'ATGGGGATGTAGGTA' # => "ATG|GGG|ATG|TAG|GTA"

The primary reason why that was added as a toplevel method has been because it may be visually simpler to identify the individual codons via your eyes that way.

Generating a random DNA sequence

You can "generate" a random DNA sequence from the commandline via:

generate_random_dna_sequence n_DNA_nucleotides_here
generate_random_dna_sequence 300

This functionality is actually stored in bin/random_dna_sequence so I just aliased it to the above. The more 'correct' invocation would be this:

bin/random_dna_sequence 300

(I use lots of aliases to simplify calling scripts.)

Count how many nucleotides are in a given DNA string

You can use class Bioroebe::CountAmountOfNucleotides to obtain information about the amount of nucleotides in a given DNA string.

You can query this information from within the Shell too, via:

dna_analyze AGCTTTTCATTCTGACTGCAACGGGCAATAT

This is somewhat similar to:

https://github.com/stungeye/Rosalind-Ruby/tree/master/challenges/Counting%20DNA%20Nucleotides

This class can also read in input from a locally existing file - just supply the path to that file as first argument to class CountAmountOfNucleotides.

There also exists a simpler toplevel method, for counting how many AT and GC nucleotides are in a given String.

Usage examples:

Bioroebe.count_GC 'ATTATTATGGCCAATATA' # => 4
Bioroebe.count_AT 'ATTATATACCGCGCCCATATAAA' # => 15

Last but not least, the two files bin/count_GC and bin/count_AT can be used to count how many AT and GC pairs can be found in the String at hand.

Removing a subsequence from a RNA or DNA string

You can remove a subsequence from the main DNA string via:

Bioroebe.remove_subsequence()
Bioroebe.remove_subsequence(["ATCGGTCGAA", "ATCGGTCGAGCGTGT"], 'ATGGTCTACATAGCTGACAAACAGCACGTAGCAATCGGTCGAATCTCGAGAGGCATATGGTCACATGATCGGTCGAGCGTGTTTCAAAGTTTGCGCCTAG')

The first argument can be an Array of (sub)sequences that you may wish to remove from the larger (second) string, as the second example shows.

Nucleotide composition of a dsDNA sequence

The Chargaff rules stipulate the following for a dsDNA sequence:

purine + purine = pyrimidine + pyrimidine

Or, more specifically perhaps via:

A = T
G  C
# left side for the purines, right side for the pyrimidines
# so A + G = T + C 

In other words, in a given dsDNA molecule the content of A and G equals the content of C and T. This is quite logical if you think about it; an A must base pair with T, and vice versa; and a G must base pair with a C, and vice versa. So the "rule" is actually the only possible way for a dsDNA strand to exist in a stable manner, at the least when this dsDNA is in a "natural" state, such as in a living cell - one strand's composition simply specifies the other strand's composition, and vice versa. It is a genetic backup system after all. (Some teachers call one strand Watson and the other Crick. I always hated this labeling and refused to use it, but I wanted to mention it here - when I first heard it, it confused the hell out of me. I still think it is a misnomer; I simply call these two DNA strands as-is. After all you can find genes on both strands too. The promoter area will define the start area of transcription.)

If you need to calculate missing frequencies in such a dsDNA strand, you can do so via the following API:

Bioroebe.base_composition()
Bioroebe.base_composition('52% GC')
Bioroebe.base_composition('76% GC')

Pass in a string to this method such as '52% GC' or '52%GC', as the second example shows. It will then return a Hash with the corresponding frequencies for the four nucleotides. Ideally use an even number here, as we have to calculate G+C, and G+C is harder to achieve cleanly if the target number is odd. 25% would mean ... 13% G and 12% C? That won't work so well.

Let's view two specific examples that demonstrate how Bioroebe.base_composition() can be used:

Bioroebe.base_composition('64 % GC') # => {"A"=>18, "T"=>18, "C"=>32, "G"=>32}
Bioroebe.base_composition('80% AT')  # => {"A"=>40, "T"=>40, "C"=>10, "G"=>10}

If you need to output this on the commandline, you can prepend report_ or show_ to this method either way:

Bioroebe.report_base_composition('64 % GC')
Bioroebe.show_base_composition('64 % GC') # Works just as the ^^^ above.

Within the bioshell you can determine the composition of the current main sequence via:

composition?

Keep in mind that the more nucleotides are part of a string, the closer the output will resemble the original base-composition hash. Worded differently: if you have only 50 nucleotides, then achieving the identical base composition as specified is much harder than if you input, say, 5000 nucleotides.

Sometimes you may wish to generate a DNA sequence with a specific base composition, such as if you wish to "reflect" the codon usage of a particular organism.

The general API for this is as follows:

Bioroebe.generate_random_dna_sequence(n_elements, hash_frequencies)
Bioroebe.generate_random_dna_sequence(100, {A: 40, T: 20, C: 20, G: 20})

Note that the second argument is optional; you can omit a frequency-hash in which case 25% is assumed for each of the four main DNA nucleotides.

You can do this via the bioshell too.

Examples:

base_composition 10 20 30 40
base_composition 5 5 5 85

These values are for A, T, C and G respectively.

Readline support

alt text

Ruby itself offers readline support. BioRoebe will try to use Readline too, in particular for the BioShell component.

If readline is not available, oldschool $stdin.gets will be used instead to obtain user input.

Readline completion support is also available, since as of September 2014. That way you can use a few characters less to type some of the commands.

Since as of the year 2020, code for readline resides in the file bioroebe/readline/readline.rb.

Bioroebe::ColourScheme and the demo

You can run the file bioroebe/colours/colour_schemes/colour_scheme_demo.rb and store the output into a .html file.

Then you can look at the default colour schemes, as the following picture shows:

Bioroebe::CreateRandomAminoacids

class Bioroebe::CreateRandomAminoacids can be used to create random amino acids. See also the bin/create_random_aminoacids file for this.

I aliased this on my home system to randomAA (random aminoacids).

I then call it like this:

randomAA 35 # to create 35 aminoacids

The pure ruby variant would go like this (for creating 125 random aminoacids):

require 'bioroebe/aminoacids/create_random_aminoacids.rb'
Bioroebe::CreateRandomAminoacids.new(125) # variant 1
Bioroebe.create_random_aminoacids(125)    # variant 2

By default colours will be used. If you don't need this on the commandline, you can call it like so:

randomAA 125 --no-colours

The BioShell component of the BioRoebe project

The BioRoebe project features a shell, which for the purpose of the document here, will be called BioShell. (More informally, it is just the shell.)

The BioShell has a "help" section that can be invoked after you have started a new BioShell instance.

So to request help from a running instance of BioShell, do:

help

This can be a bit lengthy, so if you want to just look at individual subsections, pass the first character to it, such as via:

help c

This would show all subsections that begin with the letter c.

You can also specify a range of help options via the - character in between.

See the following examples for this:

help g-k
help g-m

The executable file essentially invokes the BioShell by issuing this ruby code:

require 'bioroebe'

Bioroebe::Shell.new

This of course also works from Ruby code. Simply try the above and it should work.

If you don't want to type the :: you can also use this namespace shortcut, a module-method, instead:

Bioroebe.shell

Which will effectively do the same as above but may involve a little less typing.

There are many undocumented options too.

For example, say that you have a nucleotide sequence and you want to show a number denoting the position quickly, such as:

12345678901234
ATCGACTACGAGTC

where the lower area is the nucleotide position. Then this can be done via either:

numbered?
numbered

I needed this functionality because I wanted to see which nucleotides were close to/at the central position.

By default, the displayed main sequence will be padded with leading whitespace, the 5'-leader, then comes the sequence, and then the 3'-trailer is shown. If you only want to see the raw DNA sequence, without anything more, then you can use the following to show that, within the shell:

raw?
raw

This is also easy to remember mentally. :)

If you want to view the main DNA sequence currently in use, but in reverse, simple type reverse:

reverse

If you also want to assign this to the main DNA sequence, append a trailing ! to this sequence:

reverse!

Note that you can also use the BioShell to quickly "convert" between the three-letter amino acid code to the one-letter amino acid code.

Example:

3letters Arg-Ala-Ser
3to1     ARG-ALA-SER

Output:

RAS

This has been specifically added because you may sometimes see websites that use the 3-letter code and have a sequence of aminoacids displayed, on that website. In my opinion, the one-letter code is much better in general.

If you want to find out where disulfide bonds are in a protein, you can issue the following:

disulfide?

Note that this becomes a bit difficult to see the longer the sequence.

The BioRoebe project also features an interactive REPL (read, eval, print, loop), a shell that can easily evaluate user input: the BioShell. Note that since at the least the year 2017*+, this shell interface is actually just called bioshell, or more accurately, Bioroebe::Shell. But it can still be called BioShell, of course. For simplicity reasons the documentation will refer to it most of the time as **bioshell, though.

The interactive BioShell can be used to input various different commands "live". This allows one to apply commands / instructions onto datasets that originate from the life sciences.

If the last character in the user input supplied is the '?' character, then the bioshell will assume this to be a restriction enzyme. This is mostly a convenience feature. I am not absolutely certain whether this will be retained, but for the time being it exists as it is.

Example for this:

ecori?

This will also show where exactly we will cut within this target sequence. It will show both the forward strand and the complementary strand, including the site where it cuts via a red vertical bar.

You can also get a list of available 8-cutters, that is, enzymes that leave an overhang of 8 nucleotides.

Example:

8-cutters

You can view all available metabolic pathways, such as the glycolysis, via:'

pathways?

To find out the Alu element sequence, do:

alu?

To calculate the melting temperature:

melting_temperature?

Help subsection of the BioShell

As part of the BioShell, there is a "help" menu interface.

In order to invoke this help section, do this:

help

You may also obtain help about subcommands.

Try it like so:

help a

Where a simply means to find all entries that start with the character 'a'.

You can also specify a range of help options via the - character.

See the following examples for this:

help g-k
help g-m

You can randomly insert some sequence into your target DNA.

rinsert ATGCC
random_insert ATGCCGGGEEEEEEEE
random_insert 3

The first variant will insert this 5-nucleotide sequence somewhere into your DNA string - hence the name 'random' in this context.

The last variant will pick 3 random nucleotides and insert these randomly as well. So if you use a number, it will substitute with one of A, T, C or G, respectively.

Note that this insertion will happen at ONE position consecutively, rather than at 3 different, random positions. This is a bit similar to the insertion of a transposon.

In December 2021 I changed the way how sequences were stored, as far as the bioshell is concerned.

First: we will use arrays now. So, one array for all RNA sequences, one array for all DNA sequences, and all arrays for aminoacid sequences.

Startup of the Shell (bioshell)

If you do not want or need to see the startup-notice when the bioshell is started, you can disable it on the commandline via:

bioroebe --silent-startup

You can also modify the file use_silent_startup.yml if you wish to make it persistent whether to display the info-notice on startup of the bioroebe-shell or not.

class Bioroebe::AminoacidSubstitution

class Bioroebe::AminoacidSubstitution can be used to substitute a FASTA sequence representing a protein (a run of aminoacids).

Let's first require it:

require 'bioroebe/aminoacids/aminoacid_substitution.rb'

Alright. Now, say you have a file called foobar.fasta.

You want to mutate aminoacid position 12, from Leucine (L) to Phenylalanine (F).

Invoke bin/aminoacid_substitution like this:

aminoacid_substitution foobar.fasta L12F

The second argument is the mutation you wish to add.

Note that this currently (December 2021) only works for a local file. A future change may be to have this work on a Bioroebe::Sequence object as-is, as well as any linear string rather than be limited to a file only. But for now it is limited like that.

BLAST variants

blastp: to search for similar protein sequences to a query protein sequence blastx: to search for protein sequences related to a query DNA/RNA sequence tblastn: to search for nucleotide sequences related to a query protein sequence

k-mers

BLAST makes use of k-mers.

How can we calculate the k-mers?

Well - the number n of k-mers in any sequence s is given via the following formula:

n = len(s) - k + 1

So, if your sequence is a RNA sequence containing 51 nucleotides, you can calculate that this sequence must contain 49 3-mers:

n = 51 - 3 + 1 = 49

No Changelog in the BioRoebe project

In the past two or three years, at the least from ~2012 or so, up to 2015, there used to be a changelog summary of changes done within the bioroebe project - but I feel that most people will not really care about a changelog, so this is no longer shown here.

For more important changes, I will try to keep the date shown when that change was made, but at the end of the day, the net functionality is more important than the specific date, in my opinion. So efforts are made to guarantee a certain quality of the project, even if no changelog is made available any longer.

create_these_directories_on_startup.yml

This yaml file contains which directories should be auto-created when the bioshell is first started. Not all directories have been specified there yet, but it is planned to, in the long term, specify all directories that should be started when the bioshell is started.

Downloading the human genome

You could use the following API for this:

Bioroebe.download_human_genome

This, in turn, will use an URL such as https://bioconductor.org/packages/release/data/annotation/src/contrib/BSgenome.Hsapiens.UCSC.hg38_1.4.4.tar.gz. You may have to check whether this is the latest version on your own, as the URL is currently hardcoded and may not always be updated. (I may change to inferring the most recent URL dynamically, but for now it is just a hardcoded String.)

GUI Components of the BioRoebe project

The bioroebe gem comes with several ruby-gtk widgets. These are not yet very advanced, so consider them somewhat experimental: a proof of concept, trying out new things rather than them constituting 'high-quality-professional-grade-GUI-functionality'.

In the distant future I assume that these GUIs will become more useful for the BioRoebe project (and downstream users) in general - if time and motivation permits. The main idea in this regard is to simplify using the BioRoebe project for GUI-related tasks of all kind, to make it easier for normal people to work with the project. This includes **www-related functionality* too.

Note that this also means combining functionality made available in other software, so the GUI components in the bioroebe gem should allow the user to use bioroebe as a platform for bioinformatics-related work.

Furthermore, since as of 2020, work has started to also add some java code, including GUI components in java (via swing primarily). Again, this does not have high priority, so do not expect much at this point - but in the long run, the goal is to fully bridge ruby-to-java, and then make available that functionality from within java as well, perhaps via the GraalVM. One obvious benefit would be that java is faster than ruby for actual work (but not as much fun to write in; and more verbose too).

Ideally the whole bioroebe project would offer a complete GUI environment, in particular for those users who either do not want to, or can not, use the commandline. But since the commandline is a lot easier to use, and also develop for (at the least for me), the commandline is the primary focus for this project for the time being. The secondary focus is on the www; and only the third focus will be on traditional GUI widgets.

The project is not "sold" on using gtk per se as such, mind you. Any other toolkit would serve just as well. The thing is that I primarily use ruby-gtk; if someone wants to use e. g. ruby-qt then code would have to be added here. (Ruby-qt only works well up to qt4, so it has been abandoned for the most part, unfortunately. Would be great if qt could offer something like gobject-introspection though.)

The namespace for GUI components within the bioroebe project in general is this:

Bioroebe::GUI

This is where GUI-specific ruby-code should reside, within that subdirectory.

For gtk, it would be at:

Bioroebe::GUI::Gtk

The project is structured to allow for this.

So for example, the ruby-gtk related code is at bioroebe/gui/gtk/. Since as of ~July 2018, this has been modified a bit: there is now a a gtk2/ and a gtk3/ directory. The current focus past the year 2021 is on gtk3. While the ruby-gtk2 bindings will not be abandoned completely, new code written for bioroebe will have a focus on ruby-gtk3. So do not expect too much work on the ruby-gtk2 widgets; gtk3 is better than gtk2 at this point in time.

If qt examples would have to be added to the project, then they should reside at bioroebe/gui/qt/, and so forth.

Presently, bioroebe defaults to gtk3 as the default GUI. This may change in the future, but for now this is the way things are.

You can also start some of the GUI components from the commandline.

For example, levensthein.rb has code that allows you to start its ruby-gtk GUI component, via:

--start-gui

Or, in a more generic manner:

bioroebe --levensthein-gui

Here is a screenshot of the gtk2-class for HammingDistance.

Note that you can also invoke the main-controller widget from the commandline, via:

bioroebe --controller

Unfortunately I noticed in 2022 that this no longer works, and invokes the bioshell; this has to be changed at a later moment in time. Quite some code has to be rewritten for this, so for now I postpone the rewrite.

(I have not yet tested this on windows via WSL, but I will do so eventually. Not that this currently, in November 2020, is quite noisy; I will change the code at some later point. For now it was more important to get the functionality to work properly; polishing can be done at a later moment in time)

If you need a widget that shows local .fasta files in a given directory, have a look at fasta_table_widget_module.rb. Note that this is currently not very feature rich; I only wanted to prototype it. What works right now is that the name of the .fasta file is shown, the total file size, and how many nucleotides are part of that .fasta file. I intend to expand this over the coming months, until we can have something as feature-rich as this variant:

Note that since as of February 2021, the ruby-gtk2 bindings are mildly deprecated compared to the ruby-gtk3 bindings. That means that some widgets will not work on ruby-gtk2 but will work on ruby-gtk3. I am not entirely abandoning ruby-gtk2 as such, mind you, but some classes exist only for ruby-gtk3, such as Gtk::Grid, and I'd have to replace them with something like Gtk::Table to retain backwards compatibility - and I have not yet done so.

Since I myself use ruby-gtk3 almost exclusively these days, there is little net-gain for me to invest more time into ruby-gtk2 as such. But it is a mild deprecation, not a "I will never again use ruby-gtk2"; ruby-gtk3 merely has a significantly higher priority than does ruby-gtk2.

You can also start some of the ruby-gtk3 widgets via the bioshell. Try:

gtk3
gtk4

As input.

Note that these GUIs are currently, in 2021, not very advanced. But expect them to be improved in the future. Ideally I would like to use a graphical "pipeline" for bioinformatics-analysis, similar to a swiss-army knife.

An alias that will eventually work is to preface the individual GUIs via gui_ from the bioshell.

For instance:

gui_restriction_enzymes

Would start the GUI widget for restriction enzymes.

Note that in May 2021 work has begun to add ruby-tk bindings, in addition to the main ruby-gtk3 bindings. The primary reason for this is because I want to use these bindings on windows, and getting tk to work and run on windows is significantly easier than getting gtk to run. New innovative ideas will go via ruby-gtk3 though, not via ruby-tk; the ruby-tk bindings are mostly just done to see which GUI widgets have been ported so far.

Here is a table that may eventually be expanded, comparing ruby-gtk3 bindings in Bioroebe to ruby-tk bindings:

# |ruby-gtk3 bindings | ruby-tk bindings | ruby-libui bindings ---|------------------------------------|-------------------------------|--------------------------- 1 | three_to_one | [DONE] | Status unknown 2 | aminoacid_composition | [PARTIALLY IMPLEMENTED; ~30%] | 3 | anti_sense_strand | [NOT YET IMPLEMENTED] | 4 | calculate_cell_numbers_of_bacteria | [NOT YET IMPLEMENTED] | 5 | dna_to_aminoacid_widget | [NOT YET IMPLEMENTED] | 6 | dna_to_reverse_complement_widget | [NOT YET IMPLEMENTED] | 7 | fasta_table_widget | [NOT YET IMPLEMENTED] | 8 | gene | [NOT YET IMPLEMENTED] | 9 | hamming_distance | [PARTIALLY IMPLEMENTED; ~80%] | 10 | levensthein_distance | [NOT YET IMPLEMENTED] | [ALMOST COMPLETE; ~95%] 11 | nucleotide_analyser | [NOT YET IMPLEMENTED] | 12 | parse_pdb_file | [NOT YET IMPLEMENTED] | 13 | primer_design_widget | [NOT YET IMPLEMENTED] | 14 | protein_to_DNA | [NOT YET IMPLEMENTED] | 15 | random_aminoacids | [NOT YET IMPLEMENTED] | 16 | restriction_enzymes | [NOT YET IMPLEMENTED] | 17 | show_codon_table | [NOT YET IMPLEMENTED] | 18 | show_codon_usage | [NOT YET IMPLEMENTED] | 19 | sizeseq | [NOT YET IMPLEMENTED] | 20 | three_to_one | [NOT YET IMPLEMENTED] | 21 | www_finder | [NOT YET IMPLEMENTED] | 22 | blosum_matrix_viewer | [TINY BIT IMPLEMENTED; ~5%] | 23 | random_sequence | [NOT YET IMPLEMENTED] |

Note that while I may eventually come even with the ruby-gtk3 bindings as the main reference, since as of May 2021 priority will be given on the web-related aspects of bioroebe. The classical desktop-centric bindings, while usable, are significantly more cumbersome to write for - in particular ruby-tk. (ruby-gtk3 is in a much better state, but there are simply more people using the www-stack than gtk in general.)

In December 2021 I improved the primer-design widget a bit. It is still not fully functional, several things are missing, but you can get the general idea if you look at the following partial screenshot:

In the long run the idea here is to not only be of help when it comes to selecting the best primer pairs, but also to add specific restriction enzymes to allow for directional PCR.

In *August 2022 the fasta-viewer was improved a little bit visually. It is also possible to pass the path to local .fasta files from the commandline, as well as use a small icon to load up additional .fasta files for display in that widget.

libUI support

Presently, since as of August 2021, support for libUI in the bioroebe project is rather limited. I only ported a few widgets so far - and it looks like crap.

See the current picture for the Linux variant (I actually forgot to pad, so the current variant looks a bit better than the one shown below):

In the coming weeks and months this may be improved.

The main benefit of libui, though, is that it should work on windows just fine, out of the box. Via ocra you can also generate .exe files that can be run on windows too. On Linux I don't need this really but for windows this is a nifty feature.

To generate these .exe files, you need the ocra gem and the roebe gem (the latter merely to help me bundle the code; strictly speaking you could do this without the roebe gem just fine, of course). Then you can try the following:

Bioroebe.batch_create_windows_executables

Or from the commandline:

batch_create_windows_executables

The following table shows the status of some of the porting efforts ongoing in this regard:

widget                         | file                                                                           | status
-----------------------------------------------------------------------------------------------------------------------------------------
Alignment                      | libui/alignment/alignment.rb                                                   | almost complete   (95%)
BlosumMatrixviewer             | libui/blosum_matrix_viewer/blosum_matrix_viewer.rb                             | semi-complete     (70%)
CalculateCellNumbersOfBacteria | libui/calculate_cell_numbers_of_bacteria/calculate_cell_numbers_of_bacteria.rb | almost complete   (90%)
DnaToAminoacidWidget           | libui/dna_to_aminoacid_widget/dna_to_aminoacid_widget.rb                       | mostly complete   (80%)
DnaToReverseComplementWidget   | libui/dna_to_reverse_complement_widget/dna_to_reverse_complement_widget.rb     | almost complete   (90%)
HammingDistance                | libui/hamming_distance/hamming_distance.rb                                     | mostly complete   (80%)
ProteinToDNA                   | libui/protein_to_DNA/protein_to_DNA.rb                                         | almost complete   (95%)
RandomSequence                 | libui/random_sequence/random_sequence.rb                                       | somewhat complete (80%)
ShowCodonTable                 | libui/show_codon_table/show_codon_table.rb                                     | not very complete (20%)
ShowCodonUsage                 | libui/show_codon_usage/show_codon_usage                                        | almost complete   (95%)
ThreeToOne                     | libui/three_to_one/three_to_one.rb                                             | mostly complete   (85%)
AntiSenseStrand                | unified_widgets/anti_sense_strand/anti_sense_strand.rb                         | almost complete   (95%)

In the long run I hope to be able to offer all ruby-gtk widgets as ruby-libui widgets as well, and use a unified code base - but right now this is not possible, so I maintain the code somewhat separately between ruby-gtk and ruby-libui.

Since as of late August 2021, if you want to make use of the libui-specific GUI components, you have to install the libui_paradise gem first. The reason for this is because it helps me maintain the code base in regards to different projects that I use.

In September 2021 a slightly different approach has been started, based on the gem called simple_widgets. For an example of how this works look at the file at unified_widgets/anti_sense_strand/anti_sense_strand.rb. The advantage here is that we would use the very same code base, but support multiple different toolkits at the same time, including support for the www. Note that this does not yet work completely, and presently I am not spending time to extend this - but you can now use ruby-gtk3 and libui for at the least the AntiSenseStrand widget. Simply pass "gtk3" or "libui" to the .rb file as the first argument and the correct widget set will be used. I verified this on my home setup that this actually works - and it does.

There are tons of things that could still be improved, and several things are not yet complete - no support for fxruby or ruby-tk or www or java-related bindings (SWING/AWT toolkit). I am working on the latter aspects, but this may take weeks or months or even years; it does not have highest priority. For now consider this as all merely a proof of concept and something we may support better one day. Until then ruby-gtk3 will still be the most natural choice for GUIs within the bioroebe project as of right now, and show the frontier of development in this regard.

Note that for the time being, ruby-gtk3 and ruby-libui will be mostly separately maintained, as already mentioned, at times via a shared-module. One day I would love to see a completely unified code base in regards to GUIs, but as of right now (September 2021) this is not realistic, as it requires a LOT of work (and thus time investment) to effect this. Since time is a limited resource, it is simpler and faster to write the bindings separately for now, and do a full unification at some later point in time when things have improved - both internally for this project, external gems, as well as upstream in the libui bindings.

Bioroebe::RawSequence

class Bioroebe::RawSequence was added in September 2021. It is the base class for class Bioroebe::Sequence. It was added because of two reasons:

(1) There was a formatting bug related to Bioroebe::Sequence and the bioshell. Finding that bug was hard, largely because Bioroebe::Sequence became so big and unwieldy.

(2) I needed a simpler class that can represent a sequence but without any "useless" methods attached to it.

Note that as of September 2021, RawSequence is not yet the default for when you work with the bioshell. It is planned to make RawSequence the default eventually, but this will happen at a later time.

Java Interface to the bioroebe gem

The bioroebe project has a subdirectory aptly named java/.

In the long run java code may be added into that directory, until we may reach a state where bioroebe can easily be used from java directly, ideally with the same API and different GUIs. But right now (March 2020, up to including October 2021) this is really quite minimal. Do not expect a massive increase in the amount of java code in this project - about 98% of the code is still written in ruby and this may persist for some time.

One idea would be to use jruby too, for additional speed gains, or TruffleRuby + GraalVM. But again, this is a distant goal.

Let's remain humble and build from there, in tiny steps forward.

What follows next is a "table" to see which functionality is available via the Java interface so far:

IsPalindrome.isPalindrome()                # Check whether the input String is a palindrome or not.
Esystem.esystem("ls");                     # run a system command, similar to Bioroebe.esystem()
PartnerNucleotide.partner_nucleotide("A"); # determine the corresponding nucleotide
Bioroebe.base_composition();               # Similar to the ruby metho.

GFP protein and similar proteins, including mCherry (fluorophores)

Because GFP (green fluorescent protein) is so widely used, it deserves a small subsection here.

If you want to quickly know the sequence of GFP, do:

GFP?

If you want to assign GFP as the new DNA nucleotide sequence, do:

assign :GFP

If you look for the mCherry sequence, have a look here:

https://www.ncbi.nlm.nih.gov/nuccore/55420612

It has 711 nucleotides. The resuting protein has 236 aminoacids.

Software that should be installed

The following is an incomplete (and imperfect) alphabetical listing of what I recommend people should install, in order to better leverage stuff in regards to bioinformatics. If you can not find the URLs to these programs (google may suffice, though), simply have a look at the rbt gem - it is what I use to download and install these projects. It has the URLs mapped as-is (e. g. via "rbt samtools --url?", if you want to download it manually).

  • bedtools
  • hmmer
  • htslib
  • relion
  • samtools
  • viennarna

Screenshots for the GUI components

In September 2021 I decided to eventually have a screenshot for each of the ruby-gtk3 widgets available. This is ongoing, so we haven't yet added all screenshots yet:

alignment.rb

aminoacid_composition.rb

blosum_matrix_viewer.rb

fasta_table_widget.rb

primer_design_widget.rb

show_codon_table.rb

show_codon_usage.rb

three_to_one.rb

restriction_enzymes.rb

random_sequence.rb

Deducing the DNA codons that code for a specific aminoacid sequence

Say that you have a given aminoacid sequence, such as MTTAGP (Methionin, Threonine and so forth). You now want to find out the DNA sequence coding for this amino acid sequence. This task can be handled by a class called Bioroebe::DeduceAminoacidSequence, within the bioroebe project.

Invocation example from Ruby:

Bioroebe::DeduceAminoacidSequence.new('M-T-T-A-G-P')
Bioroebe::DeduceAminoacidSequence.new('MTTAGP') # The '-' can be omitted, of course.

Internally, class DeduceAminoacidSequence will make use of class PossibleCodonsForThisAminoacid to determine the possible codons for a given aminoacid. This will also make use of the correct codon table (defaulting to eukaryotes, so you may want to change this if you work with bacteria or archaea).

class DeduceAminoacidSequence will show some useful information on the commandline, e. g. the most likely DNA sequence coding for this sequence. Do note that presently, there is no algorithm in the Bioroebe project that actually determines the most likely sequence, based on the percentage values available for a given species (in the codon table). So for now, the above class does not show which sequence is the more likely one to occur; it will only output the first one found.

This behaviour may be improved upon in the future, though, as that data is already available; it just has to be put into the bioroebe poject.

There also exists a small executable at bin/deduce_aminoacid_sequence which can be used to query this from the commandline. For example, input MTTAGP to that file and it should show you some information on the commandline.

deduce_aminoacid_sequence MTTAGP

The bioroebe-shell also has support for deducing an amino acid sequence.

The keyword for this is "deduce" or "ded", and some aliases.

Example how to use this from within a running instance of the bioroebe-shell:'

deduce MTTAGP

Lys - Ser - Pro - Ser - Leu - Asn - Ala - Ala - Lys
Lys - Val - His - His - Leu - Met - Ala - Ala - Lys

This will show the corresponding RNA codons that can possibly code for these.

In January 2022 class DeduceAminoacidSequence was partially rewritten and a new class, called DeduceMostLikelyAminoacidSequence was added. There were two reasons why this was done:

First, the original class was a bit messy, not that well-written and hard to subclass from.

Second, I wanted to have support for different codon tables as-is, and inferring the most likely codon range for a given aminoacid sequence.

The commandline flags were also expanded, in particular enabling --RNA. So, for instance, when I now do this:

deduceaminoacidsequence MTTAGP --RNA

It will deduce the possible codons for the aminoacid sequence MTTAGP, and it will display the findings in RNA - thus, all T are U on the display on the commandline.

Determining the possible codons for a given aminoacid

If you need to quickly determine all possible codons for a specific aminoacid then the following toplevel API can be used:

Bioroebe.possible_codons_for_this_aminoacid('A') # => ["GCT", "GCC", "GCA", "GCG"]
Bioroebe.possible_codons('A') # or this shorter variant

The result will be an Array of codons coding for that particular aminoacid.

If you want to work on several aminoacids then you can use a similar API:

Bioroebe.possible_codons_for_these_aminoacids('AHLLKLAS')

Quaternary encoding of DNA

Some years ago a team at the iGEM in 2010 was using modified E. coli (called "E. cryptor") to represent segments of DNA via the 'quaternary number system'. The idea was pretty interesting.

We all know that DNA has four bases: A, C, G and T.

Let's say that we have a letter such as H. This can be represented by 1020 in the quaternary number system, which in turn corresponds to TACA. Thus, T is representing 1, A is representing 0, C is representing 2 and G is representing 3. The following table shows this more succinctly:

A | 0
T | 1
C | 2
G | 3

So, what if we have a text word such as "hello"?

This would become:

10201211123012301233

Showing this more clearly:

1020|1211|1230|1230|1233

 H    e    l    l    o
1020 1211 1230 1230 1233 

Now that we have the number, we simply map from that number into the corresponding DNA:

10201211123012301233
TACATCTTTCGATCGATCGG

That's it! \o/

This is quite simple to model via ruby too, so a toplevel method was added to do this.

Provide your DNA sequence, as well as the hash having the values for the four nucleotides, and you get the resulting number representation.

Example:

Bioroebe.quaternary_encoding_DNA_to_numbers('TACATCTTTCGATCGATCGG', { A: 0, T: 1, C: 2, G: 3} ) # => "10201211123012301233"
Bioroebe.quaternary_encoding_DNA_to_numbers('TACATCTTTCGATCGATCGG') # => "10201211123012301233"

For the reverse use:

Bioroebe.quaternary_encoding_numbers_to_DNA('10201211123012301233') # => "TACATCTTTCGATCGATCGG"

For assigning four numbers to each letter of the alphabet we don't need that many variants. And we can use a different number, so this has to be scored via a Hash that the user can pass in.

Let's propose a simple scheme - and remember that this is somewhat arbitrary:

0000 A
1000 B
2000 C
3000 D
0100 E
0200 F
0300 G
0010 H
0020 I
0030 J
0001 K
0002 L
0003 M
1100 N
1200 O
1300 P
1110 Q
1120 R
1130 S
1210 T
1220 U
1230 V
1231 W 

Different combinations can be used here.

Let's query the name "Otto". The output would have to be 1200 (for O), 1210 for T and again for T, and then 1200 again.

So the result should be:

1200121012101200

Specific API usage:

Bioroebe.quaternary_encoding_translate_from_alphabet_string_into_the_corresponding_DNA_sequence('OTTO') # => "1200121012101200"

We could also distinguish between lower-cased characters and upper-cased characters.

Sequence Logos

Sequence logos were proposed by Schneider and Stephens in 1990.

They are graphical representations of the proportions of the different amino acids in the positions of multiple sequence alignments.

For each position, a stack with the 20 amino acids is shown. The height of the letters is proportional to the frequencies of these amino acids at that particular position of the alignment.

Apart from the graphical aspect, the main advantage respect consensus sequences is that sequence logos contain information not only on the most frequent amino acid, but also on others with also high frequency which could be informative. The letters can be coloured according to different criteria.

A website that allows you to create sequence logos can be found here:

https://weblogo.berkeley.edu/logo.cgi

The images that can be generated via this may look as follows:

Let's look at another example.

Say you input the following sequences there:

AGVV
AGVV
AGVV
AGVV
AGGV
AGGV
AGGV

The resulting image that is generated is:

The Kozak Sequence

The ribosome usually scans for a AUG codon. But there are many AUG in a given (long) mRNA, so against which one should the ribosome discriminate?

There is a special consensus sequence, the Kozak consensus sequence. It has the sequence GCCACCAUGG, where the AUG is embedded into (if you look at it carefully).

It need not always be a ACC before the AUG; it can also be GCC instead. AUG codes for methionine, so most proteins begin with a methionine. (This is not always the case, though; a few other codons are used by bacteria as well, such as GUG.)

Bioroebe::CompactFastaFile

Bioroebe::CompactFastaFile can be used to compact a .fasta file's sequence into one (a single) line. So, for instance, if you have a file with this content:

aug
gcu

Then class Bioroebe::CompactFastaFile can be used to turn this into a single line of either auggcu or AUGGCU. It is thus primarily a convenience method; on Linux/UNIX systems you can easily use pipes to achieve the same, but I wanted to have this functionality ready-made for the bioroebe gem as well.

Be careful when using this class, though - the existing file will be overwritten, so perhaps make a backup copy before you invoke this class.

Usage example:

require 'bioroebe'

Bioroebe::CompactFastaFile.new('foobar.fasta')

Since as of February 2022 a file exists at bin/compact_fasta_file, so you can use this functionality from the commandline as well.

Furthermore the commandline flag --upcase was added to upcase the sequence. This ensures that the resulting .fasta file will end up with upcased letters, such as GUC rather than guc.

Usage example from the commandline for this functionality:

compact_fasta_file --upcase foobar.fasta

Bioroebe::ParseFasta

This subsection will mention a few options of class Bioroebe::ParseFasta. This is currently a stub, though.

The FASTA format does not specify exactly how many characters go in a line, in a given .fasta file. Different programs may use different values here.

If you want to parse a local .fasta file but only want to know how many nucleotides are in that file, try the following commandline flag:

pfasta foobar.fasta --size
pfasta NC_001416.1_Enterobacteria_phage_lambda_complete_genome.fasta --size
pfasta *.fasta --size

This will display the number of nucleotides found in that .fasta file.

You can also limit the number of nucleotides shown on the commandline via --limit=NUMBER. Example:

pfasta NC_001416.1.fasta --limit=500 # To show only the first 500 nucleotides.

You can also add the length information into the FASTA header in a .fasta file via class $BIOROEBE/fasta_and_fastq/length_modifier/length_modifier.rb. This functionality was added in the event that we may have to do batch-processing of FASTA files one day.

The bioshell component understands the FASTA format as well.

If you wish to do multi-line fasta input as part of the Bioshell, that is input that will include newlines, then you currently have to use either of the following commands:

# Multi-fasta example in the bioshell:
multi-line
multiline
multi_line
multi_input

Pick any of these. You can terminate the input by using __.

You can also use "fasta" for reading .fasta files, from within the bioshell. Or doing single line input.

At a later time, this may change, but for now, please use these two distinct ways to read in a fasta file.

Note that multiline actually also allows you to assign multiline input from normal, regular sequences too. Simply start to paste in your sequence.

Also note that if you paste a Genbank accession number into it, one that starts with a > token at the least, then we will simply strip this away. This allows you to more easily copy/paste data into the BioShell.

As of May 2016 you can simply input an URL to a NCBI entry and the bioshell will try to download the sequence.

You can also input something such as:

NC_000866.4_Enterobacteria_phage_T4.fasta'

The bioshell will first look locally if fasta sequence exists.

Since as of August 2016, you can also input a remote fasta file into the Bioroebe::Shell (bioshell) component.

Example showing this:

https://raw.githubusercontent.com/biopython/biopython/master/Doc/examples/ls_orchid.fasta

This will download the remote file. Then it will parse it and output whether it has done something useful with it too.

Note that since as of April 2022 you can also quickly show the translated aminoacid sequence of a .fasta file, if you make use of class Bioroebe::ParseFasta, via the --to-protein commandline switch.

Usage Example:

pfasta insulin_mRNA.fasta  --toprotein

class Bioroebe::Protein

class Bioroebe::Protein can be used to store a protein sequence.

I am not yet sure how important that class may be, but for now it exists as part of the bioroebe project.

An alternative name for this class could be Bioroebe::Aminoacids, but I found that the name Protein may be more apt.

Specific usage example:

x = Bioroebe::Protein.new('MSKADYEK')
puts x.reverse_translate # AUGAGCAAGGCCGACUACGAGAAG
y = x.reverse_translate
puts y # => "AUGAGCAAGGCCGACUACGAGAAG"
Bioroebe.to_aa y # => "MSKADYEK"

As can be seen, this should work. You can reverse-translate a DNA sequence too. This currently makes use of Bioroebe.deduce_most_likely_aminoacid_sequence_as_string(), but in the future this may be adjusted to be more flexible and suggest a sequence optimized onto the codon table of the particular organism at hand.

In April 2022 the method .hydrophobic_amino_acids?() was added. This method will return an Array of all aminoacids that are part of this class that are hydrophobic.

The hydrophobic amino acids are:

A I L M V F W Y

This is determined via the following Array:

hydrophobic_aminoacids = %w(
  A I L M V F W Y
)

This may be subjectto change, e. g. including Glycine.

Let's look at a specific example demonstrating this behaviour:

x = Bioroebe::Protein.new('AICHIVUEHAAAAAA') # Note that Bioroebe::Aminoacids.new() works as well, since as of May 2022.
pp x.hydrophobic_amino_acids? # = [1, 2, 5, 6, 10, 11, 12, 13, 14, 15] 

And another example:

x = Bioroebe::Protein.new('GGGSSSSSSSSSSSTTTTTT')
x.hydrophobic_amino_acids? # => [1, 2, 3] # As can be seen here, only Glycine is considered hydrophobic. This was changed in May 2022, though. 

So in short: you get an Array with the aminoacids that are hydrophobic. Note that this begins at position 1, referring to the first aminoacid (which in the first case above is Alanine).

Licence

The BioRoebe project used to be under the GPLv2 licence exclusively, up until April 2019 (25.04.2019).

While the GPLv2 licence is a perfectly fine licence in itself, there exists a slight inconvenience in regards to linking in C/C++ code, which, under the GPLv2, would constitute a modified variant in a derived assembly (once linked in). This would limit the BioRoebe project a little in that only effectively ruby code could be distributed (or GPLv2 code), which would mean that the BioRoebe project would never be able to use C/C++ code that may be external (or externally provided) under a less restrictive licence model, such as BSD/MIT style licences; or a more restrictive licence that may forbid licences such as the GPL, for whatever the reason. You could distribute the code under the BSD/MIT licences, but not via GPL if you have a derivative - which will be the case if you link in object files.

Thus, since as of 25.04.2019, the BioRoebe project is now using the LGPLv2.1 licence instead. This allows the BioRoebe project some more flexibility, in particular when it comes to existing C/C++ codebases out there, or respectively when others make use of the BioRoebe project as such. In particular this has been the main reason why the project switched from GPL to LGPL.

A short disclaimer for the project follows next:

This program is free software. You can distribute/modify this program under the terms of the GNU LESSER GENERAL PUBLIC LICENSE Version 2.1.

Take note that there is no later clause for the BioRoebe project, though - only LGPLv2.1 is the valid licence for the BioRoebe project.

See the following link for the full licence terms of LGPLv2.1, but note that, as stated, the later clause does NOT apply to the BioRoebe licence (in general I fail to understand why any licence wants to use a "or later" statement; never made any sense to me):

https://www.gnu.org/licenses/old-licenses/lgpl-2.1.en.html

So, the BioRoebe project uses a slightly modified variant of LGPLv2.1 - the or later clause does not apply. That is the sole difference. (The linux kernel uses a similar model, except that it is GPLv2 - it has no later clause either.)

More general warranty disclaimer:

This library/software is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

In layman terms - what the above entails to essentially is to use the project at your own risk and do not hold anyone else responsible for this use of said project / software. But that is quite common sense to assume so, anyway. Nothing is guaranteed to work and you are at your own, if you still use this software - but if it works, that's quite useful, right?

Usually the code should work fine, too; sometimes modifications may have to happen, and a few bugs may exist too.

Note that of course any discovered (and reported) bugs will eventually be fixed/addressed, if time permits; or at the least noted down somewhere.

So why isn't the MIT/BSD style licence used in the bioroebe project?

There are two main reasons as to why:

a) The LGPL and GPL places a stronger focus on obligatory code-reuse.

Linus Torvalds once summarized this for the linux kernel - as a concluding statement, he may have said something along the following lines:

"You benefit from it, without having to pay anything, so it makes sense to make your own modifications also available to others - tit for tat".

In my opinion this is the most important aspect of the GPL/LGPL licences, as opposed to the less restrictive MIT/BSD variants. (I avoid the term copyleft since I think it makes little sense to use that term to begin with. In my opinion it does not matter how courts interprete the terms copyright or copyleft - either a licence is valid in court, and can be enforced; or it is not valid. I consider both BSD/MIT and GPL to be perfectly valid licences, so the terms copyright and copyleft as a distinction are not hugely relevant. While I am fine with the GPL, I reject the associated attempt to re-"brand" words just as much as I reject the associated attempt from corporate shills to do the same re-"branding", just in an orthogonal manner. Either a licence is valid or it is not, plain and simple.)

b) I feel that it is important to use a slightly stronger licence for the BioRoebe project as-is, considering how much work went into it so far, and how much work may continue to be put into it in the future.

This was a similar concern for the Linux kernel - it is unlikely to have succeeded under a BSD/MIT style licence if you think it through.

BSD/MIT provides more options in particular for companies, but this may not necessarily translate into new opportunities for other users since the code may not be be guaranteed to be re-shared. Several big corporations do that already and I feel that it is extremely unfair when they do not allow others to (re)use their own modifications, yet benefit from the code on their own.

They can let others benefit from it, of course, and often do so - but they are not required to do so, so this puts simple users at a general disadvantage in all cases where they do NOT share the code.

How can hobbyists compete with mega-billion corporations? That's just not going to happen.

I encourage folks to think about the problem domain as a whole, from an ethical point of view, and fairness as well but also from a usability point of view - not just in regards to particular licences alone.

Do note that I do use MIT/BSD licences in other projects as well, so I don't have a problem with any of these licences at all from an "ideological" point of view. I just do not feel it would be the best fit for the BioRoebe project and neither for users of BioRoebe, if it were to focus on MIT/BSD or similar licences as a first-class citizen.

This is only meant in regards to the licence. Of course the code quality has to match, too, if code is contributed; it would not make a lot of sense to merge code that is of low quality per se.

The BioRoebe project already has quite a lot of code that is not ideal, to word this nicely. We want to improve on the quality, not lower it. :)

Last but not least, I wanted to compare difference licences of bio-related projects as well in a table:

name of the project licence
bio ruby
biopython Biopython License Agreement
bioperl Perl Artistic License or GNU GPL
biojava LGPL v2.1

Bioroebe.complementary_dna_strand()

The method Bioroebe.complementary_dna_strand() can be used to return the complementary DNA strand.

An executable at bin/complementary implements that functionality.

Usage example:

complement ATTA
TAAT

If you want to show leading 5'- and trailing -'3 then you can either use the flag --trailer:

complement ATTA --trailer
5'-TAAT-'3

Or have the input include these:

complement 5'-ATTA-'3
5'-TAAT-'3

siRNA and class Bioroebe::SiRNA

Since as of June 2022 you can use the specification by Kumiko Ui-Tei from his 2004 paper to determine whether a given sequence is a strong siRNA. The paper can be read here: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC373388/

Note that class Bioroebe::SiRNA may contain a few bugs, but for the time being it is what it is. At a later point in time test cases may be added to check whether it performs correctly or whether it does not.

The other rules, also published in 2004, are the Reynolds rules. Code support was added to the Bioroebe project in June 2022, but it was not tested yet, so the implementation may be incorrect.

The Bioroebe::Shell interface

The following subsection specifically handles information pertaining to the Bioroebe::Shell interface of the bioroebe project. It is also called bioshell, to simplify spelling it.

Numbers as input in the bioshell

alt text

You can input a number in the BioShell such as 3.

This will attempt to display the first 3 nucleotides of the assigned main sequence. It will only work if you have assigned a sequence prior to that, though.

Examples:

3
33
15

transeq

alt text

You can convert a DNA sequence into an aminoacid sequence by doing this:

transeq

Shuffling the DNA/RNA string in the bioshell

alt text

Via

shuffle

you can randomly rearrange the main DNA/RNA string that is used by the Bioroebe::Shell.

This can be useful if you just wish to quickly "test" new compositions of the same nucleotide.

Permanently disabling showing the startup-introduction of the Bioshell

alt text

If you do not want to see the start-up intro, you can try any of the following:

bioshell --permanently-disable-startup-intro
bioshell --permanently-disable-startup-notice
bioshell --permanently-no-startup-intro
bioshell --permanently-no-startup-info

Longest substring

alt text

Within the Bioroebe::Shell you can determine the longest substring, including gaps, like s:'

longest_substring? ATTATTGTT | ATTATTCTT'

Note that this will make use of the diff-lcs gem, which uses the McIlroy-Hunt algorithm.

Do not create directories on startup of the shell

alt text

By default the bioshell will try to create some directories on startup. This may not always be desired by the user, though, so an option has to exist to disable this functionality.

Internally the variable @internal_hash[:create_directories_on_startup_of_the_shell] keeps track of whether directories on startup of the shell will be created.

To disable this behaviour on startup of the bioshell, try something like this:

bioshell --do-not-create-directories-on-startup
bioshell --do-not-create-directories

Generating and assigning a random amount of nucleotides

alt text

Via:

random 555

you can "generate" 555 random nucleotides (DNA that is) and assign it to the main sequence in use by the bioshell. This is mostly a convenience feature, if you want to debug something quickly.

Determining the log directory for the Bioroebe::Shell component

alt text

Via:

bioshell_log_dir?

you can determine the log-directory output for the bioshell component. On my home system this will default to /home/Temp/bioroebe/bioshell/.

Prompt (the shell prompt of the bioshell)

alt text

You can set a custom prompt in the bioshell, via the keywords "prompt" or "set_prompt".

To display the current working directory, do:

prompt pwd

To revert to the old default again, do this:

prompt REVERT
prompt revert
prompt DEFAULT
prompt default

If you do not want to set any prompt, do:

prompt none

Random stuff - generating random DNA sequences in the bioshell

alt text

You can generate random DNA sequences in the bioshell via:

random dna 20
random dna 25
random dna 30
# or simpler
random 20
random 25
random 30

This will generate random DNA sequences, with a length of 20, 25, 30, respectively. This may not be very useful but it was important that this functionality is made available somewhere. Sometimes you may not even care about the sequence and just use the a "filler" sequence, so randomness has to be part of the Bioroebe project as well.

You can also use some toplevel-methods to generate, e. g. 20 random aminoacids. Have a look at the following toplevel API:

Bioroebe.random_aminoacid? 20 # => "UAVHYQQESWUYAOVESEIY"

Note that there may exist other APIs within the Bioroebe project that do the same as well.

If you would like to use a ruby-gtk3 widget have a look at RandomSequence, under bioroebe/gtk3/random_sequence/. It works with aminoacids, DNA and RNA, and allows the user to create random sequences. (If you need weighted randomness then you currently have to use the commandline variant. Perhaps I may add support into the GUI directly for this one day.)

Deprecations within the Bioroebe::Shell

alt text

Over the years the Bioroebe::Shell changed quite a bit.

This subsection here will list a few of these changes or rather, the deprecations.

raw_sequence: removed in June 2022 completely. It is simpler to handle sequences via Bioroebe::Sequence instead.

@internal_hash[:array_sequences] was no longer in use, so it was removed in July 2022.

Chop off nucleotides within the Bioroebe::Shell

alt text

You can use the following syntax to chop away until you find a particular substring, in the bioshell:

chop_to ATG

This functionality was specifically added to find the first ATG codon.

Truncating output in the bioroebe-shell

alt text

DNA/RNA sequences can become very long and then become quite difficult to view, read and handle on the commandline.

Normally the bioroebe shell will truncate output of DNA sequences that are "too long". This is mostly done so that working with very long sequences becomes a bit more convenient.

Sometimes this can become an antifeature, though, so the user must be able to toggle this at his or her own discretion.

By default, the bioroebe-shell (bioshell) will always try to truncate output, but you can toggle this behaviour by issuing:

do not truncate

In theory, other "do not" actions are also supported, or will be supported in the future; right now (Oct 2019) this is a bit limited.

From the toplevel, you can use this method:

Bioroebe.do_not_truncate

The above instruction will toggle the truncate behaviour to not truncate, ever.

If you need to do so within the bioshell, this is the way:

no_truncate

Or simply

truncate

This will toggle, like a switch.

Working with .pdb files in the bioshell

alt text

This subsection only very briefly mentions how to work with .pdb files in the bioshell. See other parts of this document for a more extensive overview how you can work with .pdb files via the Bioroebe project.

If you input something like this, if it ends with .pdb:

1fat.pdb

And if no such file currently exists at /home/Temp/bioroebe/pdb/1fat.pdb then it will be downloaded and moved towards /home/Temp/bioroebe/pdb/.

This feature exists just to simplify using the bioshell.

Showing the stop codons in frame1, frame2 and frame2 in the bioshell

alt text

When you have a given sequence assigned to the bioshell, such as via "random 99", you can then show all stop codons in frame1, frame2 and frame3.

The corresponding input for this will be:

stop_frame1?
stop_frame2?
stop_frame3?

An image shows this next, where we first did input "random 120", before issuing the above-mentioned instructions one after the other:

Freezing the main sequence in the bioshell - and unfreezing it again

alt text

You can freeze the BioShell, meaning that it will no longer allow for the main sequence to be modified, via the following command:

freeze

To unfreeze the sequence again, issue:

unfreeze

This functionality has been added because the shell may sometimes be quite eager to change the main sequence, so we needed a way to disable any further modifications (until "unfreeze" is issued that is).

Support for other programming languages

The main programming language for the bioroebe project is ruby. Ruby, from a language design point of view, is a great programming language - not necessarily all of ruby, but the subset that I use. It is very easy to quickly prototype ideas via ruby.

However had, ruby is known to not be among the fastest programming languages about on this planet; so, it makes sense to use other languages too from this point of view. Additionally there are some software stacks in use in other programming languages, such as matplotlib and various more.

Thus, it is important to support other programming languages as well, if there are useful libraries. The bioroebe project, after all, tries to be practical: it focuses on getting things done, no matter the language.

This means that support for other programming languages can be found in this project as well, often using system() or similar functionality to tap into these other programming languages. Do not be surprised when that happens - the bioroebe project will also try to act as a practical glue towards functionality enabled via other projects. We want to get things done, no matter the programming language at hand!

Whenever possible, though, the bioroebe project will try to be flexible in this regard, so ideally the same solution should work for many different programming languages.

While Ruby is the primary language for this project, since as of 2021 I will try to officially support java, jruby and the GraalVM. This is on my TODO list, though - stay tuned for more updates in this regard. See also the subsection Support for Python.

Support for Python

In June 2022 I decided to add support for Python to bioroebe.

While people can - and should - easily use biopython instead, I simply wanted to see how much python-support I can add to bioroebe. This may lag behind some years compared to biopython, but I wanted to extend python support as well, so there you go. It is simply an additional option for the bioroebe project. Ruby will remain the primary language for the project, though, at the least for now.

Bioroebe::ProfilePattern

This class can be used to generate nucleotide sequences that are not quite "random". For example, to generate sequences that may "simulate" a TATA box.

The idea for this class is to be extended into allowing HMMs (Hidden Markov Models) one day.

Usage example:

_ = Bioroebe::ProfilePattern.new(ARGV, :do_not_run_yet)
_.generate_sequence_based_on_this_profile

Such a profile will encode the profile specifying the preferred sequence letters for each position in a section of DNA. You have to provide the Hash into the method generate_sequence_based_on_this_profile() - or you use the default Hash, which is stored in the constant called PER_POSITION_HASH.

That profile should be a Hash, with keys pointing to A, T, C, G and the values being an Array of likelihood chance there, as a number, such as 140. These values are also called scores. Each score contains a number for each position that indicates how likely it is to find the given nucleotide at that location.

You can also use this class to generate a random DNA string, similar to the method called Bioroebe.generate_random_dna_sequence(). The difference is that class ProfilePattern allows for a bit more fine-tuned control. The class will likely be extended in the future too.

Generate DNA via Bioroebe.random_dna

You can "generate" random DNA strings by making use of the following code:

x = Bioroebe.random_dna 50 # => "AGACATCCGGCTTGGATACCTCATAAGTCATATCAGCATCGTCGGACATT"

As can be seen in the example above, after the #, a String will be returned representing that nucleotide sequence. In the case above it'll be 50 nucleotides in length.

The number given to .random_dna() tells the method how many nucleotides should be generated.

The method accepts a second argument, which should be a Hash. If it is a hash then the generated DNA will be based on the probabilities given to that Hash.

Let's look at specific example here:

Bioroebe.random_dna(50, { A: 10, T: 10, C: 10, G: 70}) # => "GGGGTGGGGAGGGTATGCGGAGGAAGGGCGGGAAGGGCGGGGGCTGGGCG"

As you can see, in the Hash defined above, the likelihood for incorporating a Guanine is much higher than for Adenine (70 : 10). This will be reflected in the generated DNA sequence which, as can be seen, contains many more Guanines than Adenines.

There is yet a third use case for the above. If you pass a String as the second argument rather than a Hash, then that String will be used as basis for generating the DNA string at hand.

Again, let's look at a specific example here:

Bioroebe.random_dna(10, 'ATCGATCGGG')

Here we add more G than A, T or C, so the new DNA sequence should contain these nucleotides as well.

More usage examples in this regard:

Bioroebe.random_dna(20, 'ATGGGGGGGG') # => "TGAGGGGGGGGGTGGGAGGG"
Bioroebe.random_dna(20, 'ATGGGGGGGG') # => "GGTAGGGGGGGGTAGGGGGG"

Note that this is similar to the .randomize() method in the bioruby project:

hash = {'a'=>1,'c'=>2,'g'=>3,'t'=>4}
puts Bio::Sequence::NA.randomize(hash) # => "ggcttgttac" (for example)

Generating a random nucleotide sequence based on frequencies

If you ever need to generate a nucleotide frequency then you can use the following method:

Bioroebe.generate_nucleotide_sequence_based_on_these_frequencies
Bioroebe.generate_nucleotide_sequence_based_on_these_frequencies 100
Bioroebe.generate_nucleotide_sequence_based_on_these_frequencies 500

Parsing genbank (.gbk) files

You could use class Bioroebe::GenbankParser to parse .gbk files, at the least if you want to obtain the raw sequence, in FASTA format.

Example for this:

require 'bioroebe/genbank/genbank_parser.rb'
result = Bioroebe::GenbankParser.new('/home/Temp/bioroebe/ls_orchid.gbk')
result.dataset? # This method call will return the FASTA sequence. 

Note that this currently (July 2022) only grabs one entry. In the upcoming rewrite in the future the parser will be able to parse all entries, and then present them to the user. Stay tuned in this regard.

Parsers in general

The bioroebe project will store most parsers in the parsers/ subdirectory since as of July 2022.

Prior to that date different parsers were stored in different subdirectories, such as the parser for genbank-files being stored in the genbank/ subdirectory. As I found this situation confusing, I settled for the parsers/ subdirectory since as of July 2022.

Coomassie staining of proteins

Coomassie staining is typically done on proteins, giving them a blue or blueish colour. Coomassie staining is the most popular anionic protein dye.

This may look like this:

This picture shows five different bands. The molecular weight of the marker can be seen on the very left hand side, in kDa. The larger fragments can be seen on top, so the farther the band has moved, the smaller the fragment must be (in kDa). That means that the larger proteins can be found on top; the smaller proteins on the bottom.

Some bands are missing, and this gives information - that is that a particular protein is missing. Probably it was not synthesized in the given tissue at hand.

The staining for a Coomassie Blue stain is typically done via G-250, with a 0.5% density prepared in 50% methanol and 10% acetic acid. The staining duration is usually done for 5 minutes.

Note that the G-250 stain is the dimethyl derivative from R-250 - the R stands for red or reddish. Both dyes will bind via electrostatic interaction with protonated basic amino acids: that is lysine, arginine, and histidine. They can also bind via hydrophobic associations to aromatic residues.

Coomassie stains are in principle reversible. They are not as sensitive as silver staining, but significantly cheaper, which is one reason why they have become so popular.

Not every protein has all aminoacids, so staining may be difficult. For instance, the glycomacropeptide is the only known naturally occurring protein that contains no Phe (Phenylalanine; F).

A protein that lacks lysine, arginine, histidine or aromatic acids may be undetectable via Coomassie staining. However had, this does not seem to be a universal rule; some groups report that they even managed to stain "unstainable" proteins via Coomassie staining.

The paper at https://www.jbc.org/article/S0021-9258(17)39198-6/pdf, titled "Why Does Coomassie Brilliant Blue R Interact Differently with Different Proteins?" and published in the year 1985, tries to give some explanations to different groups yielding different results via Coomassie staining.

They specifically point out that "there is a striking correlation between intensity of response to Coomassie dyes and the basicity of a protein which depends on the number of lysine, histidine, and arginine residues, as well as the NH₂-terminal amino group" (aka the aminoterminus of the protein at hand). The concluding remark from that paper is that "Coomassie R Interacts Differently with Different Proteins".

On class Bioroebe::Protein you can determine whether a given protein can be stained via coomassie through the following method:

.can_be_stained_via_coomassie?

This isn't an ideal check, so don't rely on it. It will simply check whether the sequence has at the least one lysine, or one histidine, or one arginine, or any of the aromatic amino acids.

Codon Usage

This paragraph deals with some aspects of codon usage in different organisms.

Let us first define the term codon usage so we can base any further analysis on this definition. In order to do so, we also have to define what a codon is, so let's start with that actually.

A codon is essentially the basic code used in DNA to denote which particular aminoacid corresponds to these (three) nucleotide base pairs. A codon is thus a series of three nucleotides, also called a triplet, such as ATG.

When we use the term base pairs, we refer to double-stranded DNA, abbreviated as dsDNA. The codon is, however had, only found in a single stranded molecule, even within dsDNA. Since some parts of a dsDNA in any given genome give rise to a, more or less, complementary copy into mRNA, the codons that are actually used, are found in the corresponding mRNA as well, excluding the codon that codes for a stop signal (a so-called stop codon). (Remember that mRNA differs from DNA in that there will be Uracil rather than Thymine; otherwise it is the same, sequence-wise. Of course it uses another sugar (Ribose), but remember we are here mostly interested in the information-containing part, not the full chemical structure.)

The codon is thus found on the mRNA and since mRNA is mostly single-stranded, the codon is a component of the mRNA. The two subunits of the ribosome are assembled on a mRNA, at the least in prokaryotes (or more accurately, the smaller subunit scans along the mRNA until it detects a start codon). Mind you, this subsection will not go into all relevant details, so just keep in mind that the codon is the part that will eventually be "translated" at the ribosome into a corresponding aminoacid, excluding stop codons at the end.

Now - different organisms use different frequencies of codons. Codon usage thus describes the fact that many proteins in these different organisms make use of certain codons with a substantially higher frequency than other codons. We can use statistics to infer this on a global (proteome) level too.

Remember that the genetic code is degenerate, meaning that you have a few aminoacids that are encoded only by one codon (Tryptophan and Methionine), whereas the other aminoacids are encoded by more than one codon - thus, at the very least two codons. Note that the latter codons, if they code for the same aminoacid, are also called synonymous codons.

This means that if you have any given aminoacid chain, you can have several different sequences that would yield to the very same amino acid chain (and codons in these sequences, which ultimately means that you can have different DNA sequences code for the very same aminoacid chain).

Usually the third base of a codon has the least influence on codon meaning. This is also called wobbling - since the anticodon loop on the tRNA is in the reverse direction, and the wobble position refers to the tRNA, this means that the wobble-position is at the 5'-end of the tRNA anticodon.

Now a few words about functionality related to codons and codon usage in the Bioroebe project.

Say that you have a long DNA sequence; let's pick a sample for now, such as:

ATGGGCGGGGTGATGGCAATGATGCCCCCGATGATG

You can analyze the codons used via class ShowCodonUsage and the corresponding entry at bin/show_codon_usage:

show_codon_usage ATGGGCGGGGTGATGGCAATGATGCCCCCGATGATG

This class can be found at bioroebe/codons/show_codon_usage.rb. It will report the top 5 codons in use and also output the frequency hash on the commandline.

On my computer at home the output it yields via the commandline, on a KDE konsole terminal, looks like this:

You can use this from within ruby code too, via the following toplevel method:

Bioroebe.codon_frequencies_of_this_sequence(ARGV)

To get the hash of the codon frequencies you can use the .hash? method:

hash = Bioroebe.codon_frequencies_of_this_sequence('ATGGGCGGGGTGATGGCAATGATGCCCCCGATGATG').hash?

If you want to look at the actual codon frequencies used by different organisms, have a look here:

http://www.kazusa.or.jp/codon/cgi-bin/showcodon.cgi?species=11076&aa=9&style=N

This is an excellent resource.

For instance, the E. coli K strain can be found here:

https://www.kazusa.or.jp/codon/cgi-bin/showcodon.cgi?species=83333&aa=9&style=N

Determining the frequencies of aminoacids in a given aminocid (protein) sequence

If you quickly wish to determine the aminoacid composition, as a Hash, you can use bin/aminoacid_frequencies.

Example from the commandline for this:

aminoacid_frequencies MVTDEGAIYFTKDAARNWKAAVEETVSATLNRTVSSGITGASYYTGTFST

Example from within bioroebe itself (and thus ruby):

require 'bioroebe/frequencies.rb'

Bioroebe.aminoacid_frequencies('MVTDEGAIYFTKDAARNWKAAVEETVSATLNRTVSSGITGASYYTGTFST')

The latter will return a Hash that you can then further make use for, such as:

{"M"=>1, "V"=>4, "T"=>9, "D"=>2, "E"=>3, "G"=>4, "A"=>7, "I"=>2, "Y"=>3, "F"=>2, "K"=>2, "R"=>2, "N"=>2, "W"=>1, "S"=>5, "L"=>1}

Determining the codon frequencies from the commandline

In April 2022 I noticed that one use case is to show the codon frequencies of a given sequence - typically a nucleotide sequence.

For aminoacids there already was an executable, at bin/aminoacid_frequencies.

So, following that logic, a new executable was added at bin/codon_frequency. This will show the Hash of the codon frequencies, as a String, on the commandline.

Usage example:

 codon_frequency ATTCGTACGATCGACTGACTGACAGTCATTCGT

The output of this would be the following:

AUU: 2
CGU: 2
ACG: 1
AUC: 1
GAC: 1
UGA: 1
CUG: 1
ACA: 1
GUC: 1

Showing the codon frequency via countcodon

The excellent website at https://www.kazusa.or.jp/codon/countcodon.html offers a rather useful functionality via a simple web-interface, in that you can pass in a mRNA sequence, and it will then show the codon frequency/likelihood of that sequence - all codons in that sequence, that is. This can be extended to all protein-coding genes in a given genome, and will thus be useful for a researcher who may be interested in determining the codon frequency in general, across all genes in that given genome.

You can test it with an input sequence.

For instance, the following sequence:

ATTCGTACGATCGACTGACTGACAGTCATTCGTAGTACGATCGACTGACTGACAGTCATTCGTACGATCGACTGACTGACAAGTCATTCGTACGATCGACTGACTTGACAGTCATAA

Would yield this result:

fields: [triplet] [frequency: per thousand] ([number])

UUU  0.0(     0)  UCU  0.0(     0)  UAU  0.0(     0)  UGU  0.0(     0)
UUC  0.0(     0)  UCC  0.0(     0)  UAC 25.6(     1)  UGC  0.0(     0)
UUA  0.0(     0)  UCA 25.6(     1)  UAA 25.6(     1)  UGA102.6(     4)
UUG  0.0(     0)  UCG 25.6(     1)  UAG  0.0(     0)  UGG  0.0(     0)

CUU  0.0(     0)  CCU  0.0(     0)  CAU 25.6(     1)  CGU 76.9(     3)
CUC  0.0(     0)  CCC  0.0(     0)  CAC  0.0(     0)  CGC  0.0(     0)
CUA  0.0(     0)  CCA  0.0(     0)  CAA  0.0(     0)  CGA 25.6(     1)
CUG102.6(     4)  CCG  0.0(     0)  CAG 25.6(     1)  CGG  0.0(     0)

AUU 76.9(     3)  ACU 25.6(     1)  AAU  0.0(     0)  AGU 51.3(     2)
AUC 76.9(     3)  ACC  0.0(     0)  AAC  0.0(     0)  AGC  0.0(     0)
AUA  0.0(     0)  ACA 76.9(     3)  AAA  0.0(     0)  AGA  0.0(     0)
AUG  0.0(     0)  ACG 76.9(     3)  AAG  0.0(     0)  AGG  0.0(     0)

GUU  0.0(     0)  GCU  0.0(     0)  GAU 25.6(     1)  GGU  0.0(     0)
GUC 51.3(     2)  GCC  0.0(     0)  GAC 76.9(     3)  GGC  0.0(     0)
GUA  0.0(     0)  GCA  0.0(     0)  GAA  0.0(     0)  GGA  0.0(     0)
GUG  0.0(     0)  GCG  0.0(     0)  GAG  0.0(     0)  GGG  0.0(     0)

At any rate, the individual functionality for that is also available within the Bioroebe project since as of April 2022.

The method that does so is:

Bioroebe.frequency_per_thousand
Bioroebe.frequency_per_thousand('ATTCGTACGATCGACTGACTGACAGTCATTCGTAGTACGATCGACTGACTGACAGTCATTCGTACGATCGACTGACTGACAAGTCATTCGTACGATCGACTGACTTGACAGTCATAA') # Usage example here.

Sinatra-bindings exist to this functionality since as of July 2022, but they are not very well-polished. Ruby-gtk3 bindings may be added at a later time, and possibly ruby-libui bindings as well, for windows support. What is missing is support for different codon tables in different species, but that may be added at a later time as well - for now it seemed more important to offer the functionality.

Working with PDB files (.pdb)

The PDB, founded in the year 1971, holds lots of atomic structures of proteins.

For instance, in July 2016 it contained 121000 structures.

In February 2018 it contained ~124000 structures (from X-ray crystallography), and about ~12000 NMR structures. NMR is limited to about 350 amino acids maximum length, give or take.

In April 2020 the PDB contained 163141 structures.

We can see that more and more structures are available nowadays - a trend that will most likely continue or even accelerate. (Let's hope the quality also remains high.)

A typical .pdb file contains entries such as this:

RTyp  Num  Atm Res Ch  ResN  X       Y       Z      Occ  Temp   PDB   Line
ATOM    1  N   ASP L   1     4.060   7.307   5.186  1.00 51.58  1FDL  93
ATOM    2  CA  ASP L   1     4.042   7.776   6.553  1.00 48.05  1FDL  94
ATOM    3  N   VAL A  25    32.433  16.336  57.540  1.00 11.92   A1    N
ATOM    4  CA  VAL A  25    31.132  16.439  58.160  1.00 11.85   A1    C
ATOM    5  C   VAL A  25    30.447  15.105  58.363  1.00 12.34   A1    C

(Not the first line; RTyp is just an explanation for the ATOM entries below that line).

The sequence starts from the N-terminal residue for proteins; see the Atm entry at Num 1.

The meaning of these entries is as follows:

1) RTyp: Record Type
2) Num:  Serial number of the atom.  Each atom has a unique serial number.
3) Atm:  Atom name (in IUPAC format).
4) Res:  Residue name (IUPAC format).
5) Ch:   Chain to which the atom belongs (in this case, L for light chain of an antibody).
6) ResN: Residue sequence number. This will be incremental e. g. 1, 2 3, 4 and so forth.
7,8,9) X, Y, Z: Cartesian coordinates specifying atomic position in space.
10) Occ: Occupancy factor
11) Temp: Temperature factor (atoms disordered in the crystal have high
          temperature factors; they are "wobbly" with a high factor.
          This is also called the B-factor).
12) PDB: The PDB data file unique identifier.
13) Line: Line (record) number in the data file.

Typically the entry on the most right area, the last one, specifies which atom it is. A H stands for a hydrogen atom; the other atoms are "heavy" atoms (heavier than hydrogen most definitely).

Most .pdb files will contain SEQRES entries. These entries will list the primary sequence of the polymeric molecules present in the entry. You can notice this by looking at the standard 3-character code used by SEQRES here, for the canonical amino acids. So, for instance, the amino acids that will be mentioned in a SEQRES entry are ALA, CYS, ASP, GLU, PHE, GLY, HIS, ILE, LYS, LEU, MET, ASN, PRO, GLN, ARG, SER, THR, VAL, TRP and TYR. You can use the method Bioroebe.three_to_one() to convert back to the one-letter chain such as follows:

Bioroebe.three_to_one('PHE') # => "F"

The data in a .pdb file need not necessarily only be a protein, with a specific aminoacid sequence. It may also include DNA. An example for such a molecule is 2dgc, which includes a protein chain and a DNA chain.

As far as the bioroebe project is concerned, you can parse .pdb files via the following class:

Bioroebe::ParsePdbFile.new
Bioroebe::ParsePdbFile.new(path_to_the_pdb_file_here)
Bioroebe::ParsePdbFile.new('/foo/bar/ack.pdb')

This class also allows some shortcuts for integrated .pdb files, that is files that are bundled with the bioroebe project:

Bioroebe::ParsePdbFile.new ':1fat'

This requires a String because ruby symbols may not start with a number. Note that this also works through the commandline, such as:

parse_pdb_file :1fat

A shell such as bash does not understand ruby symbols, so instead a string will be passed in, being :1fat. The ParsePdbFile will handle this correctly internally.

Note that a small bug was fixed in the file parse_pdb_file.rb; some entries were skipped due to an erroneous loop in the ruby file. This was corrected in May 2020.

In March 2021 the ability to use entries such as ':1fat' was removed again; the code remains though. The reason why this was removed was that the .pdb files are quite large, so distributing them via the bioroebe project makes no real sense. Consider simply downloading the .pdb files; you can use this from the bioshell or via something like:

pdb 5TIM

Note that you can also return the aminoacid-sequence from a .pdb file directly, since as of May 2020.

Example for this:

Bioroebe.return_aminoacid_sequence_from_this_pdb_file "1VII.pdb" # => "MLSDEDFKAVFGMTRSAFANLPLWKQQNLKKEKGLF"

The first argument should be the path to the (local) .pdb file at hand. (In theory support for remote .pdb files could also be added easily, but right now this is not possible, so you have to download it first.)

The specification for .pdb files can be read at the following two remote resources:

http://www.wwpdb.org/documentation/file-format-content/format33/v3.3.html http://www.wwpdb.org/documentation/file-format-content/format33/sect9.html#ATOM

Note that the parse_pdb_file.rb can also do some additional things, such as calculating the maximum distance between atoms in that file, via the method .try_to_determine_the_max_distance_between_the_atoms_in_this_protein().

If you wish to report the secondary structures from a given .pdb file then you can use the following class:

require 'bioroebe/pdb/report_secondary_structures_from_this_pdb_file.rb'

Bioroebe::ReportSecondaryStructuresFromThisPdbFile.new
Bioroebe::ReportSecondaryStructuresFromThisPdbFile.new('foobar.pdb')

If you wish to obtain the FASTA sequence of a particular remote .pdb file then you can use this API:

x = Bioroebe.return_fasta_sequence_from_this_pdb_file "2bts" # => "MLSDEDFKAVFGMTRSAFANLPLWKQQNLKKEKGLF"

Keep in mind that this is the FASTA sequence; the .pdb file itself has another format, and contains a lot more information, such as the various ATOM entries.

Since as of June 2020 the command fetch also works from within the Bioshell, similar to how pymol works. This allows us to quickly download a remote .pdb file.

fetch 2BTS

You can also use the following toplevel-API to download a remote .pdb file:

Bioroebe.download_this_pdb
Bioroebe.download_this_pdb '355D'
Bioroebe.download_this_pdb '1K4R' # This is the Dengue Virus
Bioroebe.download_this_pdb '1fat.pdb' # Lectin Phytohemagglutinin

This will refer to a remote URL such as https://files.rcsb.org/view/1FAT.pdb.

Note that this will be automatically moved to the "correct" default position in the bioroebe-project, under the pdb/ subdirectory.

You can also invoke this script from the commandline via bin/download_this_pdb, like in this way:

download_this_pdb 355D

This works with several .pdb files in one go as well:

download_this_pdb 1NR6 2F9Q 3TDA 2HI4 2V0M

They would all be downloaded one after the other. Be aware that this will overwrite the old .pdb files on that position, so if you don't want this, I recommend to do a backup on the pdb/ subdirectory before invoking the above call.

You can also turn the FASTA sequence stored in a .pdb file into a .fasta file, via --create-fasta-file.

Usage examples:

parsedb 1NR6 --create-fasta-file
parsedb 2F9Q --create-fasta-file
parsedb 3TDA --create-fasta-file
parsedb 2HI4 --create-fasta-file
parsedb 2V0M --create-fasta-file

So if you have a file called 1NR6.pdb and you use the first input, a .fasta file will be created. If such a .pdb file does not exist then this will not work, so make sure to download the .pdb file before invoking this commandline-flag.

Last but not least, the following table shall document the PDB format - it is not yet complete, but it is intended to add the remaining datasets eventually:

Record Name  Describes
MODRES       Modifications to standard residues
HET          Nonstandard residues (as well as ligands, ions and water)
HETNAM       Full chemical name of the residue
HETSYM       Synonyms for the residue
FORMUL       Chemical formula of the residue
KEYWDS       specifies keywords, such as "FK506 BINDING PROTEIN, FKBP12, CIS-TRANS PROLYL-ISOMERASE, ROTAMASE"

Determining how many stop codons existing in a given sequence

You can use bin/n_stop_codons_in_this_sequence to determine how many stop codons exist in a given sequence at hand.

Usage example from the commandline:

n_stop_codons_in_this_sequence ATGACGTACGTCAGTCAGTGATAGTAA # => 4

You can also separate these via a ' ' spacer on the commandline of course:

n_stop_codons_in_this_sequence ATG ACG TAC GTC AGT CAG TGA TAG TAA # => 4

Internally this makes use of the method called Bioroebe.n_stop_codons_in_this_sequence? or one of its aliased names. Usage example for the method, just as in the first example shown above:

Bioroebe.n_stop_codons_in_this_sequence "ATGACGTACGTCAGTCAGTGATAGTAA" # => 4

The Aliphatic Index of Globular Proteins

In a paper from 1980, Atsushi IKAI provided a formula with which one can calculate the aliphatic index of a globular protein, in a short paper titled "Thermostability and aliphatic index of globular proteins" (PMID: 7462208, see here).

Atsushi provided a statistical analysis of proteins, and determined that the aliphatic index - which is defined as the relative volume of a protein occupied by aliphatic side chains (alanine, valine, isoleucine, and leucine) - of proteins of thermophilic bacteria is significantly higher than that of ordinary proteins.

Atsushi reasoned that the index may be regarded as a positive factor for the increase of thermostability of globular proteins. The enzymes of some organisms are more stable at higher temperature than the enzymes of other organisms, in particular among thermostable proteins.

Thus, there is a good correlation between the "aliphatic index" on the one hand, and the thermostability of proteins on the other hand.

Atsushi gave the following formula for calculating this:

Aliphatic Index = XA + aXV + b (xI+XL)

The four letters A, V, I and L refer to the four aminoacids Alanine, Valine, Isoleucine and Leucine. The two coefficients a and b are the relative volumes of the side chains of Alanine. A has a value range of 2.8-3.0 and b has a value range of 3.8-4.0.

The method called .aliphatic_index() is making use of that formula. As values for a and b the two values 2.9 and 3.9 have been taken. The code in the bioroebe project for this has been inspired by: https://github.com/wwood/bioruby-aliphatic_index

It yields the following usage example for bioruby:

Bio::Sequence::AA.new('MVKSYDRYEYEDCLGIVNSKSSNCVFLNNA').aliphatic_index # => 71.33333

In bioroebe, the equivalent would be:

Bioroebe::Protein.new('MVKSYDRYEYEDCLGIVNSKSSNCVFLNNA').aliphatic_index # => 71.33333

The FASTA and FASTQ format - working with FASTA and FASTQ files

The FASTA format, and the FASTQ format, are sort of the default (non-binary) formats for sequence data. There is not really one standard to rule them all when it comes to FASTA-based formats, but the BioRoebe project will try to adhere to the most commonly used FASTA and FASTQ format, respectively.

FASTQ is used in particular for Next-Generation Sequencing data, and will also include a per-base pair quality score. Thus, it will contain DNA sequence data, with specific quality scores.

The FASTQ format typically begins with the character @. (Do note that @ is also allowed to appear within the body of a FASTQ entry, so be careful if you decide to want to split on every @ - they are not equal.)

A FASTA file, on the other hand, typically starts with one description entry (the identifier), and the sequence body, also called sequence data.

For example, the content of a FASTA file may look like this:

>ENSMUSG000029122

CCTCCTAGTAGAGCGACGACACCCTCCTAGTAGAGCGACGACACCCTCCTAGTAGAGCGACGACACCCTCCTAGT
GAGCGAACACCCTCCGCCTCCTAGTAGAGCGACGACACCCTCCTAGTAGAGCGACGCTAGTAACACCCTCCTAGT

The file suffix for a FASTA file is typically .fasta or .fa. I personally prefer the longer suffix (.fasta), but ultimately it does not matter; other projects tend to prefer .fa.

Since there is not one universal standard for FASTA, fuzzy matching is recommended for extracting the identifier. Some flexibility will be required in tools that parse FASTA and FASTQ files.

Recommended identifiers may look like this:

>gene_123456 length=231;type=dna

Or, perhaps separated via | too.

>gene_123456 | length=231;type=dna

The designated ID (in the example above, 123456) should ideally be unique, so it may be better to use long IDs here, as that reduces the chance of the same ID used accidentally. You could also use characters from the alphabetic in combination, such as ABC_123456.

Note that class Bioroebe::ParseFasta not only parses the sequence, but it can also ensure that every header conforms to list the length and (identified) type. If you wish to store this for a local .fasta file, you can try the following (but make a backup-copy of that .fasta file first, just in case):

require 'bioroebe'

x = Bioroebe::ParseFasta.new('foobar.fasta')
x.sanitize_the_description
x.save # ← Note that this will overwrite the old file, so be wary of that.

Let's now analyse a typical FASTQ file.

First, let's show one such example:

@DJB77SP1:248:DOMDGACXX:7:1202:12362:49613
TGCTTACTCTGCGTTGATACCACTGCTTAGATCGGAAGAGCACACGTCTGAA
+
JJJJJOOJJJJJJHIHHHGHFFFFFFCEEEEED8D?DDDDDDBDDDABDDCA
@DJB775P1:248:DOMDGACXX:7:1202:12782:49716
CTCTGCGTTGATACCACTGCTTACTCTGCGTTGATACCACTGCTTAGATCGG
+
IIIIIIIIIIIIIIIHHHHHHFFFFFFEECCCCBCECCCCCCCCCCCCCCCC

The first line is called the description line. The second line, before the + character, is sequence data, which can be on one or many lines. The + character indicates the end of the sequence. This can be used by a parser to know when/where to stop. The line that then comes after the

  • character is the so-called quality data. This one must be exactly as long as the sequence line at hand. This can be used as an additional cue if you ever want to write your own parser, or improve some existing parser that shows this problem, by the way.

Each numeric base quality is encoded with ASCII characters using a certain scheme. The individual values are called scores.

The scores in FASTQ are ASCII characters, as stated, and they range from a value of 33 up to a value of 126. (The reason as to why the first 32 characters are not used is because some of these characters are special - the "\07", for example, makes a ding-noise which is not very useful.)

Note that the three FASTQ-scoring schemes are stored in the file fastq_quality_schemes.yml - to be exact, it can be found at bioroebe/yaml/fasta_and_fastq/fastq_quality_schemes.yml. You can output this table from within the Bioroebe::Shell like this:

show_fastq_quality_score_table

Note that these scores are also called fastq-sanger, fastq-solexa and fastq-illumina. The most frequently (and recommended) variant is fastq-sanger, by the way.

The following table shows these scoring techniques

Name ASCII character range Offset Quality score type Quality score range
Sanger, Illumina (since version 1.8) 33-126 33 PHRED 0-93
Solexa, early Illumina (before 1.3) 59-126 64 Solexa 5-62
Illumina (versions 1.3-1.7) 64-126 64 PHRED 0-62

How do we calculate the base quality schemes?

Let's take an example .fastq file that is also distributed within the BioRoebe project, called fastq_example_file.fastq, residing at the location bioroebe/fasta_and_fastq/fastq_example_file.fastq.

This file has the following nucleotide sequence:

GTTGTTCTTGATGAGCCATGAGGAAGGCATGCCAAATTAAAATACTGGTGCGAATTTAAT

and the associated quality score tied to this sequence being:

CCFFFFHHHHHJJJJJEIFJIJIJJJIJIJJJJCDGHIIGIGIGIGJIIIIJIJIJIIIH

Remember that these characters in regards to the FASTQ scheme are just numbers, specified by the ASCII code.

Let's convert these characters into numbers then via .ord():

x = "CCFFFFHHHHHJJJJJEIFJIJIJJJIJIJJJJCDGHIIGIGIGIGJIIIIJIJIJIIIH"

array = x.chars.map {|entry| entry.ord }
pp array # => [67, 67, 70, 70, 70, 70, 72, 72, 72, 72, 72, 74,
74, 74, 74, 74, 69, 73, 70, 74, 73, 74, 73, 74, 74, 74, 73,
74, 73, 74, 74, 74, 74, 67, 68, 71, 72, 73, 73, 71, 73, 71,
73, 71, 73, 71, 74, 73, 73, 73, 73, 74, 73, 74, 73, 74, 73,
73, 73, 72]

So far so good - we can see that these are just numbers (numeric values).

Note that these numbers are not yet the real numbers - we must take the offset into account as well. The offset is stored in the file bioroebe/yaml/fastq_quality_schemes.yml, as stated above.

Let's say that we use the PHRED Sanger score type, which is the most commonly used variant these day; Illumina past version 1.8 makes use of this format.

Looking into that yaml-file, the offset defined there is 33, so the above code has to be modified via this offset like in the following way:

require 'bioroebe/toplevel_methods/return_fastq_offset.rb'

x = "CCFFFFHHHHHJJJJJEIFJIJIJJJIJIJJJJCDGHIIGIGIGIGJIIIIJIJIJIIIH"

offset = Bioroebe.return_sanger_offset # Obtain our initial offset value here.

array_phred_scores = x.chars.map {|entry| entry.ord - offset } # ← We deduct the offset here from each element.

pp array_phred_scores # => [34, 34, 37, 37, 37, 37, 39, 39, 39,
39, 39, 41, 41, 41, 41, 41, 36, 40, 37, 41, 40, 41, 40, 41,
41, 41, 40, 41, 40, 41, 41, 41, 41, 34, 35, 38, 39, 40, 40,
38, 40, 38, 40, 38, 40, 38, 41, 40, 40, 40, 40, 41, 40, 41,
40, 41, 40, 40, 40, 39] # Our final result \o/

As can be seen, the numbers have been corrected via the offset in use.

Since fastq-sanger uses PHRED scores, we now have the proper PHRED quality scores. We can now use two more formulas to calculate useful information about this.

First, to convert these quality scores to the estimated probability that the base is actually correct, we can use the following formula:

P = 10 ** (-Q/10)

P stands for Probability. An example to this follows next:

Say that you have a quality score of 48; then, in ruby code, you would get an error probability of:

10 ** (-48 / 10.0) # => 1.584893192461114e-05

aka 0.000016 and so forth. (Remember: 1.6e-05 == (0.000016), in ruby)

For a quality score of 50, we would get:

10 ** (-50 / 10.0) # => 1.0e-05

aka 0.00001, which is also a shorthand notation for "1 in 100.000".

The method Bioroebe.phred_error_probability() will also output this result; simply pass in the quality score to that method:

Bioroebe.phred_error_probability(50)

See also the following wikipedia entry:

https://en.wikipedia.org/wiki/Phred_quality_score#Definition

And to go from Probability to Quality, we can use the inverse of the function:

Q = -10 ** (log10 * P) 

Let's show how to obtain the base qualities, from using the above array (array_phred_scores):

phred_quality_scores = array_phred_scores.map {|q|
  (10 ** (-q/10)).to_f
}

And the output will be (truncated):

[0.0001, 0.0001, 0.0001, 0.0001, 0.0001, ... ]

If you wish to parse a fasta file on the commandline, you can try to invoke:

bin/parse_fasta
parse_fasta

This is not ideal but may still provide some helpful information.

You can also return all FASTA subsections from a given fasta file, through this toplevel API:

Bioroebe.fasta_subsection
Bioroebe.return_fasta_subsection
Bioroebe.return_fasta_subsection_of_this_file

These are aliases; use whichever variant you prefer. The first argument should be the path to a local fasta file (which is ideally stored with a .fa or .fasta suffix).

I have an alias to this and can then use this from the commandline like so:

return_fasta_subsection_of_this_file /Depot/Bioroebe/Mouse/ncrna/ncrna.4.fa

This will parse all FASTA entries into FASTA headers and the associated FASTA bodies, stored as an Array.

FASTA files may contain the aminoacid sequence of a protein or the DNA sequence of a gene.

If you wish to colourize a FASTA sequence, e. g. for display on the commandline, you could use the following API:

Bioroebe.colourize_this_fasta_sequence('ATGAAATCGCGCGTGCCGCGCGCGC'\
 'GCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCTGCGCGCGCGCGCGCGCGCGCG'\
 'TGCCGCGCGCAGGCGGCGGCGGCGGCGGCGGCG'
)

Code for this functionality resides in bioroebe/toplevel_methods/fasta_and_fastq.rb since as of December 2020. Note that these colours are typically used on a dark background, so they may not look ideal on white background. The colours are defined via the constant FILE_COLOURIZE_FASTA_SEQUENCES, which typically points at the .yml file at bioroebe/yaml/colours/colourize_fasta_sequences.yml. If you want to use different colours, you have to change that file (e. g. change A: teal or C: slateblue to some other colour; these are HTML colours, so it is recommended to use the names of these HTML colours).

In July 2022 the method Bioroebe.colourize_this_fasta_sequence was extended slightly. You can now attach a "ruler" to the output, that is a numbered series that shows the nucleotide position, on the commandline.

Example for this:

puts Bioroebe.colourize_this_fasta_sequence('ATGAAATCGCGCGTGCCGCGCGCGC'\
 'GCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCTGCGCGCGCGCGCGCGCGCGCG'\
 'TGCCGCGCGCAGGCGGCGGCGGCGGCGGCGGCG'
) { :with_ruler }

By default this will use a white colour on black background. If you want to modify the foreground colour you can pass the colour name to the method, such as via:

puts Bioroebe.colourize_this_fasta_sequence('ATGAAATCGCGCGTGCCGCGCGCGC'\
 'GCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCTGCGCGCGCGCGCGCGCGCGCG'\
 'TGCCGCGCGCAGGCGGCGGCGGCGGCGGCGGCG'
) { :with_ruler_steelblue_colour }

The following image shows how this can be used on the commandline:

At a later time this may be extended to allow for use in a webpage, that is to embed these strings directly into HTML or .php or .cgi.

If you wish to show a chunked display of the dataset (nucleotides normally) then you can use the following API:

Bioroebe.downcase_chunked_display 'ATGGGGCCTGCAATGGGGCCTGCAATGGGGCCTGCAATGGGGCCTGCAATGGGGCCTGCAATGGGGCCTGCAATGGGGCCTGCAATGGGGCCTGCA'

This may output something like the following:

    1 atggggcctg caatggggcc tgcaatgggg cctgcaatgg ggcctgcaat ggggcctgca
   61 atggggcctg caatggggcc tgcaatgggg cctgca

This is the gist of this functionality: to show a sequence format that is similar to the genebank format.

This also works from the commandline directly, since as of March 2020:

downcase_chunked_display ATGGGGCCTGCAATGGGGCCTGCAATGGGGCCTGCAATGGGGCCTGCAATGGGGCCTGCAATGGGGCCTGCAATGGGGCCTGCAATGGGGCCTGCA

And you can use it from within the interactive bioshell:

chunked_display

(It may be easier to assign a sequence before using that method.)

If you are using the interactive bioroebe-shell, then you can also load up files from the local directory, via:

use NUMBER
use 9

So the above would try to load up file at position 9 in the local directory.

As far as the BioRoebe project is concerned, one of the most important classes for reading (and parsing) the information content of a FASTA file is class Bioroebe::ParseFasta. If you use the interactive shell of the BioRoebe project, and read from a .fasta file, then you can also access the individual entries found in that file via:

fasta1 # ← Display the first FASTA entry of that .fasta file.
fasta2 # ← Display the first FASTA entry of that .fasta file.
fasta3 # ← Display the first FASTA entry of that .fasta file.

Additionally the sequence found there will be assigned to the main sequence of the bioshell instance. The idea here is to allow you to quickly operate on the data at hand, and keep on moving as you go.

This functionality has been added on 01.01.2020.

You can also automatically rename a .fasta / .fa / .ffn file if you would like to.

The API is:

Bioroebe.automatically_rename_this_fasta_file
Bioroebe.automatically_rename_this_fasta_file('FILE_NAME_HERE')
Bioroebe.automatically_rename_this_fasta_file('foobar.fa')

You can also use this from the bin/ subdirectory via:

automatically_rename_this_fasta_file foobar.fa

In February 2021, a method was added called Bioroebe.sanitize_this_fasta_file(). What this method currently does is that it will add how many aminoacids are in a given fasta sequence. Be careful when using this method, as it will overwrite the existing .fasta file. So make a backup of that file before calling that method.

The reason why this method was added was because I needed a way to display these files via a web-interface, in raw format, and I did not want to use any other tool to tell me how many aminoacids are in that fasta-sequence. So it's just a little helper utility really.

In the future this method may make more modifications, though.

If you ever have a use case to simplify the header in a FASTA file, you could use bin/simplify_fasta. I wrote this ad-hoc code for MEGA (in April 2021), so that we can do a phylogenetic analysis, but the code isn't really very polished for now. It may have to be cleaned up at a later time.

Since as of May 2021 it is also possible to work with local files containing the NCBI IDs, making use of bin/download_fasta.

Example:

download_fasta table_ids.md

One line per protein-entry, such as:

XP_018122880.1
XP_009892820.1
XP_025234673.1
NP_001002068.1
XP_012628786.1

You can also use the bioshell to work with FASTA files.

For example, if you want to read in some fasta file, then the following command(s) can be used:

pfasta
parse_fasta

Specific examples:

pfasta foobar.fasta
pfasta foobar.fa

If you do not provide input, the bioshell will first try to find any .fasta file in the current working directory, then any .fa file. If neither exists, we will use a default input. This should allow the user to quickly test some things, hence why this feature exists in this way.

SAM and BAM files

The SAM format is a text format for storing sequence data in a series of tab delimited ASCII columns. It consists of a header and an alignment section.

The binary equivalent of a SAM file is called a Binary Alignment Map (BAM) file.

These formats are used for NGS datasets; and the samtools command is often used to tap into information from these programs.

A typical SAM file may start like this:

@SQ SN:I  LN:15072434

We can always find a leading @ character there.

Take note that the entries there, separated via a :, are just like key→value pairs of a hash - e. g. SN:I and LN:15072434, could be written as SN→I and LN→15072434.

The samtools can convert between BAM and SAM format; a tutorial can be seen here:

http://biobits.org/samtools_primer.html

Example for the commandline usage:

samtools input.sam > foo.bam

There is also a little code denoting 1, 2, 4, 8 and 16 for SAM headers.

 1 = 00000001  (paired end read)
 2 = 00000010  (mapped as proper pair)
 4 = 000000100 (unmappable read)
 8 = 00001000  (read mate unmapped)
16 = 000100000 (read mapperd and reverse strand)

BAM files can be index like this:

samtools index output.bam

The Mouse

This subsection is about the mouse, that is the organism called mouse - in particular relevant remote websites (FTP sites). The paragraph was added mostly for convenience, in the event that I forget important URLs here.

FTP Listing:

ftp://ftp.ensembl.org/pub/release-98/fasta/mus_musculus/ ftp://ftp.ensembl.org/pub/release-98/gtf/mus_musculus/ ftp://ftp.ensembl.org/pub/release-98/fasta/mus_musculus/cdna/ ftp://ftp.ensembl.org/pub/release-98/fasta/mus_musculus/ncrna/

Other online-resources:

https://www.ensembl.org/Mus_musculus/Info/Index

If you wish to download these via wget, try these variants:

wget -r -nH --cut-dirs=4 --no-parent ftp://ftp.ensembl.org/pub/release-98/gtf/mus_musculus/
wget -r -nH --cut-dirs=4 --no-parent ftp://ftp.ensembl.org/pub/release-98/fasta/mus_musculus/cdna/
wget -r -nH --cut-dirs=4 --no-parent ftp://ftp.ensembl.org/pub/release-98/fasta/mus_musculus/ncrna/

You can find more genetic information at the Mouse Genome Database.

Bioroebe::Colourize

This small submodule, called Bioroebe::Colourize, can be used to quickly colourize a given sequence, such as a DNA sequence in FASTA format.

This can then be used on a website, on an ad-hoc basis if you want to show a colourized result to the user - such as in a sinatra application or a Ruby-on-Rails application.

You can even generate a new .html file that way. Usage example for this:

Bioroebe::Colourize.generate_HTML_page_via_this_input
Bioroebe::Colourize.generate_HTML_page_via_this_input()
Bioroebe::Colourize.generate_HTML_page_via_this_input('foobar.fasta')
Bioroebe::Colourize.generate_HTML_page_via_this_input('GATCGTACTGACA')

All variants should work fine, but number #3 is probably the simplest, as you just provide the path to the .fasta file, as-is.

This may then look like this in the browser:

Bioroebe.two_dimensional_plot()

The toplevel method Bioroebe.two_dimensional_plot() is mostly an ad-hoc method to quickly draw a 2D plot.

The first argument should be the dataset, as String.

The second argument is a Hash that may be extended.

The general idea is to extend support for drawing charts within the Bioroebe project.

Usage example:

Bioroebe.two_dimensional_plot(
  '6 12
  8 14
  10 18
  12 23
  14 28
  16 29
  18 26
  20 23
  22 20',
  {
    width:    800,
    height:   640
  }
)

This will make use of gnuplot and generate an image like this:

This is very beta-quality as of September 2022. Expect more changes pertaining to this in the future, until this API or related APIs become more useful. Users of the bioroebe project should be able to easily draw plots and graphs, for easy publication in scientific papers - that is the end goal one day.

class Bioroebe::FastaToYaml

The class Bioroebe::FastaToYaml was created in September 2022.

The idea here is to take a FASTA file, such as one stored under .fasta, and convert it into a YAML file (.yml), containing more specific information - typically stored as a hash.

It is not yet clear whether this class will be very important or not, but I have some ideas how to describe different genes via yaml files, and for this I need to be able to turn FASTA sequences into YAML files and vice versa. Stay tuned for more changes in this regard in the future.

Commandline options and usage of the BioRoebe project

This subsection will attempt to showcase some commandline-options for the BioRoebe project.

Commandline options in this context refers to arguments that can be passed into bioroebe via a -- flag, such as --help.

One such option is to enable us to load up a local .fasta file.

This can be done in the following way:

bioroebe --fasta=Arabidopsis_thaliana_chromosome_5_sequence.fasta

In other words - simply pass the location to the local .fasta file to the --fasta= option.

Via the option --silent you can do a silent startup of the BioShell, meaning that no intro-header will be shown when the BioShell is loaded up.

Via the option --sequence you can specify a sequence to be used right on startup; and the input can also be a number such as 150 or 42, as in the following example:

bioroebe --sequence=42

This will "generate" a random DNA sequence of length 42.

For all options, use --help.

You can "generaet" random aminoacids on the commandline, for whatever reason, like this:

bioroebe --n-aminoacids=33 # => ACWLMCTSGYERNWDPVATLEYTDYLGNGAPMA

The part after the # => would be shown on the commandline.

On the www there are a myriad of links to various other external sites.

It is a bit pointless to try to list all useful resources here in BioRoebe, but nonetheless this subsection will still show some entries. You have to evaluate the usefulness on your own, though, depending on your primary interest(s) of course.

GPDE:

A biological proteomic database. https://omictools.com/gpde-tool

GLASS:

The GLASS (GPCR-Ligand Association) database is a manually curated repository for experimentally-validated GPCR-ligand interactions.

Visit GLASS at: https://zhanglab.ccmb.med.umich.edu/GLASS/

Chaos Game representation of Gene Structure:

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC330698/pdf/nar00192-0217.pdf

Biogems for bioruby:

https://biogems.info/

PNAS brings in new scientific insights:

https://www.pnas.org/

DDBJ: DNA Database of Japan:

https://www.ddbj.nig.ac.jp/index-e.html

Institute for Systems Biology (ISB, Seattle):

https://isbscience.org/

PIR (Protein Information Resource):

https://proteininformationresource.org/

PDBTM: Protein Data Bank of Transmembrane Proteins

http://pdbtm.enzim.hu/

InterPro: useful if you want to discover which families a specific protein belongs to and which domains it contains.

https://www.ebi.ac.uk/interpro/

Contact information and mandatory 2FA coming up in 2022

If your creative mind has ideas and specific suggestions to make this gem more useful in general, feel free to drop me an email at any time, via:

shevy@inbox.lt

Before that email I used an email account at Google gmail, but in 2021 I decided to slowly abandon gmail for various reasons. In order to limit this explanation here, allow me to just briefly state that I do not feel as if I want to promote any Google service anymore, for various reasons.

Do keep in mind that responding to emails may take some time, depending on the amount of work I may have at that moment.

In 2022 rubygems.org decided to make 2FA mandatory for every gem owner: see https://blog.rubygems.org/2022/06/13/making-packages-more-secure.html

As I can not use 2FA, for reasons I will skip explaining here (see various github issue discussions in the past), this effectively means that Betty Li and others who run the show at rubygems.org will perma-ban me from using rubygems as a developer.

As I disagree with that decision completely, this will mean that all my gems will be removed from rubygems.org prior to that sunset date, e. g. before they permanently lock me out from the code that I used to maintain. It is pointless to want to discuss this with them anymore - they have made up their minds and decided that you can only use the code if 2FA is in place, even though the code itself works just fine. If you don't use 2FA you are effectively locked out from your own code; I consider this a malicious attack. See also how they limited discussions to people with mandatory 2FA on the ruby-bugtracker, thus banning everyone permanently without 2FA:

https://bugs.ruby-lang.org/issues/18800

Guess it may indeed be time to finally abandon ruby - not because ruby is a bad language, but there are people now in charge who really should not be part of ruby in the first place.