Skip to content

CLIMB 2

Andrea Telatin edited this page Oct 6, 2020 · 3 revisions

Some new commands

Using nano

nano is a simple text editor that can be used for small edits on remote files. For larger tasks the way to go is editing the file locally with a powerful text editor, saving it and transfering it to the remote server.

The keyboard shortcuts are listed at the bottom, where ^ means Ctrl. To exit, for example, we will press Ctrl + X (if the file changed, we will need to confirm if we want to save it.

the nano text editor

More on ls and wildcards

The general syntax is: ls [options] [files]. Both the options and the files are optional, and files can be files or directories. Now we introduce some of the options: Option Description

  • -a Also show hidden files
  • -l Long format, will show one file per line, with size, owner, date…
  • -h Used with -l, will display file size in human-readable format (e.g. 2.3Mb instead of 2298011 )
  • -d Show directories as files, without listing their content

The options can be combined together, and the following two commands are identical:

ls -l -h -a
ls -lha

If we want to list the files present at the root, we don't need to move there, but simply ask ls which path to scan for you:

ls /

Here another example:

ls ~/learn_bash/phage/ ~/learn_bash/files/

Shell expansion: wildcards to select multiple files

As we noticed, ls can receive more than one file. Usually, though, we don't type every single item to be listed, but instead we use wildcards, then the shell will expand our shortcuts into a list of paths. There are wildcards, ranges and lists to be used.

Symbol Meaning Example
* Any set of characters (any length) *.fasta: all files ending with “.fasta”
? A single character A???.txt: files starting with A, followed by exactly 3 chars, endin by “.txt”
[a-z] Range: any single lowercase letters file1[a-c].txt: files called file1a, file1b and file1c, ending with “.txt”
[0-9] Range, any single digit reads_R[1-2].fastq: reads_R1.fastq and reads_R2.fastq
{a,b} Comma separated list of words vir_{protein,assembly}*: files beginning with vir_protein or vir_assembly

Extract columns from "text tables"

Yesterday we had a preview on tsv files, a very common way to store tabular data in the command line, including an example of bioinformatic file (the gff file with the annotation of lambda phage).

The GFF format to store annotations

The GFF (General Feature Format) is used to store annotations. An alternative format, called GTF, is more focused on genes annotations while GFF is more generic. They are both TSV (tab separated values), that is they are table where the boundaries across cells are marked by a single tabulation.

The first lines optionally specify some metadata, and they are preceded by a #.

Let's see an example:

less -S ~/learn_bash/phage/vir_genomic.gff
 
# If we want to remove the header lines:
grep -v '^#' ~/learn_bash/phage/vir_genomic.gff | less -S 
 
# If we want to increase the tabulation:
grep -v '^#' ~/learn_bash/phage/vir_genomic.gff | less -S -x 15

If we want to extract all the lines with CDSs (-w requires the pattern to be surrounded by non alphanumeric characters), and then lines containing the word capsid:

grep -w CDS ~/learn_bash/phage/vir_genomic.gff
 
grep -w CDS ~/learn_bash/phage/vir_genomic.gff | grep -i capsid

A useful command to extract some columns from a text file is cut:

cut -f 1,3-5 ~/examples/phage/vir_genomic.gff

Other TSV files

GFF, GTF, but also SAM and VCF are examples of tabular text files. They all are tab-separated values. A smaller example will be easier to deal with:

cat ~/learn_bash/files/wine.tsv

To sort a table, there is the command sort with these options:

  • -nto sort numerically (default is alphabetic)
  • -k NUMBER to specify the column to sort (by default first)
  • -r for reverse sorting (default: ascending)

If we want to sort by username, that is the third column of the file:

sort -k 3 ~/learn_bash/files/wine.tsv

Sometimes we need to increase the space used by tabs to have a clearer view:

sort -k 3 /homes/2020/binf/data/people.tsv | less -S -x 20

Sometimes with tabular data we want to extract a set of columns. The command cut is there for us:

  • -f to specify the columns (fields), supports lists (-f 1,4,6) and ranges (-f 1-8)
  • -d the character delimiting the columns. By default is tab, but can be -d ",".
  • -t as delimiter. By default white spaces. type -t$'\t' for tab delimited, or -t ',' for comma separated
# Get Country and Alcohol content:
cut -f 2,3 ~/learn_bash/files/wine.tsv

# and sort by alchool:
cut -f 2-3 ~/learn_bash/files/wine.tsv  | sort -n -k 2

Try it yourself

➡️ A 15 minutes test