Skip to content
MrTomRod edited this page Jul 2, 2024 · 19 revisions

Example output

a) Simple dataset from Scoary 1

How Scoary was run
# download dataset
wget --recursive -nd --no-parent -A csv,tsv https://scoary.bioinformatics.unibe.ch/datasets/scoary-1-tetracycline/
# run Scoary2
scoary2 \
    --genes Gene_presence_absence.csv \
    --gene-info gene-info.tsv \
    --traits Tetracycline_resistance.csv \
    --outdir out \
    --multiple_testing native:0.05 \
    --n-permut 10000
# the argument gene-info is optional

See output here.

b) Large metabolomics dataset (OrthoFinder)

How Scoary was run
# download dataset
wget --recursive -nd --no-parent -A tsv,txt https://scoary.bioinformatics.unibe.ch/datasets/44-propioni/
# run Scoary2
scoary2 \
    --genes N0.tsv \
    --gene-data-type 'gene-list:\t' \
    --gene-info N0_best_names.tsv \
    --traits traits_44_noraw.tsv \
    --trait-data-type 'gaussian:kmeans:\t' \
    --trait-info traits_info_44_noraw.tsv \
    --newicktree SpeciesTree_rooted.txt \
    --isolate-info isolate_info_44.tsv \
    --multiple_testing fdr_bh:0.1 \
    --n-permut 1000 \
    --n-cpus 8 \
    --random-state 42 \
    --outdir out

# After identifying significant traits, consider running Scoary2 again with '--trait-wise-correction True'
# and '--multiple_testing bonferroni:0.999' to see the significant traits in context of the wider dataset.
# This will lead to many more traits in the output, including many false positives, but it will also show 
# traits that may be related to the significant traits.

# The following are optional: gene-info, trait-info, isolate-info

See output of a metabolomics dataset here. Click here to see the output of the same dataset, but with --trait-wise-correction.

We recommend using limit_traits and a low n-permut to determine the optimal Scoary parameters before crunching the full dataset. If the dataset has a particularly strong population structure, also use worst_cutoff to remove traits that merely correlate with the phylogeny. (See manual.)

Output files

summary.tsv

Table that contains one row per trait analyzed, summarizing the result. Rows:

  • Trait: name of the trait
  • best_fisher_p: uncorrected p-value of Fisher's test for the "best" gene
  • best_fisher_q: multiple testing corrected p-value of Fisher's test for the "best" gene
  • best_empirical_p: p-value of the post-hoc permutation test for the "best" gene
  • best_fq*ep: product of fisher_q and empirical_p for the "best" gene
  • ...: potential additional metadata columns from trait-info.tsv

The "best" gene is defined as the gene with the lowest best_fq*ep score.

See also: Understanding the p-values

overview_plot.svg

This SVG image file is made interactive in output.html. It contains:

  • Left: Dendrogram of traits.
  • Middle: negative logarithms of best_fisher_q, best_empirical_p and best_fq*ep calculated by Scoary2.
  • Right: names of the traits.

overview.html

Makes overview_plot.svg interactive and links traits to traits.html

See section How to use the app

traits (folder)

This folder contains a subfolder for each trait. These subfolders contain the following files:

  • results.tsv: The content of this file is similar to the main output of original Scoary. Rows:

    • Gene: Name of the gene
    • Name: Description of the gene from gene_info.tsv (optional)
    • g+t+: Number of isolates that have the gene (g+) and have the trait (t+)
    • g+t-, g-t+, g-t-: See g+t+. These four numbers constitute the input for Fisher's test.
    • sensitivity: The sensitivity of using this gene as a diagnostic test to determine trait-positivity; more details here
    • specificity: The specificity of using this gene as a diagnostic test to determine trait-negativity; more details here
    • odds_ratio: Odds ration (quantifies the strength of the association)
    • fisher_p, fisher_q: corrected and uncorrected p-value of Fisher's test
    • empirical_p: p-value of the post-hoc permutation test for the "best" gene
    • fq*ep: product of fisher_q and empirical_p
    • contrasting: The maximum number of pairs that contrast in both gene and trait characters that can be drawn on the phylogenetic tree without intersecting lines
    • supporting, opposing: The maximum number contrasting pairs that support/oppose the hypothesis
    • best: p-value of picking n supporting pairs out of n contrasting pairs
    • worst: p-value of picking n opposing pairs out of n contrasting pairs
  • coverage_matrix.tsv: Table that indicates which isolates have which genes

  • meta.json: Metadata about how the trait was binarized

  • values.tsv: Table that indicates the original continuous value of each isolate and how it was classified (optional)

trait.html

Visualizes the data in a trait's folder and makes it interactive.

Shows a phylogenetic tree of the isolates, the tables results.tsv and coverage_matrix.tsv, a pie chart that shows how the orthogene and the trait intersect in the dataset and a histogram of the continuous values, colored by whether each isolates has the orthogene and the trait.

See section How to use the app

binarized_traits.tsv

Binary trait matrix. Rows: isolates; columns: traits

isolate_info.tsv

Metadata about each isolate (optional)

app (folder)

Contains configuration, HTML and CSS for overview.html and trait.html

By modifying link-config, the behaviour of trait.html can be changed.

See section How to use the app

logs (folder)

Contains log files