FREYA Framework

A computational pipeline for cross-species cancer analyses

The FREYA analytic framework calculates genetic diversity within a cancer cohort, extracts progression-related patterns of expression, calls functional variants, identifies intrinsic human molecular tumor subtypes, and compares them to human breast cancer biology. It helps streamline future canine mammary tumor (or other tissue/species) analyses and provides a comprehensive suite of tools that encompass conventional human analyses and new dog-centric approaches, seamlessly integrating the two.

FREYA is split into two major components: data preparation (DataPrep) and analysis (DataAnalysis). DataPrep prepares sequencing data for analysis. DataAnalysis runs statistical analyses from the manuscript. Both DataPrep and DataAnalysis are designed to run with user-provided data. The pipeline requires that each patient (e.g. each dog) has at least 1 sample of each histology: normal, benign, and malignant.

Install FREYA

Preparing the Data

( DataPrep )

DataPrep is the first sub-pipeline of FREYA. It will prepare processed expression and mutation call data, using the raw data downloaded from your sequencer. DataPrep runtime is highly dependent on the machine used. For example, the dataset used in the CMT manuscript takes approximately 1 month to process using 1 cpu but can be run in less than 24 hours using the DisBatch setup.

This pipeline has several dependencies, which are, in effect, described by the Dockerfile. The central script is "cmwf_csv.sh." Please see our GitHub documentation for more information. Mutations are called using GATK Best Practices workflow for SNP and indel calling on RNAseq data.

Workflow Steps and Environment Requirements

Required Input:

High throughput sequencing (.fastq) files for each sample
Comma-separated phenotype file containing sample information: histology, sample ID, and patient. Multiple samples per patient are required.
Reference genome files (.gtf, .gff, etc)

Dependencies:

Python, Java, fastQC, DEXSeq, Samtools
JAR files: GATK, Picard, SnpEff

Steps:

Generate RNA-Seq counts
- Align to the reference genome
- Generate quality reports
- Construct exon counts
- Mark duplicate reads
- Reassign mapping quality scores
Call variants, annotate to genes, create genes-by-samples mutation files
- Call SNPs and indels
- Variant filtration
- Add gene name annotations
- Select final mutation calls

Install FREYA

Analyzing the Data

( DataAnalysis )

This second sub-pipeline depends on processed genomic and phenotype data (created either by yourself or by using FREYA DataPrep). If you used DataPrep, be sure to first run the prep_data.R script to convert the gene names etc. If you are providing pre-processed data you may be able to skip this step (see below).

FREYA DataAnalysis can be run from your browser using either Docker and the Jupiter notebooks, or on the command line by using the provided Makefile. Update the variables in the Jupyter notebook, following the order in Index.ipynb, or in the Makefile, to reflect your data files. We provide some example simulated data in the synthetic_data folder. If you want to view or run the pipeline without using your own data, you can do so by clicking Launch Binder (also in the Github repository).

Workflow Steps and Environment Requirements

Required Input:

Expression matrix file
Comma-separated phenotype file containing sample information: histology, sample ID, patient ID.

Dependencies:

R
Docker (if using Docker version of the pipeline)

Steps:

Prep the data - This preprocessing script creates the input files for the rest of the analysis scripts (except CMT_PEPs.csv).
Generate PEPS, run histology simulations - Generates new PEP lists based on the user data.
Gene Ontology enrichment - runs GO enrichment using topGO. This requires a file with PEP enrichment scores.
PAGE enrichment - Calculates PAGE enrichment for a number of comparisons in the human data, using PEP lists from the canine data.
Differential Expression Analysis - Compares differential expression in human tumor/normal samples against the PEP genes from dog.
PAM50 Subtype Analysis - Predicts PAM50 subtypes for the user data, based on TCGA BRCA RNA-Seq data.
Mutation Analysis - Runs all mutations analysis on the user data.

Running the DataAnalysis Workflow

Running Using the Docker image

Setting up the Docker image

How it appears in your browser

Running Using Make

Set up the repository and a folder containing your sample data.
Update the variables at the top of the Makefile to point to your data, then run "make clean all".
If you do not want to regenerate the peps and run data preprocessing, run "make report".
The default output directory is 'results'

Install FREYA