Wiki

Clone wiki

invar / INVAR_steps_detail

Data processing - detail

  • bin/step1/INVAR1.sh - mpileup, split files into smaller chunks, then annotate trinucleotide context for each chunk. Requires a BED file of patient-specific loci, which will be used for mpileup, and requires a txt file which contains the paths to the files of interest, e.g. input/file1.bam\ninput/file2.bam etc. This script will generate multiple smaller jobs as annotation is performed in parallel.

    • After the first script is done, there should be ~5tsv files per sample. Feel free to cat them to check their contents are correct :)
  • bin/step1/INVAR2.sh - concatenate and gzip the output from INVAR1. This script has no requirements, run immediately after all INVAR1.sh jobs have finished running.

  • bin/step1/INVAR3.sh - Runs an R script to blacklist loci, filter data and split into on and off-target. Please read through the settings used in R/INVAR3.R before running INVAR3.sh. You may need to edit the locus filtering settings before running this.

Fragment size profile - detail

  • bin/step2/INVAR_SIZE_ANN1.sh - For each of the loci with non-ref bases, obtain a BAM of 1bp at that locus from that sample. This can only be run after INVAR3.sh has completed. Written by Dineika Chandrananda.

  • bin/step2/INVAR_SIZE_ANN2.sh - Run after INVAR_SIZE_ANN1.sh has completed all small jobs and INVAR4.sh was run. This script annotates the output generated from INVAR4.sh with the size information generated in INVAR_SIZE_ANN1.sh

Outlier suppression

  • step3/OUTLIER_SUPRESSION.sh - Run after size annotation has finished. This skript takes the output from INVAR_SIZE_ANN2.sh and applies the patient specific outlier suppression to it. Loci with significantly more signal than the median signal will be removed from the analyis.

  • bin/step2/INVAR_SIZE_CHARACTERISATION.sh - Run after OUTLIER_SUPRESSION.sh has completed. This script makes a dataframe that will be later used in GLRT for size weighting.

    • If your cohort is small or you only have few mutant fragments, it is advised to use a size data frame from another cohort for size weighting.
  • bin/utils/ANNOTATE_WITH_OS.sh - will generate a dataframe that can be used for in depth analysis of targeted loci afterwards. Will annotate all loci for all patients with information on plasma seqeuncing, applied filters, and tumour AF of the original reference tumour.

GLRT

  • bin/step4/GLRT.sh
    • GLRT will be using the output from INVAR_SIZE_ANN3.sh (size_characterisation.rds in output_R folder) for size weighting. Ensure that your cohort has enough mutant fragments for weighting. Speak to us for more details on this.
    • GLRT will determine ctDNA content in all given samples

Final step

The final step of the pipeline is to run the FINALISE script (FINALISE.sh). This script will combine the most important generated files and place them all in a folder called output_final. Output files:

  • [PREFIX]error_rates.Rdata
    • files containing error rates
    • this is split for COSMIC loci and non COSMIC loci
  • [PREFIX].combined.os_[outlier_suppression_threshold_settings].rds
    • this is currently the old version before applying the repolishing... Will have to update this!
    • overall annotated output file containing all relevant information
    • each locus is present for each patient, annotated with error rate, tumour AF, error rate, etc
    • also contains additional filters
  • [PREFIX]size_characterisation.rds
    • two files (normal and raw version)
    • can be used to plot size profiles
  • [PREFIX]on_target.Rdata
    • on target loci before annotating with filtering criteria and patient information
  • [PREFIX]INVAR_scores.txt
    • INVAR scores and ctDNA levels of patient and control samples

Updated