Table of Contents
Whole Exome Pipeline
El proveedor nos señala que la documentacion relativa a la sequenciacion de exomas se encuentra en:
https://sequencing.roche.com/global/en/products/group/kapa-hyperexome.html
De aqui podemos sacar: How to evaluate Roche target enrichment data for germline variant research white paper
Para obtener las referencias nos dice,
Ideally, you should develop a workflow appropriate for your experimental data using benchmark/control samples
such as HapMap samples obtained from Coriell. Known variants may be downloaded from the 1000 Genomes
Project (http://www.internationalgenome.org/) or in special collections such as the GATK resource bundle
(https://console.cloud.google.com/storage/browser/genomics-public-data/resources/broad/hg38/v0) and Genome in a
Bottle (http://jimb.stanford.edu/giab/ and https://github.com/genome-in-a-bottle).
Y tambien,
The requirement for use of an indexed reference genome in a subsequent step is designated by “ref.fa
{indexed}” in the Input(s) section. An indexed reference genome consists of the genome FASTA file and all index files present in
the same directory.
Note that there can be multiple versions of the same reference genome available, and selecting the appropriate
genome for your analyses is important. Ensure the reference genome build used to generate panel design files (such
as primary target or capture target bed files) and used in the analysis pipeline are the same. Chromosome names
should match in the two files, otherwise some 3rd party analysis tools will generate errors and fail. If the version of
the reference genome contains extra chromosomes or contigs (i.e., ALT contigs in the human reference), consider if
these are useful or necessary for your particular analysis. Some regions, such as the pseduo-autosomal regions in the
human reference genome, can be represented in different ways that may affect downstream analyses including
variant analysis.
Basic Recommendations for Calling Variants and Analyzing Depth of Coverage
Exome variant analysis recommendations
- Trim adapter sequences using cut-adapt
- Input files: FASTA, FASTQ
- Output files: same as input
- Align reads to a human genome build using BWA, post-process data using SAMtools, remove duplicate reads and convert to a BAM file using Picard
- Input files: FASTA, FASTQ
- Output files: BAM
- Call SNPs using GATK HaplotypeCaller to produce a VCF file and call germline SNPs and indels
- Input files: BAM
- Output: VCF
Exome coverage analysis recommendations
- Use an existing exome BED file or use Bedtools to create intervals in the genomic regions you’d like to analyze
- Compute read depth of your genomic intervals using GATK DepthOfCoverage
Basicamente lo que he hecho es leer todo lo que he conseguido e intentar armar un pipeline con algo de logica
Referencias
En este caso hemos escogido GRCh38 como referencia, basicamente porque esta todo un poco mas ordenado. Y voy a bajarme todo del bucket de genome public data. Lo que necesito es basicamente,
https://storage.googleapis.com/genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38.fasta https://storage.googleapis.com/genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38.known_indels.vcf.gz https://storage.googleapis.com/genomics-public-data/resources/broad/hg38/v0/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz https://storage.googleapis.com/genomics-public-data/resources/broad/hg38/v0/1000G_phase1.snps.high_confidence.hg38.vcf.gz https://storage.googleapis.com/genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf https://storage.googleapis.com/genomics-public-data/resources/broad/hg38/v0/hapmap_3.3.hg38.vcf.gz
y entonces debo crear mi diccionario,
[osotolongo@brick03 hg38]$ cat index_ref.sh #!/bin/sh ref=$1 shift /nas/usr/local/bin/bwa index -a bwtsw ${ref} /nas/software/samtools/bin/samtools faidx ${ref} singularity run --cleanenv -B /nas:/nas -B /ruby:/ruby -B /the_dysk:/the_dysk /nas/usr/local/opt/gatk4.simg gatk --java-options "-DGATK_STACKTRACE_ON_USER_EXCEPTION=true -Xmx16G" CreateSequenceDictionary --REFERENCE ${ref} [osotolongo@brick03 hg38]$ ./index_ref.sh Homo_sapiens_assembly38.fasta ... blah blah blah ...
y ahora puedo convertir los targets originales en interval.list
[osotolongo@brick03 hg38]$ cat make_bait.sh #!/bin/sh gatk='singularity run --cleanenv -B /nas:/nas -B /ruby:/ruby -B /the_dysk:/the_dysk /nas/usr/local/opt/gatk4.simg gatk --java-options "-DGATK_STACKTRACE_ON_USER_EXCEPTION=true -Xmx16G"' ref='/ruby/WES_reference/hg38/Homo_sapiens_assembly38' ${gatk} BedToIntervalList --INPUT KAPA_HyperExome_hg38_capture_targets.bed --OUTPUT KAPA_HyperExome_hg38_bait.interval_list --SEQUENCE_DICTIONARY ${ref}.dict ${gatk} BedToIntervalList --INPUT KAPA_HyperExome_hg38_primary_targets.bed --OUTPUT KAPA_HyperExome_hg38_target.interval_list --SEQUENCE_DICTIONARY ${ref}.dict cat KAPA_HyperExome_hg38_capture_targets.bed KAPA_HyperExome_hg38_primary_targets.bed | sort -k1,1 -k2,2n | /nas/software/bedtools2/bin/bedtools merge -i - > KAPA_HyperExome_hg38_capture_primary_targets_union.bed ${gatk} BedToIntervalList --INPUT KAPA_HyperExome_hg38_capture_primary_targets_union.bed --OUTPUT KAPA_HyperExome_hg38_union.interval_list --SEQUENCE_DICTIONARY ${ref}.dict [osotolongo@brick03 hg38]$ ./make_bait.sh
Me faltaria por indexar los VCF de referencia. Aver,
[osotolongo@brick03 hg38]$ cat toindex.txt Homo_sapiens_assembly38.known_indels.vcf.gz Mills_and_1000G_gold_standard.indels.hg38.vcf.gz 1000G_phase1.snps.high_confidence.hg38.vcf.gz Homo_sapiens_assembly38.dbsnp138.vcf [osotolongo@brick03 hg38]$ alias gatk='singularity run --cleanenv -B /nas:/nas -B /ruby:/ruby -B /the_dysk:/the_dysk /nas/usr/local/opt/gatk4.simg gatk --java-options "-DGATK_STACKTRACE_ON_USER_EXCEPTION=true -Xmx16G"' [osotolongo@brick03 hg38]$ while read -r vcf; do gatk IndexFeatureFile -I ${vcf}; done < toindex.txt
single guy
Por cada individuo tengo dos archivos fasta
[osotolongo@brick03 Traspaso-datos-11-2022]$ ls fqdata/22D28227616/ 22D28227616_1.fq.gz 22D28227616_2.fq.gz 22D28227616_md5.txt
Yo voy a seguir las recomendaciones de vifehe (que es la que sabe de esto) y voy a tratar de hacer un pipeline GATK compatible con hg38. Despues de leerme el panfleto de Roche, el pipeline de WashU, el nature este que me han mandao y medio site de GATK , el pipeline para un solo individuo ha quedado de esta manera,
# Este es el pipeline de ejemplo para UN SOLO sujeto de estudio. Hay que paralelizar cada una de las operaciones posibles y luego paralelizar los sujetos ddir='/ruby/Traspaso-datos-11-2022/fqdata' rdir='/ruby/WES_reference/hg38' odir='/ruby/WES_output/hg38' baits='KAPA_HyperExome_hg38_bait.interval_list' targets='KAPA_HyperExome_hg38_target.interval_list' union='KAPA_HyperExome_hg38_union.interval_list' tmp_dir=$(mktemp -d ${TMPDIR}/wes.XXXXXX) ts='22D28227691' hgref='Homo_sapiens_assembly38.fasta' known1='Homo_sapiens_assembly38.known_indels.vcf.gz' known2='Mills_and_1000G_gold_standard.indels.hg38.vcf.gz' known3='Homo_sapiens_assembly38.dbsnp138.vcf' known4='1000G_phase1.snps.high_confidence.hg38.vcf.gz' gatk='singularity run --cleanenv -B /nas:/nas -B /ruby:/ruby -B /the_dysk:/the_dysk /nas/usr/local/opt/gatk4.simg gatk --java-options "-DGATK_STACKTRACE_ON_USER_EXCEPTION=true -Xmx16G"' # QC para el FASTQ. Deja unos archivos html con graficos. Se pueden parsear para sacar la info pero poca cosa. # Casi todo lo pertinente parece estar en imagenes, aunque siempre se podrian buscar los FAIL y mirar solo esos. mkdir -p ${odir}/qc /nas/usr/local/bin/fastqc -o ${odir}/qc ${ddir}/${ts}/*.fq.gz ## Convertir a SAM y pegar los trozos en uno solo /nas/usr/local/bin/bwa mem -R "@RG\tID:1\tDS:KAPA_TE\tPL:ILLUMINA\tLB:${ts}\tSM:${ts}" -t 4 -M ${rdir}/${hgref} ${ddir}/${ts}/22D28227691_1.fq.gz > ${odir}/${ts}_1.sam /nas/usr/local/bin/bwa mem -R "@RG\tID:1\tDS:KAPA_TE\tPL:ILLUMINA\tLB:${ts}\tSM:${ts}" -t 4 -M ${rdir}/${hgref} ${ddir}/${ts}/22D28227691_2.fq.gz > ${odir}/${ts}_2.sam ${gatk} MergeSamFiles -I ${odir}/${ts}_1.sam -I ${odir}/${ts}_2.sam -O ${odir}/${ts}.sam ## Ordenar ${gatk} SortSam -I ${odir}/${ts}.sam -O ${odir}/${ts}_sorted.bam --SORT_ORDER coordinate /nas/software/samtools/bin/samtools index ${odir}/${ts}_sorted.bam ## QC del BAM /nas/usr/local/bin/verifyBamID --vcf ${rdir}/hapmap_3.3.hg38.vcf.gz --bam ${odir}/${ts}_sorted.bam --chip-none --maxDepth 1000 --precise --verbose --ignoreRG --out ${odir}/${ts}_verifybam |& grep -v "Skipping marker" ## Validar y producir el SAM definitivo para GATKrdir='/ruby/WES_reference/dwn' ${gatk} ValidateSamFile --IGNORE MATE_NOT_FOUND --IGNORE MISSING_READ_GROUP --IGNORE RECORD_MISSING_READ_GROUP -I ${odir}/${ts}_sorted.bam ${gatk} MarkDuplicates -I ${odir}/${ts}_sorted.bam -O ${odir}/${ts}_rmdups.bam --METRICS_FILE ${odir}/${ts}_metrics.txt --QUIET TRUE --MAX_RECORDS_IN_RAM 2000000 --ASSUME_SORTED TRUE --CREATE_INDEX TRUE ${gatk} CollectHsMetrics -BI ${rdir}/${baits} -TI ${rdir}/${targets} -I ${odir}/${ts}_rmdups.bam -O ${odir}/${ts}_hsmetrics.txt ${gatk} BaseRecalibrator -I ${odir}/${ts}_rmdups.bam -R ${rdir}/${hgref} --known-sites ${rdir}/${known1} --known-sites ${rdir}/${known2} --known-sites ${rdir}/${known3} -O ${odir}/${ts}_recal_data.table1 ${gatk} ApplyBQSR -R ${rdir}/${hgref} -I ${odir}/${ts}_rmdups.bam -bqsr-recal-file ${odir}/${ts}_recal_data.table1 -O ${odir}/${ts}_recal.bam ${gatk} AnalyzeCovariates -bqsr ${odir}/${ts}_recal_data.table1 --plots ${odir}/${ts}_AnalyzeCovariates.pdf ${gatk} BaseRecalibrator -I ${odir}/${ts}_recal.bam -R ${rdir}/${hgref} --known-sites ${rdir}/${known1} --known-sites ${rdir}/${known2} --known-sites ${rdir}/${known3} -O ${odir}/${ts}_recal_data.table2 ${gatk} AnalyzeCovariates -before ${odir}/${ts}_recal_data.table1 -after ${odir}/${ts}_recal_data.table2 -plots ${odir}/${ts}_before-after-plots.pdf ${gatk} CollectHsMetrics -BI ${rdir}/${baits} -TI ${rdir}/${targets} -I ${odir}/${ts}_recal.bam -O ${odir}/${ts}_recal_hsmetrics.txt ${gatk} HaplotypeCaller -R ${rdir}/${hgref} -I ${odir}/${ts}_recal.bam -ERC GVCF --dbsnp ${rdir}/${known3} -O ${odir}/${ts}_raw.snps.indels.g.vcf.gz ${gatk} VariantEval -R ${rdir}/${hgref} -L ${rdir}/${union} -D ${rdir}/${known4} -O ${odir}/${ts}_eval.gatkreport --eval ${odir}/${ts}_raw.snps.indels.g.vcf.gz
Esto deja un output como este,
[osotolongo@brick03 wes]$ tree -t /ruby/WES_output/hg38/ /ruby/WES_output/hg38/ ├── qc │ ├── 22D28227691_1_fastqc.html │ ├── 22D28227691_1_fastqc.zip │ ├── 22D28227691_2_fastqc.html │ └── 22D28227691_2_fastqc.zip ├── 22D28227691_1.sam ├── 22D28227691_2.sam ├── 22D28227691.sam ├── 22D28227691_sorted.bam ├── 22D28227691_sorted.bam.bai ├── 22D28227691_verifybam.depthSM ├── 22D28227691_verifybam.log ├── 22D28227691_verifybam.selfSM ├── 22D28227691_metrics.txt ├── 22D28227691_rmdups.bai ├── 22D28227691_rmdups.bam ├── 22D28227691_hsmetrics.txt ├── 22D28227691_recal_data.table1 ├── 22D28227691_recal.bai ├── 22D28227691_recal.bam ├── 22D28227691_AnalyzeCovariates.pdf ├── 22D28227691_recal_data.table2 ├── 22D28227691_before-after-plots.pdf ├── 22D28227691_recal_hsmetrics.txt ├── 22D28227691_raw.snps.indels.g.vcf ├── 22D28227691_raw.snps.indels.g.vcf.gz.tbi ├── 22D28227691_eval.gatkreport └── 22D28227691_raw.snps.indels.g.ann.vcf 1 directory, 27 files
que hay que parsear luego. Pero primero veamos como hay que paralelizar todo.
Bulk shit
La cosa es que tengo que hacer esto para unos 300 pollos y despues pegar todos los resultados en un solo VCF.
Basicamente lo que voy a hacer es separar todo por cromosomas. Por ejemplo para el cromosoma 1 voy a hacer,
gatk GenomicsDBImport --genomicsdb-workspace-path ${output}/final/wes.chr1.db --batch-size 50 --sample-name-map ${output}/subjects.list -L chr1 --reader-threads 8 gatk GenotypeGVCFs -R ${reference}/hg38/Homo_sapiens_assembly38.fasta -V gendb:///${output}/final/wes.chr1.db -O ${output}//final/wes_chr1.snps.indels.g.vcf.gz gunzip -c ${output}/final/wes_chr1.snps.indels.g.vcf.gz |java -Xmx8g -jar /nas/software/snpEff/snpEff.jar ann -chr 1 -s snpEff_summary_chr1.html hg38 - | gzip -c > ${output}/final/wes_chr1.snps.indels.g.ann.vcf.gz
q es basicamente,
- genero una DB
- uso la DB para crear un GVCF enorme
- Lo anoto con snpEff
Nota: La primera prueba de soltar esto con 24 horas de limite ha fallado. La segunda prueba esta corriendo con 72h.