Both sides previous revisionPrevious revisionNext revision | Previous revision |
genetica:pywgs [2020/10/10 12:44] – [Parsing] osotolongo | genetica:pywgs [2020/10/27 16:12] (current) – [Programatic tree] osotolongo |
---|
===== tl;dr ===== | ===== tl;dr ===== |
<code> | <code> |
./wgs.py -o <output_dir> [-cut <list.txt>] [-g] <input_dir> | ./wgs.py -o <output_dir> [-c <list.txt>] [-g] -s <input_dir> |
| </code> |
| o |
| <code> |
| ./wgs.py --output=<output_dir> [--cut=<list.txt>] [--debug] --source=<input_dir> |
</code> | </code> |
Opciones: | Opciones: |
* <input_dir> : (Mandatory) Directorio donde se encuentran todas las secuencias. El script buscara los sujetos y sus archivos dentro de este directorio. | * -s <input_dir> : (Mandatory) Directorio donde se encuentran todas las secuencias. El script buscara los sujetos y sus archivos dentro de este directorio. |
* -o <output_dir> : (Opcional) Directorio donde se escribiran los resultados. En caso de obviarse se escribiran en el directorio desde el cual se lanza el script. | * -o <output_dir> : (Opcional) Directorio donde se escribiran los resultados. En caso de obviarse se escribiran en el directorio desde el cual se lanza el script. |
* -cut <list.txt> : (Opcional) Dice al script que analice SOLO los sujetos incluidos en el archivo que se suministra <list.txt>. Este archivo debe ser una listasimple de los sujetos a analizar. | * -cut <list.txt> : (Opcional) Dice al script que analice SOLO los sujetos incluidos en el archivo que se suministra <list.txt>. Este archivo debe ser una listasimple de los sujetos a analizar. |
* -g : Indica que no se borren los archivos temporales. Por defecto se borran, a no ser que se ponga este switch. | * -g : Indica que no se borren los archivos temporales. Por defecto se borran, a no ser que se ponga este switch. |
| |
| === Ejemplo === |
| <code> |
| [osotolongo@brick03 wgs]$ bin/wgs.py -o /home/osotolongo/wgs/temp -c only.txt -s /the_dysk/BGI_exome/F18FTSEUET0180/HUMehbE/ |
| Submitted batch job 1708 |
| Submitted batch job 1709 |
| Submitted batch job 1710 |
| Submitted batch job 1711 |
| Submitted batch job 1715 |
| Submitted batch job 1716 |
| Submitted batch job 1717 |
| Submitted batch job 1718 |
| Submitted batch job 1722 |
| Submitted batch job 1723 |
| Submitted batch job 1724 |
| Submitted batch job 1725 |
| Submitted batch job 1729 |
| [osotolongo@brick03 wgs]$ squeue |
| JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) |
| 1712 fast sam_seq- osotolon PD 0:00 1 (Dependency) |
| 1713 fast verify_s osotolon PD 0:00 1 (Dependency) |
| 1714 fast validate osotolon PD 0:00 1 (Dependency) |
| 1719 fast sam_seq- osotolon PD 0:00 1 (Dependency) |
| 1720 fast verify_s osotolon PD 0:00 1 (Dependency) |
| 1721 fast validate osotolon PD 0:00 1 (Dependency) |
| 1726 fast sam_seq- osotolon PD 0:00 1 (Dependency) |
| 1727 fast verify_s osotolon PD 0:00 1 (Dependency) |
| 1728 fast validate osotolon PD 0:00 1 (Dependency) |
| 1729 fast wgs_end osotolon PD 0:00 1 (Dependency) |
| 1708 fast sam_seq- osotolon R 0:10 1 brick01 |
| 1709 fast sam_seq- osotolon R 0:10 1 brick01 |
| 1710 fast sam_seq- osotolon R 0:10 1 brick01 |
| 1711 fast sam_seq- osotolon R 0:10 1 brick01 |
| 1715 fast sam_seq- osotolon R 0:10 1 brick01 |
| 1716 fast sam_seq- osotolon R 0:07 1 brick01 |
| 1717 fast sam_seq- osotolon R 0:07 1 brick01 |
| 1718 fast sam_seq- osotolon R 0:07 1 brick01 |
| 1722 fast sam_seq- osotolon R 0:07 1 brick02 |
| 1723 fast sam_seq- osotolon R 0:07 1 brick02 |
| 1724 fast sam_seq- osotolon R 0:07 1 brick02 |
| 1725 fast sam_seq- osotolon R 0:07 1 brick02 |
| |
| </code> |
===== Pipeline ===== | ===== Pipeline ===== |
| |
++++ Ya casi esta, ahora por cada sujeto guardado pueden definirse los primeros 4 procesos,| | ++++ Ya casi esta, ahora por cada sujeto guardado pueden definirse los primeros 4 procesos,| |
| |
<code perl> | <code python> |
for (my $i=0; $i<4; $i++){ | for i in range(0,4): |
my $fqid = $finfo{$pollo}{'fq_list'}[2*$i]; | fqid = fq_list[2*i] |
my $orderfile = $outdir.'/bwa_'.$pollo.'_'.$i.'.sh'; | orderfile = outdir+'/bwa_'+pollo+'_'+str(i)+'.sh' |
open ORD, ">$orderfile" or die "Couldnt create file"; | ord_content = '#!/bin/bash\n' |
print ORD '#!/bin/bash'."\n"; | ord_content += '#SBATCH -J sam_'+pollo+'\n' |
print ORD '#SBATCH -J sam_'.$pollo."\n"; | ord_content += '#SBATCH --time=24:0:0\n' |
print ORD '#SBATCH --time=24:0:0'."\n"; | ord_content += '#SBATCH -o '+outdir+'/bwa_'+pollo+'-%j'+'\n' |
print ORD '#SBATCH -o '.$outdir.'/bwa_'.$pollo.'-%j'."\n"; | ord_content += '#SBATCH -c 8\n' |
print ORD '#SBATCH -c 8'."\n"; | ord_content += '#SBATCH --mem-per-cpu=4G\n' |
print ORD '#SBATCH --mem-per-cpu=4G'."\n"; | ord_content += '#SBATCH -p fast\n' |
print ORD '#SBATCH -p fast'."\n"; | ord_content += '#SBATCH --mail-type=FAIL,TIME_LIMIT,STAGE_OUT\n' |
print ORD '#SBATCH --mail-type=FAIL,TIME_LIMIT,STAGE_OUT'."\n"; #no quieres que te mande email de todo | ord_content += '#SBATCH --mail-user='+os.environ.get('USER')+'\n' |
print ORD '#SBATCH --mail-user='."$ENV{'USER'}\n"; | ord_content += bwa+' mem -t 4 -R "@RG\\tID:'+fq_name+'_'+fqid+'\\tSM:'+pname+'\\tPL:BGI\\tPI:380" -M '+ref_dir+'/Homo_sapiens_assembly38 '+src_dir+'/'+pollo+'/'+fq_name+'_'+fqid+'_1.fq.gz '+src_dir+'/'+pollo+'/'+fq_name+'_'+fqid+'_2.fq.gz > '+tmpdir+'/'+pname+'_'+str(i)+'.sam\n' |
print ORD $bwa.' mem -t 4 -R "@RG\tID:'.$finfo{$pollo}{'fq_name'}.'_'.$fqid.'\tSM:'.$pname.'\tPL:BGI\tPI:380" -M '.$ref_dir.'/Homo_sapiens_assembly38 '.$finfo{$pollo}{'path'}.'/'.$finfo{$pollo}{'fq_name'}.'_'.$fqid.'_1.fq.gz '.$finfo{$pollo}{'path'}.'/'.$finfo{$pollo}{'fq_name'}.'_'.$fqid.'_2.fq.gz > '.$tmpdir.'/'.$pname.'_'.$i.'.sam'; | osf = open(orderfile, 'w') |
print ORD "\n"; | osf.write(ord_content) |
close ORD; | osf.close() |
$gsconv.= 'I='.$tmpdir.'/'.$pname.'_'.$i.'.sam '; | gsconv+= 'I='+tmpdir+'/'+pname+'_'+str(i)+'.sam ' |
system("sbatch $orderfile"); | os.system('sbatch '+orderfile) |
} | |
</code> | </code> |
++++ | ++++ |
++++ El numero 5 se hace que dependa de estos 4,| | ++++ El numero 5 se hace que dependa de estos 4,| |
| |
<code perl> | <code python> |
my $orderfile = $outdir.'/merge_'.$pollo.'.sh'; | orderfile = outdir+'/merge_'+pollo+'.sh' |
open ORD, ">$orderfile"; | ord_content = '#!/bin/bash\n' |
print ORD '#!/bin/bash'."\n"; | ord_content += '#SBATCH -J sam_'+pollo+'\n' |
print ORD '#SBATCH -J sam_'.$pollo."\n"; | ord_content += '#SBATCH --time=24:0:0\n' |
print ORD '#SBATCH -p fast'."\n"; | ord_content += '#SBATCH -o '+outdir+'/merge_'+pollo+'-%j\n' |
print ORD '#SBATCH -o '.$outdir.'/merge_'.$pollo.'-%j'."\n"; | ord_content += '#SBATCH -c 8\n' |
print ORD '#SBATCH -c 8'."\n"; | ord_content += '#SBATCH --mem-per-cpu=4G\n' |
print ORD '#SBATCH --mem-per-cpu=4G'."\n"; | ord_content += '#SBATCH -p fast\n' |
print ORD '#SBATCH -p fast'."\n"; | ord_content += '#SBATCH --mail-type=FAIL,TIME_LIMIT,STAGE_OUT\n' |
print ORD '#SBATCH --mail-type=FAIL,TIME_LIMIT,STAGE_OUT'."\n"; #no quieres que te mande email de todo | ord_content += '#SBATCH --mail-user='+os.environ.get('USER')+'\n' |
print ORD '#SBATCH --mail-user='."$ENV{'USER'}\n"; | ord_content += picard+' MergeSamFiles '+gsconv+' O='+tmpdir+'/'+pname+'.sam\n' |
print ORD $picard.' MergeSamFiles '.$gsconv.' O='.$tmpdir.'/'.$pname.'.sam'."\n"; | ord_content += picard+' SortSam I='+tmpdir+'/'+pname+'.sam O='+tmpdir+'/'+pname+'_sorted.bam SORT_ORDER=coordinate\n' |
print ORD $picard.' SortSam I='.$tmpdir.'/'.$pname.'.sam O='.$tmpdir.'/'.$pname.'_sorted.bam SORT_ORDER=coordinate'."\n"; | ord_content += samtools+' index '+tmpdir+'/'+pname+'_sorted.bam\n' |
print ORD $samtools.' index '.$tmpdir.'/'.$pname.'_sorted.bam'."\n"; | osf = open(orderfile, 'w') |
close ORD; | osf.write(ord_content) |
my $order = 'sbatch --dependency=singleton '.$orderfile; | osf.close() |
my $ujobid = `$order`; | ujobid = subprocess.getoutput('sbatch --dependency=singleton --parsable '+orderfile) |
$ujobid = ( split ' ', $ujobid )[ -1 ]; | |
</code> | </code> |
++++ | ++++ |
++++ El 6 y el 7 dependen del 5,| | ++++ El 6 y el 7 dependen del 5,| |
| |
<code perl> | <code python> |
$orderfile = $outdir.'/verify_'.$pollo.'.sh'; | orderfile = outdir+'/verify_'+pollo+'.sh' |
open ORD, ">$orderfile"; | ord_content = '#!/bin/bash\n' |
print ORD '#!/bin/bash'."\n"; | ord_content += '#SBATCH -J verify_'+pollo+'\n' |
print ORD '#SBATCH -J verify_'.$pollo."\n"; | ord_content += '#SBATCH --time=24:0:0\n' |
print ORD '#SBATCH -p fast'."\n"; | ord_content += '#SBATCH -o '+outdir+'/verify_'+pollo+'-%j\n' |
print ORD '#SBATCH -o '.$outdir.'/verify_'.$pollo.'-%j'."\n"; | ord_content += '#SBATCH -c 8\n' |
print ORD '#SBATCH -c 4'."\n"; | ord_content += '#SBATCH --mem-per-cpu=4G\n' |
print ORD '#SBATCH --mem-per-cpu=4G'."\n"; | ord_content += '#SBATCH -p fast\n' |
print ORD '#SBATCH -p fast'."\n"; | ord_content += '#SBATCH --mail-type=FAIL,TIME_LIMIT,STAGE_OUT\n' |
print ORD '#SBATCH --mail-type=FAIL,TIME_LIMIT,STAGE_OUT'."\n"; #no quieres que te mande email de todo | ord_content += '#SBATCH --mail-user='+os.environ.get('USER')+'\n' |
print ORD '#SBATCH --mail-user='."$ENV{'USER'}\n"; | ord_content += verifybamib+' --vcf '+ref_dir+'/hapmap_3.3.hg38.vcf.gz --bam '+tmpdir+'/'+pname+'_sorted.bam --chip-none --maxDepth 1000 --precise --verbose --ignoreRG --out '+resdir+'/'+pname+'_verifybam |& grep -v "Skipping marker"\n' |
print ORD $verifybamib. ' --vcf '.$ref_dir.'/hapmap_3.3.hg38.vcf.gz --bam '.$tmpdir.'/'.$pname.'_sorted.bam --chip-none --maxDepth 1000 --precise --verbose --ignoreRG --out '.$resdir.'/'.$pname.'_verifybam |& grep -v "Skipping marker"'."\n"; | osf = open(orderfile, 'w') |
close ORD; | osf.write(ord_content) |
$order = 'sbatch --dependency=afterok:'.$ujobid.' '.$orderfile; | osf.close() |
my $jobid = `$order`; | jobid = subprocess.getoutput('sbatch --parsable --dependency=afterok:'+ujobid+' '+orderfile) |
$jobid = ( split ' ', $jobid )[ -1 ]; | jobids.append(jobid) |
push @jobids, $jobid; | |
</code> | </code> |
| |
<code perl> | <code python> |
$orderfile = $outdir.'/validate_'.$pollo.'.sh'; | orderfile = outdir+'/validate_'+pollo+'.sh' |
open ORD, ">$orderfile"; | ord_content = '#!/bin/bash\n' |
print ORD '#!/bin/bash'."\n"; | ord_content += '#SBATCH -J validate_'+pollo+'\n' |
print ORD '#SBATCH -J validate_'.$pollo."\n"; | ord_content += '#SBATCH --time=24:0:0\n' |
print ORD '#SBATCH -p fast'."\n"; | ord_content += '#SBATCH -o '+outdir+'/validate_'+pollo+'-%j\n' |
print ORD '#SBATCH -o '.$outdir.'/validate_'.$pollo.'-%j'."\n"; | ord_content += '#SBATCH -c 8\n' |
print ORD '#SBATCH -c 12'."\n"; | ord_content += '#SBATCH --mem-per-cpu=4G\n' |
print ORD '#SBATCH --mem-per-cpu=4G'."\n"; | ord_content += '#SBATCH -p fast\n' |
print ORD '#SBATCH -p fast'."\n"; | ord_content += '#SBATCH --mail-type=FAIL,TIME_LIMIT,STAGE_OUT\n' |
print ORD '#SBATCH --mail-type=FAIL,TIME_LIMIT,STAGE_OUT'."\n"; #no quieres que te mande email de todo | ord_content += '#SBATCH --mail-user='+os.environ.get('USER')+'\n' |
print ORD '#SBATCH --mail-user='."$ENV{'USER'}\n"; | ord_content += picard+' ValidateSamFile IGNORE=MATE_NOT_FOUND IGNORE=MISSING_READ_GROUP IGNORE=RECORD_MISSING_READ_GROUP I='+tmpdir+'/'+pname+'_sorted.bam\n' |
print ORD $picard.' ValidateSamFile IGNORE=MATE_NOT_FOUND IGNORE=MISSING_READ_GROUP IGNORE=RECORD_MISSING_READ_GROUP I='.$tmpdir.'/'.$pname.'_sorted.bam'."\n"; | ord_content += picard+' MarkDuplicates I='+tmpdir+'/'+pname+'_sorted.bam O='+tmpdir+'/'+pname+'_GATKready.bam METRICS_FILE='+resdir+'/'+pname+'_metrics.txt QUIET=true MAX_RECORDS_IN_RAM=2000000 ASSUME_SORTED=TRUE CREATE_INDEX=TRUE\n' |
print ORD $picard.' MarkDuplicates I='.$tmpdir.'/'.$pname.'_sorted.bam O='.$tmpdir.'/'.$pname.'_GATKready.bam METRICS_FILE='.$resdir.'/'.$pname.'_metrics.txt QUIET=true MAX_RECORDS_IN_RAM=2000000 ASSUME_SORTED=TRUE CREATE_INDEX=TRUE'."\n"; | ord_content += gatk3+' -T DepthOfCoverage -R '+ref_dir+'/Homo_sapiens_assembly38.fasta -nt 1 -ct 10 -ct 15 -ct 20 -ct 30 --omitDepthOutputAtEachBase --omitIntervalStatistics --omitLocusTable -L '+ref_dir+'/MGI_Exome_Capture_V5_bis2.bed -I '+tmpdir+'/'+pname+'_GATKready.bam -o '+resdir+'/'+pname+'_exome_coverage\n' |
print ORD $gatk3.' -T DepthOfCoverage -R '.$ref_dir.'/Homo_sapiens_assembly38.fasta -nt 1 -ct 10 -ct 15 -ct 20 -ct 30 --omitDepthOutputAtEachBase --omitIntervalStatistics --omitLocusTable -L '.$ref_dir.'/MGI_Exome_Capture_V5_bis2.bed -I '.$tmpdir.'/'.$pname.'_GATKready.bam -o '.$resdir.'/'.$pname.'_exome_coverage'."\n"; | ord_content += gatk4+' BaseRecalibrator -I '+tmpdir+'/'+pname+'_GATKready.bam -R '+ref_dir+'/Homo_sapiens_assembly38.fasta --known-sites '+ref_dir+'/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz --known-sites '+ref_dir+'/dbsnp_146.hg38.vcf.gz --known-sites '+ref_dir+'/1000G_phase1.snps.high_confidence.hg38.vcf.gz -O '+resdir+'/'+pname'_recal_data.table1\n' |
print ORD $gatk4.' BaseRecalibrator -I '.$tmpdir.'/'.$pname.'_GATKready.bam -R '.$ref_dir.'/Homo_sapiens_assembly38.fasta --known-sites '.$ref_dir.'/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz --known-sites '.$ref_dir.'/dbsnp_146.hg38.vcf.gz --known-sites '.$ref_dir.'/1000G_phase1.snps.high_confidence.hg38.vcf.gz -O '.$resdir.'/'.$pname.'_recal_data.table1'."\n"; | ord_content += gatk4+' ApplyBQSR -R '+ref_dir+'/Homo_sapiens_assembly38.fasta -I '+tmpdir+'/'+pname+'_GATKready.bam -bqsr-recal-file '+resdir+'/'+pname+'_recal_data.table1 -O '+resdir+'/'+pname+'_recal.bam\n' |
print ORD $gatk4.' ApplyBQSR -R '.$ref_dir.'/Homo_sapiens_assembly38.fasta -I '.$tmpdir.'/'.$pname.'_GATKready.bam -bqsr-recal-file '.$resdir.'/'.$pname.'_recal_data.table1 -O '.$resdir.'/'.$pname.'_recal.bam'."\n"; | ord_content += gatk4+' AnalyzeCovariates -bqsr '+resdir+'/'+pname+'_recal_data.table1 --plots '+resdir+'/'+pname+'_AnalyzeCovariates.pdf\n' |
print ORD $gatk4.' AnalyzeCovariates -bqsr '.$resdir.'/'.$pname.'_recal_data.table1 --plots '.$resdir.'/'.$pname.'_AnalyzeCovariates.pdf'."\n"; | ord_content += gatk4+' BaseRecalibrator -I '+resdir+'/'+pname+'_recal.bam -R '+ref_dir+'/Homo_sapiens_assembly38.fasta --known-sites '+ref_dir+'/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz --known-sites '+ref_dir+'/dbsnp_146.hg38.vcf.gz --known-sites '+ref_dir+'/1000G_phase1.snps.high_confidence.hg38.vcf.gz -O '+resdir+'/'+pname+'_recal_data.table2\n' |
print ORD $gatk4.' BaseRecalibrator -I '.$resdir.'/'.$pname.'_recal.bam -R '.$ref_dir.'/Homo_sapiens_assembly38.fasta --known-sites '.$ref_dir.'/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz --known-sites '.$ref_dir.'/dbsnp_146.hg38.vcf.gz --known-sites '.$ref_dir.'/1000G_phase1.snps.high_confidence.hg38.vcf.gz -O '.$resdir.'/'.$pname.'_recal_data.table2'."\n"; | ord_content += gatk4+' AnalyzeCovariates -before '+resdir+'/'+pname+'_recal_data.table1 -after '+resdir+'/'+pname+'_recal_data.table2 -plots '+resdir+'/'+pname+'_before-after-plots.pdf\n' |
print ORD $gatk4.' AnalyzeCovariates -before '.$resdir.'/'.$pname.'_recal_data.table1 -after '.$resdir.'/'.$pname.'_recal_data.table2 -plots '.$resdir.'/'.$pname.'_before-after-plots.pdf'."\n"; | ord_content += gatk4+' HaplotypeCaller -R '+ref_dir+'/Homo_sapiens_assembly38.fasta -I '+resdir+'/'+pname+'_recal.bam -ERC GVCF --dbsnp '+ref_dir+'/dbsnp_146.hg38.vcf.gz -O '+resdir+'/'+pname+'_raw.snps.indels.g.vcf.gz\n' |
print ORD $gatk4.' HaplotypeCaller -R '.$ref_dir.'/Homo_sapiens_assembly38.fasta -I '.$resdir.'/'.$pname.'_recal.bam -ERC GVCF --dbsnp '.$ref_dir.'/dbsnp_146.hg38.vcf.gz -O '.$resdir.'/'.$pname.'_raw.snps.indels.g.vcf.gz'."\n"; | ord_content += gatk4_l+' VariantEval -R '+ref_dir+'/Homo_sapiens_assembly38.fasta -L '+ref_dir+'/MGI_Exome_Capture_V5_bis2.bed -D '+ref_dir+'/dbsnp_146.hg38.vcf.gz -O '+resdir+'/'+pname+'_eval.gatkreport --eval '+resdir+'/'+pname+'_raw.snps.indels.g.vcf.gz\n' |
print ORD $gatk4_l.' VariantEval -R '.$ref_dir.'/Homo_sapiens_assembly38.fasta -L '.$ref_dir.'/MGI_Exome_Capture_V5_bis2.bed -D '.$ref_dir.'/dbsnp_146.hg38.vcf.gz -O '.$resdir.'/'.$pname.'_eval.gatkreport --eval '.$resdir.'/'.$pname.'_raw.snps.indels.g.vcf.gz'."\n"; | osf = open(orderfile, 'w') |
close ORD; | osf.write(ord_content) |
$order = 'sbatch --dependency=afterok:'.$ujobid.' '.$orderfile; | osf.close() |
$jobid = `$order`; | jobid = subprocess.getoutput('sbatch --parsable --dependency=afterok:'+ujobid+' '+orderfile) |
$jobid = ( split ' ', $jobid )[ -1 ]; | jobids.append(jobid) |
push @jobids, $jobid; | |
</code> | </code> |
++++ | ++++ |
++++ y por ultimo, hay un proceso que depende de que terminen todos los demas (TODOS) y lo que hace es limpiar los archivos temporales y enviar un email de finalizacion, | | ++++ y por ultimo, hay un proceso que depende de que terminen todos los demas (TODOS) y lo que hace es limpiar los archivos temporales y enviar un email de finalizacion, | |
| |
<code perl> | <code python> |
my $orderfile = $outdir.'/wgs_end.sh'; | orderfile = outdir+'/wgs_end.sh' |
open ORD, ">$orderfile"; | ord_content = '#!/bin/bash\n' |
print ORD '#!/bin/bash'."\n"; | ord_content += '#SBATCH -J wgs_end\n' |
print ORD '#SBATCH -J wgs_end'."\n"; | ord_content += '#SBATCH -o '+outdir+'/wgs_end-%j\n' |
print ORD '#SBATCH --mail-type=END'."\n"; #email cuando termine o falle | ord_content += '#SBATCH --mail-type=END\n' |
print ORD '#SBATCH --mail-user='."$ENV{'USER'}\n"; | ord_content += '#SBATCH --mail-user='+os.environ.get('USER')+'\n' |
print ORD '#SBATCH -o '.$outdir.'/wgs_end-%j'."\n"; | if(debug): |
unless ($debug){ | ord_content += 'rm -rf '+wdir+'/*/tmp\n' |
print ORD "rm -rf $w_dir/*/tmp\n"; | else: |
}else{ | ord_content += ':\n' |
print ORD ":\n"; | osf = open(orderfile, 'w') |
} | osf.write(ord_content) |
close ORD; | osf.close() |
my $sjobs = join(',',@jobids); | sjobs = ','.join(map(str,jobids)) |
my $order = 'sbatch --depend=afterok:'.$sjobs.' '.$orderfile; | os.system('sbatch --depend=afterok:'+sjobs+' '+orderfile) |
print "$order\n"; | |
exec($order); | |
</code> | </code> |
++++ | ++++ |
| |
++++ La magia completa aqui, en menos de 200 lineas | | ++++ La magia completa aqui, en menos de 200 lineas | |
<code perl wgs.pl> | <code python wgs.py> |
#!/usr/bin/perl | #!/usr/bin/python3 |
| |
# Copyright 2020 O. Sotolongo <asqwerty@gmail.com> | """ |
# | Copyright 2020 O. Sotolongo <asqwerty@gmail.com> |
# This program is free software; you can redistribute it and/or modify | |
# it under the terms of the GNU General Public License as published by | |
# the Free Software Foundation; either version 2 of the License, or | |
# (at your option) any later version. | |
# | |
# This program is distributed in the hope that it will be useful, | |
# but WITHOUT ANY WARRANTY; without even the implied warranty of | |
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the | |
# GNU General Public License for more details. | |
| |
use strict; use warnings; | This program is free software; you can redistribute it and/or modify |
use File::Find::Rule; | it under the terms of the GNU General Public License as published by |
use File::Slurp qw(read_file); | the Free Software Foundation; either version 2 of the License, or |
use Data::Dump qw(dump); | (at your option) any later version. |
| |
################################################################# | This program is distributed in the hope that it will be useful, |
################### rellenar los paths sistema ################## | but WITHOUT ANY WARRANTY; without even the implied warranty of |
################################################################# | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the |
my $ref_dir = '/the_dysk/BGI_exome/reference'; | GNU General Public License for more details. |
my $bwa ='/nas/usr/local/bin/bwa'; | """ |
#my $picard = 'java -jar /nas/usr/local/bin/picard.jar'; | |
my $picard = 'java -Djava.io.tmpdir=/nas/'.$ENV{'USER'}.'/tmp/ -Xmx8g -jar /nas/usr/local/bin/picard.jar'; | |
my $samtools = '/nas/software/samtools/bin/samtools'; | |
my $verifybamib = '/nas/usr/local/bin/verifyBamID'; | |
my $gatk3 = 'java -jar /nas/usr/local/opt/gatk3/GenomeAnalysisTK.jar'; | |
#my $gatk4 = '/nas/usr/local/opt/gatk4/gatk --java-options "-DGATK_STACKTRACE_ON_USER_EXCEPTION=true -Xmx16G"'; | |
#my $gatk4_l = '/nas/usr/local/opt/gatk4/gatk --java-options "-DGATK_STACKTRACE_ON_USER_EXCEPTION=true -Xmx16G -Xmx16G -XX:+UseConcMarkSweepGC"'; | |
my $gatk4 = 'singularity run --cleanenv -B /nas:/nas -B /the_dysk:/the_dysk /usr/local/bin/gatk4.simg gatk --java-options "-DGATK_STACKTRACE_ON_USER_EXCEPTION=true -Xmx16G"'; | |
my $gatk4_l = 'singularity run --cleanenv -B /nas:/nas -B /the_dysk:/the_dysk /usr/local/bin/gatk4.simg gatk --java-options "-DGATK_STACKTRACE_ON_USER_EXCEPTION=true -Xmx16G -XX:+UseConcMarkSweepGC"'; | |
################################################################# | |
################################################################# | |
################################################################# | |
my $cfile; | |
my $outdatadir; | |
my $debug = 0; | |
@ARGV = ("-h") unless @ARGV; | |
while (@ARGV and $ARGV[0] =~ /^-/) { | |
$_ = shift; | |
last if /^--$/; | |
if (/^-cut/) { $cfile = shift; chomp($cfile);} | |
if (/^-o/) { $outdatadir = shift;} | |
if (/^-g/) { $debug = 1;} | |
} | |
my $src_dir = shift; | |
$src_dir =~ s/\/$//; | |
my $find_rule = File::Find::Rule->new; | |
my @ppaths = $find_rule->maxdepth(1)->directory->in($src_dir); | |
@ppaths = grep {!/$src_dir$/} @ppaths; | |
my %ltpaths = map { /.*\/(.*)?$/ => $_ } @ppaths; | |
my %finfo; | |
my %lpaths; | |
| |
if($cfile && -e $cfile && -f $cfile){ | import sys |
my @cuts = read_file $cfile; | import os |
chomp @cuts; | import getopt |
foreach my $cut (@cuts){ | import re |
if(grep {/$cut/} %ltpaths){ | import subprocess |
$lpaths{$cut} = $ltpaths{$cut}; | |
} | """ |
} | Rellenar aqui los paths del sistema |
}else{ | """ |
%lpaths = %ltpaths; | ref_dir = '/the_dysk/BGI_exome/reference' |
} | bwa = '/nas/usr/local/bin/bwa' |
| picard = 'java -Djava.io.tmpdir=/nas/'+os.environ.get('USER')+'/tmp/ -Xmx8g -jar /nas/usr/local/bin/picard.jar' |
| samtools = '/nas/software/samtools/bin/samtools' |
| verifybamib = '/nas/usr/local/bin/verifyBamID' |
| gatk3 = 'java -jar /nas/usr/local/opt/gatk3/GenomeAnalysisTK.jar' |
| gatk4 = 'singularity run --cleanenv -B /nas:/nas -B /the_dysk:/the_dysk /usr/local/bin/gatk4.simg gatk --java-options "-DGATK_STACKTRACE_ON_USER_EXCEPTION=true -Xmx16G"' |
| gatk4_l = 'singularity run --cleanenv -B /nas:/nas -B /the_dysk:/the_dysk /usr/local/bin/gatk4.simg gatk --java-options "-DGATK_STACKTRACE_ON_USER_EXCEPTION=true -Xmx16G -XX:+UseConcMarkSweepGC"' |
| """ |
| No tocar nada debajo de esto si no sabes lo que estas haciendo |
| """ |
| |
| # Get CLI inputs |
| short_args = 'c:o:gs:' |
| long_args = ['cut=', 'output=', 'debug', 'source='] |
| debug = 0 |
| cfile='' |
| outdatadir='' |
| |
| try: |
| args, values = getopt.getopt(sys.argv[1:], short_args, long_args) |
| except getopt.error as err: |
| print(str(err)) |
| sys.exit(2) |
| |
| for a,v in args: |
| if a in ('--cut', '-c'): |
| cfile = v |
| elif a in ('--output', '-o'): |
| outdatadir = v |
| elif a in ('--debug', '-g'): |
| debug = 1 |
| elif a in ('--source', '-s'): |
| src_dir = v |
| |
| lpath = os.path.abspath(src_dir) |
| dir_cont = next(os.walk(src_dir))[1] |
| |
| if cfile and os.path.isfile(cfile): |
| f = open(cfile, 'r') |
| cuts = f.read().splitlines() |
| dir_cont = set(dir_cont) - set(cuts) |
| |
# Creo el entorno de ejecucion | # Creo el entorno de ejecucion |
my $tmp_shit = '/nas/'.$ENV{'USER'}.'/tmp/'; | tmp_shit = '/nas/'+os.environ.get('USER')+'/tmp/' |
mkdir $tmp_shit; | if not os.path.isdir(tmp_shit): os.mkdir(tmp_shit) |
my $wdir; | if outdatadir: |
if($outdatadir){ | os.mkdir(outdatadir) |
mkdir $outdatadir; | wdir = outdatadir |
$wdir = $outdatadir; | else: |
}else{ | wdir = os.environ.get('PWD') |
$wdir = $ENV{'PWD'}; | outdir = wdir+'/slurm' |
} | if not os.path.isdir(outdir): os.mkdir(outdir) |
my $outdir = $wdir.'/slurm'; | fq = {} |
mkdir $outdir; | for pollo in dir_cont: |
#Recopilo la informacion de los archivos de entrada | fq_list = [] |
foreach my $pollo (sort keys %lpaths){ | fq_files = next(os.walk(lpath+'/'+pollo))[2] |
opendir PD, $lpaths{$pollo} or next; | for fq in fq_files: |
my @fqs = grep { !/^\./} readdir PD; | ids = re.search(r"^V(\d*?)\_L(\d*?)\_(\d*?)\_\d\.fq\.gz$", fq) |
#$finfo{$pollo}{'name'} = $pollo; | fq_name ='V'+ids.group(1)+'_L'+ids.group(2) |
$finfo{$pollo}{'path'} = $lpaths{$pollo}; | fq_list.append(ids.group(3)) |
foreach my $fq (@fqs) { | udir = wdir+'/'+pollo |
$_=$fq; | if not os.path.isdir(udir): os.mkdir(udir) |
my @ids = /^V(\d*?)\_L(\d*?)\_(\d*?)\_\d\.fq\.gz$/; | tmpdir = udir+'/tmp' |
$finfo{$pollo}{'fq_name'} = 'V'.$ids[0].'_L'.$ids[1]; | if not os.path.isdir(tmpdir): os.mkdir(tmpdir) |
push @{ $finfo{$pollo}{'fq_list'} }, $ids[2]; | resdir = udir+'/results' |
} | if not os.path.isdir(resdir): os.mkdir(resdir) |
} | jobids = [] |
#Creo y ejecuto los procesos | gsconv = '' |
my @jobids; | pname = re.sub('-','',pollo) |
foreach my $pollo (sort keys %finfo){ | for i in range(0,4): |
my $udir = $wdir.'/'.$pollo; | fqid = fq_list[2*i] |
mkdir $udir; | orderfile = outdir+'/bwa_'+pollo+'_'+str(i)+'.sh' |
my $tmpdir = $udir.'/tmp'; | ord_content = '#!/bin/bash\n' |
mkdir $tmpdir; | ord_content += '#SBATCH -J sam_'+pollo+'\n' |
my $resdir = $udir.'/results'; | ord_content += '#SBATCH --time=24:0:0\n' |
mkdir $resdir; | ord_content += '#SBATCH -o '+outdir+'/bwa_'+pollo+'-%j'+'\n' |
my $gsconv = ""; | ord_content += '#SBATCH -c 8\n' |
my @ujobids; | ord_content += '#SBATCH --mem-per-cpu=4G\n' |
(my $pname = $pollo) =~ s/-//g; | ord_content += '#SBATCH -p fast\n' |
for (my $i=0; $i<4; $i++){ | ord_content += '#SBATCH --mail-type=FAIL,TIME_LIMIT,STAGE_OUT\n' |
my $fqid = $finfo{$pollo}{'fq_list'}[2*$i]; | ord_content += '#SBATCH --mail-user='+os.environ.get('USER')+'\n' |
my $orderfile = $outdir.'/bwa_'.$pollo.'_'.$i.'.sh'; | ord_content += bwa+' mem -t 4 -R "@RG\\tID:'+fq_name+'_'+fqid+'\\tSM:'+pname+'\\tPL:BGI\\tPI:380" -M '+ref_dir+'/Homo_sapiens_assembly38 '+src_dir+'/'+pollo+'/'+fq_name+'_'+fqid+'_1.fq.gz '+src_dir+'/'+pollo+'/'+fq_name+'_'+fqid+'_2.fq.gz > '+tmpdir+'/'+pname+'_'+str(i)+'.sam\n' |
open ORD, ">$orderfile" or die "Couldnt create file"; | osf = open(orderfile, 'w') |
print ORD '#!/bin/bash'."\n"; | osf.write(ord_content) |
print ORD '#SBATCH -J sam_'.$pollo."\n"; | osf.close() |
print ORD '#SBATCH --time=24:0:0'."\n"; | gsconv+= 'I='+tmpdir+'/'+pname+'_'+str(i)+'.sam ' |
print ORD '#SBATCH -o '.$outdir.'/bwa_'.$pollo.'-%j'."\n"; | os.system('sbatch '+orderfile) |
print ORD '#SBATCH -c 8'."\n"; | orderfile = outdir+'/merge_'+pollo+'.sh' |
print ORD '#SBATCH --mem-per-cpu=4G'."\n"; | ord_content = '#!/bin/bash\n' |
print ORD '#SBATCH -p fast'."\n"; | ord_content += '#SBATCH -J sam_'+pollo+'\n' |
print ORD '#SBATCH --mail-type=FAIL,TIME_LIMIT,STAGE_OUT'."\n"; #no quieres que te mande email de todo | ord_content += '#SBATCH --time=24:0:0\n' |
print ORD '#SBATCH --mail-user='."$ENV{'USER'}\n"; | ord_content += '#SBATCH -o '+outdir+'/merge_'+pollo+'-%j\n' |
print ORD $bwa.' mem -t 4 -R "@RG\tID:'.$finfo{$pollo}{'fq_name'}.'_'.$fqid.'\tSM:'.$pname.'\tPL:BGI\tPI:380" -M '.$ref_dir.'/Homo_sapiens_assembly38 '.$finfo{$pollo}{'path'}.'/'.$finfo{$pollo}{'fq_name'}.'_'.$fqid.'_1.fq.gz '.$finfo{$pollo}{'path'}.'/'.$finfo{$pollo}{'fq_name'}.'_'.$fqid.'_2.fq.gz > '.$tmpdir.'/'.$pname.'_'.$i.'.sam'; | ord_content += '#SBATCH -c 8\n' |
print ORD "\n"; | ord_content += '#SBATCH --mem-per-cpu=4G\n' |
close ORD; | ord_content += '#SBATCH -p fast\n' |
$gsconv.= 'I='.$tmpdir.'/'.$pname.'_'.$i.'.sam '; | ord_content += '#SBATCH --mail-type=FAIL,TIME_LIMIT,STAGE_OUT\n' |
system("sbatch $orderfile"); | ord_content += '#SBATCH --mail-user='+os.environ.get('USER')+'\n' |
} | ord_content += picard+' MergeSamFiles '+gsconv+' O='+tmpdir+'/'+pname+'.sam\n' |
my $orderfile = $outdir.'/merge_'.$pollo.'.sh'; | ord_content += picard+' SortSam I='+tmpdir+'/'+pname+'.sam O='+tmpdir+'/'+pname+'_sorted.bam SORT_ORDER=coordinate\n' |
open ORD, ">$orderfile"; | ord_content += samtools+' index '+tmpdir+'/'+pname+'_sorted.bam\n' |
print ORD '#!/bin/bash'."\n"; | osf = open(orderfile, 'w') |
print ORD '#SBATCH -J sam_'.$pollo."\n"; | osf.write(ord_content) |
print ORD '#SBATCH -p fast'."\n"; | osf.close() |
print ORD '#SBATCH -o '.$outdir.'/merge_'.$pollo.'-%j'."\n"; | ujobid = subprocess.getoutput('sbatch --dependency=singleton --parsable '+orderfile) |
print ORD '#SBATCH -c 8'."\n"; | orderfile = outdir+'/verify_'+pollo+'.sh' |
print ORD '#SBATCH --mem-per-cpu=4G'."\n"; | ord_content = '#!/bin/bash\n' |
print ORD '#SBATCH -p fast'."\n"; | ord_content += '#SBATCH -J verify_'+pollo+'\n' |
print ORD '#SBATCH --mail-type=FAIL,TIME_LIMIT,STAGE_OUT'."\n"; #no quieres que te mande email de todo | ord_content += '#SBATCH --time=24:0:0\n' |
print ORD '#SBATCH --mail-user='."$ENV{'USER'}\n"; | ord_content += '#SBATCH -o '+outdir+'/verify_'+pollo+'-%j\n' |
print ORD $picard.' MergeSamFiles '.$gsconv.' O='.$tmpdir.'/'.$pname.'.sam'."\n"; | ord_content += '#SBATCH -c 8\n' |
print ORD $picard.' SortSam I='.$tmpdir.'/'.$pname.'.sam O='.$tmpdir.'/'.$pname.'_sorted.bam SORT_ORDER=coordinate'."\n"; | ord_content += '#SBATCH --mem-per-cpu=4G\n' |
print ORD $samtools.' index '.$tmpdir.'/'.$pname.'_sorted.bam'."\n"; | ord_content += '#SBATCH -p fast\n' |
close ORD; | ord_content += '#SBATCH --mail-type=FAIL,TIME_LIMIT,STAGE_OUT\n' |
my $order = 'sbatch --dependency=singleton '.$orderfile; | ord_content += '#SBATCH --mail-user='+os.environ.get('USER')+'\n' |
my $ujobid = `$order`; | ord_content += verifybamib+' --vcf '+ref_dir+'/hapmap_3.3.hg38.vcf.gz --bam '+tmpdir+'/'+pname+'_sorted.bam --chip-none --maxDepth 1000 --precise --verbose --ignoreRG --out '+resdir+'/'+pname+'_verifybam |& grep -v "Skipping marker"\n' |
$ujobid = ( split ' ', $ujobid )[ -1 ]; | osf = open(orderfile, 'w') |
| osf.write(ord_content) |
| osf.close() |
| jobid = subprocess.getoutput('sbatch --parsable --dependency=afterok:'+ujobid+' '+orderfile) |
| jobids.append(jobid) |
| orderfile = outdir+'/validate_'+pollo+'.sh' |
| ord_content = '#!/bin/bash\n' |
| ord_content += '#SBATCH -J validate_'+pollo+'\n' |
| ord_content += '#SBATCH --time=24:0:0\n' |
| ord_content += '#SBATCH -o '+outdir+'/validate_'+pollo+'-%j\n' |
| ord_content += '#SBATCH -c 8\n' |
| ord_content += '#SBATCH --mem-per-cpu=4G\n' |
| ord_content += '#SBATCH -p fast\n' |
| ord_content += '#SBATCH --mail-type=FAIL,TIME_LIMIT,STAGE_OUT\n' |
| ord_content += '#SBATCH --mail-user='+os.environ.get('USER')+'\n' |
| ord_content += picard+' ValidateSamFile IGNORE=MATE_NOT_FOUND IGNORE=MISSING_READ_GROUP IGNORE=RECORD_MISSING_READ_GROUP I='+tmpdir+'/'+pname+'_sorted.bam\n' |
| ord_content += picard+' MarkDuplicates I='+tmpdir+'/'+pname+'_sorted.bam O='+tmpdir+'/'+pname+'_GATKready.bam METRICS_FILE='+resdir+'/'+pname+'_metrics.txt QUIET=true MAX_RECORDS_IN_RAM=2000000 ASSUME_SORTED=TRUE CREATE_INDEX=TRUE\n' |
| ord_content += gatk3+' -T DepthOfCoverage -R '+ref_dir+'/Homo_sapiens_assembly38.fasta -nt 1 -ct 10 -ct 15 -ct 20 -ct 30 --omitDepthOutputAtEachBase --omitIntervalStatistics --omitLocusTable -L '+ref_dir+'/MGI_Exome_Capture_V5_bis2.bed -I '+tmpdir+'/'+pname+'_GATKready.bam -o '+resdir+'/'+pname+'_exome_coverage\n' |
| ord_content += gatk4+' BaseRecalibrator -I '+tmpdir+'/'+pname+'_GATKready.bam -R '+ref_dir+'/Homo_sapiens_assembly38.fasta --known-sites '+ref_dir+'/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz --known-sites '+ref_dir+'/dbsnp_146.hg38.vcf.gz --known-sites '+ref_dir+'/1000G_phase1.snps.high_confidence.hg38.vcf.gz -O '+resdir+'/'+pname'_recal_data.table1\n' |
| ord_content += gatk4+' ApplyBQSR -R '+ref_dir+'/Homo_sapiens_assembly38.fasta -I '+tmpdir+'/'+pname+'_GATKready.bam -bqsr-recal-file '+resdir+'/'+pname+'_recal_data.table1 -O '+resdir+'/'+pname+'_recal.bam\n' |
| ord_content += gatk4+' AnalyzeCovariates -bqsr '+resdir+'/'+pname+'_recal_data.table1 --plots '+resdir+'/'+pname+'_AnalyzeCovariates.pdf\n' |
| ord_content += gatk4+' BaseRecalibrator -I '+resdir+'/'+pname+'_recal.bam -R '+ref_dir+'/Homo_sapiens_assembly38.fasta --known-sites '+ref_dir+'/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz --known-sites '+ref_dir+'/dbsnp_146.hg38.vcf.gz --known-sites '+ref_dir+'/1000G_phase1.snps.high_confidence.hg38.vcf.gz -O '+resdir+'/'+pname+'_recal_data.table2\n' |
| ord_content += gatk4+' AnalyzeCovariates -before '+resdir+'/'+pname+'_recal_data.table1 -after '+resdir+'/'+pname+'_recal_data.table2 -plots '+resdir+'/'+pname+'_before-after-plots.pdf\n' |
| ord_content += gatk4+' HaplotypeCaller -R '+ref_dir+'/Homo_sapiens_assembly38.fasta -I '+resdir+'/'+pname+'_recal.bam -ERC GVCF --dbsnp '+ref_dir+'/dbsnp_146.hg38.vcf.gz -O '+resdir+'/'+pname+'_raw.snps.indels.g.vcf.gz\n' |
| ord_content += gatk4_l+' VariantEval -R '+ref_dir+'/Homo_sapiens_assembly38.fasta -L '+ref_dir+'/MGI_Exome_Capture_V5_bis2.bed -D '+ref_dir+'/dbsnp_146.hg38.vcf.gz -O '+resdir+'/'+pname+'_eval.gatkreport --eval '+resdir+'/'+pname+'_raw.snps.indels.g.vcf.gz\n' |
| osf = open(orderfile, 'w') |
| osf.write(ord_content) |
| osf.close() |
| jobid = subprocess.getoutput('sbatch --parsable --dependency=afterok:'+ujobid+' '+orderfile) |
| jobids.append(jobid) |
| orderfile = outdir+'/wgs_end.sh' |
| ord_content = '#!/bin/bash\n' |
| ord_content += '#SBATCH -J wgs_end\n' |
| ord_content += '#SBATCH -o '+outdir+'/wgs_end-%j\n' |
| ord_content += '#SBATCH --mail-type=END\n' |
| ord_content += '#SBATCH --mail-user='+os.environ.get('USER')+'\n' |
| if not debug: |
| ord_content += 'rm -rf '+wdir+'/*/tmp\n' |
| else: |
| ord_content += ':\n' |
| osf = open(orderfile, 'w') |
| osf.write(ord_content) |
| osf.close() |
| sjobs = ','.join(map(str,jobids)) |
| os.system('sbatch --depend=afterok:'+sjobs+' '+orderfile) |
| |
$orderfile = $outdir.'/verify_'.$pollo.'.sh'; | |
open ORD, ">$orderfile"; | |
print ORD '#!/bin/bash'."\n"; | |
print ORD '#SBATCH -J verify_'.$pollo."\n"; | |
print ORD '#SBATCH -p fast'."\n"; | |
print ORD '#SBATCH -o '.$outdir.'/verify_'.$pollo.'-%j'."\n"; | |
print ORD '#SBATCH -c 4'."\n"; | |
print ORD '#SBATCH --mem-per-cpu=4G'."\n"; | |
print ORD '#SBATCH -p fast'."\n"; | |
print ORD '#SBATCH --mail-type=FAIL,TIME_LIMIT,STAGE_OUT'."\n"; #no quieres que te mande email de todo | |
print ORD '#SBATCH --mail-user='."$ENV{'USER'}\n"; | |
print ORD $verifybamib. ' --vcf '.$ref_dir.'/hapmap_3.3.hg38.vcf.gz --bam '.$tmpdir.'/'.$pname.'_sorted.bam --chip-none --maxDepth 1000 --precise --verbose --ignoreRG --out '.$resdir.'/'.$pname.'_verifybam |& grep -v "Skipping marker"'."\n"; | |
close ORD; | |
$order = 'sbatch --dependency=afterok:'.$ujobid.' '.$orderfile; | |
my $jobid = `$order`; | |
$jobid = ( split ' ', $jobid )[ -1 ]; | |
push @jobids, $jobid; | |
$orderfile = $outdir.'/validate_'.$pollo.'.sh'; | |
open ORD, ">$orderfile"; | |
print ORD '#!/bin/bash'."\n"; | |
print ORD '#SBATCH -J validate_'.$pollo."\n"; | |
print ORD '#SBATCH -p fast'."\n"; | |
print ORD '#SBATCH -o '.$outdir.'/validate_'.$pollo.'-%j'."\n"; | |
print ORD '#SBATCH -c 12'."\n"; | |
print ORD '#SBATCH --mem-per-cpu=4G'."\n"; | |
print ORD '#SBATCH -p fast'."\n"; | |
print ORD '#SBATCH --mail-type=FAIL,TIME_LIMIT,STAGE_OUT'."\n"; #no quieres que te mande email de todo | |
print ORD '#SBATCH --mail-user='."$ENV{'USER'}\n"; | |
print ORD $picard.' ValidateSamFile IGNORE=MATE_NOT_FOUND IGNORE=MISSING_READ_GROUP IGNORE=RECORD_MISSING_READ_GROUP I='.$tmpdir.'/'.$pname.'_sorted.bam'."\n"; | |
print ORD $picard.' MarkDuplicates I='.$tmpdir.'/'.$pname.'_sorted.bam O='.$tmpdir.'/'.$pname.'_GATKready.bam METRICS_FILE='.$resdir.'/'.$pname.'_metrics.txt QUIET=true MAX_RECORDS_IN_RAM=2000000 ASSUME_SORTED=TRUE CREATE_INDEX=TRUE'."\n"; | |
print ORD $gatk3.' -T DepthOfCoverage -R '.$ref_dir.'/Homo_sapiens_assembly38.fasta -nt 1 -ct 10 -ct 15 -ct 20 -ct 30 --omitDepthOutputAtEachBase --omitIntervalStatistics --omitLocusTable -L '.$ref_dir.'/MGI_Exome_Capture_V5_bis2.bed -I '.$tmpdir.'/'.$pname.'_GATKready.bam -o '.$resdir.'/'.$pname.'_exome_coverage'."\n"; | |
print ORD $gatk4.' BaseRecalibrator -I '.$tmpdir.'/'.$pname.'_GATKready.bam -R '.$ref_dir.'/Homo_sapiens_assembly38.fasta --known-sites '.$ref_dir.'/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz --known-sites '.$ref_dir.'/dbsnp_146.hg38.vcf.gz --known-sites '.$ref_dir.'/1000G_phase1.snps.high_confidence.hg38.vcf.gz -O '.$resdir.'/'.$pname.'_recal_data.table1'."\n"; | |
print ORD $gatk4.' ApplyBQSR -R '.$ref_dir.'/Homo_sapiens_assembly38.fasta -I '.$tmpdir.'/'.$pname.'_GATKready.bam -bqsr-recal-file '.$resdir.'/'.$pname.'_recal_data.table1 -O '.$resdir.'/'.$pname.'_recal.bam'."\n"; | |
print ORD $gatk4.' AnalyzeCovariates -bqsr '.$resdir.'/'.$pname.'_recal_data.table1 --plots '.$resdir.'/'.$pname.'_AnalyzeCovariates.pdf'."\n"; | |
print ORD $gatk4.' BaseRecalibrator -I '.$resdir.'/'.$pname.'_recal.bam -R '.$ref_dir.'/Homo_sapiens_assembly38.fasta --known-sites '.$ref_dir.'/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz --known-sites '.$ref_dir.'/dbsnp_146.hg38.vcf.gz --known-sites '.$ref_dir.'/1000G_phase1.snps.high_confidence.hg38.vcf.gz -O '.$resdir.'/'.$pname.'_recal_data.table2'."\n"; | |
print ORD $gatk4.' AnalyzeCovariates -before '.$resdir.'/'.$pname.'_recal_data.table1 -after '.$resdir.'/'.$pname.'_recal_data.table2 -plots '.$resdir.'/'.$pname.'_before-after-plots.pdf'."\n"; | |
print ORD $gatk4.' HaplotypeCaller -R '.$ref_dir.'/Homo_sapiens_assembly38.fasta -I '.$resdir.'/'.$pname.'_recal.bam -ERC GVCF --dbsnp '.$ref_dir.'/dbsnp_146.hg38.vcf.gz -O '.$resdir.'/'.$pname.'_raw.snps.indels.g.vcf.gz'."\n"; | |
print ORD $gatk4_l.' VariantEval -R '.$ref_dir.'/Homo_sapiens_assembly38.fasta -L '.$ref_dir.'/MGI_Exome_Capture_V5_bis2.bed -D '.$ref_dir.'/dbsnp_146.hg38.vcf.gz -O '.$resdir.'/'.$pname.'_eval.gatkreport --eval '.$resdir.'/'.$pname.'_raw.snps.indels.g.vcf.gz'."\n"; | |
close ORD; | |
$order = 'sbatch --dependency=afterok:'.$ujobid.' '.$orderfile; | |
$jobid = `$order`; | |
$jobid = ( split ' ', $jobid )[ -1 ]; | |
push @jobids, $jobid; | |
} | |
my $orderfile = $outdir.'/wgs_end.sh'; | |
open ORD, ">$orderfile"; | |
print ORD '#!/bin/bash'."\n"; | |
print ORD '#SBATCH -J wgs_end'."\n"; | |
print ORD '#SBATCH --mail-type=END'."\n"; #email cuando termine o falle | |
print ORD '#SBATCH --mail-user='."$ENV{'USER'}\n"; | |
print ORD '#SBATCH -o '.$outdir.'/wgs_end-%j'."\n"; | |
unless ($debug){ | |
print ORD "rm -rf $w_dir/*/tmp\n"; | |
}else{ | |
print ORD ":\n"; | |
} | |
close ORD; | |
my $sjobs = join(',',@jobids); | |
my $order = 'sbatch --depend=afterok:'.$sjobs.' '.$orderfile; | |
print "$order\n"; | |
exec($order); | |
</code> | </code> |
++++ | ++++ |
{{:genetica:its_fucking_cool.gif?300| | {{:genetica:its_fucking_cool.gif?300| |
8-)}} | 8-)}} |
| |
| ---- |
| |
| **Nota:** Aqui hemos usado la funcion //subprocess.getoutput()// para capturar el jobid del proceso que se ejecuta. Esto es dando por sentado que vamos a usar **Python 3**. Si vamos a usar **Python 2.7** o algo asi, en lugar de, |
| |
| <code python> |
| jobid = subprocess.getoutput('sbatch --parsable --dependency=afterok:'+str(ujobid)+' '+orderfile) |
| </code> |
| |
| ha de hacerse algo asi, |
| |
| <code python> |
| order = ['sbatch --parsable --dependency=afterok:'+str(ujobid)+' '+orderfile] |
| jobid = int(subprocess.check_output(order, shell=True)) |
| </code> |
| |
| ---- |
===== Ejecucion ===== | ===== Ejecucion ===== |
| |
| |
<code> | <code> |
$ ./wgs.pl -o /the_dysk/BGI_exome/F18FTSEUET0180/WGS /the_dysk/BGI_exome/F18FTSEUET0180/HUMehbE/ | $ ./wgs.py -o /the_dysk/BGI_exome/F18FTSEUET0180/WGS -s /the_dysk/BGI_exome/F18FTSEUET0180/HUMehbE/ |
</code> | </code> |
| |
Esta estructura permitiria ejecutar, | Esta estructura permitiria ejecutar, |
<code> | <code> |
$ ./wgs.pl -o /the_dysk/BGI_exome/F18FTSEUET0180/HUMehbE /the_dysk/BGI_exome/F18FTSEUET0180/HUMehbE | $ ./wgs.pl -o /the_dysk/BGI_exome/F18FTSEUET0180/HUMehbE -s /the_dysk/BGI_exome/F18FTSEUET0180/HUMehbE |
</code> | </code> |
y **en caso de no tener problemas con los permisos**, los resultados quedarian en un directorio dentro del mismo sujeto de proyecto. No se ha hecho asi por defecto previendo precisamente los problemas de permisos. | y **en caso de no tener problemas con los permisos**, los resultados quedarian en un directorio dentro del mismo sujeto de proyecto. No se ha hecho asi por defecto previendo precisamente los problemas de permisos. |
| |
<code> | <code> |
$ ./wgs.pl -cut /nas/osotolongo/wgs/only.txt -o /the_dysk/BGI_exome/F18FTSEUET0180/WGS /the_dysk/BGI_exome/F18FTSEUET0180/HUMehbE | $ ./wgs.py -c /nas/osotolongo/wgs/only.txt -o /the_dysk/BGI_exome/F18FTSEUET0180/WGS -s /the_dysk/BGI_exome/F18FTSEUET0180/HUMehbE |
</code> | </code> |
| |
| |
<code> | <code> |
$ ./wgs.pl -g -o /the_dysk/BGI_exome/F18FTSEUET0180/WGS /the_dysk/BGI_exome/F18FTSEUET0180/HUMehbE | $ ./wgs.py -g -o /the_dysk/BGI_exome/F18FTSEUET0180/WGS -s /the_dysk/BGI_exome/F18FTSEUET0180/HUMehbE |
</code> | </code> |
| |
deja un subdirectorio //tmp// por cada sujeto con sus archivos temporales correspondientes. Ejemplo para seq-5, estos archivos estarian en // /the_dysk/BGI_exome/F18FTSEUET0180/WGS/seq-5/tmp/ //. | deja un subdirectorio //tmp// por cada sujeto con sus archivos temporales correspondientes. Ejemplo para seq-5, estos archivos estarian en // /the_dysk/BGI_exome/F18FTSEUET0180/WGS/seq-5/tmp/ //. |
| |
| ===== Usando containers ===== |
| |
| He logrado crear un container con todas las herramientas del pipeline. Tal y como esta construido el script para que funcione bastaria con cambiar los paths del sistema para incluir el container. Esto es, arriba del todo cambiar las lineas, |
| |
| <code python> |
| ref_dir = '/the_dysk/BGI_exome/reference' |
| container = 'singularity run --cleanenv -B /nas:/nas -B /the_dysk:/the_dysk /nas/osotolongo/wgs/bin/wgs.sif' |
| bwa = container+' bwa' |
| picard = container+' java -Djava.io.tmpdir=/nas/'+os.environ.get('USER')+'/tmp/ -Xmx8g -jar /opt/picard.jar' |
| samtools = container+' samtools' |
| verifybamib = container+' verifyBamID' |
| gatk3 = container+' gatk3' |
| gatk4 = container+' gatk --java-options "-DGATK_STACKTRACE_ON_USER_EXCEPTION=true -Xmx16G"' |
| gatk4_l = container+' gatk --java-options "-DGATK_STACKTRACE_ON_USER_EXCEPTION=true -Xmx16G -XX:+UseConcMarkSweepGC"' |
| </code> |
| |
| en principio todo los demas es igual, |
| |
| <code> |
| [osotolongo@brick03 wgs]$ bin/cwgs.py -o /home/osotolongo/wgs/temp -c only.txt -s /the_dysk/BGI_exome/F18FTSEUET0180/HUMehbE/ |
| Submitted batch job 76 |
| Submitted batch job 77 |
| Submitted batch job 78 |
| Submitted batch job 79 |
| Submitted batch job 83 |
| Submitted batch job 84 |
| Submitted batch job 85 |
| Submitted batch job 86 |
| Submitted batch job 90 |
| Submitted batch job 91 |
| Submitted batch job 92 |
| Submitted batch job 93 |
| Submitted batch job 97 |
| [osotolongo@brick03 wgs]$ squeue |
| JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) |
| 80 fast sam_seq- osotolon PD 0:00 1 (Dependency) |
| 81 fast verify_s osotolon PD 0:00 1 (Dependency) |
| 82 fast validate osotolon PD 0:00 1 (Dependency) |
| 87 fast sam_seq- osotolon PD 0:00 1 (Dependency) |
| 88 fast verify_s osotolon PD 0:00 1 (Dependency) |
| 89 fast validate osotolon PD 0:00 1 (Dependency) |
| 94 fast sam_seq- osotolon PD 0:00 1 (Dependency) |
| 95 fast verify_s osotolon PD 0:00 1 (Dependency) |
| 96 fast validate osotolon PD 0:00 1 (Dependency) |
| 97 fast wgs_end osotolon PD 0:00 1 (Dependency) |
| 76 fast sam_seq- osotolon R 0:08 1 brick01 |
| 77 fast sam_seq- osotolon R 0:08 1 brick01 |
| 78 fast sam_seq- osotolon R 0:08 1 brick01 |
| 79 fast sam_seq- osotolon R 0:07 1 brick01 |
| 83 fast sam_seq- osotolon R 0:05 1 brick01 |
| 84 fast sam_seq- osotolon R 0:05 1 brick01 |
| 85 fast sam_seq- osotolon R 0:05 1 brick01 |
| 86 fast sam_seq- osotolon R 0:05 1 brick01 |
| 90 fast sam_seq- osotolon R 0:05 1 brick02 |
| 91 fast sam_seq- osotolon R 0:05 1 brick02 |
| 92 fast sam_seq- osotolon R 0:05 1 brick02 |
| 93 fast sam_seq- osotolon R 0:05 1 brick02 |
| </code> |
| |