Impute2, Guía básica
formato de plink a impute
Antes hay que convertir a ped las DBs de plink. luego se usa gtool
$ gtool -P --ped Converted/ADMur.ped --map Converted/ADMur.map --og toImpute/ADMur.gen --os toImpute/ADMur.sample gtool -P --ped Converted/ADMur.ped --map Converted/ADMur.map --og toImpute/ADMur.gen --os toImpute/ADMur.sample MAP... Number of SNPs: 224403 Count Samples... Number of samples: 1130 memory...760726170 PED... Output...
$ wc -l toImpute/* 224403 toImpute/ADMur.gen 1132 toImpute/ADMur.sample 225535 total
Tarea1: imputar el genoma completo de una base de datos
Enla web de impute2 recomiendan utlizar pedazos de 5Mb y despues pegarlos haciendo un cat. No es necesario hacer overlaping, porque tal y como dicen:
IMPUTE2 uses an internal buffer region of 250 kb on either side of the analysis interval to prevent edge effects; this means that data outside the region bounded by -int will contribute to the inference, but only SNPs inside that region will appear in the output. In this way, you can specify non-overlapping, adjacent intervals and obtain uniformly high-quality imputation. (Note: to change the size of the internal buffer region, use the -buffer option.)
Entonces se toman los 22 chromosomas, se pican en pedazos de 5MB y se lanzan por separado. Con un script facilito:
- pimpute.pl
#!/usr/bin/perl # Copyright 2013 O. Sotolongo <osotolongo@fundacioace.com> # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # This script use impute2 to impute the whole genome for a given database use strict; use warnings; use Parallel::ForkManager; use File::Basename qw(basename); my @CHRSIZE = (250, 245, 200, 195, 185, 175, 160, 150, 145, 140, 140, 135, 120, 110, 105, 95, 85, 80, 60, 65, 50, 55); use constant CHROMOSOMES => 22; my $chunkz = 5; my $max_processes = 30; my $libdir = '/home/data/GNlib/1000Genome/ALL_1000G_phase1integrated_v3_impute/'; my $impute = '/opt/impute/impute2'; my $odir = 'impute_results'; my $debug = 1; my $db = shift; die "Must supply database\n" unless $db; my $pm = new Parallel::ForkManager($max_processes); mkdir $odir; my $logfile = "/tmp/.parallel_impute_by_db_output.log"; my $ordfile = "/tmp/.parallel_impute_by_db_orders.log"; my $failfile = ".parallel_impute_failures.log"; open ORDOUT, ">$ordfile" or die "Can't open file, $!"; open STDOUT, ">$logfile" or die "Can't redirect stdout"; open STDERR, ">&STDOUT" or die "Can't dup stdout"; open FAILS, ">$failfile" or die "Can't open file, $!"; my $bndb = basename($db); for(my $chr = 1; $chr <= CHROMOSOMES; $chr++){ my $map = $libdir.'genetic_map_chr'.$chr.'_combined_b37.txt'; my $nchunks = int($CHRSIZE[$chr-1]/$chunkz) +1; for(my $i = 0; $i<$nchunks; $i++){ my $ofile = $odir.'/'.$bndb.'_chr'.$chr.'_'.$i.'.impute'; my $s1 = ($i*$chunkz*1000000) +1; my $s2 = ($i+1)*$chunkz*1000000; my $order = $impute.' -g '.$db.' -m '.$map.' -int '.$s1.' '.$s2.' -o '.$ofile; $pm->start() and next; print ORDOUT "$order\n"; system($order); $pm->finish; } } $pm->wait_all_children; my $fcat = 'cat '; my $ifcat = 'tar xzvf '.$bndb.'.impute_info '; my $remove = 'rm -rf '; for(my $chr = 1; $chr <= CHROMOSOMES; $chr++){ my $nchunks = int($CHRSIZE[$chr-1]/$chunkz); for(my $i = 0; $i<$nchunks; $i++){ my $ofile = $odir.'/'.$bndb.'_chr'.$chr.'_'.$i.'.impute'; my $s1 = ($i*$chunkz*1000000) +1; my $s2 = ($i+1)*$chunkz*1000000; if( -e $ofile){ $fcat .= "$ofile "; $ifcat .= $ofile.'_info '; $remove .= $ofile.' '.$ofile.'_info'.$ofile.'_info_by_sample '.$ofile.'_warnings '.$ofile.'_summary '; }else{ print FAILS "$ofile not found!\nChr: $chr, $s1 - $s2\n"; } } } $fcat .= '> '.$odir.'/'.$bndb.'.impute'; print ORDOUT "$fcat\n"; system($fcat); system($ifcat); system($remove) unless $debug; close ORDOUT; close STDERR; close STDOUT; close FAILS; print "Have a nice day! ;-)\n";
Despues de esto, que demora un día fácilmente!, hay que reconvertir la DB a formato de plink. Para esto se usa de nuevo gtool