User Tools

Site Tools


Impute2, Guía básica


Antes hay que convertir a ped las DBs de plink. luego se usa gtool

$ gtool -P --ped Converted/ADMur.ped --map Converted/ --og toImpute/ADMur.gen --os toImpute/ADMur.sample
gtool -P --ped Converted/ADMur.ped --map Converted/ --og toImpute/ADMur.gen --os toImpute/ADMur.sample
Number of SNPs: 224403
Count Samples...
Number of samples: 1130
$ wc -l toImpute/*
    224403 toImpute/ADMur.gen
      1132 toImpute/ADMur.sample
    225535 total

Tarea1: imputar el genoma completo de una base de datos

Enla web de impute2 recomiendan utlizar pedazos de 5Mb y despues pegarlos haciendo un cat. No es necesario hacer overlaping, porque tal y como dicen:

IMPUTE2 uses an internal buffer region of 250 kb on either side of the analysis interval to prevent edge effects; this means that data outside the region bounded by -int will contribute to the inference, but only SNPs inside that region will appear in the output. In this way, you can specify non-overlapping, adjacent intervals and obtain uniformly high-quality imputation. (Note: to change the size of the internal buffer region, use the -buffer option.)

Entonces se toman los 22 chromosomas, se pican en pedazos de 5MB y se lanzan por separado. Con un script facilito: LOL
# Copyright 2013 O. Sotolongo <>
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or
# (at your option) any later version.
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# GNU General Public License for more details.
# This script use impute2 to impute the whole genome for a given database
use strict; use warnings;
use Parallel::ForkManager;
use File::Basename qw(basename);
my @CHRSIZE = (250, 245, 200, 195, 185, 175, 160, 150, 145, 140, 140, 135, 120, 110, 105, 95, 85, 80, 60, 65, 50, 55);
use constant CHROMOSOMES => 22;
my $chunkz = 5;
my $max_processes = 30;
my $libdir = '/home/data/GNlib/1000Genome/ALL_1000G_phase1integrated_v3_impute/';
my $impute = '/opt/impute/impute2';
my $odir = 'impute_results';
my $debug = 1;
my $db = shift;
die "Must supply database\n" unless $db;
my $pm = new Parallel::ForkManager($max_processes);
mkdir $odir;
my $logfile = "/tmp/.parallel_impute_by_db_output.log";
my $ordfile = "/tmp/.parallel_impute_by_db_orders.log";
my $failfile = ".parallel_impute_failures.log";
open ORDOUT, ">$ordfile" or die "Can't open file, $!";
open STDOUT, ">$logfile" or die "Can't redirect stdout";
open STDERR, ">&STDOUT" or die "Can't dup stdout";
open FAILS, ">$failfile" or die "Can't open file, $!";
my $bndb = basename($db);
for(my $chr = 1; $chr <= CHROMOSOMES; $chr++){
	my $map = $libdir.'genetic_map_chr'.$chr.'_combined_b37.txt';
	my $nchunks = int($CHRSIZE[$chr-1]/$chunkz) +1;
	for(my $i = 0; $i<$nchunks; $i++){
		my $ofile = $odir.'/'.$bndb.'_chr'.$chr.'_'.$i.'.impute';
		my $s1 = ($i*$chunkz*1000000) +1;
		my $s2 = ($i+1)*$chunkz*1000000;
		my $order = $impute.' -g '.$db.' -m '.$map.' -int '.$s1.' '.$s2.' -o '.$ofile;
		$pm->start() and next;
		print ORDOUT "$order\n";
my $fcat = 'cat ';
my $ifcat = 'tar xzvf '.$bndb.'.impute_info ';
my $remove = 'rm -rf ';
for(my $chr = 1; $chr <= CHROMOSOMES; $chr++){
	my $nchunks = int($CHRSIZE[$chr-1]/$chunkz);
	for(my $i = 0; $i<$nchunks; $i++){
		my $ofile = $odir.'/'.$bndb.'_chr'.$chr.'_'.$i.'.impute';
		my $s1 = ($i*$chunkz*1000000) +1;
		my $s2 = ($i+1)*$chunkz*1000000;		
		if( -e $ofile){
			$fcat .= "$ofile ";
			$ifcat .= $ofile.'_info ';
			$remove .= $ofile.' '.$ofile.'_info'.$ofile.'_info_by_sample '.$ofile.'_warnings '.$ofile.'_summary ';
			print FAILS "$ofile not found!\nChr: $chr, $s1 - $s2\n";
$fcat .= '> '.$odir.'/'.$bndb.'.impute';
print ORDOUT "$fcat\n";
system($remove) unless $debug;
close ORDOUT;
close STDERR;
close STDOUT;
close FAILS;
print "Have a nice day! ;-)\n";

Despues de esto, que demora un día fácilmente!, hay que reconvertir la DB a formato de plink. Para esto se usa de nuevo gtool

genetica/plink2impute.txt · Last modified: 2020/08/04 10:58 by