--- title: "Use clusterhap" author: "Sebastian Simondi,Victoria Bonnecarrere, Lucia Gutierrez, Gaston Quero" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Use clusterhap} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set(collapse = TRUE, comment = "#>") library(clusterhap) ``` ## What is clusterhap? _clusterhap_ function identifies haplotypes within QTL (Quantitative Trait Loci). One haplotype is a combination of SNP (Single Nucleotide Polymorphisms) within the QTL. This function groups together all individuals of a population with the same haplotype. Each group contains individual with the same allele in each SNP, whether or not missing data. Thus, _clusterhap_ groups individuals, that to be imputed, have a non-zero probability of having the same alleles in the entire sequence of SNP's. _clusterhap_ does not impute missing data automatically as if they do packages such as this. ( _alleHap_,_hapassoc_,_hsphase_) The function return five reports: a. **haplotypes** (all haplotypes identified), b. **haplotypes_frequencies** (the frequency of each haplotype), c. **duplicates** (individuals assigned to more than one haplotype), d. **haplotypes_result** (a matrix with individuals assigned to each haplotype, and sorted haplotypes), the function plot this matrix, e. **underterminates** (not assigned individuals) _Note_: the user must decide which haplotype assign to the individuals which were assigned to more than one haplotype. Eventually, the user must manually remove from this matrix the duplicate haplotype. A decision criterion might be looking at the frequency of haplotypes and selects the most frequent haplotype. ## Data Input Format _clusterhap_ uses a simple numeric matrix of individual (row) and SNP (columns). Given a QTL, the user should transform all SNP alleles of each individual in a number, in this way: the bases A, C, G and T are changed to 1, 2, 3 and 4, respectively. A zero (0) is assigned to all SNP with missing data (NA). Therefore, the matrix coordinates are 0, 1, 2, 3 or 4. ## Theoretical Description The _clusterhap_ function first builds the set Y, composed of all the complete sequences of SNP's the QTL, whose elements call haplotypes. With this objective, the function first counts all zeros within the vector. When there are not zero, the individual have not missing data in all SNP and it is assigned as an element in set Y. Once built the set Y, _clusterhap_ function assigns to each haplotype the population`s vector which contains the same numbers in all coordinates nonzero. i.e. _clusterhap_ associates to each QTL’s SNP complete sequence, all individuals which in each locus with data, has the same allele. For example, clusterhap assigns the individual 1 defined by [0,2,0,1,1,4,2] to haplotype 1 defined by [1,2,3,1,1,4,2] since it match in all nonzero coordinates. On the other hand, clusterhap does not associate individual 2 defined by [0,2,0,1,1,4,3] to this haplotype since the last coordinate differs. Originally, haplotype 1 had a C in the last SNP and individual 2 a G. Therefore, having been imputed individual 2 never coincides with haplotype 1. An individual may be associated with more than one haplotype or none, in which case they will be labeled as indeterminate. Thus, most individuals are univocally determined and associated with a single haplotype. ## _clusterhap_ examples: ### Simple simulated data The `sim_qtl` data.frame included with the package has simulated results for 7 SNPs on 5 individuals. ind | SNP.1 | SNP.2 | SNP.3 | SNP.4 | SNP.5 | SNP.6 | SNP.7 ------|-------|-------|-------|-------|-------|-------|------ ind.1 | A | C | G | A | A | T | C ind.2 | NA | C | NA | A | A | T | C ind.3 | C | C | T | A | A | T | C ind.4 | NA | C | NA | A | A | T | G ind.5 | C | NA | NA | A | A | T | C It is transformed to: ind | SNP.1 | SNP.2 | SNP.3 | SNP.4 | SNP.5 | SNP.6 | SNP.7 ------|-------|-------|-------|-------|-------|-------|------ ind.1 | 1 | 2 | 3 | 1 | 1 | 4 | 2 ind.2 | 0 | 2 | 0 | 1 | 1 | 4 | 2 ind.3 | 2 | 2 | 4 | 1 | 1 | 4 | 2 ind.4 | 0 | 2 | 0 | 1 | 1 | 4 | 3 ind.5 | 2 | 0 | 0 | 1 | 1 | 4 | 2 Only individuals 1 and 3 have complete SNP sequence for this QTL. Therefore: ``` Y = {H.1 = [1,2,3,1,1,4,2], H.2 = [2,2,4,1,1,4,2]} ``` Individuals 1 and 2 is assigned to H.1 . The first has the complete sequence and the second in all nonzero data has the same alleles. None other individual is assigned to this haplotype since all remaining individuals differs in the identity of at least one SNP. Following the reasoning, individuals 2, 3 and 5 are assigned to H.2 . Notice that individual 2 was assigned to both haplotypes given that nonzero SNP coincide with both H.1 and H.2. On the other hand, individual 4 was not assigned to any haplotype due to SNP.7 does not match with no element of set Y. The _clusterhap_ function classifies individual 4 as indeterminate. ```{r, fig.width=8, fig.height=4, fig.show='hold'} library(clusterhap) data("sim_qtl") clusterhap(sim_qtl, Print=TRUE) ``` ### Real experimental data The `rice_qtl` data.frame included with the package The Uruguayan Rice Breeding GWAS (URiB) population is composed of 637 genotypes from the INIA’s Rice Breeding Program (IRBP). In this example we use the information in one of those populations. The population has 324 indica lines, and 2 indica cultivars (El Paso 144 and INIA Olimar). The population was genotyped by SNPs obteined by Genotyping-by sequencing (GBS).The data is a QTL for Grain Quality. ```{r, fig.width=8, fig.height=4, fig.show='hold'} data("rice_qtl") clusterhap(rice_qtl) ``` ## Workflow **1.** _clusterhap_ function calculates the amount of null components (_cq_) in each individual of the database. ``` H_data <- x # x is the data frame cq <- NULL Q <- H_data[, -1] for (i in 1:nrow(Q)) { for (j in 1:nrow(Y)) { cq <- rowSums(Q[i, ] == 0) ``` **2.** On the other hand, clusterhap function calculates vector coordinate to coordinate subtracts, between the individual and each haplotype or element of set _Y_, counting the number of zeros (_cr_). ``` y <- which(c.Q == 0) b <- Q[y, ] Y <- b[!duplicated(b), ] for (i in 1:nrow(Q)) { for (j in 1:nrow(Y)) { cq <- rowSums(Q[i, ] == 0) w <- Q[i, ] - Y[j, ] cr <- rowSums(w == 0) ``` **3.** If the sum of _cq+cr_, is equal to the amount of SNP within the QTL, then associate the individual to the haplotype. Note that the subtraction vector has zeros in coordinates where both vectors (individual and haplotype) contain the same number. When two SNP differs, the resulting vector presents no null coordinates; particularly when the individual contains zeros, due to the haplotype is a complete SNP sequence with no missing data. The only way that the sum _cq+cr_ is equal to the amount of QTL and SNP is when individual and haplotype coordinates differ only in null coordinate of individual. i.e. individual not null coordinate coincide with haplotype coordinates. Hence, of being imputed missing SNP of the individual, this SNP has non-zero probability to match the haplotypes that was assigned by the function. Those individuals assigned to non haplotype are classified as indeterminate. ``` for (i in 1:nrow(Q)) { for (j in 1:nrow(Y)) { cq <- rowSums(Q[i, ] == 0) w <- Q[i, ] - Y[j, ] cr <- rowSums(w == 0) zeros <- cq + cr if (zeros == ncol(Q)) { hp <- cbind(hp, i) hp.1 <- cbind(hp.1, j) hpl <- cbind(Q[i, ], j) hpl1 <- cbind(id[i], hpl) hplq <- rbind(hplq, hpl1) } ``` ## References Burkett, K. _et al_. (2006) hapassoc: Software for Like lihood Inference of Trait Associations with SNP Haplotypes and Other Attributes. _J Stat Soft_., **16**, 1-19. Ferdosi, M. H. _et al_ .(2014) hsphase: an R package for pedigree reconstruction, detection of recombination events, phasing and imputation of half-sib family groups. BMC _Bioinformatics_ **15**, 172. Medina-Rodriguez, Nathan and Santana, A. (2015) alleHap: Allele Imputation and Haplotype Reconstruction from Pedigree Databases, R,package version 0.9.2.