---
title: "Use clusterhap"
author: "Sebastian Simondi,Victoria Bonnecarrere, Lucia Gutierrez, Gaston Quero"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Use clusterhap}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(collapse = TRUE, comment = "#>")
library(clusterhap)
```


## What is clusterhap?

_clusterhap_ function identifies haplotypes within QTL
(Quantitative Trait Loci). One haplotype is a combination of SNP
(Single Nucleotide Polymorphisms) within the QTL. This function
groups together all individuals of a population with the same
haplotype. Each group contains individual with the same allele in
each SNP, whether or not missing data. Thus, _clusterhap_ groups
individuals, that to be imputed, have a non-zero probability of
having the same alleles in the entire sequence of SNP's.
_clusterhap_ does not impute missing data automatically as if they do packages such as this. ( _alleHap_,_hapassoc_,_hsphase_)

The function return five reports: 
a. **haplotypes** (all haplotypes identified),
b. **haplotypes_frequencies** (the frequency of each haplotype),
c. **duplicates** (individuals assigned to more than one haplotype), 
d. **haplotypes_result** (a matrix with individuals assigned to each haplotype, and sorted haplotypes), the function plot this matrix,
e. **underterminates** (not assigned individuals)

_Note_: the user must decide which haplotype assign to the
individuals which were assigned to more than one haplotype.
Eventually, the user must manually remove from this matrix the
duplicate haplotype. 
A decision criterion might be looking at the frequency of
haplotypes and selects the most frequent haplotype.

## Data Input Format
_clusterhap_ uses a simple numeric matrix of individual (row) and
SNP (columns). Given a QTL, the user should transform all SNP
alleles of each individual in a number, in this way: the bases A,
C, G and T are changed to 1, 2, 3 and 4, respectively. A zero (0)
is assigned to all SNP with missing data (NA). Therefore, the
matrix coordinates are 0, 1, 2, 3 or 4. 

## Theoretical Description
  The _clusterhap_ function first builds the set Y, composed of all
the complete sequences of SNP's the QTL, whose elements call
haplotypes. With this objective, the function first counts all
zeros within the vector. When there are not zero, the individual
have not missing data in all SNP and it is assigned as an element
in set Y.
  Once built the set Y, _clusterhap_ function assigns to each
haplotype the population`s vector which contains the same numbers
in all coordinates nonzero. i.e. _clusterhap_ associates to each
QTL’s SNP complete sequence, all individuals which in each locus
with data, has the same allele. For example, clusterhap assigns the
individual 1 defined by [0,2,0,1,1,4,2] to haplotype 1 defined by
[1,2,3,1,1,4,2] since it match in all nonzero coordinates. On the
other hand, clusterhap does not associate individual 2 defined by
[0,2,0,1,1,4,3] to this haplotype since the last coordinate
differs. Originally, haplotype 1 had a C in the last SNP and
individual 2 a G. Therefore, having been imputed individual 2 never
coincides with haplotype 1. An individual may be associated with
more than one haplotype or none, in which case they will be labeled
as indeterminate. Thus, most individuals are univocally determined
and associated with a single haplotype.
  
## _clusterhap_ examples:

### Simple simulated data

The `sim_qtl` data.frame included with the package has simulated
results for 7 SNPs on 5 individuals.

  
  ind   | SNP.1 | SNP.2 | SNP.3 | SNP.4 | SNP.5 | SNP.6 | SNP.7 
  ------|-------|-------|-------|-------|-------|-------|------
  ind.1 |   A   |   C   |   G   |   A   |   A   |   T   |   C
  ind.2 |   NA  |   C   |   NA  |   A   |   A   |   T   |   C
  ind.3 |   C   |   C   |   T   |   A   |   A   |   T   |   C
  ind.4 |   NA  |   C   |   NA  |   A   |   A   |   T   |   G
  ind.5 |   C   |   NA  |   NA  |   A   |   A   |   T   |   C

It is transformed to:

  ind   | SNP.1 | SNP.2 | SNP.3 | SNP.4 | SNP.5 | SNP.6 | SNP.7 
  ------|-------|-------|-------|-------|-------|-------|------
  ind.1 |   1   |   2   |   3   |   1   |   1   |   4   |   2
  ind.2 |   0   |   2   |   0   |   1   |   1   |   4   |   2
  ind.3 |   2   |   2   |   4   |   1   |   1   |   4   |   2
  ind.4 |   0   |   2   |   0   |   1   |   1   |   4   |   3
  ind.5 |   2   |   0   |   0   |   1   |   1   |   4   |   2

Only individuals 1 and 3 have complete SNP sequence for this QTL. Therefore:

```
    Y = {H.1 = [1,2,3,1,1,4,2], H.2 = [2,2,4,1,1,4,2]}
    
```

Individuals 1 and 2 is assigned to H.1 . The first has the
complete sequence and the second in all nonzero data has the
same alleles.  None other individual is assigned to this
haplotype since all remaining individuals differs in the
identity of at least one SNP. Following the reasoning,
individuals 2, 3 and 5 are assigned to H.2 . Notice that
individual 2 was assigned to both haplotypes given that nonzero
SNP coincide with both  H.1 and H.2. On the other hand,
individual 4 was not assigned to any haplotype due to SNP.7 
does not match with no element of set Y. The _clusterhap_
function classifies individual 4 as indeterminate. 

 
```{r, fig.width=8, fig.height=4, fig.show='hold'}
library(clusterhap)
data("sim_qtl")
clusterhap(sim_qtl, Print=TRUE)
```

### Real experimental data

The `rice_qtl` data.frame included with the package The
Uruguayan Rice Breeding GWAS (URiB) population is composed of
637 genotypes from the INIA’s Rice Breeding Program (IRBP).
In this example we use the information in one of those
populations. The population has 324 indica lines, and 2 indica
cultivars (El Paso 144 and INIA Olimar). The population was
genotyped by SNPs obteined by Genotyping-by sequencing
(GBS).The data is a QTL for Grain Quality. 

```{r, fig.width=8, fig.height=4, fig.show='hold'}
data("rice_qtl")
clusterhap(rice_qtl)
```


## Workflow  

**1.** _clusterhap_ function calculates the amount of null components (_cq_) in each individual of the database.

```
H_data <- x # x is the data frame
cq <- NULL
Q <- H_data[, -1]
  for (i in 1:nrow(Q)) {
        for (j in 1:nrow(Y)) {
            cq <- rowSums(Q[i, ] == 0)
```

**2.**	On the other hand, clusterhap function calculates vector coordinate to coordinate subtracts, between the individual and each haplotype or element of set _Y_, counting the number of zeros (_cr_).

```
y <- which(c.Q == 0)
b <- Q[y, ]
Y <- b[!duplicated(b), ]
 for (i in 1:nrow(Q)) {
        for (j in 1:nrow(Y)) {
            cq <- rowSums(Q[i, ] == 0)
            w <- Q[i, ] - Y[j, ]
            cr <- rowSums(w == 0)
```

**3.**	If the sum of _cq+cr_, is equal to the amount of SNP within the QTL, then associate the individual to the haplotype. Note that the subtraction vector has zeros in coordinates where both vectors (individual and haplotype) contain the same number. When two SNP differs, the resulting vector presents no null coordinates; particularly when the individual contains zeros, due to the haplotype is a complete SNP sequence with no missing data. The only way that the sum _cq+cr_ is equal to the amount of QTL and SNP is when individual and haplotype coordinates differ only in null coordinate of individual. i.e. individual not null coordinate coincide with haplotype coordinates. Hence, of being imputed missing SNP of the individual, this SNP has non-zero probability to match the haplotypes that was assigned by the function. Those individuals assigned to non haplotype are classified as indeterminate. 

```
for (i in 1:nrow(Q)) {
        for (j in 1:nrow(Y)) {
            cq <- rowSums(Q[i, ] == 0)
            w <- Q[i, ] - Y[j, ]
            cr <- rowSums(w == 0)
            zeros <- cq + cr
            if (zeros == ncol(Q)) {
                hp <- cbind(hp, i)
                hp.1 <- cbind(hp.1, j)
                hpl <- cbind(Q[i, ], j)
                hpl1 <- cbind(id[i], hpl)
                hplq <- rbind(hplq, hpl1)
            }

```

## References

Burkett, K. _et al_. (2006) hapassoc:
  Software for Like lihood Inference of Trait Associations
  with SNP Haplotypes and Other Attributes. _J Stat Soft_.,
  **16**, 1-19.
  
  
Ferdosi, M. H. _et al_ .(2014) hsphase: an R package for
  pedigree reconstruction, detection of recombination events,
  phasing and imputation of half-sib family groups. BMC
  _Bioinformatics_ **15**, 172. 
  
  
Medina-Rodriguez, Nathan and Santana, A. (2015) alleHap: Allele
  Imputation and Haplotype Reconstruction from Pedigree Databases,
  R,package version 0.9.2.