We will analyze Arabidopsis recombinant inbred lines available at the Bay-0 x Shahdara project. Download phenotypic and genotypic data. The target trait in this homework assignment is Flowering time in Long Days measured as the number of days between germination and bolting (the beginning of the inflorescence growth). You can learn more about the Bay-0 x Shahdara population in Loudet et al., (2002).

```
library(readxl)
# phenotypes
bayxshaY <- read_excel("BayxSha_PublishedPheno.xls", col_names=FALSE, sheet=1, skip=4)
y0 <- bayxshaY[,6]
y0 <- y0[-c((length(bayxshaY[,6])-1):(length(bayxshaY[,6])))]
y1 <- y0[-which(is.na(y0))]
y1 <- scale(y1)
# genotypes
bayxshaG <- read_excel("BayxSha_2_Genotypes.xls", col_names=FALSE, skip=5)
bayxshaG.id <- bayxshaG[,1]
bayxshaG <- bayxshaG[,-1]
bayxshaG[bayxshaG == "A"] <- 2
bayxshaG[bayxshaG == "C"] <- 1
bayxshaG[bayxshaG == "B"] <- 0
bayxshaG[bayxshaG == "D"] <- NA
bayxshaG <- as.matrix(bayxshaG)
mode(bayxshaG) <- "numeric"
bayxshaG <- bayxshaG[-which(is.na(y0)), ]
```

What are the dimensions of `y1`

and `bayxshaG`

?

Replace missing marker genotypes with mean values. Then store marker genotypes in a matrix object `W1`

.

Apply a singular value decomposition (SVD) to `W1`

and fit a principal component regression using OLS. Show that the regression of phenotypes on marker covariates is equivalent to the regression of phenotypes on principal components (eigenvectors) computed from marker genotypes. Hint: One way to answer this question is to show that predicted genetic values \(\hat{g}\) are equal.

Fit a standard ridge regression and a SVD-based ridge regression. Use \(\lambda = 5\). Verify that predicted genetic values are equal.

Compute marker effects using SVD-based ridge regression and SVD-based OLS. Recall that the effect of ridge is to replace \(D^{-1}\) in the OLS expression with \((D^2 + \lambda I)^{-1}D\). Show that ridge estimates are shrunk toward zero because these amounts are indeed smaller in the ridge regression.

Download barley data from the Triticeae Toolbox (T3). This web portal contains barley data generated by the Barley CAP project. YouTube tutorials are available if this is your first time to access T3. (Youtube1 and Youtube2).

- Go to Select pull down menu and follow Wizard (Lines, Traits, Trials) -> NSGC Barely (NB) -> 2012 -> NSGC_2012_NormN-Irr_Aberdeen -> grain yield -> and then click Save current selection.
- Go to Download pull down menu and select Genotype and Phenotype Data
- Minimum MAF >= 5%
- Remove markers missing > 3% of data
- Remove lines missing > 5% of data

- Click create file (rrBLUP).
- The
`traits.txt`

and the`snpfile.txt`

files include grain yield phenotypes and marker genotypes, respectively.

```
pheno <- read.table(file="download_HVXF/traits.txt", header=TRUE, stringsAsFactors = FALSE, sep='\t')
geno <- read.table(file="download_HVXF/snpfile.txt", header=TRUE, stringsAsFactors = FALSE, check.names = FALSE)
pheno2 <- pheno[match(rownames(geno), pheno[,1]),]
table(pheno2[,1] %in% rownames(geno))
```

```
##
## TRUE
## 399
```

`table(pheno2[,1] == rownames(geno))`

```
##
## TRUE
## 399
```

```
y2 <- pheno2[,3]
y2 <- scale(y2)
```

What are the dimensions of `y2`

and `geno`

?

Replace missing marker genotypes with mean values. Then store marker genotypes in a matrix `W2`

.

Recall that ridge estimates can be expressed as \(\hat{\beta}^{ridge}_j = \frac{\sum^n_{i=1} x^2_{ij} }{\sum^n_{i=1} x^2_{ij} + \lambda} \times \hat{\beta}^{ols}_{j}\). Create a centered genotype matrix `W2c`

and calculate the amounts of shrinkage for all markers. Show that the greatest shrinkage is achieved for the marker with the most extreme minor allele frequency. Use \(\lambda = 1000\).

Show that we cannot fit OLS using the `solve()`

function when \(n < m\) but the ridge regression can circumvent this problem. Create a scatter plot of observed `y2`

vs. predicted `y2`

. Use \(\lambda = 1000\).