Getting started with population genetics using R

Contents

  1. Why bother with R?
  2. Import genetic data
  3. Filtering
  4. Summary statistics
  5. FST, PCA & DAPC
  6. Extras

1. Why bother with R?

There are so many programs and software out there for analysing population genetic data and generating summary statistics. At first I was quite overwhelmed and unsure which path to take. Then I started learning and using R and I’ve not looked back since. Aside from the fact that R was the first programming language I learnt, there are several reasons why I like to use R for popgen analysis:

  • A wealth of online help resources and tutorials
  • Analyses easily replicated on new data sets
  • Many options for creating publication quality figures and visualisations
  • Code can be uploaded to online repositories for other people to reproduce your analysis
  • Cross platform compatibility
  • Free!

Nowadays, popgen R has dozens of packages that often allow you to do similar things but different packages can have their own formatting requirements and R objects. I recommend choosing one type of R object to conduct all your analyses with because converting between R objects can be difficult and frustrating (I work with genind objects from the adegenet package).

In this post, I cover some ‘bread-and-butter’ analyses for typical popgen data sets and highlight some of the R packages and functions I use to analyse such data using example data sets from published studies.

Assumptions
This post assumes that you have installed R and RStudio and that you have some skills in R coding and functionality. To follow along, I recommend that you download the example data sets to a directory of your choice, create a new R script in the same directory and then set your working directory to the location of these files. To set your working directory in RStudio, for example, click Session > Set Working Directory > To Source File Location.

Download example data sets
1. European lobster SNP genotypes in data.frame format
2. Pink sea fan microsatellite genotypes in genepop format

References
Jenkins TL, Ellis CD, Triantafyllidis A, Stevens JR (2019). Single nucleotide polymorphisms reveal a genetic cline across the north‐east Atlantic and enable powerful population assignment in the European lobster. Evolutionary Applications 12, 1881–1899.
Holland LP, Jenkins TL, Stevens JR (2017). Contrasting patterns of population structure and gene flow facilitate exploration of connectivity in two widely distributed temperate octocorals. Heredity 119, 35–48.

Back to Contents

2. Import genetic data

Install and load R packages

library(adegenet)
library(poppr)
library(dplyr)
library(hierfstat)
library(reshape2)
library(ggplot2)
library(RColorBrewer)
library(scales)

Import lobster SNP genotypes

Import csv file containing SNP (single nucleotide polymorphism) genotypes.

lobster = read.csv("Lobster_SNP_Genotypes.csv")
str(lobster)
## 'data.frame':    125280 obs. of  4 variables:
##  $ Site    : chr  "Ale" "Ale" "Ale" "Ale" ...
##  $ ID      : chr  "Ale04" "Ale04" "Ale04" "Ale04" ...
##  $ Locus   : int  3441 4173 6157 7502 7892 8953 9441 11071 11183 11291 ...
##  $ Genotype: chr  "GG" NA NA NA ...

Convert data.frame from long to wide format. The wide format contains one row for each individual and one column for each locus as well as a column for the ID and site labels.

lobster_wide = reshape(lobster, idvar = c("ID","Site"), timevar = "Locus", direction = "wide", sep = "")
## Warning in reshapeWide(data, idvar = idvar, timevar = timevar, varying =
## varying, : multiple rows match for Locus=3441: first taken

# Remove "Genotype" from column names
colnames(lobster_wide) = gsub("Genotype", "", colnames(lobster_wide))

Subset genotypes and only keep SNP loci used in Jenkins et al. 2019.

# Subset genotypes
snpgeno = lobster_wide[ , 3:ncol(lobster_wide)]

# Keep only SNP loci used in Jenkins et al. 2019
snps_to_remove = c("25580","32362","41521","53889","65376","8953","21197","15531","22740","28357","33066","51507","53052","53263","21880","22323","22365")
snpgeno = snpgeno[ , !colnames(snpgeno) %in% snps_to_remove]

Create vectors of individual and site labels.

ind = as.character(lobster_wide$ID) # individual ID
site = as.character(lobster_wide$Site) # site ID

Convert data.frame to genind object. Check that the genotypes for the first five individuals and loci are as expected.

lobster_gen = df2genind(snpgeno, ploidy = 2, ind.names = ind, pop = site, sep = "")
lobster_gen$tab[1:5, 1:10]
##       3441.G 3441.A 4173.C 4173.T 6157.G 6157.C 7502.T 7502.C 7892.T 7892.A
## Ale04      2      0     NA     NA     NA     NA     NA     NA      2      0
## Ale05      1      1      2      0      1      1      2      0      1      1
## Ale06     NA     NA      2      0      2      0     NA     NA      2      0
## Ale08     NA     NA      0      2      2      0      2      0     NA     NA
## Ale13      2      0     NA     NA      2      0     NA     NA      2      0

Print basic info of the genind object.

lobster_gen
## /// GENIND OBJECT /////////
## 
##  // 1,305 individuals; 79 loci; 158 alleles; size: 945.1 Kb
## 
##  // Basic content
##    @tab:  1305 x 158 matrix of allele counts
##    @loc.n.all: number of alleles per locus (range: 2-2)
##    @loc.fac: locus factor for the 158 columns of @tab
##    @all.names: list of allele names for each locus
##    @ploidy: ploidy of each individual  (range: 2-2)
##    @type:  codom
##    @call: df2genind(X = snpgeno, sep = "", ind.names = ind, pop = site, 
##     ploidy = 2)
## 
##  // Optional content
##    @pop: population of each individual (group size range: 7-41)

popNames(lobster_gen)
##  [1] "Ale"   "Ber"   "Brd"   "Cor"   "Cro"   "Eye"   "Flo"   "Gul"   "Heb"  
## [10] "Hel"   "Hoo"   "Idr16" "Idr17" "Iom"   "Ios"   "Jer"   "Kav"   "Kil"  
## [19] "Laz"   "Loo"   "Lyn"   "Lys"   "Mul"   "Oos"   "Ork"   "Pad"   "Pem"  
## [28] "Sar13" "Sar17" "Sbs"   "She"   "Sin"   "Sky"   "Sul"   "Tar"   "The"  
## [37] "Tor"   "Tro"   "Ven"   "Vig"

Import pink sea fan microsatellite genotypes

Import genepop file and convert to genind object. Check that the genotypes at locus Ever002 for three randomly selected individuals are as expected.

seafan_gen = import2genind("Pinkseafan_13MicrosatLoci.gen", ncode = 3, quiet = TRUE)
set.seed(1)
tab(seafan_gen[loc = "Ever002"])[runif(3, 1, nInd(seafan_gen)), ]
##       Ever002.114 Ever002.117 Ever002.109 Ever002.105 Ever002.121
## Far10           2           0           0           0           0
## Han36           2           0           0           0           0
## Moh5            1           1           0           0           0

Print basic info of the genind object.

seafan_gen
## /// GENIND OBJECT /////////
## 
##  // 877 individuals; 13 loci; 114 alleles; size: 478.2 Kb
## 
##  // Basic content
##    @tab:  877 x 114 matrix of allele counts
##    @loc.n.all: number of alleles per locus (range: 2-18)
##    @loc.fac: locus factor for the 114 columns of @tab
##    @all.names: list of allele names for each locus
##    @ploidy: ploidy of each individual  (range: 2-2)
##    @type:  codom
##    @call: read.genepop(file = file, ncode = 3, quiet = quiet)
## 
##  // Optional content
##    @pop: population of each individual (group size range: 22-48)

popNames(seafan_gen)
##  [1] "ArmI_27"   "ArmII_43"  "ArmIII_41" "Bla29"     "Bov40"     "Bre43"    
##  [7] "Far44"     "Fla23"     "Han36"     "Lao40"     "Lio22"     "Lun22"    
## [13] "Men43"     "Mew44"     "Moh30"     "PorI_42"   "PorII_35"  "Rag43"    
## [19] "RosI_40"   "RosII_36"  "Sko39"     "Thu48"     "Vol24"     "Wtn43"

Update the site labels so that the site code rather than the last individual label in the sample is used.

# Use gsub to extract only letters from a vector
popNames(seafan_gen) = gsub("[^a-zA-Z]", "", popNames(seafan_gen))
popNames(seafan_gen)
##  [1] "ArmI"   "ArmII"  "ArmIII" "Bla"    "Bov"    "Bre"    "Far"    "Fla"   
##  [9] "Han"    "Lao"    "Lio"    "Lun"    "Men"    "Mew"    "Moh"    "PorI"  
## [17] "PorII"  "Rag"    "RosI"   "RosII"  "Sko"    "Thu"    "Vol"    "Wtn"

Back to Contents

3. Filtering

Missing data: loci

Calculate the percentage of complete genotypes per loci in the lobster SNP data set.

locmiss_lobster = propTyped(lobster_gen, by = "loc")
locmiss_lobster[which(locmiss_lobster < 0.80)] # print loci with < 80% complete genotypes
## named numeric(0)

# Barplot
barplot(locmiss_lobster, ylim = c(0,1), ylab = "Complete genotypes (proportion)", xlab = "Locus", las = 2, cex.names = 0.7)

Calculate the percentage of complete genotypes per loci in the pink sea fan microsatellite data set.

locmiss_seafan = propTyped(seafan_gen, by = "loc")
locmiss_seafan[which(locmiss_seafan < 0.80)] # print loci with < 80% complete genotypes
##   Ever012 
## 0.6294185

# Barplot
barplot(locmiss_seafan, ylim = c(0,1), ylab = "Complete genotypes (proportion)", xlab = "Locus", las = 2, cex.names = 0.7)

Remove microsatellite loci with > 20% missing data.

seafan_gen = missingno(seafan_gen, type = "loci", cutoff = 0.20)
## 
## Found 6873 missing values.
## 
## 1 locus contained missing values greater than 20%
## 
## Removing 1 locus: , Ever012

Missing data: individuals

Calculate the percentage of complete genotypes per individual in the lobster SNP data set.

indmiss_lobster = propTyped(lobster_gen, by = "ind")
indmiss_lobster[ which(indmiss_lobster < 0.80) ] # print individuals with < 80% complete genotypes
##     Ale04     Ale06     Ale08     Ale13     Ale15     Ale16     Ale19     Sin65 
## 0.4936709 0.5063291 0.5443038 0.5696203 0.4556962 0.5316456 0.4430380 0.7848101 
##     The24 
## 0.5696203

Remove individuals with > 20% missing genotypes.

lobster_gen = missingno(lobster_gen, type = "geno", cutoff = 0.20)
## 
## Found 2590 missing values.
## 
## 9 genotypes contained missing values greater than 20%
## 
## Removing 9 genotypes: Ale04, Ale06, Ale08, Ale13, Ale15, Ale16, Ale19,
## Sin65, The24

Calculate the percentage of complete genotypes per individual in the pink sea fan microsatellite data set.

indmiss_seafan= propTyped(seafan_gen, by = "ind")
indmiss_seafan[ which(indmiss_seafan < 0.80) ] # print individuals with < 80% complete genotypes
##  ArmIII_9 ArmIII_31      Bov1     Bov39     Lao11     Lao13     Lao16     Lao34 
##      0.75      0.75      0.75      0.75      0.75      0.75      0.75      0.75 
##     Lun19      Moh8     Moh29   PorI_25   PorI_33   PorI_40   PorII_7     Rag14 
##      0.75      0.75      0.75      0.75      0.75      0.75      0.75      0.75 
##   RosI_33  RosII_36     Vol21 
##      0.75      0.75      0.75

Remove individuals with > 20% missing genotypes.

seafan_gen = missingno(seafan_gen, type = "geno", cutoff = 0.20)
## 
## Found 5248 missing values.
## 
## 19 genotypes contained missing values greater than 20%
## 
## Removing 19 genotypes: ArmIII_9, ArmIII_31, Bov1, Bov39, Lao11, Lao13,
## Lao16, Lao34, Lun19, Moh8, Moh29, PorI_25, PorI_33, PorI_40, PorII_7,
## Rag14, RosI_33, RosII_36, Vol21

Check genotypes are unique

Check all individual genotypes are unique. Duplicated genotypes can result from unintentionally sampling the same individual twice or from sampling clones.

# Print the number of multilocus genotypes
mlg(lobster_gen)
## #############################
## # Number of Individuals:  1296 
## # Number of MLG:  1271 
## #############################
## [1] 1271

mlg(seafan_gen)
## #############################
## # Number of Individuals:  858 
## # Number of MLG:  857 
## #############################
## [1] 857

Identify duplicated genotypes.

dups_lobster = mlg.id(lobster_gen)
for (i in dups_lobster){ # for each element in the list object
  if (length(dups_lobster[i]) > 1){ # if the length is greater than 1
    print(i) # print individuals that are duplicates
  }
}
## [1] "Laz4" "Tar4"
## [1] "Eye15" "Eye16" "Eye35"
## [1] "Eye01" "Eye17"
## [1] "Laz2" "Tar2"
## [1] "Eye08" "Eye41"
## [1] "Gul101" "Gul86" 
## [1] "Eye25" "Eye29"
## [1] "Iom02" "Iom22"
## [1] "Hel07" "Hel09"
## [1] "Eye27" "Eye42"
## [1] "Eye05" "Eye06" "Eye23" "Eye40"
## [1] "Eye22" "Eye38"
## [1] "Eye11" "Eye32"
## [1] "Cro08" "Cro15"
## [1] "Laz1" "Tar1"
## [1] "Eye14" "Eye31"
## [1] "Laz3" "Tar3"
## [1] "Lyn04" "Lyn15" "Lyn34"
## [1] "Eye07" "Eye24"
## [1] "Eye02" "Eye04"
## [1] "Eye20" "Eye36"

dups_seafan = mlg.id(seafan_gen)
for (i in dups_seafan){ # for each element in the list object
  if (length(dups_seafan[i]) > 1){ # if the length is greater than 1
    print(i) # print individuals that are duplicates
  }
}
## [1] "ArmI_15" "ArmII_2"

Remove duplicated genotypes.

# Create a vector of individuals to remove
lob_dups = c("Laz4","Eye15","Eye16","Eye01","Laz2","Eye08","Gul101","Eye25","Iom02","Hel07","Eye27","Eye05","Eye06","Eye23","Eye22","Eye11","Cro08","Tar1","Eye14","Tar3","Lyn04","Lyn15","Eye07","Eye02","Eye20")
psf_dups = c("ArmI_15")
# Create a vector of individual names without the duplicates
lob_Nodups = indNames(lobster_gen)[! indNames(lobster_gen) %in% lob_dups]
psf_Nodups = indNames(seafan_gen)[! indNames(seafan_gen) %in% psf_dups]
# Create a new genind object without the duplicates
lobster_gen = lobster_gen[lob_Nodups, ]
seafan_gen = seafan_gen[psf_Nodups, ]
# Re-print the number of multilocus genotypes
mlg(lobster_gen)
## #############################
## # Number of Individuals:  1271 
## # Number of MLG:  1271 
## #############################
## [1] 1271

mlg(seafan_gen)
## #############################
## # Number of Individuals:  857 
## # Number of MLG:  857 
## #############################
## [1] 857

Check loci are still polymorphic after filtering

isPoly(lobster_gen) %>% summary
##    Mode    TRUE 
## logical      79

isPoly(seafan_gen) %>% summary
##    Mode   FALSE    TRUE 
## logical       1      11

Remove loci that are not polymorphic.

poly_loci = names(which(isPoly(seafan_gen) == TRUE))
seafan_gen = seafan_gen[loc = poly_loci]
isPoly(seafan_gen) %>% summary
##    Mode    TRUE 
## logical      11

Back to Contents

4. Summary statistics

Calculate heterozygosity per site

# Calculate basic stats using hierfstat
basic_lobster = basic.stats(lobster_gen, diploid = TRUE)
basic_seafan = basic.stats(seafan_gen, diploid = TRUE)
# Mean observed heterozygosity per site
Ho_lobster = apply(basic_lobster$Ho, MARGIN = 2, FUN = mean, na.rm = TRUE) %>%
  round(digits = 2)
Ho_lobster
##   Ale   Ber   Brd   Cor   Cro   Eye   Flo   Gul   Heb   Hel   Hoo Idr16 Idr17 
##  0.32  0.36  0.37  0.38  0.37  0.37  0.35  0.38  0.39  0.35  0.39  0.39  0.39 
##   Iom   Ios   Jer   Kav   Kil   Laz   Loo   Lyn   Lys   Mul   Oos   Ork   Pad 
##  0.39  0.39  0.38  0.37  0.38  0.38  0.39  0.40  0.34  0.37  0.32  0.36  0.37 
##   Pem Sar13 Sar17   Sbs   She   Sin   Sky   Sul   Tar   The   Tor   Tro   Ven 
##  0.38  0.32  0.34  0.38  0.37  0.35  0.33  0.36  0.42  0.33  0.33  0.33  0.39 
##   Vig 
##  0.39

Ho_seafan = apply(basic_seafan$Ho, MARGIN = 2, FUN = mean, na.rm = TRUE) %>%
  round(digits = 2)
Ho_seafan
##   ArmI  ArmII ArmIII    Bla    Bov    Bre    Far    Fla    Han    Lao    Lio 
##   0.41   0.45   0.44   0.44   0.45   0.46   0.43   0.47   0.50   0.45   0.50 
##    Lun    Men    Mew    Moh   PorI  PorII    Rag   RosI  RosII    Sko    Thu 
##   0.49   0.47   0.51   0.41   0.44   0.45   0.51   0.49   0.49   0.53   0.40 
##    Vol    Wtn 
##   0.47   0.50
# Mean expected heterozygosity per site
He_lobster = apply(basic_lobster$Hs, MARGIN = 2, FUN = mean, na.rm = TRUE) %>%
  round(digits = 2)
He_lobster
##   Ale   Ber   Brd   Cor   Cro   Eye   Flo   Gul   Heb   Hel   Hoo Idr16 Idr17 
##  0.34  0.36  0.37  0.39  0.37  0.37  0.35  0.36  0.38  0.35  0.39  0.39  0.39 
##   Iom   Ios   Jer   Kav   Kil   Laz   Loo   Lyn   Lys   Mul   Oos   Ork   Pad 
##  0.39  0.39  0.38  0.36  0.38  0.34  0.37  0.39  0.35  0.38  0.33  0.37  0.37 
##   Pem Sar13 Sar17   Sbs   She   Sin   Sky   Sul   Tar   The   Tor   Tro   Ven 
##  0.38  0.32  0.35  0.37  0.37  0.35  0.33  0.37  0.36  0.34  0.33  0.36  0.38 
##   Vig 
##  0.39

He_seafan = apply(basic_seafan$Hs, MARGIN = 2, FUN = mean, na.rm = TRUE) %>%
  round(digits = 2)
He_seafan
##   ArmI  ArmII ArmIII    Bla    Bov    Bre    Far    Fla    Han    Lao    Lio 
##   0.48   0.47   0.48   0.44   0.50   0.49   0.48   0.47   0.54   0.50   0.52 
##    Lun    Men    Mew    Moh   PorI  PorII    Rag   RosI  RosII    Sko    Thu 
##   0.52   0.50   0.53   0.49   0.51   0.50   0.51   0.50   0.53   0.53   0.43 
##    Vol    Wtn 
##   0.49   0.54

Visualise heterozygosity per site

# Create a data.frame of site names, Ho and He and then convert to long format
Het_lobster_df = data.frame(Site = names(Ho_lobster), Ho = Ho_lobster, He = He_lobster) %>%
  melt(id.vars = "Site")
Het_seafan_df = data.frame(Site = names(Ho_seafan), Ho = Ho_seafan, He = He_seafan) %>%
  melt(id.vars = "Site")
# Custom theme for ggplot2
custom_theme = theme(
  axis.text.x = element_text(size = 10, angle = 90, vjust = 0.5, face = "bold"),
  axis.text.y = element_text(size = 10),
  axis.title.y = element_text(size = 12),
  axis.title.x = element_blank(),
  axis.line.y = element_line(size = 0.5),
  legend.title = element_blank(),
  legend.text = element_text(size = 12),
  panel.grid = element_blank(),
  panel.background = element_blank(),
  plot.title = element_text(hjust = 0.5, size = 15, face="bold")
  )

# Italic label
hetlab.o = expression(italic("H")[o])
hetlab.e = expression(italic("H")[e])
# Lobster heterozygosity barplot
ggplot(data = Het_lobster_df, aes(x = Site, y = value, fill = variable))+
  geom_bar(stat = "identity", position = position_dodge(width = 0.6), colour = "black")+
  scale_y_continuous(expand = c(0,0), limits = c(0,0.50))+
  scale_fill_manual(values = c("royalblue", "#bdbdbd"), labels = c(hetlab.o, hetlab.e))+
  ylab("Heterozygosity")+
  ggtitle("European lobster")+
  custom_theme

# Pink sea fan heterozygosity barplot
ggplot(data = Het_seafan_df, aes(x = Site, y = value, fill = variable))+
  geom_bar(stat = "identity", position = "dodge", colour = "black")+
  scale_y_continuous(expand = c(0,0), limits = c(0,0.60), breaks = c(0, 0.10, 0.20, 0.30, 0.40, 0.50, 0.60))+
  scale_fill_manual(values = c("pink", "#bdbdbd"), labels = c(hetlab.o, hetlab.e))+
  ylab("Heterozygosity")+
  ggtitle("Pink sea fan")+
  custom_theme

Inbreeding coefficient (FIS)

Calculate mean FIS per site.

# European lobster
apply(basic_lobster$Fis, MARGIN = 2, FUN = mean, na.rm = TRUE) %>%
  round(digits = 3)
##    Ale    Ber    Brd    Cor    Cro    Eye    Flo    Gul    Heb    Hel    Hoo 
##  0.057  0.003  0.003  0.021 -0.006 -0.004  0.005 -0.044 -0.034  0.013 -0.016 
##  Idr16  Idr17    Iom    Ios    Jer    Kav    Kil    Laz    Loo    Lyn    Lys 
## -0.007 -0.001 -0.024 -0.007  0.004 -0.024 -0.016 -0.115 -0.043 -0.032  0.018 
##    Mul    Oos    Ork    Pad    Pem  Sar13  Sar17    Sbs    She    Sin    Sky 
##  0.040  0.023  0.017 -0.010 -0.004 -0.009  0.018 -0.017 -0.006 -0.013  0.006 
##    Sul    Tar    The    Tor    Tro    Ven    Vig 
##  0.033 -0.153  0.029  0.010  0.066 -0.024  0.013

# Pink sea fan
apply(basic_seafan$Fis, MARGIN = 2, FUN = mean, na.rm = TRUE) %>%
  round(digits = 3)
##   ArmI  ArmII ArmIII    Bla    Bov    Bre    Far    Fla    Han    Lao    Lio 
##  0.166  0.085  0.076 -0.006  0.075  0.039  0.116  0.014  0.042  0.064  0.029 
##    Lun    Men    Mew    Moh   PorI  PorII    Rag   RosI  RosII    Sko    Thu 
##  0.057  0.067  0.030  0.153  0.137  0.089  0.010  0.048  0.077  0.013  0.056 
##    Vol    Wtn 
##  0.057  0.058

Back to Contents

5. FST, PCA & DAPC

FST

Compute pairwise FST (Weir & Cockerham 1984).

# Subset data sets to reduce computation time
lobster_gen_sub = popsub(lobster_gen, sublist = c("Ale","Ber","Brd","Pad","Sar17","Vig"))
seafan_gen_sub = popsub(seafan_gen, sublist = c("Bla","Bov","Bre","Lun","PorI","Sko"))

# Compute pairwise Fsts
lobster_fst = genet.dist(lobster_gen_sub, method = "WC84") %>% round(digits = 3)
lobster_fst
##         Ale   Ber   Brd   Pad Sar17
## Ber   0.120                        
## Brd   0.131 0.007                  
## Pad   0.140 0.025 0.008            
## Sar17 0.066 0.174 0.171 0.161      
## Vig   0.117 0.064 0.038 0.018 0.112

seafan_fst = genet.dist(seafan_gen_sub, method = "WC84") %>% round(digits = 3)
seafan_fst 
##         Bla    Bov    Bre    Lun   PorI
## Bov   0.099                            
## Bre   0.105  0.005                     
## Lun   0.095 -0.002  0.012              
## PorI  0.114  0.052  0.045  0.045       
## Sko   0.094 -0.002  0.006 -0.001  0.041

Visualise pairwise FST for lobster.

# Desired order of labels
lab_order = c("Ber","Brd","Pad","Vig","Sar17","Ale")

# Change order of rows and cols
fst.mat = as.matrix(lobster_fst)
fst.mat1 = fst.mat[lab_order, ]
fst.mat2 = fst.mat1[, lab_order]

# Create a data.frame
ind = which(upper.tri(fst.mat2), arr.ind = TRUE)
fst.df = data.frame(Site1 = dimnames(fst.mat2)[[2]][ind[,2]],
                    Site2 = dimnames(fst.mat2)[[1]][ind[,1]],
                    Fst = fst.mat2[ ind ])

# Keep the order of the levels in the data.frame for plotting 
fst.df$Site1 = factor(fst.df$Site1, levels = unique(fst.df$Site1))
fst.df$Site2 = factor(fst.df$Site2, levels = unique(fst.df$Site2))

# Convert minus values to zero
fst.df$Fst[fst.df$Fst < 0] = 0

# Print data.frame summary
fst.df %>% str
## 'data.frame':    15 obs. of  3 variables:
##  $ Site1: Factor w/ 5 levels "Brd","Pad","Vig",..: 1 2 2 3 3 3 4 4 4 4 ...
##  $ Site2: Factor w/ 5 levels "Ber","Brd","Pad",..: 1 1 2 1 2 3 1 2 3 4 ...
##  $ Fst  : num  0.007 0.025 0.008 0.064 0.038 0.018 0.174 0.171 0.161 0.112 ...

# Fst italic label
fst.label = expression(italic("F")[ST])

# Extract middle Fst value for gradient argument
mid = max(fst.df$Fst) / 2

# Plot heatmap
ggplot(data = fst.df, aes(x = Site1, y = Site2, fill = Fst))+
  geom_tile(colour = "black")+
  geom_text(aes(label = Fst), color="black", size = 3)+
  scale_fill_gradient2(low = "blue", mid = "pink", high = "red", midpoint = mid, name = fst.label, limits = c(0, max(fst.df$Fst)), breaks = c(0, 0.05, 0.10, 0.15))+
  scale_x_discrete(expand = c(0,0))+
  scale_y_discrete(expand = c(0,0), position = "right")+
  theme(axis.text = element_text(colour = "black", size = 10, face = "bold"),
        axis.title = element_blank(),
        panel.grid = element_blank(),
        panel.background = element_blank(),
        legend.position = "right",
        legend.title = element_text(size = 14, face = "bold"),
        legend.text = element_text(size = 10)
        )

Visualise pairwise FST for pink sea fan.

# Desired order of labels
lab_order = c("Sko","Lun","Bov","Bla","Bre","PorI")

# Change order of rows and cols
fst.mat = as.matrix(seafan_fst)
fst.mat1 = fst.mat[lab_order, ]
fst.mat2 = fst.mat1[, lab_order]

# Create a data.frame
ind = which(upper.tri(fst.mat2), arr.ind = TRUE)
fst.df = data.frame(Site1 = dimnames(fst.mat2)[[2]][ind[,2]],
                    Site2 = dimnames(fst.mat2)[[1]][ind[,1]],
                    Fst = fst.mat2[ ind ])

# Keep the order of the levels in the data.frame for plotting 
fst.df$Site1 = factor(fst.df$Site1, levels = unique(fst.df$Site1))
fst.df$Site2 = factor(fst.df$Site2, levels = unique(fst.df$Site2))

# Convert minus values to zero
fst.df$Fst[fst.df$Fst < 0] = 0

# Print data.frame summary
fst.df %>% str
## 'data.frame':    15 obs. of  3 variables:
##  $ Site1: Factor w/ 5 levels "Lun","Bov","Bla",..: 1 2 2 3 3 3 4 4 4 4 ...
##  $ Site2: Factor w/ 5 levels "Sko","Lun","Bov",..: 1 1 2 1 2 3 1 2 3 4 ...
##  $ Fst  : num  0 0 0 0.094 0.095 0.099 0.006 0.012 0.005 0.105 ...

# Fst italic label
fst.label = expression(italic("F")[ST])

# Extract middle Fst value for gradient argument
mid = max(fst.df$Fst) / 2

# Plot heatmap
ggplot(data = fst.df, aes(x = Site1, y = Site2, fill = Fst))+
  geom_tile(colour = "black")+
  geom_text(aes(label = Fst), color="black", size = 3)+
  scale_fill_gradient2(low = "blue", mid = "pink", high = "red", midpoint = mid, name = fst.label, limits = c(0, max(fst.df$Fst)), breaks = c(0, 0.05, 0.10))+
  scale_x_discrete(expand = c(0,0))+
  scale_y_discrete(expand = c(0,0), position = "right")+
  theme(axis.text = element_text(colour = "black", size = 10, face = "bold"),
        axis.title = element_blank(),
        panel.grid = element_blank(),
        panel.background = element_blank(),
        legend.position = "right",
        legend.title = element_text(size = 14, face = "bold"),
        legend.text = element_text(size = 10)
        )

PCA

Perform a PCA (principle components analysis) on the lobster data set.

# Replace missing data with the mean allele frequencies
x = tab(lobster_gen_sub, NA.method = "mean")

# Perform PCA
pca1 = dudi.pca(x, scannf = FALSE, scale = FALSE, nf = 3)

# Analyse how much percent of genetic variance is explained by each axis
percent = pca1$eig/sum(pca1$eig)*100
barplot(percent, ylab = "Genetic variance explained by eigenvectors (%)", ylim = c(0,12),
        names.arg = round(percent, 1))

Visualise PCA results.

# Create a data.frame containing individual coordinates
ind_coords = as.data.frame(pca1$li)

# Rename columns of dataframe
colnames(ind_coords) = c("Axis1","Axis2","Axis3")

# Add a column containing individuals
ind_coords$Ind = indNames(lobster_gen_sub)

# Add a column with the site IDs
ind_coords$Site = lobster_gen_sub$pop

# Calculate centroid (average) position for each population
centroid = aggregate(cbind(Axis1, Axis2, Axis3) ~ Site, data = ind_coords, FUN = mean)

# Add centroid coordinates to ind_coords dataframe
ind_coords = left_join(ind_coords, centroid, by = "Site", suffix = c("",".cen"))

# Define colour palette
cols = brewer.pal(nPop(lobster_gen_sub), "Set1")

# Custom x and y labels
xlab = paste("Axis 1 (", format(round(percent[1], 1), nsmall=1)," %)", sep="")
ylab = paste("Axis 2 (", format(round(percent[2], 1), nsmall=1)," %)", sep="")

# Custom theme for ggplot2
ggtheme = theme(axis.text.y = element_text(colour="black", size=12),
                axis.text.x = element_text(colour="black", size=12),
                axis.title = element_text(colour="black", size=12),
                panel.border = element_rect(colour="black", fill=NA, size=1),
                panel.background = element_blank(),
                plot.title = element_text(hjust=0.5, size=15) 
)

# Scatter plot axis 1 vs. 2
ggplot(data = ind_coords, aes(x = Axis1, y = Axis2))+
  geom_hline(yintercept = 0)+
  geom_vline(xintercept = 0)+
  # spider segments
  geom_segment(aes(xend = Axis1.cen, yend = Axis2.cen, colour = Site), show.legend = FALSE)+
  # points
  geom_point(aes(fill = Site), shape = 21, size = 3, show.legend = FALSE)+
  # centroids
  geom_label(data = centroid, aes(label = Site, fill = Site), size = 4, show.legend = FALSE)+
  # colouring
  scale_fill_manual(values = cols)+
  scale_colour_manual(values = cols)+
  # custom labels
  labs(x = xlab, y = ylab)+
  ggtitle("Lobster PCA")+
  # custom theme
  ggtheme

# Export plot
# ggsave("Figure1.png", width = 12, height = 8, dpi = 600)

DAPC

Perform a DAPC (discriminant analysis of principal components) on the seafan data set.

# Perform cross validation to find the optimal number of PCs to retain in DAPC
set.seed(123)
x = tab(seafan_gen_sub, NA.method = "mean")
crossval = xvalDapc(x, seafan_gen_sub$pop, result = "groupMean", xval.plot = TRUE)

# Number of PCs with best stats (lower score = better)
crossval$`Root Mean Squared Error by Number of PCs of PCA`
##        10        20        30        40        50        60        70 
## 0.6252777 0.6326131 0.6380681 0.6057849 0.6587395 0.6412447 0.6113320
crossval$`Number of PCs Achieving Highest Mean Success`
## [1] "40"
crossval$`Number of PCs Achieving Lowest MSE`
## [1] "40"
numPCs = as.numeric(crossval$`Number of PCs Achieving Lowest MSE`)
# Run a DAPC using site IDs as priors
dapc1 = dapc(seafan_gen_sub, seafan_gen_sub$pop, n.pca = numPCs, n.da = 3)

# Analyse how much percent of genetic variance is explained by each axis
percent = dapc1$eig/sum(dapc1$eig)*100
barplot(percent, ylab = "Genetic variance explained by eigenvectors (%)", ylim = c(0,60),
        names.arg = round(percent, 1))

Visualise DAPC results.

# Create a data.frame containing individual coordinates
ind_coords = as.data.frame(dapc1$ind.coord)

# Rename columns of dataframe
colnames(ind_coords) = c("Axis1","Axis2","Axis3")

# Add a column containing individuals
ind_coords$Ind = indNames(seafan_gen_sub)

# Add a column with the site IDs
ind_coords$Site = seafan_gen_sub$pop

# Calculate centroid (average) position for each population
centroid = aggregate(cbind(Axis1, Axis2, Axis3) ~ Site, data = ind_coords, FUN = mean)

# Add centroid coordinates to ind_coords dataframe
ind_coords = left_join(ind_coords, centroid, by = "Site", suffix = c("",".cen"))

# Define colour palette
cols = brewer.pal(nPop(seafan_gen_sub), "Set2")

# Custom x and y labels
xlab = paste("Axis 1 (", format(round(percent[1], 1), nsmall=1)," %)", sep="")
ylab = paste("Axis 2 (", format(round(percent[2], 1), nsmall=1)," %)", sep="")

# Scatter plot axis 1 vs. 2
ggplot(data = ind_coords, aes(x = Axis1, y = Axis2))+
  geom_hline(yintercept = 0)+
  geom_vline(xintercept = 0)+
  # spider segments
  geom_segment(aes(xend = Axis1.cen, yend = Axis2.cen, colour = Site), show.legend = FALSE)+
  # points
  geom_point(aes(fill = Site), shape = 21, size = 3, show.legend = FALSE)+
  # centroids
  geom_label(data = centroid, aes(label = Site, fill = Site), size = 4, show.legend = FALSE)+
  # colouring
  scale_fill_manual(values = cols)+
  scale_colour_manual(values = cols)+
  # custom labels
  labs(x = xlab, y = ylab)+
  ggtitle("Pink sea fan DAPC")+
  # custom theme
  ggtheme

# Export plot
# ggsave("Figure2.png", width = 12, height = 8, dpi = 600)

Back to Contents

6. Extras

Import VCF file

To illustrate importing VCF files and conducting OutFLANK outlier selection tests in R, we will use an American lobster SNP data set available from the Dryad Digital Repository.

# Load vcfR package
library(vcfR)

# Import only 3,000 variants to reduce computation time
american = read.vcfR("10156-586.recode.vcf", nrows = 3000, verbose = FALSE)
american
## ***** Object of Class vcfR *****
## 586 samples
## 1 CHROMs
## 3,000 variants
## Object size: 16.6 Mb
## 0 percent missing data
## *****        *****         *****

# Convert to genind object
american = vcfR2genind(american)

# Add site IDs to genind object
american$pop = as.factor(substr(indNames(american), 1, 3))

Back to Contents

Conduct outlier tests using OutFLANK

Conduct FST differentiation-based outlier tests on genind object using OutFLANK using a wrapper script from the dartR package.

# Load packages
library(OutFLANK)
library(qvalue)
library(dartR)
# Run OutFLANK using dartR wrapper script
outflnk = gl.outflank(american, qthreshold = 0.05, plot = FALSE)
## Calculating FSTs, may take a few minutes...

# Extract OutFLANK results
outflnk.df = outflnk$outflank$results

# Remove duplicated rows for each SNP locus
rowsToRemove = seq(1, nrow(outflnk.df), by = 2)
outflnk.df = outflnk.df[-rowsToRemove, ]

# Print number of outliers (TRUE)
outflnk.df$OutlierFlag %>% summary
##    Mode   FALSE    TRUE 
## logical    2968      32
# Extract outlier IDs
outlier_indexes = which(outflnk.df$OutlierFlag == TRUE)
outlierID = locNames(american)[outlier_indexes]
outlierID
##  [1] "un-11566"  "un-69080"  "un-111790" "un-111865" "un-125908" "un-172034"
##  [7] "un-186923" "un-201848" "un-205435" "un-243757" "un-253632" "un-257077"
## [13] "un-288280" "un-288327" "un-342055" "un-395275" "un-395276" "un-424882"
## [19] "un-433799" "un-493905" "un-525474" "un-531991" "un-541424" "un-561940"
## [25] "un-581802" "un-631261" "un-631875" "un-649002" "un-676856" "un-679035"
## [31] "un-691876" "un-734068"
# Convert Fsts <0 to zero
outflnk.df$FST[outflnk.df$FST < 0] = 0 

# Italic labels
fstlab = expression(italic("F")[ST])
hetlab = expression(italic("H")[e])

# Plot He versus Fst
ggplot(data = outflnk.df)+
  geom_point(aes(x = He, y = FST, colour = OutlierFlag))+
  scale_colour_manual(values = c("black","red"), labels = c("Neutral SNP","Outlier SNP"))+
  ggtitle("OutFLANK outlier test")+
  xlab(hetlab)+
  ylab(fstlab)+
  theme(legend.title = element_blank(),
        plot.title = element_text(hjust = 0.5, size = 15, face = "bold")
        )

Back to Contents

Export biallelic SNP genind object

Load required packages.

library(devtools)
library(miscTools)
library(stringr)

Export genind object in genepop format.

source_url("https://raw.githubusercontent.com/Tom-Jenkins/utility_scripts/master/TJ_genind2genepop_function.R")
genind2genepop(lobster_gen_sub, file = "lobster_genotypes.gen")

Export genind object in STRUCTURE format. If you want to run STRUCTURE in Linux then use unix = TRUE which exports a Unix text file (Windows text file by default).

source_url("https://raw.githubusercontent.com/Tom-Jenkins/utility_scripts/master/TJ_genind2structure_function.R")
genind2structure(lobster_gen_sub, file = "lobster_genotypes.str", pops = TRUE, markers = TRUE, unix = FALSE)

Back to Contents

Download a PDF of this post

Download PDF

R session info

sessionInfo()
## R version 4.1.2 (2021-11-01)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19043)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United Kingdom.1252 
## [2] LC_CTYPE=English_United Kingdom.1252   
## [3] LC_MONETARY=English_United Kingdom.1252
## [4] LC_NUMERIC=C                           
## [5] LC_TIME=English_United Kingdom.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] stringr_1.4.0      miscTools_0.6-26   devtools_2.4.3     usethis_2.1.5     
##  [5] dartR_2.0.3        OutFLANK_0.2       qvalue_2.26.0      vcfR_1.12.0       
##  [9] scales_1.1.1       RColorBrewer_1.1-2 ggplot2_3.3.5      reshape2_1.4.4    
## [13] hierfstat_0.5-10   dplyr_1.0.8        poppr_2.9.3        adegenet_2.1.5    
## [17] ade4_1.7-18       
## 
## loaded via a namespace (and not attached):
##   [1] spam_2.8-0          StAMPP_1.6.3        plyr_1.8.6         
##   [4] igraph_1.3.0        sp_1.4-6            splines_4.1.2      
##   [7] digest_0.6.29       foreach_1.5.2       htmltools_0.5.2    
##  [10] viridis_0.6.2       gdata_2.18.0        fansi_1.0.2        
##  [13] magrittr_2.0.2      memoise_2.0.1       cluster_2.1.2      
##  [16] doParallel_1.0.17   PopGenReport_3.0.4  remotes_2.4.2      
##  [19] R.utils_2.11.0      prettyunits_1.1.1   colorspace_2.0-2   
##  [22] mmod_1.3.3          xfun_0.29           rgdal_1.5-29       
##  [25] callr_3.7.0         crayon_1.5.0        jsonlite_1.7.3     
##  [28] iterators_1.0.14    ape_5.6-1           glue_1.6.1         
##  [31] gtable_0.3.0        seqinr_4.2-8        polysat_1.7-6      
##  [34] pkgbuild_1.3.1      maps_3.4.0          mvtnorm_1.1-3      
##  [37] DBI_1.1.2           GGally_2.1.2        Rcpp_1.0.8         
##  [40] viridisLite_0.4.0   xtable_1.8-4        dotCall64_1.0-1    
##  [43] dismo_1.3-5         genetics_1.3.8.1.3  calibrate_1.7.7    
##  [46] ellipsis_0.3.2      pkgconfig_2.0.3     reshape_0.8.8      
##  [49] R.methodsS3_1.8.1   farver_2.1.0        sass_0.4.0         
##  [52] utf8_1.2.2          tidyselect_1.1.1    labeling_0.4.2     
##  [55] rlang_1.0.1         later_1.3.0         munsell_0.5.0      
##  [58] tools_4.1.2         cachem_1.0.6        cli_3.1.1          
##  [61] generics_0.1.2      evaluate_0.14       fastmap_1.1.0      
##  [64] yaml_2.2.2          processx_3.5.2      knitr_1.37         
##  [67] fs_1.5.2            gdsfmt_1.30.0       purrr_0.3.4        
##  [70] RgoogleMaps_1.4.5.3 nlme_3.1-153        mime_0.12          
##  [73] R.oo_1.24.0         brio_1.1.3          gap_1.2.3-1        
##  [76] compiler_4.1.2      rstudioapi_0.13     png_0.1-7          
##  [79] testthat_3.1.2      tibble_3.1.6        bslib_0.3.1        
##  [82] stringi_1.7.6       ps_1.6.0            gdistance_1.3-6    
##  [85] highr_0.9           blogdown_1.9        desc_1.4.0         
##  [88] fields_13.3         memuse_4.2-1        lattice_0.20-45    
##  [91] Matrix_1.3-4        vegan_2.5-7         permute_0.9-7      
##  [94] vctrs_0.3.8         pillar_1.7.0        lifecycle_1.0.1    
##  [97] combinat_0.0-8      jquerylib_0.1.4     data.table_1.14.2  
## [100] SNPRelate_1.28.0    raster_3.5-15       httpuv_1.6.5       
## [103] patchwork_1.1.1     R6_2.5.1            bookdown_0.25      
## [106] promises_1.2.0.1    KernSmooth_2.23-20  gridExtra_2.3      
## [109] sessioninfo_1.2.2   codetools_0.2-18    pkgload_1.2.4      
## [112] boot_1.3-28         MASS_7.3-54         gtools_3.9.2       
## [115] assertthat_0.2.1    rprojroot_2.0.2     withr_2.4.3        
## [118] pinfsc50_1.2.0      pegas_1.1           mgcv_1.8-38        
## [121] parallel_4.1.2      terra_1.5-21        grid_4.1.2         
## [124] tidyr_1.2.0         rmarkdown_2.13      shiny_1.7.1

Back to Contents

Tom Jenkins
Tom Jenkins
Bioinformatician & Software Developer