PEB advanced R workshop

Welcome to the PEB advanced R workshop!

We will take a “messy” gene expression table, and use the tidyr library to restructure it in a format that is better suited for data analysis. We will also group the data and calculate summaries using the dplyr library, and finally plot it using ggplot2.

Requirements

Let’s install all the required packages:

pkgs = c("tidyr", "dplyr", "ggplot2")
#for (pkg in pkgs) install.packages(pkg)
for (pkg in pkgs) library(pkg, character.only=T)

The tidyr package requires at least R 3.1.2. If you can’t install it, you can use reshape2 instead, but you won’t be able to follow some of the exercises.

We also need to download the following data files:

download.file('https://raw.githubusercontent.com/dalloliogm/peb_workshop/master/peb_expression.csv', dest='peb_expression.csv')
download.file('https://raw.githubusercontent.com/dalloliogm/peb_workshop/master/cancer_genes.csv', dest='cancer_genes.csv')

the peb_expression.csv file

Let’s read the peb_expression.csv file that we just downloaded, and have a look at its contents.

peb = read.table("peb_expression.csv", header=T, sep='\t')
head(peb)
##        ID_REF IDENTIFIER GSM11815.YRI GSM11832.YRI GSM12069.YRI
## 1 200000_s_at      PRPF8       4254.0       5298.2       4026.5
## 2   200001_at     CAPNS1      17996.2      12010.7      10283.5
## 3   200002_at      RPL35      41678.8      39116.9      38758.9
## 4 200003_s_at      RPL28      65390.9      34806.2      31257.2
## 5   200004_at     EIF4G2      19030.1      15813.6      16355.7
## 6   200005_at      EIF3D       8824.5       9706.2      10590.0
##   GSM12083.YRI GSM12101.YRI GSM12106.EUR GSM12274.EUR GSM12299.EUR
## 1       3498.4       3566.4       4903.1       6372.6       4829.1
## 2       2534.7      11048.4      13354.0       8563.8      17247.6
## 3      32847.7      39633.9      43511.2      46856.7      47032.4
## 4      28308.5      67447.5      56989.9      57972.5      57570.5
## 5       9579.7      14273.5      17217.0      19116.9      17487.6
## 6       6986.7       9400.4      12835.2      10299.0      12375.2
##   GSM12412.EUR GSM11810.EUR GSM11827.EUR GSM12078.EAS GSM12099.EAS
## 1       5205.8       2756.8       3932.0       3729.9       3223.4
## 2      16018.5       6077.0      15703.8      10138.5      11614.4
## 3      22152.2      26660.7      26373.6      23809.6      24749.3
## 4      29062.2      35140.9      23629.3      22100.5      21651.0
## 5      14671.6      17733.1      18022.4      17957.4      15958.0
## 6       7645.4       8661.5       7355.7       6973.4       6855.9
##   GSM12269.EAS GSM12287.EAS GSM12301.EAS GSM12448.EAS
## 1       3640.5       4886.3       4070.2       3482.1
## 2       8460.5      10282.6      11844.3       9741.6
## 3      21936.8      31462.8      22733.7      25395.5
## 4      18550.7      23496.5      21315.4      28631.4
## 5      15799.8      16685.8      18817.3      17421.1
## 6       7949.2       9486.5       7494.5       7252.1
str(peb)
## 'data.frame':    22645 obs. of  19 variables:
##  $ ID_REF      : Factor w/ 22645 levels "200000_s_at",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ IDENTIFIER  : Factor w/ 17953 levels "7A5","A1BG","A2BP1",..: 14826 9605 15529 15524 10565 10558 14391 16230 11210 11210 ...
##  $ GSM11815.YRI: num  4254 17996 41679 65391 19030 ...
##  $ GSM11832.YRI: num  5298 12011 39117 34806 15814 ...
##  $ GSM12069.YRI: num  4026 10284 38759 31257 16356 ...
##  $ GSM12083.YRI: num  3498 2535 32848 28308 9580 ...
##  $ GSM12101.YRI: num  3566 11048 39634 67448 14274 ...
##  $ GSM12106.EUR: num  4903 13354 43511 56990 17217 ...
##  $ GSM12274.EUR: num  6373 8564 46857 57972 19117 ...
##  $ GSM12299.EUR: num  4829 17248 47032 57570 17488 ...
##  $ GSM12412.EUR: num  5206 16018 22152 29062 14672 ...
##  $ GSM11810.EUR: num  2757 6077 26661 35141 17733 ...
##  $ GSM11827.EUR: num  3932 15704 26374 23629 18022 ...
##  $ GSM12078.EAS: num  3730 10138 23810 22100 17957 ...
##  $ GSM12099.EAS: num  3223 11614 24749 21651 15958 ...
##  $ GSM12269.EAS: num  3640 8460 21937 18551 15800 ...
##  $ GSM12287.EAS: num  4886 10283 31463 23496 16686 ...
##  $ GSM12301.EAS: num  4070 11844 22734 21315 18817 ...
##  $ GSM12448.EAS: num  3482 9742 25396 28631 17421 ...

The first two columns contain the probe ID and the name of the gene. The other columns contain the expression levels of each gene in an individual. Notice how the name of the columns also encode the population of each individual, e.g. YRI, EUR, or EAS.

A piping system for R

The dplyr package introduced a piping system for R, using the %>% symbol.

This works similarly to the piping system in bash, but using the %>% symbol instead of the pipe |. We first write the name of the dataframe to use, then all the operation that must be executed on it.

For example, the following is equivalent to head(peb)

peb %>% head
##        ID_REF IDENTIFIER GSM11815.YRI GSM11832.YRI GSM12069.YRI
## 1 200000_s_at      PRPF8       4254.0       5298.2       4026.5
## 2   200001_at     CAPNS1      17996.2      12010.7      10283.5
## 3   200002_at      RPL35      41678.8      39116.9      38758.9
## 4 200003_s_at      RPL28      65390.9      34806.2      31257.2
## 5   200004_at     EIF4G2      19030.1      15813.6      16355.7
## 6   200005_at      EIF3D       8824.5       9706.2      10590.0
##   GSM12083.YRI GSM12101.YRI GSM12106.EUR GSM12274.EUR GSM12299.EUR
## 1       3498.4       3566.4       4903.1       6372.6       4829.1
## 2       2534.7      11048.4      13354.0       8563.8      17247.6
## 3      32847.7      39633.9      43511.2      46856.7      47032.4
## 4      28308.5      67447.5      56989.9      57972.5      57570.5
## 5       9579.7      14273.5      17217.0      19116.9      17487.6
## 6       6986.7       9400.4      12835.2      10299.0      12375.2
##   GSM12412.EUR GSM11810.EUR GSM11827.EUR GSM12078.EAS GSM12099.EAS
## 1       5205.8       2756.8       3932.0       3729.9       3223.4
## 2      16018.5       6077.0      15703.8      10138.5      11614.4
## 3      22152.2      26660.7      26373.6      23809.6      24749.3
## 4      29062.2      35140.9      23629.3      22100.5      21651.0
## 5      14671.6      17733.1      18022.4      17957.4      15958.0
## 6       7645.4       8661.5       7355.7       6973.4       6855.9
##   GSM12269.EAS GSM12287.EAS GSM12301.EAS GSM12448.EAS
## 1       3640.5       4886.3       4070.2       3482.1
## 2       8460.5      10282.6      11844.3       9741.6
## 3      21936.8      31462.8      22733.7      25395.5
## 4      18550.7      23496.5      21315.4      28631.4
## 5      15799.8      16685.8      18817.3      17421.1
## 6       7949.2       9486.5       7494.5       7252.1

We can concantenate any number of operations on the same dataset, just as we would do in bash. We will play with this piping system in a few minutes.

Converting to a long format

A dataset can be encoded in a “wide” format (more columns and less rows), or in a “long” format (minimum number of columns and more rows). While the wide format can be more readable for the human eye, the long format is better suited for data analysis.

Our peb data frame is in a “wide” format, as it contains many columns, one for every individual. However, all the functions used in the rest of the workshop require the dataset to be in a long format. Let’s convert it using the gather() function from tidyr:

peb.long = gather(peb, sample, expression, -c(ID_REF, IDENTIFIER))
head(peb.long)
##        ID_REF IDENTIFIER       sample expression
## 1 200000_s_at      PRPF8 GSM11815.YRI     4254.0
## 2   200001_at     CAPNS1 GSM11815.YRI    17996.2
## 3   200002_at      RPL35 GSM11815.YRI    41678.8
## 4 200003_s_at      RPL28 GSM11815.YRI    65390.9
## 5   200004_at     EIF4G2 GSM11815.YRI    19030.1
## 6   200005_at      EIF3D GSM11815.YRI     8824.5

Explanation:

Note how the new format encodes exactly the same data as before, but has only four columns.

Nowadays many recent R libraries and functions are designed for datasets in the long format. A common beginner mistake is trying to apply these functions to datasets in the wide format, while the trick is to restructure the data first. If you learn how to reorganize your dataset into a long format, then you can solve most data analysis problems in R using always the same approach.

If you couldn’t install the tidyr package, you can achieve the same format using the melt function from reshape2

install.packages("reshape2")
library("reshape2")
peb.long = melt(peb, id.vars=c("ID_REF", "IDENTIFIER"), variable.name="sample")
names(peb.long)[4] = "expression"
head(peb.long, 3)

Tidying-up the peb.long dataframe

In a properly structured table each variable must contain only one type of information. If we look at our peb.long dataframe, the sample column contains both the individual ID and its population. Let’s split this column into two, using separate from tidyr:

peb.long.tmp = gather(peb, sample, expression, -ID_REF, -IDENTIFIER)
peb.long     = separate(peb.long.tmp, sample, into=c("individual", "population"), sep='\\.')
head(peb.long)
##        ID_REF IDENTIFIER individual population expression
## 1 200000_s_at      PRPF8   GSM11815        YRI     4254.0
## 2   200001_at     CAPNS1   GSM11815        YRI    17996.2
## 3   200002_at      RPL35   GSM11815        YRI    41678.8
## 4 200003_s_at      RPL28   GSM11815        YRI    65390.9
## 5   200004_at     EIF4G2   GSM11815        YRI    19030.1
## 6   200005_at      EIF3D   GSM11815        YRI     8824.5

The above code can be simplified using the %>% operator:

peb.long = peb %>% 
  gather(sample, expression, -ID_REF, -IDENTIFIER) %>%
  separate(sample, into=c("individual", "population"), sep='\\.')

To prepare the dataset for our workshop, we need to apply a few more data filtering steps, like removing all the “Control” rows, eliminating all the duplicated genes (keeping only one probe per gene), and dropping the ID_REF column.

peb.long %>% nrow
## [1] 384965
peb.long = peb %>% 
  gather(sample, expression, -ID_REF, -IDENTIFIER) %>%
  separate(sample, into=c("individual", "population"), sep='\\.') %>%
  subset(!grepl("Control", IDENTIFIER)) %>%
  subset(!duplicated(paste(IDENTIFIER, individual))) %>%
  select(-ID_REF)

peb.long %>% nrow
## [1] 305184

Group operations

Apart from the %>% operator, the dplyr library introduces three useful functions: group_by, summarise and mutate.

peb.means = peb.long %>% 
  group_by(population) %>%
  summarise(
    mean=mean(expression),
    sd=sd(expression))

peb.means %>% print
## Source: local data frame [3 x 3]
## 
##   population     mean       sd
## 1        EAS 747.4846 2696.084
## 2        EUR 742.9541 2818.128
## 3        YRI 744.1489 2856.510
peb.long = peb.long %>% 
  group_by(individual) %>%
  mutate(
    mean.expression=mean(expression) ) %>%
  mutate(
    overexpressed = ifelse(expression > mean.expression, "overexpressed", "underexpressed")
  )

peb.long %>% head %>% print.data.frame(row.names=F)
##  IDENTIFIER individual population expression mean.expression overexpressed
##       PRPF8   GSM11815        YRI     4254.0        736.6556 overexpressed
##      CAPNS1   GSM11815        YRI    17996.2        736.6556 overexpressed
##       RPL35   GSM11815        YRI    41678.8        736.6556 overexpressed
##       RPL28   GSM11815        YRI    65390.9        736.6556 overexpressed
##      EIF4G2   GSM11815        YRI    19030.1        736.6556 overexpressed
##       EIF3D   GSM11815        YRI     8824.5        736.6556 overexpressed

Plotting the dataset

We will use the ggplot2 package to plot our peb.long dataframe.

A ggplot2 plot is composed by a base ggplot() object, defining the dataset and the basic variables used, and by additional elements defining how to represent them. Let’s see an example:

ggplot(peb.long, aes(x=individual, y=expression)) +
  geom_point() + 
  ggtitle("Expression by individual")

Explanation:

Let’ try a boxplot representation:

ggplot(peb.long, aes(x=individual, y=expression)) +
  geom_boxplot()

Let’s create a nicer plot, in which we:

- sort the individuals by population (x=reorder..)
- color by the population variable   (color=population)
- set limits for the Y axis          (scale_y_continuous)
- set a nice name for X axis         (scale_x_discrete)
- use a whiter plot theme            (theme_bw)
- rotate the X axis labels           (theme)
ggplot(peb.long, aes(x=reorder(individual, desc(population)), y=expression, color=population)) +
  geom_boxplot() + 
  scale_y_continuous("Expression", limits=c(0, 1000)) + 
  scale_x_discrete('Individuals sorted by Population') +
  theme_bw() +
  theme(axis.text.x=element_text(angle = 90, hjust = 0)) +
  ggtitle('Expression by individual')
## Warning: Removed 46203 rows containing non-finite values (stat_boxplot).

Classifying cancer genes

Remember that we also downloaded a file containing the classification of genes into oncogenes and tumor suppressors, from the NCG database:

cancer = read.table("cancer_genes.csv", header=T)
cancer %>% head
##            dataset
## 1        oncogenes
## 2 tumorsuppressors
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  genes
## 1 ABI1,CDH11,AKAP9,LHFP,CDK4,CDK6,OLIG2,GPHN,FSTL3,IKZF1,TFG,NDRG1,CDX2,NCOA2,CEBPA,SLC34A2,SEPT9,MALT1,MLLT11,CLP1,CNTRL,FGFR1OP,PSIP1,WIF1,CHN1,RMI2,LRIG3,CLTC,MSI2,CANT1,COL1A1,COX6C,CREB1,CREBBP,VTI1A,FLJ27352,CTNNB1,TTL,DDIT3,DDX5,DDX6,DDX10,ZNF384,EBF1,EGFR,EIF4A2,ELF4,ELK4,ELN,EPS15,ERBB2,AKT1,ERG,AKT2,ETV1,ETV4,ETV5,ETV6,MECOM,EWSR1,ALDH2,ACSL3,FCGR2B,JAZF1,FGFR1,FGFR3,FGFR2,FHIT,DUX4,FNBP1,FOXO1,ERC1,FOXO3,ARHGAP26,FLI1,SEPT6,FLT3,CAMTA1,ACSL6,ARHGEF12,CRTC1,SF3B1,HEY1,BRD4,SUZ12,KAT6B,PATZ1,ALK,ABL1,FUS,KDSR,C15orf55,ZNF521,WWTR1,SS18L1,SETBP1,GATA1,GATA2,CHIC2,ABL2,FOXP1,AFF4,EML4,GNA11,GNAQ,GNAS,P2RY8,CD274,TFPT,TLX3,H3F3A,HIP1,MNX1,HLF,HMGA1,FOXA1,HNRNPA2B1,TLX1,HOXA9,HOXA11,HOXA13,HOXC11,HOXC13,HOXD11,HOXD13,HRAS,BIRC3,HSP90AA1,HSP90AB1,IDH1,IDH2,IL2,IL6ST,IL7R,IRF4,ITK,JAK1,JAK2,JAK3,JUN,KCNJ5,KDR,KIF5B,KIT,KLK2,KRAS,KTN1,AFF3,LASP1,LCK,LCP1,LIFR,RHOH,LMO1,LMO2,LPP,ARNT,LYL1,MAF,MDM2,MDM4,MET,CIITA,MITF,MLF1,MLL,MLLT1,AFF1,MLLT3,MLLT4,MLLT6,FOXO4,MN1,MPL,MSN,MTCP1,MUC1,MYB,MYC,MYCL1,MYCN,MYD88,MYH9,MYH11,ATF1,NACA,ATIC,ATP1A1,NFE2L2,NFIB,NFKB2,NONO,CNOT3,NOTCH1,NOTCH2,NPM1,NRAS,NTRK1,NTRK3,ATP2B3,NUMA1,NUP98,OMD,PAFAH1B2,IL21R,PAX3,PAX5,PAX7,PBX1,PCM1,NIN,NCKIPSD,PDGFB,PDGFRA,PDGFRB,TRIM33,PER1,PIK3CA,PIM1,PLAG1,BCL11A,PML,PRRX1,SEPT5,POU2AF1,POU5F1,PPARG,FEV,WHSC1L1,PPP2R1A,ZNF331,PRCC,PRKAR1A,MAP2K1,MAP2K2,CXCR7,CASC5,GOPC,MKL1,KIAA1549,RNF213,PTPN11,CCNB1IP1,RAC1,RAD51B,RAF1,RALGDS,RAP1GDS1,RARA,KDM5A,CCND1,BCL2,REL,RET,TRIM27,BCL3,BCL6,BCL7A,BCL9,TNFRSF17,ROS1,BCR,RPL22,RPN1,SDC4,PRDM16,CRLF2,SET,SFPQ,SRSF2,SRSF3,NSD1,SH3GL1,CREB3L2,RBM15,CRTC3,RANBP17,STIL,BCL11B,SMO,SOX2,FOXL2,BRAF,SSX1,SSX2,SSX4,SS18,STAT3,STAT5B,SYK,TAL1,TAL2,TCEA1,TCF3,TCF7L2,TCF12,BTG1,TERT,TFE3,TFRC,NKX2-1,TMPRSS2,TOP1,TPM3,TPM4,TPR,TSHR,FAM22A,FAM22B,U2AF1,EZR,WHSC1,XPO1,YWHAE,CNBP,ZBTB16,ZMYM2,CACNA1D,PAX8,ASPSCR1,DEK,CHCHD7,TFEB,KAT6A,NR4A3,BRD3,NUP214,MLLT10,CCDC6,C2orf44,NCOA4,TET1,PDCD1LG2,HMGA2,CALR,TCL1A,TAF15,FIP1L1,ELL,CLTCL1,HIST1H4I,TRRAP,PICALM,CARS,FCRL4,HIST1H3B,TRAF7,HOOK3,CARD11,MAML2,GAS7,SLC45A3,RUNX1,RUNX1T1,CBFA2T3,NCOA1,CBFB,CBL,TRIM24,GMPS,BCL10,CCND2,CCND3,CCNE1,USP6,CREB3L1,RABEP1,PCSK7,SNX29,SPECC1,KLF4,TRIP11,PDE4DIP,HERPUD1,CD74,CD79A,CD79B,SRGAP3,MAFB,GOLGA5,THRAP3,MED12
## 2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         SH2B3,CDKN2A,CDKN2C,IKZF1,STAG2,CHEK2,KLF6,CREBBP,FAM123B,CYLD,DAXX,DDB2,ASXL1,DNM2,DNMT3A,ARID2,FLCN,EP300,ERCC2,ERCC3,ERCC4,ERCC5,EXT1,EXT2,EZH2,FANCA,FANCC,FANCD2,FANCE,FANCF,FANCG,FH,CIC,DICER1,CBLC,GATA3,GPC3,SETD2,MSH6,APC,ECT2L,FAS,SMAD4,MEN1,MLH1,MSH2,MUTYH,NBN,ATM,NF1,NF2,SBDS,SUFU,CDK12,PIK3R1,PMS1,PMS2,ATRX,TET2,FAM46C,BCOR,RNF43,SDHAF2,PBRM1,FBXW7,PRF1,PRKAR1A,PTCH1,PTEN,MLL3,RB1,PRDM1,SDHB,SDHC,SDHD,BLM,MAP2K4,BMPR1A,SMARCA4,SMARCB1,BRCA1,BRCA2,STK11,HNF1A,BUB1B,TNFAIP3,TP53,TSC1,TSC2,KDM6A,VHL,WAS,WRN,WT1,XPA,XPC,CDC73,PALB2,FBXO11,MLL2,ZRSR2,KDM5C,ARID1A,AXIN1,BAP1,BRIP1,CASP8,PHF6,SOCS1,CBL,CBLB,TNFRSF14,FUBP1,PHOX2B,RECQL4,CDH1

This file doesn’t follow the principles of “tidy” data, according to which each row should represent only one single observation.

In this case we can use the unnest function from tidyr to convert it to a “long” format:

cancer.long = cancer %>%
    mutate(genes=strsplit(as.character(genes), ",") ) %>%
    unnest(genes) %>%
    rename(IDENTIFIER=genes)

cancer.long %>% head
##     dataset IDENTIFIER
## 1 oncogenes       ABI1
## 2 oncogenes      CDH11
## 3 oncogenes      AKAP9
## 4 oncogenes       LHFP
## 5 oncogenes       CDK4
## 6 oncogenes       CDK6

Now we can use left_join from dplyr to add a column to our peb expression data frame:

peb.long %>% nrow
## [1] 305184
peb.long = peb.long %>%
    left_join(cancer.long, by='IDENTIFIER') %>%
    mutate(dataset = ifelse(is.na(dataset), "no-cancer", as.character(dataset)))
## Warning: joining character vector and factor, coercing into character
## vector
peb.long %>% nrow
## [1] 305235

OPTIONAL exercise

You may notice that the number of rows in the dataframe increased after the join. The reason is that three genes are classified both as Oncogenes and as Tumor Suppressors.

To find them you can do any of the following:

# Detect duplicated IDENTIFIER & individual rows
peb.long %>% 
  subset(duplicated(paste(IDENTIFIER, individual)))
## Source: local data frame [51 x 7]
## Groups: individual
## 
##    IDENTIFIER individual population expression mean.expression
## 1         CBL   GSM11815        YRI      508.8        736.6556
## 2       IKZF1   GSM11815        YRI      440.9        736.6556
## 3      CREBBP   GSM11815        YRI      279.5        736.6556
## 4         CBL   GSM11832        YRI     1213.5        739.8641
## 5       IKZF1   GSM11832        YRI      606.7        739.8641
## 6      CREBBP   GSM11832        YRI      776.0        739.8641
## 7         CBL   GSM12069        YRI     1183.7        757.4072
## 8       IKZF1   GSM12069        YRI      798.9        757.4072
## 9      CREBBP   GSM12069        YRI      899.7        757.4072
## 10        CBL   GSM12083        YRI      994.3        711.6908
## ..        ...        ...        ...        ...             ...
## Variables not shown: overexpressed (chr), dataset (chr)
# Group by IDENTIFIER and individual, and summarise
peb.long %>% 
  group_by(IDENTIFIER, individual) %>%
  mutate(dataset=paste(dataset, collapse=',')) %>% 
  subset(grepl('tumor', dataset) & grepl('oncogene', dataset))
## Source: local data frame [102 x 7]
## Groups: IDENTIFIER, individual
## 
##    IDENTIFIER individual population expression mean.expression
## 1         CBL   GSM11815        YRI      508.8        736.6556
## 2         CBL   GSM11815        YRI      508.8        736.6556
## 3       IKZF1   GSM11815        YRI      440.9        736.6556
## 4       IKZF1   GSM11815        YRI      440.9        736.6556
## 5      CREBBP   GSM11815        YRI      279.5        736.6556
## 6      CREBBP   GSM11815        YRI      279.5        736.6556
## 7         CBL   GSM11832        YRI     1213.5        739.8641
## 8         CBL   GSM11832        YRI     1213.5        739.8641
## 9       IKZF1   GSM11832        YRI      606.7        739.8641
## 10      IKZF1   GSM11832        YRI      606.7        739.8641
## ..        ...        ...        ...        ...             ...
## Variables not shown: overexpressed (chr), dataset (chr)
# Reorganize the data frame to a "wide" format 
peb.long %>%
  mutate(value=T) %>% 
  spread(dataset, value) %>% 
  subset(oncogenes==T & tumorsuppressors==T)
## Source: local data frame [51 x 9]
## 
##    IDENTIFIER individual population expression mean.expression
## 1         CBL   GSM11815        YRI      508.8        736.6556
## 2       IKZF1   GSM11815        YRI      440.9        736.6556
## 3      CREBBP   GSM11815        YRI      279.5        736.6556
## 4         CBL   GSM11832        YRI     1213.5        739.8641
## 5       IKZF1   GSM11832        YRI      606.7        739.8641
## 6      CREBBP   GSM11832        YRI      776.0        739.8641
## 7         CBL   GSM12069        YRI     1183.7        757.4072
## 8       IKZF1   GSM12069        YRI      798.9        757.4072
## 9      CREBBP   GSM12069        YRI      899.7        757.4072
## 10        CBL   GSM12083        YRI      994.3        711.6908
## ..        ...        ...        ...        ...             ...
## Variables not shown: overexpressed (chr), no-cancer (lgl), oncogenes
##   (lgl), tumorsuppressors (lgl)

End of the OPTIONAL exercise

advanced ggplot2: facetting and saving to a file

Now that we have more qualitative variables in peb.long, we can create more sophisticated plots. For example, an useful way to represent multi-dimensional data is facetting:

myplot = ggplot(peb.long, aes(x=reorder(individual, desc(population)), y=expression, color=population)) +
  geom_boxplot() + 
  scale_y_continuous("Expression", limits=c(0, 1000)) + 
  scale_x_discrete('Individuals sorted by Population') +
  theme_bw() +
  theme(axis.text.x=element_text(angle = -90, hjust = 0)) +
  ggtitle('Expression by individual')

print(myplot + facet_wrap(~population, scales="free"))
## Warning: Removed 16121 rows containing non-finite values (stat_boxplot).
## Warning: Removed 16531 rows containing non-finite values (stat_boxplot).
## Warning: Removed 13562 rows containing non-finite values (stat_boxplot).

If we want to save this plot to a file, we can do:

ggsave('expression_by_individual.pdf')

a Barchart plot

Let’s see how many oncogenes/tumor suppressors are overexpressed in each population:

peb.long %>% 
    subset(dataset !='no-cancer') %>% 
    ggplot(aes(x=dataset, fill=overexpressed)) + 
        geom_bar(position='dodge') + 
        facet_wrap(~population) +
        theme_bw() +
        ggtitle('Number of overexpressed/underexpressed genes by population')