Welcome to the PEB advanced R workshop!
We will take a “messy” gene expression table, and use the tidyr library to restructure it in a format that is better suited for data analysis. We will also group the data and calculate summaries using the dplyr library, and finally plot it using ggplot2.
Let’s install all the required packages:
pkgs = c("tidyr", "dplyr", "ggplot2")
#for (pkg in pkgs) install.packages(pkg)
for (pkg in pkgs) library(pkg, character.only=T)
The tidyr package requires at least R 3.1.2. If you can’t install it, you can use reshape2 instead, but you won’t be able to follow some of the exercises.
We also need to download the following data files:
download.file('https://raw.githubusercontent.com/dalloliogm/peb_workshop/master/peb_expression.csv', dest='peb_expression.csv')
download.file('https://raw.githubusercontent.com/dalloliogm/peb_workshop/master/cancer_genes.csv', dest='cancer_genes.csv')
Let’s read the peb_expression.csv file that we just downloaded, and have a look at its contents.
peb = read.table("peb_expression.csv", header=T, sep='\t')
head(peb)
## ID_REF IDENTIFIER GSM11815.YRI GSM11832.YRI GSM12069.YRI
## 1 200000_s_at PRPF8 4254.0 5298.2 4026.5
## 2 200001_at CAPNS1 17996.2 12010.7 10283.5
## 3 200002_at RPL35 41678.8 39116.9 38758.9
## 4 200003_s_at RPL28 65390.9 34806.2 31257.2
## 5 200004_at EIF4G2 19030.1 15813.6 16355.7
## 6 200005_at EIF3D 8824.5 9706.2 10590.0
## GSM12083.YRI GSM12101.YRI GSM12106.EUR GSM12274.EUR GSM12299.EUR
## 1 3498.4 3566.4 4903.1 6372.6 4829.1
## 2 2534.7 11048.4 13354.0 8563.8 17247.6
## 3 32847.7 39633.9 43511.2 46856.7 47032.4
## 4 28308.5 67447.5 56989.9 57972.5 57570.5
## 5 9579.7 14273.5 17217.0 19116.9 17487.6
## 6 6986.7 9400.4 12835.2 10299.0 12375.2
## GSM12412.EUR GSM11810.EUR GSM11827.EUR GSM12078.EAS GSM12099.EAS
## 1 5205.8 2756.8 3932.0 3729.9 3223.4
## 2 16018.5 6077.0 15703.8 10138.5 11614.4
## 3 22152.2 26660.7 26373.6 23809.6 24749.3
## 4 29062.2 35140.9 23629.3 22100.5 21651.0
## 5 14671.6 17733.1 18022.4 17957.4 15958.0
## 6 7645.4 8661.5 7355.7 6973.4 6855.9
## GSM12269.EAS GSM12287.EAS GSM12301.EAS GSM12448.EAS
## 1 3640.5 4886.3 4070.2 3482.1
## 2 8460.5 10282.6 11844.3 9741.6
## 3 21936.8 31462.8 22733.7 25395.5
## 4 18550.7 23496.5 21315.4 28631.4
## 5 15799.8 16685.8 18817.3 17421.1
## 6 7949.2 9486.5 7494.5 7252.1
str(peb)
## 'data.frame': 22645 obs. of 19 variables:
## $ ID_REF : Factor w/ 22645 levels "200000_s_at",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ IDENTIFIER : Factor w/ 17953 levels "7A5","A1BG","A2BP1",..: 14826 9605 15529 15524 10565 10558 14391 16230 11210 11210 ...
## $ GSM11815.YRI: num 4254 17996 41679 65391 19030 ...
## $ GSM11832.YRI: num 5298 12011 39117 34806 15814 ...
## $ GSM12069.YRI: num 4026 10284 38759 31257 16356 ...
## $ GSM12083.YRI: num 3498 2535 32848 28308 9580 ...
## $ GSM12101.YRI: num 3566 11048 39634 67448 14274 ...
## $ GSM12106.EUR: num 4903 13354 43511 56990 17217 ...
## $ GSM12274.EUR: num 6373 8564 46857 57972 19117 ...
## $ GSM12299.EUR: num 4829 17248 47032 57570 17488 ...
## $ GSM12412.EUR: num 5206 16018 22152 29062 14672 ...
## $ GSM11810.EUR: num 2757 6077 26661 35141 17733 ...
## $ GSM11827.EUR: num 3932 15704 26374 23629 18022 ...
## $ GSM12078.EAS: num 3730 10138 23810 22100 17957 ...
## $ GSM12099.EAS: num 3223 11614 24749 21651 15958 ...
## $ GSM12269.EAS: num 3640 8460 21937 18551 15800 ...
## $ GSM12287.EAS: num 4886 10283 31463 23496 16686 ...
## $ GSM12301.EAS: num 4070 11844 22734 21315 18817 ...
## $ GSM12448.EAS: num 3482 9742 25396 28631 17421 ...
The first two columns contain the probe ID and the name of the gene. The other columns contain the expression levels of each gene in an individual. Notice how the name of the columns also encode the population of each individual, e.g. YRI, EUR, or EAS.
The dplyr package introduced a piping system for R, using the %>% symbol.
This works similarly to the piping system in bash, but using the %>% symbol instead of the pipe |. We first write the name of the dataframe to use, then all the operation that must be executed on it.
For example, the following is equivalent to head(peb)
peb %>% head
## ID_REF IDENTIFIER GSM11815.YRI GSM11832.YRI GSM12069.YRI
## 1 200000_s_at PRPF8 4254.0 5298.2 4026.5
## 2 200001_at CAPNS1 17996.2 12010.7 10283.5
## 3 200002_at RPL35 41678.8 39116.9 38758.9
## 4 200003_s_at RPL28 65390.9 34806.2 31257.2
## 5 200004_at EIF4G2 19030.1 15813.6 16355.7
## 6 200005_at EIF3D 8824.5 9706.2 10590.0
## GSM12083.YRI GSM12101.YRI GSM12106.EUR GSM12274.EUR GSM12299.EUR
## 1 3498.4 3566.4 4903.1 6372.6 4829.1
## 2 2534.7 11048.4 13354.0 8563.8 17247.6
## 3 32847.7 39633.9 43511.2 46856.7 47032.4
## 4 28308.5 67447.5 56989.9 57972.5 57570.5
## 5 9579.7 14273.5 17217.0 19116.9 17487.6
## 6 6986.7 9400.4 12835.2 10299.0 12375.2
## GSM12412.EUR GSM11810.EUR GSM11827.EUR GSM12078.EAS GSM12099.EAS
## 1 5205.8 2756.8 3932.0 3729.9 3223.4
## 2 16018.5 6077.0 15703.8 10138.5 11614.4
## 3 22152.2 26660.7 26373.6 23809.6 24749.3
## 4 29062.2 35140.9 23629.3 22100.5 21651.0
## 5 14671.6 17733.1 18022.4 17957.4 15958.0
## 6 7645.4 8661.5 7355.7 6973.4 6855.9
## GSM12269.EAS GSM12287.EAS GSM12301.EAS GSM12448.EAS
## 1 3640.5 4886.3 4070.2 3482.1
## 2 8460.5 10282.6 11844.3 9741.6
## 3 21936.8 31462.8 22733.7 25395.5
## 4 18550.7 23496.5 21315.4 28631.4
## 5 15799.8 16685.8 18817.3 17421.1
## 6 7949.2 9486.5 7494.5 7252.1
We can concantenate any number of operations on the same dataset, just as we would do in bash. We will play with this piping system in a few minutes.
A dataset can be encoded in a “wide” format (more columns and less rows), or in a “long” format (minimum number of columns and more rows). While the wide format can be more readable for the human eye, the long format is better suited for data analysis.
Our peb data frame is in a “wide” format, as it contains many columns, one for every individual. However, all the functions used in the rest of the workshop require the dataset to be in a long format. Let’s convert it using the gather() function from tidyr:
peb.long = gather(peb, sample, expression, -c(ID_REF, IDENTIFIER))
head(peb.long)
## ID_REF IDENTIFIER sample expression
## 1 200000_s_at PRPF8 GSM11815.YRI 4254.0
## 2 200001_at CAPNS1 GSM11815.YRI 17996.2
## 3 200002_at RPL35 GSM11815.YRI 41678.8
## 4 200003_s_at RPL28 GSM11815.YRI 65390.9
## 5 200004_at EIF4G2 GSM11815.YRI 19030.1
## 6 200005_at EIF3D GSM11815.YRI 8824.5
Explanation:
Note how the new format encodes exactly the same data as before, but has only four columns.
Nowadays many recent R libraries and functions are designed for datasets in the long format. A common beginner mistake is trying to apply these functions to datasets in the wide format, while the trick is to restructure the data first. If you learn how to reorganize your dataset into a long format, then you can solve most data analysis problems in R using always the same approach.
If you couldn’t install the tidyr package, you can achieve the same format using the melt function from reshape2
install.packages("reshape2")
library("reshape2")
peb.long = melt(peb, id.vars=c("ID_REF", "IDENTIFIER"), variable.name="sample")
names(peb.long)[4] = "expression"
head(peb.long, 3)
In a properly structured table each variable must contain only one type of information. If we look at our peb.long dataframe, the sample column contains both the individual ID and its population. Let’s split this column into two, using separate from tidyr:
peb.long.tmp = gather(peb, sample, expression, -ID_REF, -IDENTIFIER)
peb.long = separate(peb.long.tmp, sample, into=c("individual", "population"), sep='\\.')
head(peb.long)
## ID_REF IDENTIFIER individual population expression
## 1 200000_s_at PRPF8 GSM11815 YRI 4254.0
## 2 200001_at CAPNS1 GSM11815 YRI 17996.2
## 3 200002_at RPL35 GSM11815 YRI 41678.8
## 4 200003_s_at RPL28 GSM11815 YRI 65390.9
## 5 200004_at EIF4G2 GSM11815 YRI 19030.1
## 6 200005_at EIF3D GSM11815 YRI 8824.5
The above code can be simplified using the %>% operator:
peb.long = peb %>%
gather(sample, expression, -ID_REF, -IDENTIFIER) %>%
separate(sample, into=c("individual", "population"), sep='\\.')
To prepare the dataset for our workshop, we need to apply a few more data filtering steps, like removing all the “Control” rows, eliminating all the duplicated genes (keeping only one probe per gene), and dropping the ID_REF column.
peb.long %>% nrow
## [1] 384965
peb.long = peb %>%
gather(sample, expression, -ID_REF, -IDENTIFIER) %>%
separate(sample, into=c("individual", "population"), sep='\\.') %>%
subset(!grepl("Control", IDENTIFIER)) %>%
subset(!duplicated(paste(IDENTIFIER, individual))) %>%
select(-ID_REF)
peb.long %>% nrow
## [1] 305184
Apart from the %>% operator, the dplyr library introduces three useful functions: group_by, summarise and mutate.
group_by is used to define how the rows of a dataset must be grouped. For example, we can group the rows of peb.long by population.
summarise is used to calculate summaries of a grouped dataset - for example we can use it to calculate the mean and standard deviation of the expression for every population:
peb.means = peb.long %>%
group_by(population) %>%
summarise(
mean=mean(expression),
sd=sd(expression))
peb.means %>% print
## Source: local data frame [3 x 3]
##
## population mean sd
## 1 EAS 747.4846 2696.084
## 2 EUR 742.9541 2818.128
## 3 YRI 744.1489 2856.510
peb.long = peb.long %>%
group_by(individual) %>%
mutate(
mean.expression=mean(expression) ) %>%
mutate(
overexpressed = ifelse(expression > mean.expression, "overexpressed", "underexpressed")
)
peb.long %>% head %>% print.data.frame(row.names=F)
## IDENTIFIER individual population expression mean.expression overexpressed
## PRPF8 GSM11815 YRI 4254.0 736.6556 overexpressed
## CAPNS1 GSM11815 YRI 17996.2 736.6556 overexpressed
## RPL35 GSM11815 YRI 41678.8 736.6556 overexpressed
## RPL28 GSM11815 YRI 65390.9 736.6556 overexpressed
## EIF4G2 GSM11815 YRI 19030.1 736.6556 overexpressed
## EIF3D GSM11815 YRI 8824.5 736.6556 overexpressed
We will use the ggplot2 package to plot our peb.long dataframe.
A ggplot2 plot is composed by a base ggplot() object, defining the dataset and the basic variables used, and by additional elements defining how to represent them. Let’s see an example:
ggplot(peb.long, aes(x=individual, y=expression)) +
geom_point() +
ggtitle("Expression by individual")
Explanation:
ggplot(peb.long, aes(x=overexpressed, y=expression)) initialize the base ggplot2 object. We use the aes (for aesthetic) function to define the default x and y variables
geom_point adds a scatterplot representation of the base plot
ggtitle is another element of the plot, in this case the title.
Let’ try a boxplot representation:
ggplot(peb.long, aes(x=individual, y=expression)) +
geom_boxplot()
Let’s create a nicer plot, in which we:
- sort the individuals by population (x=reorder..)
- color by the population variable (color=population)
- set limits for the Y axis (scale_y_continuous)
- set a nice name for X axis (scale_x_discrete)
- use a whiter plot theme (theme_bw)
- rotate the X axis labels (theme)
ggplot(peb.long, aes(x=reorder(individual, desc(population)), y=expression, color=population)) +
geom_boxplot() +
scale_y_continuous("Expression", limits=c(0, 1000)) +
scale_x_discrete('Individuals sorted by Population') +
theme_bw() +
theme(axis.text.x=element_text(angle = 90, hjust = 0)) +
ggtitle('Expression by individual')
## Warning: Removed 46203 rows containing non-finite values (stat_boxplot).
Remember that we also downloaded a file containing the classification of genes into oncogenes and tumor suppressors, from the NCG database:
cancer = read.table("cancer_genes.csv", header=T)
cancer %>% head
## dataset
## 1 oncogenes
## 2 tumorsuppressors
## genes
## 1 ABI1,CDH11,AKAP9,LHFP,CDK4,CDK6,OLIG2,GPHN,FSTL3,IKZF1,TFG,NDRG1,CDX2,NCOA2,CEBPA,SLC34A2,SEPT9,MALT1,MLLT11,CLP1,CNTRL,FGFR1OP,PSIP1,WIF1,CHN1,RMI2,LRIG3,CLTC,MSI2,CANT1,COL1A1,COX6C,CREB1,CREBBP,VTI1A,FLJ27352,CTNNB1,TTL,DDIT3,DDX5,DDX6,DDX10,ZNF384,EBF1,EGFR,EIF4A2,ELF4,ELK4,ELN,EPS15,ERBB2,AKT1,ERG,AKT2,ETV1,ETV4,ETV5,ETV6,MECOM,EWSR1,ALDH2,ACSL3,FCGR2B,JAZF1,FGFR1,FGFR3,FGFR2,FHIT,DUX4,FNBP1,FOXO1,ERC1,FOXO3,ARHGAP26,FLI1,SEPT6,FLT3,CAMTA1,ACSL6,ARHGEF12,CRTC1,SF3B1,HEY1,BRD4,SUZ12,KAT6B,PATZ1,ALK,ABL1,FUS,KDSR,C15orf55,ZNF521,WWTR1,SS18L1,SETBP1,GATA1,GATA2,CHIC2,ABL2,FOXP1,AFF4,EML4,GNA11,GNAQ,GNAS,P2RY8,CD274,TFPT,TLX3,H3F3A,HIP1,MNX1,HLF,HMGA1,FOXA1,HNRNPA2B1,TLX1,HOXA9,HOXA11,HOXA13,HOXC11,HOXC13,HOXD11,HOXD13,HRAS,BIRC3,HSP90AA1,HSP90AB1,IDH1,IDH2,IL2,IL6ST,IL7R,IRF4,ITK,JAK1,JAK2,JAK3,JUN,KCNJ5,KDR,KIF5B,KIT,KLK2,KRAS,KTN1,AFF3,LASP1,LCK,LCP1,LIFR,RHOH,LMO1,LMO2,LPP,ARNT,LYL1,MAF,MDM2,MDM4,MET,CIITA,MITF,MLF1,MLL,MLLT1,AFF1,MLLT3,MLLT4,MLLT6,FOXO4,MN1,MPL,MSN,MTCP1,MUC1,MYB,MYC,MYCL1,MYCN,MYD88,MYH9,MYH11,ATF1,NACA,ATIC,ATP1A1,NFE2L2,NFIB,NFKB2,NONO,CNOT3,NOTCH1,NOTCH2,NPM1,NRAS,NTRK1,NTRK3,ATP2B3,NUMA1,NUP98,OMD,PAFAH1B2,IL21R,PAX3,PAX5,PAX7,PBX1,PCM1,NIN,NCKIPSD,PDGFB,PDGFRA,PDGFRB,TRIM33,PER1,PIK3CA,PIM1,PLAG1,BCL11A,PML,PRRX1,SEPT5,POU2AF1,POU5F1,PPARG,FEV,WHSC1L1,PPP2R1A,ZNF331,PRCC,PRKAR1A,MAP2K1,MAP2K2,CXCR7,CASC5,GOPC,MKL1,KIAA1549,RNF213,PTPN11,CCNB1IP1,RAC1,RAD51B,RAF1,RALGDS,RAP1GDS1,RARA,KDM5A,CCND1,BCL2,REL,RET,TRIM27,BCL3,BCL6,BCL7A,BCL9,TNFRSF17,ROS1,BCR,RPL22,RPN1,SDC4,PRDM16,CRLF2,SET,SFPQ,SRSF2,SRSF3,NSD1,SH3GL1,CREB3L2,RBM15,CRTC3,RANBP17,STIL,BCL11B,SMO,SOX2,FOXL2,BRAF,SSX1,SSX2,SSX4,SS18,STAT3,STAT5B,SYK,TAL1,TAL2,TCEA1,TCF3,TCF7L2,TCF12,BTG1,TERT,TFE3,TFRC,NKX2-1,TMPRSS2,TOP1,TPM3,TPM4,TPR,TSHR,FAM22A,FAM22B,U2AF1,EZR,WHSC1,XPO1,YWHAE,CNBP,ZBTB16,ZMYM2,CACNA1D,PAX8,ASPSCR1,DEK,CHCHD7,TFEB,KAT6A,NR4A3,BRD3,NUP214,MLLT10,CCDC6,C2orf44,NCOA4,TET1,PDCD1LG2,HMGA2,CALR,TCL1A,TAF15,FIP1L1,ELL,CLTCL1,HIST1H4I,TRRAP,PICALM,CARS,FCRL4,HIST1H3B,TRAF7,HOOK3,CARD11,MAML2,GAS7,SLC45A3,RUNX1,RUNX1T1,CBFA2T3,NCOA1,CBFB,CBL,TRIM24,GMPS,BCL10,CCND2,CCND3,CCNE1,USP6,CREB3L1,RABEP1,PCSK7,SNX29,SPECC1,KLF4,TRIP11,PDE4DIP,HERPUD1,CD74,CD79A,CD79B,SRGAP3,MAFB,GOLGA5,THRAP3,MED12
## 2 SH2B3,CDKN2A,CDKN2C,IKZF1,STAG2,CHEK2,KLF6,CREBBP,FAM123B,CYLD,DAXX,DDB2,ASXL1,DNM2,DNMT3A,ARID2,FLCN,EP300,ERCC2,ERCC3,ERCC4,ERCC5,EXT1,EXT2,EZH2,FANCA,FANCC,FANCD2,FANCE,FANCF,FANCG,FH,CIC,DICER1,CBLC,GATA3,GPC3,SETD2,MSH6,APC,ECT2L,FAS,SMAD4,MEN1,MLH1,MSH2,MUTYH,NBN,ATM,NF1,NF2,SBDS,SUFU,CDK12,PIK3R1,PMS1,PMS2,ATRX,TET2,FAM46C,BCOR,RNF43,SDHAF2,PBRM1,FBXW7,PRF1,PRKAR1A,PTCH1,PTEN,MLL3,RB1,PRDM1,SDHB,SDHC,SDHD,BLM,MAP2K4,BMPR1A,SMARCA4,SMARCB1,BRCA1,BRCA2,STK11,HNF1A,BUB1B,TNFAIP3,TP53,TSC1,TSC2,KDM6A,VHL,WAS,WRN,WT1,XPA,XPC,CDC73,PALB2,FBXO11,MLL2,ZRSR2,KDM5C,ARID1A,AXIN1,BAP1,BRIP1,CASP8,PHF6,SOCS1,CBL,CBLB,TNFRSF14,FUBP1,PHOX2B,RECQL4,CDH1
This file doesn’t follow the principles of “tidy” data, according to which each row should represent only one single observation.
In this case we can use the unnest function from tidyr to convert it to a “long” format:
cancer.long = cancer %>%
mutate(genes=strsplit(as.character(genes), ",") ) %>%
unnest(genes) %>%
rename(IDENTIFIER=genes)
cancer.long %>% head
## dataset IDENTIFIER
## 1 oncogenes ABI1
## 2 oncogenes CDH11
## 3 oncogenes AKAP9
## 4 oncogenes LHFP
## 5 oncogenes CDK4
## 6 oncogenes CDK6
Now we can use left_join from dplyr to add a column to our peb expression data frame:
peb.long %>% nrow
## [1] 305184
peb.long = peb.long %>%
left_join(cancer.long, by='IDENTIFIER') %>%
mutate(dataset = ifelse(is.na(dataset), "no-cancer", as.character(dataset)))
## Warning: joining character vector and factor, coercing into character
## vector
peb.long %>% nrow
## [1] 305235
OPTIONAL exercise
You may notice that the number of rows in the dataframe increased after the join. The reason is that three genes are classified both as Oncogenes and as Tumor Suppressors.
To find them you can do any of the following:
# Detect duplicated IDENTIFIER & individual rows
peb.long %>%
subset(duplicated(paste(IDENTIFIER, individual)))
## Source: local data frame [51 x 7]
## Groups: individual
##
## IDENTIFIER individual population expression mean.expression
## 1 CBL GSM11815 YRI 508.8 736.6556
## 2 IKZF1 GSM11815 YRI 440.9 736.6556
## 3 CREBBP GSM11815 YRI 279.5 736.6556
## 4 CBL GSM11832 YRI 1213.5 739.8641
## 5 IKZF1 GSM11832 YRI 606.7 739.8641
## 6 CREBBP GSM11832 YRI 776.0 739.8641
## 7 CBL GSM12069 YRI 1183.7 757.4072
## 8 IKZF1 GSM12069 YRI 798.9 757.4072
## 9 CREBBP GSM12069 YRI 899.7 757.4072
## 10 CBL GSM12083 YRI 994.3 711.6908
## .. ... ... ... ... ...
## Variables not shown: overexpressed (chr), dataset (chr)
# Group by IDENTIFIER and individual, and summarise
peb.long %>%
group_by(IDENTIFIER, individual) %>%
mutate(dataset=paste(dataset, collapse=',')) %>%
subset(grepl('tumor', dataset) & grepl('oncogene', dataset))
## Source: local data frame [102 x 7]
## Groups: IDENTIFIER, individual
##
## IDENTIFIER individual population expression mean.expression
## 1 CBL GSM11815 YRI 508.8 736.6556
## 2 CBL GSM11815 YRI 508.8 736.6556
## 3 IKZF1 GSM11815 YRI 440.9 736.6556
## 4 IKZF1 GSM11815 YRI 440.9 736.6556
## 5 CREBBP GSM11815 YRI 279.5 736.6556
## 6 CREBBP GSM11815 YRI 279.5 736.6556
## 7 CBL GSM11832 YRI 1213.5 739.8641
## 8 CBL GSM11832 YRI 1213.5 739.8641
## 9 IKZF1 GSM11832 YRI 606.7 739.8641
## 10 IKZF1 GSM11832 YRI 606.7 739.8641
## .. ... ... ... ... ...
## Variables not shown: overexpressed (chr), dataset (chr)
# Reorganize the data frame to a "wide" format
peb.long %>%
mutate(value=T) %>%
spread(dataset, value) %>%
subset(oncogenes==T & tumorsuppressors==T)
## Source: local data frame [51 x 9]
##
## IDENTIFIER individual population expression mean.expression
## 1 CBL GSM11815 YRI 508.8 736.6556
## 2 IKZF1 GSM11815 YRI 440.9 736.6556
## 3 CREBBP GSM11815 YRI 279.5 736.6556
## 4 CBL GSM11832 YRI 1213.5 739.8641
## 5 IKZF1 GSM11832 YRI 606.7 739.8641
## 6 CREBBP GSM11832 YRI 776.0 739.8641
## 7 CBL GSM12069 YRI 1183.7 757.4072
## 8 IKZF1 GSM12069 YRI 798.9 757.4072
## 9 CREBBP GSM12069 YRI 899.7 757.4072
## 10 CBL GSM12083 YRI 994.3 711.6908
## .. ... ... ... ... ...
## Variables not shown: overexpressed (chr), no-cancer (lgl), oncogenes
## (lgl), tumorsuppressors (lgl)
End of the OPTIONAL exercise
Now that we have more qualitative variables in peb.long, we can create more sophisticated plots. For example, an useful way to represent multi-dimensional data is facetting:
myplot = ggplot(peb.long, aes(x=reorder(individual, desc(population)), y=expression, color=population)) +
geom_boxplot() +
scale_y_continuous("Expression", limits=c(0, 1000)) +
scale_x_discrete('Individuals sorted by Population') +
theme_bw() +
theme(axis.text.x=element_text(angle = -90, hjust = 0)) +
ggtitle('Expression by individual')
print(myplot + facet_wrap(~population, scales="free"))
## Warning: Removed 16121 rows containing non-finite values (stat_boxplot).
## Warning: Removed 16531 rows containing non-finite values (stat_boxplot).
## Warning: Removed 13562 rows containing non-finite values (stat_boxplot).
If we want to save this plot to a file, we can do:
ggsave('expression_by_individual.pdf')
Let’s see how many oncogenes/tumor suppressors are overexpressed in each population:
peb.long %>%
subset(dataset !='no-cancer') %>%
ggplot(aes(x=dataset, fill=overexpressed)) +
geom_bar(position='dodge') +
facet_wrap(~population) +
theme_bw() +
ggtitle('Number of overexpressed/underexpressed genes by population')