Here’s a very clever paper that combines bibliolmetrics with second-gen sequencing to define the brain’s ignorome–the set of genes which are highly and selectively expressed in the CNS yet poorly studied in the neuroscience literature [cite source=’pubmed’]24523945[/cite]. Out of 650 genes with CNS-specific expression, about 38 have one or fewer neuroscience papers. In contrast to this ‘ignorome’, the top 5% of genes account for almost 70% of publications. Even more interesting, biological properties don’t predict which genes are hot vs. ignored (it’s not number of interacting partners, number of motifs, etc.). Instead, popularity is best predicted by date of discovery.
There are a couple of interesting things to think about here. First, it seems like neuroscientists are guilty of lampposting–looking at what we know how to look at rather truly exploring. Second, and related, we need to be more open to discovery (something I wish our NIH reviewers would recognize)–we shouldn’t be racing to mechanism when we haven’t even fully characterized all the players yet.
Thanks much Robert. We had a lot of fun working on this paper. The methods really need to be extended to other biological systems (“memory” would work), and diseases (e.g., Riba M, et al. Scientific Reports, 2016, on asthma). I wish we had been able to get this into a somewhat higher profile journal. We now have so much more omics data suitable fo this type of analysis.
In unpublished work (should be out in 2019) , we have extended methods in this ignorome paper using gene ontologies in combination with expression data as another way to rank genes more objectively. In our case we wanted to do this for bone-associated genes. General protocol is simple:
1. Generate a list of 100 or more well known bone-associated genes taken from the literature or from the GO terms themself (genes already linked to terms such as “osteoblast”, “bone mineral density”, ….). This becomes a gene reference set that is used as a ‘true positive”. In your case, you could develop a reference set of 100 or more “memory-associated” genes quite easily.
2. Compute the top 1000 covariates of all transcripts in a bone or brain transcriptome data (for bone, we used Farber CR et al. 2012, GSE27483) . It is possible to do this using any relevant transcriptome or even proteome data sets downloaded from GeneNetwork (www.genenetwork.org), GEO, or from GTEx. The data sets should ideally be large eQTL-type expression data sets of the type in GeneNetwork or GTEx.
3. Compute bone-specific GO enrichment scores or each gene using its top 1000 covariates. We collaborated with Bing Zhang and the WebGestalt team at Baylor to do this efficiently. For each gene/transcript we computed it GO enrichment score against 60 bone-associated GO terms. The distribution of GO enrichment scores for GO bone terms of the reference set serves as the positive control for “boniness”. This distribution is then compared to all other transcripts/genes to provide an objective “boniness” score for the whole genome. As expected, many unknown genes have higher GO enrichment score for bone-associated terms than many members of the reference gene set.
The idea that sociology of science (time of discovery or gene “popularity”) constrains search patterns in functional genomics originated in this lovely paper in 2007: Pfeiffer T, Hoffmann R (2007) Temporal patterns of genes in scientific publications. Proc Natl Acad Sci U S A 104: 12052–12056.
Posted by rwilliams@uthsc.edu (feel free to follow-up).