Want to analyze millions of scientific papers all at once? Here’s the best way to do it
By Lindsay McKenzie
With more than a million scientific papers produced each year, keeping on top of the latest research is becoming an impossible task. That’s why a growing number of scientists are having computers trawl through thousands of research papers at once for raw data and text. Now, in one of the largest text and data mining exercises ever conducted, scientists say they have identified the best way to do such searches, which could improve the hunt for everything from new drug targets to genes that have not been studied in detail.
There is long-standing debate among text and data miners: whether sifting through full research papers, rather than much shorter and simpler research summaries, or abstracts, is worth the extra effort. Though it may seem obvious that full papers would give better results, some researchers say that a lot of information they contain is redundant, and that abstracts contain all that’s needed. Given the challenges of obtaining and formatting full papers for mining, stick with abstracts, they say.
In an attempt to settle the debate, Søren Brunak, a bioinformatician at the Technical University of Denmark in Kongens Lyngby, and colleagues analyzed more than 15 million scientific articles published in English from 1823 to 2016. After creating two databases of those articles—one of full-text and one of abstracts—the researchers directly compared the results of mining either. The full texts were obtained from publishers Elsevier and Springer, as well as the open-access section of online repository PubMed Central. The abstracts from the same papers were collected from MEDLINE, a resource that like PubMed Central receives funding from the U.S. National Institutes of Health.
Text mining full research articles gave consistently better results than text mining abstracts, the team reports this month on the preprint server bioRxiv (which was not mined). In one example test, the authors identified far more associations between genes and a variety of diseases from the full-text articles than the abstracts—potentially creating a treasure trove of ideas for future research targets.
The paper “convincingly shows that ideally text mining studies should use full-text,” says Daniel Himmelstein, a biodata scientist at the University of Pennsylvania who was not involved in the study.
Now, many researchers are just using abstracts, says study co-author Lars Juhl Jensen, a bioinformatician at the University of Copenhagen. These summaries are typically much easier to get ahold of than full research papers, have fewer legal restrictions on their use, and are much easier for computers to read due to their simple formatting.
Given those advantages, researchers using text mining may not switch from abstracts any time soon, Himmelstein says. An additional obstacle, he notes, is that because of restrictions put on many full-text articles by publishers, researchers are often restricted from sharing the databases of papers they download and prepare for text mining—making it extremely difficult for others to replicate their research.
Brunak admits that the process of negotiating permissions with publishers was challenging and took his colleagues in the library several months. But he says that arguably the most time-consuming and challenging step in the study was converting the full-text articles the publishers provided in the common PDF file format into a machine-readable text format.
“This is one of the big reasons why nobody did full-text mining at this scale before,” Jensen says. “We probably spent more computational resources teasing the text out of PDFs and beating it into shape than we spent on the actual text mining.” Jensen warns that if researchers aren’t familiar with this step, they may be “unpleasantly surprised” by how many errors they get when converting the files.
One solution, says Jensen, would be for publishers to ensure that full-text articles can be easily mined. He’s eager to see publishers work together to find “a consistent format” that could be used across the board, “rather than each journal just inventing their own.” The XML file format for sharing data used by the scholarly article repository PubMed Central could be a good model for this, Jensen notes.
More from News
Finding full-text psychology journals online can be difficult, especially for students with limited access to academic libraries or online databases. There are a number of psychology, social science, and medical journals that offer free full-text articles, which may be especially useful for students living in rural areas or studying via distance education. The following journals offer access to a selection of full-text articles online.
Full-Text Psychology Journals
Addictive Behaviors offers a sample issue of the journal online. The sample issue contains full-text articles in both HTML and PDF format. A great resource for students researching addictions.
American Journal of Drug and Alcohol Abuse
Find full-text articles on the study and treatment of drug abuse and alcoholism. The American Journal of Drug and Alcohol Abuse focuses on a wide range of topics including clinical, pharmacological, administrative, and social aspects of substance abuse.
Archives of Internal Medicine
Offers free full-text articles to registered users 12 months after publication. Published by the American Medical Association, the journal covers a wide range of topics related to internal medicine. Free registration is required to access the articles.
Biology of Reproduction
Find full-text articles as well as article abstracts dating back to 1969 from the Biology of Reproduction journal.
Brain: A Journal of Neurology
Find free full-text articles on neurology as well as free editorials. A useful resource for students of neuroscience and biopsychology.
British Journal of Psychiatry
Find articles covering all topics in psychology from the British Journal of Psychiatry. The journal is focused on clinical aspects of mental health and includes issues of interest to psychiatrists, clinical psychologists, and students of psychology.
Full-text articles are available from January 2000 and articles become available one year after publication.
CogPrints features journal articles on a number of topic areas, including many in psychology. Find articles on behavioral analysis, clinical psychology, psychobiology, social psychology, and more.
Current Psychology Letters
This electronic journal offers short papers on current topics in psychology. Available papers date from 2000 to 2006.
Electronic Journal of Research in Educational Psychology
This journal is a great resource for current research in educational psychology. Find full-text journal articles in both English and Spanish.
Find full-text articles and reviews on the history, research, and theoretical work in evolutionary psychology.
Journal of Abnormal Child Psychology
Read full-text articles focused on child psychotherapy, prevention, assessment, and treatment. Research of interest includes childhood disorders including developmental disorders, depression, and anxiety.
Journal of Abnormal Psychology
Read selected articles on topics in abnormal psychology from this journal published by the American Psychological Association.
Journal of Applied Behavior Analysis
Read current and past research on applied behavior analysis in back issues of this journal.
Journal of Experimental Psychology
This journal, published by the American Psychological Association, offers a selection of full-text journal articles on topics in experimental psychology.
Journal of General Psychology
Offers full-text articles on a variety of topics in psychology. A great reference for psychology students.
Journal of Instructional Psychology
The Journal of Instructional Psychology provides articles and essays on education, the psychology of learning, and instruction.
Journal of Neuroscience
The Journal of Neuroscience offers full-text journal articles in their archive starting in 1996.
Full-text access is available for articles 1.5 years after publication.
Learning and Memory
This journal focuses on the neurobiology of learning and memory offers access to articles one year after publication.
Psychart is an online journal focused on the psychological study of the arts. Articles are primarily focused on psychoanalytic theory, literature, and film.