The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly Articles

Abstract. Classifying research papers according to their research topics is an important task to improve their retrievability, assist the creation of smart analytics, and support a variety of approaches for analysing and making sense of the research environment. In this paper, we present the CSO Classifier, a new unsupervised approach for automatically classifying research papers according to the Computer Science Ontology (CSO), a comprehensive ontology of research areas in the field of Computer Science. The CSO Classifier takes as input the metadata associated with a research paper (title, abstract, keywords) and returns a selection of research concepts drawn from the ontology. The approach was evaluated on a gold standard of manually annotated articles yielding a significant improvement over alternative methods.

Keywords: Scholarly Data, Ontology Learning, Bibliographic Data, Scholarly Ontologies, Text Mining, Topic Detection.

Paper:

The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly Articles Download

Gold Standard

We built our gold standard by asking 21 domain experts to classify 70 papers in terms of topics drawn from the CSO ontology.
We queried the MAG dataset and selected the 70 most cited papers published in 2007-2017 within the fields of “Semantic Web”, “Natural Language Processing”, and “Data Mining”. We then contacted 21 researchers in these fields at various level of seniority and asked each of them to annotate 10 of these papers. We structured the data collection in order to have each paper annotated by at least three experts, using majority vote to address disagreements. The papers were randomly assigned to experts, while minimising the number of shared papers between each pair of experts.
The Gold Standard is JSON file containing a dictionary of 70 items (papers). Each item has a 32 alphanumerical characters key representing the id of the paper and its value is also a dictionary structured as showed in Table 1.

GoldStandard Download

paper

Key	Type	Info
“doi”	string	DOI of the paper
“title”	string	title of the paper
“abstract”	string	abstract of the paper
“keywords”	list	author keywords
“doc_type”	string	type of document, it identifies whether it is a conference paper, or journal, or others
“topics”	list	Fields of Science identified by Microsoft Academic Graph. This information is not used during the process of classification.
“source”	string	Source topic, whether it comes form the field of “Semantic Web”, “Natural Language Processing”, or “Data Mining”
“citations”	numerical	Number of citation at time of download
“gold_standard”	dictionary	object containing the information obtained by the experts and the generated gold standard ( specifications)
“cso_output”	dictionary	object containing the output of the CSO Classifier ( specifications)

gold_standard

Key	Type	Info
“relevant_rater_A”	list	relevant topics selected by the first expert during the annotation process
“relevant_rater_B”	list	relevant topics selected by the second expert during the annotation process
“relevant_rater_C”	list	relevant topics selected by the third expert during the annotation process
“majority_vote”	list	set of topics selected using the majority vote approach over the relevant topics chosed by the experts
“enhancement_majority_vote”	list	set of enhanced topics of the majority vote set

cso_output

Key	Type	Info
“syntactic”	list	list of topics returned by the syntactic module
“semantic”	list	list of topics returned by the semantic module
“enhancement”	list	list of enhanced topics from the union of the result of semantic and syntactic module
“final”	list	final set of topics from the CSO Classifier

Table 1. Fields available for each paper in our gold standard.

Errata Corrige

In a preprint of the paper “The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly Articles” we used and released a version of this gold standard with some additional topics that were included by erroneously inferring also all the super topics of the topics indicated by the users. As result, a portion of the topics was basically unretrievable by any of the tested methods. The error is now corrected in the evaluation submitted for TPDL (link) the methods in which both the approach and the GS are compared after inferring only the direct subtopics. As result, all methods yield about 5% increase in recall. The different between the tested algorithm is still significant and as before the most recent version of the CSO classifier (2.0) yield an increment of about 4% F-measure over previous version.

Navigation

Scholarly Knowledge Modelling

The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly Articles

Paper:

Gold Standard

paper

gold_standard

cso_output

Errata Corrige

News

Angelo gives a webinar to ISKO UK Meetup

Stanford’s AI Index features SKM research

Computer Science Ontology v3.4.1

Our Research

Navigation

The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly Articles

Paper:

Gold Standard

paper

gold_standard

cso_output

Errata Corrige

Share this:

News

Angelo gives a webinar to ISKO UK Meetup

Stanford’s AI Index features SKM research

Computer Science Ontology v3.4.1

Our Research