Achieving the Sustainable Development Goals of the United Nations is the primary goal of the 2030 Agenda. A critical step towards that objective is identifying if the scientific production is going in this way. Funders must do a manual recognition, impacting accuracy, scalability, and objectiveness. For this reason, we propose in this work an AI-based model for the automatic classification of scientific papers based on their impacts on the SDGs.
The training database consists of manually extracted texts from the UN page. After preprocessing these texts, we train three models: NMF, LDA, and Top2Vec. The output of these models is the probability of a paper being associated with each SDG. We then combine their scores by implementing a voting function to take advantage of their inherently different mathematical nature. To validate this methodology, we use the database provided by Vinuesa et al., Nature Communications 11, with more than 150 papers labeled with at least 1 SDG. Using only the abstracts, we correctly identify a of the SDGs presented in a paper, while a better is obtained when fetching the complete paper information.
Moreover, we find that the other identified SDGs which were not labeled are also related to the text contents. We recognize that more training files are required for the other cases since they are based on more complex human reasoning. We open-source these databases and trained models to enable future investigation in this field and allow public institutions to use this tool.