[Meta-Analysis] Mining Academic Papers from SCOPUS with Pybliometrics in Python

[Meta-Analysis] Mining Academic Papers from SCOPUS with Pybliometrics in Python


Image created by DALL·E 3

SCOPUS is one of the largest abstract and citation databases, providing access to a wide range of peer-reviewed literature across various disciplines. It ensures researchers have access to high-quality, up-to-date academic papers, conference proceedings, and other scholarly materials.

Pybliometrics is a Python library that streamlines the retrieval of bibliometric data from SCOPUS. It simplifies accessing and manipulating large datasets, saving researchers time and effort compared to manual data collection. Using Pybliometrics to mine academic papers from SCOPUS enables efficient data retrieval, advanced analysis, and comprehensive bibliometric studies, enhancing the quality and impact of academic research.

Today, I will introduce a method for mining academic papers from SCOPUS using Pybliometrics.


Prerequisite

1) Watch the video

Please watch this video to understand the whole picture. This post is based on the video.



2) Install Python to your PC

First, you need to install Python on your PC. You can download Python from the link below.

https://www.python.org/downloads/


3) Download Visual Studio

Second, you need to install Visual Studio on your PC. You can download Visual Studio from the link below.

https://visualstudio.microsoft.com

and install Python to Visual Studio.

For mining academic papers from SCOPUS with pybliometrics, it is better to avoid the Jupyter Notebook environment, as API keys do not work in that environment.

So, set up the environment by going to File > New File > Python File.


4) Install git

It is necessary to install Git. Please visit the link below to download Git.

https://git-scm.com/download/win

5) Install pybliometrics and gitpython

Then, let’s install two packages, pybliometrics and gitpython.

import subprocess
subprocess.check_call(["pip3", "install", "pybliometrics"])

When you run the above code, you first need to designate the folder where the Python code is saved. I have already created a folder named Python_OUTPUT, and I will choose this folder.

After installing pybliometrics, I’ll install gitpython.

import subprocess
subprocess.check_call(["pip3", "install", "gitpython"])

Now, all environments are set up for mining academic papers.



[Step 1] Clone pybliometrics from github

I’ll run the following code. This process will clone the Pybliometrics repository from GitHub to your PC. The folder where you download the GitHub repository should be empty, so I created a new folder named pybliometrics inside C:\Users\kimjk\Desktop\Python_OUTPUT.

from git import Repo

# URL of the Git repository to clone
repo_url = 'https://github.com/computron/pybliometrics_ml.git'

# Local path where the repository will be cloned
path = r'C:\Users\kimjk\Desktop\Python_OUTPUT\pybliometrics' 

# Clone the repository
Repo.clone_from(repo_url, path)


[Step 2] Update the search terms according to your preferences

Then, the pybliometrics repository will be cloned to the path you set up on your PC. Let’s check that all the data was cloned.

import subprocess

# Define the directory path
directory_path= r'C:\Users\kimjk\Desktop\Python_OUTPUT\pybliometrics'  

# Execute the dir command using subprocess
result= subprocess.run(['dir', directory_path], stdout=subprocess.PIPE, shell=True)

# Decode and print the result
print(result.stdout.decode())

Now I’ll modify generate.py. When you open the folder where the pybliometrics repository was cloned, you can find generate.py. Double-click it, and the file will open in Visual Studio.

I’d like to search for journal articles regarding wheat, yield, source, and sink. So, I modified generate.py as shown below. Don’t forget to save modified generate.py.

if __name__ == "__main__":
    for year in range(2018, 2024):
        # make the folder to store the data for the year
        current_path = os.getcwd()
        folder_path = os.path.join(current_path, "output", str(year))
        if not os.path.exists(folder_path):
            os.makedirs(folder_path)

        # get the results
        x = ScopusSearch(
            f'TITLE-ABS-KEY ("wheat" OR "yield") AND ' \
            f'TITLE-ABS-KEY ("source" OR "sink") AND ' \
            f'TITLE ("wheat" OR "yield") AND ' \
            f'TITLE ("source" OR "sink") AND ' \
            f'DOCTYPE ("AR") ' \
            f'AND SRCTYPE(j) AND PUBYEAR = {year}', view="STANDARD")
        print(f"Year: {year}, Results count: {len(x.results)}")


[Step 3] Enter SCOPUS APIs for mining papers

First, let’s set up a folder to save the papers. I’ll create a folder named paper_mining inside C:\Users\kimjk\Desktop\Python_OUTPUT. In Visual Studio, click Open Folder, select the paper_mining folder, and then click Select Folder. This will change the terminal path, meaning the papers will be saved in this location.

Next, I’ll run generate.py using the following code.

import subprocess

# Define the path to the script
script_path= r'C:\Users\kimjk\Desktop\Python_OUTPUT\pybliometrics\generate.py'

# Run the script
subprocess.run(['python', script_path], check=True)

If you run the above code, you can enter the SCOPUS API key. In the below website, you can create the API Keys.

https://dev.elsevier.com

After entering the API Key, the following message pops up; API Keys are sufficient for most users. If you have an InstToken, please enter the token now; otherwise just press Enter:

When you press enter, you will see the paper mining is working.

Now, all the papers I wanted to search for have been downloaded to the folder I designated.



[Step 4] Convert JSON to RIS file

The file type for the papers is JSON. To import these papers into EndNote, I’ll convert them to an RIS file using the following code.

import os
import json

def json_to_ris(entry):
    ris_entry = ""
    if 'title' in entry:
        ris_entry += f"TI  - {entry['title']}\n"
    if 'author' in entry:
        authors = entry['author'] if isinstance(entry['author'], list) else [entry['author']]
        for author in authors:
            ris_entry += f"AU  - {author}\n"
    if 'year' in entry:
        ris_entry += f"PY  - {entry['year']}\n"
    if 'journal' in entry:
        ris_entry += f"JO  - {entry['journal']}\n"
    if 'volume' in entry:
        ris_entry += f"VL  - {entry['volume']}\n"
    if 'issue' in entry:
        ris_entry += f"IS  - {entry['issue']}\n"
    if 'pages' in entry:
        ris_entry += f"SP  - {entry['pages']}\n"
    ris_entry += "ER  - \n"
    return ris_entry

def convert_folder_to_ris(folder_path, output_file):
    ris_entries = []
    for root, _, files in os.walk(folder_path):
        for file in files:
            if file.endswith(".json"):
                file_path = os.path.join(root, file)
                with open(file_path, 'r', encoding='utf-8') as f:
                    try:
                        json_data = json.load(f)
                        if isinstance(json_data, list):
                            ris_entries.extend([json_to_ris(entry) for entry in json_data])
                        elif isinstance(json_data, dict):
                            ris_entries.append(json_to_ris(json_data))
                        else:
                            print(f"Unexpected data structure in file {file_path}")
                    except json.JSONDecodeError as e:
                        print(f"Error decoding JSON from file {file_path}: {e}")
                    except Exception as e:
                        print(f"Error processing file {file_path}: {e}")
    with open(output_file, 'w', encoding='utf-8') as f:
        f.write("\n".join(ris_entries))
    print(f"RIS file saved to {output_file}")

# Path to the folder containing JSON files
folder_path = r'C:\Users\kimjk\Desktop\Python_OUTPUT\paper_mining'
# Output RIS file path
output_file = r'C:\Users\kimjk\Desktop\Python_OUTPUT\paper_mining\output.ris'

# Convert JSON files to RIS
convert_folder_to_ris(folder_path, output_file)

Now, an RIS file has been created as shown below.



[Step 5] Import RIS file into EndNote

Let’s import the RIS file to EndNote. Go to File > Import > File, and choose the RIS file we created.

Then, click Import

Now, all papers collected by Python according to our search terms have been saved to EndNote. These papers are for wheat, and I’ll move all of them to the ‘Wheat’ folder.

Now, I have 768 paper titles in the ‘Wheat’ folder. Although it only provides the titles, you can single out papers you might want to read. This method is much more efficient than searching in Google Scholar or other journal databases.

code summary: https://github.com/agronomy4future/python_code/blob/main/Mining_Academic_Papers.ipynb

Reference

https://github.com/computron/pybliometrics_ml


Comments are closed.