[Meta-Analysis] Mining Academic Papers from SCOPUS with Pybliometrics in Python
SCOPUS is one of the largest abstract and citation databases, providing access to a wide range of peer-reviewed literature across various disciplines. It ensures researchers have access to high-quality, up-to-date academic papers, conference proceedings, and other scholarly materials.
Pybliometrics is a Python library that streamlines the retrieval of bibliometric data from SCOPUS. It simplifies accessing and manipulating large datasets, saving researchers time and effort compared to manual data collection. Using Pybliometrics to mine academic papers from SCOPUS enables efficient data retrieval, advanced analysis, and comprehensive bibliometric studies, enhancing the quality and impact of academic research.
Today, I will introduce a method for mining academic papers from SCOPUS using Pybliometrics.
■ Prerequisite
1) Watch the video
Please watch this video to understand the whole picture. This post is based on the video.
2) Install Python to your PC
First, you need to install Python on your PC. You can download Python from the link below.
https://www.python.org/downloads/
3) Download Visual Studio
Second, you need to install Visual Studio on your PC. You can download Visual Studio from the link below.
https://visualstudio.microsoft.com
and install Python to Visual Studio.
For mining academic papers from SCOPUS with pybliometrics
, it is better to avoid the Jupyter Notebook environment, as API keys do not work in that environment.
So, set up the environment by going to File
> New File
> Python File
.
4) Install git
It is necessary to install Git. Please visit the link below to download Git.
https://git-scm.com/download/win
5) Install pybliometrics and gitpython
Then, let’s install two packages, pybliometrics
and gitpython
.
import subprocess
subprocess.check_call(["pip3", "install", "pybliometrics"])
When you run the above code, you first need to designate the folder where the Python code is saved. I have already created a folder named Python_OUTPUT
, and I will choose this folder.
After installing pybliometrics
, I’ll install gitpython
.
import subprocess
subprocess.check_call(["pip3", "install", "gitpython"])
Now, all environments are set up for mining academic papers.
[Step 1] Clone pybliometrics
from github
I’ll run the following code. This process will clone the Pybliometrics
repository from GitHub to your PC. The folder where you download the GitHub repository should be empty, so I created a new folder named pybliometrics
inside C:\Users\kimjk\Desktop\Python_OUTPUT
.
from git import Repo
# URL of the Git repository to clone
repo_url = 'https://github.com/computron/pybliometrics_ml.git'
# Local path where the repository will be cloned
path = r'C:\Users\kimjk\Desktop\Python_OUTPUT\pybliometrics'
# Clone the repository
Repo.clone_from(repo_url, path)
[Step 2] Update the search terms according to your preferences
Then, the pybliometrics
repository will be cloned to the path you set up on your PC. Let’s check that all the data was cloned.
import subprocess
# Define the directory path
directory_path= r'C:\Users\kimjk\Desktop\Python_OUTPUT\pybliometrics'
# Execute the dir command using subprocess
result= subprocess.run(['dir', directory_path], stdout=subprocess.PIPE, shell=True)
# Decode and print the result
print(result.stdout.decode())
Now I’ll modify generate.py
. When you open the folder where the pybliometrics
repository was cloned, you can find generate.py
. Double-click it, and the file will open in Visual Studio.
I’d like to search for journal articles regarding wheat, yield, source, and sink. So, I modified generate.py
as shown below. Don’t forget to save modified generate.py
.
if __name__ == "__main__":
for year in range(2018, 2024):
# make the folder to store the data for the year
current_path = os.getcwd()
folder_path = os.path.join(current_path, "output", str(year))
if not os.path.exists(folder_path):
os.makedirs(folder_path)
# get the results
x = ScopusSearch(
f'TITLE-ABS-KEY ("wheat" OR "yield") AND ' \
f'TITLE-ABS-KEY ("source" OR "sink") AND ' \
f'TITLE ("wheat" OR "yield") AND ' \
f'TITLE ("source" OR "sink") AND ' \
f'DOCTYPE ("AR") ' \
f'AND SRCTYPE(j) AND PUBYEAR = {year}', view="STANDARD")
print(f"Year: {year}, Results count: {len(x.results)}")
[Step 3] Enter SCOPUS APIs for mining papers
First, let’s set up a folder to save the papers. I’ll create a folder named paper_mining
inside C:\Users\kimjk\Desktop\Python_OUTPUT
. In Visual Studio, click Open Folder
, select the paper_mining
folder, and then click Select Folder
. This will change the terminal path, meaning the papers will be saved in this location.
Next, I’ll run generate.py
using the following code.
import subprocess
# Define the path to the script
script_path= r'C:\Users\kimjk\Desktop\Python_OUTPUT\pybliometrics\generate.py'
# Run the script
subprocess.run(['python', script_path], check=True)
If you run the above code, you can enter the SCOPUS API key. In the below website, you can create the API Keys.
https://dev.elsevier.com
After entering the API Key, the following message pops up; API Keys are sufficient for most users. If you have an InstToken, please enter the token now; otherwise just press Enter:
When you press enter, you will see the paper mining is working.
Now, all the papers I wanted to search for have been downloaded to the folder I designated.
[Step 4] Convert JSON to RIS file
The file type for the papers is JSON. To import these papers into EndNote, I’ll convert them to an RIS file using the following code.
import os
import json
def json_to_ris(entry):
ris_entry = ""
if 'title' in entry:
ris_entry += f"TI - {entry['title']}\n"
if 'author' in entry:
authors = entry['author'] if isinstance(entry['author'], list) else [entry['author']]
for author in authors:
ris_entry += f"AU - {author}\n"
if 'year' in entry:
ris_entry += f"PY - {entry['year']}\n"
if 'journal' in entry:
ris_entry += f"JO - {entry['journal']}\n"
if 'volume' in entry:
ris_entry += f"VL - {entry['volume']}\n"
if 'issue' in entry:
ris_entry += f"IS - {entry['issue']}\n"
if 'pages' in entry:
ris_entry += f"SP - {entry['pages']}\n"
ris_entry += "ER - \n"
return ris_entry
def convert_folder_to_ris(folder_path, output_file):
ris_entries = []
for root, _, files in os.walk(folder_path):
for file in files:
if file.endswith(".json"):
file_path = os.path.join(root, file)
with open(file_path, 'r', encoding='utf-8') as f:
try:
json_data = json.load(f)
if isinstance(json_data, list):
ris_entries.extend([json_to_ris(entry) for entry in json_data])
elif isinstance(json_data, dict):
ris_entries.append(json_to_ris(json_data))
else:
print(f"Unexpected data structure in file {file_path}")
except json.JSONDecodeError as e:
print(f"Error decoding JSON from file {file_path}: {e}")
except Exception as e:
print(f"Error processing file {file_path}: {e}")
with open(output_file, 'w', encoding='utf-8') as f:
f.write("\n".join(ris_entries))
print(f"RIS file saved to {output_file}")
# Path to the folder containing JSON files
folder_path = r'C:\Users\kimjk\Desktop\Python_OUTPUT\paper_mining'
# Output RIS file path
output_file = r'C:\Users\kimjk\Desktop\Python_OUTPUT\paper_mining\output.ris'
# Convert JSON files to RIS
convert_folder_to_ris(folder_path, output_file)
Now, an RIS file has been created as shown below.
[Step 5] Import RIS file into EndNote
Let’s import the RIS file to EndNote. Go to File
> Import
> File
, and choose the RIS file we created.
Then, click Import
Now, all papers collected by Python according to our search terms have been saved to EndNote. These papers are for wheat, and I’ll move all of them to the ‘Wheat’ folder.
Now, I have 768 paper titles in the ‘Wheat’ folder. Although it only provides the titles, you can single out papers you might want to read. This method is much more efficient than searching in Google Scholar or other journal databases.
code summary: https://github.com/agronomy4future/python_code/blob/main/Mining_Academic_Papers.ipynb
■ Reference
https://github.com/computron/pybliometrics_ml