Automating PubMed Article and Manuscript Data Retrieval with Python Using the NCBI API

preview_player
Показать описание
This brief tutorial will dive into automating the retrieval of PubMed data using Python - that's right, add as many criteria as you want and create tons of query search combos. I will guide you through the entire process, from setting up the Entrez API (which in our case, just means providing an email in our code), setting up search lists, fetching article data, and converting our data into a usable pandas DataFrame (and ultimately in Excel format). This video is perfect for researchers, data scientists, and anyone interested in bioinformatics who wants to streamline their data collection process.

If you are familar with XML, this will be a breeze. If you aren't, I will show you a Python module that will help you parse through the XML to get the data you want.

What you'll learn:

- Setting Up Entrez API: Learn how to configure and use the Entrez API from NCBI to access PubMed data.
- Fetching Data: I will show you how to construct queries to fetch data based on authors, topics, and date ranges. Of course, you can search by other parameters (titles, journals, etc).
- Obtaining XML Data Parsing: Demonstrate how to handle and parse XML data directly from PubMed using Python libraries.
- Conversion to DataFrame: Convert the parsed data into a pandas DataFrame for easier manipulation and analysis.
- Error Handling: Learn how to effectively handle common errors such as data type mismatches and missing XML elements.
- Practical Tips: Gain practical tips for optimizing your data retrieval script to handle large datasets and avoid IP blocks.

Tools Used:

- Python: A powerful programming language for handling and analyzing data.
- BioPython's Entrez: A specialized module for interacting with NCBI databases like PubMed
- pandas: A data manipulation toolkit in Python that allows for easy data cleaning and analysis.
- (Optional) - XML ElementTree - A neat tool to breakdown XML nests and make it easier to extract information

Whether you're new to programming or looking to enhance your data retrieval skills, this tutorial will provide you with the knowledge to automate and simplify your PubMed data interactions. Don't forget to subscribe for more tutorials like this and hit the like button if you find this video helpful!

TIMELINE

0:00 - Intro
0:04 - Overview
0:32 - Getting our modules
1:00 - Create a new Python file
1:22 - Our Pub Date Function
1:58 - NCBI Entrez API Email
2:59 - Author, topics, and date ranges
3:51 - Our mechanism for running queries
4:27 - Creating our dataframe
4:39 - Processing each query through a loop
5:33 - Go through each PMID, obtain article XML
5:42 - Extract data from article
7:20 - Defining data elements in our dataframe
7:40 - Time delay to prevent overwhelming survey
7:51 - Take out duplicates by PMID
8:01 - The process for exporting our results to Excel
8:15 - Run our code (and explaining the BiopythonWarning)
9:01 - Taking a look at our results
9:13 - Beef up our numbers to 50 per query
9:39 - Solution to understand XML structure
11:22 - One last glance
11:33 - Wrapping up
11:49 - Outro
Рекомендации по теме
Комментарии
Автор

This is incredibly helpful to automate large searches for academic articles - thank you for posting and sharing your code!!

kelleyrivenburgh
Автор

I had a personal project idea that would need quite a few research articles relating to some particular topics and was wondering how I could collect the data-- thank you so much for posting and sharing the code!

NishaKa
Автор

This is awesome! I'd love to see a video on doing this in R too!

quinnbeltramo
Автор

why does mine say [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self-signed certificate in certificate chain (_ssl.c:1000)"

hahainfinity
Автор

hi, is there a way to automate downloading the pdf's for a specific keyword? as the only thing that seems of use to me here from the API is the abstract but it's not enough.
thanks!

wafflebutsad
Автор

amazing video, so the H-index is basically high impact factor? how would you filter if you wanted to look for this?

1. highest amount of citations.
2. Highest impact factor journals
3. Peer reviewed
4.High Consensus or International Alliance
5. Highest evidence of evidence based guidelines

nealdriscoll
join shbcf.ru