Scraping weather data from the internet with R and the tidyverse (CC231)

preview_player
Показать описание
R has powerful but simple tools that allow for easy scraping of the internet. In this episode, Pat will show you how to track down local weather data from the NOAA website and make it accessible within your R session in RStudio using tools from the tidyverse including dplyr, lubridate, and more

#tidyverse #R #Rstudio #reproducibility #Rstats

You can also find complete tutorials for learning R with the tidyverse using...

0:00 Introduction
1:04 Finding weather station data at NOAA
8:06 Finding the closest weather station
17:47 Get and tidy local weather station data
Рекомендации по теме
Комментарии
Автор

Thanks a tons Sir. I am in Germany and i was able to get the lattitude and longitude for my place. This is so incredible .

MrMandarpriya
Автор

Your tutorials are great. I have a purely wet bio background and your videos helped me kickstart my computational biology literacy. Thank you for openly sharing your knowledge.

davidmantilla
Автор

For everybody having a hard time with parentheses like Pat has @13:00

Tools -> "Global options "-> "Code" -> On the top to "Display" and then tick Rainbow parentheses

svenr
Автор

This tutorial in practice is very interesting. I did manage to run the entire code but using my local latitude and longitude as you suggested. I did work. My interested variables were TMAX and PRCP. In Rwanda we do not have SNOW. Thanks a lot.

NdengoMarcel
Автор

This is my favorite video of yours. It is so useful for what I want to do. Thanks!

erichill
Автор

Excellent! There's one station in my city!

Автор

+ looks like a fun assignment to create a shiny dashboard containing time series plots of this data

djangoworldwide
Автор

Great episode as always! I just ended a course about german raster data with some students :) !

svenr
Автор

I love this, use the rainbow parentheses btw!!

zjardynliera-hood
Автор

This is great! I like how you build it up and have a specific goal in mind. This is also a problem any of us can tackle since the data is readily available.

I typically write my own code for these sort of exercise (since I at least I can understand my own code) - that is how I learn best. I came up with a slightly different way of finding my "closet" weather station. I wrote a couple functions to do this - and tested the distance on Houston-Chicago and got pretty close. Here is how I tackled the problem.

I set up two functions to run inside tidyverse - so used rlang (hence the enquo() and the bang bang !!).

The first function converts to radians:

radians_func <- function(df, longitude, latitude) {
longitude <- enquo(longitude)
latitude <- enquo(latitude)
mutate(df,
long_rad = !!longitude/(180/pi),
lat_rad = !!latitude/(180/pi)
)

and another for distance (in meters):

dist_func <- function(df, my_long, my_lat) {

my_long <- enquo(my_long)
my_lat <- enquo(my_lat)

mutate(df,
my_long_rad = !!my_long/(180/pi),
my_lat_rad = !!my_lat/(180/pi),

dist = acos(
sin(lat_rad)*sin(my_lat_rad) + cos(lat_rad)*cos(my_lat_rad)*
cos(my_long_rad - long_rad)) * 6371000)
}

These may look long but now the main code is quite simple to get the station:

weather_inventory %>%
radians_func(longitude, latitude) %>%
dist_func(my_long, my_lat) %>%
filter(end - start > 100) %>%
slice_min(dist) %>%
distinct(station) %>%
pull(station)

My closest station was about 500 m form my current location but has only operated for a couple of years. The filter gave me another station about 4 km away with a more extensive record. I decided to filter for stations with over 100 year record (although it is not clear what kind of record that is).

It seems like the search should be more focused, though. What are we after? Temperature it seems. And it seems like that is the one variable most often measured.

haraldurkarlsson
Автор

I had no trouble pulling up data for my best neighborhood station. However, my question is the temperature - what is the unit? Kelvin?

haraldurkarlsson
Автор

I'm having issues finding the same website as shown in 1:45 and beyond. Any info on how the path has changed from a year ago?

lancesnodgrass
Автор

Pat,
I used vroom to read in the file and it read it fast and detected the columns. The only thing I had to do was to clean the column names.

haraldurkarlsson
Автор

I might be wrong but mehh, I'm just gonna make this assumption.
Science in a nutshell 😅
Great tutorial sir. I always enjoy your videos since I learn so much more than what I came for (might you elaborate about top_n ? Couldn't quite grasp this one)

djangoworldwide
Автор

TMAX looks very high, is that combining rows?

kmbrahm
Автор

I must add the vroom read it in fast (lazy loading I suspect) but I not so sure about the column allocations. It seems to have created new ones with mixed type data. So be aware.

haraldurkarlsson
Автор

Pat,
Webscraping has - at least in my mind - a different meaning that what you are doing here. It uses rvest etc. It might be misleading for those looking for actual webscraping.

haraldurkarlsson