When data is shared in the cloud, anyone can analyze it without having to download it or store it themselves, which lowers the cost of new product development, reduces the time to scientific discovery, and can accelerate innovation. However, staging large-scale datasets for analysis in the cloud requires consideration of how data should be prepared and organized to allow fast, efficient, and programmatic access from distributed computing systems. This workshop provides a forum for members of the community to share lessons learned as they explore ways to use the cloud to expand data access. It seeks to encourage dialog between users interested in leveraging data in the AWS Cloud for research and application development for Earth Sciences.
View Session RecordingSession Description:When data is shared in the cloud, anyone can analyze it without having to download it or store it themselves, which lowers the cost of new product development, reduces the time to scientific discovery, and can accelerate innovation. However, staging large-scale datasets for analysis in the cloud requires consideration of how data should be prepared and organized to allow fast, efficient, and programmatic access from distributed computing systems. This workshop provides a forum for members of the community to share lessons learned as they explore ways to use the cloud to expand data access. It seeks to encourage dialog between users interested in leveraging data in the AWS Cloud for research and application development for Earth Sciences.Workshop Format: Workshop includes 1.5 hours of presentations (Cloud Data Optimization: Emerging Best Practices I) followed by 1.5 hours of discussion on emerging best practices and identifying needs to move this space forward.
Presentations (10 minutes each)Full Abstracts can be found in the attached file.
- Title: STAC, sat-utils, and Open Data - Prioritizing Data Use (10 min)
Presenter: Dan Pilone (Element 84) - Title: Radiant ML Hub, A cloud based commons for geospatial training datasets (10 min)
Presenter: Hamed Alemohammad (Radiant Earth Foundation)
Slides: https://doi.org/10.6084/m9.figshare.9696446 - Title: One data format pattern to rule them all (10 min)
Presenter: Grega Milcinski (Sinergise)
Slides: https://doi.org/10.6084/m9.figshare.9121991 - Title: Improved Cloud Raster Format for multidimensional raster storage and analysis (10 min)
Presenters: Hong Xu (Esri) & Sudhir Raj Shrestha (Esri)
Slides: https://doi.org/10.6084/m9.figshare.9762866 - Title: Optimization of CESM LENS on AWS S3 (10 min)
Presenter: Jeff de La Beaujardiere (NCAR)
Slides: https://doi.org/10.6084/m9.figshare.9633314 - Title: The Zarr format
Presenter: Rich Signell (USGS)
Slides: https://doi.org/10.6084/m9.figshare.9701684 - Title: NOAA’s Big Data Project - A Data Broker’s Perspective
Presenter: Otis Brown (NC State University/NCICS)
Slides: https://doi.org/10.6084/m9.figshare.9693776 - Title: HDF Data Service for the Cloud
Presenter: John Readey (The HDF Group)
Session Take-Aways- Moving to cloud infrastructure offers a chance to reevaluate best practices, though some of these may not be purely cloud-related (e.g., data formats) but the discussions are coming along for the ride!
- It is unclear who will own the cloud-optimized datasets and it will likely be different from dataset to dataset. Until (if/when) cloud-optimized formats become the norm, they may often be provided by other groups (or created on the fly).
- There is a lot of focus on datasets in these conversations, but we need to also focus on tooling/services and education.