The amount of data generated by public and private sector organizations has increased many fold in the last decade. In recent years, consumers and providers of data are faced with an increasing challenge of managing the quantity and quality of information produced. The advent of cloud technologies has been a boon for the big data era offering a solution for the information overload. While cloud technologies have provided an excellent opportunity, challenges and opportunities on utilizing cloud technologies are still to be explored. The complex business/infrastructure aspect of the cloud technologies paradigm and the rapid changes in the technical development have made transitions complex and confusing at times. In this session, we hope to share case studies of migration/utilization of cloud technologies for data intensive science. The challenges and opportunities revealed by those case studies we hope will inform stakeholders, collaborators, and other interested parties. We hope that the lessons learned will inform future work and help expedite progress in the field of Earth Science informatics.
Developing Applications Using Earth Science Data in the AWS Cloud with PODPACMatt UeckermannObservational and modeled data products from NASA encompass petabytes of scientific data available for analysis, analytics, and exploitation. Unfortunately, these data sets are highly underutilized by the scientific community due to: (1) vast computational resource requirements; (2) disparate formats, projections, and resolutions that hinder data fusion and integrated analyses across different data sets; (3) complex and disjoint data access and retrieval protocols; and (4) task specific and non-reusable code development processes that hinder algorithm sharing and collaboration. In response, NASA EOSDIS is actively investigating migration of their vast data archives to storage on commercial cloud services such as Amazon Web Services (AWS). However, to maximize the benefit of cloud-based data storage, cloud-based data analysis and analytics are needed to process data “close” to where it is stored. Recognizing that migrating workflows to the cloud requires a high degree of cloud computing expertise, we are developing the Pipeline for Observational Data Analysis and Collaboration (PODPAC). PODPAC is a Python library designed to automatically harmonize disparate data sources, seamlessly access NASA earth science data, and analyze data in the AWS cloud. PODPAC is built around the tools of the Python data ecosystem (NumPy, Scipy, X-Array) and aims to bridge the gap between data sources, analysis, and the cloud. In this talk, we will introduce PODPAC, and demonstrate on-demand cloud computation of a value-added derived product using NASA data.
Opportunities for Accelerating Science in the Cloud
Christopher LynnesAs the data holdings of the Earth Observation System Data and Information System expand over the next several years, the typical data analysis process of downloading data to local compute resources will become increasingly inefficient. However, cloud computing promises to mitigate that by allowing the user to process close to the data. These improvements will be obtained via a variety of mechanisms: 1 - improving the ability of data transformation services to reduce the data prior to analysis; 2 – providing cloud-native analysis capabilities for common analysis functions; and 3 – providing the ability to work directly with data in Web Object Storage.
The role of data stewards in a cloud-based platformAmanda LeonGoogle Earth Engine has a growing user community as a cloud-based platform for analysis and visualization of geospatial data. This adoption is heavily driven by the ease of access Earth Engine’s Data Catalog provides to a wealth of satellite imagery and other geospatial data. As stewards of NASA EOSDIS data, Distributed Active Archive Centers (DAACs) can play a key role in supporting and maximizing the utility of Earth Engine for the scientific community. The NSIDC DAAC has been assessing various data stewardship topics to support the sustainment and expansion of NASA EOSDIS data in Google Earth Engine including: 1) data inclusion decisions based on science use cases; 2) optimized workflows for preparing
Open Source Data-Intensive Platform for the Cloud
Thomas HuangJPL has a long history of building many innovative solutions for onboard instrument, ground operation and data system, archive and distribution for our missions. As the rate of data generate from our missions continue to increase and is expected to rise significantly in near future, JPL is engaging in in reusable data-intensive technologies for mission operations and to enable science. This talk discusses open source solution we have developed for the Cloud platform to address three challenges from our growing collections of scientific data: interactive analysis, in situ match-up, and search relevancy, and their applications.
Developing a roadmap for cloud services
Suresh Vannan
The Physical Oceanography Distributed Active Archive Center (PO.DAAC) will be the data repository for the Surface Water Ocean Topography (SWOT) mission. SWOT provides new challenges, and opportunities, to PO.DAAC, a large data volume (20 TB/day) and a new community of users (hydrologists). This presentation will show how PO.DAAC plans on addressing those. PO.DAAC first assessed what tools and services current and new users will need to discover, access and utilize SWOT data. This analysis provided information for developing a roadmap that shows what services PO.DAAC (and ESDIS) will migrate and/or develop in a Cloud-based environment for the user community.
Leveraging an interoperable scalable data platform to support Earth Observation DataSudhir Raj Shrestha (sshrestha@esri.com)With an ever-increasing wealth of scientific data produced from various sources and platforms including earth observations, models and forecasts, comes exciting and challenging opportunities to exploit such vast amounts of data to produce valuable information products. These data are widely used for monitoring, and analysis of measurements that are associated with physical, chemical and biological phenomena across earth’s oceans, atmosphere and land masses by government agencies like NOAA, NASA, USGS and private industries. The volume, diversity, and complexity of multidimensional earth science data have posed challenges in the past with how it is shared with a diverse community, visualized intuitively, and integrated for answering scientific questions. With advances in geospatial science and technology, these data and analytics can now advantageously be hosted in the cloud. This will have a tremendous impact on how scientists, policy makers, and the public ingest, manage, analyze, visualize, and share complex scientific data. GIS software is evolving in step with the technology industry to help meet these challenges. In this presentation, I will discuss briefly, how the current technology trend is driving more scalable, interoperable and format agnostic capabilities. We will share how the ArcGIS platform supports this “Open Science” and share use cases in place in NOAA and NASA. We will also share recent advancements in the cloud, spatial machine learning and geospatial data science that support various domain of science applications.
Session recording here.