Data Rods: High Speed, Time-Series Analysis of Massive Data Sets Using

34 Slides7.02 MB

Data Rods: High Speed, Time-Series Analysis of Massive Data Sets Using Pure Object Database Methods David Gallaher(1), Qin Lv(2), Glenn Grant(1), Garrett Campbell(1) 1) 2) 1 National Snow and Ice Data Center, University of Colorado, Boulder, Colorado, 80309, USA Department of Computer Science, University of Colorado, Boulder, Colorado, 80309, USA Data Rods: High Speed, Time-Series Analysis of Massive Data Sets

The National Snow and Ice Data Center Mission: To Monitor the Climate Data in Earth’s Icy Regions, Analyze and Distribute it Worldwide 24x7. Focus is Mainly NASA Satellite Data Manages and distributes scientific data Supports data Performs scientific users research University of Colorado at Boulder Cooperative Institute for Research in Environmental Sciences World Data Center for Glaciology (since 1976) Creates tools for Affiliations and Educates the public Sponsorshiabout the cryosphere p Data Rods: High Speed, Time-Series Analysis of Massive Data Sets

Data Rods - Project Basis The “Data Rods” project proposes to create prototype a high-speed, scalable database structure for rapid retrieval, filtering, and analysis of massive multi-modality data sets. Data Rods: High Speed, Time-Series Analysis of Massive Data Sets

Objective: Remote Sensing Data Analysis The Problem: Data sets are becoming too large to move over the internet Need for basic Boolean logic for time-series anomaly detection Data downloads for long time-series analysis are especially cumbersome Data Rods: High Speed, Time-Series Analysis of Massive Data Sets

Analysis Challenges A wide variety of data formats Ever-increasing data set sizes Myriad analysis and visualization requirements There will be uses and analysis of the data that cannot be anticipated (data discovery is not enough) Lack of direct access to the data (ie albedo 15%) Our current directory trees impede data access (We really need to consider a database) Data Rods: High Speed, Time-Series Analysis of Massive Data Sets

“Big Data” Considerations: Search, Order and Transmission of data is ending. We must develop systems where the data stay fixed and analyses are rendered against it Rapid, scalable data access across time and space Direct query of the data, not just the metadata (we need more than what, where, when) Web-based spatio-temporal analysis and visualization 6 Data Rods: High Speed, Time-Series Analysis of Massive Data Sets

Database Choice Fast and efficient storage, query and retrieval of entire data sets – not just the metadata Ability to store colossal amounts of small files Relational databases can't handle it. The tables grow too big. (Object-relational is no better) Hadoop excels at unstructured data but due to it’s batch oriented nature, it is inefficient with real-time analytics as well as intra-data analysis A “pure-object” database seen as best choice Data Rods: High Speed, Time-Series Analysis of Massive Data Sets

The Data Rods Project The “Data Rods” project has created a high speed, scalable database structure for rapid retrieval, filtering, and analysis of massive data sets. We’ll cover the following: Database design Status on development Basic analysis examples and performance Planned analysis and potential applications Data Rods: High Speed, Time-Series Analysis of Massive Data Sets

Database design Gridded data is key. For consistency, NSIDC's Equal-Area Scalable Earth Grids (EASE-Grids) tool is used. Common resolutions between data sets (1km, 5km, etc) and point data Data Rods: High Speed, Time-Series Analysis of Massive Data Sets

The nesting relationship of differing resolutions in EASE-Grid Data Rods: High Speed, Time-Series Analysis of Massive Data Sets

Data Rods Concept Time Y te na di or co X coordinate Data Rods: High Speed, Time-Series Analysis of Massive Data Sets

Database Systems Development Object Database Design X coordin ate Data Rods: High Speed, Time-Series Analysis of Massive Data Sets Pattern Search Search (input Pattern pattern or trend) (input pattern or trend) Automated Pattern Discovery Automated Pattern Discovery Anomaly Detection Anomaly Detection Trend Detection Trend Detection Cycle Detection Cycle Detection User User Interface Interface User Input te ina Other Other Object Object Interface Interface d or co Radar Radar Basic Data Management Basic Data Management (query & index) (query & index) Y Active Active Microwave Microwave Ease Grid Ease Grid Processing Processing Pixel Grid Pixel Grid Sampling Sampling Object Object Database Database Loading Loading Data Rod Data Rod Updating Updating Cryospheric Change Analysis Time Data Input Passive Passive Microwave Microwave Visual Infrared Visual Infrared Data Rod Objects

Pure-Object Database Object persistence/instantiation is directly to/from the database – no Java Spring or Hibernate needed Not object-relational (examples include Versant, ObjectDB, db4o, Objectivity) Not as limited by size Fast interactions across databases Simple, efficient schema Next: schema design Data Rods: High Speed, Time-Series Analysis of Massive Data Sets

Object Database Schema Each image pixel is an object Data rods are time-series collections of pixels Each data rod can be analyzed independently Adjacency analysis by row/col or lat/lon Data Rods: High Speed, Time-Series Analysis of Massive Data Sets

Database Creation Gridded data sets Standardized grid dimensions e tud Lati Lends itself well to time-series analysis Time Visualize as layers of imagery through time (days to decades) Longitude Data Rods: High Speed, Time-Series Analysis of Massive Data Sets

Status – Database Administration 5 AVHRR databases, each with 5 years of imagery ( 100 GB each, administratively easier) Surface mask databases for northern hemisphere at 5 km and 25km SSM/I database, 25 years of daily 25 km data at all frequencies and polarizations Selected MODIS database at 250 Meter resolution 600 GB total No upper limit to database except disk space Data Rods: High Speed, Time-Series Analysis of Massive Data Sets

AVHRR Database Creation Initial demonstration region is Greenland 25 years of daily multi-spectral AVHRR data at 5 km resolution 9000 images 2 billion pixel objects total Each pixel object is independently accessible for query Data Rods: High Speed, Time-Series Analysis of Massive Data Sets

Database Flexibility Data can be spread across many databases Transparent queries across databases Methods (routines) can be attached to the data rods to add functionality such as statistical analysis Data fusion: analyses may span multiple data types, resolutions, time spans Data Rods supports NetCDF output Data Rods: High Speed, Time-Series Analysis of Massive Data Sets

Simple AVHRR Object Database Time Test Built a using AVHRR 5km data from 1995-1999 2 visible channels, 3 IR channels, 3 references plus albedo, skin temperature and cloud mask Database includes location class, time stamp class and metadata 213,000 data rods covering 5-years over Greenland 1 Data rod contains 1825 pixels Pixels 388,725,000 each with 11 variables/pixel Variables 4.2 billion coded short integer values Data Rods: High Speed, Time-Series Analysis of Massive Data Sets

Example Analysis Using Object Databases All queries run on a singe processor, single thread Example #1: Queries and plots on single database Example #2: Queries and plots on multiple databases Example #3: Advanced Spatiotemporal Analysis 1 Data rod contains 1825 pixels Pixels 388,725,000 each with 11 variables/pixel Variables 4.2 billion coded short integer values We will move to multi-tread, multiprocessor once we have the design finalized (this is a research project) Data Rods: High Speed, Time-Series Analysis of Massive Data Sets

Using Single AVHRR Object Database Time Test Single processor under load 5-year plots returned in 2-10 seconds. Cached data plots returned in ½ second. Images in 10 seconds Data Rods: High Speed, Time-Series Analysis of Massive Data Sets

Multi Data Rod Selection Seven locations selected across 5 years simultaneously Selected Temperature Brightness and Albedo output Again caching is much faster Data Rods: High Speed, Time-Series Analysis of Massive Data Sets

Example Analysis of Greenland & 5 databases Using 5 5-year Rods and Statistics (1 min or 5 secs cached) AVHRR albedo statistics May average, 1981 – 2005 Camp Century: Mean: 0.801 Std. dev.: 0.077 Summit Station: Mean: 0.819 Std. dev.: 0.069 Swiss Camp: Mean: 0.817 Std. dev.: 0.070 GISP Ice Core Camp: Mean: 0.802 Std. dev.: 0.071 Image ref: Maurer, J. 2007. Atlas of the Cryosphere. Boulder, Colorado USA: National Snow and Ice Data Center. Digital media. Data Rods: High Speed, Time-Series Analysis of Massive Data Sets

Temporal Analysis of Single Rods Descriptive Statistical functions Spatiotemporal data selection Filtering by value Anomaly detection Also: Image generation Inter-database data fusion Data Rods: High Speed, Time-Series Analysis of Massive Data Sets

Broad Spatiotemporal Analysis (This took some time) Statistical analysis repeated at every grid cell. Intersection of surface mask database and AVHRR database: only pixels on the ice sheet were processed. Bad data filtered out. Multivariate: cloud mask used to exclude cloudy pixels from albedo averages. All 2 billion objects queried and analyzed Data Rods: High Speed, Time-Series Analysis of Massive Data Sets

Analysis Example: Sea Ice Temporal Query t8 We would like to remove clouds from the image (clouds move faster than ice so find minimum Albedo for open water) } Moving 8-day window through datarod Minimum albedo in temporal window t1 Pseudocode example query: datarod database.getDatarod(row,col) Datarod timeseries of pixels albedo datarod.getMinAlbedo(t,t 7) Data Rods: High Speed, Time-Series Analysis of Massive Data Sets

Analysis result: Sea Ice Detection Technique for removing clouds from the image Composite image created from Data Rods’ time series Lowest AVHRR albedo over an 8-day period One of the Original images Remaining objective: exclude lingering clouds Data Rods: High Speed, Time-Series Analysis of Massive Data Sets

Analysis Potential: Rapid Data Fusion Loss of AMSR-E decreases sea ice detection capability Data Rods AVHRR/SSM/I product fusion may fill the gap Can be validated against AMSR-E sea ice record. AVHRR 8-day High resolution sea ice detection – still some clouds Fused product SSM/I Cloud free with good sea ice detection but low resolution Data Rods: High Speed, Time-Series Analysis of Massive Data Sets High-res sea ice extent, no clouds

Performing this lake detection analysis conventionally took 6 months (downloading & gridding & image analysis) With Data Rods, the analysis was done in 2 days (single tread, single processor) Data Rods: High Speed, Time-Series Analysis of Massive Data Sets

What’s Next-Ongoing Efforts Newest version of ODB software has multi-threaded capability – to take advantage of multiprocessor machines to reduce query times Investigating Data rod performance on the Janus supercomputer with Pan-Arctic extent User Interface to Data Rod database Data Rods: High Speed, Time-Series Analysis of Massive Data Sets

Creating 1000s of Databases for Use with Massive Parallel Machines Each database is small enough to be held in memory for each CPU (uses MPI calls) Each database covers 5ox5ox25 years of Data Rods Each database is capped (fixed for minimal changes) Changes are added to the present year database for each 5ox5o Data Rods: High Speed, Time-Series Analysis of Massive Data Sets

Creating 1000s of Databases for Use with Massive Parallel Machines With this database it should be possible perform analysis at Internet speeds Multi-sensor analysis is relatively simple We are starting the database loading now 100TB database testing will occur over the summer Data Rods: High Speed, Time-Series Analysis of Massive Data Sets

Summary We can now perform high-speed time-series analysis on the server-side without downloads Scalable, massive remote sensing databases Accelerated analysis compared to traditional “search, order and transmission”’ methods Interactions across data sets – data fusion Developing UI and additional analysis tools Allow users interactive access to the data Data Rods: High Speed, Time-Series Analysis of Massive Data Sets

NSIDC Data Rods Project Thank You The Data Rods project is funded by the National Science Foundation through grant: ARC 0941442 Interesting in testing Data Rods? Please contact us at: david.gallaher@nsid c.org Data Rods: High Speed, Time-Series Analysis of Massive Data Sets

Back to top button