Data grids implement the ability to submit, query and
retrieve the contents of a registry and repository. An example
is the integrated Rule Oriented Data System, iRODS, available as
open source software at
The iRODS software has been under development since 2006 in
projects funded by the National Science Foundation and the
National Archives and Records Administration. It incorporates
registry and repository management functions that were first
implemented in the Storage Resource Broker that was developed
between 1996 and 2005.
The iRODS software is used to support data sharing
environments, digital libraries, archives, and repositories.
Examples include French National Library, Australian Research
Collaboration Service (national data grid), CyberSKA radio
astronomy data, National Optical Astronomy Observatory data
grid, genomics data grids (Wellcome Trust Sanger Institute,
Broad Institute), satellite data (NASA Center for Climate
Simulations), Ocean Observatories Initiative sensor data, EUDAT
data replication, etc.
Some of the challenges that are faced when managing petabytes
of internationally distributed data containing hundreds of
millions of files include:
- managing interactions with heterogeneous storage systems
(Windows, Mac, Unix file systems, tape archives, web sites,
databases)
- enforcing assertions about collection properties (policy
enforcement through a distributed rule engine)
- automating administrative functions (migration,
replication, integrity checking, metadata loading)
- providing efficient data transport mechanisms
- supporting the wide variety of clients requested by user
communities (web browsers, web services, load libraries, I/O
libraries, file system interfaces, workflows, dropbox style
synchronization, digital libraries, portals, webDav, grid tools,
Unix tools, etc.)
The capabilities supported by iRODS include:
- submission of files into a repository
- management of descriptive metadata, system metadata,
provenance metadata for files, users, storage systems
- queries on metadata, browsing on files
- registration of files from remote systems, web sites,
archives
- data management functions such as replication, aggregation,
distribution, caching
- policy enforcement for domain specific requirements (access
controls, derived data product generation, automated metadata
extraction, data processing, etc.)
Given a well defined API, it is possible to port the ebXML
access mechanisms on top of the iRODS data grid. The major
concern is that the ebXML protocol is a constrained subset of
the operations required by the above listed projects.
Reagan Moore
DICE Center
UNC-CH