An increasing number of applications manage large amounts of semistructured data. Common applications that use semistructured data today include Bioinformatics sequence search and alignment, genomic data analysis, multi-resolution video storage, clinical data systems, XML databases, and more . Given that a semi-structure such as a tree provides a more intuitive way of managing large amounts of data, the trend of storing data in such formats is likely to strengthen in the future.
Current approaches to store semistructured data either map the data to an underlying relational database system, use the abstraction provided by a general-purpose object storage manager, or use a combination of flat files and indices (e.g., XALAN, XT, Galax, BLAST, Timber and Natix). Since these approaches retrofit existing storage mechanisms to work with semistructured data, their scope is restricted to the underlying mechanisms, which are predominantly optimized for sequential accesses. Consequently, these approaches may result in a mismatch between the structure and navigational primitives of semistructured data and the access characteristics of disk drives. In particular, semistructured data have a tree (or graph) structure with tree-type operations. Relational databases, on the other hand, store structured tables that are optimized for row-based access, and flat files are unstructured, optimized for sequential access. Further complicating this mismatch, the underlying storage device, i.e. disk drives, store information in circular tracks that are accessed with mechanical seek and rotational overhead. Given the growing amount of semistructured data, there is a need for re-examining the current storage and access machinery that support them.
In this project, we evaluate strategies to optimize the storage and retrieval of semistructured data on disk drives by explicitly accounting for the mismatch between the structure of the data and the disk drive storage and access characteristics. In particular, this chapter presents algorithms that given the physical characteristics of a disk drive (number of tracks, sectors per track and rotational speed.), place semistructured data on the disk drive in a way that facilitates navigation of the data by reducing access overheads. Such low-level control of data layout is made possible using information provided by standard disk profiling tools.
The proposed technique first addresses the problem of grouping nodes of semistructured data trees so that they can be mapped to disk blocks. This chapter presents the development and experimental evaluation of grouping strategies, which are compared with the Enhanced Kundu Misra (EKM) grouping strategy. Second, the proposed on-disk layout strategy for node groups optimizes common tree navigation operations such as parent-to-child and node-to-next-sibling traversals. These on-disk layout strategies make use of semi-sequential disk access technique that allows the reduction and even elimination of rotational delay overhead during disk accesses.
Given that this approach requires circumventing the prevalent logical block abstraction, applying this layout strategy to a general purpose storage system is not straightforward. The goal of these techniques is simply to expose the merits and demerits of this approach. Through experiments we show that our proposed approach is superior for a dedicated single-user storage system with standard caching and prefetching capabilities – for instance, a specialized system for analysis of biological data (suffix trees). Based on this study, we believe that our approach provides a fresh perspective on the problem of storing semistructured data that is worth the attention and research time of the community.
To evaluate the proposed native data layout techniques, we used XML as a case study. XML is becoming increasingly popular due to its ability to represent arbitrary semistructured data. It is the de facto data representation format for many modern applications, including Geographic Information Systems Markup Language (GML), Medical Markup Language (MML), Health Level HL7, Clinical Document Architecture (CDA) used to represent Electronic Health Records (EHRs), Open Document Format (ODF), and Scalable Vector Graphics (SVG) used to describe two-dimensional graphics and graphical applications. Despite the widespread use of XML, the challenge of optimizing access to XML data stores is a key challenge also identified in the latest report on the future directions on database research, published every few years by the database research community.
M. Bhadkamkar, F. Farfán, V. Hristidis, and R. Rangaswami. Storing Semi-structured Data on Disk Drives. ACM Transactions on Storage, Vol. 5, Issue # 2, May, 2009.
Medha Bhadkamkar, Fernando Farfán, Vagelis Hristidis and Raju Rangaswami. Efficient Native Storage Systems for Semistructured Data. Florida International University Technical Report TR-2006-09-01. September 2006.
Medha Bhadkamkar, Fernando Farfán, Vagelis Hristidis, and Raju Rangaswami. Storing Trees on Disk Drives. Proceedings of the File and Storage Technologies WiP, 2005.