Content-addressable storage

From Hill2dot0
Jump to: navigation, search

Content-addressable storage (CAS), also known as content-aware storage, is a disk storage technology designed to offer efficient access to fixed or archival data that should not change over time. It is the latest and greatest in the constantly evolving area of storage area networking technologies.

CAS Compared to Other Storage Options

There are several different ways to archive data, but each tactic has its limitations. For example, magnetic tape is a common archival medium, but it does not offer a short recovery time objective (RTO). Magnetic tape is also vulnerable to loss or theft, and often is susceptible to failure upon reload after a successful back-up.

Disk arrays, such as RAID, are also frequently used as an archival medium, but users must often contend with multiple copies of a file. This makes it difficult to know which copy is the latest or official version. Traditional meta-data entries (e.g. file-name, creation date, modification date) include very little direct information about the file itself, making it difficult to search for files in the future.

CAS overcomes some of the challenges imposed by these alternative storage methods. By attaching a comprehensive suite of meta-data to the object, data can be indexed or searched without knowing specific file-names, dates or other traditional file designations. Quality meta-data can also include contextual information that can help a user to understand or employ the data when it is accessed in the future. For example, including a doctor's diagnosis along with an MRI record can help other doctors quickly come up to speed on a patient's condition, track changes to their condition, find other patients with similar conditions and so on.

It is important to note that content-addressable storage is not the same as content-addressable memory, although the principals are similar.

Content-Addressable versus Location-Addressable

A typical local, direct or networked storage device is referred to as location-addressable. In a location-addressable storage device, each element of data is stored onto the physical medium, and its location is recorded for later use. The storage device often keeps a list, or directory, of these locations. When a future request is made for a particular item, the request includes only the location (for example, path and file names) of the data. (Think of how files are stored on one's hard drive, and how you know where to look for them when you need them.) The storage device can then use this information to locate the data on the physical medium, and retrieve it. When new information is written into a location-addressed device, it is simply stored in some available free space, without regard to its content. The information at a given location can usually be altered or completely overwritten without any special action on the part of the storage device.

In contrast, when information is stored into a CAS system, the system will record a content address, which is an identifier based solely on the information itself. A request to retrieve information from a CAS system must provide the content identifier, from which the system can determine the physical location of the data and retrieve it. Because the identifiers are based on content, any change to a data element will necessarily change its content address. In some cases, a CAS device will not permit editing or deleting of information once it has been stored.

While the idea of content-addressed storage is not new, production-quality systems were not readily available until roughly 2003 when EMC released the first truly commercially available CAS system, using the Centera architecture. In mid-2004, the industry group SNIA [1] began working with a number of CAS providers to create standard behavior and interoperability guidelines for CAS systems.

Implementing CAS

How CAS Saves Space

In content-addressable storage systems, data is marked with meta-data. Data is treated as an object, which is then assigned a unique designator (i.e., a content address) and sent to a permanent location on a hard disk. This is in contrast to treating data as a file and allowing a file system to handle data storage, as is done in network attached storage (NAS). Since each object is unique, it is impossible to store multiple copies of the same file. Duplication of data is eliminated and the total storage requirement is reduced.

Data reduction, also often referred to as commonality factoring, is a key attribute of CAS, saving additional cost by reducing the total storage space needed for all of a company's data. When data is stored on a CAS system, a hashing algorithm is applied to the file or more granular file elements like individual blocks. Each time the hashing algorithm is run, it produces a unique value. The CAS appliance compares those values against its index of saved objects. If a hash value is new, that portion of data and meta-data will be added to disk. If the hash value already exists, it means that portion of data has already been stored, so only meta-data and a pointer to that existing portion will be saved. Obviously, if the data already exists, there's no reason to save it again. All you need is a reference to point to that data.

Let's consider an example. Suppose that CAS is being used to archive e-mails, and there are 30 e-mails that have the same attachment. In a traditional backup or replication scheme, that same attachment would be saved 30 times along with the e-mails. With CAS and its data reduction techniques, the storage appliance would save the actual attachment only once, and additional emails with the attachment would only save pointers and different meta-data. Since there are often many versions and copies of files scattered across corporate servers, the potential for storage savings can be significant.

CAS Provides Data Integrity

With the advent of new regulatory compliance requirements such as the Sarbanes-Oxley Act or Gramm-Leach-Bliley legislation, CAS is finding an important role in the topics of corporate governance, risk mitigation and legal liability. Since all CAS data is uniquely identified through hash algorithm results, can only be stored once, cannot be modified and can only be destroyed outside of established retention policies, companies are increasingly evaluating CAS technologies to meet their compliance needs. The inclusion of detailed meta-data also enables superior indexing and searching, allowing relevant files to be located long after their file-name has been forgotten.

PodSnacks

<mp3>http://podcast.hill-vt.com/podsnacks/2008q2/cas.mp3%7Cdownload</mp3> | Content-addressable storage (CAS)