FreshPatents.com Logo FreshPatents.com icons
Monitor Keywords Patent Organizer File a Provisional Patent Browse Inventors Browse Industry Browse Agents

n/a

views for this patent on FreshPatents.com
updated 05/17/13


Inventor Store

    Free Services  

  • MONITOR KEYWORDS
  • Enter keywords & we'll notify you when a new patent matches your request (weekly update).

  • ORGANIZER
  • Save & organize patents so you can view them later.

  • RSS rss
  • Create custom RSS feeds. Track keywords without receiving email.

  • ARCHIVE
  • View the last few months of your Keyword emails.

  • COMPANY PATENTS
  • Patents sorted by company.

Scalable cloud storage architecture   

pdficondownload pdfimage preview


20120179874 patent thumbnailAbstract: a virtual storage module operable to run in a virtual machine monitor may include a wait-queue operable to store incoming block-level data requests from one or more virtual machines. In-memory metadata may store information associated with data stored in local persistent storage that is local to a host computer hosting the virtual machines. The data stored in local persistent storage replicates a subset of data in one or more virtual disks provided to the virtual machines. The virtual disks are mapped to remote storage accessible via a network connecting the virtual machines and the remote storage. A cache handling logic may be operable to handle the block-level data requests by obtaining the information in the in-memory metadata and making I/O re-quests to the local persistent storage or the remote storage or combination of the local persistent storage and the remote storage to service the block-level data requests.
Agent: International Business Machines Corporation - Armonk, NY, US
Inventors: Rong N. Chang, Byung C. Tak, Chunqiang Tang
USPTO Applicaton #: #20120179874 - Class: 711128 (USPTO) - 07/12/12 - Class 711 
Related Terms: Cache   Cloud   Disks   Host Computer   Hosting   Metadata   Subset   Virtual   Virtual Machine   
view organizer monitor keywords


The Patent Description & Claims data below is from USPTO Patent Application 20120179874, Scalable cloud storage architecture.

pdficondownload pdf

FIELD

The present application generally relates to computer systems and computer storage, and more particularly to virtual storage and storage architecture.

BACKGROUND

Designing a storage system is a challenging task. For instance, in Cloud Computing, high degree of virtualization increases the demand for storage spaces and this requires the use of remote storage spaces. However, uncontrolled access to the remote storage from large number of virtual machines can easily saturate the networking infrastructure and affect the entire systems using the network.

More particularly, for example, in an IaaS (Infrastructure-as-a-Service) cloud services, storage needs of VM (Virtual Machine) instances are met through virtual disks (i.e. virtual block devices). However, it is nontrivial to provide virtual disks to VMs in an efficient and scalable way for a couple of reasons. First, a VM host may be required to provide virtual disks for a large number of VMs. It is difficult to ascertain the largest possible storage demands and physically provision them all in the host machine. On the other hand, if the storage spaces for virtual disks are provided through remote storage servers, aggregate network traffic due to storage accesses from VMs can easily deplete the network bandwidth and cause congestion.

BRIEF

SUMMARY

A storage system and method for handling data for virtual machines, for instance, for scalable cloud storage architecture, may be provided. The system, in one aspect, may include a virtual storage module operable to run in a virtual machine monitor. The virtual storage module may include a wait-queue operable to store incoming block-level data requests from one or more virtual machines, and in-memory metadata for storing information associated with data stored in local persistent storage that is local to a host computer hosting the virtual machines. The data stored in local persistent storage may be replication of a subset of data in one or more virtual disks provided to the virtual machines, the virtual disks being mapped to remote storage accessible via a network connecting the virtual machines and the remote storage. A cache handling logic may be operable to handle the block-level data requests by obtaining the information in the in-memory metadata and making I/O requests to the local persistent storage or the remote storage or combination of the local persistent storage and the remote storage to service the block-level data requests.

A method for handling data storage for virtual machines, in one aspect, may include intercepting one or more incoming block-level data requests received by a virtual machine monitor from one or more virtual machines. The method may also include obtaining from in-memory metadata, information associated with data of the block-level data request. The in-memory metadata may store information associated with data stored in local persistent storage that is local to a host computer hosting the virtual machines. The data stored in local persistent storage may be replication of a subset of data in one or more virtual disks provided to the virtual machines. The virtual disks may be mapped to remote storage accessible via a network connecting the virtual machines and the remote storage. The method may further include making I/O requests to the local persistent storage or the remote storage or combination of the local persistent storage and the remote storage to service the block-level data requests.

A computer readable storage medium storing a program of instructions executable by a machine to perform one or more methods described herein also may be provided.

Further features as well as the structure and operation of various embodiments are described in detail below with reference to the accompanying drawings. In the drawings, like reference numbers indicate identical or functionally similar elements.

BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS

FIG. 1 shows the architecture of a scalable Cloud storage system in one embodiment of the present disclosure.

FIG. 2 shows the architecture of vStore in one embodiment of the present disclosure.

FIG. 3 illustrates structure of one cache entry in one embodiment of the present disclosure.

FIG. 4A is a flow diagram illustarting a read request handling in one embodiment of the present disclosure.

FIG. 4B is a flow diagram illustarting a write request handling in one embodiment of the present disclosure.

FIG. 5 shows as an example, the Xen implementation of vStore in one embodiment of the present disclsoure.

DETAILED DESCRIPTION

The present disclosure in one embodiment presents a system (referred to in this disclosure as vStore), which utilizes the host\'s (e.g., computer server hosting virtual machines) local disk space as a block-level cache for the remote storage (e.g., network attached storages), for example, in order to absorb network traffics from storage accesses. This allows the VMM (Virtual Machine Monitor, a.k.a. hypervisor) to serve VMs\' disk input/output (I/O) requests from the host\'s local disks most of the time, while providing the illusion of much larger storage space for creating new virtual disks. Caching virtual disks at block-level poses special challenges in achieving high performance while maintaining virtual disk semantics. First, after a disk write operation finishes from the VM\'s perspective, the data should survive even if the host immediately encounters a power failure. That is, the block-level cache should preserve the data integrity in the event of host crashes. To that end, cache handling operations in one embodiment of the present disclosure may ensure consistency between on-disk metadata and data to avoid committing incorrect data to the network attached storage (NAS) during recovery from a crash, while minimizing overheads in updating on-disk metadata. Second, as disk I/O performance is dominated by disk seek times, a virtual disk should be kept as sequential as possible in the limited cache space. Unlike memory-based caching schemes, the performance of an on-disk cache is highly sensitive to data layout. The present disclosure in one embodiment may utilize a cache placement policy that maintains a high degree of data sequentiality in the cache as in the original (i.e., remote) virtual disk. Third, the destaging operation that sends dirty pages back to the remote storage server may be self-adaptive and minimize the impact on the foreground traffic.

In another aspect, a scalable architecture is presented that provides reliable virtual disks (i.e., block devices as opposed to object stores) for virtual machines (VM) in a cloud environment.

FIG. 1 shows the architecture of a scalable Cloud storage system in one embodiment of the present disclosure. The architecture may include one or more VM-hosting machines (e.g., 102, 104, 106). A VM-hosting machine is a physical machine that hosts a large number of VMs and has limited local storage space. vStore 108 uses local storage 110 as a block-level cache and provides to VMs 112 the illusion of unlimited storage space. vStore 108 may be implemented in hypervisor 114 and provides persistent cache. vStore 108 performs caching at the block device level rather than the file system level. The hypervisor 114 executes on one or more computer processors and provides a virtual block device to VMs 112, which implies that VMs 112 see raw block devices and they are free to install any file systems on top of it. Thus, hypervisor 114 receives block-level requests and redirects it to the remote storage (e.g., 116, 118).

In one embodiment, single cache space is provided per machine (e.g., 102). The cache tries to replicate the block layout of remote storage (e.g., 116, 118) in the local cache space (local disk) 110.

Storage server clusters (e.g., 116, 118) provide network attached storage to physical machines (e.g., 102, 104, 106). They (e.g., 116, 118) can be either dedicated high-performance storage servers or a cluster of servers using commodity storage devices. The interface to the hypervisors 114 can be either block-level or file-level. If it is the block-level, iSCSI type of protocol can be used between storage servers and clients (i.e., hypervisors). If it is file-level, the hypervisor mounts a remote directory structure and keeps the virtual disks as individual files. Regardless of the protocol between hypervisors and storage servers, the interface between VMs and hypervisor remains at block-level.

The directory server 120 holds the location information about the storage server clusters. When a hypervisor 114 wants to attach a virtual disk to a VM, it consults the directory server 120 to determine the address of a specific storage server (e.g., 116, 118) that currently stores the virtual disk.

The architecture also includes networking infrastructure. Usually network bandwidth within a rack is well-provisioned, but cross-rack network is usually 5-10 times under-provisioned than that of within-rack network. As a result, uncontrolled storage accesses from VMs can easily deplete the network bandwidth and cause congestion.

An example configuration may have rack-mounted servers for hosting virtual machines and remote storage servers to provide storage services to the VMs. A rack may contain more than 20 servers and virtual machine monitors such as Xen-3.1.4 hypervisor installed on each of them. Servers may have processors such as two Intel® Xeon™ CPU of 3.40 GHz and have memory, e.g., 2 giga (G) bytes of memory. They can communicate through 1 Gbps link within the rack. Local storage for each server may be about 1 terabytes and they have a network file system (NFS)-mounted shared storage space that is used to hold VM images for all Virtual Machines. Remote storage servers may have physical hard disks attached, e.g., through Serial Advanced Technology Attachment (SATA) interface.

There may be multiple options when designing a storage system for a Cloud. One solution is to use only local storage. In a Cloud, VMs may use different amounts of storage space, depending on how much the user pays. If every host\'s local storage space is over-provisioned for the largest possible demand, the cost would be prohibitive. Another solution is to only use network attached storage. That is, a VM\'s root file system, swap area, and additional data disks are all stored on network attached storage. This solution, however, would incur a large amount of network traffic and disk I/O load on the storage servers.

Sequential disk access can achieve a data rate of 100 MB/s. Even with pure random access, it can reach 10 MB/s. Since 1 Gbps network can sustain roughly about 13 MB/s, four uplinks to the rack-level switch are not enough to handle even one single sequential access. Note that uplinks to the rack-level network switches are limited in numbers and cannot be easily increased in commodity systems. Even for random disk access, it can only support about five VMs\' disk I/O traffic. Even with 10 Gbps networks, it still can hardly support thousands of VMs running in one rack (e.g., typical numbers are 42 hosts per rack, and 32 VMs per host, i.e., 1,344 VMs per rack).

vStore 108 takes a hybrid approach that leverages both local storage 110 and network attached storage 116, 118. It still relies on network attached storage 116, 118 to provide sufficient storage space for VMs 112, but utilizes the local storage 110 of a host 102 to cache data and avoid accessing network attached storage 116, 118 as much as possible.

Consider the case of Amazon EC2, where a VM is given one 10 GB virtual disk to store its root file system and another 160 GB virtual disk to store data. The root disk can be stored on local storage due to its small size. The large data disk can be stored on network attached storage and accessed through the vStore cache. Data integrity and performance are two main challenges in the design of vStore. After a disk write operation finishes from the VM\'s perspective, the data should survive even if the host immediately encounters a power failure. In vStore, system failures can compromise data integrity in several ways. If the host crashes while vStore is in the middle of updating either the metadata or the data and there is no mechanism for detecting the inconsistency between the metadata and the data, after the host restarts, incorrect data may remain in the cache and be written back to the network attached storage. Another case that may compromise data integrity is through violating the semantics of writes. If data is buffered in memory and not flushed to disk after reporting write completion to the VM, a system crash will cause data loss. Taking such semantics in consideration vStore of the present disclosure in one embodiment may be designed to support data integrity.

The second challenge is to achieve high performance, which conflicts with ensuring data integrity and hence may be designed to minimize performance penalties. The performance of vStore may be affected by several factors: (i) data placement within the cache, (ii) vStore metadata placement on disk, (iii) complication introduced by the vStore logic. For (i), if sequential blocks in a virtual disk are placed far apart in the cache, a sequential read of these blocks incurs a high overhead due to a long disk seek time. Therefore, in one embodiment, vStore keeps a virtual disk as sequential as possible in the limited cache space. For (ii), ideally, on-disk metadata should be small and should not require an additional disk seek to access data and metadata separately. For (iii), one potential overhead is the dependency among outstanding requests. For example, if one request is about to evict one cache entry, then all the requests on that entry must wait. All of these factors may be considered in the design of vStore.

FIG. 2 shows the architecture of vStore in one embodiment of the present disclosure. The description herein is based on para-virtualized Xen as an example. VMs 202 generate block requests in the form of (sector address, sector count). Requests arrive at the front-end device driver within the VM 202 after passing through the guest kernel. Then they are forwarded to the back-end driver in Domain-0. The back-end driver issues actual I/O requests to the device, and send responses to the guest VM 202 along the reverse path.

In one embodiment, the vStore module 204 runs in Domain-0, and extends the function of the back-end device driver. vStore 204 intercepts requests and filters them through its cache handling logic. In FIG. 2, vStore 204 internally may include a wait queue 206 for incoming requests, a cache handling logic 208, and in-memory metadata 210. Incoming requests are first put into vStore\'s wait queue 206. The wait queue 206 is used in one embodiment because the cache entry that this request needs to use might be under eviction or update triggered by previous requests. After clearing such conflicts, the request is handled by the cache handling logic 208. The in-memory metadata 210 are consulted to obtain information such as block address, dirty bit, and modification time. Depending on the current cache state, actual I/O requests are made to either the cache on local storage 212 or the network attached storage 214.

I/O Unit: Guest VMs usually operate on 4 KB blocks, but vStore can perform I/Os to and from the network attached storage at a configurable larger unit. A large I/O unit reduces the size of in-memory metadata, as it reduces the number of cache entries to manage. Moreover, a large I/O unit works well with high-end storage servers, which are optimized for large I/O sizes (e.g., 256 KB or even 1 MB). Thus, reading a large unit is as efficient as reading 4 KB. This may increase the incoming network traffic, but our evaluation shows that the subsequent savings outweigh the initial cost. We use the term, block group, to refer to the I/O unit used by the vStore as opposed to the (typically 4 KB) block used by the guest VMs. That is, one block group contains one or more 4 KB blocks.

Metadata: Metadata holds information about cache entries on disk. Metadata are stored on disk for data integrity and cached in memory for performance. Metadata updates are done in a write-through manner. After a host crashes and recovers, vStore visits each metadata entry on disk and recovers any dirty data that have not been flushed to network attached storage. Table 1 summarizes examples of the metadata fields in one embodiment of the present disclosure.

TABLE 1 vStore Metadata. Fields Size Descriptions Virtual 2 Bytes ID assigned by vStore to uniquely identify a Disk ID virtual disk. An ID is unique only within individual hypervisors. Sector 4 Bytes Cache entry\'s remote address in unit of Address sector. Dirty Bit 1 Bit Set if cache content is modified. Valid Bit 1 Bit Set if cache entry is being used and the corresponding data is in the cache. Lock Bit 1 Bit Set if under modification by a request. Read Count 2 Bytes How many read accesses within a time unit. Write Count 2 Bytes How many write accesses within a time unit. Bit Vector Variable Each bit represents 4 KB within the block group. Set if corresponding 4 KB is valid. The size is (block group)/4 KB bits. Access Time 8 Bytes Most recently accessed time. Total Size <23 Bytes

Virtual Disk identifier (ID) identifies a virtual disk stored on network attached storage. When a virtual disk is detached and reconnected later, cached contents that belong to this disk is identified and reused. Bit Vector has one bit for each 4 KB block in a block group so that the states of 4 KB blocks in the same block group can be changed and tracked individually. Without Bit Vector, the states of 4 KB blocks in the same block group must always be changed together. As a result, when the VM writes to a 4 KB block, vStore must read the entire block group (including all 4 KB blocks in that block group) from network attached storage, merge with the 4 KB new data, and writes the entire block group to cache. With Bit Vector, vStore can write to the 4 KB data directly without fetching the entire block group, and then only change the affected 4 KB block\'s state in Bit Vector. Our experiments show that Bit Vector helps reduce network traffic when using a large cache unit size.

Maintaining metadata on disk may compromise performance. A naive implementation may require two disk accesses to handle one write request issued by a VM—one for metadata update and one for writing actual data. In the present disclosure in one embodiment, vStore solves this problem by putting metadata and data together, and updates them in a single write. The details are described below.

In-memory Metadata: To avoid disk I/Os for reading the on-disk metadata, vStore in one embodiment maintains a complete copy of the metadata in memory and updates them in a write-through manner. One embodiment of the present disclosure use a large block group size (e.g., 256 KB) to reduce the size of the in-memory metadata.

Cache Structure: vStore in one embodiment of the present disclosure organizes local storage as a set-associative cache with write-back policy by default. We describe the cache as a table-like structure, where a cache set is a column in the table, and a cache row is a row in the table. A cache row includes multiple block groups. A block group has contents coming from one virtual disk, but different block groups in the same cache row may have contents coming from different virtual disks. Block groups in the same cache row are laid out in logically contiguous disk blocks in one embodiment of the present disclosure.

FIG. 3 illustrates structure of one cache entry in one embodiment of the present disclosure. A block group includes n number of 4 kilobyte (KB) blocks and each 4 KB blocks have trailers. For instance, each 4 KB block 302 in a block group 304 has a 512-byte trailer 306 shown in FIG. 3. This trailer 306 in one embodiment includes metadata 308 and the hash value 310 of the 4 KB data block 302. On a write operation, vStore computes the hash of the 4 KB block 302, and writes the 4 KB block 302 and its 512-byte trailer 306 in a single write operation. If the host crashes during the write operation, after recovery, the hash value helps detect that the 4 KB block and the trailer are inconsistent. The 4 KB block can be safely discarded, because the completion of the write operation has not been acknowledged to the VM yet. When handling a read request, vStore also reads the 512-byte trailer 306 together with the 4 KB block 302. As a result, a sequential read of two adjacent blocks issued by the VM is also sequential in the cache. If only the 4 KB data block is read without the trailer, the sequential request would be broken into two sub-requests, spaced apart by 512 bytes.

Cache Replacement

In one aspect, simple policies like least recently used (LRU) and least frequently used (LFU) may not be suitable for vStore, because they are designed primarily for memory-based cache without consideration of block sequentiality on disk. If two consecutive blocks in a virtual disk are placed at two random locations in vStore\'s cache, sequential I/O requests issued by the VM become random accesses on the physical disk. In one embodiment, vStore\'s cache replacement algorithm strives to preserve the sequentiality of a virtual disk\'s blocks.

Below, we describe an embodiment of vStore\'s cache replacement algorithm in detail. We introduce the concept of base cache row of a virtual disk. The base cache row is the default cache row on which the first row of blocks of a virtual disk is placed. Subsequent blocks of the virtual disk are mapped to the subsequent cache rows. For example, if there are two virtual disks Disk1 and Disk2 currently attached to the vStore and the cache associativity is 5 (i.e., there are 5 cache rows), then Disk might be assigned 1 as a base cache row and Disk2 might be assigned 3 to keep them reasonably away from each other. If we assume one cache row is made of ten 128 KB cache groups, Disk2\'s block at address 1280K will be mapped to row 4 which is the next row from Disk2\'s base cache row.

Upon arrival of new data block, vStore in one embodiment determines the cache location in two steps. First, it looks at the cache entry\'s state whose location is calculated using the base cache row and the block\'s address. If it is invalid or not dirty, then it is immediately assigned to the cache entry. If dirty, a victim entry is selected based on the scores. Six criteria may be used to calculate the score one embodiment. Recentness—E.g., the more recently accessed, higher the score. Prior Sequentiality—This measures how sequential the cache entry is with respect to the adjacent cache entries. If the cache entry is already sequential, then we prefer to keep it in one embodiment. Prior Distance—This measures how far away the cache entry is from the default base cache row. If the entry is located in cache row 2 and the default base cache row of the virtual disk is 1, then the value is 2−1=1. Posterior Sequentiality—This measures how sequential it will be if we cache new block. If it becomes sequential, then we prefer this cache entry as a victim. Posterior Distance—This measures how far away from the default base cache row it would be if we cache new block. If this distance is far, it is less preferable. Dirtiness—If the cache entry is modified, we would like to avoid evicting this entry as much as possible.

Let xi be each of the six criteria described above, e.g., for i=1 to 6. A score may be computed using equation (1) as follows.

S=a0·x0+a1·x1+ . . . +a5·x5  (1)

Here the coefficient ai represents the weight of each criterion. If all ai is 0 except for a5, the eviction policy becomes equivalent to LRU. Weight coefficients are adjustable according to the preference. In one embodiment, this value (score) is computed for all the cache entry within the cache set and the entry with the lowest score is chosen for eviction.

Cache Handling Operations

In one embodiment of the present disclosure, there may be three cases in cache handling—cache hit, miss without flush and miss with flush. In one embodiment, vStore design considers both performance and data integrity in its cache handling operations. Since vStore uses disk as a cache space, cache handling has more disk access than when cache were not used. Excessive disk accesses may degrade the overall performance and reduce the merit of using vStore. In one embodiment of the present disclosure, disk accesses are minimized to make the performance loss tolerable. vStore may address data integrity, in one embodiment as follows. 512 byte trailer to each 4K blocks is added to record hash of it. In order to minimize disk I/O in one embodiment of the present disclosure, we read and write the trailer together. This only increases data size, but does not increase the number of I/O. However, for cache miss handling, additional disk I/O for data integrity may be introduced. In general, such consistency issue complicates overall cache handling and there may be a trade-off between maintaining consistency and performance penalty due to additional disk I/O.

FIG. 4A is a flow diagram illustarting a read request handling in one embodiment of the present disclosure. FIG. 4B is a flow diagram illustarting a write request handling in one embodiment of the present disclosure.

READ Handling

FIG. 4A illustrates a flow diagram for read cache handling in one embodiment of the present disclosure. At 402, a read request is received. The read request may originate from an application in a VM, for example to read data X. At 404, it is determined whether the block group which stores the data of the read request is already cached. For example, the sector address of the read data is compared with the in-memory metatdata to determine whether the block group is cached already. If it is determined that the block group is cached, the flow logic proceeds to 406, otherwise the flow logic proceeds to 420.

Using a virtual disk involves multiple steps: open the virtual disk, perform reads/writes, and finally close the virtual disk. When the virtual disk was opened, vStore assigns a “Virtual Disk ID” to the virtual disk and maps it to a remote disk on storage server (virtual disk ID was described previously). This mapping relationship is kept in a mapping table, and stored both in memory and on disk in one embodiment. When the VM issues a read request, vStore knows the Virtual Disk ID implicitly (because the request comes from a previously opened handle) and the sector address is specified explicitly. Combining the virtual disk ID and the sector address as one search key to look up the in-memory metadata can determine whether the data is cached and if so which block group currently caches the data. The following shows an example data struc-ture of the combined search key.



Download full PDF for full patent description/claims.




You can also Monitor Keywords and Search for tracking patents relating to this Scalable cloud storage architecture patent application.

Patent Applications in related categories:

20130124802 - Class dependent clean and dirty policy - A method for cleaning dirty data in an intermediate cache is disclosed. A dirty data notification, including a memory address and a data class, is transmitted by a level 2 (L2) cache to frame buffer logic when dirty data is stored in the L2 cache. The data classes may include ...


###
monitor keywords

Other recent patent applications listed under the agent International Business Machines Corporation:

20090327627 - System, method and computer program product for copying data
20090328229 - System, method and computer program product for performing a data protection operation
20090310462 - Frustum-shaped holographic disc and matching tray in a holographic drive
20090296267 - Apparatus and method for writing data onto tape medium
20090296268 - System and method for controlling traveling of tape



Keyword Monitor How KEYWORD MONITOR works... a FREE service from FreshPatents
1. Sign up (takes 30 seconds). 2. Fill in the keywords to be monitored.
3. Each week you receive an email with patent applications related to your keywords.  
Start now! - Receive info on patent apps like Scalable cloud storage architecture or other areas of interest.
###


Previous Patent Application:
Performance of emerging applications in a virtualized environment using transient instruction streams
Next Patent Application:
Using ephemeral stores for fine-grained conflict detection in a hardware accelerated stm
Industry Class:
Electrical computers and digital processing systems: memory

###

FreshPatents.com Support - Terms & Conditions
Thank you for viewing the Scalable cloud storage architecture patent info.
- - - AAPL - Apple, BA - Boeing, GOOG - Google, IBM, JBL - Jabil, KO - Coca Cola, MOT - Motorla

Results in 1.13949 seconds


Other interesting Freshpatents.com categories:
Medical: Surgery Surgery(2) Surgery(3) Drug Drug(2) Prosthesis Dentistry   g2