Ok. This is second part of EMC Celerra Unified, CLARiiON storage with Vmware – Deduplication and Compression Part 1. In my last post, I explained how EMC block compression works and where it will applies. In this post, I’m going to talk about another important part of EMC Celerra Unified storage technology. Celerra file level Deduplication& compression. AKA, Celerra Data deduplication.
Considering Celerra Unified storage is combination of Celerra (NAS) and CALRiiON (block storage, FC). The essential Celerra is actually focusing on NAS parts. It is not only has compression function(but it’s file level), but also has file level deduplication. Let’s talk about how it works.
The sequence of Celerra Data Deduplication is to look for cold data, compress first, then deduplicate later.
Please be aware although NAS is basically a file system. but Celerra is not exactly operating on file leve. Let’s see how it works.
If you check out latest version of Celerra docs, you may notice Celerra compression is not longer a file compression. It’s actually just labeled a “compression”.
Plus, EMC used to claim that do not use deduplication on NFS for Vmware, but recentely, EMC just released the Celerra Vmware Plug-in which allow you to compress single VM. First of all, let’s talk about “Initial compression and deduplication”
Initial and scheduled compression & deduplication
In the Celerra, the compression and deduplication is done together. Celerra periodically scan file system and compress “cold” (not recently active files) and hash them into metadatabase. If there are same hash exist in the database, this file is deduplicated.
So the process is:
Compress->hash->compare meta database->Copy to hidden space of this LUN (or not)->Update meta database.
Reading Compressed file
Compress Read is not much difference from EMC CLARiiON read. Celerra directly load compressed data into memory and extract data in the memory. Nothing is writing back to disk. Therefore, in some cases, reading compression data is even faster than uncompression data but with some CPU cycles cost.
Writing to a compressed & deduplicated file
Writing a compressed file is a long procedure because it involves writing to disk. A write to or a modification of a deduplicated file cause a copy of the requested portion of the file data to be reduplicated for this particular instance while preserving the deduplicated data for the remaining references to this file.
The entire file is not decompressed and reduplicated on the disk until the sum of the number of individual changed blocks and the number of blocks in the corresponding deduplicated file are larger than the logical file size.
Compression with Vmware environment
Celerra Vmware Plug-in just be released not long ago. It can only work on NAS file system. It has “Thin provisioning”, “Fast/Full Clone”, “Compression/Decompression” features. Just like CLARiiON system, you can use this plug-in to off load those operations from host to SAN. One of operation is compression. It allows you to compress a VM regardless it’s think or thin disk. VMs that are selected for storage optimization are compressed by using the same compression algorithm that is used in Celerra Data Deduplication. The algorithm comresses the VMDK file (virtual disk) and leaves intact other files that make up the VMs. When a compressed VMDK is read, only the data blocks containing the actual data are decompressed. Likewise, when data is written to the VMDK, instead of compressing it “on the fly”, data is written to a set-aside buffer, which then gets compressed as a background task. In most situations, the amount of data read or written is a small percentage of total amount of data stored. VM performance will typically be impacted maginally (less than 10%) when the corrsponding VMDK file is compressed.
I still think there are plenty of things EMC can do. Celerra plug – in is just a beginning. I will keep eye on it and post more later.