Archive for February, 2009

Will the real data please stand up? A look at deduplication in the online backup world

by Sekar Vembu on February 11th, 2009

Talk about data deduplication (in the backup and archiving domain) seems to be gaining a fair amount of momentum in the last few years! Most enterprise backup software vendors like Symantec (Veritas), EMC (Avamar) etc. support deduplication in some form or the other – some do deduplication in the source system (that is being backed up) and others do deduplication at the target (backup/storage server). There are also pure “deduplication based storage hardware vendors” like Data Domain who have gained considerable traction in the enterprise.

I am actually quite surprised by the hype around deduplication and the adoption it seems to have gained in the enterprise. The reason I am surprised is similar to the one I articulated in my previous blog post: “Synthetic Full Backup in the online backup world – Are we inviting trouble?“. The crux of my argument is that backup and archiving is about building redundancy to the data and not about eliminating redundancy in the name of efficiency of storage or network bandwidth. So it is my contention that wherever feasible we should have as much redundancy to the data (that needs backing up) and only under unavoidable circumstances should we resort to using synthetic full backup or deduplication. Actually, let me state this more strongly: “avoid falling for the synthetic full backup or deduplication hype if you can!”

But who am I to say this. I am neither an “industry expert” nor am I Steve Jobs to say “this is what is good for you; take it or leave it”. Given that we are a niche company trying to grow (and growing) in the face of industry giants, we are actually contemplating building deduplication support in our data backup software, StoreGrid. While not many of our customers/partners are asking for it, we do get the occasional prospect saying that deduplication (rather, the lack of it) is a show stopper feature for them!

As we started thinking about and designing the best way to support deduplication in StoreGrid, we encountered many options to consider and many complexities to be handled. But at the end, we were left with a fundamental question – whether a full-fledged deduplication is indeed possible in the online backup world! Before I explain some of the options and the complexities, and why we think a full-fledged de-duplication may not be feasible in a pure online backup scenario, let me first get into a broad overview of the two deduplication approaches…

Deduplication at the source (client) vs. at the target (backup server) : There are vendors who claim they do the deduplication at the source (i.e. the client system that is being backed up) as opposed to others who claim that they do deduplication at the target (i.e. at the backup server). If deduplication is done at the source then it is easy to deduplicate data at a block level across all files within the source system. If deduplication is done at the target then it is equally easy to deduplicate data at a block level across all files across all the client systems backing up to the backup server. Quite obviously doing deduplication across all files across all clients will be much more effective than doing deduplication only at a client system level. It is theoretically possible to do deduplication at the source system and still be able to deduplicate across all systems backing up to the backup server. In this case, each client (source) has to continuously update itself with the meta-data of the blocks that are being stored in the backup server. The meta-data in this case would simply be the checksums of the blocks. These checksums are looked up to identify similar blocks of data. I have not personally tested such a product myself – i.e. the ones doing deduplication at the source system and still being able to deduplicate across all systems backing up to the backup server. But this may not be as efficient in terms of performance as compared to doing the deduplication at the backup server end, especially if the backup/storage server resides at a remote data center (and the meta-data needs to be downloaded each time from the remote server).

Armed with this background, lets dive deeper into the implications of these ‘approaches’ in the online backup context…

Option 1: Deduplication at target
One of the most important requirements in the online backup domain is that the data that is backed up is encrypted before the data leaves the source system and is sent over the internet to the remote data center (where the data is stored). Deduplication works by finding similar blocks across all the files and physically storing only one copy of the block in the storage system. And encryption works by destroying all patterns in a given data and making the data random. Because of the way encryption eliminates all patterns, trying to do deduplication on a set of encrypted files will have no effect – i.e. finding similar blocks of data across encrypted data will not be of much use as encryption would have eliminated all patterns. That means doing deduplication at the remote storage end, where all the data from different clients systems are encrypted and stored, is technically not possible. The option of not encrypting the data that is being backed up to the remote data center is not really an option in the online backup world.  Another point to note is that deduplication at target doesn’t really help much in the case of an online backup scenario – clients still send all data across and hence don’t save anything on bandwidth! Of course, you save on ’server side storage’ but optimizing this, I’d assume, comes a distant second to optimizing bandwidth utilization – for online backups!

Option 2: Deduplication at source – with a common encryption key
As I said before it is theoretically possible to do deduplication at source and still be able to deduplicate across all client systems in an organization. In order to do that, either the data should not be encrypted during backup or all the client systems will have to use a common encryption key to encrypt the data. Not encrypting the data is not really an option with online backups. Using a common encryption key would mean that for each block of data that is backed up the checksum signature of the unencrypted block is also sent to the backup server where it is stored. Every client that is backed up should look up this database of checksums stored in the backup server before sending a block of data to the backup server. Though this can be done efficiently, I am not really fond of this option, because of the performance penalty, considering that the backup server is at a remote location in the case of online backups.

Option 3: Deduplication at local target backup server – with offsite replication
The only practical option I can think of is to have a deployment model where all clients in an organization backup to a local backup server – without encryption. The backed up data is deduplicated at the local backup server and then encrypted and sent to a remote backup or replication server. This deployment model will ensure that the deduplication is done on data from across all clients backing up to the local backup server.  Depending upon a customer’s preference, the local backup server can either keep a copy of the deduplicated backed up data (for quicker restores) or the backed up data at the local backup server can be purged (not recommended) once the data is moved to the remote backup/replication server.

In summary, we prefer the last approach, viz. doing the deduplication at the target backup server which is deployed locally at the site where clients systems are. This would allow the client to backup to the local backup server without encrypting the data – thus facilitating  deduplication at the target. And for offsite storage, the data from the local backup server would be deduplicated, encrypted and sent to the remote backup or replication server.  This would also ensure that the benefits of bandwidth savings associated with deduplication are also achieved.

I look forward to feedback & suggestions on other ‘better’ ways of implementing deduplication in the online backup domain!

The above post was written by Sekar Vembu of Vembu Technologies. Vembu Technologies is a backup software vendor whose product, StoreGrid, powers the online backup services of a large number of service providers across the globe. Besides remote backup, StoreGrid is also used for on premise backups of workstations and servers at various companies & universities.

Carbonite (and Mozy’s) Achilles heel

by lux on February 4th, 2009

Online backup service provider Carbonite has been in the news for some time now. They were recently in the news for the wrong reasons. Hot on the heels of the Belkin fiasco , “biased reviews without disclosing affiliations” are getting a hard rap on the knuckles!

The Weakest Link for Commodity Online Backup services

The objective of this post is not to debate on companies’ marketing tactics (that too from 2 years ago – I’m not sure if that’s really relevant now) but to point your attention to something else that’s indeed relevant…

Sometimes, we have a VAR/MSP who uses StoreGrid to offer online backup services to their clients, ask us how they can compete with the likes of Carbonite & Mozy. We simply tell them: “Don’t even bother trying”. The events that led to the Carbonite story are quite relevant in this context…

A customer (calling himself Bruce Goldensteinberg), signed up for Carbonite after hearing ads for it on the radio. The backing up part went well, but when his computer actually crashed, he was unable to restore it from the online backup. When he called Carbonite’s customer support, they kept him on hold for over an hour. Read those last 8 words again – that’s the crux of the problem! The customer elaborates in his post that through his ‘wait time’ in the call, he was offered the option to avail of premium service and a quicker response – by paying US$ 20 more!

This, to me, is the fundamental point of departure between commodity online backup services and VARs/MSPs offering managed online backup services. Commodity backup services simply cannot afford to service customers paying $4/month – even after you do the math on what % of users actually avail of support!

We keep saying – backups are not like Skype (who don’t offer phone support, and whose email support isn’t too great, by the way). You cannot say your call quality was bad on Wednesday and great over the weekend and move on! You data was either backed up or not – there are no shades of grey here. Couple this with the increasing complexity and heterogeneity in the IT environment and the requirements for specialized backups like Exchange & SQL, and the value added by a VAR/MSP becomes a lot more clearer.

This ‘unaccounted support overhead’ is even more evident when you read blogs describing customer experiences with Carbonite, Mozy and other commodity online backup services. Most of those online rants end with “…the guy on the line offered me a full refund / free year of service”. In effect, the call center executives are empowered to execute a refund/freebie as required! Why? Because it happens quite often. Its simply the nature of the beast…

COMMODITY ONLINE BACKUP SERVICES CANNOT AFFORD TO PROPERLY SUPPORT THEIR CUSTOMERS. PERIOD!

So, the next time someone asks you (their IT service provider) why they should use your online backup services over a $4/month ‘all you can eat’ service, ask them how they’d like to be treated when they have an acute stomach ache – by a real doctor or by an unknown guy on the phone who might attend to them after they’ve been kept on hold for an hour? How valuable is their health? How valuable is their data?

The above post was written by Lakshmanan (Lux) Narayan of Vembu Technologies. Vembu Technologies is a backup software vendor whose product, StoreGrid, powers the online backup services of a large number of service providers across the globe. Besides remote backup, StoreGrid is also used for on premise backups of workstations and servers at various companies & universities.