Inline deduplication is a technique that removes redundant components of data before writing it to a storage device. Eliminating duplicate pieces reduces the storage space requirements without compromising the safety of the data.
As the amount of data continues to grow, the demand for more storage space, data center size, cooling, network bandwidth, and other requirements increases. In addition, the growth adds operational complexity, administration time as well as risks. Consequently, ensuring data security and compliance becomes costly and challenging.
According to IDC, multiple duplicate copies of content account for about 75% of the data in storage today. As such, removing the redundancies can help organizations reduce their storage needs and costs, and this is where deduplication comes in.
Basically, deduplication, dedupe or deduplicate is the technique of eliminating duplicate components of data before backing up or for primary storage.
Currently, the five main types of deduplication are;
Although the outcome depends on the environment, the inline deduplication is probably more efficient and economical than the post-process technique for some applications. However, the savings it achieves depend on the type of files, frequency of backups, environment, and other variables. Typical solutions can cut the storage needs by a factor of between 10 and 30, and this translates to lower drive capacity and bandwidth requirements.
Generally, reducing the data footprint has benefits such as smaller data center space, and savings on hardware, software, bandwidth, and power.
How does inline deduplication work?
The technique compares new data with what is in the storage device and only writes unique parts of the content. If there are matching pieces, it does not write the data again but adds a pointer to the existing data in the storage media.
The deduplication software breaks the data sets into smaller parts and then uses algorithms to append identifying hashes to each of the chunks, file, byte or block. Using smaller data pieces delivers better reduction and storage efficiency.
When there is new data to write, the algorithms first checks if the hash identifier is in the storage and only writes the unique parts. If there is a match, it does not write the data but rather adds a pointer to the existing piece on the backup drive.
For example, if a file is 100% original, the system copies everything to the backup device. However, if there is a similar file on the backup, it does not back it up in its state; instead, the technology writes a pointer or placeholder to a hash table.
When restoring, the system uses pointers in the hash table listing to retrieve and copy the duplicate pieces of the content.
The removal of duplicates happens before the system writes the data to the disk, and this may slow down the backup process. However, eliminating redundant content reduces the amount of data to write, and the overall delay may be insignificant.
Benefits of inline deduplication
The benefits specific to inline processing include;
Hardware vs. software inline deduplicating appliance
The choice between hardware and software deduplication depends on the environment as well as current backup software and configuration. While you need additional software and configuration on older storage systems, the modern hardware such as flash comes with inbuilt inline deduplication options. If you have a system without the inbuilt option, you can extend its capabilities by inserting an inline deduplicating appliance in front of the existing legacy storage array.
Plug n play hardware appliances with built-in deduplicating capabilities provide faster processing and are easy to add. However, scalable is usually a challenge in addition to sometimes requiring complex integrations with existing infrastructure.
On the other hand, there are now powerful Intel processors that are enabling software-based solutions to deliver better performance without compromising on the speed. The software approach, such as the Altaro backup solutions and others, have fewer overheads, are less costly, more flexible, easily scalable to the petabyte scale level, and ideal for virtual and cloud environments.
When do you use inline deduplication? Although inline deduplication is one of the major data reduction techniques, it is not suitable for some applications. For example, it delivers negligible savings for engineering test data, music, video, x-ray data, etc.
The technology may not be the best fit for every environment and below are some areas where it works better.
As an example, imagine your organization has about 500 virtual machines running the same operating system. In such a case, each instance of the OS comprises of identical blocks. Using the inline deduplication, you only need to write each block once instead of 500 times.
Another application where technology delivers huge savings is when archiving emails. For example, instead of storing a copy of an attachment for every user, the technology will only write one copy to the backup storage media.
Applications in hyper-convergence infrastructure appliances and virtual desktop environments
Most HCI vendors prefer the inline deduplication to optimize internal storage. Compared to the post-process, the inline has better performance in addition to reducing storage capacity requirements and wear of the drives. Usually, the HCI appliances can only accommodate a limited number of physical disks and removing the duplicate data helps to optimize the limited storage space.
Inline deduplication is also suitable for VDI storage which has always been a challenge. Most people are usually after the performance when deploying virtual desktop environments. To achieve this, providers often use expensive, high-performance storage. By reducing the data footprint, you can efficiently use the limited storage that the expensive but high-performance drives offer without spending more on extra drives.
Deduplicating inline for primary storage
Although most organizations use inline dedupe for backup or on secondary disks, it is also applicable for primary storage. This is especially useful when you want to take advantage of the fast but expensive flash memory.
Unfortunately, the cost of the flash storage is usually very high and you may not justify purchasing larger capacities. But, eliminating the duplicate information allows you to save and enjoy the high speeds and a better return on investment.
In some applications, the inline deduplication has the ability to level the capacity playing field between the low-cost traditional storage arrays and high performance and costly all-flash arrays. For example, a 10:1 ratio means that a 10 terabyte all-flash array has the potential to store data at the same level as an 80 to 100 TB array.
Data volumes continue to grow at a faster rate than the drop in the price of storage. Yet, there is a need to look for ways to reduce the storage costs without sacrificing the security and quality of the data.
One of the most effective techniques is the inline deduplication which removes duplicate pieces before writing data to the drive. Consequently, the downstream operation such as the backup, archiving, replication and network transfers will benefit from the lower data footprint.
Day 1 began with the general session, which was a lot different than the previous year where the VMware Executives laid out their vision for the partner community. This general session was focused more correctly on the audience in attendance.
VMware's CTO, of Global Field and Industry, Chris Wolf began the general session. Chris is responsible for shaping VMware’s long term technology vision, while ensuring that Research and Development priorities align with customer and industry needs. With this being a technical partner conference, I felt this was the right choice for leading the general session. Last year, it felt more like a sales pitch and less technical. I am not sure if this was due to feedback that VMware received after last years conference. In my opinion, this demonstrates that this conference is now correctly aligned with the audience attending the event.
VMware Empower 2019 is bigger, and with much richer content than 2018. Now, with over 90 breakout sessions, first time instructor led VMware labs along with VCDX experts on hand to talk with, and opportunities to take a certification exam.
Chris spoke about the nature of applications, how they are changing with an unprecedented growth. Applications are more diversified and the demand is increasing more than ever before. The application needs and requirements are driving IT initiatives within customer business.
Chris continues the general session with talking about the hybrid and public cloud journey. VMware approaches cloud through a consistency within the infrastructure, operations, and a native developer experience. This allows for workloads and user experience to be consistent across both hybrid and public cloud offerings. With bringing consistency within the infrastructure and operations, customers can more easily bring in Service Integration for managing cloud through business KPIs across the customer organization.
Automation allows customers to set guardrails by line of business and manage via policies. Automation also brings with it the ability to re-mediate and to conform to standards, follow best practices, and adhere to industry standards. Governance and Security allows for reporting on compliance and fixing misconfigurations. Governance and security also brings compliance by teams and allows for proactive monitoring of security and compliance risks. The last one, Cost and Visibility, allows IT to accurately allocate costs and find unused resources. IT can optimize costs and infrastructure, automate cost control, and continue cost optimization based on strategy.
VMware Cloud Foundations brings complete cloud integration with vSphere, vSAN, and NSX. Chris talks about how vSAN adoption is growing with more than 38% of the market now and how Cloud Foundations is the right choice for building a consistent cloud experience. Through this platform, IT can more easily manage and deploy automation, governance, and security, all while controlling costs.
VMware demonstrates these abilities through CloudHealth, as seen above. CloudHealth, acquired by VMware, is a crucial multi-cloud management platform that works across AWS, Microsoft Azure, and Google Cloud Platform, giving customers a way to manage cloud cost, usage, security, and performance from a single interface.
CloudHealth has over 80 billion workloads managed through this platform today and are the leader in multi-cloud management. CloudHealth has perspectives on optimization for right sizing with cost controls, downsizing, and reserved instances. Customers can build out policies to control things like low EC2 utilization.
Chris talked about traditional network challenges businesses face today and the need to bring in automation and intrinsic security into the security fabric of networks.
SD-WAN has seen large momentum with 2,000+ customers, with more than 70+ countries. Chris demoed deploying SD-WAN into remote locations. This was very easily deployed and took only minutes to provision in front of a live audience.
Chris spoke about VMware Cloud (VMC) on AWS and the benefits of this platform. He spoke about some use cases like data-center evacuation, disaster recovery, applications integrating with AWS offerings like AWS Lambda, which is an event-driven, server-less computing platform provided by Amazon as a part of the Amazon Web Services. It is a computing service that runs code in response to events and automatically manages the computing resources required by that code. VMC on AWS is now available in Singapore and Canada. You can see the full roadmap and further information from the VMware Cloud on AWS site from VMware, here.
The VMC on AWS provider partners are growing as you can see from the slide above and it's important to note that these partners are VMware Cloud Verified, which means that when you see the VMware Cloud Verified logo, you'll know you can easily access the full set of capabilities of VMware's Cloud Infrastructure. Get the ultimate in cloud choice through flexible and inter-operable infrastructure, from the data center to the cloud.
Chris spoke about simplicity and choice for customers at the edge. He spoke about AWS Greengrass, Azure IoT, Data Analytics, and Hybrid Applications running on that consistent infrastructure of vSphere and spoke about Azure IoT Edge running on vSphere.
The day was ended with a demonstration of edge devices. VMware demonstrated running vSphere on a Mac Mini and a demonstration of ESXi running on an Intel Compute Stick. They performed a vMotion across WiFi which was amazing to witness.
Overall this was one of the better events I have attended from VMware. The sessions along with the general sessions are very technical. I am excited for the next few days in Atlanta.