A few years after their introduction, HyperConverged systems are now a reality, slowly but steadily eroding market shares from the traditional server/array platforms. They promised scalability, ease of deployment, operational and architectural simplification. While they mostly delivered on those promises, HCI systems introduced some new limitations and pain points. Probably the most relevant is a consequence of the uniqueness of HCI systems’ architecture where multiple identical nodes – each one providing compute, storage and data services – are pooled together. This induces an inter-dependency between them, as VM data must be available on different nodes at the same time to guarantee resiliency in case of failure. Consequently, HCI nodes are not stateless: inter-node, east-west communications are required to guarantee that data resiliency policies are applied. Unfortunately, this statefulness also has other consequences: when a node is brought down, either by a fault or because of a planned maintenance task, so is the storage that comes with it and data must be rebuilt or relocated to ensure continued operations.
Yes, one could say that in traditional SAN-backed architectures hosts would be stateless from a storage perspective because all the data would reside on the array, but as the infrastructure grows, by connecting more hosts we would eventually hit a performance wall. This because the array controller would be the I/O bottleneck and adding more hosts would only increase competition for the (fixed and shared) available resources.
Therefore, although not perfect, HCI systems still represent an appealing alternative to traditional non-converged architectures.
But what if could have the benefits of both statelessness and I/O performance scalability in a simple Rackscale form factor?
Datrium is a relatively new company founded by original Data Domain key staff, whose mission is to bring to market a solution that delivers on the promises of HCI without its inherent compromises. Datrium’s solution, called DVX, is a “System for Rackscale Convergence” where I/O processing is split from durable data: this allows for nodes statelessness and predictable I/O performance scalability, characteristics that – as explained before – are usually mutually exclusive in current HCI systems.
Datrium DVX if made of two distinct hardware components, DVX Compute Nodes and DVX Data Nodes, glued together by Datrium’s own software layer. DVX Compute Nodes are Datrium-branded commodity x86 servers running VMware ESXi and equipped with Flash Drives – for a maximum RAW capacity of approximately 16 TB – that act as low latency local read cache. The beauty of Datrium’s “Open Converged” approach is that the customer is not bound to deploy only DVX Compute Nodes: any server on VMware HCL can be part of a DVX cluster, therefore breaking the unwritten HCI rule that “all nodes in a HCI cluster must be equal”. DVX clusters start from one compute node and can scale up to 32. The other component of DVX is the Data Node: this is where all the durable data resides. It would not be correct to call the Data Node a storage array, as it does not come with a controller. As we will soon see, each host is a controller, while the Data Node is indeed no more than a disk shelf containing 12x 4TB slow rotational disks, 64 GB of RAM and some fast NVRAM to acknowledge writes sent by the hosts through 10 Gbit links. With the current release, only one DVX Data Node per Rackscale System is supported, but that should change soon to allow for scaling out to multiple Data Nodes.
What really makes a difference is Datrium’s software component living in the ESXi’s userspace: Datrium’s HyperDriver enables the host to use its own CPU power to directly control I/Os to and from the Data Node and to drive advanced Data Services such as Compression, Deduplication, Encryption, Erasure Coding, Snapshots and Replication. To achieve this, Datrium DVX nodes reserve a minimum of two CPU cores to exclusively perform data-related processing. If the hosts has at least 10 cores, the amount of CPU reserved for data processing can be increased up to 20% of the aggregate CPU capacity of the node in “Fast Mode” and up to 40% in “Insane Mode”. This, in conjunction with the low latency local Flash Read Cache, is what allows for uncompromised performance, when needed, where needed.
Imagine now that you have a VM with a very demanding I/O profile: you can turn on “Insane Mode” on the host, or that VM can be vMotioned to a node with more available CPU capacity and make use of that extra “muscle” to handle the special I/O demand. This also happens to be a superb use case to highlight the advantages of being able to mix different hardware in DVX clusters: one might start with a few standard DVX nodes and – only when the need to address special performance requirements arises – add any other 3rd party servers with different, beefier specs. Consequently, DVX is much more flexible in terms of workloads it can support, compared to current HCI systems, which are far more rigid.
It is also very clear now why with DVX each host acts as a storage controller: its CPUs provide the power for I/Os and data intelligence while the entire data set is contained in the Data Node. There is no interaction between nodes as there is no need for one node to access data located on any of their neighbors and every node contributes its own CPU power to the cluster’s aggregate performance capabilities. Adding nodes simply scales up performance. In addition, bringing down one node will not be a concern anymore, because no data rebuild or migration will happen. All it will take is for the read caches to be re-warmed on the hosts where the VMs will happen to run. To summarize: with Datrium DVX all the benefits expected from a HCI systems are delivered, with the bonus of node statelessness and predictable scalable performance. I think this is where the value of Datrium’s proposition really shines.
It can be interesting to spend a few more words to go deep dive in how all of this happens in practice.
The Datrium software running inside each VMware ESXi host exports an NFS interface to the node and presents a single NFS DataStore to all nodes in the cluster. This DataStore represents the contents of the Data Node. The NFS interface is indeed terminated on the host itself and communication between the Compute Nodes and the Data Node happens on the 10 Gbit wire through a proprietary Datrium protocol.
As soon as any data enters the system, the Datrium Hyperdriver activates advanced Data Management features such us Deduplication/Fingerprinting, Compression, Encryption, Checksumming and Erasure Coding, while data is synchronously written to the NVRAM within the Data Node and flushed to the spinning backing disks, with zero risk of data loss during writes. It is interesting to note that each host has full control of each of the spinning disks in the Data Node, so data chunks are flushed to the right disks based on the erasure coding decisions performed on the host! Thanks to Datrium algorithm – which leverages deduplication and compression to make best possible use of host local storage – in addition to global dedupe on the data node – the working set is expected to be serviced to the hypervisor from the local flash read cache.
One thing that is not completely clear to me – and it will be worth to be further investigated – is how exactly the network connection between Compute and Data Nodes and should be configured on the ESXi hosts. Should the 2x 10 Gbit uplinks to the Data Node be dedicated or can they be shared with other traffic? Is there such a thing like a “Datrium vmkernel interface”? How far can we push “Insane Mode” on multi-socket, multi-core hosts before hitting the network limit and throttling the I/Os?
It is also worth commenting on Datrium’s “Blanket Encryption” implementation that ensures that data is encrypted the moment it enters DVX and so it remains, at rest and in-flight, with no need to de-crypt and re-encrypt it every time it crosses a logical or physical boundary. This is achievable because Datrium has full control of all the elements of the stack and full visibility and integration. Although this in theory should have a performance hit, in practice it does not as Datrium takes advantage of the Intel AES-NI instructions available in modern CPUs to hardware-accelerate the AES-XTS-256 algorithm. Encryption keys are stored into the built-in key manager, but Datrium plans to support external KMSs to give customers choice.
Until now, we have mostly focused on primary storage features, but Datrium has secondary storage requirements covered too, by means of their Data Cloud solution, designed to be fully integrated with DVX. The idea is to have one single workflow to deploy, protect, clone and replicate VM and VM-related objects. VMs (or objects, more in general) can be dynamically grouped into Protection Groups and policies applied to them for snapshots to be taken consistently for all objects in the PG at the same time and at the desired schedule. Snapshots are stored in a so-called Snapstore – which is basically a VM Backup Catalog – backed by the same Compression, Deduplication and Erasure Coding features discussed earlier.
Snapstores can be federated across different DVXs so backups can be replicated across them. As of today, this seems the only way to export VM backups from the same Data Node where the original VMs reside and this is clearly a concern. It would be nice if Data Cloud played nice with 3rd party backup solutions, both purely software like Veeam or hardware based like Rubrik or Cohesity. The lack, at the time of writing, of any public API implementation (or even roadmap for it) is, honestly, disappointing. If Datrium wants to succeed, they have to open up their platform for interoperability with other solutions, and not only in the availability area.
In addition to backups, Data Cloud also manages replication across DVXs for DR purposes: although Datrium’s Elastic Replications incorporates most of the features typically available with array-based replication (such as replication topologies, data transfer reduction, transfer monitoring and throttling), the Datrium approach is different. Arrays replicate LUNs, without regards to the actual objects stored on those LUNs; it is the VMware/DR engineer’s role to ensure that VMs are placed on the correct LUNs, and that replication consistency is guaranteed. Datrium instead replicates VMs and Protection Groups and it does it this way because there is no such LUN construct in DVX, while Data Cloud is fully aware of “what a VM is made of” in DVX. This allows for efficient data transfer over the wire; it is also worth noting that, just as compute nodes are responsible for I/Os, they are also in charge of replication that consequently happens in multiple, concurrent streams. The same snapshots process described above is the foundation for the replication: individual VMs or VMs in Protection Group are (consistently) snapped and the snapshots are asynchronously replicated to the destination using the optimization techniques described above. Right now the supported RPO is 30 minutes, which is not particularly exciting, but keeping in mind that Datrium is still a young technology, we can only expect this to improve.
One very interesting development in roadmap is a virtual AWS appliance to allow replication to Amazon, although details are vague now and it is not clear whether this data will be accessible by VMware hosts in AWS or in some other way.
One very important detail about DVX that has been hinted throughout this article is that DVX is a VMware only solution: it has been designed from the very beginning to support VMware vSphere and it is natively (and nicely!) integrated with the VMware Web Client. This does not mean that in the future other virtualization platforms or container engines will not be supported, but it is just too early to say.
I would encourage you to head to the Tech Field Day website to watch the live demos (and the deep dive technical presentations) to see yourself how easy is to work with DVX from within the Web Client.
Time for my verdict, then. I was impressed by Datrium proposal, but not (yet) entirely convinced. I think Datrium’s approach to address the current pain points of HCI solutions is a valid one and definitely different from any other player. As I mentioned in my previous article, Datrium takes well-known ingredients but the recipe used to assemble them is unique. I would like to see how far Datrium could go and possibly establish themselves as respected players in the HCI arena. I am not still completely convinced because it is clear that the product is still young and must mature. The lack of public APIs is simply unacceptable today and some features must be improved (using naming conventions only as a way to create PGs is not enough; an RPO of 30 minutes is excessively big). The news about ex-Nutanix veteran Andre Leibovici joining them is definitely a clear indication that Datrium wants to play this game and that they have the commitment to succeed. I will stay tuned to see how Datrium develops, and I am grateful I had the opportunity to be one of the delegates invited to witness their first TFD presentation.
Way to go Datrium!
Disclaimer: I have been invited to Tech Field Day 14 by Gestalt IT who paid for travel, hotel, meals and transportation. I did not receive any compensation to attend TFD and I am under no obligation whatsover to write any content related to TFD. The contents of these blog posts represent my personal opinions about the products and solutions presented during TFD14.