Vendors that leverage kURL for embedded cluster installations alongside their application will find that our default, recommended kURL specification has recently been updated to provide OpenEBS local-pv for local storage, and Rook for distributed storage. This is a purposeful move away from the prior storage default of Longhorn.
Vendors that have been with Replicated for two years or more may remember when we originally embraced Longhorn after having experienced issues with early versions of Rook. This blog is meant to share our experience regarding both Rook and Longhorn while also providing visibility into our decision to reembrace Rook, as well as OpenEBS, as our preferred distributed storage providers in our new recommended kURL specification.
The kURL installer originally provided Rook v1.0.x for storage with the intention of allowing a seamless transition from one to many nodes without the Vendor having to pre-decide anything. The goal was to have the data follow the application as it was rescheduled on nodes so the enterprise customers could expand their cluster as needed, and services could be spread to the new hardware.
In practice, this generated an unexpected amount of support burden for Vendors and Replicated due to the increased complexity of running a distributed storage system for small, single node clusters. Furthermore, later versions of Rook (v1.3.x and later) no longer supported using folder based storage due to poor reliability, which meant that using Rook for single node use cases would still require dedicated block devices, which wasn’t the original design intention with kURL.
In mid-2021, having identified that Rook was regularly showing up in a high volume of support cases, Replicated changed the default storage for kURL from Rook to Longhorn and encouraged vendors to migrate. Some of the reasons this decision was made included:
In hindsight, many of the escalated support issues that were associated with use of Rook v1.0 were not actually issues with Rook itself, rather they were challenges with things like network failures, resource contention, or a full hard drive, all things that can cause Rook to appear unhealthy. At the time, Replicated associated these issues with Rook itself.
Given our hesitancy with the perceived challenges with Rook at that time, and the lack of a clear upstream upgrade path to newer Rook versions, Replicated decided to invest in migrating customers to Longhorn instead of investing in a custom built Rook upgrade path.
At that time, Replicated moved our default kURL spec to replace Rook with Longhorn. New customers, as well as customers that chose to migrate, began to use Longhorn in production across a variety of cluster use cases and different enterprise environments.
After more than a year of running and supporting Longhorn in production with customers, we’ve found it to be the root cause of a significant number of escalations where Longhorn is found to be failing. Unlike our prior experience with Rook, where the issues were attributable to other environmental causes, deep analysis shows these issues are due to Longhorn itself. The most commonly seen errors are drive corruption, failure to mount, and unrecoverable nodes upon reboot.
When we have engaged the upstream Longhorn open-source community on these types of issues, the level of responsiveness has not given us confidence that these issues will be resolved in an acceptable timeframe for our Replicated customers.
Additionally we’re seeing lag in version compatibility with Kubernetes. As an example, Kubernetes v1.25 came out August 23rd 2022, with support added to kURL by September 23rd. Longhorn did not support Kubernetes v1.25 until Longhorn v1.4, which was released early January 2023. For Replicated customers leveraging Longhorn, that creates a several month gap where they cannot upgrade to Kubernetes v1.25.
As we look back at the high volume of issues we’ve experienced among a relatively small percentage of our customer base using Longhorn to-date, we no longer believe it is the right go-forward CSI for kURL.
Meanwhile, newer versions of Rook, which is now up to v1.10, have proved to be significantly more stable when running in Replicated customer environments.
While Longhorn does allow you to still use folders, as opposed to block devices, they do recommend that you use block devices for improved production stability. Both projects recognize that distributed storage is best run on block devices for both data stability and performance. Rook (Ceph) chose to remove the folder option back with v1.3 specifically because it was generating more failure and support issues that the upstream project deemed responsible.
Our new default recommendation is OpenEBS local-pv for local storage, and Rook for distributed storage, with Rook requiring dedicated block devices.
We anticipate that we will eventually discontinue Replicated support for the Longhorn add-on in kURL, though we will continue to support any existing customer installation that is currently leveraging it.
If you are currently using Longhorn, we strongly encourage you to update your kURL spec and migrate to either OpenEBS local-pv or Rook upon your next upgrade. This guide can help you assess which option is better for your use case. If you’d like to explore this topic deeper we also highly recommend watching this Replicon talk about storage considerations from our CTO Marc Campbell.
We offer a supported migration path to make this transition easy for you. Follow this guide for detailed instructions to complete the migration.