This article is more than one year old. Older articles may contain outdated content. Check that the information in the page has not become incorrect since its publication.
With the continuous development of Guazi’s business, the system scale is gradually expanding. Currently, hundreds of Dubbo applications and thousands of Dubbo instances are running on Guazi’s private cloud. Each department of Guazi has rapidly developed its own versions without timely alignment. With the construction of the second data center, the need for a unified Dubbo version has become increasingly urgent. A few months ago, a production incident related to Dubbo occurred, which became a catalyst for the company’s Dubbo version upgrade.
Next, I will start from this incident to discuss the journey we took to upgrade the Dubbo version and the subsequent multi-data center solutions for Dubbo.
In the production environment, various business lines within Guazi share a Zookeeper cluster as the registration center for Dubbo. In September 2019, a switch in the data center failed, causing a few minutes of network fluctuation in the Zookeeper cluster. After the Zookeeper cluster recovered, Dubbo providers should have quickly re-registered with Zookeeper under normal circumstances, but a small number of providers did not re-register for a long time, only recovering registration after manually restarting the application.
First, we analyzed the version distribution of the Dubbo services experiencing this phenomenon and found that the issue existed across most versions, with a relatively low occurrence rate. There were no related issues found in GitHub. Therefore, we deduced that this was an unaddressed problem that sporadically occurred under network fluctuation scenarios.
Next, we compared the application logs of the problematic apps, Zookeeper logs, and Dubbo code logic. In the application logs, after successfully reconnecting to Zookeeper, the provider immediately attempted to re-register, after which there were no log prints. In the Zookeeper logs, after the registration node was deleted, the node was not recreated. In the Dubbo code, this scenario only aligns with cases where the execution of FailbackRegistry.register(url)
’s doRegister(url)
was successful or the thread was suspended.
public void register(URL url) {
super.register(url);
failedRegistered.remove(url);
failedUnregistered.remove(url);
try {
// Sending a registration request to the server side
doRegister(url);
} catch (Exception e) {
Throwable t = e;
// If the startup detection is opened, the Exception is thrown directly.
boolean check = getUrl().getParameter(Constants.CHECK_KEY, true)
&& url.getParameter(Constants.CHECK_KEY, true)
&& !Constants.CONSUMER_PROTOCOL.equals(url.getProtocol());
boolean skipFailback = t instanceof SkipFailbackWrapperException;
if (check || skipFailback) {
if (skipFailback) {
t = t.getCause();
}
throw new IllegalStateException("Failed to register " + url + " to registry " + getUrl().getAddress() + ", cause: " + t.getMessage(), t);
} else {
logger.error("Failed to register " + url + ", waiting for retry, cause: " + t.getMessage(), t);
}
// Record a failed registration request to a failed list, retry regularly
failedRegistered.add(url);
}
}
Before we continued investigating the issue, let’s clarify these concepts: Dubbo uses Curator as the Zookeeper client by default, and Curator maintains a connection with Zookeeper through sessions. When Curator reconnects to Zookeeper, if the session has not expired, it continues using the original session; if the session has expired, it creates a new session to reconnect. The ephemeral node is bound to the session, and when the session expires, the ephemeral nodes under that session are deleted.
Continuing to investigate the doRegister(url)
code, we found the logic in the CuratorZookeeperClient.createEphemeral(path)
method: it captures NodeExistsException
during createEphemeral(path)
. When attempting to create an ephemeral node, if the node already exists, it considers the creation successful. This logic seems initially correct and performs normally in two common scenarios:
public void createEphemeral(String path) {
try {
client.create().withMode(CreateMode.EPHEMERAL).forPath(path);
} catch (NodeExistsException e) {
} catch (Exception e) {
throw new IllegalStateException(e.getMessage(), e);
}
}
However, there is also an extreme scenario where the expiration of Zookeeper’s session and the deletion of ephemeral nodes are not atomic, meaning that when the client receives the session expiration message, the ephemeral nodes corresponding to the session may not have been deleted by Zookeeper yet. At this point, when Dubbo attempts to create ephemeral nodes, it finds that the original nodes still exist, hence it does not recreate them. Once the ephemeral nodes are deleted by Zookeeper, it will lead to a situation where Dubbo assumes a successful re-registration occurred, while in reality, it did not, which is the problem we encountered in production.
At this point, the root cause of the issue has been identified. After locating the problem, we communicated with the Dubbo community and discovered that colleagues from Koala had also encountered the same issue, which further confirmed this cause.
After pinpointing the issue, we began to attempt to reproduce it locally. Directly simulating the scenario of Zookeeper’s session expiring while ephemeral nodes are not deleted is relatively difficult, so we modified the Zookeeper source code to add a sleep period in the logic for session expiration and deletion of ephemeral nodes, indirectly simulating this extreme scenario and reproducing the issue locally.
During the investigation, we found that older versions of Kafka also encountered similar issues when using Zookeeper. We referred to Kafka’s fix for this problem to determine Dubbo’s repair plan. When capturing a NodeExistsException
during the creation of ephemeral nodes, we decided to check whether the SessionId of the ephemeral node was different from the current client’s SessionId. If they differed, we would delete and recreate the ephemeral node. After internal fixes and verifications, we submitted issues and PRs to the community.
Similar Kafka issue: https://issues.apache.org/jira/browse/KAFKA-1387
Dubbo registration recovery issue: https://github.com/apache/dubbo/issues/5125
The fix plan mentioned above has been determined, but obviously, we cannot fix every version of Dubbo. After consulting the community’s recommended versions, we decided to develop an internal version based on Dubbo 2.7.3 to fix this issue and took this opportunity to start promoting a unified upgrade of Dubbo versions across the company.
The internally developed version of Dubbo based on community Dubbo 2.7.3 is a transitional version aimed at fixing the online provider failure to recover registration, as well as some compatibility issues of community Dubbo 2.7.3. Ultimately, Guazi’s Dubbo will need to follow the community versions, not develop its internal functions. Therefore, all issues fixed in the internal version of Dubbo are kept synchronized with the community to ensure later compatibility for upgrades to higher community versions.
After consulting community colleagues about version upgrade experiences, we began the upgrade work for the Dubbo version in late September.
Overall, the process of promoting the upgrade to Dubbo 2.7.3 went relatively smoothly, although we encountered some compatibility issues:
Permission Denied When Creating Zookeeper Node
The Dubbo configuration file already has the Zookeeper username and password configured, but an exception KeeperErrorCode = NoAuth
is thrown when creating Zookeeper nodes. This situation corresponds to two compatibility issues:
NoAuth
problem will occur.
We fixed this in the internal version by referencing the community PR.Curator Version Compatibility Issue
<dependency>
<groupId>org.apache.curator</groupId>
<artifactId>curator-framework</artifactId>
<version>4.2.0</version>
</dependency>
<dependency>
<groupId>org.apache.curator</groupId>
<artifactId>curator-recipes</artifactId>
<version>4.2.0</version>
</dependency>
OpenFeign and Dubbo Compatibility Issues issues: https://github.com/apache/dubbo/issues/3990 The Dubbo ServiceBean listens to the Spring ContextRefreshedEvent to expose services. OpenFeign prematurely triggers the ContextRefreshedEvent, causing an application startup exception when ServiceBean has not finished initialization. We fixed this issue in the internal version based on community PR.
RpcException Compatibility Issue
Lower version Dubbo consumers cannot recognize org.apache.dubbo.rpc.RpcException
thrown by Dubbo 2.7 version providers. Therefore, until all consumers are upgraded to 2.7, it is advised not to change the provider’s com.alibaba.dubbo.rpc.RpcException
to org.apache.dubbo.rpc.RpcException
.
Qos Port Occupation By default, Dubbo 2.7.3 enables the QoS functionality, leading to port occupation issues when upgrading Dubbo services on physical machines. Disabling QoS resolves this.
Custom Extension Compatibility Issues Since there are relatively few custom extensions for Dubbo among business lines, there haven’t been many difficult compatibility issues, mostly related to package changes that business lines can fix themselves.
Skywalking Agent Compatibility Issues We typically use Skywalking for tracing, but Skywalking agent 6.0 does not support Dubbo 2.7. Hence, we standardized the upgrade of Skywalking agent to 6.1.
Guazi is currently working on constructing a second data center, and multi-data center support for Dubbo is a significant topic during this building process. With the unification of Dubbo versions, we can more smoothly carry out research and development related to multi-data centers.
We consulted the Dubbo community for recommendations and, considering Guazi’s cloud platform status, initially determined a multi-data center plan for Dubbo:
The implementation of Dubbo’s same data center call priority is relatively simple, as follows:
Based on the logic above, we implemented a basic routing feature in Dubbo through environment variables and submitted a PR to the community. Dubbo routing via environment variables PR: https://github.com/apache/dubbo/pull/5348