Root Cause Analysis: Connectivity Failures in Tencent Cloud

Symptoms of Network-Level Failure

  • Pods or ECS instances intermittently fail to reach peer services
  • Requests time out or result in DNS resolution errors
  • Cross-zone traffic is slow or gets silently dropped
  • Security group and route table changes do not take effect immediately

Architectural Complexity

Unlike simpler flat networks, Tencent Cloud's VPC model introduces multiple layers of abstraction: route tables, ACLs, and security groups must all align. Problems arise when any of the following are misconfigured:

  • Elastic IPs vs internal IPs
  • ACLs blocking traffic by protocol/port
  • Overlapping CIDR blocks between subnets or VPC peering

Diagnostic Techniques

Use VPC Flow Logs

Enable VPC flow logging to inspect denied or dropped packets. These logs help identify security group misconfigurations or port-level blocks.

# VPC Console - Enable Flow Logs
Select VPC > Logs > Enable> Choose Destination (CLS or COS)

Run Cross-Zone Connectivity Tests

Use Tencent's built-in diagnostics tool or deploy a ping/tcpdump pod in each AZ to measure latency and packet drops.

kubectl run net-debug --image=busybox --restart=Never -it -- sh
ping 10.x.x.x
telnet 10.x.x.x 8080

Fixing Cross-Zone and Inter-VPC Failures

1. Review and Align Security Groups

Ensure that all ECS instances or TKE nodes involved are using security groups that explicitly allow traffic across required ports and protocols (e.g., TCP 443, UDP 53).

# Example Inbound Rule
Protocol: TCP
Port: 443
Source: 10.0.0.0/8 (or peer VPC CIDR)

2. Check Route Table Consistency

Verify that subnet route tables contain entries to other subnets or VPCs through the correct next-hop, such as VPC peering or NAT gateways.

# Route table example
Destination: 10.1.0.0/16
Next Hop Type: Peering Connection
Next Hop: pcx-xxxxx

3. Validate DNS and Service Discovery

TKE clusters may rely on CoreDNS. Ensure your service definitions (ClusterIP or headless services) match across regions or namespaces and use fully qualified names (FQDNs).

# /etc/resolv.conf in container
nameserver 10.0.0.10
search svc.cluster.local

4. Use Cloud NAT or VPN Gateways for Hybrid Connectivity

For hybrid or on-premises integration, validate VPN gateway routing and ensure IPsec rules match local firewall settings. NAT gateways must be added as default routes if required.

Long-Term Best Practices

  • Centralize VPC configuration via Terraform or TCC CLI to ensure consistency across environments
  • Use tagging and naming conventions to trace resource relationships quickly
  • Monitor with CLS logs and integrate with Tencent Cloud Monitor for alerts on VPC traffic anomalies
  • Perform quarterly audits of security groups and route tables

Conclusion

Connectivity issues in Tencent Cloud are rarely due to physical outages—instead, they stem from multilayered misconfigurations in routing, security, or DNS. As architectures grow more complex, especially with container orchestration and multi-region deployments, the chances of subtle misalignments increase. By enforcing configuration as code, validating routes, and leveraging diagnostic tools like flow logs and CoreDNS introspection, teams can build more resilient and transparent networking layers on Tencent Cloud.

FAQs

1. Why do my TKE pods resolve DNS but fail to connect?

DNS resolution may succeed via CoreDNS, but the destination IP might be blocked by a security group or misrouted. Use flow logs and traceroute tools to verify reachability.

2. How does VPC peering affect routing?

VPC peering enables inter-VPC traffic but does not automatically propagate routes. You must add explicit route entries in both VPCs and open traffic in security groups.

3. Can I use public IPs to connect services internally?

Yes, but this introduces latency and egress charges. Use internal IPs or VPC peering for service-to-service communication unless external access is required.

4. Why are security group changes not applied instantly?

Tencent Cloud may take a few seconds to propagate security group changes across AZs or ECS instances. Always validate via flow logs or direct tests after updates.

5. How should I secure inter-region communication?

Use encrypted VPN tunnels or Tencent Cloud Global Accelerator with TLS. Avoid exposing services via public IP unless properly firewalled and monitored.