Context: Solaris in Enterprise Systems
Where Complexity Meets Legacy
Solaris environments often run critical workloads on aging SPARC hardware or in hybrid x86 virtualized configurations. These setups frequently blend old and new components—ZFS over UFS compatibility layers, global zones interfacing with branded zones, and outdated patches—all of which increase operational complexity and obscure root cause identification.
Common Troubleshooting Scenarios
- High CPU usage with low actual workload
- Intermittent hangs in applications using ZFS-backed NFS shares
- Kernel panics related to ip_filter, tcp retransmits, or page scanner activity
Root Cause Analysis: Solaris Kernel Performance Bottlenecks
Diagnosing System Hangs and Slowness
One of the most difficult issues to troubleshoot in Solaris is when the system appears alive but unresponsive. Tools like mpstat
, vmstat
, and prstat
may show idle CPUs, but commands stall or timeout. Often, this is due to ZFS lock contention, ARC cache saturation, or paging bottlenecks.
vmstat 1 5 kthr memory page disk faults cpu r b w swap free re mf pi po fr de sr s0 ... 0 1 0 250000 5000 300 ... 0 0 0 0 0 ... # High sr (scan rate) or blocked threads (b) indicates memory pressure
ARC Cache and ZFS Lock Contention
ZFS's ARC can grow aggressively, starving kernel memory and triggering the page scanner. Lock contention in ZFS metadata operations can block user threads, especially in high-NFS or zone environments.
echo ::memstat | mdb -k # Look for high ARC usage vs kernel heap size
Architectural Pitfalls in Solaris Deployments
Nested Zones and Resource Starvation
Global zones running multiple child zones may suffer resource contention if rcapd or FSS is not properly tuned. These conflicts are magnified when zones use ZFS datasets with heavy snapshot or dedup operations.
Networking Layer Conflicts
Legacy ipfilter or bad TCP tunables can cause retransmissions and degraded throughput. This often results in kernel-level TCP congestion that appears as application timeouts.
netstat -sP tcp # Look for high tcpRetransSegs and tcpOutDataSegs ratio
Step-by-Step Troubleshooting and Fixes
1. Validate ZFS ARC Behavior
echo ::arc | mdb -k # Review arc_size vs arc_max echo 1073741824 > /etc/system >set zfs:zfs_arc_max=1073741824
Cap ARC to prevent it from consuming all kernel memory.
2. Analyze Page Scanner and Memory Pressure
vmstat 1 10 prstat -Z
Excessive scan rate (sr) and blocked threads (b) suggest need for zone-level memory capping or swap tuning.
3. Optimize Networking Stack
ndd /dev/tcp tcp_smallest_anon_port ndd /dev/tcp tcp_conn_req_max_q ndd /dev/tcp tcp_xmit_hiwat
Adjust TCP tuning parameters to reduce packet loss and retransmission delays.
4. Review rcapd and FSS Settings
svcs -l rcapd >prctl -n zone.max-swap -i zone appzone1
Ensure that memory caps are applied per-zone and that FSS does not starve critical processes.
5. Patch and Firmware Consistency
showrev -p >uname -a >psrinfo -v
Ensure you're running the latest SRU (Support Repository Update) for Oracle Solaris 11 or recommended patch bundle for Solaris 10.
Best Practices for Long-Term Solaris Stability
- Cap ZFS ARC to an appropriate value based on system RAM
- Avoid dedup and frequent snapshots in production zones
- Enable predictive self-healing (FMA) and automate event monitoring with
fmdump
- Use FSS over TS for better CPU sharing across zones
- Maintain regular SRU patching and firmware consistency
Conclusion
Solaris's performance challenges often stem from its powerful but complex subsystems—ZFS, zones, and the network stack. While these features offer scalability and reliability, they also require careful tuning and architectural discipline. By combining thorough diagnostics with kernel-aware best practices, administrators can ensure Solaris remains performant and resilient under enterprise demands.
FAQs
1. How do I determine if ZFS ARC is causing memory issues?
Use mdb -k
with ::arc
to inspect ARC usage. If arc_size approaches system memory limits, it may cause paging or lock contention.
2. Can nested zones impact system-wide performance?
Yes, especially if rcapd or FSS is misconfigured. One misbehaving zone can starve the rest of the system of memory or CPU.
3. Why do ZFS snapshots slow down my applications?
Frequent snapshots increase ZFS metadata, which can cause locking and ARC overhead, especially during deletion or replication operations.
4. What is the best way to monitor TCP retransmissions?
netstat -sP tcp
provides detailed counters. High tcpRetransSegs indicates packet loss or congestion requiring stack tuning.
5. Should I use deduplication in production Solaris systems?
Generally no. ZFS deduplication is memory intensive and can degrade performance unless extremely well-provisioned and justified.