Some Cisco UCS Servers May Lose Data If Power Is Cut Off
Cisco is warning customers that certain UCS servers were shipped with incorrectly-configured hard drives that could result in a loss of data should the servers lose power.
Cisco, in a field notice updated earlier this week and first reported by The Register, also provided detailed instructions for customers and partners to reconfigure the hard drives to prevent that data loss from occurring.
The issue affects the Cisco UCS C220-M3, C220-M4L, C240-M3, C240-M4L, and UCSC-C3X60 servers.
[Related: NetApp, Cisco Integrate SolidFire All-Flash Storage With UCS Servers In New FlexPod SF]
The problem stems from five models of SAS, 7,200-RPM large form factors drives in 1-TB, 2-TB, and 4-TB that were shipped in the Cisco UCS servers with their hard drive write cache enabled. Normally, they are shipped with that write cache disabled. "If drive write cache is enabled during a power loss it can result in loss of data," Cisco wrote in its field notice.
Cisco in the field notice said the issue has been solved on the Cisco side, but that the affected drives will have to be re-configured in the field.
"Cisco ships all of their hard drives from manufacturing with drive write cache disabled. During a quality audit, select units were found to have the drive write cache enabled. The issue has been remediated in the manufacturing process. Users of potentially affected devices are recommended to change the drive cache configuration," Cisco wrote. Cisco also included instructions how to re-configure the impacted hard drives to disable the write cache.
The drives all have Cisco model numbers. However, in the workaround suggested by Cisco, there are a couple of references to Seagate. One reference, in test details, is to the Seagate ST300MM0006, a 300-GB SAS drive with a 10,000-RPM speed. Cisco also refers to Seagate in a CDETS (Cisco Defect and Enhancement Tracking System) note.
Neither Cisco nor Seagate has confirmed that Seagate made the drives that were improperly configured. Neither commented have yet commented on whether the improper configuration was caused by the drive manufacturer or Cisco.
A Seagate spokesperson told CRN via email that the company was looking into the issue.
"Cisco is committed to avoiding issues in our products and handling them professionally when they arise," Cisco wrote in an email to CRN. "Our top priority is making sure customers are aware of an issue when it exists and what they can do to mitigate its potential impact on their network. To ensure our products continue to operate as intended in customer environments, Cisco issues Field Notices to notify customers about potential issues that require an upgrade, workaround, or other action."
The Cisco UCS C220 and Cisco UCS C240 are very common platforms, and the drives are very common drives, said John Woodall, vice president of engineering at Integrated Archive Systems, a Palo Alto, Calif.-based solution provider and Cisco channel partner.
The drives referenced in the field notice are more focused on capacity, and not performance, and the write cache might be turned on when used with applications where there is no concerns about losing data, Woodall told CRN. "But the write cache shouldn't be turned on at the factory," he said.
There is a huge installed base of those servers, so it's uncertain how much impact the improperly configured hard drives might have, Woodall said.
"It sounds like Cisco found a bug and took care of it," he said. "For someone impacted by it, they may not have a pleasant day. They may lose data and have to restore it. But things go bump in the night, and [stuff] happens. Vendors have issues like this all the time."
If this is indeed an issue that only recently happened and Cisco just caught it in an audit, the impact might be limited, but if this has been going on for some time, the impact could be much wider, Woodall said.
For customers, this is a wake-up call to the importance of ensuring their data is protected, Woodall said.
"Customers should have their data protected and ensure the level of risk from loss of data is acceptable," he said. "If not, this must be addressed by the customer. This doesn't let Cisco off the hook. If you find an issue, you test it and fix it. But if it happens again, it becomes a real issue."
For partners, it's important to proactively reach out to customers who have purchased servers that are potentially impacted by the issue, Woodall said. Most customer have competent technical personnel who can download the new code, test it, and deploy it, and can reach out to partners for help if needed, he said.
"Patches and patch management are common, and are typically handled by customers unless they are under a managed services contract," he said.