RAID Advice

Posted by Andy Neil 
RAID Advice
April 01, 2010 12:36PM
Hey Everyone,

Last year (around Sept), I set up a RAID 5 for someone. I used a ATTO ExpressSAS RAID card and a 4 drive chassis with 4 - 2TB Hitachi 7200 drives. RAIDed together for about 5.6TB.

Worked great at first; seemed very solid. Then in December, the RAID unmounted and wouldn't re-mount. Using the ATTO RAID config software I was able to determine that there didn't seem to be any real problem with the drives. They showed up in the config tool, but the group wouldn't mount. After talking with tech support, I was able to force-mount the drives, copy the media over to a backup drive and rebuild the group.

But then only a couple of months later, the RAID spontaneously unmounted again. Again I got into it with ATTO tech support. They confirmed via logs that there's nothing really wrong with the drives.

According to them, the problem lies with the nature of using desktop drives in a RAID configuration. Apparently, when a desktop drive comes across a sector it has trouble reading, it'll make a "heroic" attempt to re-read the sector, tying up access to the drive until it recovers the information. This can take as much as a couple of minutes depending on the severity, but most times is just seconds.

Their RAID controller however, has an extremely low tolerance for a drive that doesn't respond to it, and after about 14 seconds, will unmount the volume, causing the group to go offline.

This doesn't happen with Enterprise drives as apparently, their recovery tactics for hard-to-read sectors are different and are never out of contact with the controller for more than 7-10 seconds.

But there are a LOT of RAID configs on the market that are using desktop drives in place of Enterprise drives, seemingly without the troubles I'm having.

My question (finally) is: Is there a RAID controller out there that I should replace my ExpressSAS with that is more forgiving to desktop drives? Those of you with RAIDs, what configs are working for you?

When I initially put this RAID together, I had intended to get a CalDigit HDOne or HDPro2 RAID, but they were back-ordered and there was a time element in setting up the system. It looks like CalDigit RAIDs use desktop drives, so I feel like there is a solution out there that won't require scrapping the entire set-up.

Thought?

Andy
Re: RAID Advice
April 01, 2010 01:06PM
Are you using an internal or external RAID setup?

Shane had his popsicle RAID running on Hitachis.

I've been working off largely Seagate Barricuda drives on external RAID configurations, and I haven't had any issues with it, aside from a firewire chip issue, but I got that replaced. Many of us use desktop class drives in our RAIDs, mainly because enterprise level drives are so expensive. You may have to swap a RAID controller.

Someone like Jon Shilling may be able share more about it.



www.strypesinpost.com
Re: RAID Advice
April 01, 2010 01:30PM
Yeah, I don't think buying Enterprise level drives is possible due to their expense. I'm using an external RAID setup.

So is your RAID a firewire RAID? I don't think I could build anything slower than a SATA RAID because this was created for a RED film. We're offlining it in 1080p ProRes, but I need it to be able to handle 2K files in the coloring stage.

I'm not familiar with the term popsicle RAID. I was going to ask what it was, but decided to google it. ROLF! That was awesome.

I'm waiting to hear back from ATTO regarding some setting changes to the card that MAY help (adjusting the amount of time the controller will retry an unresponsive drive before unmounting). But if their solution doesn't work, I need to know what I can go to after. I'd like to stay SAS if possible, but if all the RAID SAS controllers are as finicky as this one, I might need to change to an eSATA RAID. A friend of mine uses a RocketRAID and says he has no problems either.

Andy
Re: RAID Advice
April 01, 2010 01:37PM
That raid served me well. But I am VERY glad I didn't rely on it for long. I have since taken the better route of spending the extra cash to get a more secure RAID. Caldigit HD One. Although now I need it to be larger than the 2TB I have.

But building that Popsicle Raid was fun...


www.shanerosseditor.com

Listen to THE EDIT BAY Podcast on iTunes
[itunes.apple.com]
Re: RAID Advice
April 01, 2010 01:40PM
Popsicle RAID:



Nobody knows how important it is to choose the right popsicle sticks. You need to make sure you get the right tension on the stick when you bite into the ice cream. Too loose and the RAID will fall apart, literally. So here's Shane testing the hardware:



[lfhd.net]



www.strypesinpost.com
Re: RAID Advice
April 01, 2010 01:50PM
>So is your RAID a firewire RAID?

The one i'm on at home is a Firewire Raid 1. I was working off a FW800 RAID 5 on a show, which got whittled down to USB RAID 5 when the Firewire chip short circuited, but that was sorted out in the end.

I'm not sure if FW800 will be able to handle ProRes HQ at 2K, but I think it actually can. Ideally you'll be on eSATA or Fibre. I've had Seagates on FC RAID 5, as well as Hitachis on another FC RAID. No issues on them.



www.strypesinpost.com
Re: RAID Advice
April 01, 2010 04:00PM
I've been testing a HighPoint 3522 and a 4322 alongside a Areca 1680x they all work well (if you get the correct firmware) with all of the drives I've tested from all manufacturers (never Maxtor) and work well with both desktop and enterprise class. I've not tested SSDs though.

The RAID 5 I have with Samsung Spinpoint F1 on the 3522 is about 2 years old and rock solid except for the odd directory fix with diskwarrior when I've accidentally turned off the RAID without unmounting and using the RAID management software to remove the RAID.

My suggestion would be to run the Disk-tester from Lloyd Chambers [macperformanceguide.com] on each of the disks and thoroughly test them all - it's likely one of them has a fault causing all of the RAID to fail.

It's also worth checking all the HDDs have the same firmware installed - check which is the latest from Hitachi.

If the RAID worked well and the problem is intermittent maybe its a setting on the management software?

However if you want to avoid the hassle of testing and testing get the new HDpro2 - at 800MBps over 8 HDDs its pretty damn good and with less to worry about.



For instant answers to more than one hundred common FCP questions, check out the LAFCPUG FAQ Wiki here : [www.lafcpug.org]
Re: RAID Advice
April 01, 2010 04:23PM
UPDATE from ATTO:

Tech support suggests increasing the command timeout time for the controller to accommodate the desktop drives better. The logs suggest that nothing is wrong with the drives themselves (though I think I'll double check that with your suggestion Ben), but the controller is too sensitive to the call/response times of desktop drives versus enterprise drives.

They've even said that they're working on a firmware update that will, "better accomodate a wider range of drive classes and compensate better for command timeouts and retries."

Thanks Ben for the controller info. I'll keep them in mind if these suggestions from ATTO don't pan out.

Andy
Re: RAID Advice
April 01, 2010 04:27PM
So Ben, DiskTester can be used to diagnose individual drives in a RAID with breaking the RAID?
Re: RAID Advice
April 01, 2010 04:34PM
I think eating all those popsicles is what made me fat. Finally lost most of that popsicle weight.


www.shanerosseditor.com

Listen to THE EDIT BAY Podcast on iTunes
[itunes.apple.com]
Re: RAID Advice
April 01, 2010 04:39PM
Was it Wall's? Lol. Someday I'll make my own popsicle RAID running off a Firewire chip.



www.strypesinpost.com
Re: RAID Advice
April 01, 2010 10:19PM
Quote

So Ben, DiskTester can be used to diagnose individual drives in a RAID with breaking the RAID?

I don't think so..

I test all my drives in the RAIDs as JBODs first to check each individual HDD or I do it in an external case via eSATA or FW800 or occasionally internally in the Mac - especially if I need to update firmware using something like freeDOS on a CD-ROM.

Once that is done I RAID them as a single Volume - usually RAID 5 or 6 but if its a critical project I might opt for RAID 10.

I've written an article for the SuperMag about BYO RAID but its not really for people who already build their own. It will be out hopefully in time for the NAB SuperMeet.



For instant answers to more than one hundred common FCP questions, check out the LAFCPUG FAQ Wiki here : [www.lafcpug.org]
Re: RAID Advice
April 02, 2010 01:40AM
Sorry for the typo, I meant "diagnose individual drives in a RAID WITHOUT breaking the RAID. Their website is not clear about this so I will leave a question about it directly with Lloyd.
Re: RAID Advice
April 02, 2010 03:00AM
I think not - but let me know what Lloyd says as that would be useful - however due to the way many Hardware RAIDs are formed it is highly unlikely.

You would need to break the RAID to test individual disks properly.

It was more aimed at Andy who would need to make a new RAID or reformat his old one anyway.



For instant answers to more than one hundred common FCP questions, check out the LAFCPUG FAQ Wiki here : [www.lafcpug.org]
Re: RAID Advice
April 05, 2010 11:42AM
Like Ben I have been running an 8 drive RAID (as a RAID 5) for a couple of years and it has been rock solid. I use the HighPoint 3522 Raid controller card. You have to get a couple of miniSAS cables and a box that has a SAS controller in it. This protocol that this HighPoint card uses is about as fast as it gets short of going to a fiber link solution. The price jump to fiber is about double the cost (or more) last time I checked.
Re: RAID Advice
April 05, 2010 11:58AM
Quote

You have to get a couple of miniSAS cables and a box that has a SAS controller in it.

The ATTO R380 is an SAS controller. I've already got all the components necessary and when the RAID is up and running, I've been very happy with its speed and performance.

I'm glad to hear another proponent of the HighPoint 3522 card. I believe I looked into that card when doing my initial do-diligence. If the fixes don't work with the ATTO, I may purchase the HighPoint and swap it out.

Thanks,

Andy
Re: RAID Advice
October 17, 2010 12:50AM
Hi Andy! I dunno if you solved this as I WAS just having the same "command timeout" issue from the R380 for the past 2 week now withe enterprise level Hitachi AK72000 ULtrastars HUA722020ALA330 and the ATTO Expresssas R380 @ sept 2010 firmware and driver. (Driving me mad) My config may be similar to you.

These "Command Timeouts" appear only on WRITE activity.. followed by ATTO retries then failure..... on random RBAs btw... (can rule out bad patch or area on disk(s))

Some info that might help diagnose it a bit.

My Config
  • atto expresssas R380 at the September 2010 firmware
  • PROAVIO EB8MS enclosure (dumb) - 8 bay with 2 x SFF-8088 minisas female connectors
  • 8 x Hitachi Ultrastars A7K2000 enterprise (HUA722020ALA330) 2TB 7200RPM (no encrypt) DDM;s - new past 3 weeks.
  • 2 x SFF-8088 interface CABLES (2M) from R380 HBA to PROAVIO
  • configured as 2 x filesystems @ RAID0 (raid zero) of 2 x 8 TB (raidset01 = 4 x DDM [4 x 2TB]) and raidset02 = 4 x DDM (4 x 2TB] ) - note there are two (2) logical volumes of 8TB each. I did this instead of 8 x 2TB as a single filesystem of 16TB. (reason is that I dont need redundacy but need transmission speed and capacity as I use LTO 4 tape drives) and want the separation of 2 file systems for workflows..
  • MAC PRO 2009 model 2.93 (8 cores, 16 vcores, 12GB blahdy blah. & ATTO Expresssas R380 is in Slot 2 (& H380 in slot 3 for the tape drives)
  • spotlight disabled for those file systems
  • the two x file systems driven by the R280 andh the 8 x DDM's in the PROAVIO EB8MS enclosure were formatted by disk utility.app as MAC OSX EXTENDED only (no journaling)

History:
  • 2 weeks back I replaced 8 x perfectly good WD RE3 WDFBYS1002 1TB 7200 enterprise level DDM (2008 models) because running out of capacity in PROAVIO EM8MS enclosure due to soem work thats going on with ...
  • elcheapo WD20EARS 2TB 5400 cavier "green" DDM's. - cheap because they cost me new only $HK1000 (?100)/each at that time. Cheap because they looked ok and I could wear the slower transfer time (475MB/sec with AJA system test when disks were empty).. AND the price /TB was very cheap.
  • sadly this was a tragic HUGE HUGE HUGE mistake! ....(see later )
  • since sold the 8 x wd20ears greenie disks to some mates with a NAS box & cheap STARDOM Tank Taiwanese enclosure.. recovered my costs (Phew).
  • WDC lost me as a 5 years customer and I jumped ship last week to Hitachi.. the COW forum guys love those Hitachi Ultrastars A7K2000's as did many at IBC including the lads from JMR & the other zillion disk storage vendors etc...
  • recently replaced the 8 x wd20ears with 8 x Hitcahi Ultrastar A7K2000's (@ $HK2200 (?220) each !) same price as the equivalent WDRE4's etc.

Issues with WD20EARS greenie disks (this was enlightening!)
  • ATTO R380 rejection with failed "COMMEND TIMEOUT" one , some or all of the 8 x wd20ears! R380 would take it offline.
  • extremely unstable when moving large objects (600GB- 3TB) TO (write action) any of the file systems backed by the wd20ears ddms (I had two raidsets defines as above).
  • ATTO R380 would fail any and all of the 4 (or 8) wd20ears disk when a large object was WRITTEN (disk write block). Assumed some or all were faulty. HAd them replaced with new wd20ears disks the next day - same issue.!
  • As usual I have everything a a bunch of HP LTO4 Ultrium data tapes with BRU-PE so no headaches losing content ... (yay LTO4 Ultrium TAPE!)
  • ... looked into it .. and here's what I found.

The Findings (easy to find via google)
  1. the newer elcheapo SATA2 disks not stipulated for ENTERPRISE level by hte H?W disk makers & their marketeers are demoted to consumer DESKTOP where they work ok in NAS, and USB/Firewire etc enclosure or smart enclosres because as stated earlier, the elcheapo disks go hunting for a NEW SECTOR (sectors for the raid card) to successfully find and read back in situ loads of time (verify) before that block (or segments of the stripe in a raid set) are successfully written.
  2. from above (as noted earlier in this thread) this can takes ages and in some cases 10's of minutes when the spindle (or spindles) are getting full and thus a RAID HBA will time the WRITE I/O as in our R380 and then retry it again for what looks like 5 times (you can see all this in the .var/logs/attotech .. set on ALL messages in the ATTO configurator.app) .. then
  3. after our R380 had 5 gos as it, it decides that the particular spindle is faulty , issues a message on the log to tell you it thinks it is faulty, fails the raidset and tells the O/S it is faulty through status and the O/S whinges that the volume is damaged.
  4. btw in a single desktop system , especially with some low end systems, the expectation of a response time on the I/O's is buried in the usage of the app. THere fore (IMO) many people accept this elongated time as the app rather than the crummy disk service time (well in most cases) smiling smiley
  5. from the above, , using the ATTO configurator.app RECOVERY will bring it back again and online BTW
  6. and the exact thing happens in a JBOD mode with a single spindle!
  7. the wd20ears disks were incidently ok after I got a good batch replaced....
  8. it seems that there are at least two major differences between what the marketing storage dudes call low end DESKTOP disks & ENTERPRISE is that DESKTOP flavours , (a) parking heads after 8 seconds of non activity and (b) worse not notifying the host (or HBA) that it is spending ages looking for a place to write the data (hunting for a good sector on the spindles where the raid is striped orthe in thecase of a JBOD.
  9. in the case of Western Digital ENTERPRISE they call it TLER (Time Limited Error Reporting/Recovery) This is where a disk(s) will give up searching for a new sector and REPORT back in a veru short time (300ms??) so the card/host can retry the opertion again and not fail it. Other disk vendors have something similar by other names.
  10. all these issues and what the current status is are well documented on the net if you look hard enough.
  11. WD have a mr softee winduzz dos tool to disable the parking and also enable and disable the TLER on the ENTERPRISE disks (WDREx's and higher)
  12. lastly, the newer DESKTOP didks form WD have either not included TLER on their elcheapo disks like they used to apparently, These elcheapo GREEN and BLACK post October 2009 it is written, that WD for example does not permit to enable or disable TLER on htese CAVIR and elcheapo disks... so I'm host I thought....

As fate would have i I bumped into the ATTO guys at IBC. Had a chat and asked the lads on the booth in Hall 7 if they had a magic CLI config command to alter the command timeout on the R380 so it would wait around much longer for the interupts for an I/O.. and yes there is.. you need to contact them to get it as they will have to exponge me from the earth if I say so.. (although the SONNET manual has this command documented. UYou can look in that .PDF online.

I believe the default for the R380 was 5000 ms (5 secs). You can use this magic CLI commadn in teh configcli to increase it as much as 10 minutes!

Anyway I had some limited success with this for a few days, but in the end this constant offline of the raid set was too much to bear and I swapped out the drived for Hitachi's (2.5 times the cost of the Wd cheapies) as above.

FIXED?
  • not sure if fixed because I had a similar issue with the Hitachi Ultrastars today on my production OSX system!.. yikes!
  • I reinstalled the R380 flash, the OSX driver @ OSX at SEPT 2010 level expecting this to fix it.
  • also reset the R380 back to defaults (I had also messed with the R380 NVRAM setting ..as you do when desparate!)
  • INTERESTINGLY!: well today, I restarted on my maintenance OSX 10.6.5 system and am running some tests.. just finder stuff.. and
  • lo and behold I have not encountered any errors with the R380.! YaY!
  • me thinks as may you readers of this long post will concur is that some nasty component on my production MACOSX system is to blame.. maybe! (TTP5 etc)

Summary:

  • use the console.app (OSX) or look in /var/log/attotech for a lod of really good info. make sure you enable that warning and info messages are logged not just 'critical'.
  • also look in the kernal logs on OSX .. some info in there when this occurs.. sorry dunno about mr softee systems there might be some stuff in the that ATTO log.
  • dont use elcheapo didks on the R380. I know that some other lesser SAS HBA RAID cards such as ARECA, HIGHPOINT are less tolerant and cheaper too!

HTH's someone.

Will report back

(Hmm... still working ok on dev system.... ... )
Sorry, only registered users may post in this forum.

Click here to login

 


Google
  Web lafcpug.org

Web Hosting by HermosawaveHermosawave Internet


Recycle computers and electronics