[Rosetta] Watch out for PRODUCTION_ABINITIO

DoubleTop · 19 February 2006 12:03

These units seem to have a high error rate of “Maximum WU time exceeded”. Be carefull with these, it would appear I’ve lost around 90hours of crunch time

DT.

MAOJC · 19 February 2006 12:09

I had a couple that got “stuck” too, you need to look at them every once in a while.

DoubleTop · 19 February 2006 12:14

Have you found that you get two units in “progress” on some rigs ? I’ve had a few rigs that have two units in progress, they only have Rosetta attached to them. The problem on the diskless is amplified, as when this happens the “leaves applications in memory” setting means that I run out of RAM and the box sits idle.

DT.

drezha · 19 February 2006 12:25

Yeah…2 in progress because of the messed up and odd deadlines…I believe…

MAOJC · 19 February 2006 12:44

hmm have not noticed that but these only happened twice and the boxes were dual cores and always have 2 in progress.

Nightlord · 19 February 2006 13:07

I’ve had to abort several of these units in the last day or so.

I’m getting cheesed off with the apparent lack of quality control at Rosetta (or at the very least clear communication about changes). The 7 day deadlines get mixed up with 2 week and 1 month deadlines which only eats more memory as units get pre-empted. Plus the units like these ABINITIO ones that run and get stuck, or timeout after huge hours.

The TruXoft client helps a bit, but Boinc adjusts the expected completion time as a function of recent runs. So when you complete a few of these big units, your expected completion time goes way up high. Then TruXoft only alerts after many hours wasted crunching.

To cap it off, I’m still none the wiser about the new application and the 8hour units, despite the thread in the Rosetta forum.

…stumbles off shaking head and muttering to himself :mad:

Michael_Roberts · 19 February 2006 14:05

Have you found that you get two units in “progress” on some rigs?

I had that problem on Rhea (Win98) while I was trying to update my project mix and BOINC client. I have now found that SZTAKI, seti and rosetta work OK together, but I could not run LHC or Einstein. This has not happened on Dione (WinXP.)

Boinc adjusts the expected completion time as a function of recent runs. So when you complete a few of these big units, your expected completion time goes way up high.

I have noticed that sometimes BOINC preloads lots and lots of workunits. I think this happens after the expected completion time has been reduced by some short units. It hasn’t caused me a problem so far, but I have a small cache, relying on multiple projects to keep the machines active. If you have a large cache, the machine could become overcommitted as a result of this behaviour if the new units have long completion times. The user’s ability to specify the desired completion time could help avoid this problem.

The_Balrog · 19 February 2006 14:17

Just checked my home machine. :eek: Nothing but ABINITIO. Seem to be running ok so far though. 14 ready to report with round 20 in the queue :xfinger:

Peige · 19 February 2006 14:30

Havn’t been watching my nodes all week… finally got the time to take a look on friday night and 8 of them were stuck, went through and aborted the unit that was stuck on each node and off it all went again.
No lockup or reboot necessary

Tom_Wilson · 19 February 2006 18:14

I (think I) saw a strange thing… Rosetta and SETI were loaded on one of my computers, and BIONC reported that I was over committed, and as a result SETI wasn’t running.

I thought that I could tweak that by suspending Rosetta. As soon as I did, I got a computational error on one of the Rosetta jobs.

Has anyone else seen this?

–Tom

Nightlord · 19 February 2006 18:40

I havent seen this myself, but I gather that Rosetta does not like Boinc switching between applications unless the WU is left in memory.

If it is not already set, try setting the preference to leave applications in memory when pre-empted.

DoubleTop · 19 February 2006 19:10

I’ve got the pre-empted setting “set”.

After clearing the bad ones this morning, I return to have a peek and there we go, four more stuck with a completely new segmentation error that dumps the boinc process as well. More machines sat doing nothing. Sorry - too much maintenance on Rosetta for me. I’m a install and leave type of fella :mad:

DT.
(note to others - this is linux diskless I am seeing these errors)

DoubleTop · 19 February 2006 20:25

/me shuffles back head in hands…

A bit more playing and it all seems to be down to the bbc project causing the segmentation fault and boinc to get killed. Investigating now why one of the nodes has a much MUCH lower RAC than the others yet has been on project the same length of time as the others.

DT.