Folding question

Looking for some help here…

Several of my Linux boxes completed their first runs overnight, but all of those appear to have reported the same error:

Firstly, they report being unable to connect to server twice. They are on the same network as the windows boxes with 24/7 internet access and have experienced no problems before.

Secondly, when they suceed on the third attempt, they all report the core used is not what the server required for that unit. This seems odd to me because at installation, I simply installed and ran the Linux FAH 5.04 Console which downloaded all the necessary files. I’ve checked the machine ID’s for the cores on those boxes and they are different for each of the two cores.

Comparing the logs between my windows boxes and these Linux machines also leads me to think the units have been rejected because the logs do not state that the stats for those boxes are incremented by 1.

Have I got something wrong?:confused:

I encountered something similiar to this and meant I had to compile the latest beta cores so that the servers “took” the units. There are occasions where the units are not assigned to the correct cores, normally this is spotted pretty quickly. Post a couple of the contents of the logs files and I’ll Sherlock Holmes it for you :slight_smile:

DT.

sorry for double post, I’ve been reading up :slight_smile:

What could be happening, is the the original assignment server is not responding (see here) the client could then be doing it’s level best to get more work by trying other servers. Not all servers take all projects, some are platform and therefore assignment server specific. I believe what should happen is that once the “proper” server is back online, the one you sent the results into will transfer them to the correct server and you will be assigned points on teh next run.

I have no way of verifying this other than guesswork based on reading up on the wiki

Attach your log file FAHlog.txt for the machines in question and to assist I’ll sort out posting on the FAH forum for you if you like, or visit and register yourself :slight_smile:

DT.

I know that server 171.65.103.68 is giving trouble i’ve got a 200 pointer hanging waiting to report , see here if anything fits .

Yup, that makes sense. Just noticed that all of the boxes that have this error ran units with FahCore_a0.exe and that one of those machines then downloaded, ran, completed and has just sucessfully reported a unit with FahCore_79.exe on the same cpu.

[edit]So, looks like a problem with 171.65.103.68[/edit]

Anyway, extract from one of the logs in case there is something else wrong…

[19:46:40] - Ask before connecting: No
[19:46:40] - User name: Nightlordy (Team 315)
[19:46:40] - User ID not found locally
[19:46:40] + Requesting User ID from server
[19:46:41] - Machine ID: 1
[19:46:41]
[19:46:41] Work directory not found. Creating…
[19:46:41] Could not open work queue, generating new queue…
[19:46:41] + Benchmarking …
[19:46:50] - Preparing to get new work unit…
[19:46:50] + Attempting to get work packet
[19:46:50] - Connecting to assignment server
[19:46:50] - Successful: assigned to (171.65.103.68).
[19:46:50] + News From Folding@Home: Welcome to Folding@Home
[19:46:50] Loaded queue successfully.
[19:46:59] + Closed connections
[19:46:59]
[19:46:59] + Processing work unit
[19:46:59] Core required: FahCore_a0.exe
[19:46:59] Core not found.
[19:46:59] - Core is not present or corrupted.
[19:46:59] - Attempting to download new core…
[19:46:59] + Downloading new core: FahCore_a0.exe
[19:47:00] + 10240 bytes downloaded
[19:47:00] + 20480 bytes downloaded

…[big snip]

[19:47:08] Verifying core Core_a0.fah…
[19:47:08] Signature is VALID
[19:47:08]
[19:47:08] Trying to unzip core FahCore_a0.exe
[19:47:09] Decompressed FahCore_a0.exe (3165800 bytes) successfully
[19:47:09] + Core successfully engaged
[19:47:14]
[19:47:14] + Processing work unit
[19:47:14] Core required: FahCore_a0.exe
[19:47:14] Core found.
[19:47:14] Working on Unit 01 [May 25 19:47:14]
[19:47:14] + Working …
[19:47:14]
[19:47:14] ------------------------------
[19:47:14] Folding@Home Gromacs 3.3 Core
[19:47:14] Version 1.71 (February 3, 2006)
[19:47:14]
[19:47:14] Preparing to commence simulation
[19:47:14] - Looking at optimizations…
[19:47:14] - Created dyn
[19:47:14] - Files status OK
[19:47:14] - Expanded 878431 -> 3988659 (decompressed 454.0 percent)
[19:47:14] - Starting from initial work packet
[19:47:14]
[19:47:14] Project: 1489 (Run 0, Clone 224, Gen 8)
[19:47:14]
[19:47:14] Assembly optimizations on if available.
[19:47:14] Entering M.D.
[19:47:20] Protein: p1489_Protein_POPC
[19:47:20] Writing local files
[19:47:21] Extra 3DNow boost OK.
[19:47:21] Extra SSE boost OK.
[19:47:21] Writing local files
[19:47:21] Completed 0 out of 1000000 steps (0%)

[03:29:10] Completed 1000000 out of 1000000 steps (100%)
[03:29:10] Writing final coordinates.
[03:29:10] Past main M.D. loop
[03:30:11]
[03:30:11] Finished Work Unit:
[03:30:11] - Reading up to 886416 from “work/wudata_01.arc”: Read 886416
[03:30:11] - Reading up to 28736 from “work/wudata_01.xtc”: Read 28736
[03:30:11] goefile size: 0
[03:30:11] logfile size: 29480
[03:30:11] Leaving Run
[03:30:11] - Writing 960076 bytes of core data to disk…
[03:30:11] … Done.
[03:30:12] - Shutting down core
[03:30:12]
[03:30:12] Folding@home Core Shutdown: FINISHED_UNIT
[03:30:12] CoreStatus = 64 (100)
[03:30:12] Sending work to server

[03:30:12] + Attempting to send results
[03:30:13] - Couldn’t send HTTP request to server
[03:30:13] + Could not connect to Work Server (results)
[03:30:13] (171.65.103.68:8080)
[03:30:13] - Error: Could not transmit unit 01 (completed May 27) to work server.
[03:30:13] Keeping unit 01 in queue.

[03:30:13] + Attempting to send results
[03:30:13] - Couldn’t send HTTP request to server
[03:30:13] + Could not connect to Work Server (results)
[03:30:13] (171.65.103.68:8080)
[03:30:13] - Error: Could not transmit unit 01 (completed May 27) to work server.

[03:30:13] + Attempting to send results
[03:30:45] - Core type used on unit not what server demands.
[03:30:45] Successfully sent unit 01 to Collection server.
[03:30:45] - Preparing to get new work unit…
[03:30:45] + Attempting to get work packet
[03:30:45] - Connecting to assignment server
[03:30:46] - Successful: assigned to (171.64.122.136).
[03:30:46] + News From Folding@Home: Welcome to Folding@Home
[03:30:46] Loaded queue successfully.
[03:30:47] + Could not connect to Work Server
[03:30:47] - Error: Attempt #1 to get work failed, and no other work to do.
Waiting before retry.
[03:30:59] + Attempting to get work packet
[03:30:59] - Connecting to assignment server
[03:31:00] - Successful: assigned to (171.64.122.136).
[03:31:00] + News From Folding@Home: Welcome to Folding@Home
[03:31:00] Loaded queue successfully.
[03:31:01] + Could not connect to Work Server
[03:31:01] - Error: Attempt #2 to get work failed, and no other work to do.
Waiting before retry.
[03:31:19] + Attempting to get work packet
[03:31:19] - Connecting to assignment server
[03:31:20] - Successful: assigned to (171.64.122.136).
[03:31:20] + News From Folding@Home: Welcome to Folding@Home
[03:31:21] Loaded queue successfully.
[03:31:23] + Closed connections
[03:31:23]
[03:31:23] + Processing work unit
[03:31:23] Core required: FahCore_7a.exe
[03:31:23] Core not found.
[03:31:23] - Core is not present or corrupted.
[03:31:23] - Attempting to download new core…
[03:31:23] + Downloading new core: FahCore_7a.exe
[03:31:24] + 10240 bytes downloaded
[03:31:25] + 20480 bytes downloaded

[03:31:33] + 1219181 bytes downloaded
[03:31:33] Verifying core Core_7a.fah…
[03:31:33] Signature is VALID
[03:31:33]
[03:31:33] Trying to unzip core FahCore_7a.exe
[03:31:33] Decompressed FahCore_7a.exe (3102232 bytes) successfully
[03:31:33] + Core successfully engaged
[03:31:39]
[03:31:39] + Processing work unit
[03:31:39] Core required: FahCore_7a.exe
[03:31:39] Core found.
[03:31:39] Working on Unit 02 [May 27 03:31:39]
[03:31:39] + Working …
[03:31:39]
[03:31:39] ------------------------------
[03:31:39] Folding@Home GB Gromacs Core
[03:31:39] Version 1.90 (March 8, 2006)
[03:31:39]
[03:31:39] Preparing to commence simulation
[03:31:39] - Looking at optimizations…
[03:31:39] - Created dyn
[03:31:39] - Files status OK
[03:31:39] - Expanded 132426 -> 684763 (decompressed 517.0 percent)
[03:31:39] - Starting from initial work packet
[03:31:39]
[03:31:39] Project: 2099 (Run 73, Clone 4, Gen 22)
[03:31:39]
[03:31:39] Assembly optimizations on if available.
[03:31:39] Entering M.D.
[03:31:45] Protein: p2099_abeta_4mer_agbnp
[03:31:45]
[03:31:45] Writing local files
[03:31:45] GB activated
[03:31:45] Extra SSE boost OK.
[03:31:46] Writing local files
[03:31:46] Completed 0 out of 750000 steps (0%)

if the assigned server is off line…the client should queue the result and periodicaly try to upload it (every 6 hrs iirc), I’m not sure if QD from here will when run give you a clue as to the status of the result, worth a try though.

Curly

Excellent! Thanks guys,

Using QD, I can see the client has attempted 2 uploads to 171.65.103.68, then sent the returned data to a collection server at 171.65.103.100. To quote the explanation pages for qd:

[i]CS: 171.67.89.100
This is the collection server IP address, which is not always present. If it is given, it identifies an alternate server to which the results can be returned if there are too many errors attempting to upload to the actual work server. If work is returned to the collection server, credit for it will occasionally be delayed in the Stanford statistics, but this is better than having the work deleted by the client if its deadline expires before the server can be reached successfully.

upload failures: 1
This field will not be printed for the current unit, or for anything other than a unit pending upload if the number of failures is zero. The client uses this field to decide when to try sending results to the collection server instead of to the actual work server.
[/i]

It looks like the affected units are not lost, but they will appear some time in the next few days :cool:

So, thankfully, I don’t have a farm full of mis-configured linux boxes, which is what it looked like first thing this morning! :eek: :chuckle:

:cool: I understood the info on the wiki right :smiley: Not bad for so early in the morning :lol:

DT.

The collection server is often overloaded and it somtimes takes a while to report. There are also many different cores for different work units. Gromacs, GBGromacs, DGromacs, Amber, QMD, Tinker, Gromacs33(for linux) and downloading new cores is not a bad thing.

You need only change your machineID for different CPU’s on the same machine. If they are seperate computers you don’t need to change it. ie

BOX1 : P4-HT CPU - MachineID1, MachineID2
BOX2 : P4-NO HT CPU - MachineID1
BOX3 : ATHLON 64 X2 - MachineID1, MachineID2

Thanks Fenix, that’s how I read it the other night when I was setting them up. But this morning, when I saw the failures, what worried me was that all the separate processors had to have separate machine ID’s eg:

P4 with HT ID1 & ID2
AMD64 x2 ID3 & ID4
etc…

At the time, it looked kind of like the server was rejecting multiple instances of machine ID1.

I didn’t want to have to go through the pain of setting up 25 nodes all over again:eek:

Looks ok now though :smiley:

If you have the project #'s and the Run/Clone/Gen Numbers for each project, logon to folding-community.org and send a private message to Uncle_Fungus and ask him to look each of them up. You can also download a utility called QD that will display the contents of you queue.dat file and that can be helpful in determining whats going on. Also, with the summer heating up (at least where I live) I have notice that cpu temps are getting higher. That can have a direct effect of units EUEing. Either way, machine ID’s need only be different for each physical system. 2 seperate computers can both be using machineID 1, 2 etc as long as one system doesn’t have 2 machineID 1’s running.

nevermind… saw that you already found QD :slight_smile: