There are two primary ways which ATLAS moves data around the grid. DDM transfers which simply move data around the grid and Jobs which may read or write data from one site before processing it and then registering it to another site.
Over the weekend of the 13th – 14th February 2010, RAL experienced problems with data transfers to Tier 2 sites. Many jobs were stuck in a transfering state despite there being no obvious problems. The problem was initially thought to be realted to a large transfer rate from CMS jobs however this problem remained even when the CMS rate dropped significantly. The problem was eventually traced to a problem with the ATLAS LSF (which schedules the data transfers) which was solved by minor configuration changes and a reboot.
The following plots are taken from the panda graph generator which can be found at:
http://gridinfo.triumf.ca/panglia/graph-generator/

It shows the panda (ie production) jobs for the entire UK for the past week. The majority of these jobs will have transferred their data to RAL once finished. The binning along the bottom is slightly confusing. The word (Fri, Sat etc) is at the centre of the bin so at midday. The re-processing started on Friday around 8pm. Graeme Stewart noticed that our problems started at around midnight and you can see that the red line indicating the number of transferring jobs rises considerably. The total number of running jobs also seems to drop slightly, this drop was not uniform, some sites such as Manchester found they had no running jobs at one point.
Chris Kruk identified the problem over the weekend as was able to implement his fix at around 9-10pm on Sunday night. You can see that the number of transferring jobs dropped steadily after that. No new jobs were coming in because the re-processing was finished. On Tuesday morning the next stage of the ATLAS production was started as you can see from the jump in running jobs.
The results from the UK can be compared to both the German (top) or French (bottom) Clouds for the same time period:


The French cloud was slightly special in that it had to have a whole bunch of jobs aborted which are shown as the big failed spike. Also it was given more data than it expected to re-process which is why it didn’t finish by Monday like the UK and Germany.
The following 3 tables show the transfer rate, the number of completed file transfers (succesful) and the total number of transfer errors for the MC Disk for the UK, German and French Clouds for each day over the last week. There is unfortunately no better resolution for the transfer rate.
As you can see the total number of file transfers was higher for the UK but not by that much. By the final day the merging step of the re-processing had started which is why the files being transferred are so much larger.
The error rates are slightly mis-leading. The majority of the transfer errors for the UK cloud were caused by Tier 2 sites having problems. For example on the 16/2 ~10000 errors were caused by QMUL. The error rate from the failing jobs can be better estimated from the light green line on the plots.
| UK CLOUD |
Throughput (MB/s) |
Completed File Transfers |
Total Number Transfer Errors |
| 12/2/10 |
7 |
55231 |
6771 |
| 13/2/10 |
7 |
50746 |
4470 |
| 14/2/10 |
10 |
45838 |
3384 |
| 15/2/10 |
6 |
45625 |
6502 |
| 16/2/10 |
5 |
29157 |
12156 |
| 17/2/10 |
46 |
2506 |
71 |
| DE CLOUD |
Throughput (MB/s) |
Completed File Transfers |
Total Number Transfer Errors |
| 12/2/10 |
10 |
47283 |
187 |
| 13/2/10 |
7 |
37613 |
75 |
| 14/2/10 |
10 |
40698 |
102 |
| 15/2/10 |
2 |
8531 |
2 |
| 16/2/10 |
6 |
24154 |
24 |
| 17/2/10 |
41 |
4638 |
1 |
| FR CLOUD |
Throughput (MB/s) |
Completed File Transfers |
Total Number Transfer Errors |
| 12/2/10 |
15 |
52125 |
1579 |
| 13/2/10 |
11 |
29609 |
207 |
| 14/2/10 |
11 |
35318 |
11 |
| 15/2/10 |
7 |
38575 |
1227 |
| 16/2/10 |
6 |
26014 |
9742 |
| 17/2/10 |
93 |
7541 |
3444 |
The final table shows the estimated average file size of each file transferred. This was calculated from Througput * 86400 / Completed File Transfers. (86400 = number of seconds in the day, for the Thursday when the day hadn’t been completed I took 16 hours)
It is clear that the average file size to the UK was significantly smaller than both DE and French clouds when we experienced the problem. However as said before the number of files transferred wasn’t that much higher and certainly not that unusual in terms of what we can expect from re-processing.
| Estimated average file size (MB) |
UK |
DE |
FR |
| 12/2/10 |
11.0 |
18.3 |
24.9 |
| 13/2/10 |
11.9 |
16.1 |
32.1 |
| 14/2/10 |
18.8 |
21.2 |
26.9 |
| 15/2/10 |
11.4 |
20.3 |
15.7 |
| 16/2/10 |
14.8 |
21.5 |
19.9 |
| 17/2/10 |
1060 |
509 |
710 |