FTS Upgrade

March 17th, 2010 by Alastair Dewhurst

On Wednesday 17th March the RAL FTS server was upgraded to 2.2.3.  This appears to have gone successfully and RAL has come out of its scheduled downtime.  The plot shows the traffic going through the OPN router which shows low activity from 8am – 11am while the upgrade was taking place follow by a peak when the FTS server was turned back on.  The majority of this activity appears to have come from ATLAS.

Cacti plot showing traffic on the OPN after the FTS upgrade

ATLAS memory limit changes

March 4th, 2010 by Alastair Dewhurst

A small number of ATLAS jobs are starting to require memory in excess of 3GB.  If these jobs are killed by the batch system it makes it much harder for any problems to be debugged.  As a result ATLAS requested a change to the memory limit of its batch jobs from 3GB to 4GB.

The majority of worker nodes have 8 cores with 16 GB of RAM.  The RAM is overcommitted by 50% which would allow 8 3GB jobs to run on the same worker node.  By applying ATLAS’ request it is only possible for a worker node to run a maximum of 6 4GB jobs on one worker node, thus blocking 2 jobs slots.  If the entire cluster was just running ATLAS jobs there could be a reduction in capacity of 25%.  However as well as ATLAS jobs the worker nodes also run jobs requiring 2, 1 and 0.5 GB of RAM.  It is thus not immediately obvious what effect this change would have on the batch system.

As ATLAS are the only VO that runs 4 GB jobs this change is most likely to have the largest effect on the batch farm when ATLAS is trying to run a lot of jobs.  Over the weekend of the 20th – 21st February ATLAS have been doing a lot of re-processing.  This involves the merging of lots of small files in MCDisk.  There were also some Monte Carlo jobs running.

The plot show the performance of the batch farm over the weekend.  For the first half of the weekend the number of blocked job slots remains fairly constant at around 30 or ~1% of running jobs.  On Sunday this started to rise and is currently at around 500 blocked jobs.  This rise can be attributed to the fact that ATLAS jobs (since Sunday) have been having problems.  This has taken worker nodes offline.  However, the number of ATLAS jobs has remained roughly constant while the total number of job slots available has dropped, leading to more machines having to run ATLAS jobs.  The current number of blocked jobs is roughly ~16% of the running jobs.

ATLAS data transfers problems

March 3rd, 2010 by Alastair Dewhurst

There are two primary ways which ATLAS moves data around the grid. DDM transfers which simply move data around the grid and Jobs which may read or write data from one site before processing it and then registering it to another site.

Over the weekend of the 13th – 14th February 2010, RAL experienced problems with data transfers to Tier 2 sites.  Many jobs were stuck in a transfering state despite there being no obvious problems.  The problem was initially thought to be realted to a large transfer rate from CMS jobs however this problem remained even when the CMS rate dropped significantly.  The problem was eventually traced to a problem with the ATLAS LSF (which schedules the data transfers) which was solved by minor configuration changes and a reboot.

The following plots are taken from the panda graph generator which can be found at:

http://gridinfo.triumf.ca/panglia/graph-generator/

It shows the panda (ie production) jobs for the entire UK for the past week.  The majority of these jobs will have transferred their data to RAL once finished.  The binning along the bottom is slightly confusing.  The word (Fri, Sat etc) is at the centre of the bin so at midday.  The re-processing started on Friday around 8pm.  Graeme Stewart noticed that our problems started at around midnight and you can see that the red line indicating the number of transferring jobs rises considerably.  The total number of running jobs also seems to drop slightly, this drop was not uniform, some sites such as Manchester found they had no running jobs at one point.

Chris Kruk identified the problem over the weekend as was able to implement his fix at around 9-10pm on Sunday night. You can see that the number of transferring jobs dropped steadily after that.   No new jobs were coming in because the re-processing was finished.  On Tuesday morning the next stage of the ATLAS production was started as you can see from the jump in running jobs.

The results from the UK can be compared to both the German (top) or French (bottom) Clouds for the same time period:

The French cloud was slightly special in that it had to have a whole bunch of jobs aborted which are shown as the big failed spike.  Also it was given more data than it expected to re-process which is why it didn’t finish by Monday like the UK and Germany.

The following 3 tables show the transfer rate, the number of completed file transfers (succesful) and the total number of transfer errors for the MC Disk for the UK, German and French Clouds for each day over the last week.  There is unfortunately no better resolution for the transfer rate.

As you can see the total number of file transfers was higher for the UK but not by that much.  By the final day the merging step of the re-processing had started which is why the files being transferred are so much larger.

The error rates are slightly mis-leading.  The majority of the transfer errors for the UK cloud were caused by Tier 2 sites having problems.  For example on the 16/2 ~10000 errors were caused by QMUL.  The error rate from the failing jobs can be better estimated from the light green line on the plots.

UK CLOUD Throughput (MB/s) Completed File Transfers Total Number Transfer Errors
12/2/10 7 55231 6771
13/2/10 7 50746 4470
14/2/10 10 45838 3384
15/2/10 6 45625 6502
16/2/10 5 29157 12156
17/2/10 46 2506 71
DE CLOUD Throughput (MB/s) Completed File Transfers Total Number Transfer Errors
12/2/10 10 47283 187
13/2/10 7 37613 75
14/2/10 10 40698 102
15/2/10 2 8531 2
16/2/10 6 24154 24
17/2/10 41 4638 1
FR CLOUD Throughput (MB/s) Completed File Transfers Total Number Transfer Errors
12/2/10 15 52125 1579
13/2/10 11 29609 207
14/2/10 11 35318 11
15/2/10 7 38575 1227
16/2/10 6 26014 9742
17/2/10 93 7541 3444

The final table shows the estimated average file size of each file transferred.  This was calculated from Througput * 86400 / Completed File Transfers.  (86400 = number of seconds in the day, for the Thursday when the day hadn’t been completed I took 16 hours)

It is clear that the average file size to the UK was significantly smaller than both DE and French clouds when we experienced the problem.  However as said before the number of files transferred wasn’t that much higher and certainly not that unusual in terms of what we can expect from re-processing.

Estimated average file size (MB) UK DE FR
12/2/10 11.0 18.3 24.9
13/2/10 11.9 16.1 32.1
14/2/10 18.8 21.2 26.9
15/2/10 11.4 20.3 15.7
16/2/10 14.8 21.5 19.9
17/2/10 1060 509 710

Low efficiency LHCb jobs

March 2nd, 2010 by Alastair Dewhurst

Jobs on the batch farm are automatically killed if they exceed a certain amount of wall or CPU time.  This however, doesn’t prevent in-efficient jobs from running.  Job efficiency is defined as CPU time / Wall time.  One of the most common causes of job in-efficiency is where a job is waiting on files.  It was noticed that LHCb seemed to have a large number of in-efficient jobs compared to the other LHC VOs.  One possible reason for this is because LHCb allows users to run their analysis at the Tier 1 and there code may not be well designed.  However it is also possible for jobs that are using the same computing resources to effect each other. There could be a hardware bottleneck which meant that all LHCb jobs would slow down once a certain threshold was passed.

One way to test if this is happening is by producing a plot of the throughput against the number of running jobs.  The throughput is defined as the Job Efficiency x number of jobs running.  If the number of jobs is effecting the efficiency then plot should reach a plateau.  The plot shows just this for each of the 4 LHC experiments. As there is no plateau then it is to be assumed that the in-efficiency is due to poor coding.

Update on Tier1 Castor Problem and 3D Migration.

February 1st, 2010 by Gareth Smith

While carrying out the planned migration of the Oracle Castor databases we encountered unexpected stability problems. This has led to an extended outage of the RAL Tier1.

The instabilities were in the Oracle RAC / Storage Area Network / Disk array system that holds the databases for the RAL Castor systems. In particular a reboot of one of the nodes in an Oracle RAC would cause other nodes in that RAC to crash.  Many, but not all, of these issues have been resolved. However, in part this has been at the expense of some resilience – the ‘multipath’ capabilities of the SAN have been disabled.

The current unscheduled outage on Castor is due to end on Wednesday lunch-time. However, we are hopeful of bringing Castor up tomorrow (Tuesday). This will be without some of the resilience we had hoped for, as referred to above.

The LFC and FTS systems were migrated back their original hardware last week without issue and have been running satisfactorily since then. During today the 3D databases, OGMA, LUGH, and the LHCb LFC have been largely migrated back to the same equipment. The migration was interrupted for a short while around 4pm for some system reboots and will continue during the evening.

Tier1 Updates.

January 26th, 2010 by Gareth Smith

As announced in our previous BLOG entry we are about to make a number of updates and changes to the RAL Tier1. The batch queues are now draining ahead of the significant outages scheduled for tomorrow (Wednesday 27th January) and Thursday (28th). The outages are announced via the GOC DB.

During an At Risk period for the whole site yesterday an intervention was made on the UPS with the aim of reducing the noise on the electric current supplied. An initial view shows some resultant improvement. However, we await a more detailed analysis.

The more significant of the interventions in the next couple of days are the outage of the LFC (lfc.gridpp.rl.ac.uk and lfc-atlas.gridpp.rl.ac.uk) tomorrow and of Castor for tomorrow and Thursday. During these interventions the Oracle databases behind these services will be migrated back to their original disk arrays.

The 3D databases (including the LHCb-lfc) will be migrated back on Monday 1st February. Again this is detailed in the GOC DB. This is largely done during an ‘At Risk’, although there is a short outage at the end of the afternoon for some systems to be rebooted to complete the process.

There will also be a networking intervention at the RAL site on the morning of Tuesday 9th February. Some details of this remain to be finalised and will be announced via the GOC DB as soon as known.

Gareth Smith
RAL Tier1 Production Manager

Tier1 Updates During January.

January 19th, 2010 by Gareth Smith

During January we are making a number of updates to the Tier1 in preparation for the restart of LHC during February. A significant outage has been declared for Wednesday and Thursday 27/28 January, although some work is also planned at other times.

One of the main pieces of work to be done is to migrate Oracle databases back to their original disk arrays. Users will no doubt remember the problems we had back in October when we had multiple failures of the disk arrays hosting the Oracle databases behind the Castor, LFC, FTS and 3D services. Since then we have been running using a variety of older equipment to host these databases. We are now confident that the source of this problem has been traced to noise on the electrical current supplied by the Uninterruptible Power Supply (UPS) fitted in the new computer building. A test was made on the 5th January where the UPS was bypassed. This test confirmed the UPS as the source of the noise, and also showed that we can carry out the operation of inserting and removing the UPS bypass without affecting running systems. As a result we are now in a position to migrate the databases back to the original disk arrays. These will initially be powered from an alternative clean power supply, although without UPS backup. Work is ongoing to fix the electrical noise problem, and once completed we will move the disk arrays back to UPS power.

During the intervention on 27/28 January we will migrate the Castor, LFC and FTS databases back. This has been combined with a drain of the batch system which is required for an update to the batch engine. Furthermore while the systems are down we will ‘FSCK’ all the disk servers, apply kernel patches and carry out a number of other improvements. Different components do have different lengths of downtimes during this intervention. Notably the plan is the complete the migration of the LFC on Wednesday 27th and return it to service within the day. In contrast, the requirement to drain out the batch system means that no new batch work will be accepted from the evening of Sunday 24th.

We are just finalising the date for the migration of the 3D databases.

Other interventions are being announced via the GOC DB. In particular there is a planned network intervention on the RAL site during the morning of Tuesday 9th February that will lead to a short (half an hour) break in external connectivity.

Gareth Smith
RAL Tier1 Production Manager

RAL Tier1 – Plans for Christmas Holiday.

December 15th, 2009 by Gareth Smith

RAL will be closed from Friday 25th December and re-open on Monday 4th January. During this time we plan for services at the RAL Tier1 to remain up. The usual on-call cover will be in place (as per nights and weekends). This cover will be enhanced by daily checks of key systems. Some hardware interventions, such as to swap out faulty disks will also take place over this time. However, we have relaxed our expectation that the on-call person will respond within two hours, particularly on 25/26 December and 1st January.

During the holiday will check for tickets in the usual manner. However, only service critical issues will be dealt with.

The status of the RAL Tier1 can be seen on the dashboard at:

http://www.gridpp.rl.ac.uk/status/

Gareth Smith

QUATTOR Codeswarm

November 25th, 2009 by James Adams

We’ve been actively working with QUATTOR in production for several months now, so I took the opportunity to look back on the amount of work that we’ve been committing to our SCDB repository. This also gave me an excuse to play with an awesome repository visualisation tool I stumbled across the other week called code_swarm. Read the rest of this entry »

Update on Tier1 Problems of Loss of Service and Data Loss from Castor.

November 11th, 2009 by Gareth Smith

As you are no doubt aware the RAL Tier1 suffered a significant outage during October with data loss from Castor. The causes of the outage have been the subject of an earlier blog entry. In summary four disk arrays failed in a short timescale. Two of these arrays hosted the Oracle databases behind Castor, the other two the Oracle databases behind the LFC, FTS and 3D services. Each set of services had a pair of disk arrays so as to provide resilience should one fail.

The underlying causes of the failures of the disk arrays has not yet been pin-pointed. It is clear that something in the electrical supply triggers the errors on the disk arrays. This has been shown by a series of tests powering the arrays in different ways. However, in the location where the arrays show problems (the room with UPS power) there is a significant amount of other equipment, none of which has any power issues. Furthermore, all measurements made so far show the power supply is within specification.

Reasons for the Data Loss.

Following the failure of the two pairs of disk arrays it was necessary to restore some of the databases from backup. The loss of data from Castor arose following a complex series of events. Where necessary the databases were restored and then migrated to an alternative disk system. The Castor configuration has two separate databases. One contains the nameserver along with the stager databases for the CMS and ‘GEN’ instances. The other contains the stager databases for Atlas and LHCb instances. The problem occurred with the database that included the Castor ‘nameserver’. A subsequently analysis showed that following the successful restore of this database Oracle picked up the database from the ‘wrong’ disk array of the pair. This contained a version of the database dating from an earlier hardware failure. This older version was migrated to the alternative disk array and used to restart Castor.

While many checks were made during the Castor restart the particular problem of the databases containing an out-of-date nameserver was not detected. Once started Castor carries out a series of internal validations and references to files no longer within the nameserver were scavenged and removed from other locations (including removing the copies on disk). It subsequently became clear that a significant data loss covering a period of ten days had occurred. This being the time difference between the database picked up during the recovery operation (dating from 24th October) and the point at which Castor had failed on 4th October.

Each file stored in Castor is allocated a unique FileID. Following our understanding of the cause of the data loss it was realized that File IDs within Castor were being re-used. This was a direct consequence of picking up the older version of the database containing the Castor nameserver. Following discussions with the Castor developers a mechanism was identified whereby files that made use of a recycled FileID could subsequently be deleted in error. This possibility was very unlikely, it only provided a mechanism where a file in one Castor instance could be  deleted by an explicit request to do so via another instance. I.e. the  usual file security could be circumvented. Although this theoretical possibility was very unlikely to appear in practice a short intervention was made on 15th October to eliminate this risk for further files. The value of the FileID was increased beyond the value it had reached before the failures on 4th October.

Current Situation.

All the four disk arrays that showed problems were removed from production before services were restarted. The databases are currently hosted on older disk systems which have  a proven reliability, but do not have the resilience we theoretically had before the failures. Work has been ongoing to check and improve issues such as spares for these temporary systems and to provide an alternative disk array should there be another failure. Work has been ongoing to get to the bottom of the hardware issue. Likewise other procedure, including the architecture of the database systems (as led to the wrong database being picked up) and procedures at Castor start-up are being reviewed. This incident, which is overseen by the Tier’s disaster management process, is ongoing. All actions taken during the complex sequence of events is being looked at to ensure we do not suffer another data loss in this way.