Subsections


6. Data Management

A large part of user tasks on the grid consist of access to data and management of the files containing data. Most users will use the Replica Manager command line interface and API to perform data management tasks on the grid. The Replica Manager interacts with the Replica Location Service (RLS), the Replica Metadata Catalog (RMC), the Replica Optimization Service (ROS) and the Storage Elements (SE) to provide high-level functionality and concurrently to shield users from tedious details of direct RLS and SE interaction. Nonetheless some details concerning the RLS and SE help users understand how the Replica Manager performs its job.

6.1 Terminology

Jargon unfortunately permeates the descriptions of the data management middleware. The following definitions will help to understand the typical terminology:

GUID
Grid Unique Identifier. This is a unique, immutable label for a file registered in the RLS. All replicas of this file share the same GUID. GUIDs take the form:
  guid:135b7b23-4a6a-11d7-87e7-9d101f8c8b70

LFN
Logical File Name. This is a user-specified, unique label for a file-usually a more intuitive tag which gives some indication of the file's content. In contrast to a GUID, a LFN is mutable. LFNs take the form:
  lfn:HiggsMonteCarlo.dat.

SURL
Storage URL. A URL which uniquely identifies a file contained in a Storage Element. SURLs typically take the form:
  srm://grid02.lal.in2p3.fr/iteam/higgsCandidate.dat

TURL
Transport URL. A temporary URL which can used used to access a particular data file contained in a Storage Element via a certain protocol. For example, a TURL for access to a file via rfio takes the form:
  rfio://grid02.lal.in2p3.fr/iteam/higgsCandidate.dat
Much of the terminology has changed from the previous release and has been replaced by the above, more precise terms.

6.2 Replica Location Service

The Replica Location Service (RLS) consists of two services: the Local Replica Catalog (LRC) and the Replica Location Index (RLI). The RLI allows the RLS to be geographically distributed; however, for EDG 2.0, this is not deployed. Therefore, a single LRC instance per Virtual Organization acts as a global registry for that VO's files. (Technically, the LRC contains the GUIDSURL mapping as well as some metadata concerning the physical file. See below for an explanation of the terminology.)

A service closely related to the RLS is the Replica Metadata Catalog (RMC). The RMC contains metadata about a VO's files. (Technically, the LFNGUID mapping and metadata tied to the GUID.)

The Replica Optimization Service (ROS) allows the Replica Manager to choose the ``closest'' file in terms of total transfer time.

Finally, the Storage Element (SE) provides a uniform interface to data storage. It provides a web service interface for management functions and typically allows for several types of direct access to data stored on the SE. The GridFTP protocol is supported by all SEs and can be used for wide-area access to the data. Typically ``file'' (i.e. standard POSIX access) and ``rfio'' access to the data are also provided to a ``close'' Storage Element. A SE and CE are in fact defined to be ``close'' if file access to the SE's data is possible from the CE.

6.3 Replica Manager

The Replica Manager allows one to copy files into grid storage, register files, replicate files between SEs, delete individual replicas, delete all replicas of a particular file, among other things. All of these are available via the edg-replica-manager command or its abbreviated version edg-rm. Two general options to this command that are absolutely vital to correct operation are the vo and insecure6.1 options. The vo option takes the name of your VO as an argument.

A good first test is to execute the following on an User Interface machine:

edg-replica-manager --vo iteam --insecure printInfo
substituting ``iteam'' for the name of your VO. This prints a lot of information about exactly what services the replica manager command will use; the information is pulled from R-GMA. If there are problems with the Replica Manager commands, this command is often useful for debugging.

The subcommands for the Replica Manager also have shortened forms; for example the ``printInfo'' in the above command could have been replaced with ``pi''. A full list of the abbreviations is available from the command's usage obtained with the help option. The examples in this chapter will use the long forms for clarity.

Frequently used subcommands are:

The use of all of these commands will be seen in the following examples.

6.4 Examples

The examples show typical data management cases and highlight the commands described above.

Example 6..1 (Bringing a File onto the Grid)  

Often data files are first created in temporary scratch space or on computers outside of the grid. To make these data grid-accessible, they must be moved to a Storage Element; usually one wants to register these files in a VO's catalog as well. This example demonstrates how to do this.

First create a fresh proxy with grid-proxy-init. Although the registration is not currently secured, the file transfer is; therefore, valid credentials will be needed.

Create an empty local file to work with:

touch empty-local-file
and now perform a copyAndRegisterFile with the Replica Manager to copy this to a Storage Element and register the file.
>> edg-replica-manager --vo iteam --insecure \
     copyAndRegisterFile file:`pwd`/empty-local-file \
     --destination-file gppse05.gridpp.rl.ac.uk \
     --logical-file-name lfn:my-demo-2003-10-01-1600

guid:b793f080-f417-11d7-b584-857330072702

The GUID of the created file is returned on successful completion. If the destination-file option is not given, then the copy is made on the ``local'' SE. You can use the printInfo subcommand to see what the ``local'' SE is. If the logical-file-name is not given, then the only way to access this file is through the returned GUID.

To check that this file exists, use the listReplicas command:

>> edg-replica-manager --vo iteam --insecure \
     listReplicas guid:b793f080-f417-11d7-b584-857330072702

srm://gppse05.gridpp.rl.ac.uk/iteam/generated/2003/10/01/fileb16684bf...
Either the logical file name or GUID can be used. One sees that both commands return the same SURL (truncated here) for the replica and that this replica is indeed on the specified SE.

To delete this file, one can use the subcommand deleteFile, specifying the SURL to be deleted.

>> edg-replica-manager --vo iteam --insecure \
     deleteFile \
     srm://gppse05.gridpp.rl.ac.uk/iteam/generated/2003/10/01/fileb16684bf...
Note that deleting the last replica of a file will also remove the GUID and LFN from the catalog. If you wish to remove all replicas, you can use the all option with specifying a GUID.

Example 6..2 (Replicating Existing File)  

As the brokering system does not yet perform automatic replication of data files for jobs, it is often necessary to make several replicas of a file on different Storage Elements. To demonstrate this, repeat the previous example to bring a file ``lfn:my-second-demo-2003-10-01-1600'' onto the grid but fill the file with the string ``Hello There''. To verify this exists:

>> edg-replica-manager --vo iteam --insecure \
     listGUID lfn:my-second-demo-2003-10-01-1600

guid:a3ac7647-f418-11d7-a57b-e4d5c9608efc
which lists the GUID associated with this file. One could have also used listReplicas again
>> edg-replica-manager --vo iteam --insecure \
     listReplicas lfn:my-second-demo-2003-10-01-1600

srm://gppse05.gridpp.rl.ac.uk/iteam/generated/2003/10/01/file9de8efe6...
which shows that the file is on the gppse05.gridpp.rl.ac.uk SE.

You can use the edg-rgma to find another SE. Now to replicate this to another storage element:

>> edg-replica-manager --vo iteam --insecure \
     replicateFile --destination se001.fzk.de \
     lfn:my-second-demo-2003-10-01-1600

srm://se001.fzk.de/iteam/generated/2003/10/01/file42a1d2b2...
which returns the SURL of the copy. Using listReplicas again shows the two distinct replicas:
>> edg-replica-manager --vo iteam --insecure \
     listReplicas lfn:my-second-demo-2003-10-01-1600

srm://gppse05.gridpp.rl.ac.uk/iteam/generated/2003/10/01/file9de8efe6...
srm://se001.fzk.de/iteam/generated/2003/10/01/file42a1d2b2...
Leave these files on the grid for the next example.

Example 6..3 (Accessing a File from a Job)  

The previous example showed how to bring data onto the grid and move it around. This one now demonstrates how to read the data from a job using the ``file'' protocol. It uses getBestFile to get the SURL of the local copy (making a copy if necessary) and then transforms that SURL into a filename which can be opened. The script calculates the checksum of the file.

Put the following JDL into a file called ``ReadData.jdl'':

Executable         = "script.sh";
InputData          = {"lfn:my-second-demo-2003-10-01-1600"};
DataAccessProtocol = {"file","gridftp","rfio"};
StdOutput          = "std.out";
StdError           = "std.err";
InputSandbox       = {"script.sh"};
OutputSandbox      = {"std.out","std.err"};
and put the following into a file script.sh:
#!/bin/sh
  
# Get SURL of local replica (making one if necessary).
surl=`edg-replica-manager --vo iteam --insecure \
        getBestFile lfn:my-second-demo-2003-10-01-1600`
echo SURL: $surl
 
# Get TURL of the local replica.
turl=`edg-replica-manager --vo iteam --insecure \
        getTurl $surl file`
echo TURL: $turl
 
# Strip off URL's scheme and fix multiple slashes.
fname=`echo $turl | sed -r 's%/+%/%g' | sed s%file:%%`
echo FILE: $fname
 
# Get the check sum of this file.
cksum $fname
Checking the matching with an edg-job-list-match should return Computing Elements at the two sites which have replicas of this file. Actually sending the job should return the correct checksum of the file in the std.out file.

More information on the Replica Manager and the underlying services discussed in this chapter can be found in the Users' Guides for the Replica Managerhttp://cern.ch/edg-wp2/replication/docu/edg-replica-manager-userguide.pdf, LRChttp://cern.ch/edg-wp2/replication/docu/edg-lrc-userguide.pdf, RMChttp://cern.ch/edg-wp2/replication/docu/edg-rmc-userguide.pdf, and ROShttp://cern.ch/edg-wp2/replication/docu/edg-ros-userguide.pdf.


http://marianne.in2p3.fr