Introduction:
Bulk Data Transfer service provides an efficient and reliable solution for transferring datasets into the cloud computing environments of EOSC EU Node and transferring them back to the end-user premises and repositories. Bulk Data Transfer Service of LOT2 is an infrastructure oriented massive data transfer service. It is intended to support high-volume data transfers among distant sites. By leveraging the service, data can be moved directly to the backend storage of EOSC EU Node infrastructure, from where they can be accessed by VMs and containers in the cloud computing platform
Bulk Data Transfer service allows to move user data from outside of the EOSC EU Node down to the data storage back-end of the LOT2 compute services. These data, once transferred, can be accessed by applications and service running in the OpenStack-based virtual compute infrastructure and OKD-based container platform.
Bulk Data Transfer use-cases
Various use-cases can be implemented by Bulk Data Transfer service including:
- staging the research / experiment data into the EOSC EU node compute service in order to use these data in computation and data analysis processes; staging data to the back-end of the virtual compute service and/or container platform ensures proximity of the datasets and the compute power offered by EOSC EU Node
- transferring the output of the computation or analysis performed in EOSC EU Node compute services back to the to the user or research infrastructure and/or the external data repository
- performing large scale data migration among sites e.g. within the process of migrating services from existing cloud computing platform to the EOSC EU Node ; such migration may include transferring the VMs and container images with the application or services code as well as the pore massive dataset that are used in computations and data analysis;
Bulk Data Transfer implementation
Current implementation of the Bulk Data Transfer service for EOSC EU Node is based on File Transfer Service (FTS). Other implementations based on cloud-native protocols such as S3 are to be provided in future.
What is FTS
FTS is File Transfer Service It is used across project that deal with large volumes of scientific data that has to be moved around the geographically distributed data storage infrastructure.
It has been designed and developed in CERN (https://github.com/cern-fts). Historically, FTS main application was to automate transfer of large data volumes (in range of petabytes) within the large collaborations. For that purpose FTS supports 3rd party transfer, among other using GridFTP as well as transfer status monitoring, transfer restarts and built-in transfer optimisation.
FTS benefits
The overall added value of the Bulk Data Transfer services is performance and reliability of data transfers.
FTS-based implementation uses GridFTP protocol to enable multi-threaded data transport that helps to overcome negative impact of the network latency in the long distance links. FTS also takes care of completing the data transfer tasks defined by users, it monitors their progres and status and can restart failed transfers if needed.
While delivering specialised functionality that is applied in large scale data management projects, Bulk Data Transfer service can be integrated and in generic research and business workflows that involve large data transport. It is possible to integrate FTS with today typical cloud computing platforms. EOSC EU Node provides such an integration.
How to use FTS
The data transfers are organised into data transfer tasks/jobs. The task/job specification includes, as the minimum, the indication of the source and target locations (URLs) for the data transfer.
The managed Bulk Data Transfer jobs can be triggered and monitored using any of the FTS servers available in the EOSC EU Node installation including servers at PSNC and Safespring. The list of the FTS servers configured for the EOSC EU node is included in the table below.
Site | Componen | hostname *) | Port # | URL |
PSNC | FTS3rest | 8446 | https://fts.eu-1.datatransfer.open-science-cloud.ec.europa.eu:8446/ | |
PSNC | FTSmon | 8449 | https://fts.eu-2.datatransfer.open-science-cloud.ec.europa.eu:8449/ | |
Safespring | FTS3rest | fts02.staging.eosc.safedc.services | 8446 | https://fts.eu-2.datatransfer.open-science-cloud.ec.europa.eu:8446/ |
Safespring | FTSmon | fts02.staging.eosc.safedc.services | 8449 | https://fts.eu-2.datatransfer.open-science-cloud.ec.europa.eu:8449/ |
*) NOTE: temporarily the list includes staging installation, in the final version of the documentation it will include the production version host names
The interaction with the service is possible by using the CLI tools, that contact the API endpoint of the indicated/relevant FTS server or directly using the REST API of the FTS servers. This user guide focuses on using FTS CLI tools for interacting with the Bulk Data Transfer service of EOSC EU node.
The list of the FTS servers configured for EOSC EU node is included in the table below. The table lists also the host names of GridFTP servers in EOSC EU Node as well as their URLs. The URLs provided in the table can be used to specify the the target or the source of the managed transfer jobs, to be used while performing transfers into the EOSC EU Node or from the EOSC EU Node, respectively.
Site | Component | hostname *) | Port # | URL to be used in the FTS CLI tools |
PSNC | GridFTP server | gridftp01.eu-1.datatransfer.open-science-cloud.ec.europa.eu | 2811 | gsiftp://gridftp01.eu-1.datatransfer.open-science-cloud.ec.europa.eu:2811 |
PSNC | GridFTP server | gridftp02.eu-1.datatransfer.open-science-cloud.ec.europa.eu | 2811 | gsiftp://gridftp02.eu-1.datatransfer.open-science-cloud.ec.europa.eu:2811 |
Safespring | GridFTP server | gridftp03.eu-2.datatransfer.open-science-cloud.ec.europa.eu | 2811 | gsiftp://gridftp03.eu-2.datatransfer.open-science-cloud.ec.europa.eu:2811 |
Safespring | GridFTP server | gridftp04.eu-2.datatransfer.open-science-cloud.ec.europa.eu | 2811 | gsiftp://gridftp04.eu-2.datatransfer.open-science-cloud.ec.europa.eu:2811 |
*) NOTE: temporarily the list includes staging installation, in the final version of the documentation it will include the production version host names
Please note that, user of the FTS service typically interacts with the FTS server only and does not interact with GridFTP servers directly It is FTS server that triggers and supervise the 3-rd party transfers that are performed by GridFTP servers.
EOSC EU node includes two FTS servers, one per each location for increased reliability. GridFTP servers are also instantiated at both sites that enables efficient data transfer to any of the sites. For instance, transferring data to/from PSNC requires using GridFTP running at PSNC. Similarly, staging data into Safespring compute infrastructure requires using GridFTP servers running there.
Using FTS
This section of the user guide focuses on using FTS CLI tools for interacting with the Bulk Data Transfer service of EOSC EU node. The Web interface of FTS is not supported in EOSC EU Node.
Starting to use FTS
Prerequisites:
If CLI client is to be used with the service, the following minimum requirements have to be met:
- user has the access to the FTS cli tools package;
- user has to own the relevant user certificate (X.509), included in the list of the supported CAs;
- user has the proxy certificate that is valid within the perdiod of executing the commands
- The REST API endpoint is reachable from the server/virtual machine/worstation where user runs the FTS commands. For the list of the FTS server addresses along with port numbers refer to the table above.
Using FTS CLI tools:
Initiating the transfer with the CLI tool:
The following CLI command has to be used, in order to initiate the FTS monitored transfer:
/bin/fts-rest-transfer-submit --verbose $IN $OUT
where $IN and $OUT are the source and target URLs of the GridFTP servers holding the data, along with the path to the data to be transferred.
Example command is presented below:
/bin/fts-rest-transfer-submit --verbose gsiftp://gridftp01.staging.eosc.pcss.pl:2811/data/output//testfile_1G_2.bin gsiftp://gridftp03.staging.eosc.safedc.services:2811/data/output/testfile_1G_2.bin
Monitoring the transfers:
Transfer can be monitored using various approaches. FTS CLI tool can be used in order to list the transfers initiated by a user along with their status information. In addition, FTSmon monitoring console of FTS can be used in order to get the graphical overview of the transfer jobs handled by the FTS server.
Monitoring the transfers with the CLI tool:
The following CLI command has to be used, in order to monitor the transfers initiated by a user:
/bin/fts-rest-transfer-list | egrep -i "Request ID|ACTIVE|Status"
The command should display the list of the transfers initiated along with their state. The detailed explanation of the command output is provided in the tool documentation.
Example output of the command can be seen below:
/bin/fts-rest-transfer-list | egrep -i "Request ID|ACTIVE|Status"
Monitoring the transfers with FTSmon:
In addition to the FTS CLI, FTSmon module can be used to get the graphical overview of the transfer jobs handled by FTS servers. The FTSmon monitoring consoles are available for FTS servers running in the EOSC EU Node. The list of URLs where these consoles can be reached is included in the table above.
Example view on the WebFTS console is presented in the picture below.
The picture shows one active transfer among two GridFTP servers at PSNC ("Active: 1") as well as the status and statistics of other jobs that are or were running among the GridFTP server pairs.
Note that FTSmon presents only the aggregated information on the transfers triggered in particular relations. Additional information on the transfer jobs can be examined by navigating to detailed task status information and statistics pages in the web interface. The detailed job information is available to users through the CLI monitoring tool - see the previous subsection for details.
Troubleshooting:
Common issues:
Authorization issue:
- Problem:
- user tried to use CLI client (fts-rest-transfer-submit) to trigger FTS transfer but the CLI client returns errors indicating that the request is not authorized
- Steps to solve the problem:
- (1) check the status of the grid proxy (grid/voms-proxy-info command output);
- (2) examine the output of the command; it should contain the proxy expiration time; examine if the proxy is still valid;
- (3) If the proxy is not valid, initialise the proxy (grid-proxy-init or voms-proxy-init) and re-run the FTS client command;
- (4) refer the FTS CLI tool documentation;
- (5) refer the grid/voms-proxy* CLI commands documentation;
Transfer does not start:
- Problem:
- tried to use CLI client to trigger FTS transfer but the CLI client returns errors indicating that the transfer did not start or the transfer status checking tool returns no output or transfer errors:
- Steps to solve the problem:
- (0) check if the request is authorized properly - see the previous help topic
- (1) Examine input to the FTS client command (CLI): it should include 2 parts: (a) GridFTP source endpoint with URL; (b) GridFTP target endpoint with URL;
- (2) if the list of endpoint is incomplete, use the correct syntax to issue the command; refer to relevant documentation;
- (3) if the input is complete, but the transfer jobs do not start,, compare the endpoints with the list of the EOSC EU Node endpoints (see table above); at least one of the endpoints should be within EOSC EU Node - specified as target or source of the data transfer); if it is not the case, correct your command's input so that the transfer will be initiated among proper GridFTP transfer endpoints including FTS server within EOSC EU Node;
- (4) refer the FTS CLI tool documentation;
Transfer monitoring issue:
- Problem:
- triggered the transfer but cannot monitor it using CLI client command (fts-rest-transfer-list)
- Remedy:
- (0) check if the request vs the FTS servers is authorized properly - see the previous help topic;
- (1) Examine the input to the FTS client (CLI): it should include no arguments or specification of the FTS query details (transfer source, target etc.);
- (2) if the command line syntax is incorrect; refer to documentation; and re-run the command; if the input is complete go to step (3)
- (3) if the command line syntax is correct, compare the endpoints specified in the command (as the command output filter) with the list of the EOSC EU Node endpoints; at least one of the endpoints should be within EOSC EU Node - specified as target or source of the transfer); if it is not the case, specify proper GridFTP transfer endpoints; rerun the command;
- (4) refer to service documentation
More information
Detailed instructions of using FTS and GridFTP can be found in the documentation of these particular products pointed below.
Products documentation:
FTS
CLI tools documentation: https://fts3-docs.web.cern.ch/fts3-docs/docs/cli.html
GridFTP
GridFTP clients documentation: https://gridcf.org/gct-docs/6.0/gridftp/user/index.html#gridftp-user-quickstart
: