LOFAR USE CASES: (In context of ESCAPE PROJECT) Version: under construction (very preliminary)
Please Note: This page is under construction. All the material belonging to this page includes the work carried out by several teams from ASTRON and people who work with LOFAR data from several institutes. Thus the page creators do not have any exclusive claim about all the work being described on this page.
On this page, information related to LOFAR use cases which are being worked on in the context of ESCAPE project will be placed. The main aim is to first run Prefactor pipeline on LOFAR data stored on the data-lake and then do subsequent research on the performances as well as several associated issues (like best practices, parameter tuning).
The software is provided in the form a self contained Singularity Image which contains all the required LOFAR software based on Ubuntu 18.04 LTS.
PREFACTOR PIPELINE (FOR IMAGING DATA) This Pipeline is used to process LOFAR data using the pre-factor method. There are other pipelines used as well to process LOFAR data.
In this pipeline the dataset Generally consists of at-least two datasets. One of them is called Calibrator Data Set (in simple words when LOFAR telescope is pointed/directed towards the calibrator source in the sky). The other one is called the Target data set (LOFAR pointed towards a target source in the sky). The calibrator source is generally a source whose astronomical properties are well known and so it is very helpful to estimate the instrument (e.g. Antennas, ionosphere) parameters. The Target source is generally the source we are interested to study and its astronomical properties are not well known. The idea is to use the instrument properties (estimated using the calibrator source) so as we can calibrate the Target dataset and study the target field (sky).
prefactor is a pipeline to correct for various instrumental and ionospheric effects in both LOFAR HBA and LOFAR LBA observations. It will prepare your data so that you will be able to use any direction-dependent calibration software, like factor or killMS.
It includes: removal of clock offsets between core and remote stations (using clock-TEC separation) correction of the polarization alignment between XX and YY robust time-independent bandpass correction ionospheric RM corrections with RMextract removal of the element beam advanced flagging and interpolation of bad data mitigation of broad-band RFI and bad stations direction-independent phase correction of the target, using a global sky model from TGSS ADR or the new Global Sky Model GSM detailled diagnostics (optional) wide-band cleaning in Initial-Subtract and Pre-Facet-Image
The main directory contains the different parsets for the genericpipeline:
Pre-Facet-Calibrator.parset : The calibrator part of the "standard" pre-facet calibration pipeline.
Pre-Facet-Target.parset : The target part of the "standard" pre-facet calibration pipeline.
Concatenate.parset : A pipeline that concatenates single-subband target data to produce concatenated bands suitable for the initial-subtract pipeline.
Initial-Subtract.parset : A pipeline that generates full-FoV images and subtracts the sky-models from the visibilities. (Needed for facet-calibration.)
Initial-Subtract-IDG.parset : Same as Initial-Subtract-Fast.parset, but uses the image domain gridder (IDG) in WSClean.
Initial-Subtract-IDG-LowMemory.parset : Same as Initial-Subtract-Fast.parset, but uses the image domain gridder (IDG) in WSClean for high-res imaging.
Pre-Facet-Image.parset : A pipeline that generates a full-bandwidth, full-FoV image.
make_calibrator/target_plots.losoto_parset : Losoto parsets for making diagnostic plots from the output h5parms.
General questions related to processing and answers (tentative)
Q1). What type of hardware does the use case run on (how many CPUs, minimum amount of memory do you need or use GPUs or other accelerators,...)?
a) Minimum number of CPUs - 16 (Preferred 32 or more)
b) Minimum amount of memory - 64GB (Preferred 128GB)
c) GPUs - No it does not use GPU's
Q2). What interaction with the system is necessary (think: batch, jupyter notebook, command line shell, ...) Also if applicable please specify what type of batch system is needed (slurm, dirac, ...)?
a) command line shell: bash...
b) batch: It may not be correct to call them batch as the jobs are run on a number of data sets and known but are not submitted to OS in advance but rather controlled via the pipeline.
Q3). What type of other resource requirements do you have (also think about HPC, HTC, ...)?
a) POSIX access to the data is need so all the data should be accessible in form of mounted directories.
b) HPC/HTC - YES as the pipeline does use quite some data (10TB) and compute power.
c) We need a directory which is visible from all compute resources (involved). We need this to store
the singularity image which does contain the entire pipeline software(autofs ?). This is a preferred scenario. Else we need to ship the singularity image to all the compute resources involved.
Q4). How long do the jobs take to run, and/or are there any other time constrains on the jobs?
- Jobs will be in 2 categories:
a) Ordinary tests to be performed :- This will take about 15-20minutes.
b) The pipeline on the full data :- This is expected to last less than two days depending on the throughput performance and compute resources available.
c) Note: There are no time constraints.
Q5). Do you need any dependencies on the system (think things like containers, OS versions, software, shared/local file system)?
a) Singularity version > 3.2 (the image has been built on 3.5.2 but any version higher than 3.2 should work);
b) The singularity image (700MBytes) contains all the needed software (i.e. Go Language?).
c) OS version: Any Linux system (preferably EL7)
d) All the data sets should be visible /accessible via POSIX as if it was on a directory from the compute resources.
e) The output of the pipeline does consists of new files (logs and otherwise) as well as the original data getting modified (new columns are added and updated). So at least for the logs its preferable if we can have a shared file system.
Q6). Do you have any requirements on collaborative workspaces, data sharing?
a) As a part of ESCAPE team, we are doing this work together with people from ASTRON, The Netherlands (Pandey & Yan); and from DESY Institute, Germany (Aleem & Paul). Apart from other aspects, we also will be working together to monitor the performance parameters vis a vis local run of the pipeline on a dedicated cluster. We will like to also utilize and evaluate especially rol of XCache and dCache in a minor way.
b) DATA - Full access to the collaborating partners is needed. Since we are using only LOFAR test data, there are no propriety issues involved. In addition data deletion/corruption is also not a major issue.
Q7). What is the typical numbers and files and their typical sizes?
a) Typically there are 244-488 Measurement Sets (directories). Each directory is like a data set.
b) Each Measuement set is typically from 10GBytes - 40GBytes.
Q8. How many jobs do you expect to run in parallel?
a) It could be as much as the number of datasets (so can go upto 488 in principle). In practice it is a parameter which can be given and can be as low as desired.