CASA Parallel Processing: Difference between revisions

From CASA Guides
Jump to navigationJump to search
No edit summary
Line 15: Line 15:
== Parallelized tasks ==
== Parallelized tasks ==


This is a set current as of 23 August 2011.
This set of tasks described here is current as of 23 August 2011.


=== partition ===
=== partition ===


In order to perform parallel processing, the original Measurement Set must be subdivided into smaller chunks that can then be farmed out to multiple cores for processing.  This new, subdivided MS (called a "multi-MS", or MMS) is created by a task called <tt>partition</tt>.  In many ways, <tt>partition</tt> resembles {{split}}, in that it can also time-average or select only a subset of data.  However, frequency averaging is not currently available.   
In order to perform parallel processing, the original Measurement Set must be subdivided into smaller chunks that can then be farmed out to multiple cores for processing.  This new, subdivided MS (called a "multi-MS", or MMS) is created by a task called <tt>partition</tt>.  In many ways, <tt>partition</tt> resembles {{split}}, in that it can also time-average or select only a subset of data.  However, frequency averaging is not currently available.   
Here is the current set of input parameters for <tt>partition</tt>:
<pre>
#  partition :: Experimental extension of split to produce multi-MSs
vis                =        ''        #  Name of input measurement set
outputvis          =        ''        #  Name of output measurement set
createmms          =      True        #  Should this create a multi-MS output
    separationaxis =    'scan'        #  Axis to do parallelization across
    numsubms      =        64        #  The number of SubMSs to create
calmsselection      =    'none'        #  Cal Data Selection ('none', 'auto', 'manual')
datacolumn          =    'data'        #  Which data column(s) to split out
field              =        ''        #  Select field using ID(s) or name(s)
spw                =        ''        #  Select spectral window/channels
antenna            =        ''        #  Select data based on antenna/baseline
timebin            =      '0s'        #  Bin width for time averaging
timerange          =        ''        #  Select data by time range
scan                =        ''        #  Select data by scan numbers
scanintent          =        ''        #  Select data by scan intent
array              =        ''        #  Select (sub)array(s) by array ID number
uvrange            =        ''        #  Select data by baseline length
async              =      False        #  If true the taskname must be started using partition(...)
</pre>
The parameters which are specific to parallel processing are 'createmms', 'separationaxis', 'numsubms', and 'calmsselection'.
It is currently recommended that 'separationaxis' be set to 'default', which should create sub-MSs which are optimized for parallel processing by dividing the data along scan and spectral window boundaries.  Other options include 'spw' and 'scan', which would force separation across only one of these axes.
The optimal number of sub-MSs to create will depend on the processing environment; namely, the number of available cores.  A reasonable rule of thumb is to create twice as many sub-MSs as there are available cores.





Revision as of 17:27, 23 August 2011

Overview

This is meant to be a general guide for testing out the early capabilities in CASA for parallel processing. As of the time of this writing (23 August 2011), the tasks flagdata, applycal, and partition are all parallelized. In addition, there are two versions of parallel imaging available at the task level, pcont and pcube (for continuum and spectral line imaging, respectively).

Feedback at this stage about what works, as well as what doesn't or could use improvement, will be very helpful. Please send comments, bug reports, and questions to ...

More information may also be found in Jeff Kern's presentation, posted [here].

Setting up for parallel processing

Before you can run tasks with parallelization, you must first set up the machine on which CASA will be running to use SSH keys for password-free login. See [the SSH section of the Gold Book] for instructions.

Parallel processing in CASA is set up to take advantage of both multiple-core machines (as most standard workstations are) as well as shared memory access (as is available in a cluster). However, the NRAO cluster in Socorro also has the distinction of a very fast connection to the Lustre filesystem, which will boost I/O performance by around 2 orders of magnitude of the standard desktop SATA disk. Therefore, I/O-limited operations are unlikely to see much improvement with parallel processing.

Parallelized tasks

This set of tasks described here is current as of 23 August 2011.

partition

In order to perform parallel processing, the original Measurement Set must be subdivided into smaller chunks that can then be farmed out to multiple cores for processing. This new, subdivided MS (called a "multi-MS", or MMS) is created by a task called partition. In many ways, partition resembles split, in that it can also time-average or select only a subset of data. However, frequency averaging is not currently available.

Here is the current set of input parameters for partition:

#  partition :: Experimental extension of split to produce multi-MSs
vis                 =         ''        #  Name of input measurement set
outputvis           =         ''        #  Name of output measurement set
createmms           =       True        #  Should this create a multi-MS output
     separationaxis =     'scan'        #  Axis to do parallelization across
     numsubms       =         64        #  The number of SubMSs to create

calmsselection      =     'none'        #  Cal Data Selection ('none', 'auto', 'manual')
datacolumn          =     'data'        #  Which data column(s) to split out
field               =         ''        #  Select field using ID(s) or name(s)
spw                 =         ''        #  Select spectral window/channels
antenna             =         ''        #  Select data based on antenna/baseline
timebin             =       '0s'        #  Bin width for time averaging
timerange           =         ''        #  Select data by time range
scan                =         ''        #  Select data by scan numbers
scanintent          =         ''        #  Select data by scan intent
array               =         ''        #  Select (sub)array(s) by array ID number
uvrange             =         ''        #  Select data by baseline length
async               =      False        #  If true the taskname must be started using partition(...)

The parameters which are specific to parallel processing are 'createmms', 'separationaxis', 'numsubms', and 'calmsselection'.

It is currently recommended that 'separationaxis' be set to 'default', which should create sub-MSs which are optimized for parallel processing by dividing the data along scan and spectral window boundaries. Other options include 'spw' and 'scan', which would force separation across only one of these axes.

The optimal number of sub-MSs to create will depend on the processing environment; namely, the number of available cores. A reasonable rule of thumb is to create twice as many sub-MSs as there are available cores.



flagdata

applycal

Parallel cleaning (toolkit)

pcont

pcube