Duplicate Source Handling

2MASS Catalog Generation: Handling of Duplicate Sources in Scan Overlap Regions

I. Options

Select One Apparition

Adopted Option (12/29/98)

Advantages

Avoids imhomogeneity resulting from sqrt(2) SNR improvement
Two-scan case relatively simple to implement (with caveats)
Easier traceability between catalog source and DB entry (e.g. time of observation)
Better choice if dominated by systematic photometric and/or astrometric errors
Will allow favoring side of array that has flatter cross-scan photometric response

Disadvantages

Discontinuity at boundaries
Moving source may be double-counted or missed
Geometry in multi-scan overlap regions (e.g. corners or regions near poles) more complex to implement.

Algorithm

Safety border around scans: use only sources >10 arcsec from scan edges to avoid band coverage problems (Sampler used 5 arcsec border and still had a few missing bands)
Match sources from adjoining scans in overlap region

Consider only objects not identified with artifacts
Match radius < 3.5 arcseconds (minimum resolution)
Final value will be set after analysis, but will likely be < 2 arcseconds
Require source brightness to agree to within TBD mags for valid match
Point has been made that with purely geometric selection criteria, then a brightness match is not essential. Only requirement is that measurement of object from one scan be reported

For all sources detected in more than one scan:
- Select position and brightness from scan in which source lies closest to the scan centerline (U-scan coords) and closest to N/S declination midpoint for DEC overlaps redundancy
- Advantage #5 (above) cannot be realized since need full overlaps for extended sources
- For sources near corners of scans, examine the horizontal (EW) and vertical (NS) distances from two nearest edges, dh and dv. Select the minimum of dh or dv from each scan. Select the apparation from the scan that has the larger min(dh,dv). Note that this forms somewhat complex boundaries between scans in the very corners, but it is simple to implement and document.
- See Issues below
For sources detected in only one scan:
- Define positional boundary between scans (e.g. U-scan or RA midpoint between adjacent scans; more complex at corners and near poles)
- Source makes catalog if it is detected in the scan on the appropriate side of boundary
For "open" edges of scans (not bounded by another scan):
- Define a boundary that is set in from the edges the same amount that the opposite edge boundary is set in. That is, make the boundaries of the release scan symmetric for this scan. It is essential that this open-edge boundary be bookkept for subsequent releases to make certain that no area is inadvertanly skipped.

Example:
1. Algorithm points 1-4 were implemented and run on a set of scans from the Spring 1999 Working Release Point Source DB. This PostScript FIGURE illustrates the duplicate source rectification at the relatively complex boundary between several tiles. The boundaries of each tile are coded in a separate color - the "reference" tile is shown in black. The matched sources between the reference and surrounding tiles are color coded to show from which tile the apparation that will be selected for inclusion in the catalog. (the crosses on the figure can be ignored). Zoom in on the interesting corners to see detail of the selection.
Issues:

Should relative scan quality be allowed to override geometrical decision from what scan to draw information? (e.g. one scan has better sensitivity than the other)

Decision on 1/5/99 - No. Use only geometric decision

Should relative source measurement quality be allowed to override geometrical decision from what scan to draw information? (e.g. galaxy measurement in one scan has measurement contaminated from bright star, but not in other scan)

Decision on 1/5/99 - No. Use only geometric decision

What to do with objects that match positionally, but not in brightness? (e.g. variable stars, marginally resolved multiple stars)

What to do with relatively bright objects that do not match with object in adjacent scan(s)? (e.g.variable stars, asteroids)

General consensus on 1/5/99 was to start with a simple, position-only matching scheme and pure geometric decision for which scan to report, and keep track of the frequency of occurence of issues 3 and 4.

Merge apparitions

This option will not be adopted. Possibly reconsider with full reprocessing at end of Survey

Advantages

Improved sensitivity in overlap regions if not dominated by systematic errors
Can replace discontinuities with smooth transitions
Easier to implement in multi-scan overlap region

Disadvantages

Should use weighted averaging algorithm to avoid discontinuities
Inhomogeneity in sensitivity on scan spatial scales (may not be significant in light of seeing, transmission, background variations, etc)
Not appropriate if dominated by systematic photometric and/or astrometric errors
Potential problems matching variable, moving, or intermittently resolved sources
Merge is especially difficult for extended sources because can have different sized "pieces" in adjacent scans
Requires considerable testing of merge/average algorithms before release
Sources at edges of blocks of released data may change name/position/brightness when adjacent block is released

Algorithm

Match sources from adjoining scans in overlap region (require positional and brightness match)
Evaluate weighted average brightness and position values for matched groups
Quote combined position and brightness in Catalogs

II. Comments

Tom Chester (12/21/98):

although i strongly support merging point sources, tj, sschneider and i
agree that we cannot merge galaxies.  after all, the main reason for the
scan overlap was to ensure that we got at least one good rendition of an
extended source near a scan boundary.  attempting to merge them could screw
the good rendition up as often as it improves things.  besides, unlike
point sources, we don't know how to combine elliptical parameters, etc.!
hence for gals we will always pick the one farthest from a boundary.

i reluctantly conclude that because we cannot merge galaxies, and because
of the complexity in the ties between the point and extended sources, for
this first release we ought to do to point sources what we have done to
galaxies.  this means that we cannot "favor the side of the array with
flatter response" - we have to use the same algorithm of picking the source
farthest from an edge.

btw, note that both advantages 4 and 5 go away from the "select one
apparition" if you choose the source farthest from an edge, which you have
to do for galaxies.  i'm not at all convinced you can always obtain those
advantages anyway.

Dave Monet (12/22/98):

The real question is whether you think that the systematic errors are
small or large compared to the random errors in these zones.  If the
random errors dominate, then it is reasonable to take the average.  If
the systematic errors dominate (or might dominate), then you should
choose one instance and omit the rest.  My suggestion for the culling
algorithm is to take the one closest to the centerline of the scan.
Presumably systematic errors grow rapidly near the edges of the scan, so
one should not take an object very close to the edge in preference to
one just inside of the overlap zone.
 
Arguments about variability, motion, or some other physical effect will
always arise, and the sophisticated user will want to examine all observations
of the same object.  Another aspect is the trade-off between producting
the "best" catalog you can (i.e., random errors dominate so one takes the
mean) and the difficulty of most users to correctly incorporate the factors
of SQRT(2) or more difference between the uncertainty estimators of some
catalog entries and the others.
  
My gut feeling is that this first release should choose one observation
and omit the rest.  This avoids the proof that systematic errors are
negligible (tough even under non-stressed timetables), and it makes
the uncertainty estimators easier to describe (all entries have similar
properties).  Save the mean (or mean with sigma chopping with the
associated flag for potential variability or motion) for a later release
when there is time to use this algorithm, test it, and see how often
weird things happen.

John Carpenter (12/28/98):

(1) I strongly agree that duplicate sources must be removed in the overlap 
    region. i.e. each source should reported only once in the catalog.

(2) I just want to clarify one aspect about the notion of "going deeper"
    in the overlap region. This could mean one of two things. One, report
    more accurate photometry for sources in the overlap region that would 
    normally appear in the database regardless of the fact that they are in
    the overlap region. Or two, report better photometry AND include 
    additional sources that would meet some selection criteria for appearing
    in the catalog only after averaging the photometry. This distinction 
    could become rather important once data near the 5 sigma detection 
    threshold or lower start being released. Perhaps I have missed some
    discussion on this, but it is not clear to me what people are proposing
    regarding this.

(3) In general, I do not favor averaging the photometry in the overlap region
    until I better understand the implications of the algorithm in terms of the
    (1) detection statistics in the overlap region, and (2) that the photometry
    is actually improved by approximately sqrt(2.0) in the overlap region. 
    (And how do we add together a quality=10 scan with a quality=6 scan say?)
    Even then, I am concerned about deliberately adding additional 
    inhomogeneity into the data on 8' scale. I cannot pinpoint a specific
    quantitative argument. But it just runs counter to the notion of releasing
    as homogeneous dataset as possible.

(4) Finally, regardless if we average the photometry or not in the overlap
    region, will there be a flag in the database indicating whether the source 
    is in the overlap region, and if so, whether it was detected by the
    other scan? This will be rather valuable for studies that want to
    empirically estimate the completeness limit in the overlap regions.

Jay Elias (12/29/98):

A comment on the agenda that we won't re-issue sources from previous
incremental releases made me realize that there is an issue concerning
the borders of scans in this regard. That is, if we have scans in
the current release which are overlapped by scans due for later release,
what do we do?

If we inclde sources all the way out to the edge of such scans, then
we presumably don't use any of the data from the not-yet-released scans.
This leades to far greater inhomogeneities than anything Martin was
worrying about.

The way to avoid this is presumably to not release any data that overlap
with unreleased scans (with some margin for positional errors) if we
merge sources, or else to include only out to the approximate splitting
point (minus a margin) if we don't merge. In either case, though, this
requires estimating what to leave out and also requires releasing what's
in the safety margins in future releases. Thus, this solution, while
ideal, involves extra work, and I don't know whether IPAC has taken
this into account. 

----

A second comment/question has to do with the statement that merging
data will make the survey 40% deeper in the overlap zones. This is 
not quite true at the limit, since the sources have to be detected
reliably in each scan; we are not co-adding the scans but rather
merging the sources. Certainly signal to noise of sources will improve
but there will not be added faint sources. 

I still favor merging sources but people need to understand that it
is not the same as co-added data.

R. Cutri - IPAC
Last Update - 9 March 1999 - 17:00 PST