GATOR Documentation:
Quick Guide to GATOR
What is GATOR?
GATOR is an interface that allows users to build queries that retrieve data from astronomical catalogs archived at the NASA/IPAC Infrared Science Archive. The catalogs are stored as two-dimensional tables in either Informix or Oracle databases. Each row represents a record of the information for an individual source, such as a point-source in a catalog of point sources. Each column holds one datum or item of information about the source, such as position, magnitude, quality flags and so on.
Information in a database is accessed with the Structured Query Language (SQL), a syntax optimized for database tasks. SQL queries are interpreted by a Relational Database Management System (RDBMS), the engine that manages and organizes information in a database. While SQL is an easy language to master, complex queries are long and tedious to debug. GATOR shields users from the need to write SQL by providing a web-based CGI form that allows users to build their query interactively. GATOR builds an SQL query based on the user's requests, launches it, and writes the retrieved data to an ASCII file for download.
GATOR's power lies in its generality, in that complex search constraints can be quickly entered and run. For instance, a query such as the following one to the 2MASS All Sky Point Source Catalog is easy to build:
"return all sources at galactic latitudes greater 30 degrees with K magnitudes between 14 and 17 and uncertainties of less than 0.05, that are not flagged as contaminated by extended sources and have a blue-red optical color of less than 1.5"
How Do I Use GATOR?
GATOR employs two interfaces, one for selecting a catalog (accessible from the "Catalogs Listing" button), and another for building a query to extract data from that catalog. Submitting the job automatically brings up the process monitor, which reports the status of the job, and which gives the link to the returned data. These data are always written to an ASCII file of size less than 2 GB that is kept on-line for 48 hours after the job is finished. (A processing summary is also sent to the user by email if the user requests it.)
Selecting a Catalog
The first interface brings up a list of the catalogs at IRSA, along with information on the size of the catalogs and links to documentation. Catalogs are submitted by hitting the "selection" button on the left hand side of the screen, then hitting any of the "select" buttons to bring up the query builder interface.
Building a Query
The query builder interface constrains which rows and columns will be retrieved from the catalog. There are three classes of constraints:
- spatial constraints - which regions of the sky over which to select rows;
- column constraints - which columns to return, and upper and lower limits on the values of the entries in these columns, and
- user-defined custom constraints, which permit more complex queries than can be specified through the spatial and column constraints alone.
Rather than asking for a tabular listing of the requested rows and columns (the default), the user can click the optional "Source Count Only" button to return a summary of the number of sources matching their constraints.
Spatial Constraints:
Specify the region of the sky over which records should be returned (the user is not, however, limited to these options; see user defined constraints below). There are three options:
-
Single Object Search:
- Cone Search: finds sources within a specified circular or elliptical area
- Box Search: finds sources within a specified rectangular area
- Polygon Search : finds sources in a defined polygon.
- Multi-object Search: returns positional matches between an input set of positions (in an upload table) and the positions recorded in the catalog. Matches within a user-defined search are reported. Only one search area is permitted for a single table. The format of the table MUST follow the rules given in the Upload Table help file.
- All-sky search: applies no spatial constraints (i.e., searches the entire sky unless the user enters spatial constraints in the "Additional Constraints" area). Be careful when attempting a query of the entire catalog -- such searches can often generate output files that exceed the 2 GB file size limit. The user will be warned that this has happened, and that the output file will not contain all matched records. First-time users who wish to perform such searches are advised to contact IRSA for advice on performing full catalog searches.
Column Selection
GATOR provides a listing of the columns in the catalog. The desired columns can be selected simply by checking the "Sel" column, third from the left. Users can also select predefined subsets of the columns, called 'Standard', 'Long Form', or 'User Form'. Ranges on any column can be specified in the 'Low Limit' and 'Upper Limit' boxes. These ranges typed in the boxes must include the '<' , '<=' , '>' , or '>=' operators.
Some columns in the catalog are indexed for fast access, and are indicated as such under the Indx header. An index is a storage structure that allows rapid access to the rows of a table. The DBMS interprets the index much the same way as a reader interprets the index of a book. The index contains data values for a given column in ascending or descending order, and pointers to the rows where those values are found. The DBMS can then quickly find the value it is seeking, then uses the pointer to access the row containing that datum. Generally speaking, IRSA only indexes a few columns in each catalog. The issue of how many columns to index is actually a complex one. While indexing a column allows rapid access to its data values, indices also increase storage space and the complexity of managing the tables. Moreover, the query optimization algorithms used by the DBMS take many factors into account in assessing the best path to take, and a large number of indices can slow down a query while the optimizer decides which one to use.
The rows in a relational database have no particular order. If the same query is run twice, the returned rows are not guaranteed to be in the same order. GATOR therefore provides an 'Order By' field; it sorts the data in ascending order based on the values of the data in the column name entered.
User-defined Constraints
This is one of the most powerful features of GATOR, for it permits custom requests that avoid the need to perform post-processing filtering of large data sets that are downloaded to a desktop. Constraints are entered in SQL-format, but users need not have detailed knowledge of SQL to enter their constraints. The examples given below show how queries can be built intuitively; these examples are given for each of the five types of search constraints that are supported; more details are provided in the user defined constraints help file:
-
Comparisons: compares the values of one expression
to another. The great majority of queries will be of this
type, which use simple mathematical operators.
declination > 40 selects rows with declinations above 40 degrees; user defined constraints then allow for broader constraints than can be entered through the spatial constraints form
j_m <= 11 and declination > 40 selects all rows having J magnitudes fainter than 11 and declinations above 40 degrees. -
Ranges: tests whether an expression falls within a
certain range.
ra between 12 and 14 selects rows having Right Ascension values between 12 and 14 hours
ra not between 12 and 14 selects rows not between 12 and 14 hours -
Set memberships: tests whether an expression
matches any one of a set of values.
Hubble_Type in ('Sa', 'Sb', 'Sc') selects rows where the Hubble Type has values of Sa, Sb or Sc. This is equivalent to Hubble_Type ='Sa' or Hubble_Type ='Sb' or Hubble_Type ='Sc'
-
Pattern Matching: tests whether the value of a
column containing string data matches a specified
pattern.
cc_flag ='000' selects rows where the quality flag has values of '000'
id_opt like ' USNOA %' selects rows where the optical counterpart has a name beginning with USNOA. -
Null values: Checks whether a column has a null
value.
Fnu(60) is not null returns only those entries whose 60 micron flux has a measured (non-null) value.
The sample query given earlier on the 2MASS All Sky Point Source Catalog:
"return all sources at galactic latitudes greater 30 degrees with K magnitudes between 14 and 17 and uncertainties of less than 0.05 mag, that are not flagged as contaminated by extended sources and have a blue-red optical color of less then 1.5"
can now be built by the user-defined constraint
(glat > 30) and (k_m between 14 and 17) and (k_msigcom < 0.05) and (gaLcontam=0) and (b_m_opt - r_m_opt) <1.5)
For users conversant with SQL, GATOR uses the user-defined constraints to set up a WHERE clause in the SQL statement submitted to the DBMS. Users wishing to learn SQL have the choice of many fine textbooks, but we recommend "The LAN Times Guide to SQL" by Groff and Weinberg (1994; Osborne McGraw- Hill) for its particularly clear exposition in Chapter 6 of the where clause.
Running the Query and Accessing the Results
Here, we describe in general terms how to run jobs, monitor their status and access the results. We also show sample outputs. The accompanying Tutorial provides outputs for specific usage examples. Hitting the 'Run Query' Button launches the job: GATOR uses the interactively entered information to build the query in SQL syntax, submit it to the database, open an output file, write data to it, and, if the user requests, send an email to the user. GATOR allows users to monitor the progress of the various stages in the execution of the job via the Process Monitor, brought up in a separate window - click here for a sample. It allows the user to:
- bookmark the page, and detach from Netscape; then return and continue to monitor a long job.
- check the status of the query through a log file -
click here for a
sample.
Note: the log file has been designed to allow users to monitor the status of their jobs, while providing IRSA sufficient information to trace the cause of any errors. This information is given in the square brackets, and contains messages returned by the DBMS. The status messages are generally self-explanatory, with the exception of 'Set PDQ 10'. This specifies the DBMS resources that will be made available to the query (controlled by the server).
- access the output file (written to a 'workspace')
- stop the job
A processing summary and pointer to the output file are also optionally available via email - click here for a sample
Tips for Fast Queries
- Query on indexed columns as far as possible.
- In user-defined queries, avoid inefficient operations such as square roots and trigonometric functions.