Data Ingestion
Data Archive
When working with local data, you must first download them from the Subaru data archive for your open-use programs, specifically from the STARS2 database. The archive provides access to science images, calibration data, and PfsConfig files.
Note
Details on data retrieval will be provided by the STARS team on the day following your observation.
In principle, you need to download a TAR file from the STARS2 website. After that, follow the steps below to extract and access the data on UNIX or macOS:
# Step 1: Create a directory for downloads
$ mkdir /download_dir
# Step 2: Copy the downloaded TAR file to the directory
$ cp S2Query.tar /download_dir
# Step 3: Navigate to the directory
$ cd /download_dir
# Step 4: Extract the TAR file
$ tar -xvf S2Query.tar
# Step 5: Run the unpacking script
$ ./zadmin/unpack.py
Platform-Specific Instructions:
-
UNIX: Use wget for downloading.
-
macOS: Use curl for downloading.
-
Windows: An FTP manager is required for data transfer.
Ingestion to butler
We can ingest data into the butler repository.
There are two kinds of data that we need to ingest: raw images and the PfsConfig files.
These are ingested using two separate commands:
# Assume that we define a database to be ingested (DATASTORE) and the directory with the input data (DATADIR):
DATASTORE="$WORKDIR/pfs/data/datastore"
DATADIR="$WORKDIR/pfs/data"
# Ingest the raw images taken on Oct. 2024
$ butler ingest-raws $DATASTORE $DATADIR/raw/2024-10-*/*/PFS[A-B]*.fits --ingest-task lsst.obs.pfs.gen3.PfsRawIngestTask --transfer link --fail-fast
# Ingest the PfsConfig files for the Oct. 2024 run
$ ingestPfsConfig.py $DATASTORE PFS PFS/raw/pfsConfig $DATADIR/raw/2024-10-*/pfsConfig/pfsConfig-*.fits --transfer link
In case of PFSF files used by the observatroy, they are the same to pfsConfig files, and the similar command works:
$ ingestPfsConfig.py $DATASTORE PFS PFS/raw/pfsConfig $DATADIR/raw/2024-10-*/pfsConfig/PFSF*.fits --transfer link
The details of ingested data can be refered to the Appendix or the datamodel.
For example, the file names have the meanings:
-
PFSA12345611.fits:
Raw
scienceexposure,visit=123456, taken on thesite=summitwithspectrograph=1using the blue arm (armNum=1) -
PFSB12345623.fits:
Raw
up-the-rampexposure,visit=123456, taken at thesite=summitwithspectrograph=2using the IR arm (armNum=3) -
pfsConfig-0xad349fe21234abcd-123456.fits:
The realisation of a
PfsDesignwithpfsDesignId=ad349fe21234abcdforvisit=123456. -
PFSF12345600.fits:
The realisation of a
PfsDesignused by the observatory forvisit=123456. In the final two digits,00means the original fullPfsConfig(PFSF) file;01-99are for the customizedPFSFwith the fibers belonging to only a specific proposal ID and calibration fibers (sky, flux, and etc.). The observatory will provide the customizedPFSFfile with01-99in the final two digits. If thePFSFfile contain only one proposal ID or calibration frame the00files will be distributed. The ingestion process is similar to that for aPfsConfigfile.
The parameters in the commands include:
--transfer: The method by which data is added to the repository, includinglink,copy, andmove, which specify whether the data is symlinked, duplicated, or physically relocated, respectively.--fail-fast: The process will immediately stop the ingestion process if an error occurs. This is useful for debugging. If this is not considered a useful feature, exclude this option.
The ingestion process places the files (referred to as “datasets” in the butler) in the repository and records them in the registry database. Each file is placed in a collection, which can be thought of as a directory in the butler (and in the case of the datastore on a traditional filesystem, it is implemented as a directory).
The raw data is placed in the collection <instrument>/raw/all, while we’ve specified above that the PfsConfig files are placed in the collection PFS/raw/pfsConfig.
There are different kinds of collections:
RUNcollection always associates the datasets.CALIBRATIONcollections associate datasets with a timespan indicating the validity range.CHAINEDcollections provide a search path through multiple collections.
Each dataset is specified by a “dataId”, which is a dictionary of key-value pairs representing the dimensions.
For example,
rawimage may have adataIdlike {'instrument': 'PFS', 'visit':123, 'arm': 'r', 'spectrograph':3}.PfsConfigfile is valid for an entire exposure, so may have adataIdlike {'instrument': 'PFS', 'visit':123}.
IMPORTANT: In general, users should treat the files in the datastore as a butler implementation detail, and use the butler commands and Python API to access the data products.
There are some kinds of datastores that do not use a traditional filesystem (e.g., the S3 datastore), and so the files may not be directly accessible.
Warning
The registry database tracks all files in the datastore. Do not delete files from the datastore without using the appropriate butler commands.
You can see what raw datasets are in the datastore with the following command:
$ butler query-datasets $DATASTORE --collections PFS/raw/all
The result looks something like this:
type run id instrument arm dither pfs_design_id spectrograph detector visit
---- ------------- ------------------------------------ ---------- --- ------ ------------- ------------ -------- --------
raw PFS/raw/all 27217522-a357-5071-a32b-af97b5b8bee6 PFS b 0.0 1 1 0 0
raw PFS/raw/all 0ce0cbea-fe7c-589e-8259-30060bf20500 PFS b 0.0 1 1 0 1
[...]
raw PFS/raw/all 570092eb-f571-5631-8d20-11acbeabc640 PFS r 0.0 3 1 1 26
raw PFS/raw/all f8e3ae71-2cdf-5e55-bc42-4a4fb913770c PFS r 0.0 4 1 1 27
Datasets can be accessed from Python using the butler API:
from lsst.daf.butler import Butler
butler = Butler.from_config($DATASTORE, collections="PFS/raw/all")
raw = butler.get("raw", instrument="PFS", visit=12, arm="r", spectrograph=1)
rawImage = raw.getImage()
The raw data returned from the butler is of type PfsRaw, which provides a common interface for both CCD and NIR detectors.
You can use butler.get("raw.exposure", ...) to get the exposure from the raw data directly.