This lesson is being piloted (Beta version)

Data Management (2025 updated for metacat/justIN/rucio)

Overview

Teaching: 30 min
Exercises: 15 min
Questions
  • What are the data management tools and software for DUNE?

Objectives
  • Learn how to access data from DUNE Data Catalog.

  • Learn a bit about the justIN workflow system for submitting batch jobs.

Session Video

The session video on December 10, 2024 was captured for your asynchronous review.

Introduction

What we need to do to produce accurate physics results

DUNE has a lot of data which is processed through a complicated chain of steps. We try to abide by FAIR (Findable, Accesible, Intepretable and Reproducible) principles in our use of data.

Our DUNE Physics Analysis Review Procedures state that:

  1. Software must be documented, and committed to a repository accessible to the collaboration.

    The preferred location is any repository managed within the official DUNE GitHub page: https://github.com/DUNE.

    There should be sufficient instructions on how to reproduce the results included with the software. In particular, a good goal is that the working group conveners are able to remake plots, in case cosmetic changes need to be made. Software repositories should adhere to licensing and copyright guidelines detailed in DocDB-27141.

  2. Data and simulation samples must come from well-documented, reproducible production campaigns. For most analyses, input samples should be official, catalogued DUNE productions.

How we do it ?

DUNE offical data samples are produced using released code, cataloged with metadata that describes the processing chain and stored so that they are accessible to collaborators.

DUNE data is stored around the world and you cannot directly log into most of the systems. For this purpose we use the metacat data catalog to describe the data and collections and the rucio file storage system to determine where replicas of files are and provide a url that allows you to process them. There is also a legacy SAM data access system that can be used for older files.

There are currently > 27M files cataloged in metacat with a total size of > 48 PB.

Key idea:

All rucio entries should have a metacat entry describing them. With a few exceptions (retired, intermediate files), metacat entries correspond to real files in rucio.

How do I use this.

If you want to access data, this module will help you find and examine it.

If you want to process data using the full power of DUNE computing, you should talk to the data management group about methods for cataloging any data files you plan to produce. This will allow you to use DUNE’s collaborative storage capabilities to preserve and share your work with others and will be required for publication of results.

How to find and access official data

What is metacat?

Metacat is a file and dataset catalog - it allows you to search for files and datasets that have particular attributes and understand their provenance, including details on all of their processing steps. It also allows for querying jointly the file catalog and the DUNE conditions database.

You can find extensive documentation on metacat at:

General metacat documentation

DUNE metacat examples

Find a file in metacat

DUNE runs multiple experiments (far detectors, protodune-sp, protodune-dp hd-protodune, vd-protodune, iceberg, coldboxes… ) and produces various kinds of data (mc/detector) and process them through different phases.

To find your data you need to specify at the minimum

and when searching for specific types of data

You probably also want to know about

Here is an example of a metacat query that gets you raw files from a recent ‘hd-protodune’ cosmics run.

Note: there are example setups that do a full setup in the extras folder:

First get metacat if you have not already done so

SL7

Make certain you have dune software set up SL7 setup

AL9

Make certain you have AL9 set up AL9 setup

then do queries to find particular sets of files

metacat query "files from dune:all where core.file_type=detector and core.run_type=hd-protodune and core.data_tier=raw and core.runs[any]=27331 limit 1"

this should give you a file:

hd-protodune:np04hd_raw_run027331_0254_dataflow0_datawriter_0_20240620T173408.hdf5

the string before the ‘:’ is the namespace and the string after is the filename.

You can find out more about your file by doing:

metacat file show -m -l hd-protodune:np04hd_raw_run027331_0254_dataflow0_datawriter_0_20240620T173408.hdf5

which gives you a lot of information:

checksums:
    adler32   : 5222f2ae
created_timestamp   :	2024-06-20 17:38:45.141418+00:00
creator             :	dunepro
fid                 :	83316551
name                :	np04hd_raw_run027331_0254_dataflow0_datawriter_0_20240620T173408.hdf5
namespace           :	hd-protodune
size                :	4238541524
updated_timestamp   :	1718905125.141418
metadata:
    core.data_stream    : physics
    core.data_tier      : raw
    core.end_time       : 1718904863.0
    core.event_count    : 35
    core.events         : [35564, 35568, 35572, 35576, 35580, 35584, 35588, 35592, 35596, 35600, 35604, 35608, 35612, 35616, 35620, 35624, 35628, 35632, 35636, 35640, 35644, 35648, 35652, 35656, 35660, 35664, 35668, 35672, 35676, 35680, 35684, 35688, 35692, 35696, 35700]
    core.file_content_status: good
    core.file_format    : hdf5
    core.file_type      : detector
    core.first_event_number: 35564
    core.last_event_number: 35700
    core.run_type       : hd-protodune
    core.runs           : [27331]
    core.runs_subruns   : [2733100001]
    core.start_time     : 1718904848.0
    dune.daq_test       : False
    retention.class     : physics
    retention.status    : active
children:
   hd-protodune-det-reco:np04hd_run27331_calibana_merged_1.root (cl7b7XQpTpS3PGm5)
   hd-protodune-det-reco:np04hd_raw_run027331_0254_dataflow0_datawriter_0_20240620T173408_reco_stage1_reco_stage2_20240926T225231_keepup_hists.root (Kf6CttcJRJu2uk58)
   hd-protodune-det-reco:np04hd_raw_run027331_0254_dataflow0_datawriter_0_20240620T173408_reco_stage1_reco_stage2_20240926T225231_keepup.root (sxh5FHKLTGym6pDp)
   hd-protodune-det-reco:np04hd_raw_run027331_0254_dataflow0_datawriter_0_20240620T173408_reco_stage1_reco_stage2_20240623T013207_keepup.root (Xl9QXEo0RouAqO0s)
   hd-protodune-det-reco:np04hd_raw_run027331_0254_dataflow0_datawriter_0_20240620T173408_reco_stage1_20240623T013207_keepup_hists.root (xWIH9tmnSWqWBJE5)

What do those fields mean?

look in the glossary to see what those fields mean.

Warning - there are multiple child files that look similar

Files can be reprocessed with different versions and can be reprocessesd twice if the batch system gets confused. Experts can tell them apart with specific queries about reconstruction versions and file status (in this case there were 2 reconstruction versions core.application.version = v09_90_02d00 and v09_91_02d01).

If you are doing real analysis please use the official datasets which experts have defined

if no official dataset exists, you need to require additional fields like: core.application.version=v09_91_02d01 and dune.config_file=standard_reco_stage2_calibration_protodunehd_keepup.fcl to make certain you are not looking at 2 versions of the same file.

find out how much raw data there is in a run using the summary option

metacat query -s "files from dune:all where core.file_type=detector \
 and core.run_type=hd-protodune and core.data_tier=raw \
 and core.data_stream=physics and core.runs[any]=27331"
Files:        4144
Total size:   17553648200600 (17.554 TB)

Official datasets

The production group make official datasets which are sets of files which share important characteristics such as experiment, data_tier, data_stream, processing version and processing configuration.

See DUNE Physics Datasets for a detailed description.

Fast web catalog queries

You can do fast string queries based on keywords embedded in the dataset name.

Go to dunecatalog and log in with your services password.

Choose your apparatus (Far Detector for example), use the category key to further refine your search and then type in keywords. Here I chose the Far Detectors tab and the FD-VD category from the pulldown menu.

Fast keyword search

If you click on a dataset you can see a sample of the files inside it.

You can find a more detailed tutorial for the dunecatalog site at: Dune Catalog Tutorial

Command line tools and advanced queries

You can also explore and find the right dataset on the command line by using metacat dataset keys:

First you need to know your namespace and then explore within it.

metacat namespace list # find likely namespaces

There are official looking ones like hd-protodune-det-reco and ones for users doing production testing like schellma. The default for general use is usertests

Let’s look for some reconstructed Monte Carlo from the VD far detector.

metacat query "datasets matching fardet-vd:*official having core.data_tier=full-reconstructed"

Lots of output … looks like there are 2 types of official ones - let’s get “v2”

metacat query "datasets matching fardet-vd:*v2_official having core.data_tier=full-reconstructed"

and there are then several different generators. Let’s explore reconstructed simulation of the vertical drift far detector.

metacat query "datasets matching fardet-vd:*v2_official having core.data_tier=full-reconstructed and dune_mc.gen_fcl_filename=prodgenie_nu_numu2nue_nue2nutau_dunevd10kt_1x8x6_3view_30deg.fcl"

Ok, found the official neutrino beam dataset:

fardet-vd:fardet-vd__full-reconstructed__v09_81_00d02__reco2_dunevd10kt_nu_1x8x6_3view_30deg_geov3__prodgenie_nu_numu2nue_nue2nutau_dunevd10kt_1x8x6_3view_30deg__out1__v2_official
metacat query "datasets matching fardet-vd:*v2_official having core.data_tier=full-reconstructed and dune_mc.gen_fcl_filename=prodgenie_anu_numu2nue_nue2nutau_dunevd10kt_1x8x6_3view_30deg.fcl"

And the anti-neutrino dataset:

fardet-vd:fardet-vd__full-reconstructed__v09_81_00d02__reco2_dunevd10kt_anu_1x8x6_3view_30deg_geov3__prodgenie_anu_numu2nue_nue2nutau_dunevd10kt_1x8x6_3view_30deg__out1__v2_official

you can use the web data catalog to do advanced searches

You can also do keyword/value queries like the ones above using the Other tab on the web-based Data Catalog.

Full query search

What describes a dataset?

Let’s look at the metadata describing that anti-neutrino dataset: the -j means json output

metacat dataset show -j fardet-vd:fardet-vd__full-reconstructed__v09_81_00d02__reco2_dunevd10kt_anu_1x8x6_3view_30deg_geov3__prodgenie_anu_numu2nue_nue2nutau_dunevd10kt_1x8x6_3view_30deg__out1__v2_official
{
    "created_timestamp": 1722620726.662696,
    "creator": "dunepro",
    "description": "files where namespace='fardet-vd' and core.application.version=v09_81_00d02 and core.application.name=reco2 and core.data_stream=out1 and core.data_tier='full-reconstructed' and core.file_type=mc and core.run_type='fardet-vd' and dune.campaign=fd_mc_2023a_reco2 and dune.config_file=reco2_dunevd10kt_anu_1x8x6_3view_30deg_geov3.fcl and dune.requestid=ritm1780305 and dune_mc.detector_type='fardet-vd' and dune_mc.gen_fcl_filename=prodgenie_anu_numu2nue_nue2nutau_dunevd10kt_1x8x6_3view_30deg.fcl and dune.output_status=confirmed and core.group=dune",
    "file_count": 20648,
    "file_meta_requirements": {},
    "frozen": true,
    "metadata": {
        "core.application.name": "reco2",
        "core.application.version": "v09_81_00d02",
        "core.data_stream": "out1",
        "core.data_tier": "full-reconstructed",
        "core.file_type": "mc",
        "core.group": "dune",
        "core.run_type": "fardet-vd",
        "datasetpar.deftag": "v2_official",
        "datasetpar.max_time": null,
        "datasetpar.min_time": null,
        "datasetpar.namespace": "fardet-vd",
        "dune.campaign": "fd_mc_2023a_reco2",
        "dune.config_file": "reco2_dunevd10kt_anu_1x8x6_3view_30deg_geov3.fcl",
        "dune.output_status": "confirmed",
        "dune.requestid": "ritm1780305",
        "dune_mc.detector_type": "fardet-vd",
        "dune_mc.gen_fcl_filename": "prodgenie_anu_numu2nue_nue2nutau_dunevd10kt_1x8x6_3view_30deg.fcl"
    },
    "monotonic": false,
    "name": "fardet-vd__full-reconstructed__v09_81_00d02__reco2_dunevd10kt_anu_1x8x6_3view_30deg_geov3__prodgenie_anu_numu2nue_nue2nutau_dunevd10kt_1x8x6_3view_30deg__out1__v2_official",
    "namespace": "fardet-vd",
    "updated_by": null,
    "updated_timestamp": null
}

You can use any of those keys to refine dataset searches as we did above. You probably have to ask your physics group which are interesting.

What files are in that dataset and how do I use them?

You can either click on a dataset in the web data catalog or:

metacat query  "files from  fardet-vd:fardet-vd__full-reconstructed__v09_81_00d02__reco2_dunevd10kt_anu_1x8x6_3view_30deg_geov3__prodgenie_anu_numu2nue_nue2nutau_dunevd10kt_1x8x6_3view_30deg__out1__v2_official limit 10"

will list the first 10 files in that dataset (you probably don’t want to list all 20648)

You can also use a similar query in your batch job to get the files you want.

Finding those files on disk

To find your files, you need to use Rucio directly or give the justIN batch system your query and it will locate them for you.

Getting file locations using Rucio

What is Rucio?

Rucio is the next-generation Data Replica service and is part of DUNE’s new Distributed Data Management (DDM) system that is currently in deployment. Rucio has two functions:

  1. A rule-based system to get files to Rucio Storage Elements around the world and keep them there.
  2. To return the “nearest” replica of any data file for use either in interactive or batch file use. It is expected that most DUNE users will not be regularly using direct Rucio commands, but other wrapper scripts that calls them indirectly.

As of the date of the 2025 tutorial:

You will need to authenticate to use read files

For SL7 use justin to get a token

To get a token that allows you to access files (and rucio) in SL7

setup justin
justin time # this just tells justin you want to authenticate

The first time it will ask you to open a web browser, authenticate and enter the long string it delivers to you.

To authorize this computer to run the justin command, visit this page with your
usual web browser and follow the instructions within the next 10 minutes:
https://dunejustin.fnal.gov/authorize/_W_azUJcLhYmAOqClYz9RAsnKbDgzQ6lNA

Check that the Session ID displayed on that page is -cprbbe

Once you've followed the instructions on that web page, you can run the justin
command without needing to authorize this computer again for 7 days.

That gave you authorization to use justin. Now do the command again to get an actual token.

justin get-token

You will have to do this sequence weekly as your justin access expires.

for AL9 use htgettoken to get a token

RUCIO authentication does not work on AL9 for now

Please use SL7 if you need to use rucio

normal tokens (below) for xroot access do work

Make certain you have al9 set up

Then use htgettoken to get a token so you can read the files you find.

htgettoken -i dune --vaultserver htvaultprod.fnal.gov #:8200
export BEARER_TOKEN_FILE=/run/user/`id -u`/bt_u`id -u`

The first time you do it it will ask you to authenticate using a web browser.

You should be able to read files at remote sites now. You may need to repeat the htgettoken as the interactive tokens are pretty short-lived. Batch jobs do their own tokens.

finding a file

Then use rucio to find out about a file’s locations. the –protocols flag makes certain you get the streaming root: location.

rucio replica list file fardet-vd:prodmarley_nue_es_flat_radiological_decay0_dunevd10kt_1x8x14_3view_30deg_20250217T033222Z_gen_004122_supernova_g4stage1_g4stage2_detsim_reco.root --pfns --protocols=root

returns 2 locations:

root://fndca1.fnal.gov:1094/pnfs/fnal.gov/usr/dune/tape_backed/dunepro//fardet-vd/hit-reconstructed/2025/mc/out1/le_mc_2024a/00/00/51/85/prodmarley_nue_es_flat_radiological_decay0_dunevd10kt_1x8x14_3view_30deg_20250217T033222Z_gen_004122_supernova_g4stage1_g4stage2_detsim_reco.root
root://meitner.tier2.hep.manchester.ac.uk:1094//cephfs/experiments/dune/RSE/fardet-vd/fd/a6/prodmarley_nue_es_flat_radiological_decay0_dunevd10kt_1x8x14_3view_30deg_20250217T033222Z_gen_004122_supernova_g4stage1_g4stage2_detsim_reco.root

which the locations of the file on disk and tape. We can use this to copy the file to our local disk or access the file via xroot.

Testing - access a file

Try to access the file at manchester using the command:

root -l root://meitner.tier2.hep.manchester.ac.uk:1094//cephfs/experiments/dune/RSE/fardet-vd/fd/a6/prodmarley_nue_es_flat_radiological_decay0_dunevd10kt_1x8x14_3view_30deg_20250217T033222Z_gen_004122_supernova_g4stage1_g4stage2_detsim_reco.root
_file0->ls

It will complain because you haven’t loaded all the information needed to read an artroot file but you should be able to read it.

NOTE if you see a path in /pnfs/usr/dune/tape_backed at Fermilab or on eos, ctapublic.cern.ch

those are on tape and not generally accessible to the user. Try to get the file from the remaining one (in this case hep.manchester.ac.uk)

More finding files by characteristics using metacat

There isn’t always an official dataset so you can also list files directly using metact.

To list raw data files for a given run:

metacat query "files where core.file_type=detector \
 and core.run_type='protodune-sp' and core.data_tier=raw \
 and core.data_stream=physics and core.runs[any] in (5141)"
protodune-sp:np04_raw_run005141_0013_dl7.root
protodune-sp:np04_raw_run005141_0005_dl3.root
protodune-sp:np04_raw_run005141_0003_dl1.root
protodune-sp:np04_raw_run005141_0004_dl7.root
...
protodune-sp:np04_raw_run005141_0009_dl7.root
protodune-sp:np04_raw_run005141_0014_dl11.root
protodune-sp:np04_raw_run005141_0007_dl6.root
protodune-sp:np04_raw_run005141_0011_dl8.root

Note the presence of both a namespace and a filename

What about some files from a reconstructed version?

metacat query "files from dune:all where core.file_type=detector \
 and core.run_type='protodune-sp' and core.data_tier=full-reconstructed  \
 and core.data_stream=physics and core.runs[any] in (5141) and dune.campaign=PDSPProd4 limit 10" 
pdsp_det_reco:np04_raw_run005141_0013_dl10_reco1_18127013_0_20210318T104043Z.root
pdsp_det_reco:np04_raw_run005141_0015_dl4_reco1_18126145_0_20210318T101646Z.root
pdsp_det_reco:np04_raw_run005141_0008_dl12_reco1_18127279_0_20210318T104635Z.root
pdsp_det_reco:np04_raw_run005141_0002_dl2_reco1_18126921_0_20210318T103516Z.root
pdsp_det_reco:np04_raw_run005141_0002_dl14_reco1_18126686_0_20210318T102955Z.root
pdsp_det_reco:np04_raw_run005141_0015_dl5_reco1_18126081_0_20210318T122619Z.root
pdsp_det_reco:np04_raw_run005141_0017_dl10_reco1_18126384_0_20210318T102231Z.root
pdsp_det_reco:np04_raw_run005141_0006_dl4_reco1_18127317_0_20210318T104702Z.root
pdsp_det_reco:np04_raw_run005141_0007_dl9_reco1_18126730_0_20210318T102939Z.root
pdsp_det_reco:np04_raw_run005141_0011_dl7_reco1_18127369_0_20210318T104844Z.root

To see the total number (and size) of files that match a certain query expression, then add the -s option to metacat query.

See the metacat documentation for more information about queries. DataCatalogDocs and check out the glossary of common fields at: MetaCatGlossary

Accessing data for use in your analysis

To access data without copying it, XRootD is the tool to use. However it will work only if the file is staged to the disk.

You can stream files worldwide if you have a DUNE VO certificate as described in the preparation part of this tutorial.

To learn more about using Rucio and Metacat to run over large data samples go here:

Full justIN/Rucio/Metacat Tutorial

The justIN tutorial and justIN/Rucio/Metacat Tutorial

Exercise 1

  • Use metacat query .... to find a file from a particular experiment/run/processing stage. Look in DataCatalogDocs for hints on constructing queries.
  • Use metacat file show -m -l namespace:filename to get metadata for this file. Note that --json gives the output in json format.

When we are analyzing large numbers of files in a group of batch jobs, we use a metacat dataset to describe the full set of files that we are going to analyze and use the JustIn system to run over that dataset. Each job will then come up and ask metacat and rucio to give it the next file in the list. It will try to find the nearest copy. For instance if you are running at CERN and analyzing this file it will automatically take it from the CERN storage space EOS.

Exercise 2 - explore in the gui

The Metacat Gui is a nice place to explore the data we have.

You need to log in with your services (not kerberos) password.

do a datasets search of all namespaces for the word official in a dataset name

you can then click on sets to see what they contain

Exercise 3 - explore a dataset

Use metacat to find information about the dataset justin-tutorial:justin-tutorial-2024 How many files are in it, what is the total size. (metacat dataset show command, and metacat dataset files command) Use rucio to find one of the files in it.

Resources:

Quiz

Question 01

What is file metadata?

  1. Information about how and when a file was made
  2. Information about what type of data the file contains
  3. Conditions such as liquid argon temperature while the file was being written
  4. Both A and B
  5. All of the above

Answer

The correct answer is D - Both A and B.

Comment here

Question 02

How do we determine a DUNE data file location?

  1. Do `ls -R` on /pnfs/dune and grep
  2. Use `rucio list-file-replicas` (namespace:filename) --pnfs --protocols=root
  3. Ask the data management group
  4. None of the Above

Answer

The correct answer is B - use rucio list-file-replicas (namespace:filename).

Comment here

Key Points

  • SAM and Rucio are data handling systems used by the DUNE collaboration to retrieve data.

  • Staging is a necessary step to make sure files are on disk in dCache (as opposed to only on tape).

  • Xrootd allows user to stream data files.