Advanced Systems & Operations

Data Handling and Analytics

The Advanced Systems & Operations group conducts research in High Performance Computing systems and data storage. Current projects provide researchers better access to their data via the development of fast parallel filesystems and distributed filesystems for widely distributed computing environments.

In addition, the Advanced Systems & Operations group manages all PSC HPC resources, including security, supercomputing operations, high performance storage, and data management and file systems. 

Advanced Systems Research

SLASH2

SLASH2 is a distributed filesystem that incorporates existing storage resources into a common filesystem domain. It provides system-managed storage tasks for users who work in widely distributed environments. The SLASH2 metadata controller performs inline replication management, maintains data checksums and coordinates third party, parallel, data transfer between constituent data systems. The SLASH2 I/O service is a portable, user-space process that does not interfere with the underlying storage system’s administrative model.

Read more about SLASH2 at https://github.com/pscedu/slash2/wiki.

SLASH2 is licensed under the GPLv2.

Zest

Zest is a parallel storage system specifically designed to meet the ever-increasing demands of HPC application checkpointing. Zest differs from traditional parallel filesystems by making use of log-structuring filesystems on the I/O server in combination with opportunistic data placement. Utilizing these techniques, Zest is capable of driving its disks at 90% peak bandwidth. A patent for Zest was granted to PSC scientists in 2013.

Read more about Zest.

Research Collaborations

Advanced Systems & Operations staff work with The University of Pittsburgh MIDAS National Center of Excellence and the MIDAS Network Software Repository to enable research that prepares the nation to plan for, detect and respond to infectious diseases.

Lustre Projects

PSC has multiple projects built on the Lustre file system that deal with authentication, availability, security and validation.

Albedo

Albedo is a distributed wide area network file system for the TeraGrid. It is a single namespace distributed file system utilizing Lustre version 1.8. Using a UID/GID remapping model designed at Indiana University, shared metadata unites the file storage resources distributed among the TeraGrid Resource Providers into a common file system. File servers are connnected via high performance 10Gb interfaces over the TeraGrid Network.

ExTeNCI

ExTENCI (Extending Science Through Enhanced National Cyberinfrastructure) is a secure wide area file system for the Open Science Grid (OSG) and TeraGrid. One of ExTENCI project’s goals is the deployment of a secure, distributed Lustre resource across the WAN as a shared file system between the TeraGrid and OSG. The central infrastructure hub will be located at the University of Florida for initial deployment and testing. Integration with select scientific applications (CMS, LQCD, CMS, ATLAS) will be performed with the Fermi National Accelerator Laboratory and the University of Chicago.

Kerberized Lustre

Kerberized Lustre 2.0 over the WAN is a Kerberos secured Lustre file system established with our partners from the TeraGrid (SDSC) and the Naval Research Laboratory (NRL) over the WAN. Distributed Object Storage Targets (OST) and OST pools are enabled and Kerberized data transfers are performed within local and remote sites. Efforts towards Lustre cross-realm authentication with the NRL and integration with Kerberos enabled NFS4 expand the accessibility to the file system.

KonFUSEd

KonFUSEd (Kerberos Over the Network via FUSE daemon) is a Lustre 2.0 project that provides Kerberos security to validate servers, clients, and users, but in the process adds a layer of complexity to the overall system. Batch and system services interacting with the Lustre 2.0 file system are subsequently required to present Kerberos credentials. In particular, data transfer services require server side credential caches to function properly with Lustre 2.0 through Kerberos. KonFUSE acts as a Kerberos credential delegator to maintain this cache, enabling the data transfer to function seamlessly within the batch system.

Production Lustre

AntonFS

The NIH-funded AntonFS storage cluster provides roughly a half Petabyte of Lustre parallel filesystem. A primary metadata server and six dedicated storage servers present eighteen SAS-SATA RAID6 storage arrays resiliently over QDR 4x Infiniband to simulation and analysis resources. AntonFS is the primary target of a large simulation resource and is available at high speed to multiple clusters for subsequent data analysis.

The NIH-funded Kollman cluster provides analysis capabilities to complement local simulation capabilities. Kollman currently consists of four Intel Westmere-EP based nodes, each with twelve compute cores, ninety-six gigabytes of memory, and a QDR 4x Infiniband path to AntonFS and other nodes. Some Kollman systems also contain 448-core Nvidia “Fermi” Telsa cards for acceleration and development of key analysis packages.

Brashear

The Brashear cluster is a Lustre 1.8 based filesystem that provides 291TB of storage to multiple production platforms. There are two dedicated metadata storage servers (MDS) and eight object storage servers (OSS). Each server is connected via SDR infiniband and 1Gb/s ethernet. There are Two DataDirect Network servers (DDNs) connected that provide the backing storage.

Systems & Operations Management

Security

The Information Security team is responsible for ensuring the availability and integrity of the PSC’s high performance computing assets. The security team constantly monitors PSC resources to detect unauthorized/malicious activity and quickly responds to eliminate threats.

PSC Certificate Authority

The PSC Certificate Authority (CA) provides host and service certificates for PSC systems. PSC CA does not issue long-term individual user certificates.

To configure your Globus hosts to accept user authentication from PSC users, please install the PSC CA public key and signing policy files in your host’s grid-security/certificates directory (typically /etc/grid-security/certificates). The PSC CA public key and signing policy files are contained in https://wp-dev.psc.edu/ca/cert/PSC-CA.tar.

See more information and the up-to-date Certificate Revocation List.

Workshops

Pittsburgh Supercomputing Center offers a variety of workshops, both at PSC and off-site, on subjects ranging from code optimization and parallel programming to specific scientific topics.

See more information about PSC’s education, outreach and training programs.