RAPIDS is a data science framework which bundles a collection of libraries for executing end-to-end data science pipelines completely on top of GPUs. It uses optimized NVIDIA CUDA® primitives and high-bandwidth GPU memory to accelerate data preparation and machine learning tasks. For example, it can be used for ETL and preprocessing of deep learning workflows.
This content is based on the RAPIDS website documentation.
The RAPIDS implementation on Bridges-2 uses the NVIDIA RAPIDS suite. Please see https://developer.nvidia.com/rapids for more information.
Start an interactive session on Bridges-2 with the interact command. You can use a regular or a GPU node, as needed for your code.
interact --gpu # Start a session in a node with a GPU.
Once your interactive session has begun, execute Python using the
latest RAPIDS Singularity/Apptainer container. You can see all of the
containers that PSC has created on Bridges-2 in the directory
/ocean/containers/ngc/
. The file latest.sif
is a link to the most recently updated container.
Use a command like:
singularity exec --nv /ocean/containers/ngc/rapidsai/latest.sif python3
This is a step-by-step guide of how to run the quick-start examples.
This code snippet reads in a CSV file and outputs some descriptive statistics. The file hw_25000.csv contains height and weight data for 25000 individuals Each record includes 3 values: index, height (inches), weight (pounds). There is also an initial header line.
First, download the sample dataset:
wget https://people.sc.fsu.edu/~jburkardt/data/csv/hw_25000.csv
Create a file named statistics.py with the following content:
# statistics.py
import cudf
gdf = cudf.read_csv('hw_25000.csv')
for column in gdf.columns:
print(gdf[column].mean())
Run it via the RAPIDS container:
singularity exec --nv /ocean/containers/ngc/rapidsai/latest.sif python3 statistics.py
It should return the following output:
python statistics.py
12500.5
67.9931135968
127.07942116080001
The original can be found here: https://github.com/rapidsai/cudf.
This example loads a public dataset, from a CSV file on GitHub, into a GPU memory-resident DataFrame and performs a basic calculation.
All of the CSV parsing and the operations for calculating the tip percentage and average are done on the GPU.
Create a file named cuDF.py with the folowing content:
# cuDF.py
import cudf
import io, requests
# Download the CSV file from GitHub.
url = "https://github.com/plotly/datasets/raw/master/tips.csv"
content = requests.get(url).content.decode('utf-8')
# Read the CSV into memory.
tips_df = cudf.read_csv(io.StringIO(content))
tips_df['tip_percentage'] = tips_df['tip'] / tips_df['total_bill'] * 100
# Display the average tip amount by dining party size.
print(tips_df.groupby('size').tip_percentage.mean())
Run it via the RAPIDS container
singularity exec --nv /ocean/containers/ngc/rapidsai/latest.sif python3 cuDF.py
It should return the following output:
python cuDF.py
size
1 21.729202
2 16.571919
3 15.215685
4 14.594901
5 14.149549
6 15.622920
Name: tip_percentage, dtype: float64
The original can be found here: https://github.com/rapidsai/cuml.
This example loads a small sample data frame and computes DBSCAN clusters.
Create a file named cuML.py with the following content:
# cuML.py
import cudf
import cuml
# Create and populate a GPU DataFrame
df_float = cudf.DataFrame()
df_float['0'] = [1.0, 2.0, 5.0]
df_float['1'] = [4.0, 2.0, 1.0]
df_float['2'] = [4.0, 2.0, 1.0]
# Setup and fit clusters
dbscan_float = cuml.DBSCAN(eps=1.0, min_samples=1)
dbscan_float.fit(df_float)
print(dbscan_float.labels_)
Run it via the RAPIDS container
singularity exec --nv /ocean/containers/ngc/rapidsai/latest.sif python3 cuML.py
It should return the following output:
python cuML.py
0 0
1 1
2 2
dtype: int32