AWS High Performance Storage

Late 2018, AWS release FSx for Lustre. Lustre is an open-source, parallel file system that supports many requirements of High Performance Compute (HPC) http://lustre.org/.

This is managed offering by AWS and allows you do the following:
– Easy to process high performing data sets natively or using S3 as a source
– Easy access to S3 buckets.
– Launch and run a file system that provides sub-millisecond access to your data.
– Lustre is POSIX-compliant. What it means you can mount via NFS Linux or CIFS for Windows.

I have created a terraform file to spin up a set of of instances using spot-fleet to simulate a high performing compute cluster and attach Lustre file system to the instances launched. Keep in mind that FSx for Lustre does not run across different subnets. Your instances will need to be launch on the same subnet.

provider "aws" {
    region = var.region
    version = 2.25
}

resource "aws_s3_bucket" "terraform_state_libertytech" {
    bucket = "terraform-state-libertytech"
    versioning {
        enabled = true
    }
    tags = {
        Name = "Terraform state files"
    }
}

terraform {
    backend "s3" {
        bucket = "terraform-state-libertytech"
        key = "terraform.tfstate"
        region = "us-east-1"
    }
}

resource "aws_fsx_lustre_file_system" "hpc_storage" {
    storage_capacity = 3600
    subnet_ids = var.lustre_subnet_us-east-1b
    import_path = "s3://nasanex/Landsat"
}

resource "aws_spot_fleet_request" "hpc-fleet" {
    iam_fleet_role = var.iam_fleet_role
    spot_price = "0.0310"
    allocation_strategy = "lowestPrice"
    target_capacity = 3
    terminate_instances_with_expiration = true
    
    launch_specification {
        instance_type = "c4.large"
        ami = var.amis[var.region]
        key_name = var.key_pair[var.region]
        subnet_id = var.subnets_us-east-1.us-east-1b

        user_data = templatefile("${path.module}/lustre.tmpl", {
                    lustre_file = "${aws_fsx_lustre_file_system.hpc_storage.dns_name}"
        })

        tags = {
            Name = "HPC C4"
        }
    }

    launch_specification {
        instance_type = "m4.large"
        ami = var.amis[var.region]
        key_name = var.key_pair[var.region]
        subnet_id = var.subnets_us-east-1.us-east-1b
       
        tags = {
            Name = "HPC M4"
        }
    }
    depends_on = [aws_fsx_lustre_file_system.hpc_storage]
}

output "lustre_dns_name" {
    value =  aws_fsx_lustre_file_system.hpc_storage.dns_name
}

The repo that stores the code above is currently hosted in GitLab
https://gitlab.com/gabrielrojasnyc/hpc_aws/tree/master

The terraform output looks like the image below when creating the Lustre managed filesystem.
Creating a Lustre filesystem takes around 6 minutes.

I have also create a GitLab pipeline to deploy the code in automated way. See the image below for the workflow process

Deploying AWS FSx for Lustre via Terraform and GitLab

CONCLUSION
You may ask yourself how all these technologies add value to your company? When you’re doing any kind of data processing and high performance compute one of the main constrains have been storage.

AWS FSx for Lustre allows you to process massive amounts of data fast and in a distributed manner thus removing the storage constrains. It integrates easily with S3 as you want to keep your artifacts and or objects stored there. Keep in mind the main use FSx is for short live workloads (less than 24 hours) and you should be storing any artifacts somewhere else like S3.

One of the FSx risks at the moment is does not support IAM roles.

To mitigate this at the moment, I would create High Performance compute subnet that only creates instances that will be attach to Lustre for FSx.

Leave a comment