Quickstart

This guide walks you through deploying a private EMR cluster with Spark and Trino using instance fleets. You’ll have a running cluster at the end.

Prerequisites

Before you start, make sure you have:

AWS credentials configured locally (e.g., via aws configure or environment variables)
Terraform >= 1.5.7 installed (install guide)
A VPC with at least one private subnet
The VPC and subnets tagged with for-use-with-amazon-emr-managed-policies = true
An S3 bucket for cluster logs

EMR’s AmazonEMRServicePolicy_v2 managed policy requires the for-use-with-amazon-emr-managed-policies = true tag on your VPC and subnets. The module tags its own resources automatically, but you must tag your VPC resources manually before applying.

Deploy a private cluster

Create your Terraform configuration

Create a file named main.tf and add the following configuration. Replace the placeholder values for subnet_ids, vpc_id, and log_uri with your own.

module "emr" {
  source = "terraform-aws-modules/emr/aws"

  name = "example-instance-fleet"

  release_label = "emr-7.9.0"
  applications  = ["spark", "trino"]
  auto_termination_policy = {
    idle_timeout = 3600
  }

  bootstrap_action = [
    {
      path = "file:/bin/echo",
      name = "Just an example",
      args = ["Hello World!"]
    }
  ]

  configurations_json = jsonencode([
    {
      "Classification" : "spark-env",
      "Configurations" : [
        {
          "Classification" : "export",
          "Properties" : {
            "JAVA_HOME" : "/usr/lib/jvm/java-1.8.0"
          }
        }
      ],
      "Properties" : {}
    }
  ])

  master_instance_fleet = {
    name                      = "master-fleet"
    target_on_demand_capacity = 1
    instance_type_configs = [
      {
        instance_type = "m5.xlarge"
      }
    ]
  }

  core_instance_fleet = {
    name                      = "core-fleet"
    target_on_demand_capacity = 2
    target_spot_capacity      = 2
    instance_type_configs = [
      {
        instance_type     = "c4.large"
        weighted_capacity = 1
      },
      {
        bid_price_as_percentage_of_on_demand_price = 100
        ebs_config = [{
          size                 = 256
          type                 = "gp3"
          volumes_per_instance = 1
        }]
        instance_type     = "c5.xlarge"
        weighted_capacity = 2
      },
      {
        bid_price_as_percentage_of_on_demand_price = 100
        instance_type                              = "c6i.xlarge"
        weighted_capacity                          = 2
      }
    ]
    launch_specifications = {
      spot_specification = {
        allocation_strategy      = "capacity-optimized"
        block_duration_minutes   = 0
        timeout_action           = "SWITCH_TO_ON_DEMAND"
        timeout_duration_minutes = 5
      }
    }
  }

  task_instance_fleet = {
    name                      = "task-fleet"
    target_on_demand_capacity = 1
    target_spot_capacity      = 2
    instance_type_configs = [
      {
        instance_type     = "c4.large"
        weighted_capacity = 1
      },
      {
        bid_price_as_percentage_of_on_demand_price = 100
        ebs_config = [{
          size                 = 256
          type                 = "gp3"
          volumes_per_instance = 1
        }]
        instance_type     = "c5.xlarge"
        weighted_capacity = 2
      }
    ]
    launch_specifications = {
      spot_specification = {
        allocation_strategy      = "capacity-optimized"
        block_duration_minutes   = 0
        timeout_action           = "SWITCH_TO_ON_DEMAND"
        timeout_duration_minutes = 5
      }
    }
  }

  ebs_root_volume_size = 64
  ec2_attributes = {
    # Subnets should be private subnets and tagged with
    # { "for-use-with-amazon-emr-managed-policies" = true }
    subnet_ids = ["subnet-abcde012", "subnet-bcde012a", "subnet-fghi345a"]
  }
  vpc_id = "vpc-1234556abcdef"

  list_steps_states  = ["PENDING", "RUNNING", "FAILED", "INTERRUPTED"]
  log_uri            = "s3://my-elasticmapreduce-bucket/"

  scale_down_behavior    = "TERMINATE_AT_TASK_COMPLETION"
  step_concurrency_level = 3
  termination_protection = false
  visible_to_all_users   = true

  tags = {
    Terraform   = "true"
    Environment = "dev"
  }
}

Initialise Terraform

Run the following command to download the module and its dependencies:

terraform init

Review the plan

Preview the resources Terraform will create before applying:

terraform plan

Expect to see resources for the EMR cluster, security groups (master, slave, service), and IAM roles.

Apply the configuration

Create the cluster:

terraform apply

Cluster provisioning typically takes 5–10 minutes. Terraform will print the outputs once complete.

Key variables

Variable	Description
`name`	Name of the EMR job flow. Used as a prefix for all created resources.
`release_label`	The EMR release version (e.g., `emr-7.9.0`). Determines which application versions are available.
`applications`	Case-insensitive list of applications to install, such as `spark` or `trino`.
`ec2_attributes`	EC2 instance configuration including `subnet_ids` (instance fleets) or `subnet_id` (instance groups) and optional security group overrides.
`vpc_id`	The VPC where the module creates managed security groups.
`is_private_cluster`	Set to `true` (default) for private subnets. Set to `false` for public subnets — also requires updating `ec2_attributes.subnet_ids`.

Expected outputs

After a successful terraform apply, the following outputs are available:

Output	Description
`cluster_id`	The ID of the EMR cluster.
`cluster_arn`	The ARN of the EMR cluster.
`cluster_master_public_dns`	The DNS name of the master node. Returns the private DNS name for clusters in private subnets.

Access outputs in your shell:

terraform output cluster_id
terraform output cluster_master_public_dns

Get Started

Cluster Types

Configuration

Examples

Prerequisites

Deploy a private cluster

Key variables

Expected outputs

Get Started

Cluster Types

Configuration

Examples

​Prerequisites

​Deploy a private cluster

​Key variables

​Expected outputs

Prerequisites

Deploy a private cluster

Key variables

Expected outputs