Skip to main content
This guide walks you through deploying a private EMR cluster with Spark and Trino using instance fleets. You’ll have a running cluster at the end.

Prerequisites

Before you start, make sure you have:
  • AWS credentials configured locally (e.g., via aws configure or environment variables)
  • Terraform >= 1.5.7 installed (install guide)
  • A VPC with at least one private subnet
  • The VPC and subnets tagged with for-use-with-amazon-emr-managed-policies = true
  • An S3 bucket for cluster logs
EMR’s AmazonEMRServicePolicy_v2 managed policy requires the for-use-with-amazon-emr-managed-policies = true tag on your VPC and subnets. The module tags its own resources automatically, but you must tag your VPC resources manually before applying.

Deploy a private cluster

1

Create your Terraform configuration

Create a file named main.tf and add the following configuration. Replace the placeholder values for subnet_ids, vpc_id, and log_uri with your own.
module "emr" {
  source = "terraform-aws-modules/emr/aws"

  name = "example-instance-fleet"

  release_label = "emr-7.9.0"
  applications  = ["spark", "trino"]
  auto_termination_policy = {
    idle_timeout = 3600
  }

  bootstrap_action = [
    {
      path = "file:/bin/echo",
      name = "Just an example",
      args = ["Hello World!"]
    }
  ]

  configurations_json = jsonencode([
    {
      "Classification" : "spark-env",
      "Configurations" : [
        {
          "Classification" : "export",
          "Properties" : {
            "JAVA_HOME" : "/usr/lib/jvm/java-1.8.0"
          }
        }
      ],
      "Properties" : {}
    }
  ])

  master_instance_fleet = {
    name                      = "master-fleet"
    target_on_demand_capacity = 1
    instance_type_configs = [
      {
        instance_type = "m5.xlarge"
      }
    ]
  }

  core_instance_fleet = {
    name                      = "core-fleet"
    target_on_demand_capacity = 2
    target_spot_capacity      = 2
    instance_type_configs = [
      {
        instance_type     = "c4.large"
        weighted_capacity = 1
      },
      {
        bid_price_as_percentage_of_on_demand_price = 100
        ebs_config = [{
          size                 = 256
          type                 = "gp3"
          volumes_per_instance = 1
        }]
        instance_type     = "c5.xlarge"
        weighted_capacity = 2
      },
      {
        bid_price_as_percentage_of_on_demand_price = 100
        instance_type                              = "c6i.xlarge"
        weighted_capacity                          = 2
      }
    ]
    launch_specifications = {
      spot_specification = {
        allocation_strategy      = "capacity-optimized"
        block_duration_minutes   = 0
        timeout_action           = "SWITCH_TO_ON_DEMAND"
        timeout_duration_minutes = 5
      }
    }
  }

  task_instance_fleet = {
    name                      = "task-fleet"
    target_on_demand_capacity = 1
    target_spot_capacity      = 2
    instance_type_configs = [
      {
        instance_type     = "c4.large"
        weighted_capacity = 1
      },
      {
        bid_price_as_percentage_of_on_demand_price = 100
        ebs_config = [{
          size                 = 256
          type                 = "gp3"
          volumes_per_instance = 1
        }]
        instance_type     = "c5.xlarge"
        weighted_capacity = 2
      }
    ]
    launch_specifications = {
      spot_specification = {
        allocation_strategy      = "capacity-optimized"
        block_duration_minutes   = 0
        timeout_action           = "SWITCH_TO_ON_DEMAND"
        timeout_duration_minutes = 5
      }
    }
  }

  ebs_root_volume_size = 64
  ec2_attributes = {
    # Subnets should be private subnets and tagged with
    # { "for-use-with-amazon-emr-managed-policies" = true }
    subnet_ids = ["subnet-abcde012", "subnet-bcde012a", "subnet-fghi345a"]
  }
  vpc_id = "vpc-1234556abcdef"

  list_steps_states  = ["PENDING", "RUNNING", "FAILED", "INTERRUPTED"]
  log_uri            = "s3://my-elasticmapreduce-bucket/"

  scale_down_behavior    = "TERMINATE_AT_TASK_COMPLETION"
  step_concurrency_level = 3
  termination_protection = false
  visible_to_all_users   = true

  tags = {
    Terraform   = "true"
    Environment = "dev"
  }
}
2

Initialise Terraform

Run the following command to download the module and its dependencies:
terraform init
3

Review the plan

Preview the resources Terraform will create before applying:
terraform plan
Expect to see resources for the EMR cluster, security groups (master, slave, service), and IAM roles.
4

Apply the configuration

Create the cluster:
terraform apply
Cluster provisioning typically takes 5–10 minutes. Terraform will print the outputs once complete.

Key variables

VariableDescription
nameName of the EMR job flow. Used as a prefix for all created resources.
release_labelThe EMR release version (e.g., emr-7.9.0). Determines which application versions are available.
applicationsCase-insensitive list of applications to install, such as spark or trino.
ec2_attributesEC2 instance configuration including subnet_ids (instance fleets) or subnet_id (instance groups) and optional security group overrides.
vpc_idThe VPC where the module creates managed security groups.
is_private_clusterSet to true (default) for private subnets. Set to false for public subnets — also requires updating ec2_attributes.subnet_ids.

Expected outputs

After a successful terraform apply, the following outputs are available:
OutputDescription
cluster_idThe ID of the EMR cluster.
cluster_arnThe ARN of the EMR cluster.
cluster_master_public_dnsThe DNS name of the master node. Returns the private DNS name for clusters in private subnets.
Access outputs in your shell:
terraform output cluster_id
terraform output cluster_master_public_dns