This guide walks you through deploying a private EMR cluster with Spark and Trino using instance fleets. You’ll have a running cluster at the end.
Prerequisites
Before you start, make sure you have:
- AWS credentials configured locally (e.g., via
aws configure or environment variables)
- Terraform >= 1.5.7 installed (install guide)
- A VPC with at least one private subnet
- The VPC and subnets tagged with
for-use-with-amazon-emr-managed-policies = true
- An S3 bucket for cluster logs
EMR’s AmazonEMRServicePolicy_v2 managed policy requires the for-use-with-amazon-emr-managed-policies = true tag on your VPC and subnets. The module tags its own resources automatically, but you must tag your VPC resources manually before applying.
Deploy a private cluster
Create your Terraform configuration
Create a file named main.tf and add the following configuration. Replace the placeholder values for subnet_ids, vpc_id, and log_uri with your own.module "emr" {
source = "terraform-aws-modules/emr/aws"
name = "example-instance-fleet"
release_label = "emr-7.9.0"
applications = ["spark", "trino"]
auto_termination_policy = {
idle_timeout = 3600
}
bootstrap_action = [
{
path = "file:/bin/echo",
name = "Just an example",
args = ["Hello World!"]
}
]
configurations_json = jsonencode([
{
"Classification" : "spark-env",
"Configurations" : [
{
"Classification" : "export",
"Properties" : {
"JAVA_HOME" : "/usr/lib/jvm/java-1.8.0"
}
}
],
"Properties" : {}
}
])
master_instance_fleet = {
name = "master-fleet"
target_on_demand_capacity = 1
instance_type_configs = [
{
instance_type = "m5.xlarge"
}
]
}
core_instance_fleet = {
name = "core-fleet"
target_on_demand_capacity = 2
target_spot_capacity = 2
instance_type_configs = [
{
instance_type = "c4.large"
weighted_capacity = 1
},
{
bid_price_as_percentage_of_on_demand_price = 100
ebs_config = [{
size = 256
type = "gp3"
volumes_per_instance = 1
}]
instance_type = "c5.xlarge"
weighted_capacity = 2
},
{
bid_price_as_percentage_of_on_demand_price = 100
instance_type = "c6i.xlarge"
weighted_capacity = 2
}
]
launch_specifications = {
spot_specification = {
allocation_strategy = "capacity-optimized"
block_duration_minutes = 0
timeout_action = "SWITCH_TO_ON_DEMAND"
timeout_duration_minutes = 5
}
}
}
task_instance_fleet = {
name = "task-fleet"
target_on_demand_capacity = 1
target_spot_capacity = 2
instance_type_configs = [
{
instance_type = "c4.large"
weighted_capacity = 1
},
{
bid_price_as_percentage_of_on_demand_price = 100
ebs_config = [{
size = 256
type = "gp3"
volumes_per_instance = 1
}]
instance_type = "c5.xlarge"
weighted_capacity = 2
}
]
launch_specifications = {
spot_specification = {
allocation_strategy = "capacity-optimized"
block_duration_minutes = 0
timeout_action = "SWITCH_TO_ON_DEMAND"
timeout_duration_minutes = 5
}
}
}
ebs_root_volume_size = 64
ec2_attributes = {
# Subnets should be private subnets and tagged with
# { "for-use-with-amazon-emr-managed-policies" = true }
subnet_ids = ["subnet-abcde012", "subnet-bcde012a", "subnet-fghi345a"]
}
vpc_id = "vpc-1234556abcdef"
list_steps_states = ["PENDING", "RUNNING", "FAILED", "INTERRUPTED"]
log_uri = "s3://my-elasticmapreduce-bucket/"
scale_down_behavior = "TERMINATE_AT_TASK_COMPLETION"
step_concurrency_level = 3
termination_protection = false
visible_to_all_users = true
tags = {
Terraform = "true"
Environment = "dev"
}
}
Initialise Terraform
Run the following command to download the module and its dependencies: Review the plan
Preview the resources Terraform will create before applying:Expect to see resources for the EMR cluster, security groups (master, slave, service), and IAM roles. Apply the configuration
Create the cluster:Cluster provisioning typically takes 5–10 minutes. Terraform will print the outputs once complete.
Key variables
| Variable | Description |
|---|
name | Name of the EMR job flow. Used as a prefix for all created resources. |
release_label | The EMR release version (e.g., emr-7.9.0). Determines which application versions are available. |
applications | Case-insensitive list of applications to install, such as spark or trino. |
ec2_attributes | EC2 instance configuration including subnet_ids (instance fleets) or subnet_id (instance groups) and optional security group overrides. |
vpc_id | The VPC where the module creates managed security groups. |
is_private_cluster | Set to true (default) for private subnets. Set to false for public subnets — also requires updating ec2_attributes.subnet_ids. |
Expected outputs
After a successful terraform apply, the following outputs are available:
| Output | Description |
|---|
cluster_id | The ID of the EMR cluster. |
cluster_arn | The ARN of the EMR cluster. |
cluster_master_public_dns | The DNS name of the master node. Returns the private DNS name for clusters in private subnets. |
Access outputs in your shell:
terraform output cluster_id
terraform output cluster_master_public_dns