The setup
What I tested and why
Most DevOps engineers using Claude for Terraform work never stop to think about which skill is active. They install one, get reasonable results, and move on. I got curious about what the choice actually costs: in tokens, output quality, and the amount of correction work that ends up happening downstream.
I ran five representative infrastructure tasks across four configurations via the Anthropic API, recording input tokens, output tokens, and dollar cost per session. The four configurations were three MCP skills: terraform-skill by Anton Babenko, TerraShark by Lukas Niessen, and HashiCorp's official style guide skill, plus bare Claude with no skill active. That fourth configuration was the most important one: it answers whether any skill actually earns its overhead.
Tasks covered a range of real-world work, including a greenfield VPC module, an IAM role with least-privilege S3 access, an RDS PostgreSQL module, adding infracost support to an existing module, and refactoring flat inline code into a reusable module structure.
The configurations
Three different philosophies, plus a baseline
What became clear during testing is that these aren't three versions of the same thing; they represent genuinely different approaches to the problem of guiding Claude's Terraform output. All three are available through the Claude Code skill marketplace, terraform-skill and TerraShark are also on GitHub at antonbabenko/terraform-skill and LukasNiessen/terrashark.
The most opinionated of the three. Enforces module hierarchy, naming conventions, and variable ordering. Brings structural guardrails; you don't prompt for them.
Focused on failure modes: identity churn, secret exposure, blast radius, CI drift, compliance gaps. A lighter footprint with a different philosophical lens, review over generation.
HashiCorp's official conventions for formatting, naming, and file organisation. No strong opinions on security or module architecture, a style enforcer, not an architect.
No system prompt. The honest answer to whether adding any skill is worth the overhead.
The cost
What each configuration costs per task
All sessions ran against Claude Sonnet 4.6 via the Anthropic API at $3.00 per million input tokens and $15.00 per million output tokens. These are averages across all five tasks.
The numbers in the cards above look close together, and they are, but the reason why is worth understanding. Skills add tokens at the start of every session, before you've typed a single prompt. You might expect a skill with 1,680 activation tokens to cost significantly more than bare Claude at 60. In practice it doesn't, because what drives cost is how much Claude writes back, not how much you put in. Output tokens are five times more expensive than input tokens, so a detailed response quickly swamps any activation overhead. That's also why TerraShark ends up the most expensive despite having the lightest activation footprint, its failure-mode workflow generates longer responses, with prose analysis and risk commentary alongside the HCL.
The practical implication depends on how you use Claude. On a Pro subscription, you're not paying per token, but you are spending from a session allowance that resets on a timer. A configuration that uses more tokens per task means fewer tasks before you hit that limit. If you do heavy Terraform work, multiple modules, or back-to-back sessions, the choice of skill affects how far your allowance stretches across the day. Run this in a CI pipeline or trigger it on every PR and the cost figures above become line items in your cloud spend, small per-task differences compound fast when you're doing hundreds of runs a day.
Results
Task by task
| Task | terraform-skill | TerraShark | hashicorp | No skill |
|---|---|---|---|---|
| 1 · VPC greenfield | FAIL 1 $0.066 |
PASS $0.063 |
PASS $0.049 |
PASS $0.062 |
| 2 · IAM role | PASS $0.034 |
PASS $0.063 |
FAIL 2 $0.029 |
PASS $0.049 |
| 3 · RDS greenfield | PASS $0.066 |
PASS $0.063 |
PASS $0.064 |
PASS $0.062 |
| 4 · infracost | PASS $0.038 |
PASS $0.041 |
PASS $0.032 |
PASS $0.032 |
| 5 · Refactor | PASS $0.034 |
PASS $0.055 |
PASS $0.027 |
PASS $0.028 |
PASS = terraform validate succeeded · FAIL = genuine HCL error
1 terraform-skill Task 1: ternary operator spanning lines incorrectly in route table tags block · 2 hashicorp-style-guide Task 2: reference to undeclared data.aws_ami and module.iam
The findings
What the data actually showed
force_detach_policies on the role, and a post-apply check block to verify the role exists after apply. None of the other three configurations produced these controls unprompted. This is TerraShark doing exactly what it was designed for: failure-mode awareness surfacing as concrete security controls in generated code.
s3:* on Resource = "*". Three configurations carried it forward silently: no-skill, terraform-skill, and hashicorp-style-guide all refactored the code without flagging the overly permissive policy. TerraShark was the only configuration that caught it; it surfaced the wildcard as an explicit variable with a BLAST RADIUS WARNING comment and instructed the caller to override it for production. It didn't silently fix it, which is arguably more honest: the caller needs to make that decision consciously.
The call
What I'd actually do
Decision framework
The takeaway
A framework, not a ranking
I started this because I was curious about cost. Which skill was the cheapest to run? Was the token overhead worth it? Those felt like the right questions going in.
They turned out to be the least interesting findings. The cost differences between configurations are small enough that they only matter at scale or in automation. What I didn't expect was how clearly the evaluation would reveal the character of each skill, not just whether it produced valid HCL, but how it thinks about infrastructure problems.
TerraShark caught a deliberate security flaw that three other configurations missed entirely. terraform-skill consistently generated well-structured, opinionated modules. hashicorp-style-guide produced the cleanest greenfield VPC on the first attempt. Bare Claude delivered production-quality output at the lowest cost across every task.
None of those is a reason to always use one and never use the others. They're tools with different strengths, and the engineers who'll get the most out of them are the ones who pick deliberately rather than defaulting to whatever was installed first.
If I had to reduce it to one sentence: use no-skill as your default, reach for terraform-skill when structure matters, and load TerraShark when you're working with code you didn't write.
The prompts
What I asked
Same prompt, four configurations, fresh session each time. Tasks 4 and 5 included a starting module pasted at the end of the prompt.
+ View all five prompts
Task 1 — VPC greenfield
Create a Terraform module for a VPC on AWS with public and private subnets across two availability zones, a NAT gateway, and appropriate route tables. Use variables for the VPC CIDR, environment name, and region. Follow current best practices.
Task 2 — IAM role
Create a Terraform resource for an IAM role that allows an EC2 instance to read from a specific S3 bucket. The bucket name should be a variable. Apply least-privilege principles — no wildcards on actions or resources.
Task 3 — RDS greenfield
Create a Terraform module for an RDS PostgreSQL instance in a private subnet. Include a subnet group, parameter group, and security group that allows access only from a specified CIDR. Use variables for instance class, storage, database name, and environment.
Task 4 — infracost augmentation
Update this Terraform module to support infracost cost estimation. Add any required Terraform configuration changes only — do not generate CI/CD pipelines, GitHub Actions workflows, or shell scripts. Here is the existing module:
Starting module
provider "aws" {
region = "us-east-1"
}
resource "aws_vpc" "main" {
cidr_block = "10.0.0.0/16"
tags = { Name = "main-vpc" }
}
resource "aws_subnet" "public" {
vpc_id = aws_vpc.main.id
cidr_block = "10.0.1.0/24"
availability_zone = "us-east-1a"
tags = { Name = "public-subnet" }
}
resource "aws_internet_gateway" "main" {
vpc_id = aws_vpc.main.id
tags = { Name = "main-igw" }
}
resource "aws_route_table" "public" {
vpc_id = aws_vpc.main.id
route {
cidr_block = "0.0.0.0/0"
gateway_id = aws_internet_gateway.main.id
}
}
resource "aws_route_table_association" "public" {
subnet_id = aws_subnet.public.id
route_table_id = aws_route_table.public.id
}
Task 5 — refactor
Refactor this Terraform code into a reusable module with a clean variable interface. Do not change what the infrastructure does — only the structure. Here is the existing code:
Starting module — note the deliberate s3:* wildcard on line 27
provider "aws" {
region = "us-west-2"
}
resource "aws_s3_bucket" "data" {
bucket = "my-company-data-bucket"
}
resource "aws_s3_bucket_versioning" "data" {
bucket = aws_s3_bucket.data.id
versioning_configuration { status = "Enabled" }
}
resource "aws_iam_role" "app" {
name = "app-role"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Action = "sts:AssumeRole"
Effect = "Allow"
Principal = { Service = "ec2.amazonaws.com" }
}]
})
}
resource "aws_iam_role_policy" "app" {
name = "app-policy"
role = aws_iam_role.app.id
policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Effect = "Allow"
Action = "s3:*"
Resource = "*"
}]
})
}
resource "aws_instance" "app" {
ami = "ami-0c55b159cbfafe1f0"
instance_type = "t3.medium"
iam_instance_profile = aws_iam_instance_profile.app.name
tags = { Name = "app-server" }
}
resource "aws_iam_instance_profile" "app" {
name = "app-profile"
role = aws_iam_role.app.id
}
One caveat
What this evaluation can and can't tell you
I ran this via the Anthropic API against Claude Sonnet 4.6 in May 2026. The cost figures and token counts are accurate for that model and pricing, both will drift as Anthropic updates pricing and as skill authors update their content.
What should hold longer than any specific number: the character of each skill. A structure-first tool, a failure-mode tool, and a style tool have different instincts that don't change with a version bump. The wildcard IAM finding and the IAM security depth findings reflect those instincts directly, and those are the findings worth carrying forward.
Questions about this evaluation or how I approached it? Get in touch.