A FinOps Engineering Approach to Multi-Tenant Cost Attribution - Wiv

I’ve spent over eight years in FinOps, and if there’s one lesson that keeps reinforcing itself, it’s this: the visibility you need rarely comes out of the box. You have to build it.

At Wiv.ai, we’re a FinOps automation platform built by practitioners.

When we ran into a cost attribution challenge inside our own product, it wasn’t about proving a point, it was about necessity. There was no off-the-shelf solution that could handle the complexity of attributing costs in a multi-tenant, serverless environment. So we did what automation-first teams do: we built exactly what we needed.

This post is a look at that journey: how we engineered end-to-end attribution, what we learned, and why tailored automation is often the only path to real FinOps maturity.

The Challenge: When Aggregate Costs Tell You Nothing

Our platform runs hundreds of thousands of workflows every month for multiple customers. Each workflow is orchestrated through AWS Step Functions, triggering Lambda functions for lightweight tasks and Fargate containers for heavy compute. Some workflows spawn thousands of parallel executions through Distributed Map states, we’re talking tens of millions of individual step executions monthly.

AWS happily bills us for all of this. Step Functions charges per state transition. Lambda charges per GB-second of compute. Fargate charges per vCPU and memory hour. At the end of the month, we see the totals in Cost Explorer. But here’s what AWS doesn’t tell us: which customer drove those costs? Which workflow types are expensive? Why did our ECS costs grow last month, and which tenant was responsible?

This isn’t a tagging problem. You can tag Step Functions state machines, but you can’t tag individual executions. When multiple customers share the same infrastructure, which they do in any multi-tenant architecture, tags become meaningless for attribution.

Organizations in the early stages of their FinOps journey (what the FinOps Foundation calls the “Crawl” phase) often stop here. They see aggregate costs, maybe broken down by service or tag, and assume that’s the best they can do. But if you’re operating a platform where understanding unit economics matters, where you need to know the true cost of serving each customer, that level of visibility isn’t enough.

The Approach: Embedding Attribution at the Source

The insight that unlocked our solution was simple: if AWS won’t attribute costs for us, we need to capture the attribution data ourselves, at the moment of execution, not after the fact.

Every workflow execution in our system now carries attribution metadata. This data travels through the entire execution chain, passed as input to Lambda functions, injected as environment variables into Fargate tasks:

{

“tenant_id”: “customer-abc-123”,

“workflow_id”: “cost-optimization-pipeline”,

“workflow_execution_id”: “exec-789xyz”

}

The key insight here is that we’re not changing how AWS bills us. We’re creating a data layer that lets us attribute those bills accurately. The metadata exists at execution time; we just need a system to collect it and connect it to actual costs.

Building the Collection System

With metadata in place, we built a real-time cost collection pipeline. When a workflow execution completes, a post-processing step sends the execution data to SNS, which triggers our data collection script. This happens in real-time, every execution gets processed as it finishes.

The collection script queries the Step Functions Execution History API to get the complete event timeline for that execution. This includes every Lambda invocation, every Fargate task submission, every state transition. The execution history is where the cost data actually lives; it tells us exactly what happened during the workflow run.

For each event, we calculate the actual cost using AWS’s pricing. Lambda costs come from the duration (in GB-seconds) multiplied by the function’s memory configuration, plus invocation fees. Fargate costs come from vCPU and memory hours, with a minimum billing period of 60 seconds. Step Functions costs come from counting state transitions.

The attributed metrics are then stored, ready for analysis and reporting.

We also maintain DynamoDB as our source of truth for execution data, it contains all the information we need to understand what’s running across our platform. This becomes essential for our backfill capability, which I’ll explain shortly.

Building True Cost Visibility: A FinOps Engineering Approach to Multi-Tenant Cost Attribution

The Complexity We Discovered

What seemed like a straightforward calculation, add up the compute time and apply AWS rates, turned out to be anything but. Our first version showed costs far lower than what AWS was actually billing us. The culprit? Distributed Maps spawn thousands of child executions, each running its own compute, and we were only looking at the parent. We added recursive processing to follow the full chain.

Then we found our Fargate numbers were 28% too high. Step Functions tells you when it submitted a task and when it got a result back, but that’s not when the container actually ran. We had to dig into the task output to find the real timestamps.

Along the way, we discovered an entire category of compute we’d overlooked – our heaviest workloads, running on 8x the resources of everything else we were tracking. And Lambda costs depend on memory allocation, a value that isn’t even in the execution history. We added lookups to capture what each function actually uses.

Each fix felt like peeling back another layer. But getting this foundation right was essential – without accurate execution costs, tenant-level attribution would be meaningless.

Also, as an automation platform, we had to continuously check ourselves, and we created a coverage report, making sure that we’re not introducing any breaking changes to our collection logic, making sure to account for at least 90% of the costs

The Backfill Layer

Beyond real-time processing, we built a separate backfill capability. This script can process historical executions for any date range, pulling executions from DynamoDB (our source of truth) and recalculating costs for each execution.

The backfill serves multiple purposes. It lets us fill gaps if the real-time pipeline has issues. It enables debugging when we suspect cost calculation errors. And it gives us a governance layer, we can re-process historical data when we improve our calculation logic, ensuring our attribution data stays accurate over time.

What This Visibility Enables

With granular cost attribution in place, we moved from asking “how much did Step Functions cost?” to asking much more useful questions.

We can now see unit economics at the execution level. Some workflows cost a few cents per execution; others cost significantly more. Understanding why- whether it’s the number of steps, the compute intensity, or the parallelism level, lets us make informed decisions about pricing, architecture, and optimization.

We identified inefficient workflows that were spawning tens of thousands of child executions when they didn’t need to. That kind of insight doesn’t appear in aggregate billing. It focused our optimization efforts where they’d have the biggest impact, at the step level within specific workflows.

We can look at our most expensive workflows grouped by tenant to understand how our platform operates across small, medium, and enterprise customers. We can see how usage patterns evolve over time.

Most importantly, we moved from trying to understand why our ECS costs grew, to knowing exactly which tenant drove that increase. Investigating cost changes became a straightforward query rather than a FinOps mystery.

From Attribution to Anomaly Detection

Once you have cost attribution at the tenant level, detecting anomalies becomes possible. We built a system on Wiv.ai that tracks each customer’s cost patterns over time, establishes baselines, and alerts when something changes unexpectedly.

The anomaly report surfaces patterns like spikes or consecutive cost increases sustained over time, with total dollar impact quantified. It detects sudden changes in executions, identifies new tenants appearing in our cost data, and ranks everything by severity.

This isn’t possible without attribution. You can’t detect that a specific customer’s costs are anomalous if you don’t know what their costs are in the first place.

Why This Matters for FinOps Maturity

The FinOps Foundation describes maturity across three phases: Crawl, Walk, Run. Applied to each capability dimension: Inform, Optimize, and Operate. Most organizations I’ve worked with are comfortable in Crawl: they have dashboards, they see aggregate costs, they can answer basic questions about spending trends.

Moving to Walk and Run requires going deeper. It means having granular enough data to understand unit economics, to attribute costs to specific activities, to detect anomalies before they become problems. That level of visibility often doesn’t come from tools you can buy. It comes from engineering work that connects your specific architecture to your specific cost drivers.

Mature FinOps means building the visibility you need. True cost attribution requires getting your hands dirty with the data.

This isn’t about replacing FinOps tools, they’re valuable for aggregate analysis and broad optimization. It’s about recognizing that complex architectures have complex cost drivers, and surfacing those drivers often requires custom engineering.

If you’re running a multi-tenant platform, a serverless architecture with nested executions, or any system where the relationship between activity and cost isn’t obvious from your cloud bill, this kind of work is worth the investment. Without it, optimization decisions are guesses. Pricing decisions are based on averages. Capacity plans are built on assumptions.

The time you invest in building true cost visibility pays dividends in every decision that follows.

Building True Cost Visibility: A FinOps Engineering Approach to Multi-Tenant Cost Attribution