BOYD and the CUR(s) - Cloud Archaeologist

BOYD supports all AWS Cost and Usage Report versions currently available, but which should you choose?

This question matters a lot if you are data engineering CUR processing, but, technically speaking, BOYD doesn’t really care. Some nuances in the formats for each version may slow down collecting data slightly, but think of it as shoveling loosened mud as opposed to harder earth. So let’s briefly walk through the options and challenges.

Legacy CUR/CUR V1 vs CUR V2

The primary issue with CUR V1 is schema related. Columns in the dataset might not be present in all billing periods. For example, if you add and remove cost allocation tags, each tag removed will still exist in the data for the billing periods when it was active. Lots of data processing things do not like this, preferring a standard schema across the dataset.

CUR V2 fixed this by pushing the pain further downstream. Instead of inconsistent schemas and individual columns for potentially evolving fields, CUR V2 uses nested structures to group multiple columns of data. What does that mean? Instead of a column for each tag you now get a column called “resource_tags” that contains a list of all the tag key and value combinations. The same is true for “product” and a few other columns (like, “discounts” and “cost categories”). Now the data pipeline-y things are happier, but the parsing and extraction of those nested columns becomes someone else’s pain.

Why does this matter to BOYD? A large part of what makes BOYD possible is the parquet file format, where only columns queried are actually scanned. To do similar with a CSV file, the entire file would need to be scanned and pre-loaded adding intensive CPU, memory usage and time to both collecting and analyzing the results when you might only need a fragment of the data. With parquet, only the columns chosen are scanned and returned, reducing the data retrieved and processing overhead. Processing a single tag column for millions of rows is much more efficient than analyzing millions of nested ones just to see if the tag is present. This is as essential for BOYD retrieving results from S3 as it is for local processing and analysis.

BOYD’s utility is in tracking and analyzing resources, which equates to building filters to apply context. When dealing with nested columns (which we meet again in the FOCUS version), BOYD will take an additional step during the collection process to extract the nested columns. So instead of a “resource_tags” column with all the things, the user will be able to work with each column independently and without additional overhead each time they leverage a tag column for filtering or analysis.

Since BOYD can already handle column inconsistency month to month, CUR V1 and V2 work almost interchangeably with BOYD, meaning you can start with V1 and change to V2 later if needed without reprocessing data or losing history.

FOCUS

FOCUS version is a bit different. Aligning to a standard promoted by the FinOps Foundation, users gain the benefit of the “EffectiveCost” column and a column naming convention that provides consistency with other providers who adopt the format. What we lose is consistency with V1 and V2. Additionally, FOCUS reports are currently only available as hourly reports. Depending on the amount of data you’re trying to analyze with BOYD, it might be too much for the task, but for smaller shops or users wanting to explore the FOCUS results it is more than viable.

How does BOYD rank in terms of handling each? CUR V1 is the most efficient for collecting and local analysis. V2 is slightly less so during collection if you choose a nested column like tags. Instead of only needing to scan the specific tags selected for V1, all nested values in V2 will be extracted as columns. FOCUS is the least efficient due to hourly granularity and nested columns.

In our next post, we’ll explore key columns and ways to get the best performance out of BOYD, whichever version you’re using.