Ignored By Dinosaurs 🦕

I've been backing on a side project lately to try and make open sources some of the bones of a FinOps visibility tool. You can find the FinOpsPod episode I recorded on the topic recently here. Well now that that's out, I've been properly motivated to ship and while AWS is done enough for now, I have been wrangling the Azure side of things over the weekend. This is what I learned in the last 72 hours.

Azure Blob Storage download progress with TQDM

I searched the internet high and low for how to handle this. AWS makes it fairly easy with the Callback argument you can pass when downloading an object from S3. I guess Azure's version is more recent and it goes like this:

def download_object(self, blob_name, destination_path):
    blob_client = self.container_client.get_blob_client(blob_name)
    filesize = blob_client.get_blob_properties().size
    # just in case the destination path doesn't exist
    os.makedirs(os.path.dirname(destination_path), exist_ok=True)
    with open(destination_path, "wb") as file, tqdm(
        total=filesize,
        unit="B",
        unit_scale=True,
        desc='whatever helpful file description you wish',
        colour="green",
    ) as t:
        bytes_read = 0
        # the progress_hook calls with 2 args - the bytes downloaded so far and the total
        # bytes of the object.  t.update wants the bytes read in that iteration, so we have to
        # make that happen by keeping track of what the total was as of the previous
        # iteration.
        def update_progress(bytes_amount, *args):
            nonlocal bytes_read
            t.update(bytes_amount - bytes_read)
            bytes_read = bytes_amount

        blob_data = blob_client.download_blob(progress_hook=update_progress)
        file.write(blob_data.readall())
        t.close()

In the API docs for the download_blob function you can find the progress_hook kwarg. It isn't called as often as its AWS counterpart so the progress bar isn't nearly as fine-grained, but it's better than nothing in my opinion. The whole thing in general requires more wrangling than the AWS version, but I learned quite a lot in the process.

DuckDB, the ultimate dataeng swiss army knife?

One helpful thing that AWS does in their billing data export is to include a metadata file with each night's export that tells us facts about the export in general. Things like

  • the timestamp that the export was generated
  • where you can find the associated billing data files
  • a unique ID on that particular version of the export and most helpfully
  • a list of columns and their datatypes in that export.

For this side project I'm using Clickhouse as the backend warehouse. It's really fun to run huge queries on a huge dataset and have them come back in what feels like 100ms, so I'm a rather big fan of Clickhouse at this point though I'm only just getting to know it. There are fussy things, too. Things like its CSV importing, which is ... not super friendly. Here's an example:

Azure's billing exports generate with a Date field that tells you the date of the charge/line item. For some reason, even though my employer is a French company and our bill is in euro, all of the date fields in this bill come across with the US date format – MM/DD/YYYY. After exhaustive searching, I did find a clue in the Clickhouse documentation that it could parse US style dateTime strings, but I cannot find that piece of documentation again AND it was only available after you'd gotten the data into the warehouse (presumably as a String). I want the thing stored as a date to begin with so I started to wonder if I could grab DuckDB and use it to parse this stupid Date column for me correctly.

The answer is yes. DuckDB is also a pretty cool piece of gear and so I'm playing with both of these open-source columnar things at the moment. One thing that Duck has gone out of their way with is making it super easy to ingest data, and to specify all the little weird things that can go wrong in their extremely generous default CSV importer, things like “hey, the dateformat should look like this – {strptime string}”. Super cool and works like a charm, so now I have this CSV in memory as a DuckDB table. What else can I do with it?

Well, why spit it back out as CSV, how about spit it back out as Parquet? Clickhouse will have a much easier time reading a Parquet file as it comes along with all the column names and datatypes, so that's what I'm doing. So, I have this function that downloads all the data_files for a given billing export and for the sake of brevity I'll put it here in its current, non-optimized form:

def download_datafiles(self, data_files: List[str]):
    local_files = []
    # this downloads each of the CSV files and puts them in a local
    # tmp directory
    for data_file in data_files:
        destination_path = f"tmp/{data_file}"
        print(f"Downloading to {destination_path}")
        self.storage_client.download_object(data_file, destination_path)
        local_files.append(destination_path)
    dirname = os.path.dirname(local_files[0])
    con = duckdb.connect()
    # Here we convert the csv to Parquet, because DuckDB is excellent with
    # parsing CSV and Clickhouse is a bit fussy in this regard.  The Azure
    # files come over with dates in the format MM/DD/YYYY, which DuckDB
    # can be made to deal with, but Clickhouse cannot.

    # moreover, Duck can grab any number of CSV files in the target directory and
    # merge them all together for me.  This allows me to generate a single Parquet
    # file from all the CSV files in the directory.  Given that Azure doesn't even
    # gzip these files, this turns 500MB of CSV into 20MB of Parquet.  Not bad.
    con.sql(
        f"""CREATE TABLE azure_tmp AS SELECT * FROM read_csv('{dirname}/*.csv', 
            header = true, 
            dateformat = '%m/%d/%Y'
        )"""
    )
    con.sql(
        f"COPY (SELECT * FROM azure_tmp) TO 'tmp/azure-tmp.parquet' (FORMAT 'parquet')"
    )
    con.close()
    # yes, we do two things in the function right now, it's ok.  We'll refactor
    # and use the "parse the columns and datatypes out of this parquet table" probably
    # all over the place.
    return "tmp/azure-tmp.parquet"

#data #data-engineering #azure #finops

Hey there, just wanted to leave a signpost for you. My usecases lately have been something like -

Computing a BIG table, with a lot of math in it, over a LOT of rows of data, and then joining in other data to enrich the primary set. Specifically, this is container usage data, which I'm attempting to blend with our AWS bill to arrive at something like “cost per container” per time period.

I don't want to have to rebuild this table every day because most of the data is static once it shows up in the warehouse. An incremental strategy would be perfect BUT, some of this data arrives late, which means that if I do the standard DBT routine of

WHERE timestamp > (SELECT MAX(timestamp) from {{this}})

then I will have gaps. Indeed, I have gaps. I haven't rolled out any reporting on this table, or made any announcements because I felt a disturbance in the force, confirmed by some light analysis this morning.

I've recently discovered a new DBT hammer in the incremental_strategy parameter for incrementally built tables, and specifically the insert_overwrite option. From the DBT docs:

The insert_overwrite strategy generates a merge statement that replaces entire partitions in the destination table.

In short I can just always recompute yesterday and today, or the last 7 days, or whatever full partitions-worth of data I want. Yes, I'm recomputing more than I strictly need to, but it assures me that there will be no gaps in the results.

This operation seems pretty foolproof so far, check it out.

#analytics #DBT

Supposing a hypothetical organization that sold a product whose feature set and COGS closely followed a typical CSP like Amazon Web Services. That organization allows its customers to change products at will but must manually invoice a significant percentage of those sellables, therefore it needs a robust system to track changes to those sellables and ensure that they are properly charged at each turn of the billing cycle.

I'm picturing reporting format that reports on 2 different types of metrics -

  • Accruing (non-temporal)
  • Static (temporal)

Accruing metrics are easy, they're things like outgoing bandwidth. These are capped monthly and overages should be trued up, therefore 2 measurements could be helpful on these -

  • Month to date sum (this will end up being your billable, since the bounds of the billing cycle are likely set at the calendar month)
  • Rolling 30 day average (typical month's usage, helps you notice a customer who is tracking above what you sold them)

Static metrics require a bit more understanding. These are things like CPU cores in a given VM that you're selling. The tin says “8 CPUs” and gives you a monthly rate for those 8 CPUs but you're allowed to upsize that 8 core machine any time you want. Those 8 cores might become 16 for a week, then back to 8. That means you're charging for neither the 8 core machine nor the 16 core machine, but a blend of both.

This is what I mean by “temporal”, you have to generate a time component, divide your 8 or 16 cores into that time component, prorate the usage by that time component, and ultimately arrive at a piece of usage that accrues just like the other.

Given the example of 8 cores to 16 cores and using a 30 day month (720 hours) we get something like this:

You're actually charging for CPU/Hours, firstly. If an 8 core machine is $720/month ($1/hour) and a 16 core machine is $1440 ($2/hour) then your hourly CPU rate is $.0125/hour. This makes it very simple to track (and bill!!) the changes to your sellables that your customers are using.

The metrics you might want to watch on these types of sellables/COGS are almost the opposite of the accruing type:

  • Month to date average
  • Rolling 30 days average

The monthly vs. the 30 day average would tell you if they are tracking above or below recent historical averages. It would be trivial to compare the two and throw an alert if the month to date or shorter term rolling average is trending significant above the 30 day average.

Note: I'm on vacation and just want to remember some stuff for when I get back so don't dock me on this, I'm just spitballing

#business #analytics

First day of my job at my current employer (almost 7 years ago now) I crack open the company handbook to start onboarding. I remember it saying something like “Reservations are the lifeblood of this company” and thinking “wow, what does that mean?”

I'd had a job prior to this one that had some Stuff on AWS so I was familiar with the concept – something something pay upfront get reduced prices – but it as far from the lifeblood of the company. It was something the IT head took care of, and he didn't seem that pleased to do it either. So, six years after that I find myself in charge of FinOps here and it has a lot to do with reservations. Indeed, reservations are the main lever that you have to pull on the job. Let's talk bout it…

What are reservations?

I recently heard reservations described as a “forward contract”, which Investopedia (one of the most useful resources in this leg of my career) describes thusly

A forward contract is a customized contract between two parties to buy or sell an asset at a specified price on a future date.

The promise of the cloud is that you, developer, can push a button and spin up a VM or any number of other network-connected computing resources that you didn't have to ask IT for. It's why the default pricing model for these resources is called “On Demand”. AWS invented this idea several years ago that if you committed to running that resource for a year or more, you could receive a reduced price. You could pay for the year either entirely upfront to receive the deepest discount, entirely over time as a monthly payment for a shallower discount, or somewhere in between – the “partial upfront”, which mixes an upfront payment with a monthly payment for those resources.

At this point, you might be starting to learn concepts like “amortization” and “the time cost of money” to distinguish why you would choose one of these, but if not I suggest looking them up.

Uh, ok

There is a balancing act in place with reservations. Obviously you want to receive those lower rates wherever possible, but your resource usage might be, indeed probably is, rather variable. “Elastic”. Suppose you overpurchase a reservation, all upfront so that the entire year is paid for, but then the resource is turned off after 6 months. So my latest analog for what FinOps is largely about is a double ended inventory job. You have an inventory of resources to cover with reservations and you have an inventory of the reservations themselves. One can be created or turned off in an instant, the other lives for 1 or 3 years.

#business #finops

I've been kicking around this thought for a year or so now – to the outsider a career in data looks like a technical path. The data practitioner learns SQL, how to query data stored in a database somewhere using SQL, and if you know enough SQL you can answer any question whose artifacts are stored in that database somewhere.

The reality is that SQL is the very last mile. SQL is code, and so it looks to the non-practitioner like the act of creation, like code written in any imperative language creates motion and process and a webapp or piece of automation that didn't exist before. SQL does not create. SQL encapsulates that which already exists as a business process.

SQL is a contract. SQL puts business conditions and processes into code. If the business processes are ill-defined, then the SQL that has to be written to handle all the various cases will sprawl. (Most business processes are ill-defined as it turns out, made up in a time of need by a human, and probably one who doesn't spend their day thinking about data modeling.) If the business process is well-defined, but the SQL author's understanding of it is wrong or incomplete, then you'll end up with a poorly written contract that spits out wrong or incomplete answers.

That's what makes Data the hard part, because to write that contract down always requires the author to have spent time reverse-engineering the business process. I view this as an inherent good for the business as a whole – it forces the business to reckon with itself and to better define how it operates. The road to get there is tough though and in my experience it's often the data analyst who is actually pulling the cart.

#analytics #business #databases

I had a really organized map of things in my head I'd like to tell my younger self about FinOps last night. This morning it is gone. Let this be a lesson to me – jot some notes down. It was a primer course, from the point of view of a data person who was placed in charge of a FinOps practice – how to think about FinOps, what data are you going to need, what do the terms and costs mean, etc.

So what is FinOps?

Well, it's the driest sounding topic that I've ever found incredibly interesting (so far). Essentially, the cloud has upended what used to be an agreeably distant relationship between Engineering teams and Finance teams.

If an Eng team needed to launch a new thing to the young internet in the year 1999, they went through a procurement process with their employers Finance team. A server was purchased and placed in a rack somewhere and the interaction was largely done – Finance depreciated the hardware as they saw fit and Engineering optimized the workloads on that hardware as they saw fit. It was paid for, who cared after that?

Well, The Cloud screwed all that up. The cloud allows engineers to directly spend company money by pressing a button. Pressing buttons and launching resources without asking anybody is fun af, so Eng did it, lots. Some time later the bill comes to the old IT team or to Finance and friction entered the chat.

Finance could no longer control IT outflows. Engineering could no longer be totally ignorant of the company money they were spending. Both sides needed more information to do their jobs and make better decisions and into that dysfunctional gap grew the practice of FinOps.

How does FinOps Op?

“Financial Operations” is, I guess, what it stands for. See, cloud vendors – AWS, Google Cloud Platform aka GCP, and Azure (Microsoft's cloud) – don't make their money by making it easy for an Engineering team to understand the impact of their hardware decisions. They don't make their money by making it easy for Finance teams to surface anomalies in spending. They don't make their money by generating understandably reporting and forecasting tools. They make their money by selling Moar Cloud. And turns out one of the easiest ways to sell Moar Cloud is by making all of the above as difficult as possible!

I'm being cheeky and slightly humorous, or so I tend to think over my morning coffee. Truth is, these are huge suites of very complex products, upon which the largest companies in the world are:

  • running their enormous, heterogenous workloads
  • across dozens or hundreds of products within the cloud vendor's catalog and
  • asking to be able to report on those any one of these workloads in a manner that fits their organization.

So what pops out of these requirements is typically a very granular bill with millions (or billions, so I hear) of line items. Those line items were generated by the various teams that built the products within the suite, so they tend to be pretty heterogenous themselves in terms of data points and consistency.

This is where FinOps finally steps in. It's basically a heavily data-backed job of informing both sides of the equation in as close to realtime as possible about the workloads and the financial impact of the workloads.

I intend next chapter to talk about “reservations”, which is part of the bread and butter of the cost management and therefore FinOps domain.

#business #finops

I read an article earlier this week about lessons learned between $5MM and $100MM in ARR. To the layperson – this means growing a small company into a larger company, as measured by its yearly revenue.

One of the points in the article (maybe more, I don't remember) was about hiring, and it referenced the old adage

A players hire other A players. B players hire C players…

While this sounds like one of those BS businessisms that some capitalist dude came up with, I absolutely believe it to be true. The HN comments section had multiple threads with commenters asking the totally reasonable question “Who's hiring these B players anyway?”. After all, if all you have to do is only hire A players, why would anyone hire a B player in the first place?

I went for a jog yesterday and decided to imagine some of the scenarios that might lead to B player infiltration of a company..

—————————

I imagine a common scenario is known in some circles as the Peter Principle. A talented IC (individual contribute, ie not a manager) is promoted into management. The IC work that came naturally to them is no longer their job and they have to learn a new set of skills to be an effective manager.

These skills are, frankly, not their thing and so they don't pick them up as readily and as hungrily as the more fun thing they used to do. One of those skills is learning how to hire good people. Their responsibilities and workload are growing every week, so eventually they have to hire but due to circumstance they rush through the process and hire a less than great teammate.

The formerly A player has committed a B player mistake. Will they learn from it and grow, or will they just put their head back down and keep moving?

——————————

Sometimes B and C players actually do hire A players. B players aren't dumb, after all, they do want to hire good talent. They just don't possess the skills or the confidence or the humility to grow their potential, so they set about micromanaging them into C players.

——————————

I personally think this one is very common, but I've never seen it discussed – the B player founded the company. They were born into a wealthy family, they raised their first round off of family connections or their last name. They look the part, they belong to the right social circles and at the end of the day that counts for a lot in this society.

The B player founder is never challenged to do better, indeed they are surrounded by evidence of their skill and business acumen. They hire B player after B player into the senior leadership ranks and because they are already rich, and because they are smart enough to avoid running the company into the ground, the company keeps going.

The company thus has an entire leadership culture of B players and the last thing a B player wants is to let an A player anywhere in the room. Money has its own gravity, and so these companies end up succeeding anyway. It's depressing if you think about it too much.

————————————

So the answer in all 3 scenarios above to the question “who is hiring these B players in the first place” is your leadership.

#business #management

Been reading the Harry Potter books for a few years with the family at night, and in the middle gets introduced this thing of Dumbledore's called the “Pensieve”, which is like a bowl into which Dumbledore can put his memories so he doesn't have to keep them all in his head.

I just realized I've been doing this, sort of, for the last year or so. I'm full on manager now, all I do is phones calls for the first half of any given day. I started taking notes with pen and paper sometime last year. Lately it's a lot of thoughts I don't want to forget, or questions I want to ask but don't want to interrupt the speaker.

It started out as organized, action items to follow up on or something that made me feel more organized. Now I just think the physical act of writing things down with a pen is really helpful, with a bonus that I have a few things that make sense to me later when look back on it.

#life

I'm working through some thoughts in my head about social media, as I've been doing since founding this blog well over a decade ago. Back then I thought it was going to be a savior of democracy in oppressed societies around the world, and we see how that's turned out.

Lately it's an issue closer to home. My kids are creators. At some point years back they got inspired by Captain Underpants and started making their own comic books. We have bookshelves full of 8 inch sketch pads from AC Moore (RIP) and our middle son especially made visual art and story telling his thing. It's amazing to see how far he and they have come, especially with the story telling part.

In recent years, they've taken to making movies with iMovie and other tools like it with an iPad. They've started making animated movie shorts as well. None of this is the stuff they want to post on YouTube, but I know it's coming.

My middle son went and signed himself up a YouTube account. He's been posting content lately, mostly gameplay stuff from Minecraft. His brothers, of course, want their own YouTube accounts and so far I've said “no” without really understanding the why. Michelle hit on it the other day when she said to one of them “I don't want anyone telling you what you're worth” and I think that's it.

Especially when you're young, other people's opinions matter a lot. Social media is a wide open gate to put yourself out there and be judged in the form of likes and subscribes, and no matter what they think they will or won't care about, the human brain is wired to want to fit in with a community. The part that really bothers me WRT the boys is that the stuff that gets likes and subscribes is often the lowest common denominator and I don't want them molding their creativity around that. We've seen how that plays out.

#life

“Run your data team like it's a product team” was a common refrain at the DBT conference for the last two years. What does that mean? I am still figuring that out, but I have had an aha in the last few weeks about what a “data product” is, exactly.

Your standard software development process requires two main components to deliver the Things – a software developer and her wits. She takes those wits and expresses them into a text editor, and thereby makes something from truly nothing.

Data products differ in 1 key way – they require raw materials in the form of data. The process of building a data product therefore requires at least 1 additional step that standard software product development does not – refining that data into something consumable by the system that is delivering the product.

There can potentially be an additional step even before this one, which is to get the data in the first place. My current employer built an Observability suite and stack to be able to deliver metrics to our customers about their projects that they run/host here. This process took multiple quarters because the entire metrics creation and delivery pipeline had to be built from scratch. Once the data existed, it was then a process of refining the materials and building the product.

The good news is that many data products can be consumed in a standard way through some kind of BI or reporting or data visualization tool, we use Metabase. It has taken me a while to understand that the method of delivery of the products is the more standardized part, whereas the gathering and refinement of the raw materials/data is where the action is.

#analytics #business #data