I’ve been backing on a side project lately to try and make open sources some of the bones of a FinOps visibility tool. You can find the FinOpsPod episode I recorded on the topic recently here. Well now that that’s out, I’ve been properly motivated to ship and while AWS is done enough for now, I have been wrangling the Azure side of things over the weekend. This is what I learned in the last 72 hours.
Azure Blob Storage download progress with TQDM
I searched the internet high and low for how to handle this. AWS makes it fairly easy with the Callback
argument you can pass when downloading an object from S3. I guess Azure’s version is more recent and it goes like this:
In the API docs for the download_blob
function you can find the progress_hook
kwarg. It isn’t called as often as its AWS counterpart so the progress bar isn’t nearly as fine-grained, but it’s better than nothing in my opinion. The whole thing in genearl requires more wrangling than the AWS version, but I learned quite a lot in the process.
DuckDB, the ultimate dataeng swiss army knife?
One helpful thing that AWS does in their billing data export is to include a metadata file with each night’s export that tells us facts about the export in general. Things like
- the timestamp that the export was generated
- where you can find the associated billing data files
- a unique ID on that particular version of the export and most helpfully
- a list of columns and their datatypes in that export.
For this side project I’m using Clickhouse as the backend warehouse. It’s really fun to run huge queries on a huge dataset and have them come back in what feels like 100ms, so I’m a rather big fan of Clickhouse at this point though I’m only just getting to know it. There are fussy things, too. Things like its CSV importing, which is … not super friendly. Here’s an example:
Azure’s billing exports generate with a Date field that tells you the date of the charge/line item. For some reason, even though my employer is a French company and our bill is in euro, all of the date fields in this bill come across with the US date format - MM/DD/YYYY. After exhaustive searching, I did find a clue in the Clickhouse documentation that it could parse US style dateTime strings, but I cannot find that piece of documentation again AND it was only available after you’d gotten the data into the warehouse (presumably as a String). I want the thing stored as a date to begin with so I started to wonder if I could grab DuckDB and use it to parse this stupid Date column for me correctly.
The answer is yes. DuckDB is also a pretty cool piece of gear and so I’m playing with both of these open-source columnar things at the moment. One thing that Duck has gone out of their way with is making it super easy to ingest data, and to specify all the little weird things that can go wrong in their extremely generous default CSV importer, things like “hey, the dateformat should look like this - {strptime string}
”. Super cool and works like a charm, so now I have this CSV in memory as a DuckDB table. What else can I do with it?
Well, why spit it back out as CSV, how about spit it back out as Parquet? Clickhouse will have a much easier time reading a Parquet file as it comes along with all the column names and datatypes, so that’s what I’m doing. So, I have this function that downloads all the data_files for a given billing export and for the sake of brevity I’ll put it here in its current, non-optimized form: