AEMO Data Snippets

Dividing large AEMO Data CSVs into parquet partitions

This script can be run via the command line to divide a large AEMO data CSV (e.g. from the Monthly Data Archive, such as rebids in BIDPEROFFER) into Parquet partitions. This is advantageous for using packages such as Dask or polars to analyse such data.

Partitions are generated based on the chunksize parameter, which specifies a number of line (default \(10^6\) lines per chunk). However, this code could be modified to partition data another way (e.g. by date, or by unit ID).

It also assumes that the first row of the table is the header (i.e. columns) for a single data table.

Requirements

Written using Python 3.11. Uses pandas and tqdm (progress bar).

Also uses standard librarypathlib and type annotations, so probably need at least Python > 3.5.

Usage

create_parquet_partitions.py [-h] -file FILE -output_dir OUTPUT_DIR [-chunksize CHUNKSIZE]

Example

python create_parquet_partitions.py -file PUBLIC_DVD_BIDPEROFFER_202107010000.CSV -output_dir BIDPEROFFER -chunksize 1000000

Script

create_parquet_partitions.py