Skip to content

Victorian Consumers Data

Database Structure & Data

This dataset includes 30-minute consumption, generation, voltage data, 30-minute subload data, and site details.

For this dataset three tables were created in the DARTH database:

  1. victorian_consumers_site_details

    • darth_site_id (character 5, NOT NULL)

    • darth_circuit_id (character 5, NOT NULL)

    • dnsp (character varying)

    • postcode (character 4, NOT NULL)

    • connection_name (character varying)

    • dc_cap_w (integer)

    • num_phase (small int, NOT NULL)

    The victorian_consumers_site_details table has 4726 rows.

  2. victorian_consumers_electricity_data

    • darth_site_id (character 5, NOT NULL)

    • datetime (timestamp without time zone, NOT NULL)

    • pv_energy_wh (real)

    • pv_reactive_energy_varh (real)

    • load_energy_wh (real)

    • load_reactive_energy_varh (real)

    • voltage_max_v (real)

    • voltage_min_v (real)

    The victorian_consumers_electricity_data table has 27,613,165 data rows which consists of the consumption, generation, and voltage data for 1146 sites. The time ranges from 2016-12-15 00:30 to 2022-05-24 00:00, varying between different sites.

  3. victorian_consumers_subload_data

    • darth_site_id (character 5, NOT NULL)

    • darth_circuit_id (character 5, NOT NULL)

    • datetime (timestamp without time zone, NOT NULL)

    • subload_wh (real)

    • subload_reactive_varh (real)

    • voltage_max_v (real)

    • voltage_min_v (real)

    The victorian_consumers_subload_data table has 25,978,057 data rows which consists of the subload data for 501 sites. The time ranges from 2016-12-15 00:30 to 2022-05-24 00:00, varying between different sites.

The total volume of this dataset is around 4 GB. If the whole table is downloaded, each of the subload and electricity data tables will be around 2 GB. The site details table is comparatively very small, around 220 KB.

Appendix A: Original Dataset

This dataset includes 30-minute consumption, generation, voltage data, 30-minute subload data, and site details. The original dataset provided by Solar Analytics included one csv file for site details, and one folder consisting of eight sub-folders each including around 200 csv files of load and subload data. The following details about the original dataset were provided by Solar Analytics.

Columns for 30-minute consumption/generation/voltage data:

  • timestamp: timestamp in local time, note this is the ending timestamp of a 30-minute period (e.g., 00:30 refers to 00:00 - 00:30)

  • pv_energy_(Wh): 30-minute PV energy for a site

  • pv_reactive_energy_(VARh): 30-minute reactive PV energy for a site

  • load_energy_(Wh): 30-minute load energy for a site

  • load_reactive_energy_(VARh): 30-minute reactive load energy for a site

  • voltage_max_(V): maximum 5-second voltage sampled over the 30-minute period

  • voltage_min_(V): minimum 5-second voltage sampled over the 30-minute period

Note as our power values are just average power values derived from energy values, they are not included here as they could be easily converted (e.g., 600 Wh = 1200 W for a given 30 min time period).

Columns for 30-minute subload data:

  • subload_(Wh): 30-minute subload energy

  • subload_reactive_(VARh): 30-minute reactive subload energy

  • c_id: circuit id

Other columns are identical to the load data as above.

Columns for site details:

  • site_id: site ID

  • c_id: circuit id

  • dnsp

  • postcode

  • connection_name: circuit type (note load_ is a subload circuit)

  • dc_cap_w: PV DC size in W

  • num_phase: number of phases

Appendix B: Data Processing

This dataset was deidentified as described below.

Creating New Site IDs and Circuit IDs

Every site id and circuit id in the were replaced by a new 5-character id that was generated by our processing program. First, the site details file was read, all the unique site ids extracted, the first one matched to S0001, second one matched to S0002, etc. This process continued until all the sites were matched with a new 5-character site id. A site IDs lookup table was generated and stored. Similarly, circuit ids in the site details file were matched to new 5-character ids (C0001, C0002, etc.) and a circuit IDs lookup table was generated and stored.

Deidentification of Site Details

Second step included deidentification of the site details file. Using the above lookup tables, new columns were added for darth_site_id and darth_circuit_id which included the corresponding 5-character id that matched site id and circuit id in each line. Then the original columns for site id and circuit id were removed. This deidentified site details table was stored as a csv file with the following columns, ready to be uploaded to DARTH:

  • darth_site_id

  • darth_circuit_id

  • dnsp

  • postcode

  • connection_name

  • dc_cap_w

  • num_phase

Deidentification of Electricity Data

Each subfolder in the original dataset contained both electricity (consumption/generation/voltage) data files and subload data files, however they were named differently: the subload data files had subload word in in the file name after the site id. We used this to differentiate between electricity data files and subload data files within the data processing program.

Each file was loaded into a DataFrame. The site id only existed in the name of the file, it was extracted and noted in a list to keep track of the processed sites, then deidentified using the id lookup table that previously generated and the deidentified site id was inserted as a new column to the DataFrame. The columns of the dataset were changed in preparation for uploading to the database: timestamp was renamed to datetime, parentheses removed, capitalized units were changed to lower case. The resulting data was stored as a csv file. Therefore, each deidentified electricity data file ready to be uploaded to the database included the following columns:

  • darth_site_id

  • datetime

  • pv_energy_wh

  • pv_reactive_energy_varh

  • load_energy_wh

  • load_reactive_energy_varh

  • voltage_max_v

  • voltage_min_v

In total there were 1146 electricity data files, initially spread between eight subfolders together with subload data. 1146 deidentified electricity data files were generated by performing the above process. After finishing the deidentification process the total number of files in the original electricity data files and deidentified electricity data files were compared together and to the number of unique site ids processed. The data and column labels of a few original data files were compared to their corresponding deidentified electricity data files to ensure the correctness of the deidentification procedure.

Deidentification of Subload Data

With an almost similar process as electricity data deidentification, the subload data files were deidentified. They had an additional column including circuit id which was deidentified using the circuits ids lookup table generated at the first step. This process was performed for each row rather than once for each file because for some of the sites multiple circuits existed in the subload data file. Each deidentified subload data file ready to be uploaded consisted of data in the following columns:

  • darth_site_id

  • darth_circuit_id

  • datetime

  • subload_wh

  • subload_reactive_varh

  • voltage_max_v

  • voltage_min_v

In total 501 subload data files originally existed and the same number of deidentified subload data files were generated and stored.