API Reference#

This page provides an autogenerated summary of offsets-db-data’s API. For more details and examples, refer to the relevant chapters in the main part of teh documentation.

Registry Specific Functions#

The following functions are specific to a given registry and are grouped under each registry’s module. We currently support the following registries:

Verra#

offsets_db_data.vcs.add_vcs_compliance_projects(df: DataFrame) DataFrame[source]#

Add details about two compliance projects to projects database.

Parameters:

df (pd.DataFrame) – A pandas DataFrame containing project data with a ‘project_id’ column.

Returns:

df – A pandas DataFrame with two additional rows, describing two projects from the mostly unused Verra compliance registry portal.

Return type:

pd.DataFrame

offsets_db_data.vcs.add_vcs_project_id(df: DataFrame) DataFrame[source]#

Add a prefix ‘VCS’ to each project ID in the DataFrame.

Parameters:

df (pd.DataFrame) – Input DataFrame with Verra project data.

Returns:

DataFrame with updated ‘project_id’ column, containing the prefixed project IDs.

Return type:

pd.DataFrame

offsets_db_data.vcs.add_vcs_project_url(df: DataFrame) DataFrame[source]#

Create a URL for each project based on its Verra project ID.

Parameters:

df (pd.DataFrame) – Input DataFrame with Verra project data.

Returns:

DataFrame with a new ‘project_url’ column, containing the generated URLs for each project.

Return type:

pd.DataFrame

offsets_db_data.vcs.calculate_vcs_issuances(df: DataFrame) DataFrame[source]#

Logic to calculate verra transactions from prepocessed transaction data

Verra allows rolling/partial issuances. This requires inferring vintage issuance from Total Vintage Quantity

Parameters:

df (pd.DataFrame) – Input DataFrame with preprocessed transaction data.

Returns:

DataFrame containing only issuance transactions with deduplicated and renamed columns.

Return type:

pd.DataFrame

offsets_db_data.vcs.calculate_vcs_retirements(df: DataFrame) DataFrame[source]#

Calculate retirements and cancellations for Verra transactions. The data does not allow distinguishing between retirements and cancellations.

Parameters:

df (pd.DataFrame) – Input DataFrame with Verra transaction data.

Returns:

DataFrame containing only retirement transactions with renamed columns.

Return type:

pd.DataFrame

offsets_db_data.vcs.determine_vcs_transaction_type(df: DataFrame, *, date_column: str) DataFrame[source]#

Determine the transaction type for Verra transactions based on a specified date column. Transactions with non-null date values are labeled as ‘retirement’, else as ‘issuance’.

Parameters:
  • df (pd.DataFrame) – Input DataFrame with transaction data.

  • date_column (str) – Name of the column in the DataFrame used to determine the transaction type.

Returns:

DataFrame with a new ‘transaction_type’ column indicating the type of each transaction.

Return type:

pd.DataFrame

offsets_db_data.vcs.generate_vcs_project_ids(df: DataFrame, *, prefix: str) DataFrame[source]#

Generate Verra project IDs by concatenating a specified prefix with the ‘ID’ column of the DataFrame.

Parameters:
  • df (pd.DataFrame) – Input DataFrame containing Verra project data.

  • prefix (str) – Prefix string to prepend to each project ID.

Returns:

DataFrame with a new ‘project_id’ column, containing the generated project IDs.

Return type:

pd.DataFrame

offsets_db_data.vcs.process_vcs_credits(df: DataFrame, *, download_type: str = 'transactions', registry_name: str = 'verra', prefix: str = 'VCS', arb: DataFrame | None = None) DataFrame[source]#

Process Verra credits data, including generation of project IDs, determination of transaction types, setting transaction dates, and various data transformations and validations.

Parameters:
  • df (pd.DataFrame) – Input DataFrame with raw credits data.

  • download_type (str, optional) – Type of download operation performed (default is ‘transactions’).

  • registry_name (str, optional) – Name of the registry (default is ‘verra’).

  • prefix (str, optional) – Prefix for generating project IDs (default is ‘VCS’).

  • arb (pd.DataFrame | None, optional) – DataFrame for additional data merging (default is None).

Returns:

Processed DataFrame with Verra credits data.

Return type:

pd.DataFrame

offsets_db_data.vcs.process_vcs_projects(df: DataFrame, *, credits: DataFrame, registry_name: str = 'verra', download_type: str = 'projects') DataFrame[source]#

Process Verra projects data, including renaming, adding, and validating columns, and merging with credits data.

Parameters:
  • df (pd.DataFrame) – Input DataFrame with raw projects data.

  • credits (pd.DataFrame) – DataFrame containing credits data for merging.

  • registry_name (str, optional) – Name of the registry (default is ‘verra’).

  • download_type (str, optional) – Type of download operation performed (default is ‘projects’).

Returns:

Processed DataFrame with harmonized and validated Verra projects data.

Return type:

pd.DataFrame

offsets_db_data.vcs.set_vcs_transaction_dates(df: DataFrame, *, date_column: str, fallback_column: str) DataFrame[source]#

Set the transaction dates in a DataFrame, using a primary date column and a fallback column.

Parameters:
  • df (pd.DataFrame) – Input DataFrame with transaction data.

  • date_column (str) – Primary column to use for transaction dates.

  • fallback_column (str) – Column to use as fallback for transaction dates when primary column is null.

Returns:

DataFrame with a new ‘transaction_date’ column, containing the determined dates.

Return type:

pd.DataFrame

offsets_db_data.vcs.set_vcs_vintage_year(df: DataFrame, *, date_column: str) DataFrame[source]#

Set the vintage year for Verra transactions based on a date column formatted as ‘%d/%m/%Y’.

Parameters:
  • df (pd.DataFrame) – Input DataFrame with transaction data.

  • date_column (str) – Name of the column containing date information to extract the vintage year from.

Returns:

DataFrame with a new ‘vintage’ column, containing the vintage year of each transaction.

Return type:

pd.DataFrame

Gold Standard#

offsets_db_data.gld.add_gld_project_id(df: DataFrame, *, prefix: str) DataFrame[source]#

Add Gold Standard project IDs to the DataFrame

Parameters:
  • df (pd.DataFrame) – Input DataFrame containing credits data.

  • prefix (str) – Prefix string to prepend to each project ID.

Returns:

DataFrame with a new ‘project_id’ column, containing the generated project IDs.

Return type:

pd.DataFrame

offsets_db_data.gld.add_gld_project_url(df: DataFrame) DataFrame[source]#

Add url for gold standard projects

gs project ids are different from the id used in gold standard urls.

Parameters:

df (pd.DataFrame) – Input DataFrame containing Gold Standard project data.

Returns:

DataFrame with a new ‘project_url’ column, containing URLs for each project.

Return type:

pd.DataFrame

offsets_db_data.gld.determine_gld_transaction_type(df: DataFrame, *, download_type: str) DataFrame[source]#

Assign a transaction type to each record in the DataFrame based on the download type for Gold Standard transactions.

Parameters:
  • df (pd.DataFrame) – Input DataFrame containing transaction data.

  • download_type (str) – Type of transaction (‘issuances’, ‘retirements’) to determine the transaction type.

Returns:

DataFrame with a new ‘transaction_type’ column, containing assigned transaction types based on download_type.

Return type:

pd.DataFrame

offsets_db_data.gld.process_gld_credits(df: DataFrame, *, download_type: str, registry_name: str = 'gold-standard', prefix: str = 'GLD', arb: DataFrame | None = None) DataFrame[source]#

Process Gold Standard credits data by renaming columns, setting registry, determining transaction types, adding project IDs, converting date columns, aggregating issuances (if applicable), and validating the schema.

Parameters:
  • df (pd.DataFrame) – Input DataFrame with raw Gold Standard credits data.

  • download_type (str) – Type of download (‘issuances’ or ‘retirements’).

  • registry_name (str, optional) – Name of the registry for setting and mapping columns (default is ‘gold-standard’).

  • prefix (str, optional) – Prefix for generating project IDs (default is ‘GLD’).

  • arb (pd.DataFrame | None, optional) – Additional DataFrame for data merging (default is None).

Returns:

Processed DataFrame with Gold Standard credits data.

Return type:

pd.DataFrame

offsets_db_data.gld.process_gld_projects(df: DataFrame, *, credits: DataFrame, registry_name: str = 'gold-standard', prefix: str = 'GLD') DataFrame[source]#

Process Gold Standard projects data, including renaming, adding, and validating columns, harmonizing statuses, and merging with credits data.

Parameters:
  • df (pd.DataFrame) – Input DataFrame with raw Gold Standard projects data.

  • credits (pd.DataFrame) – DataFrame containing credits data for merging.

  • registry_name (str, optional) – Name of the registry for specific processing steps (default is ‘gold-standard’).

  • prefix (str, optional) – Prefix for generating project IDs (default is ‘GLD’).

Returns:

Processed DataFrame with harmonized and validated Gold Standard projects data.

Return type:

pd.DataFrame

APX Registries#

Functionality for APX registries is currently grouped under the `apx`` module.

offsets_db_data.apx.add_project_url(df: DataFrame, *, registry_name: str) DataFrame[source]#

Add a project URL to each record in the DataFrame based on the registry name and project ID.

Parameters:
  • df (pd.DataFrame) – Input DataFrame containing project data.

  • registry_name (str) – Name of the registry (‘american-carbon-registry’, ‘climate-action-reserve’, ‘art-trees’).

Returns:

DataFrame with a new ‘project_url’ column, containing URLs for each project.

Return type:

pd.DataFrame

offsets_db_data.apx.determine_transaction_type(df: DataFrame, *, download_type: str) DataFrame[source]#

Assign a transaction type to each record in the DataFrame based on the download type.

Parameters:
  • df (pd.DataFrame) – Input DataFrame containing transaction data.

  • download_type (str) – Type of transaction (‘issuances’, ‘retirements’, ‘cancellations’) to determine the transaction type.

Returns:

DataFrame with a new ‘transaction_type’ column, containing assigned transaction types based on download_type.

Return type:

pd.DataFrame

offsets_db_data.apx.harmonize_acr_status(row: Series) str[source]#

Derive single project status for CAR and ACR projects

Raw CAR and ACR data has two status columns – one for compliance status, one for voluntary. Handle and harmonize.

Parameters:

row (pd.Series) – A row from a pandas DataFrame

Returns:

value – The status of the project

Return type:

str

offsets_db_data.apx.process_apx_credits(df: DataFrame, *, download_type: str, registry_name: str, arb: DataFrame | None = None) DataFrame[source]#

Process APX credits data by setting registry, determining transaction types, renaming columns, converting date columns, aggregating issuances (if applicable), and validating the schema.

Parameters:
  • df (pd.DataFrame) – Input DataFrame with raw APX credits data.

  • download_type (str) – Type of download (‘issuances’, ‘retirements’, etc.).

  • registry_name (str) – Name of the registry for setting and mapping columns.

  • arb (pd.DataFrame | None, optional) – Additional DataFrame for data merging (default is None).

Returns:

Processed DataFrame with APX credits data.

Return type:

pd.DataFrame

offsets_db_data.apx.process_apx_projects(df: DataFrame, *, credits: DataFrame, registry_name: str) DataFrame[source]#

Process APX projects data, including renaming, adding, and validating columns, harmonizing statuses, and merging with credits data.

Parameters:
  • df (pd.DataFrame) – Input DataFrame with raw projects data.

  • credits (pd.DataFrame) – DataFrame containing credits data for merging.

  • registry_name (str) – Name of the registry for specific processing steps.

Returns:

Processed DataFrame with harmonized and validated APX projects data.

Return type:

pd.DataFrame

ARB Data Functions#

The following functions are specific to the ARB data.

offsets_db_data.arb.process_arb(df: DataFrame) DataFrame[source]#

Process ARB (Air Resources Board) data by renaming columns, handling nulls, interpolating vintages, and transforming the data structure for transactions.

Parameters:

df (pd.DataFrame) – Input DataFrame containing raw ARB data.

Returns:

data – Processed DataFrame with ARB data. Columns include ‘opr_id’, ‘vintage’, ‘issued_at’ (interpolated), various credit transaction types, and quantities. The DataFrame is also validated against a predefined schema for credit data.

Return type:

pd.DataFrame

Notes

  • The function renames columns for readability and standardization.

  • It interpolates missing vintage values and handles NaNs in ‘issuance’ column.

  • Retirement transactions are derived based on compliance period dates.

  • The DataFrame is melted to restructure credit data.

  • Zero retirement events are dropped as they are considered artifacts.

  • A prefix is added to ‘project_id’ to indicate the source.

  • The ‘registry’ column is derived based on the project_id prefix.

  • The ‘vintage’ column is converted to integer type.

  • Finally, the data is converted to datetime where necessary and validated against a predefined schema.

Common Functions#

The following functions are common to all registries.

offsets_db_data.common.add_missing_columns(df: DataFrame, *, schema: DataFrameSchema) DataFrame[source]#

Add any missing columns to the DataFrame and initialize them with None.

Parameters:
  • df (pd.DataFrame) – Input DataFrame.

  • schema (pa.DataFrameSchema) – Pandera schema to validate against.

Returns:

DataFrame with all specified columns, adding missing ones initialized to None.

Return type:

pd.DataFrame

offsets_db_data.common.clean_and_convert_numeric_columns(df: DataFrame, *, columns: list[str]) DataFrame[source]#

Clean and convert specified columns to numeric format in the DataFrame.

Parameters:
  • df (pd.DataFrame) – Input DataFrame.

  • columns (list[str]) – List of column names to clean and convert to numeric format.

Returns:

DataFrame with specified columns cleaned (removing commas) and converted to numeric format.

Return type:

pd.DataFrame

offsets_db_data.common.convert_to_datetime(df: DataFrame, *, columns: list, utc: bool = True, **kwargs: Any) DataFrame[source]#

Convert specified columns in the DataFrame to datetime format.

Parameters:
  • df (pd.DataFrame) – Input DataFrame.

  • columns (list) – List of column names to convert to datetime.

  • utc (bool, optional) – Whether to convert to UTC (default is True).

  • **kwargs (Any) – Additional keyword arguments passed to pd.to_datetime.

Returns:

DataFrame with specified columns converted to datetime format.

Return type:

pd.DataFrame

offsets_db_data.common.load_column_mapping(*, registry_name: str, download_type: str, mapping_path: str) dict[source]#
offsets_db_data.common.load_inverted_protocol_mapping() dict[source]#
offsets_db_data.common.load_protocol_mapping(path: UPath = PosixUPath('/home/docs/checkouts/readthedocs.org/user_builds/offsets-db-data/envs/latest/lib/python3.12/site-packages/offsets_db_data/configs/all-protocol-mapping.json')) dict[source]#
offsets_db_data.common.load_registry_project_column_mapping(*, registry_name: str, file_path: UPath = PosixUPath('/home/docs/checkouts/readthedocs.org/user_builds/offsets-db-data/envs/latest/lib/python3.12/site-packages/offsets_db_data/configs/projects-raw-columns-mapping.json')) dict[source]#
offsets_db_data.common.set_registry(df: DataFrame, registry_name: str) DataFrame[source]#

Set the registry name for each record in the DataFrame.

Parameters:
  • df (pd.DataFrame) – Input DataFrame.

  • registry_name (str) – Name of the registry to set.

Returns:

DataFrame with a new ‘registry’ column set to the specified registry name.

Return type:

pd.DataFrame

offsets_db_data.common.validate(df: DataFrame, schema: DataFrameSchema) DataFrame[source]#

Validate the DataFrame against a given Pandera schema.

Parameters:
  • df (pd.DataFrame) – Input DataFrame.

  • schema (pa.DataFrameSchema) – Pandera schema to validate against.

Returns:

DataFrame with columns sorted according to the schema and validated against it.

Return type:

pd.DataFrame

offsets_db_data.credits.aggregate_issuance_transactions(df: DataFrame) DataFrame[source]#

Aggregate issuance transactions by summing the quantity for each combination of project ID, transaction date, and vintage.

Parameters:

df (pd.DataFrame) – Input DataFrame containing issuance transaction data.

Returns:

DataFrame with aggregated issuance transactions, filtered to include only those with a positive quantity.

Return type:

pd.DataFrame

offsets_db_data.credits.filter_and_merge_transactions(df: DataFrame, arb_data: DataFrame, project_id_column: str = 'project_id') DataFrame[source]#

Filter transactions based on project ID intersection with ARB data and merge the filtered transactions.

Parameters:
  • df (pd.DataFrame) – Input DataFrame with transaction data.

  • arb_data (pd.DataFrame) – DataFrame containing ARB issuance data.

  • project_id_column (str, optional) – The name of the column containing project IDs (default is ‘project_id’).

Returns:

DataFrame with transactions from the input DataFrame, excluding those present in ARB data, merged with relevant ARB transactions.

Return type:

pd.DataFrame

offsets_db_data.credits.handle_non_issuance_transactions(df: DataFrame) DataFrame[source]#

Filter the DataFrame to include only non-issuance transactions.

Parameters:

df (pd.DataFrame) – Input DataFrame containing transaction data.

Returns:

DataFrame containing only transactions where ‘transaction_type’ is not ‘issuance’.

Return type:

pd.DataFrame

offsets_db_data.credits.merge_with_arb(credits: DataFrame, *, arb: DataFrame) DataFrame[source]#

ARB issuance table contains the authorative version of all credit transactions for ARB projects. This function drops all registry crediting data and, isntead, patches in data from the ARB issuance table.

Parameters:
  • credits (pd.DataFrame) – Pandas dataframe containing registry credit data

  • arb (pd.DataFrame) – Pandas dataframe containing ARB issuance data

Returns:

Pandas dataframe containing merged credit and ARB data

Return type:

pd.DataFrame

offsets_db_data.projects.add_category(df: DataFrame, *, protocol_mapping: dict) DataFrame[source]#

Add a category to each record in the DataFrame based on its protocol.

Parameters:
  • df (pd.DataFrame) – Input DataFrame containing protocol data.

  • protocol_mapping (dict) – Dictionary mapping protocol strings to categories.

Returns:

DataFrame with a new ‘category’ column, derived from the protocol information.

Return type:

pd.DataFrame

offsets_db_data.projects.add_first_issuance_and_retirement_dates(projects: DataFrame, *, credits: DataFrame) DataFrame[source]#

Add the first issuance date of carbon credits to each project in the projects DataFrame.

Parameters:
  • credits (pd.DataFrame) – A pandas DataFrame containing credit issuance data with columns ‘project_id’, ‘transaction_date’, and ‘transaction_type’.

  • projects (pd.DataFrame) – A pandas DataFrame containing project data with a ‘project_id’ column.

Returns:

projects – A pandas DataFrame which is the original projects DataFrame with two additional columns ‘first_issuance_at’ representing the first issuance date of each project and ‘first_retirement_at’ representing the first retirement date of each project.

Return type:

pd.DataFrame

offsets_db_data.projects.add_is_compliance_flag(df: DataFrame) DataFrame[source]#

Add a compliance flag to the DataFrame based on the protocol.

Parameters:

df (pd.DataFrame) – Input DataFrame containing protocol data.

Returns:

DataFrame with a new ‘is_compliance’ column, indicating if the protocol starts with ‘arb-‘.

Return type:

pd.DataFrame

offsets_db_data.projects.add_retired_and_issued_totals(projects: DataFrame, *, credits: DataFrame) DataFrame[source]#

Add total quantities of issued and retired credits to each project.

Parameters:
  • projects (pd.DataFrame) – DataFrame containing project data.

  • credits (pd.DataFrame) – DataFrame containing credit transaction data.

Returns:

DataFrame with two new columns: ‘issued’ and ‘retired’, representing the total quantities of issued and retired credits.

Return type:

pd.DataFrame

offsets_db_data.projects.find_protocol(*, search_string: str, inverted_protocol_mapping: dict[str, list[str]]) list[str][source]#

Match known strings of project methodologies to internal topology

Unmatched strings are passed through to the database, until such time that we update mapping data.

offsets_db_data.projects.get_protocol_category(*, protocol_strs: list[str] | str, protocol_mapping: dict) list[str][source]#

Get category based on protocol string

Parameters:
  • protocol_strs (str or list) – single protocol string or list of protocol strings

  • protocol_mapping (dict) – metadata about normalized protocol strings

Returns:

categories – list of category strings

Return type:

list[str]

offsets_db_data.projects.harmonize_country_names(df: DataFrame, *, country_column: str = 'country') DataFrame[source]#

Harmonize country names in the DataFrame to standardized country names.

Parameters:
  • df (pd.DataFrame) – Input DataFrame with country data.

  • country_column (str, optional) – The name of the column containing country names to be harmonized (default is ‘country’).

Returns:

DataFrame with harmonized country names in the specified column.

Return type:

pd.DataFrame

offsets_db_data.projects.harmonize_status_codes(df: DataFrame, *, status_column: str = 'status') DataFrame[source]#

Harmonize project status codes across registries

Excludes ACR, as it requires special treatment across two columns

Parameters:
  • df (pd.DataFrame) – Input DataFrame with project status data.

  • status_column (str, optional) – Name of the column containing status codes to harmonize (default is ‘status’).

Returns:

DataFrame with harmonized project status codes.

Return type:

pd.DataFrame

offsets_db_data.projects.map_protocol(df: DataFrame, *, inverted_protocol_mapping: dict, original_protocol_column: str = 'original_protocol') DataFrame[source]#

Map protocols in the DataFrame to standardized names based on an inverted protocol mapping.

Parameters:
  • df (pd.DataFrame) – Input DataFrame containing protocol data.

  • inverted_protocol_mapping (dict) – Dictionary mapping protocol strings to standardized protocol names.

  • original_protocol_column (str, optional) – Name of the column containing original protocol information (default is ‘original_protocol’).

Returns:

DataFrame with a new ‘protocol’ column, containing mapped protocol names.

Return type:

pd.DataFrame

offsets_db_data.registry.get_registry_from_project_id(project_id: str) str[source]#

Retrieve the full registry name from a project ID using a predefined abbreviation mapping.

Parameters:

project_id (str) – The project ID whose registry needs to be identified.

Returns:

The full name of the registry corresponding to the abbreviation in the project ID.

Return type:

str

Notes

  • The function expects the first three characters of the project ID to be the abbreviation of the registry.

  • It uses a predefined mapping (REGISTRY_ABBR_MAP) to convert the abbreviation to the full registry name.

  • The project ID is converted to lowercase to ensure case-insensitive matching.

  • The function raises a KeyError if the abbreviation is not found in REGISTRY_ABBR_MAP.