Data Processing#
Order of Operations#
OffsetsDB follows a typical extract-transform-load (ETL) workflow.
Extraction involves querying and downloading raw credit and project data hosted by offset registries.
Transformation involves executing the functions contained within this repo, offsets-db-data
.
Load involves uploading the resulting data to S3 and the Postgres-backend that powers the OffsetsDB database tool.
Downloading Raw Data#
We download a fresh copy of project and transaction data on a daily basis. While downloading, we make no changes to the raw data provided by the registries. We’ve fully automated downloading of registry data, with the exception of Gold Standard. Gold Standard’s terms and conditions require that downloads occur through the interfaces provided by the Gold Standard site. Those interfaces, as provided, do not allow automated downloads.
We have no plans to release the code the directly interacts with registries.
We decided to keep this part of OffsetsDB private in an effort to limit download requests to the registries.
Other, technical aspects of OffsetsDB, like the database and API that power the database tool are similarly closed.
We made this decision to ensure that the OffsetsDB database tool remains performant.
Critically, the production database represents an identical clone of the data generated by the code contained within offsets-db-data
.
No additional processing or inferences should occur outside the context of this repository.
Transforming Raw Data#
Nearly the entirety of the code contained within offsets-db-data
involves registry-specific logic for transforming raw registry data into a common, shared schema.
The logic for transforming the data of each registry is contained within a single file, with the filename denoting which registry the transformations apply to.
For example, the logic involved in transforming Verra data are contained within offsets_db_data.vcs
.
Each registry-specific file contains at least two functions: process_{registry_abbreviation}_credits
and process_{registry_abbreviation}_projects
Those functions, in turn, call a series of additional transformation functions that produce the normalized project and credit data which combine to form OffsetsDB.
These transformation functions tend to be quite small and operate on one or two properties of the raw data.
To continue with the Verra example, vcs.py
contains functions with names like offsets_db_data.vcs.set_vcs_vintage_year
and offsets_db_data.vcs.add_vcs_project_id
.
These functions contain the registry-specific logic needed to map Verra’s raw data to the common data schema of OffsetsDB.
An Example#
In practice, replicating the behavior of OffsetsDB should be simple.
Here’s an example of using offsets_db_data
to transform the raw transactions data from Verra into a normalized, analysis ready file:
import pandas as pd
pd.options.display.max_columns = 5
from offsets_db_data import vcs
archive_fname = {{ path to local copy of Verra transaction data }}
raw_credits = pd.read_csv(archive_fname)
processed_credits = vcs.process_vcs_credits(raw_credits)
Note
Running the above example requires first downloading a copy of Verra’s transaction data and changing the above code to reference the location of that data on your local machine.
Invoking single transformation functions, like offsets_db_data.vcs.set_vcs_vintage_year
is even more straightforward.
Let’s say you want to understand more about how OffsetsDB assigns Verra credits a vintage year.
You can explore the behavior of this single transformation function by calling:
raw_credits.set_vcs_vintage_year(date_column='Vintage End').head()
It’s worth noting that we’ve wrapped all transformation functions using the pandas_flavor.register_dataframe_method
decorator.
That means that after importing a registry module from offsets_db_data
, the transformation functions of that module are directly callable by any Pandas dataframe.
Initial Column Mapping#
The initial and perhaps must mundane transformation of OffsetsDB involves mapping properties in the raw data to a common schema.
This step requires constructing a map between the names of properties as they appear in the raw data to the property in OffsetsDB.
For example, the Climate Action Reserve data refers to the property, project_id
, as Project ID
.
The ART registry, however, refers to the same property as Program ID
.
These column mapping files are stored in offsets_db_data/configs
.
There is a separate mapping file for projects
data and credits
data.
Some properties either aren’t included in the raw data or inferring their value requires special processing.
In these cases, a null
value is recorded in the column mapping files and the property is populated using registry-specific logic.
Protocol Mapping#
Offset projects are developed by following a set of rules, known as a protocol. These rules specify things like when measurements must be made and what paperwork must be submitted in order for a project to receive credits. Unfortunately, there is no standardized way of referring to the exact protocol (or protocol version) used to develop an offset project. Even within the domain of a single registry, references to the exact protocol used to develop a project are often inconsistent. Take for example the Clean Development Mechanism protocol AMS-III.D., “Methane recovery in animal manure management systems”. Across all five registries included in OffsetsDB, we identified twenty-two unique ways of referring to this one protocol.
OffsetsDB addresses this problem by manually assigning every known protocol string to a common schema.
Continuing with the AMS-III.D. example, we map all twenty-two “known strings” that describe the same protocol to a single, unified reference, ams-iii-d
.
We repeat this manual unification of dissimilar strings for all protocols across all registries.
The results of the mapping are contained within offsets-db-data/configs/all-protocol-mapping.json
.
Project Type & Categorization#
In addition to unifying protocol mapping, we also assign two levels of classification to projects: category
and type
.
Categories represent broad classes of offset approaches, while types provide more specific information about the mitigation strategy.
Category Assignment#
Projects are assigned to one of these broad categories
agriculture: offsets derived from changes in the management of agricultural systems, including livestock
forest: offsets derived from the management of forests
ghg-management: offsets derived from the destruction or elimination (e.g., substitution) of greenhouse gases
renewable-energy: offsets derived from expanding renewable energy capacity
energy-efficiency: offsets derived from decreasing the amount of energy required to complete a task
fuel-switching: offsets derived from generating energy using a fuel source that produces fewer greenhouse gasses
carbon-capture: offsets derived from technologies that capture and store carbon
land-use: offsets derived from land management changes outside of forests
biochar: offsets derived from biochar production and application
Category assignment is primarily determined by project type through the mapping defined in offsets-db-data/configs/type-category-mapping.json
.
This mapping connects specific project types (like “improved forest management” or “cookstoves”) to their appropriate category.
Project Type Assignment#
Project types represent more specific offset approaches. For example, within the category of “forest”, projects might be classified as “improved forest management”, “afforestation/reforestation”, or “avoided forest conversion”.
Project types are determined through a multi-step process:
First, we attempt to infer the project type from protocol information (via
offsets_db_data.projects.infer_project_type
).We apply manual overrides from curated data sources (via
offsets_db_data.projects.override_project_types
). Currently, the Berkeley Carbon Trading Project data inoffsets-db-data/configs/berkeley-project-types.json
serves as the authoritative source for project types. The project types from the Berkeley Carbon Trading Project’s Voluntary Registry Offsets Database are licensed under a CC-BY 4.0 license.
Retirement User Harmonization#
Carbon offset credits are often retired on behalf of a specific entity or organization. However, the names of these retirement users are recorded inconsistently across registry data, making it difficult to analyze retirement patterns. The following section describes our approach for identifying and harmonizing information about the end-users (“retirement users”) of specific offset credits.
Harmonization Process#
The harmonization process attempts to identify specific “retirement users” from publicly disclosed retirement beneficiary information. We try to standardize retirement user information across registries using the following steps:
Data merging: we combine information from four sources into a single temporary field:
retirement_beneficiary
: note specifically designating the entity claiming a credit’s environmental benefitsretirement_account
: name on account from which credits were retiredretirement_note
: short-form text accompanying credit retirementretirement_reason
: short form note specifying why credits were retired (e.g., compliance purposes). Sometimes similar to a retirement note
We refer to these fields as “retirement beneficiary data.” Any one of these fields might contain information useful for relating a transaction to a retirement user.
Standardization via OpenRefine: we process this merged information through OpenRefine using a detailed set of transformation rules define in
offsets-db-data/configs/beneficiary-mappings.json
. This includes:text transformations that standardize common company names and entities
pattern matching to identify the same entities despite different formatting
Only confident matches are included in the harmonized beneficiary field, retirement_beneficiary_harmonized
.
The retirement user harmonization process runs daily, along with the rest of OffsetsDB.
However, the underyling standardization rules (implemented via OpenRefine) are only irregularly updated.
This means that there might be new retirement data that could be mapped to a known entity but, because of that mapping has not previously been described, that relationship is not reflected in OffsetsDB.
To account for this, all searches via the database tool return matches across all available retirement beneficiary fields: retirement_beneficiary
, retirement_account
, retirement_note
, retirement_reason
, and retirement_beneficiary_harmonized
.
Thus, searching for known retirement users, like Delta
, will return all records that contain the substring delta
anywhere within their retirement beneficiary data.
Users should carefully examine these unmapped transactions to determine whether or not these unmapped records are relevant to their specific search.
Implementation Details#
Retirement user harmonization is implemented in the function offsets_db_data.credits.harmonize_beneficiary_data
.
This function runs a temporary OpenRefine project using the offsets-db-data-orcli
command-line tool (which is a wrapper around orcli
, an OpenRefine’s command-line interface) to apply the transformations defined in our mapping file.
The result is a new column, retirement_beneficiary_harmonized
, that contains the standardized user names.
Examples of Standardization#
Our harmonization process unifies many common variations:
“Delta Air Lines”, “Delta Airlines” → “Delta Airlines”
“Terpel”, “Organizacion Terpel”, “Terpel S.A.” → “Terpel”
“Retired on behalf of Sydney Opera House” → “Sydney Opera House”
Why This Matters#
Without harmonization, the same entity might appear under multiple names, making it difficult to accurately analyze which entities are retiring the most credits. This harmonization allows for more accurate aggregation of retirement data by user.
Note
The harmonizaton process can be toggled on or off via the harmonize_beneficiary_info
parameter of the process_{registry_abbreviation}_credits
functions.
Registry Specific Transformations#
Some transformations involved in producing OffsetsDB require special knowledge or assumptions about the underlying data. This section highlights special cases.
Note
For additional context, consult specific function docstrings in the API reference or reach out on GitHub if something doesn’t make sense.
American Carbon Registry#
Project status: When processing ACR projects, we combine two status properties present in the raw data: Compliance Program Status (ARB or Ecology)
and Voluntary Status
.
For compliance projects, we report compliance program status.
For voluntary projects, we report voluntary status.
Gold Standard#
Planned Emission Reductions#
Some Gold Standard protocols all for the issuance of “planned emission reductions” (PERs). These credits represent anticipated climate benefits that are expected to occur in the future. PERs are issued and can be traded, but cannot be retired. OffsetsDB includes all issued PERs when reporting vintage and project level credit totals.
Verra#
Issuance#
Verra allows “rolling” credit issuance. This allows projects to complete the paperwork and verificaiton processes for credit issuance, but delay the actual issuance event. This results in ambiguities around the precise timing of credit issuance events, as credits that are eligible to be issued but have not yet been issued, are not publicly reported in the Verra crediting data. We handle this ambiguity by assuming that the first crediting event, be it an issuance, retirement, or cancellation, on a per-project, per-vintage basis results in issuance of 100 percent of credits eligible to be issued for that project-vintage.
Retirements vs. Cancellations#
Verra’s publicly available data does not distinguish between retirement events and cancellation events.
We report all Verra retirements and cancellations as retirement
.
We originally contemplated tagging every Verra retirement or cancellation as retirement/cancellation
.
This made our processed Verra data slightly incompatiable with data from other registries.
Simple queries, like “give me all the retirements”, suddenly required writing code that look like this:
credits[(credits['transaction_type'] == 'retirement') | (credits['transaction_type'] == 'retirement/cancellation')]
Warning
Because we know the majority of Verra retirement/cancellation
events are in fact retirement
, we opted for this more ergonomic representation of the data.
Any analysis involving Verra retirement data should clearly specify that Verra’s raw data does not currently distinguish between retirement and cancellation events.
Vintage Date: Verra allows for the simultaneous issuance of multiple vintages. We assign all credits from these multi-vintage issuances to the earliest reported vintage year.
California Compliance Projects#
We treat the California Air Resources Board’s issuance table as the source of truth for all credits issued and retired by any project developed under an ARB-approved protocol. When a project appears in the issuance table, we drop all crediting data reported from the project’s host registry and only report the issuance and retirement values contained within the ARB issuance table. This methodological decision introduces a small error when it comes to “Early Action” projects. These projects were developed during the very first phase of California’s offsets program, which had slightly different rules. After the early action phase, some projects transitioned into the full compliance program, while others did not. Fully accounting for how these early projects retired credits, both reported by CARB’s issuance table and the registries, likely requires more careful treatment.
Retirement Dates: Offsets retired for compliance with California’s cap-and-trade program occur on fixed dates that correspond with the program’s reporting deadlines.
These deadlines come in two forms: partial (annual) and full (triennial) compliance events.
For simplicity, the current version of OffsetsDB uses the date of the full (triennial) compliance event as the retirement date for all compliance offsets.
This means some retirement dates go unrecorded.
Specifically, compliance credits retired for voluntary purposes (i.e., not to satisfy requirements under AB32) and credits retired in linked markets (e.g., Quebec) are unknown and reported as NaT
.
Other Normalizations#
Country#
We use the Python package coutnry_convertor to harmonize country names.
Project Status#
OffsetsDB unifies and simplifies project status information reported by the registries.
OffsetsDB contains use the following status codes: listed
, registered
, and completed
.
Listed typically refers to the stage during which a project has been formally announced and is undergoing development, but has yet to receive credits.
Registered refers to projects that have received credits and are eligible to receive additional credits in the future.
Completed means a project previously received credits and is not currently able to receive additional credits in the future.
Many registries have far more detailed project status information, often reflecting the specific stages of the registry’s validation and verification process.
Future work might focus on normalizing these additional project states across the registries.
Schema#
Projects#
Project data conform to the following schema:
{
"properties": {
"category": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"description": "Category of the project",
"title": "Category"
},
"country": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"title": "Country"
},
"first_issuance_at": {
"anyOf": [
{
"format": "date",
"type": "string"
},
{
"type": "null"
}
],
"description": "Date of first issuance of credits",
"title": "First Issuance At"
},
"first_retirement_at": {
"anyOf": [
{
"format": "date",
"type": "string"
},
{
"type": "null"
}
],
"description": "Date of first retirement of credits",
"title": "First Retirement At"
},
"is_compliance": {
"anyOf": [
{
"type": "boolean"
},
{
"type": "null"
}
],
"description": "Whether project is compliance project",
"title": "Is Compliance"
},
"issued": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 0,
"description": "Total of issued credits",
"title": "Issued"
},
"listed_at": {
"anyOf": [
{
"format": "date",
"type": "string"
},
{
"type": "null"
}
],
"description": "Date project was listed",
"title": "Listed At"
},
"name": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"description": "Name of the project",
"title": "Name"
},
"project_id": {
"description": "Project id used by registry system",
"title": "Project Id",
"type": "string"
},
"project_url": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"description": "URL to project details",
"title": "Project Url"
},
"proponent": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"title": "Proponent"
},
"protocol": {
"anyOf": [
{
"items": {
"type": "string"
},
"type": "array"
},
{
"type": "null"
}
],
"default": null,
"description": "List of protocols",
"title": "Protocol"
},
"registry": {
"description": "Name of the registry",
"title": "Registry",
"type": "string"
},
"retired": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 0,
"description": "Total of retired credits",
"title": "Retired"
},
"status": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"title": "Status"
},
"project_type": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"description": "Type of project",
"title": "Project Type"
},
"project_type_source": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"description": "Source of project type information",
"title": "Project Type Source"
}
},
"required": [
"project_id",
"name",
"registry",
"proponent",
"category",
"status",
"country",
"listed_at",
"is_compliance",
"first_issuance_at",
"first_retirement_at",
"project_url",
"project_type",
"project_type_source"
],
"title": "Project",
"type": "object"
}
The majority of project attributes are directly taken from the project data downloaded from each registry. Table 1 provides the mapping from the raw column names found in downloaded registry data to the OffsetsDB project schema.
verra |
climate-action-reserve |
american-carbon-registry |
gold-standard |
art-trees |
|
---|---|---|---|---|---|
project_id |
ID |
Project ID |
Project ID |
id |
Program ID |
name |
Name |
Project Name |
Project Name |
name |
Program Name |
protocol |
Methodology |
Project Type |
Project Methodology/Protocol |
methodology |
- |
category |
inferred from protocol or type |
inferred from protocol or type |
inferred from protocol |
inferred from protocol or type |
inferred from protocol or type |
proejct_type |
manually assigned |
manually assigned |
manually assigned |
manually assigned |
manually assigned |
proponent |
Proponent |
Project Owner |
Project Developer |
project_developer |
Sovereign Program Developer |
country |
Country/Area |
Project Site Country |
Project Site Country |
country |
Program Country |
status |
Status |
Status |
Derived: voluntary + compliance status |
status |
Status |
listed_at |
Project Listed Date |
- |
- |
- |
- |
Credits#
Credit data conform to the following schema:
{
"properties": {
"id": {
"default": null,
"title": "Id",
"type": "integer"
},
"project_id": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"description": "Project id used by registry system",
"title": "Project Id"
},
"quantity": {
"description": "Tons of carbon dioxide equivalent (each ton is one carbon credit",
"title": "Quantity",
"type": "integer"
},
"retirement_account": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"description": "Name on account from which credits were retired",
"title": "Retirement Account"
},
"retirement_beneficiary": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"description": "Note specifically designating the entity claiming a credit’s environmental benefits",
"title": "Retirement Beneficiary"
},
"retirement_beneficiary_harmonized": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"description": "Harmonized beneficiary of credits",
"title": "Retirement Beneficiary Harmonized"
},
"retirement_note": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"description": "Short-form text accompanying credit retirement",
"title": "Retirement Note"
},
"retirement_reason": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"description": "Short form note specifying why credits were retired (e.g., compliance purposes). Sometimes similar to a retirement note. ",
"title": "Retirement Reason"
},
"transaction_date": {
"anyOf": [
{
"format": "date",
"type": "string"
},
{
"type": "null"
}
],
"description": "Date of transaction",
"title": "Transaction Date"
},
"transaction_type": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"description": "Type of transaction",
"title": "Transaction Type"
},
"vintage": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"description": "Year when carbon avoidance/removal occurred",
"title": "Vintage"
}
},
"required": [
"quantity",
"vintage",
"transaction_date",
"transaction_type",
"retirement_account",
"retirement_beneficiary",
"retirement_reason",
"retirement_note",
"retirement_beneficiary_harmonized",
"project_id"
],
"title": "Credit",
"type": "object"
}
The majority of credit attributes are taken directly taken from the credit data downloaded from each registry.
The raw attribute names of crediting data tends to vary depending on the transaction type.
For example, ART TREES refers to retirement dates as Status Effective
, while it reports issuances as Date Approved
.
Rather than produce a table of each of these mappings here, please refer to credits-raw-columns-mappings.json.