Data Processing#
Order of Operations#
OffsetsDB follows a typical extract-transform-load (ETL) workflow.
Extraction involves querying and downloading raw credit and project data hosted by offset registries.
Transformation involves executing the functions contained within this repo, offsets-db-data
.
Load involves uploading the resulting data to S3 and the Postgres-backend that powers the OffsetsDB web tool.
Downloading raw data#
We download a fresh copy of project and transaction data on a daily basis. While downloading, we make no changes to the raw data provided by the registries. We’ve fully automated downloading of registry data, with the exception of Gold Standard. Gold Standard’s terms and conditions require that downloads occur through the interfaces provided by the Gold Standard site. Those interfaces, as provided, do not allow automated downloads.
We have no plans to release the code the directly interacts with registries.
We decided to keep this part of OffsetsDB private in an effort to limit download requests to the registries.
Other, technical aspects of OffsetsDB, like the database and API that power the database tool are similarly closed.
We made this decision to ensure that the OffsetsDB database tool remains performant.
Critically, the production database represents an identical clone of the data generated by the code contained within offsets-db-data
.
No additional processing or inferences should occur outside the context of this repository.
Transforming raw data#
Nearly the entirety of the code contained within offsets-db-data
involves registry-specific logic for transforming raw registry data into a common, shared schema.
The logic for transforming the data of each registry is contained within a single file, with the filename denoting which registry the transformations apply to.
For example, the logic involved in transforming Verra data are contained within offsets_db_data.vcs
.
Each registry-specific file contains at least two functions: process_{registry_abbreviation}_credits
and process_{registry_abbreviation}_projects
Those functions, in turn, call a series of additional transformation functions that produce the normalized project and credit data which combine to form OffsetsDB.
These transformation functions tend to be quite small and operate on one or two properties of the raw data.
To continue with the Verra example, vcs.py
contains functions with names like offsets_db_data.vcs.set_vcs_vintage_year
and offsets_db_data.vcs.add_vcs_project_id
.
These functions contain the registry-specific logic needed to map Verra’s raw data to the common data schema of OffsetsDB.
An example#
In practice, replicating the behavior of OffsetsDB should be simple.
Here’s an example of using offsets_db_data
to transform the raw transactions data from Verra into a normalized, analysis ready file:
import pandas as pd
pd.options.display.max_columns = 5
from offsets_db_data import vcs
archive_fname = {{ path to local copy of Verra transaction data }}
raw_credits = pd.read_csv(archive_fname)
processed_credits = vcs.process_vcs_credits(raw_credits)
Note
Running the above example requires first downloading a copy of Verra’s transaction data and changing the above code to reference the location of that data on your local machine.
Invoking single transformation functions, like offsets_db_data.vcs.set_vcs_vintage_year
is even more straightforward.
Let’s say you want to understand more about how OffsetsDB assigns Verra credits a vintage year.
You can explore the behavior of this single transformation function by calling:
raw_credits.set_vcs_vintage_year(date_column='Vintage End').head()
It’s worth noting that we’ve wrapped all transformation functions using the pandas_flavor.register_dataframe_method
decorator.
That means that after importing a registry module from offsets_db_data
, the transformation functions of that module are directly callable by any Pandas dataframe.
Initial Column Mapping#
The initial and perhaps must mundane transformation of OffsetsDB involves mapping properties in the raw data to a common schema.
This step requires constructing a map between the names of properties as they appear in the raw data to the property in OffsetsDB.
For example, the Climate Action Reserve data refers to the property, project_id
, as Project ID
.
The ART registry, however, refers to the same property as Program ID
.
These column mapping files are stored in offsets_db_data/configs
.
There is a separate mapping file for projects
data and credits
data.
Some properties either aren’t included in the raw data or inferring their value requires special processing.
In these cases, a null
value is recorded in the column mapping files and the property is populated using registry-specific logic.
Protocol Mapping & Categorization#
Offset projects are developed by following a set of rules, known as a protocol. These rules specify things like when measurements must be made and what paperwork must be submitted in order for a project to receive credits. Unfortunately, there is no standardized way of referring to the exact protocol (or protocol version) used to develop an offset project. Even within the domain of a single registry, references to the exact protocol used to develop a project are often inconsistent. Take for example the Clean Development Mechanism protocol AMS-III.D., “Methane recovery in animal manure management systems”. Across all five registries included in OffsetsDB, we identified twenty-two unique ways of referring to this one protocol.
OffsetsDB addresses this problem by manually assigning every known protocol string to a common schema.
Continuing with the AMS-III.D. example, we map all twenty-two “known strings” that describe the same protocol to a single, unified reference, ams-iii-d
.
We repeat this manual unification of dissimilar strings for all protocols across all registries.
The results of the mapping are contained within offsets-db-data/configs/all-protocol-mapping.json
.
We also categorize each of these unified protocol references. Those categories include:
agriculture: offsets derived from changes in the management of agricultural systems, including livestock.
forest: offsets derived from the management of forests.
ghg-management: offsets derived from the destruction or elimination (e.g., substitution) of greenhouse gases
renewable-energy: offsets derived from expanding renewable energy capacity
energy-efficiency: offsets derived from decreasing the amount of energy required to complete a task.
fuel-switching: offsets derived from generating energy using a fuel source that produces fewer greenhouse gasses.
The borders between these categories often blur. That is especially the case with energy efficiency and fuel switching protocols. Many protocols in these categories allow for projects that accomplish some mixture of the two approaches for displacing and reducing greenhouse gas emissions. Despite this blurriness, we assign each protocol to a single category. In the future, we are especially interested in adding project-level categorizations to the data. These project categories would highlight various sub-types of projects. This, for example, would provide an approach for distinguishing hydroelectric projects from wind projects, both of which fall within the renewable energy category.
Finally, several smaller categories exist within the data (e.g., biochar).
Registry specific transformations#
Some transformations involved in producing OffsetsDB require special knowledge or assumptions about the underlying data. This section highlights special cases.
Note
For additional context, consult specific function docstrings in the API reference or reach out on GitHub if something doesn’t make sense.
American Carbon Registry#
Project status: When processing ACR projects, we combine two status properties present in the raw data: Compliance Program Status (ARB or Ecology)
and Voluntary Status
.
For compliance projects, we report compliance program status.
For voluntary projects, we report voluntary status.
Gold Standard#
Planned Emission Reductions#
Some Gold Standard protocols all for the issuance of “planned emission reductions” (PERs). These credits represent anticipated climate benefits that are expected to occur in the future. PERs are issued and can be traded, but cannot be retired. OffsetsDB includes all issued PERs when reporting vintage and project level credit totals.
Verra#
Issuance#
Verra allows “rolling” credit issuance. This allows projects to complete the paperwork and verificaiton processes for credit issuance, but delay the actual issuance event. This results in ambiguities around the precise timing of credit issuance events, as credits that are eligible to be issued but have not yet been issued, are not publicly reported in the Verra crediting data. We handle this ambiguity by assuming that the first crediting event, be it an issuance, retirement, or cancellation, on a per-project, per-vintage basis results in issuance of 100 percent of credits eligible to be issued for that project-vintage.
Retirements vs. Cancellations#
Verra’s publicly available data does not distinguish between retirement events and cancellation events.
We report all Verra retirements and cancellations as retirement
.
We originally contemplated tagging every Verra retirement or cancellation as retirement/cancellation
.
This made our processed Verra data slightly incompatiable with data from other registries.
Simple queries, like “give me all the retirements”, suddenly required writing code that look like this:
credits[(credits['transaction_type'] == 'retirement') | (credits['transaction_type'] == 'retirement/cancellation')]
Because we know the majority of Verra retirement/cancellation
events are in fact retirement
, we opted for this more ergonomic representation of the data.
Any analysis involving Verra retirement data should clearly specify that Verra’s raw data does not currently distinguish between retirement and cancellation events.
Vintage Date: Verra allows for the simultaneous issuance of multiple vintages. We assign all credits from these multi-vintage issuances to the earliest reported vintage year.
California Compliance Projects#
We treat the California Air Resources Board’s issuance table as the source of truth for all credits issued and retired by any project developed under an ARB-approved protocol. When a project appears in the issuance table, we drop all crediting data reported from the project’s host registry and only report the issuance and retirement values contained within the ARB issuance table. This methodological decision introduces a small error when it comes to “Early Action” projects. These projects were developed during the very first phase of California’s offsets program, which had slightly different rules. After the early action phase, some projects transitioned into the full compliance program, while others did not. Fully accounting for how these early projects retired credits, both reported by CARB’s issuance table and the registries, likely requires more careful treatment.
Retirement Dates: Offsets retired for compliance with California’s cap-and-trade program occur on fixed dates that correspond with the program’s reporting deadlines.
These deadlines come in two forms: partial (annual) and full (triennial) compliance events.
For simplicity, the current version of OffsetsDB uses the date of the full (triennial) compliance event as the retirement date for all compliance offsets.
This means some retirement dates go unrecorded.
Specifically, compliance credits retired for voluntary purposes (i.e., not to satisfy requirements under AB32) and credits retired in linked markets (e.g., Quebec) are unknown and reported as NaT
.
Other normalizations#
Country#
We use the Python package coutnry_convertor to harmonize country names.
Project Status#
OffsetsDB unifies and simplifies project status information reported by the registries.
OffsetsDB contains use the following status codes: listed
, registered
, and completed
.
Listed typically refers to the stage during which a project has been formally announced and is undergoing development, but has yet to receive credits.
Registered refers to projects that have received credits and are eligible to receive additional credits in the future.
Completed means a project previously received credits and is not currently able to receive additional credits in the future.
Many registries have far more detailed project status information, often reflecting the specific stages of the registry’s validation and verification process.
Future work might focus on normalizing these additional project states across the registries.
Schema#
Projects#
Project data conform to the following schema:
{
"title": "Project",
"properties": {
"project_id": {
"title": "Project Id",
"description": "Project id used by registry system",
"type": "string"
},
"name": {
"title": "Name",
"description": "Name of the project",
"type": "string"
},
"registry": {
"title": "Registry",
"description": "Name of the registry",
"type": "string"
},
"proponent": { "title": "Proponent", "type": "string" },
"protocol": {
"title": "Protocol",
"description": "List of protocols",
"type": "array",
"items": { "type": "string" }
},
"category": {
"title": "Category",
"description": "List of categories",
"type": "array",
"items": { "type": "string" }
},
"status": { "title": "Status", "type": "string" },
"country": { "title": "Country", "type": "string" },
"listed_at": {
"title": "Listed At",
"description": "Date project was listed",
"type": "string",
"format": "date"
},
"is_compliance": {
"title": "Is Compliance",
"description": "Whether project is compliance project",
"type": "boolean"
},
"retired": {
"title": "Retired",
"description": "Total of retired credits",
"type": "integer"
},
"issued": {
"title": "Issued",
"description": "Total of issued credits",
"type": "integer"
},
"project_url": {
"title": "Project Url",
"description": "URL to project details",
"type": "string"
}
}
}
The majority of project attributes are directly taken from the project data downloaded from each registry. Table 1 provides the mapping from the raw column names found in downloaded registry data to the OffsetsDB project schema.
verra |
climate-action-reserve |
american-carbon-registry |
gold-standard |
art-trees |
|
---|---|---|---|---|---|
project_id |
ID |
Project ID |
Project ID |
id |
Program ID |
name |
Name |
Project Name |
Project Name |
name |
Program Name |
protocol |
Methodology |
Project Type |
Project Methodology/Protocol |
methodology |
- |
category |
inferred from protocol |
inferred from protocol |
inferred from protocol |
inferred from protocol |
inferred from protocol |
project_subcategory |
manually assigned |
manually assigned |
manually assigned |
manually assigned |
manually assigned |
proponent |
Proponent |
Project Owner |
Project Developer |
project_developer |
Sovereign Program Developer |
country |
Country/Area |
Project Site Country |
Project Site Country |
country |
Program Country |
status |
Status |
Status |
Derived: voluntary + compliance status |
status |
Status |
listed_at |
Project Listed Date |
- |
- |
- |
- |
Credits#
Credit data conform to the following schema:
{
"title": "Credit",
"properties": {
"id": {
"title": "Id",
"type": "integer"
},
"project_id": {
"title": "Project ID",
"description": "Unique project identifier, by registry",
"type": "string"
},
"quantity": {
"title": "Quantity",
"description": "Number of credits",
"type": "integer"
},
"vintage": {
"title": "Vintage",
"description": "Vintage year of credits",
"type": "integer"
},
"transaction_date": {
"title": "Transaction Date",
"description": "Date of transaction",
"type": "string",
"format": "date"
},
"transaction_type": {
"title": "Transaction Type",
"description": "Type of transaction (i.e., issuance, retirement)",
"type": "string"
}
}
}
The majority of credit attributes are taken directly taken from the credit data downloaded from each registry.
The raw attribute names of crediting data tends to vary depending on the transaction type.
For example, ART TREES refers to retirement dates as Status Effective
, while it reports issuances as Date Approved
.
Rather than produce a table of each of these mappings here, please refer to credits-raw-columns-mappings.json.