pip install pydantic-glueConverts pydantic schemas to json schema and then to AWS glue schema,
so in theory anything that can be converted to JSON Schema could also work.
When using AWS Kinesis Firehose in a configuration that receives JSONs and writes parquet files on S3,
one needs to define a AWS Glue table so Firehose knows what schema to use when creating the parquet files.
AWS Glue lets you define a schema using Avro or JSON Schema and then to create a table from that schema,
but as of May 2022
there are limitations on AWS that tables that are created that way can't be used with Kinesis Firehose.
https://stackoverflow.com/questions/68125501/invalid-schema-error-in-aws-glue-created-via-terraform
This is also confirmed by AWS support.
What one could do is create a table set the columns manually, but this means you now have two sources of truth to maintain.
This tool allows you to define a table in pydantic
and generate a JSON with column types that can be used with terraform to create a Glue table.
Take the following pydantic class
from pydantic import BaseModel
from typing import List
class Bar(BaseModel):
name: str
age: int
class Foo(BaseModel):
nums: List[int]
bars: List[Bar]
other: strRunning pydantic-glue
pydantic-glue -f example.py -c Fooyou get this JSON in the terminal:
{
"//": "Generated by pydantic-glue at 2022-05-25 12:35:55.333570. DO NOT EDIT",
"columns": {
"nums": "array<int>",
"bars": "array<struct<name:string,age:int>>",
"other": "string"
}
}and can be used in terraform like that
locals {
columns = jsondecode(file("${path.module}/glue_schema.json")).columns
}
resource "aws_glue_catalog_table" "table" {
name = "table_name"
database_name = "db_name"
storage_descriptor {
dynamic "columns" {
for_each = local.columns
content {
name = columns.key
type = columns.value
}
}
}
}Alternatively you can run CLI with -o flag to set output file location:
pydantic-glue -f example.py -c Foo -o example.json -lWherever there is a type key in the input JSON Schema, an additional key glue_type may be
defined to override the type that is used in the AWS Glue Schema. This is, for example, useful for
a pydantic model that has a field of type int that is unix epoch time, while the column type you
would like in Glue is of type timestamp.
Additional JSON Schema keys to a pydantic model can be added by using the
Field function
with the argument json_schema_extra like so:
from pydantic import BaseModel, Field
class A(BaseModel):
epoch_time: int = Field(
...,
json_schema_extra={
"glue_type": "timestamp",
},
)The resulting JSON Schema will be:
{
"properties": {
"epoch_time": {
"glue_type": "timestamp",
"title": "Epoch Time",
"type": "integer"
}
},
"required": [
"epoch_time"
],
"title": "A",
"type": "object"
}And the result after processing with pydantic-glue:
{
"//": "Generated by pydantic-glue at 2022-05-25 12:35:55.333570. DO NOT EDIT",
"columns": {
"epoch_time": "timestamp",
}
}Recursing through object properties terminates when you supply a glue_type to use. If the type is
complex, you must supply the full complex type yourself.
pydanticgets converted to JSON Schema- the JSON Schema types get mapped to Glue types recursively
- Not all types are supported, I just add types as I need them, but adding types is very easy, feel free to open issues or send a PR if you stumbled upon a non-supported use case
- the tool could be easily extended to working with JSON Schema directly
- thus, anything that can be converted to a JSON Schema should also work.