diff --git a/.github/workflows/tests.yml b/.github/workflows/tests.yml
index a00af99..93e43f3 100644
--- a/.github/workflows/tests.yml
+++ b/.github/workflows/tests.yml
@@ -7,13 +7,13 @@ jobs:
build:
runs-on: ubuntu-latest
steps:
- - name: Set up Go 1.x
- uses: actions/setup-go@v2
+ - name: Set up Go
+ uses: actions/setup-go@v5
with:
- go-version: ^1.15
+ go-version: '1.24.x'
- name: Check out code into the Go module directory
- uses: actions/checkout@v2
+ uses: actions/checkout@v4
- name: Update pip
run: pip install --upgrade pip
@@ -31,13 +31,13 @@ jobs:
name: lint
runs-on: ubuntu-latest
steps:
- - uses: actions/setup-go@v3
+ - uses: actions/setup-go@v5
with:
- go-version: 1.21
- - uses: actions/checkout@v3
+ go-version: '1.24.x'
+ - uses: actions/checkout@v4
- name: golangci-lint
- uses: golangci/golangci-lint-action@v3
+ uses: golangci/golangci-lint-action@v7
with:
# Optional: version of golangci-lint to use in form of v1.2 or v1.2.3 or `latest` to use the latest version
- version: v1.54
+ version: v2.10.1
# args: --timeout 2m
diff --git a/.golangci.yml b/.golangci.yml
index 0bbd141..29f0c3b 100644
--- a/.golangci.yml
+++ b/.golangci.yml
@@ -1,27 +1,30 @@
-run:
- skip-files:
- - ".*bindata.go$"
- - ".*pb.go"
- - ".*pb.gw.go"
+version: "2"
+run:
timeout: 5m
-issues:
- exclude:
- - "not declared by package utf8"
- - "unicode/utf8/utf8.go"
-
linters:
- # Disable all linters.
- # Default: false
- disable-all: true
+ default: none
# Enable specific linter
# https://golangci-lint.run/usage/linters/#enabled-by-default
enable:
- - gofmt
- - goimports
- misspell
- - typecheck
- - gosimple
- - govet
\ No newline at end of file
+ - govet
+ exclusions:
+ paths:
+ - ".*bindata.go$"
+ - ".*pb.go"
+ - ".*pb.gw.go"
+ rules:
+ - text: "not declared by package utf8"
+ linters:
+ - govet
+ - path: "unicode/utf8/utf8.go"
+ linters:
+ - govet
+
+formatters:
+ enable:
+ - gofmt
+ - goimports
\ No newline at end of file
diff --git a/Makefile b/Makefile
index e8b4980..4e3c39c 100644
--- a/Makefile
+++ b/Makefile
@@ -1,6 +1,6 @@
-SIFTER_VERSION=0.1.5
+SIFTER_VERSION=0.2.0
#hack to get around submodule weirdness in automated docker builds
hub-build:
@@ -30,5 +30,3 @@ test: .TEST
.TEST:
go test ./test
-docs:
- @go run docschema/main.go | ./docschema/schema-to-markdown.py > Playbook.md
diff --git a/Playbook.md b/Playbook.md
deleted file mode 100644
index d21bce5..0000000
--- a/Playbook.md
+++ /dev/null
@@ -1,1216 +0,0 @@
-# Introduction
-SIFTER is an Extract Transform Load (ETL) platform that is designed to take
-a variety of standard input sources, create a message streams and run a
-set of transforms to create JSON schema validated output classes.
-SIFTER is based based on implementing a Playbook that describes top level
-Extractions, that can include downloads, file manipulation and finally reading
-the contents of the files. Every extractor is meant to produce a stream of
-MESSAGES for transformation. A message is a simple nested dictionary data structure.
-
-Example Message:
-
-```
-{
- "firstName" : "bob",
- "age" : "25"
- "friends" : [ "Max", "Alex"]
-}
-```
-
-Once a stream of messages are produced, that can be run through a TRANSFORM
-pipeline. A transform pipeline is an array of transform steps, each transform
-step can represent a different way to alter the data. The array of transforms link
-togeather into a pipe that makes multiple alterations to messages as they are
-passed along. There are a number of different transform steps types that can
-be done in a transform pipeline these include:
-
- - Projection
- - Filtering
- - Programmatic transformation
- - Table based field translation
- - Outputing the message as a JSON Schema checked object
-
-
-***
-# Example Playbook
-Our first task will be to convert a ZIP code TSV into a set of county level
-entries.
-
-The input file looks like:
-
-```
-ZIP,COUNTYNAME,STATE,STCOUNTYFP,CLASSFP
-36003,Autauga County,AL,01001,H1
-36006,Autauga County,AL,01001,H1
-36067,Autauga County,AL,01001,H1
-36066,Autauga County,AL,01001,H1
-36703,Autauga County,AL,01001,H1
-36701,Autauga County,AL,01001,H1
-36091,Autauga County,AL,01001,H1
-```
-
-First is the header of the Playbook. This declares the
-unique name of the playbook and it's output directory.
-
-```
-name: zipcode_map
-outdir: ./
-docs: Converts zipcode TSV into graph elements
-```
-
-Next the configuration is declared. In this case the only input is the zipcode TSV.
-There is a default value, so the playbook can be invoked without passing in
-any parameters. However, to apply this playbook to a new input file, the
-input parameter `zipcode` could be used to define the source file.
-
-```
-config:
- schema:
- type: Dir
- default: ../covid19_datadictionary/gdcdictionary/schemas/
- zipcode:
- type: File
- default: ../data/ZIP-COUNTY-FIPS_2017-06.csv
-```
-
-The `inputs` section declares data input sources. In this playbook, there is
-only one input, which is to run the table loader.
-```
-inputs:
- tableLoad:
- input: "{{config.zipcode}}"
- sep: ","
-```
-
-Tableload operaters of the input file that was originally passed in using the
-`inputs` stanza. SIFTER string parsing is based on mustache template system.
-To access the string passed in the template is `{{config.zipcode}}`.
-The seperator in the file input file is a `,` so that is also passed in as a
-parameter to the extractor.
-
-
-The `tableLoad` extractor opens up the TSV and generates a one message for
-every row in the file. It uses the header of the file to map the column values
-into a dictionary. The first row would produce the message:
-
-```
-{
- "ZIP" : "36003",
- "COUNTYNAME" : "Autauga County",
- "STATE" : "AL",
- "STCOUNTYFP" : "01001",
- "CLASSFP" : "H1"
-}
-```
-
-The stream of messages are then passed into the steps listed in the `transform`
-section of the tableLoad extractor.
-
-For the current tranform, we want to produce a single entry per `STCOUNTYFP`,
-however, the file has a line per `ZIP`. We need to run a `reduce` transform,
-that collects rows togeather using a field key, which in this case is `"{{row.STCOUNTYFP}}"`,
-and then runs a function `merge` that takes two messages, merges them togeather
-and produces a single output message.
-
-The two messages:
-
-```
-{ "ZIP" : "36003", "COUNTYNAME" : "Autauga County", "STATE" : "AL", "STCOUNTYFP" : "01001", "CLASSFP" : "H1"}
-{ "ZIP" : "36006", "COUNTYNAME" : "Autauga County", "STATE" : "AL", "STCOUNTYFP" : "01001", "CLASSFP" : "H1"}
-```
-
-Would be merged into the message:
-
-```
-{ "ZIP" : ["36003", "36006"], "COUNTYNAME" : "Autauga County", "STATE" : "AL", "STCOUNTYFP" : "01001", "CLASSFP" : "H1"}
-```
-
-The `reduce` transform step uses a block of python code to describe the function.
-The `method` field names the function, in this case `merge` that will be used
-as the reduce function.
-
-```
- zipReduce:
- - from: zipcode
- - reduce:
- field: STCOUNTYFP
- method: merge
- python: >
- def merge(x,y):
- a = x.get('zipcodes', []) + [x['ZIP']]
- b = y.get('zipcodes', []) + [y['ZIP']]
- x['zipcodes'] = a + b
- return x
-```
-
-The original messages produced by the loader have all of the information required
-by the `summary_location` object type as described by the JSON schema that was linked
-to in the header stanza. However, the data is all under the wrong field names.
-To remap the data, we use a `project` tranformation that uses the template engine
-to project data into new files in the message. The template engine has the current
-message data in the value `row`. So the value
-`FIPS:{{row.STCOUNTYFP}}` is mapped into the field `id`.
-
-```
- - project:
- mapping:
- id: "FIPS:{{row.STCOUNTYFP}}"
- province_state: "{{row.STATE}}"
- summary_locations: "{{row.STCOUNTYFP}}"
- county: "{{row.COUNTYNAME}}"
- submitter_id: "{{row.STCOUNTYFP}}"
- type: summary_location
- projects: []
-```
-
-Using this projection, the message:
-
-```
-{
- "ZIP" : ["36003", "36006"],
- "COUNTYNAME" : "Autauga County",
- "STATE" : "AL",
- "STCOUNTYFP" : "01001",
- "CLASSFP" : "H1"
-}
-```
-
-would become
-
-```
-{
- "id" : "FIPS:01001",
- "province_state" : "AL",
- "summary_locations" : "01001",
- "county" : "Autauga County",
- "submitter_id" : "01001",
- "type" : "summary_location"
- "projects" : [],
- "ZIP" : ["36003", "36006"],
- "COUNTYNAME" : "Autauga County",
- "STATE" : "AL",
- "STCOUNTYFP" : "01001",
- "CLASSFP" : "H1"
-}
-```
-
-Now that the data has been remapped, we pass the data into the 'objectCreate'
-transformation, which will read in the schema for `summary_location`, check the
-message to make sure it matches and then output it.
-
-```
- - objectCreate:
- class: summary_location
-```
-
-
-Outputs
-
-To create an output table, with two columns connecting
-`ZIP` values to `STCOUNTYFP` values. The `STCOUNTYFP` is a county level FIPS
-code, used by the census office. A single FIPS code my contain many ZIP codes,
-and we can use this table later for mapping ids when loading the data into a database.
-
-```
-outputs:
- zip2fips:
- tableWrite:
- from:
- output: zip2fips
- columns:
- - ZIP
- - STCOUNTYFP
-```
-
-
-***
-# File Format
-A Playbook is a YAML file, that links a schema to a series of extractors that
-in turn, can run several transforms to emit objects that are checked against
-the schema.
-
-
-***
-## Playbook
-
-The Playbook represents a single ETL pipeline that takes multiple inputs
-and turns them into multiple output streams. It can take a set of inputs
-then run a sequential set of extraction steps.
-
-
- - class
-
-> Type: *string*
-
- - name
-
-> Type: *string*
-
-: Unique name of the playbook
-
- - docs
-
-> Type: *string*
-
- - outdir
-
-> Type: *string*
-
- - config
-
-> Type: *object* of [ConfigVar](#configvar)
-
-
-: Configuration for Playbook
-
- - inputs
-
-> Type: *object* of [Extractor](#extractor)
-
-
-: Steps of the transformation
-
- - outputs
-
-> Type: *object* of [WriteConfig](#writeconfig)
-
-
- - pipelines
-
-> Type: *object*
-
-
-***
-## ConfigVar
-
- - name
-
-> Type: *string*
-
- - type
-
-> Type: *string*
-
- - default
-
-> Type: *string*
-
-
-***
-# Extraction Steps
-Every playbook consists of a series of extraction steps. An extraction step
-can be a data extractor that runs a transform pipeline.
-
-
-***
-## Extractor
-
-This object represents a single extractor step. It has a field for each possible
-extractor type, but only one is supposed to be filed in at a time.
-
-
- - description
-
-> Type: *string*
-
-: Human Readable description of step
-
- - xmlLoad
-
- of [XMLLoadStep](#xmlloadstep)
-
- - tableLoad
-
- of [TableLoadStep](#tableloadstep)
-
-: Run transform pipeline on a TSV or CSV
-
- - jsonLoad
-
- of [JSONLoadStep](#jsonloadstep)
-
-: Run a transform pipeline on a multi line json file
-
- - sqldumpLoad
-
- of [SQLDumpStep](#sqldumpstep)
-
-: Parse the content of a SQL dump to find insert and run a transform pipeline
-
- - gripperLoad
-
- of [GripperLoadStep](#gripperloadstep)
-
-: Use a GRIPPER server to get data and run a transform pipeline
-
- - avroLoad
-
- of [AvroLoadStep](#avroloadstep)
-
-: Load data from avro file
-
- - embedded
-
-> Type: *array*
-
- - glob
-
- of [GlobLoadStep](#globloadstep)
-
- - sqliteLoad
-
- of [SQLiteStep](#sqlitestep)
-
-An array of Extractors, each defining a different extraction step
-
-```
-- desc: Untar the input file
- untar:
- input: "{{inputs.tar}}"
-- desc: Loading Patient List
- tableLoad:
- input: data_clinical_patient.txt
- transform:
- ...
-- desc: Loading Sample List
- tableLoad:
- input: data_clinical_sample.txt
- transform:
- ...
-- fileGlob:
- files: [ data_RNA_Seq_expression_median.txt, data_RNA_Seq_V2_expression_median.txt ]
- steps:
- ...
-```
-
-
-***
-## SQLDumpStep
-
- - input
-
-> Type: *string*
-
-: Path to the SQL dump file
-
- - tables
-
-> Type: *array*
-
-: Array of transforms for the different tables in the SQL dump
-
-
-***
-## TableLoadStep
-
- - input
-
-> Type: *string*
-
-: TSV to be transformed
-
- - rowSkip
-
-> Type: *integer*
-
-: Number of header rows to skip
-
- - columns
-
-> Type: *array*
-
-: Manually set names of columns
-
- - extraColumns
-
-> Type: *string*
-
-: Columns beyond originally declared columns will be placed in this array
-
- - sep
-
-> Type: *string*
-
-: Separator \t for TSVs or , for CSVs
-
-
-***
-## JSONLoadStep
-
- - input
-
-> Type: *string*
-
-: Path of multiline JSON file to transform
-
- - transform
-
-> Type: *array* of [Step](#step)
-
-: Transformation Pipeline
-
- - multiline
-
-> Type: *boolean*
-
-: Load file as a single multiline JSON object
-
-```
-- desc: Convert Census File
- jsonLoad:
- input: "{{inputs.census}}"
- transform:
- ...
-```
-
-
-***
-## GripperLoadStep
-
-Use a GRIPPER server to obtain data
-
- - host
-
-> Type: *string*
-
-: GRIPPER URL
-
- - collection
-
-> Type: *string*
-
-: GRIPPER collection to target
-
-
-***
-# Transform Pipelines
-A tranform pipeline is a series of method to alter a message stream.
-
-
-***
-## ObjectCreateStep
-
-Output a JSON schema described object
-
- - class
-
-> Type: *string*
-
-: Object class, should match declared class in JSON Schema
-
- - schema
-
-> Type: *string*
-
-: Directory with JSON schema files
-
-
-***
-## MapStep
-
-Apply the sample function to every message
-
- - method
-
-> Type: *string*
-
-: Name of function to call
-
- - python
-
-> Type: *string*
-
-: Python code to be run
-
- - gpython
-
-> Type: *string*
-
-: Python code to be run using GPython
-
-The `python` section defines the code, and the `method` parameter defines
-which function from the code to call
-```
-- map:
- #fix weird formatting of zip code
- python: >
- def f(x):
- d = int(x['zipcode'])
- x['zipcode'] = "%05d" % (int(d))
- return x
- method: f
-```
-
-
-***
-## ProjectStep
-
-Project templates into fields in the message
-
- - mapping
-
-> Type: *object*
-
-: New fields to be generated from template
-
- - rename
-
-> Type: *object*
-
-: Rename field (no template engine)
-
-
-```
-- project:
- mapping:
- code: "{{row.project_id}}"
- programs: "{{row.program.name}}"
- submitter_id: "{{row.program.name}}"
- projects: "{{row.project_id}}"
- type: experiment
-```
-
-
-***
-## LookupStep
-
-Use a two column file to make values from one value to another.
-
- - replace
-
-> Type: *string*
-
- - tsv
-
- of [TSVTable](#tsvtable)
-
- - json
-
- of [JSONTable](#jsontable)
-
- - table
-
-> Type: *object*
-
- - lookup
-
-> Type: *string*
-
- - copy
-
-> Type: *object*
-
-Starting with a table that maps state names to the two character state code:
-
-```
-North Dakota ND
-Ohio OH
-Oklahoma OK
-Oregon OR
-Pennsylvania PA
-```
-
-The transform:
-
-```
- - tableReplace:
- input: "{{inputs.stateTable}}"
- field: sub_region_1
-```
-
-Would change the message:
-
-```
-{ "sub_region_1" : "Oregon" }
-```
-
-to
-
-```
-{ "sub_region_1" : "OR" }
-```
-
-
-***
-## RegexReplaceStep
-
-Use a regular expression based replacement to alter a field
-
- - field
-
-> Type: *string*
-
- - regex
-
-> Type: *string*
-
- - replace
-
-> Type: *string*
-
- - dst
-
-> Type: *string*
-
-
-```
-- regexReplace:
- col: "{{row.attributes.Parent}}"
- regex: "^transcript:"
- replace: ""
- dst: transcript_id
-```
-
-
-***
-## ReduceStep
-
- - field
-
-> Type: *string*
-
- - method
-
-> Type: *string*
-
- - python
-
-> Type: *string*
-
- - gpython
-
-> Type: *string*
-
- - init
-
-> Type: *object*
-
-```
- - reduce:
- field: "{{row.STCOUNTYFP}}"
- method: merge
- python: >
- def merge(x,y):
- a = x.get('zipcodes', []) + [x['ZIP']]
- b = y.get('zipcodes', []) + [y['ZIP']]
- x['zipcodes'] = a + b
- return x
-```
-
-
-***
-## FilterStep
-
- - field
-
-> Type: *string*
-
- - value
-
-> Type: *string*
-
- - match
-
-> Type: *string*
-
- - check
-
-> Type: *string*
-
-: How to check value, 'exists' or 'hasValue'
-
- - method
-
-> Type: *string*
-
- - python
-
-> Type: *string*
-
- - gpython
-
-> Type: *string*
-
- - steps
-
-> Type: *array* of [Step](#step)
-
-
-Match based filtering:
-
-```
- - filter:
- col: "{{row.tax_id}}"
- match: "9606"
- steps:
- - tableWrite:
-```
-
-Code based filters:
-
-```
-- filter:
- python: >
- def f(x):
- if 'FIPS' in x and len(x['FIPS']) > 0 and len(x['date']) > 0:
- return True
- return False
- method: f
- steps:
- - objectCreate:
- class: summary_report
-```
-
-
-***
-## DebugStep
-
-Print out messages
-
- - label
-
-> Type: *string*
-
-```
-- debug: {}
-```
-
-
-***
-## FieldProcessStep
-
-Table an array field from a message, split it into a series of
-messages and run on child transform pipeline. The `mapping` field
-allows you to take data from the parent message and map it into the
-child messages.
-
-
- - field
-
-> Type: *string*
-
- - mapping
-
-> Type: *object*
-
- - itemField
-
-> Type: *string*
-
-: If processing an array of non-dict elements, create a dict as {itemField:element}
-
-```
-- fieldProcess:
- col: portions
- mapping:
- samples: "{{row.id}}"
-```
-
-
-***
-## FieldParseStep
-
-Take a param style string and parse it into independent elements in the message
-
- - field
-
-> Type: *string*
-
- - sep
-
-> Type: *string*
-
- - assign
-
-> Type: *string*
-
-
-The messages
-
-```
-{ "attributes" : "ID=CDS:ENSP00000419345;Parent=transcript:ENST00000486405;protein_id=ENSP00000419345" }
-```
-
-After the transform:
-
-```
- - fieldParse:
- col: attributes
- sep: ";"
-```
-
-Becomes:
-```
-{
- "attributes" : "ID=CDS:ENSP00000419345;Parent=transcript:ENST00000486405;protein_id=ENSP00000419345",
- "ID" : "CDS:ENSP00000419345",
- "Parent" : "transcript:ENST00000486405",
- "protein_id" : "ENSP00000419345"
-}
-```
-
-
-***
-## AccumulateStep
-
- - field
-
-> Type: *string*
-
-: Field to use for group definition
-
- - dest
-
-> Type: *string*
-
-## AvroLoadStep
-
- - input
-
-> Type: *string*
-
-: Path of avro object file to transform
-
-## CleanStep
-
- - fields
-
-> Type: *array*
-
-: List of valid fields that will be left. All others will be removed
-
- - removeEmpty
-
-> Type: *boolean*
-
- - storeExtra
-
-> Type: *string*
-
-## CommandLineTemplate
-
- - template
-
-> Type: *string*
-
- - outputs
-
-> Type: *array*
-
- - inputs
-
-> Type: *array*
-
-## DistinctStep
-
- - value
-
-> Type: *string*
-
- - steps
-
-> Type: *array* of [Step](#step)
-
-## EdgeRule
-
- - prefixFilter
-
-> Type: *boolean*
-
- - blankFilter
-
-> Type: *boolean*
-
- - toPrefix
-
-> Type: *string*
-
- - sep
-
-> Type: *string*
-
- - idTemplate
-
-> Type: *string*
-
-## EmitStep
-
- - name
-
-> Type: *string*
-
-## GlobLoadStep
-
- - storeFilename
-
-> Type: *string*
-
- - input
-
-> Type: *string*
-
-: Path of avro object file to transform
-
- - xmlLoad
-
- of [XMLLoadStep](#xmlloadstep)
-
- - tableLoad
-
- of [TableLoadStep](#tableloadstep)
-
-: Run transform pipeline on a TSV or CSV
-
- - jsonLoad
-
- of [JSONLoadStep](#jsonloadstep)
-
-: Run a transform pipeline on a multi line json file
-
- - avroLoad
-
- of [AvroLoadStep](#avroloadstep)
-
-: Load data from avro file
-
-## GraphBuildStep
-
- - schema
-
-> Type: *string*
-
- - class
-
-> Type: *string*
-
- - idPrefix
-
-> Type: *string*
-
- - idTemplate
-
-> Type: *string*
-
- - idField
-
-> Type: *string*
-
- - filePrefix
-
-> Type: *string*
-
- - sep
-
-> Type: *string*
-
- - fields
-
-> Type: *object* of [EdgeRule](#edgerule)
-
-
- - flat
-
-> Type: *boolean*
-
-## HashStep
-
- - field
-
-> Type: *string*
-
- - value
-
-> Type: *string*
-
- - method
-
-> Type: *string*
-
-## JSONTable
-
- - input
-
-> Type: *string*
-
- - value
-
-> Type: *string*
-
- - key
-
-> Type: *string*
-
-## SQLiteStep
-
- - input
-
-> Type: *string*
-
-: Path to the SQLite file
-
- - query
-
-> Type: *string*
-
-: SQL select statement based input
-
-## SnakeFileWriter
-
- - from
-
-> Type: *string*
-
- - commands
-
-> Type: *array* of [CommandLineTemplate](#commandlinetemplate)
-
-## Step
-
- - from
-
-> Type: *string*
-
- - fieldParse
-
- of [FieldParseStep](#fieldparsestep)
-
-: fieldParse to run
-
- - fieldType
-
-> Type: *object*
-
-: Change type of a field (ie string -> integer)
-
- - objectCreate
-
- of [ObjectCreateStep](#objectcreatestep)
-
-: Create a JSON schema based object
-
- - emit
-
- of [EmitStep](#emitstep)
-
-: Write to unstructured JSON file
-
- - filter
-
- of [FilterStep](#filterstep)
-
- - clean
-
- of [CleanStep](#cleanstep)
-
- - debug
-
- of [DebugStep](#debugstep)
-
-: Print message contents to stdout
-
- - regexReplace
-
- of [RegexReplaceStep](#regexreplacestep)
-
- - project
-
- of [ProjectStep](#projectstep)
-
-: Run a projection mapping message
-
- - map
-
- of [MapStep](#mapstep)
-
-: Apply a single function to all records
-
- - reduce
-
- of [ReduceStep](#reducestep)
-
- - distinct
-
- of [DistinctStep](#distinctstep)
-
- - fieldProcess
-
- of [FieldProcessStep](#fieldprocessstep)
-
-: Take an array field from a message and run in child transform
-
- - lookup
-
- of [LookupStep](#lookupstep)
-
- - hash
-
- of [HashStep](#hashstep)
-
- - graphBuild
-
- of [GraphBuildStep](#graphbuildstep)
-
- - accumulate
-
- of [AccumulateStep](#accumulatestep)
-
-## TSVTable
-
- - input
-
-> Type: *string*
-
- - sep
-
-> Type: *string*
-
- - value
-
-> Type: *string*
-
- - key
-
-> Type: *string*
-
- - header
-
-> Type: *array*
-
-## TableWriter
-
- - from
-
-> Type: *string*
-
- - output
-
-> Type: *string*
-
-: Name of file to create
-
- - columns
-
-> Type: *array*
-
-: Columns to be written into table file
-
- - sep
-
-> Type: *string*
-
-## WriteConfig
-
- - tableWrite
-
- of [TableWriter](#tablewriter)
-
- - snakefile
-
- of [SnakeFileWriter](#snakefilewriter)
-
-## XMLLoadStep
-
- - input
-
-> Type: *string*
-
diff --git a/README.md b/README.md
index 794811a..1fae78f 100644
--- a/README.md
+++ b/README.md
@@ -31,7 +31,7 @@ More detailed descriptions can be found in out [Playbook manual](Playbook.md)
class: sifter
name: census_2010
-config:
+params:
census: ../data/census_2010_byzip.json
date: "2010-01-01"
schema: ../covid19_datadictionary/gdcdictionary/schemas/
@@ -39,7 +39,7 @@ config:
inputs:
censusData:
jsonLoad:
- input: "{{config.census}}"
+ input: "{{params.census}}"
pipelines:
transform:
@@ -54,13 +54,13 @@ pipelines:
method: f
- project:
mapping:
- submitter_id: "{{row.geo_id}}:{{inputs.date}}"
+ submitter_id: "{{row.geo_id}}:{{params.date}}"
type: census_report
- date: "{{config.date}}"
+ date: "{{params.date}}"
summary_location: "{{row.zipcode}}"
- objectValidate:
title: census_report
- schema: "{{config.schema}}"
+ schema: "{{params.schema}}"
```
diff --git a/cmd/graphplan/main.go b/cmd/graphplan/main.go
deleted file mode 100644
index 56b93f8..0000000
--- a/cmd/graphplan/main.go
+++ /dev/null
@@ -1,67 +0,0 @@
-package graphplan
-
-import (
- "path/filepath"
-
- "github.com/bmeg/sifter/graphplan"
- "github.com/bmeg/sifter/logger"
- "github.com/bmeg/sifter/playbook"
- "github.com/spf13/cobra"
-)
-
-var outScriptDir = ""
-var outDataDir = "./"
-var objectExclude = []string{}
-var verbose bool = false
-
-// Cmd is the declaration of the command line
-var Cmd = &cobra.Command{
- Use: "graph-plan",
- Short: "Scan directory to plan operations",
- Args: cobra.MinimumNArgs(1),
- RunE: func(cmd *cobra.Command, args []string) error {
-
- if verbose {
- logger.Init(true, false)
- }
-
- scriptPath, _ := filepath.Abs(args[0])
-
- /*
- if outScriptDir != "" {
- baseDir, _ = filepath.Abs(outScriptDir)
- } else if len(args) > 1 {
- return fmt.Errorf("for multiple input directories, based dir must be defined")
- }
-
- _ = baseDir
- */
- outScriptDir, _ = filepath.Abs(outScriptDir)
- outDataDir, _ = filepath.Abs(outDataDir)
-
- outDataDir, _ = filepath.Rel(outScriptDir, outDataDir)
-
- pb := playbook.Playbook{}
-
- if sifterErr := playbook.ParseFile(scriptPath, &pb); sifterErr == nil {
- if len(pb.Pipelines) > 0 || len(pb.Inputs) > 0 {
- err := graphplan.NewGraphBuild(
- &pb, outScriptDir, outDataDir, objectExclude,
- )
- if err != nil {
- logger.Error("Parse Error", "error", err)
- }
- }
- }
-
- return nil
- },
-}
-
-func init() {
- flags := Cmd.Flags()
- flags.BoolVarP(&verbose, "verbose", "v", verbose, "Verbose logging")
- flags.StringVarP(&outScriptDir, "dir", "C", outScriptDir, "Change Directory for script base")
- flags.StringVarP(&outDataDir, "out", "o", outDataDir, "Change output Directory")
- flags.StringArrayVarP(&objectExclude, "exclude", "x", objectExclude, "Object Exclude")
-}
diff --git a/cmd/inspect/main.go b/cmd/inspect/main.go
index 4375f45..bd2cbb8 100644
--- a/cmd/inspect/main.go
+++ b/cmd/inspect/main.go
@@ -53,23 +53,18 @@ var Cmd = &cobra.Command{
out := map[string]any{}
cf := map[string]string{}
- for _, f := range pb.GetConfigFields() {
+ for _, f := range pb.GetRequiredParams() {
cf[f.Name] = f.Name //f.Type
}
out["configFields"] = cf
- ins := pb.GetConfigFields()
+ ins := pb.GetRequiredParams()
out["config"] = ins
outputs := map[string]any{}
- sinks, _ := pb.GetOutputs(task)
- for k, v := range sinks {
- outputs[k] = v
- }
-
- emitters, _ := pb.GetEmitters(task)
- for k, v := range emitters {
+ pouts, _ := pb.GetOutputs(task)
+ for k, v := range pouts {
outputs[k] = v
}
diff --git a/cmd/root.go b/cmd/root.go
index 2e889eb..d23161c 100644
--- a/cmd/root.go
+++ b/cmd/root.go
@@ -3,10 +3,8 @@ package cmd
import (
"os"
- "github.com/bmeg/sifter/cmd/graphplan"
"github.com/bmeg/sifter/cmd/inspect"
"github.com/bmeg/sifter/cmd/run"
- "github.com/bmeg/sifter/cmd/scan"
"github.com/spf13/cobra"
)
@@ -20,8 +18,6 @@ var RootCmd = &cobra.Command{
func init() {
RootCmd.AddCommand(run.Cmd)
RootCmd.AddCommand(inspect.Cmd)
- RootCmd.AddCommand(graphplan.Cmd)
- RootCmd.AddCommand(scan.Cmd)
}
var genBashCompletionCmd = &cobra.Command{
diff --git a/cmd/run/main.go b/cmd/run/main.go
index f439127..8b3f016 100644
--- a/cmd/run/main.go
+++ b/cmd/run/main.go
@@ -11,9 +11,9 @@ import (
)
var outDir string = ""
-var inputFile string = ""
+var paramsFile string = ""
var verbose bool = false
-var cmdInputs map[string]string
+var cmdParams map[string]string
// Cmd is the declaration of the command line
var Cmd = &cobra.Command{
@@ -26,15 +26,15 @@ var Cmd = &cobra.Command{
logger.Init(true, false)
}
- inputs := map[string]string{}
- if inputFile != "" {
- if err := playbook.ParseStringFile(inputFile, &inputs); err != nil {
+ params := map[string]string{}
+ if paramsFile != "" {
+ if err := playbook.ParseStringFile(paramsFile, ¶ms); err != nil {
logger.Error("%s", err)
return err
}
}
- for k, v := range cmdInputs {
- inputs[k] = v
+ for k, v := range cmdParams {
+ params[k] = v
logger.Info("Input Params", k, v)
}
for _, playFile := range args {
@@ -46,11 +46,11 @@ var Cmd = &cobra.Command{
}
pb := playbook.Playbook{}
playbook.ParseBytes(yaml, "./playbook.yaml", &pb)
- if err := Execute(pb, "./", "./", outDir, inputs); err != nil {
+ if err := Execute(pb, "./", "./", outDir, params); err != nil {
return err
}
} else {
- if err := ExecuteFile(playFile, "./", outDir, inputs); err != nil {
+ if err := ExecuteFile(playFile, "./", outDir, params); err != nil {
return err
}
}
@@ -63,6 +63,6 @@ var Cmd = &cobra.Command{
func init() {
flags := Cmd.Flags()
flags.BoolVarP(&verbose, "verbose", "v", verbose, "Verbose logging")
- flags.StringToStringVarP(&cmdInputs, "config", "c", cmdInputs, "Config variable")
- flags.StringVarP(&inputFile, "configFile", "f", inputFile, "Config file")
+ flags.StringToStringVarP(&cmdParams, "param", "p", cmdParams, "Parameter variable")
+ flags.StringVarP(¶msFile, "params-file", "f", paramsFile, "Parameter file")
}
diff --git a/cmd/run/run.go b/cmd/run/run.go
index a5607a6..42651b2 100644
--- a/cmd/run/run.go
+++ b/cmd/run/run.go
@@ -22,7 +22,7 @@ func ExecuteFile(playFile string, workDir string, outDir string, inputs map[stri
return Execute(pb, baseDir, workDir, outDir, inputs)
}
-func Execute(pb playbook.Playbook, baseDir string, workDir string, outDir string, inputs map[string]string) error {
+func Execute(pb playbook.Playbook, baseDir string, workDir string, outDir string, params map[string]string) error {
if outDir == "" {
outDir = pb.GetDefaultOutDir()
@@ -32,7 +32,7 @@ func Execute(pb playbook.Playbook, baseDir string, workDir string, outDir string
os.MkdirAll(outDir, 0777)
}
- nInputs, err := pb.PrepConfig(inputs, workDir)
+ nInputs, err := pb.PrepConfig(params, workDir)
if err != nil {
return err
}
diff --git a/cmd/scan/main.go b/cmd/scan/main.go
deleted file mode 100644
index f1fbe48..0000000
--- a/cmd/scan/main.go
+++ /dev/null
@@ -1,217 +0,0 @@
-package scan
-
-import (
- "encoding/json"
- "fmt"
- "io/fs"
- "os"
- "path/filepath"
- "strings"
-
- "github.com/bmeg/sifter/playbook"
- "github.com/bmeg/sifter/task"
- "github.com/spf13/cobra"
-)
-
-var jsonOut = false
-var objectsOnly = false
-var baseDir = ""
-
-type Entry struct {
- ObjectType string `json:"objectType"`
- SifterFile string `json:"sifterFile"`
- Outfile string `json:"outFile"`
-}
-
-var ObjectCommand = &cobra.Command{
- Use: "objects",
- Short: "Scan for outputs",
- Args: cobra.MinimumNArgs(1),
- RunE: func(cmd *cobra.Command, args []string) error {
-
- scanDir := args[0]
-
- outputs := []Entry{}
-
- PathWalker(scanDir, func(pb *playbook.Playbook) {
- for pname, p := range pb.Pipelines {
- emitName := ""
- for _, s := range p {
- if s.Emit != nil {
- emitName = s.Emit.Name
- }
- }
- if emitName != "" {
- for _, s := range p {
- outdir := pb.GetDefaultOutDir()
- outname := fmt.Sprintf("%s.%s.%s.json.gz", pb.Name, pname, emitName)
- outpath := filepath.Join(outdir, outname)
- o := Entry{SifterFile: pb.GetPath(), Outfile: outpath}
- if s.ObjectValidate != nil {
- //outpath, _ = filepath.Rel(baseDir, outpath)
- //fmt.Printf("%s\t%s\n", s.ObjectValidate.Title, outpath)
- o.ObjectType = s.ObjectValidate.Title
- }
- if objectsOnly {
- if o.ObjectType != "" {
- outputs = append(outputs, o)
- }
- } else {
- outputs = append(outputs, o)
- }
- }
- }
- }
- })
-
- if jsonOut {
- j := json.NewEncoder(os.Stdout)
- j.SetIndent("", " ")
- j.Encode(outputs)
- } else {
- for _, i := range outputs {
- fmt.Printf("%s\t%s\n", i.ObjectType, i.Outfile)
- }
- }
-
- return nil
-
- },
-}
-
-type ScriptEntry struct {
- Name string `json:"name"`
- Path string `json:"path"`
- Inputs []string `json:"inputs"`
- Outputs []string `json:"outputs"`
-}
-
-func removeDuplicates(s []string) []string {
- t := map[string]bool{}
-
- for _, i := range s {
- t[i] = true
- }
- out := []string{}
- for k := range t {
- out = append(out, k)
- }
- return out
-}
-
-func relPathArray(basedir string, paths []string) []string {
- out := []string{}
- for _, i := range paths {
- if o, err := filepath.Rel(baseDir, i); err == nil {
- out = append(out, o)
- }
- }
- return out
-}
-
-var ScriptCommand = &cobra.Command{
- Use: "scripts",
- Short: "Scan for scripts",
- Args: cobra.MinimumNArgs(1),
- RunE: func(cmd *cobra.Command, args []string) error {
-
- scanDir := args[0]
-
- scripts := []ScriptEntry{}
-
- if baseDir == "" {
- baseDir, _ = os.Getwd()
- }
- baseDir, _ = filepath.Abs(baseDir)
- //fmt.Printf("basedir: %s\n", baseDir)
-
- userInputs := map[string]string{}
-
- PathWalker(scanDir, func(pb *playbook.Playbook) {
- path := pb.GetPath()
- scriptDir := filepath.Dir(path)
-
- config, _ := pb.PrepConfig(userInputs, baseDir)
-
- task := task.NewTask(pb.Name, scriptDir, baseDir, pb.GetDefaultOutDir(), config)
- sourcePath, _ := filepath.Abs(path)
-
- cmdPath, _ := filepath.Rel(baseDir, sourcePath)
-
- inputs := []string{}
- outputs := []string{}
- for _, p := range pb.GetConfigFields() {
- if p.IsDir() || p.IsFile() {
- inputs = append(inputs, config[p.Name])
- }
- }
- //inputs = append(inputs, sourcePath)
-
- sinks, _ := pb.GetOutputs(task)
- for _, v := range sinks {
- outputs = append(outputs, v...)
- }
-
- emitters, _ := pb.GetEmitters(task)
- for _, v := range emitters {
- outputs = append(outputs, v)
- }
-
- //for _, e := range pb.Inputs {
- //}
-
- s := ScriptEntry{
- Path: cmdPath,
- Name: pb.Name,
- Outputs: relPathArray(baseDir, removeDuplicates(outputs)),
- Inputs: relPathArray(baseDir, removeDuplicates(inputs)),
- }
- scripts = append(scripts, s)
- })
-
- if jsonOut {
- e := json.NewEncoder(os.Stdout)
- e.SetIndent("", " ")
- e.Encode(scripts)
- } else {
- for _, i := range scripts {
- fmt.Printf("%s\n", i)
- }
- }
-
- return nil
- },
-}
-
-// Cmd is the declaration of the command line
-var Cmd = &cobra.Command{
- Use: "scan",
- Short: "Scan for scripts or objects",
-}
-
-func init() {
- Cmd.AddCommand(ObjectCommand)
- Cmd.AddCommand(ScriptCommand)
-
- objFlags := ObjectCommand.Flags()
- objFlags.BoolVarP(&objectsOnly, "objects", "s", objectsOnly, "Objects Only")
- objFlags.BoolVarP(&jsonOut, "json", "j", jsonOut, "Output JSON")
-
- scriptFlags := ScriptCommand.Flags()
- scriptFlags.StringVarP(&baseDir, "base", "b", baseDir, "Base Dir")
- scriptFlags.BoolVarP(&jsonOut, "json", "j", jsonOut, "Output JSON")
-
-}
-
-func PathWalker(baseDir string, userFunc func(*playbook.Playbook)) {
- filepath.Walk(baseDir,
- func(path string, info fs.FileInfo, err error) error {
- if strings.HasSuffix(path, ".yaml") {
- pb := playbook.Playbook{}
- if parseErr := playbook.ParseFile(path, &pb); parseErr == nil {
- userFunc(&pb)
- }
- }
- return nil
- })
-}
diff --git a/config/config.go b/config/config.go
index f1b3ac2..5f3fc48 100644
--- a/config/config.go
+++ b/config/config.go
@@ -2,33 +2,43 @@ package config
import "strings"
-type Config map[string]*string
+type Params map[string]Param
-type Type string
-
-const (
- Unknown Type = ""
- File Type = "File"
- Dir Type = "Dir"
-)
+type Param struct {
+ Type string `json:"type"`
+ Default any `json:"default,omitempty"`
+}
-type Variable struct {
+type ParamRequest struct {
+ Type string `json:"type"`
Name string `json:"name"`
- Type Type
}
type Configurable interface {
- GetConfigFields() []Variable
+ GetRequiredParams() []ParamRequest
}
-func (in *Variable) IsFile() bool {
- return in.Type == File
+func (in *Param) IsFile() bool {
+ return strings.ToLower(in.Type) == "file"
}
-func (in *Variable) IsDir() bool {
- return in.Type == Dir
+func (in *Param) IsDir() bool {
+ t := strings.ToLower(in.Type)
+ return t == "path" || t == "dir"
}
func TrimPrefix(s string) string {
- return strings.TrimPrefix(s, "config.")
+ if strings.HasPrefix(s, "params.") {
+ return strings.TrimPrefix(s, "params.")
+ }
+ return s
+}
+
+func (in *ParamRequest) IsFile() bool {
+ return strings.ToLower(in.Type) == "file"
+}
+
+func (in *ParamRequest) IsDir() bool {
+ t := strings.ToLower(in.Type)
+ return t == "path" || t == "dir"
}
diff --git a/docs/css/darcula.css b/docs/assets/css/darcula.css
similarity index 100%
rename from docs/css/darcula.css
rename to docs/assets/css/darcula.css
diff --git a/docs/css/dark.css b/docs/assets/css/dark.css
similarity index 100%
rename from docs/css/dark.css
rename to docs/assets/css/dark.css
diff --git a/docs/css/flexboxgrid.css b/docs/assets/css/flexboxgrid.css
similarity index 100%
rename from docs/css/flexboxgrid.css
rename to docs/assets/css/flexboxgrid.css
diff --git a/docs/css/funnel.css b/docs/assets/css/funnel.css
similarity index 100%
rename from docs/css/funnel.css
rename to docs/assets/css/funnel.css
diff --git a/docs/css/highlight.min.css b/docs/assets/css/highlight.min.css
similarity index 100%
rename from docs/css/highlight.min.css
rename to docs/assets/css/highlight.min.css
diff --git a/docs/css/html5reset.css b/docs/assets/css/html5reset.css
similarity index 100%
rename from docs/css/html5reset.css
rename to docs/assets/css/html5reset.css
diff --git a/docs/css/hybrid.css b/docs/assets/css/hybrid.css
similarity index 100%
rename from docs/css/hybrid.css
rename to docs/assets/css/hybrid.css
diff --git a/docs/css/monokai-sublime.css b/docs/assets/css/monokai-sublime.css
similarity index 100%
rename from docs/css/monokai-sublime.css
rename to docs/assets/css/monokai-sublime.css
diff --git a/docs/css/poole.css b/docs/assets/css/poole.css
similarity index 100%
rename from docs/css/poole.css
rename to docs/assets/css/poole.css
diff --git a/docs/css/syntax.css b/docs/assets/css/syntax.css
similarity index 100%
rename from docs/css/syntax.css
rename to docs/assets/css/syntax.css
diff --git a/docs/css/theme.css b/docs/assets/css/theme.css
similarity index 100%
rename from docs/css/theme.css
rename to docs/assets/css/theme.css
diff --git a/docs/sifter_example.png b/docs/assets/sifter_example.png
similarity index 100%
rename from docs/sifter_example.png
rename to docs/assets/sifter_example.png
diff --git a/docs/categories/index.xml b/docs/categories/index.xml
deleted file mode 100644
index 4b26c88..0000000
--- a/docs/categories/index.xml
+++ /dev/null
@@ -1,11 +0,0 @@
-
-
-
- Categories on Sifter
- https://bmeg.github.io/sifter/categories/
- Recent content in Categories on Sifter
- Hugo -- gohugo.io
- en-us
-
-
-
diff --git a/docs/docs/.nav.yml b/docs/docs/.nav.yml
new file mode 100644
index 0000000..1d7fa65
--- /dev/null
+++ b/docs/docs/.nav.yml
@@ -0,0 +1,11 @@
+
+title: Sifter Documentation
+
+nav:
+ - index.md
+ - example.md
+ - schema.md
+ - config.md
+ - inputs
+ - transforms
+ - outputs
\ No newline at end of file
diff --git a/docs/docs/config.md b/docs/docs/config.md
new file mode 100644
index 0000000..391e21c
--- /dev/null
+++ b/docs/docs/config.md
@@ -0,0 +1,34 @@
+---
+title: Parameters
+---
+
+## Parameters Variables
+
+Playbooks can be parameterized. They are defined in the `params` section of the playbook YAML file.
+
+### Configuration Syntax
+```yaml
+params:
+ variableName:
+ type: File # one of: File, Path, String, Number
+ default: "path/to/default"
+```
+
+### Supported Types
+- `File`: Represents a file path
+- `Dir`: Represents a directory path
+
+### Example Configuration
+```yaml
+params:
+ inputDir:
+ type: Dir
+ default: "/data/input"
+ outputDir:
+ type: Dir
+ default: "/data/output"
+ schemaFile:
+ type: File
+ default: "/config/schema.json"
+```
+
diff --git a/docs/docs/developers/source_mapping.md b/docs/docs/developers/source_mapping.md
new file mode 100644
index 0000000..335e52f
--- /dev/null
+++ b/docs/docs/developers/source_mapping.md
@@ -0,0 +1,48 @@
+# SIFTER Project Documentation to Source Code Mapping
+
+## Inputs
+
+| Documentation File | Source Code File |
+|-------------------|------------------|
+| docs/docs/inputs/avro.md | extractors/avro_load.go |
+| docs/docs/inputs/embedded.md | extractors/embedded.go |
+| docs/docs/inputs/glob.md | extractors/glob_load.go |
+| docs/docs/inputs/json.md | extractors/json_load.go |
+| docs/docs/inputs/plugin.md | extractors/plugin_load.go |
+| docs/docs/inputs/sqldump.md | extractors/sqldump_step.go |
+| docs/docs/inputs/sqlite.md | extractors/sqlite_load.go |
+| docs/docs/inputs/table.md | extractors/tabular_load.go |
+| docs/docs/inputs/xml.md | extractors/xml_step.go |
+
+## Transforms
+
+| Documentation File | Source Code File |
+|-------------------|------------------|
+| docs/docs/transforms/accumulate.md | transform/accumulate.go |
+| docs/docs/transforms/clean.md | transform/clean.go |
+| docs/docs/transforms/debug.md | transform/debug.go |
+| docs/docs/transforms/distinct.md | transform/distinct.go |
+| docs/docs/transforms/fieldParse.md | transform/field_parse.go |
+| docs/docs/transforms/fieldProcess.md | transform/field_process.go |
+| docs/docs/transforms/fieldType.md | transform/field_type.go |
+| docs/docs/transforms/filter.md | transform/filter.go |
+| docs/docs/transforms/flatmap.md | transform/flat_map.go |
+| docs/docs/transforms/from.md | transform/from.go |
+| docs/docs/transforms/hash.md | transform/hash.go |
+| docs/docs/transforms/lookup.md | transform/lookup.go |
+| docs/docs/transforms/map.md | transform/mapping.go |
+| docs/docs/transforms/objectValidate.md | transform/object_validate.go |
+| docs/docs/transforms/plugin.md | transform/plugin.go |
+| docs/docs/transforms/project.md | transform/project.go |
+| docs/docs/transforms/reduce.md | transform/reduce.go |
+| docs/docs/transforms/regexReplace.md | transform/regex.go |
+| docs/docs/transforms/split.md | transform/split.go |
+| docs/docs/transforms/uuid.md | transform/uuid.go |
+
+## Outputs
+
+| Documentation File | Source Code File |
+|-------------------|------------------|
+| docs/docs/outputs/graphBuild.md | playbook/output_graph.go |
+| docs/docs/outputs/json.md | playbook/output_json.go |
+| docs/docs/outputs/tableWrite.md | playbook/output_table.go |
\ No newline at end of file
diff --git a/website/content/docs/example.md b/docs/docs/example.md
similarity index 82%
rename from website/content/docs/example.md
rename to docs/docs/example.md
index d1d29b0..d506f6b 100644
--- a/website/content/docs/example.md
+++ b/docs/docs/example.md
@@ -1,11 +1,3 @@
----
-title: Example
-menu:
- main:
- identifier: example
- weight: 3
----
-
# Example Pipeline
Our first task will be to convert a ZIP code TSV into a set of county level
@@ -13,7 +5,7 @@ entries.
The input file looks like:
-```
+```csv
ZIP,COUNTYNAME,STATE,STCOUNTYFP,CLASSFP
36003,Autauga County,AL,01001,H1
36006,Autauga County,AL,01001,H1
@@ -27,44 +19,50 @@ ZIP,COUNTYNAME,STATE,STCOUNTYFP,CLASSFP
First is the header of the pipeline. This declares the
unique name of the pipeline and it's output directory.
-```
+```yaml
name: zipcode_map
outdir: ./
docs: Converts zipcode TSV into graph elements
```
-Next the configuration is declared. In this case the only input is the zipcode TSV.
-There is a default value, so the pipeline can be invoked without passing in
+Next the parameters are declared. In this case the only parameter is the path to the
+zipcode TSV. There is a default value, so the pipeline can be invoked without passing in
any parameters. However, to apply this pipeline to a new input file, the
-input parameter `zipcode` could be used to define the source file.
+input parameter `zipcode` could be used to define the source file.
+Path and File Parameters can be relative to the directory that the playbook file is in.
-```
-config:
- schema: ../covid19_datadictionary/gdcdictionary/schemas/
- zipcode: ../data/ZIP-COUNTY-FIPS_2017-06.csv
+```yaml
+params:
+ schema:
+ type: path
+ default: ../covid19_datadictionary/gdcdictionary/schemas/
+ zipcode:
+ type: path
+ default: ../data/ZIP-COUNTY-FIPS_2017-06.csv
```
The `inputs` section declares data input sources. In this pipeline, there is
only one input, which is to run the table loader.
-```
+```yaml
inputs:
- tableLoad:
- input: "{{config.zipcode}}"
- sep: ","
+ zipcode:
+ table:
+ path: "{{params.zipcode}}"
+ sep: ","
```
Tableload operaters of the input file that was originally passed in using the
`inputs` stanza. SIFTER string parsing is based on mustache template system.
-To access the string passed in the template is `{{config.zipcode}}`.
+To access the string passed in the template is `{{params.zipcode}}`.
The seperator in the file input file is a `,` so that is also passed in as a
parameter to the extractor.
-The `tableLoad` extractor opens up the TSV and generates a one message for
+The `table` extractor opens up the TSV and generates a one message for
every row in the file. It uses the header of the file to map the column values
into a dictionary. The first row would produce the message:
-```
+```json
{
"ZIP" : "36003",
"COUNTYNAME" : "Autauga County",
@@ -85,14 +83,14 @@ and produces a single output message.
The two messages:
-```
+```json
{ "ZIP" : "36003", "COUNTYNAME" : "Autauga County", "STATE" : "AL", "STCOUNTYFP" : "01001", "CLASSFP" : "H1"}
{ "ZIP" : "36006", "COUNTYNAME" : "Autauga County", "STATE" : "AL", "STCOUNTYFP" : "01001", "CLASSFP" : "H1"}
```
Would be merged into the message:
-```
+```json
{ "ZIP" : ["36003", "36006"], "COUNTYNAME" : "Autauga County", "STATE" : "AL", "STCOUNTYFP" : "01001", "CLASSFP" : "H1"}
```
@@ -100,7 +98,7 @@ The `reduce` transform step uses a block of python code to describe the function
The `method` field names the function, in this case `merge` that will be used
as the reduce function.
-```
+```yaml
zipReduce:
- from: zipcode
- reduce:
@@ -122,7 +120,7 @@ to project data into new files in the message. The template engine has the curre
message data in the value `row`. So the value
`FIPS:{{row.STCOUNTYFP}}` is mapped into the field `id`.
-```
+```yaml
- project:
mapping:
id: "FIPS:{{row.STCOUNTYFP}}"
@@ -136,7 +134,7 @@ message data in the value `row`. So the value
Using this projection, the message:
-```
+```json
{
"ZIP" : ["36003", "36006"],
"COUNTYNAME" : "Autauga County",
@@ -148,7 +146,7 @@ Using this projection, the message:
would become
-```
+```json
{
"id" : "FIPS:01001",
"province_state" : "AL",
@@ -165,13 +163,14 @@ would become
}
```
-Now that the data has been remapped, we pass the data into the 'objectCreate'
-transformation, which will read in the schema for `summary_location`, check the
+Now that the data has been remapped, we pass the data into the 'objectValidate'
+step, which will open the schema directory and find the class titled `summary_location`, check the
message to make sure it matches and then output it.
-```
- - objectCreate:
- class: summary_location
+```yaml
+ - objectValidate:
+ title: summary_location
+ schema: {{params.schema}}
```
@@ -182,12 +181,12 @@ To create an output table, with two columns connecting
code, used by the census office. A single FIPS code my contain many ZIP codes,
and we can use this table later for mapping ids when loading the data into a database.
-```
+```yaml
outputs:
zip2fips:
tableWrite:
- from:
- output: zip2fips
+ from: zipReduce
+ path: zip2fips.tsv
columns:
- ZIP
- STCOUNTYFP
diff --git a/docs/docs/example/index.html b/docs/docs/example/index.html
deleted file mode 100644
index af38dec..0000000
--- a/docs/docs/example/index.html
+++ /dev/null
@@ -1,507 +0,0 @@
-
-
-
-
-
-
-
-
-
-
- Example · Sifter
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
Example Pipeline
-
Our first task will be to convert a ZIP code TSV into a set of county level
-entries.
-
The input file looks like:
-
ZIP,COUNTYNAME,STATE,STCOUNTYFP,CLASSFP
-36003,Autauga County,AL,01001,H1
-36006,Autauga County,AL,01001,H1
-36067,Autauga County,AL,01001,H1
-36066,Autauga County,AL,01001,H1
-36703,Autauga County,AL,01001,H1
-36701,Autauga County,AL,01001,H1
-36091,Autauga County,AL,01001,H1
-First is the header of the pipeline. This declares the
-unique name of the pipeline and it’s output directory.
-
name: zipcode_map
-outdir: ./
-docs: Converts zipcode TSV into graph elements
-Next the configuration is declared. In this case the only input is the zipcode TSV.
-There is a default value, so the pipeline can be invoked without passing in
-any parameters. However, to apply this pipeline to a new input file, the
-input parameter zipcode could be used to define the source file.
-
config:
- schema: ../covid19_datadictionary/gdcdictionary/schemas/
- zipcode: ../data/ZIP-COUNTY-FIPS_2017-06.csv
-The inputs section declares data input sources. In this pipeline, there is
-only one input, which is to run the table loader.
-
inputs:
- tableLoad:
- input: "{{config.zipcode}}"
- sep: ","
-Tableload operaters of the input file that was originally passed in using the
-inputs stanza. SIFTER string parsing is based on mustache template system.
-To access the string passed in the template is {{config.zipcode}}.
-The seperator in the file input file is a , so that is also passed in as a
-parameter to the extractor.
-
The tableLoad extractor opens up the TSV and generates a one message for
-every row in the file. It uses the header of the file to map the column values
-into a dictionary. The first row would produce the message:
-
{
- "ZIP" : "36003",
- "COUNTYNAME" : "Autauga County",
- "STATE" : "AL",
- "STCOUNTYFP" : "01001",
- "CLASSFP" : "H1"
-}
-The stream of messages are then passed into the steps listed in the transform
-section of the tableLoad extractor.
-
For the current tranform, we want to produce a single entry per STCOUNTYFP,
-however, the file has a line per ZIP. We need to run a reduce transform,
-that collects rows togeather using a field key, which in this case is "{{row.STCOUNTYFP}}",
-and then runs a function merge that takes two messages, merges them togeather
-and produces a single output message.
-
The two messages:
-
{ "ZIP" : "36003", "COUNTYNAME" : "Autauga County", "STATE" : "AL", "STCOUNTYFP" : "01001", "CLASSFP" : "H1"}
-{ "ZIP" : "36006", "COUNTYNAME" : "Autauga County", "STATE" : "AL", "STCOUNTYFP" : "01001", "CLASSFP" : "H1"}
-Would be merged into the message:
-
{ "ZIP" : ["36003", "36006"], "COUNTYNAME" : "Autauga County", "STATE" : "AL", "STCOUNTYFP" : "01001", "CLASSFP" : "H1"}
-The reduce transform step uses a block of python code to describe the function.
-The method field names the function, in this case merge that will be used
-as the reduce function.
-
zipReduce:
- - from: zipcode
- - reduce:
- field: STCOUNTYFP
- method: merge
- python: >
- def merge(x,y):
- a = x.get('zipcodes', []) + [x['ZIP']]
- b = y.get('zipcodes', []) + [y['ZIP']]
- x['zipcodes'] = a + b
- return x
-The original messages produced by the loader have all of the information required
-by the summary_location object type as described by the JSON schema that was linked
-to in the header stanza. However, the data is all under the wrong field names.
-To remap the data, we use a project tranformation that uses the template engine
-to project data into new files in the message. The template engine has the current
-message data in the value row. So the value
-FIPS:{{row.STCOUNTYFP}} is mapped into the field id.
-
- project:
- mapping:
- id: "FIPS:{{row.STCOUNTYFP}}"
- province_state: "{{row.STATE}}"
- summary_locations: "{{row.STCOUNTYFP}}"
- county: "{{row.COUNTYNAME}}"
- submitter_id: "{{row.STCOUNTYFP}}"
- type: summary_location
- projects: []
-Using this projection, the message:
-
{
- "ZIP" : ["36003", "36006"],
- "COUNTYNAME" : "Autauga County",
- "STATE" : "AL",
- "STCOUNTYFP" : "01001",
- "CLASSFP" : "H1"
-}
-would become
-
{
- "id" : "FIPS:01001",
- "province_state" : "AL",
- "summary_locations" : "01001",
- "county" : "Autauga County",
- "submitter_id" : "01001",
- "type" : "summary_location"
- "projects" : [],
- "ZIP" : ["36003", "36006"],
- "COUNTYNAME" : "Autauga County",
- "STATE" : "AL",
- "STCOUNTYFP" : "01001",
- "CLASSFP" : "H1"
-}
-Now that the data has been remapped, we pass the data into the ‘objectCreate’
-transformation, which will read in the schema for summary_location, check the
-message to make sure it matches and then output it.
-
- objectCreate:
- class: summary_location
-Outputs
-
To create an output table, with two columns connecting
-ZIP values to STCOUNTYFP values. The STCOUNTYFP is a county level FIPS
-code, used by the census office. A single FIPS code my contain many ZIP codes,
-and we can use this table later for mapping ids when loading the data into a database.
-
outputs:
- zip2fips:
- tableWrite:
- from:
- output: zip2fips
- columns:
- - ZIP
- - STCOUNTYFP
-
-
-
-
-
-
diff --git a/docs/docs/index.html b/docs/docs/index.html
deleted file mode 100644
index 6e55370..0000000
--- a/docs/docs/index.html
+++ /dev/null
@@ -1,528 +0,0 @@
-
-
-
-
-
-
-
-
-
-
- Overview · Sifter
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
Sifter pipelines
-
Sifter pipelines process steams of nested JSON messages. Sifter comes with a number of
-file extractors that operate as inputs to these pipelines. The pipeline engine
-connects togeather arrays of transform steps into directed acylic graph that is processed
-in parallel.
-
Example Message:
-
{
- "firstName" : "bob" ,
- "age" : "25"
- "friends" : [ "Max" , "Alex" ]
- }
- Once a stream of messages are produced, that can be run through a transform
-pipeline. A transform pipeline is an array of transform steps, each transform
-step can represent a different way to alter the data. The array of transforms link
-togeather into a pipe that makes multiple alterations to messages as they are
-passed along. There are a number of different transform steps types that can
-be done in a transform pipeline these include:
-
-Projection: creating new fields using a templating engine driven by existing values
-Filtering: removing messages
-Programmatic transformation: alter messages using an embedded python interpreter
-Table based field translation
-Outputing the message as a JSON Schema checked object
-
-
Script structure
-
Pipeline File
-
An sifter pipeline file is in YAML format and describes an entire processing pipelines.
-If is composed of the following sections: config, inputs, pipelines, outputs. In addition,
-for tracking, the file will also include name and class entries.
-
- class : sifter
-name : <script name>
-outdir : <where output files should go, relative to this file>
-
- config :
- <config key> : <config value>
- <config key> : <config value>
- # values that are referenced in pipeline parameters for
- # files will be treated like file paths and be
- # translated to full paths
-
- inputs :
- <input name> :
- <input driver> :
- <driver config>
-
- pipelines :
- <pipeline name> :
- # all pipelines must start with a from step
- - from : <name of input or pipeline>
- - <transform name> :
- <transform parameters>
-
- outputs :
- <output name> :
- <output driver> :
- <driver config>
-
-
Each sifter file starts with a set of field to let the software know this is a sifter script, and not some random YAML file. There is also a name field for the script. This name will be used for output file creation and logging. Finally, there is an outdir that defines the directory where all output files will be placed. All paths are relative to the script file, so the outdir set to my-results will create the directory my-results in the same directory as the script file, regardless of where the sifter command is invoked.
-
class : sifter
-name : <name of script>
-outdir : <where files should be stored>
-Config and templating
-
The config section is a set of defined keys that are used throughout the rest of the script.
-
Example config:
-
config:
- sqlite: ../../source/chembl/chembl_33/chembl_33_sqlite/chembl_33.db
- uniprot2ensembl: ../../tables/uniprot2ensembl.tsv
- schema: ../../schema/
-Various fields in the script file will be be parsed using a Mustache template engine. For example, to access the various values within the config block, the template {{config.sqlite}}.
-
-
The input block defines the various data extractors that will be used to open resources and create streams of JSON messages for processing. The possible input engines include:
-
-AVRO
-JSON
-XML
-SQL-dump
-SQLite
-TSV/CSV
-GLOB
-
-
For any other file types, there is also a plugin option to allow the user to call their own code for opening files.
-
Pipeline
-
The pipelines defined a set of named processing pipelines that can be used to transform data. Each pipeline starts with a from statement that defines where data comes from. It then defines a linear set of transforms that are chained togeather to do processing. Pipelines may used emit steps to output messages to disk. The possible data transform steps include:
-
-Accumulate
-Clean
-Distinct
-DropNull
-Field Parse
-Field Process
-Field Type
-Filter
-FlatMap
-GraphBuild
-Hash
-JSON Parse
-Lookup
-Value Mapping
-Object Validation
-Project
-Reduce
-Regex
-Split
-UUID Generation
-
-
Additionally, users are able to define their one transform step types using the plugin step.
-
Example script
-
class : sifter
-
- name : go
-outdir : ../../output/go/
-
- config :
- oboFile : ../../source/go/go.obo
- schema : ../../schema
-
- inputs :
- oboData :
- plugin :
- commandLine : ../../util/obo_reader.py {{config.oboFile}}
-
- pipelines :
- transform :
- - from : oboData
- - project :
- mapping :
- submitter_id : "{{row.id[0]}}"
- case_id : "{{row.id[0]}}"
- id : "{{row.id[0]}}"
- go_id : "{{row.id[0]}}"
- project_id : "gene_onotology"
- namespace : "{{row.namespace[0]}}"
- name : "{{row.name[0]}}"
- - map :
- method : fix
- gpython : |
- def fix(row) :
- row['definition'] = row['def'][0].strip('"')
- if 'xref' not in row :
- row['xref'] = []
- if 'synonym' not in row :
- row['synonym'] = []
- return row
- - objectValidate :
- title : GeneOntologyTerm
- schema : "{{config.schema}}"
- - emit :
- name : term
-
-
-
-
-
-
diff --git a/website/content/docs.md b/docs/docs/index.md
similarity index 81%
rename from website/content/docs.md
rename to docs/docs/index.md
index d45d044..2601927 100644
--- a/website/content/docs.md
+++ b/docs/docs/index.md
@@ -1,17 +1,13 @@
-
---
-title: Overview
-menu:
- main:
- identifier: overview
- weight: 1
+title: Sifter
---
-# Sifter pipelines
-Sifter pipelines process steams of nested JSON messages. Sifter comes with a number of
+# Sifter
+
+Sifter is a stream based processing engine. It comes with a number of
file extractors that operate as inputs to these pipelines. The pipeline engine
-connects togeather arrays of transform steps into directed acylic graph that is processed
+connects togeather several processing data into directed acylic graph that is processed
in parallel.
Example Message:
@@ -43,7 +39,7 @@ be done in a transform pipeline these include:
# Pipeline File
An sifter pipeline file is in YAML format and describes an entire processing pipelines.
-If is composed of the following sections: `config`, `inputs`, `pipelines`, `outputs`. In addition,
+If is composed of the following sections: `params`, `inputs`, `pipelines`, `outputs`. In addition,
for tracking, the file will also include `name` and `class` entries.
```yaml
@@ -52,9 +48,12 @@ class: sifter
name:
-
-
-
-
-
-
-
-
-
-
-
-
-
-
avroLoad
-
Load an AvroFile
-
Parameters
-
-
-
-name
-Description
-
-
-
-
-input
-Path to input file
-
-
-
-
-
-
-
-
-