Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
136 changes: 136 additions & 0 deletions docs/rdf_file_loader_agent.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@
# RDF File Loader Agent

## Overview

The RDF File Loader agent automatically loads RDF files into the Whyis knowledge graph as nanopublications. It monitors resources typed as `whyis:RDFFile` and loads their content.

## Features

- **Multiple Source Support:**
- Local files from the file depot (via `whyis:hasFileID`)
- Remote HTTP/HTTPS URLs
- S3 URIs (requires boto3 to be installed)

- **Format Detection:**
- Automatic format detection from file extensions and content types
- Supports: Turtle (.ttl), RDF/XML (.rdf, .owl), JSON-LD (.jsonld), N-Triples (.nt), N3 (.n3), TriG (.trig), N-Quads (.nq)

- **Provenance Tracking:**
- Resources are marked with `whyis:RDFFile` type before processing
- After loading, marked as `whyis:LoadedRDFFile`
- Activities are tracked as `whyis:RDFFileLoadingActivity`
- Proper nanopublication structure with provenance

## Usage

### 1. Add the agent to your configuration

In your application's config file:

```python
from whyis import autonomic

class Config:
INFERENCERS = {
'RDFFileLoader': autonomic.RDFFileLoader(),
# ... other agents
}
```

### 2. Mark resources as RDF files

Create a nanopublication that types a resource as `whyis:RDFFile`:

```turtle
@prefix whyis: <http://vocab.rpi.edu/whyis/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .

<http://example.com/my-data-file> a whyis:RDFFile .
```

### 3. Loading from different sources

#### Local File Depot

For files already uploaded to the file depot:

```turtle
<http://example.com/my-file> a whyis:RDFFile ;
whyis:hasFileID "file_depot_id_here" .
```

#### HTTP/HTTPS URL

Simply use the URL as the resource URI:

```turtle
<http://example.com/data/dataset.ttl> a whyis:RDFFile .
```

or

```turtle
<https://secure.example.com/rdf/ontology.owl> a whyis:RDFFile .
```

#### S3 URI

For files stored in S3 (requires boto3):

```turtle
<s3://my-bucket/path/to/data.ttl> a whyis:RDFFile .
```

**Note:** Ensure boto3 is installed and AWS credentials are configured:
```bash
pip install boto3
```

AWS credentials can be configured via:
- Environment variables (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)
- AWS credentials file (~/.aws/credentials)
- IAM role (when running on EC2)

## How It Works

1. The agent queries for resources typed as `whyis:RDFFile` that are not yet `whyis:LoadedRDFFile`
2. For each resource:
- Checks if it has a `whyis:hasFileID` (file depot)
- Otherwise, examines the URI scheme (http://, https://, s3://)
- Downloads and parses the RDF content
- Adds the loaded triples to a nanopublication
- Marks the resource as `whyis:LoadedRDFFile`
3. The nanopublication includes provenance linking back to the source file

## Retirement

When a resource is no longer typed as `whyis:RDFFile`, the agent's update mechanism will retire the associated nanopublications containing the loaded data.

## Testing

The agent includes 26 comprehensive unit tests covering:
- Basic functionality
- Format detection
- HTTP/HTTPS loading
- S3 loading (with and without boto3)
- File depot access
- Error handling

Run tests with:
```bash
pytest tests/unit/test_rdf_file_loader*.py
```

## Error Handling

- **Missing boto3:** Gracefully fails with a clear error message when trying to load from S3
- **Invalid RDF:** Logs errors when content cannot be parsed
- **Network errors:** Propagates HTTP errors with proper logging
- **Missing files:** Reports file depot access errors

## Example Use Cases

1. **Bulk Data Import:** Mark multiple HTTP URLs as RDFFile to automatically import external datasets
2. **S3 Data Pipeline:** Load RDF files from S3 buckets as part of a data processing pipeline
3. **File Upload Processing:** When users upload RDF files, mark them as RDFFile for automatic processing
4. **Ontology Loading:** Automatically load and update ontologies from remote URLs
160 changes: 160 additions & 0 deletions tests/unit/test_rdf_file_loader_basic.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,160 @@
"""
Simple unit tests for RDFFileLoader agent that don't require full app context.

Tests basic functionality like format guessing and URI parsing.
"""

import pytest
from unittest.mock import Mock, patch
from rdflib import URIRef

from whyis.autonomic.rdf_file_loader import RDFFileLoader
from whyis.namespace import whyis


class TestRDFFileLoaderBasic:
"""Basic tests for RDFFileLoader that don't require app context."""

def test_agent_initialization(self):
"""Test that RDFFileLoader agent can be initialized."""
agent = RDFFileLoader()
assert agent is not None
assert hasattr(agent, 'activity_class')
assert agent.activity_class == whyis.RDFFileLoadingActivity

def test_agent_input_class(self):
"""Test that RDFFileLoader returns correct input class."""
agent = RDFFileLoader()
input_class = agent.getInputClass()
assert input_class == whyis.RDFFile

def test_agent_output_class(self):
"""Test that RDFFileLoader returns correct output class."""
agent = RDFFileLoader()
output_class = agent.getOutputClass()
assert output_class == whyis.LoadedRDFFile

def test_agent_has_query(self):
"""Test that RDFFileLoader has get_query method."""
agent = RDFFileLoader()
assert hasattr(agent, 'get_query')
assert callable(agent.get_query)
query = agent.get_query()
assert 'RDFFile' in query
assert 'LoadedRDFFile' in query

def test_format_guessing_turtle(self):
"""Test RDF format guessing for Turtle files."""
agent = RDFFileLoader()

# Test by filename
assert agent._guess_format('test.ttl', None) == 'turtle'
assert agent._guess_format('test.turtle', None) == 'turtle'

# Test by content type
assert agent._guess_format(None, 'text/turtle') == 'turtle'
assert agent._guess_format('file.dat', 'text/turtle') == 'turtle'

def test_format_guessing_rdfxml(self):
"""Test RDF format guessing for RDF/XML files."""
agent = RDFFileLoader()

# Test by filename
assert agent._guess_format('test.rdf', None) == 'xml'
assert agent._guess_format('test.owl', None) == 'xml'
assert agent._guess_format('test.xml', None) == 'xml'

# Test by content type
assert agent._guess_format(None, 'application/rdf+xml') == 'xml'

def test_format_guessing_jsonld(self):
"""Test RDF format guessing for JSON-LD files."""
agent = RDFFileLoader()

# Test by filename
assert agent._guess_format('test.jsonld', None) == 'json-ld'
assert agent._guess_format('test.json-ld', None) == 'json-ld'

# Test by content type
assert agent._guess_format(None, 'application/ld+json') == 'json-ld'

def test_format_guessing_ntriples(self):
"""Test RDF format guessing for N-Triples files."""
agent = RDFFileLoader()

# Test by filename
assert agent._guess_format('test.nt', None) == 'nt'

# Test by content type
assert agent._guess_format(None, 'application/n-triples') == 'nt'

def test_format_guessing_n3(self):
"""Test RDF format guessing for N3 files."""
agent = RDFFileLoader()

# Test by filename
assert agent._guess_format('test.n3', None) == 'n3'

# Test by content type
assert agent._guess_format(None, 'text/n3') == 'n3'

def test_format_guessing_trig(self):
"""Test RDF format guessing for TriG files."""
agent = RDFFileLoader()

# Test by filename
assert agent._guess_format('test.trig', None) == 'trig'

# Test by content type
assert agent._guess_format(None, 'application/trig') == 'trig'

def test_format_guessing_nquads(self):
"""Test RDF format guessing for N-Quads files."""
agent = RDFFileLoader()

# Test by filename
assert agent._guess_format('test.nq', None) == 'nquads'

def test_format_guessing_default(self):
"""Test that format guessing defaults to turtle."""
agent = RDFFileLoader()

# No filename or content type
assert agent._guess_format(None, None) == 'turtle'

# Unknown extension
assert agent._guess_format('test.unknown', None) == 'turtle'

# Unknown content type
assert agent._guess_format(None, 'application/unknown') == 'turtle'

def test_load_from_s3_without_boto3(self):
"""Test that loading from S3 fails gracefully when boto3 is not installed."""
agent = RDFFileLoader()

# Mock boto3 import to fail by patching it in the function
with patch.dict('sys.modules', {'boto3': None}):
with pytest.raises(ImportError) as exc_info:
agent._load_from_s3('s3://bucket/key.ttl')

assert 'boto3' in str(exc_info.value).lower()

def test_load_from_s3_invalid_uri(self):
"""Test that invalid S3 URIs are rejected."""
agent = RDFFileLoader()

# Mock boto3 module
mock_boto3_module = Mock()
mock_s3_client = Mock()
mock_boto3_module.client.return_value = mock_s3_client

with patch.dict('sys.modules', {'boto3': mock_boto3_module}):
# Invalid URI (no bucket/key)
with pytest.raises(ValueError) as exc_info:
agent._load_from_s3('s3://bucket-only')
assert 'Invalid S3 URI' in str(exc_info.value)

# Invalid URI (not s3://)
with pytest.raises(ValueError) as exc_info:
agent._load_from_s3('http://not-s3.com/file.ttl')
assert 'Invalid S3 URI' in str(exc_info.value)
Loading