Add RDF file loader agent with file depot, HTTP/HTTPS, and S3 support #349

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft

Copilot wants to merge 5 commits into main from copilot/add-rdf-file-loading-agent

docs/rdf_file_loader_agent.md

-Original file line number
+Diff line change
@@ -0,0 +1,136 @@
+    # RDF File Loader Agent
+    ## Overview
+    The RDF File Loader agent automatically loads RDF files into the Whyis knowledge graph as nanopublications. It monitors resources typed as `whyis:RDFFile` and loads their content.
+    ## Features
+    - **Multiple Source Support:**
+      - Local files from the file depot (via `whyis:hasFileID`)
+      - Remote HTTP/HTTPS URLs
+      - S3 URIs (requires boto3 to be installed)
+    - **Format Detection:**
+      - Automatic format detection from file extensions and content types
+      - Supports: Turtle (.ttl), RDF/XML (.rdf, .owl), JSON-LD (.jsonld), N-Triples (.nt), N3 (.n3), TriG (.trig), N-Quads (.nq)
+    - **Provenance Tracking:**
+      - Resources are marked with `whyis:RDFFile` type before processing
+      - After loading, marked as `whyis:LoadedRDFFile`
+      - Activities are tracked as `whyis:RDFFileLoadingActivity`
+      - Proper nanopublication structure with provenance
+    ## Usage
+    ### 1. Add the agent to your configuration
+    In your application's config file:
+    ```python
+    from whyis import autonomic
+    class Config:
+        INFERENCERS = {
+            'RDFFileLoader': autonomic.RDFFileLoader(),
+            # ... other agents
+        }
+    ```
+    ### 2. Mark resources as RDF files
+    Create a nanopublication that types a resource as `whyis:RDFFile`:
+    ```turtle
+    @prefix whyis: <http://vocab.rpi.edu/whyis/> .
+    @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
+    <http://example.com/my-data-file> a whyis:RDFFile .
+    ```
+    ### 3. Loading from different sources
+    #### Local File Depot
+    For files already uploaded to the file depot:
+    ```turtle
+    <http://example.com/my-file> a whyis:RDFFile ;
+        whyis:hasFileID "file_depot_id_here" .
+    ```
+    #### HTTP/HTTPS URL
+    Simply use the URL as the resource URI:
+    ```turtle
+    <http://example.com/data/dataset.ttl> a whyis:RDFFile .
+    ```
+    or
+    ```turtle
+    <https://secure.example.com/rdf/ontology.owl> a whyis:RDFFile .
+    ```
+    #### S3 URI
+    For files stored in S3 (requires boto3):
+    ```turtle
+    <s3://my-bucket/path/to/data.ttl> a whyis:RDFFile .
+    ```
+    **Note:** Ensure boto3 is installed and AWS credentials are configured:
+    ```bash
+    pip install boto3
+    ```
+    AWS credentials can be configured via:
+    - Environment variables (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)
+    - AWS credentials file (~/.aws/credentials)
+    - IAM role (when running on EC2)
+    ## How It Works
+. The agent queries for resources typed as `whyis:RDFFile` that are not yet `whyis:LoadedRDFFile`
+. For each resource:
+       - Checks if it has a `whyis:hasFileID` (file depot)
+       - Otherwise, examines the URI scheme (http://, https://, s3://)
+       - Downloads and parses the RDF content
+       - Adds the loaded triples to a nanopublication
+       - Marks the resource as `whyis:LoadedRDFFile`
+. The nanopublication includes provenance linking back to the source file
+    ## Retirement
+    When a resource is no longer typed as `whyis:RDFFile`, the agent's update mechanism will retire the associated nanopublications containing the loaded data.
+    ## Testing
+    The agent includes 26 comprehensive unit tests covering:
+    - Basic functionality
+    - Format detection
+    - HTTP/HTTPS loading
+    - S3 loading (with and without boto3)
+    - File depot access
+    - Error handling
+    Run tests with:
+    ```bash
+    pytest tests/unit/test_rdf_file_loader*.py
+    ```
+    ## Error Handling
+    - **Missing boto3:** Gracefully fails with a clear error message when trying to load from S3
+    - **Invalid RDF:** Logs errors when content cannot be parsed
+    - **Network errors:** Propagates HTTP errors with proper logging
+    - **Missing files:** Reports file depot access errors
+    ## Example Use Cases
+. **Bulk Data Import:** Mark multiple HTTP URLs as RDFFile to automatically import external datasets
+. **S3 Data Pipeline:** Load RDF files from S3 buckets as part of a data processing pipeline
+. **File Upload Processing:** When users upload RDF files, mark them as RDFFile for automatic processing
+. **Ontology Loading:** Automatically load and update ontologies from remote URLs

tests/unit/test_rdf_file_loader_basic.py

-Original file line number
+Diff line change
@@ -0,0 +1,160 @@
+    """
+    Simple unit tests for RDFFileLoader agent that don't require full app context.
+    Tests basic functionality like format guessing and URI parsing.
+    """
+    import pytest
+    from unittest.mock import Mock, patch
+    from rdflib import URIRef
+    from whyis.autonomic.rdf_file_loader import RDFFileLoader
+    from whyis.namespace import whyis
+    class TestRDFFileLoaderBasic:
+        """Basic tests for RDFFileLoader that don't require app context."""
+        def test_agent_initialization(self):
+            """Test that RDFFileLoader agent can be initialized."""
+            agent = RDFFileLoader()
+            assert agent is not None
+            assert hasattr(agent, 'activity_class')
+            assert agent.activity_class == whyis.RDFFileLoadingActivity
+        def test_agent_input_class(self):
+            """Test that RDFFileLoader returns correct input class."""
+            agent = RDFFileLoader()
+            input_class = agent.getInputClass()
+            assert input_class == whyis.RDFFile
+        def test_agent_output_class(self):
+            """Test that RDFFileLoader returns correct output class."""
+            agent = RDFFileLoader()
+            output_class = agent.getOutputClass()
+            assert output_class == whyis.LoadedRDFFile
+        def test_agent_has_query(self):
+            """Test that RDFFileLoader has get_query method."""
+            agent = RDFFileLoader()
+            assert hasattr(agent, 'get_query')
+            assert callable(agent.get_query)
+            query = agent.get_query()
+            assert 'RDFFile' in query
+            assert 'LoadedRDFFile' in query
+        def test_format_guessing_turtle(self):
+            """Test RDF format guessing for Turtle files."""
+            agent = RDFFileLoader()
+            # Test by filename
+            assert agent._guess_format('test.ttl', None) == 'turtle'
+            assert agent._guess_format('test.turtle', None) == 'turtle'
+            # Test by content type
+            assert agent._guess_format(None, 'text/turtle') == 'turtle'
+            assert agent._guess_format('file.dat', 'text/turtle') == 'turtle'
+        def test_format_guessing_rdfxml(self):
+            """Test RDF format guessing for RDF/XML files."""
+            agent = RDFFileLoader()
+            # Test by filename
+            assert agent._guess_format('test.rdf', None) == 'xml'
+            assert agent._guess_format('test.owl', None) == 'xml'
+            assert agent._guess_format('test.xml', None) == 'xml'
+            # Test by content type
+            assert agent._guess_format(None, 'application/rdf+xml') == 'xml'
+        def test_format_guessing_jsonld(self):
+            """Test RDF format guessing for JSON-LD files."""
+            agent = RDFFileLoader()
+            # Test by filename
+            assert agent._guess_format('test.jsonld', None) == 'json-ld'
+            assert agent._guess_format('test.json-ld', None) == 'json-ld'
+            # Test by content type
+            assert agent._guess_format(None, 'application/ld+json') == 'json-ld'
+        def test_format_guessing_ntriples(self):
+            """Test RDF format guessing for N-Triples files."""
+            agent = RDFFileLoader()
+            # Test by filename
+            assert agent._guess_format('test.nt', None) == 'nt'
+            # Test by content type
+            assert agent._guess_format(None, 'application/n-triples') == 'nt'
+        def test_format_guessing_n3(self):
+            """Test RDF format guessing for N3 files."""
+            agent = RDFFileLoader()
+            # Test by filename
+            assert agent._guess_format('test.n3', None) == 'n3'
+            # Test by content type
+            assert agent._guess_format(None, 'text/n3') == 'n3'
+        def test_format_guessing_trig(self):
+            """Test RDF format guessing for TriG files."""
+            agent = RDFFileLoader()
+            # Test by filename
+            assert agent._guess_format('test.trig', None) == 'trig'
+            # Test by content type
+            assert agent._guess_format(None, 'application/trig') == 'trig'
+        def test_format_guessing_nquads(self):
+            """Test RDF format guessing for N-Quads files."""
+            agent = RDFFileLoader()
+            # Test by filename
+            assert agent._guess_format('test.nq', None) == 'nquads'
+        def test_format_guessing_default(self):
+            """Test that format guessing defaults to turtle."""
+            agent = RDFFileLoader()
+            # No filename or content type
+            assert agent._guess_format(None, None) == 'turtle'
+            # Unknown extension
+            assert agent._guess_format('test.unknown', None) == 'turtle'
+            # Unknown content type
+            assert agent._guess_format(None, 'application/unknown') == 'turtle'
+        def test_load_from_s3_without_boto3(self):
+            """Test that loading from S3 fails gracefully when boto3 is not installed."""
+            agent = RDFFileLoader()
+            # Mock boto3 import to fail by patching it in the function
+            with patch.dict('sys.modules', {'boto3': None}):
+                with pytest.raises(ImportError) as exc_info:
+                    agent._load_from_s3('s3://bucket/key.ttl')
+                assert 'boto3' in str(exc_info.value).lower()
+        def test_load_from_s3_invalid_uri(self):
+            """Test that invalid S3 URIs are rejected."""
+            agent = RDFFileLoader()
+            # Mock boto3 module
+            mock_boto3_module = Mock()
+            mock_s3_client = Mock()
+            mock_boto3_module.client.return_value = mock_s3_client
+            with patch.dict('sys.modules', {'boto3': mock_boto3_module}):
+                # Invalid URI (no bucket/key)
+                with pytest.raises(ValueError) as exc_info:
+                    agent._load_from_s3('s3://bucket-only')
+                assert 'Invalid S3 URI' in str(exc_info.value)
+                # Invalid URI (not s3://)
+                with pytest.raises(ValueError) as exc_info:
+                    agent._load_from_s3('http://not-s3.com/file.ttl')
+                assert 'Invalid S3 URI' in str(exc_info.value)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add RDF file loader agent with file depot, HTTP/HTTPS, and S3 support #349

Uh oh!

Diff view

Diff view

There are no files selected for viewing

Uh oh!

Add RDF file loader agent with file depot, HTTP/HTTPS, and S3 support #349

Are you sure you want to change the base?

Uh oh!

Add RDF file loader agent with file depot, HTTP/HTTPS, and S3 support #349

Uh oh!

Uh oh!

Diff view

Diff view

There are no files selected for viewing

Uh oh!