oni-ocfl

A Node.js library for ingesting datasets into an OCFL repository as RO-Crate objects. It is used in corpus-tools at LDaCA to build and manage language data collections stored in OCFL.

Installation

This library is used as a GitHub dependency (not published to npm):

"oni-ocfl": "github:Language-Research-Technology/oni-ocfl#1.7.2"

In your package.json:

{
  "dependencies": {
    "oni-ocfl": "github:Language-Research-Technology/oni-ocfl#1.7.2"
  }
}

Then run:

npm install

Usage

const { Collector, generateArcpId, Provenance } = require('oni-ocfl');

Modules

Collector

The main class for building an OCFL repository from a collection of RO-Crate objects. It manages the repository connection, creates CollectionObject instances, and drives the ingestion workflow.

const collector = new Collector({
  repoPath: './repo',
  namespace: 'my-collection',
  dataDir: './data',
  templateCrateDir: './template'
});

await collector.connect(); // Creates or loads the OCFL repository

You can also use the static factory method:

const collector = await Collector.create({ repoPath: './repo', namespace: 'my-collection' });

collector.newObject(cratePath, crate?)

Creates a new CollectionObject. Pass a directory path to load the RO-Crate metadata from disk, or pass an existing ROCrate instance directly:

// Load from disk
const obj = collector.newObject('./path/to/crate-dir');

// Pass an ROCrate instance directly
const { ROCrate } = require('ro-crate');
const crate = new ROCrate(myJson, { alwaysAsArray: true, resolveLinks: true });
const obj = collector.newObject('./path/to/crate-dir', crate);

CollectionObject

Represents a single OCFL object being built. Key methods:

  • mintArcpId(paths, id?) — Assigns an ARCP identifier to the root dataset (see generateArcpId below).
  • importFile(source, target, entityProps?) — Queues a local file to be imported into the OCFL object.
  • addFile(entity, srcDir, filePath?, addToRootHasPart?) — Copies a file into the temp directory and adds it to the crate.
  • addToRepo(ignoreFilesInCrate?, files?) — Validates the crate and writes the object to the OCFL repository.
const obj = collector.newObject(collector.templateCrateDir);
obj.mintArcpId(['corpus', 'my-dataset']);

obj.crate.rootDataset.name = 'My Dataset';
obj.crate.rootDataset.description = 'A language corpus';
obj.crate.rootDataset.datePublished = '2024';
obj.crate.rootDataset.license = 'cc-by-4.0';

await obj.addToRepo();

generateArcpId(namespace, paths)

Generates an ARCP URI used as the @id of an RO-Crate root dataset. ARCP URIs provide stable, namespace-scoped identifiers for objects in a collection.

const { generateArcpId } = require('oni-ocfl');

const id = generateArcpId('my-collection', ['corpus', 'dataset-1']);
// => 'arcp://name,my-collection/corpus/dataset-1'

Provenance

Records software provenance in the RO-Crate by adding a SoftwareSourceCode entity and a CreateAction entity. It reads metadata from the calling package's package.json (which must have repository.url and description fields).

const { Provenance } = require('oni-ocfl');

const prov = new Provenance({
  name: 'my-corpus-tool',
  description: 'Tool for ingesting my corpus',
  repository: { url: 'https://github.com/my-org/my-corpus-tool' }
});
// prov.scriptTool  — SoftwareSourceCode entity added to the crate
// prov.createAction — CreateAction entity added to the crate

Provenance is instantiated automatically by Collector using the main process package.json. The entities are injected into every crate written by addToRepo().

CLI Options

Collector integrates with Commander and accepts command-line arguments that override constructor options:

FlagDescription
-r, --repo-pathPath to OCFL repository
-n, --repo-nameName of OCFL repository
-s, --namespaceNamespace for ARCP IDs
-d, --data-dirDirectory of data files
-t, --templateRO-Crate directory to use as base template
-p, --temp-pathTemporary directory path
--vx, --validate-with-excelExcel file for crate validation
--vm, --validate-with-modePath or URL to an RO-Crate mode file for validation
-D, --debugEnable debug mode

Tests

Run all tests

npm test

Run a single test file

Select a *.spec.js file in VS Code and use:

Current file run all tests

Docs

docs