io7m.com: com.io7m.entomos

entomos

A utility library for implementing sectioned binary file formats.

Features

Read and validate versioned, sectioned file formats.
Written in pure Java 21.
OSGi ready.
JPMS ready.
ISC license.
High-coverage automated test suite.

Motivation

Many io7m projects implement binary file formats. With minor variations, all of the formats have tended to converge on the following set of design rules:

Files start with a 64-bit format-specific tag for easy identification.
The file tag is followed by a 32-bit major and minor version.
The rest of the file is divided into a flat array of sections.
Each section is aligned to a 16-byte boundary.
Each section starts with a 64-bit format-specific tag.
The section tag is followed by a 64-bit size value, specifying the length of the data within the section.
The section then contains size bytes of data, followed by up to 15 bytes of padding in order to ensure that any data that follows is aligned to a 16-byte boundary.
The file ends with an end section. This is a normal section where the section has a format-specific end tag, and a size of 0.
Readers are required to stop reading at the end section. Trailing data is ignored.
Readers are required to fail if there is no end section, and treat the file as corrupted and truncated.

All values are in big-endian order. The data within sections can be in any byte order as required by specific formats.

Some formats have rules on ordering: A section with a particular tag might be required to be the first one in the file, or the last one in the file (just prior to the end section). No format to date has apparently required a very strict order on sections, and this would appear to be of limited utility.

Some formats have rules on cardinality: A section with a particular tag might be required to appear exactly once, at least once, at most once, or any number of times.

For all formats, semantic versioning has tended to be used. The formats tend to come with versioning rules akin to the following:

The specification is versioned via a restricted subset of the Semantic Versioning specification. The specification has a major and minor version number, with major version increments denoting incompatible changes, and minor version increments denoting new functionality. There is no patch version number. A version of the specification with major version m and minor version n is denoted as specification version (m, n).
Assuming a version of the specification m, an update to the specification that yields version n such that n > m is considered to be forwards compatible if a parser that supports format version m can read files that were written using format version n.
Assuming a version of the specification m, an update to the specification that yields version n such that n > m is considered to be backwards compatible if a parser that supports format version n can read files that were written using format version m.
The specification is designed such that a correctly-written parser implementation that supports a major version m is able to support the set of versions ∀n. (m, n). This implies full forwards and backwards compatibility for parsers when the major version is unchanged.
Changes that would cause a parser supporting an older version of the specification to fail to read a file written according to a newer version of the specification MUST imply an increment in the major version of the specification.
Changes that would cause a parser supporting a newer version of the specification to fail to read a file written according to an older version of the specification MUST imply an increment in the major version of the specification.

An implication of the above rules is that new features added to the format specification must be added in a manner that allows them to be ignored by older parsers, lest the major version of the specification be incremented on every update.

00000000 89 43 4c 4e 0d 0a 1a 0a 00 00 00 01 00 00 00 00 |.CLN............| 00000010 43 4c 4e 49 49 4e 46 4f 00 00 00 00 00 00 00 a0 |CLNIINFO........| 00000020 00 00 01 00 00 00 01 00 00 00 00 06 00 00 00 0b |................| 00000030 52 38 3a 47 38 3a 42 38 3a 41 38 00 00 00 00 1f |R8:G8:B8:A8.....| 00000040 46 49 58 45 44 5f 50 4f 49 4e 54 5f 4e 4f 52 4d |FIXED_POINT_NORM| 00000050 41 4c 49 5a 45 44 5f 55 4e 53 49 47 4e 45 44 00 |ALIZED_UNSIGNED.| 00000060 00 00 00 0c 55 4e 43 4f 4d 50 52 45 53 53 45 44 |....UNCOMPRESSED| 00000070 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| 00000080 00 00 00 00 00 00 00 03 4c 5a 34 00 00 00 00 00 |........LZ4.....| 00000090 00 00 00 00 00 00 00 08 52 54 3a 53 52 3a 54 44 |........RT:SR:TD| 000000a0 00 00 00 04 53 52 47 42 00 00 00 00 00 00 00 0d |....SRGB........| 000000b0 4c 49 54 54 4c 45 5f 45 4e 44 49 41 4e 00 00 00 |LITTLE_ENDIAN...| 000000c0 43 4c 4e 5f 41 52 52 21 00 00 00 00 00 12 30 e0 |CLN_ARR!......0.| 000000d0 00 00 00 30 00 00 00 07 00 00 00 00 00 00 00 00 |...0............| 000000e0 00 00 06 e0 00 00 00 00 00 00 00 10 00 00 00 00 |................|

The general structure of the file formats could be seen as similar to a flattened and simplified form of the RIFF format, but using 64-bit instead of 32-bit values.

The entomos package provides a set of primitives for implementing file formats that adhere to the general design rules above. It is intended to reduce the amount of effectively duplicated code between projects, providing a simple API for parsing, validating, and extracting data from files.

No code is provided for writing files, because that code is already trivial and is typically implemented by just using jbssio directly; there don't appear to be any abstractions that would reduce code duplication and/or make the code any mechanically simpler than it already is.

Building

$ mvn clean verify

Usage

Declare a file format by declaring the sections. The EoTags class provides some helpful methods for picking 64-bit values that can have various useful properties. The various builder classes check various required invariants such as tags being unique, no conflicting ordering declarations, and etc.

/// 0x8958595A0D0A1A0A final long fileTag = EoTags.pngStyle((byte) 'X', (byte), (byte) 'Y', (byte) 'Z'); // 0x4558414D_54414741 final long tagA = EoTags.ofString("EXAMTAGA"); // 0x4558414D_54414742 final long tagB = EoTags.ofString("EXAMTAGB"); // 0x4558414D_54414742 final long tagC = EoTags.ofString("EXAMTAGC"); // 0x4558414D_454E4421 final long endTag = EoTags.ofString("EXAMEND!"); // A section with tag A, that must appear exactly once, and must be first. final var sectionA = EoFileSectionDescription.builder() .setTag(tagA) .setCardinality(ONE) .setOrdering(MUST_BE_FIRST) .build(); // A section with tag B, that must appear at least once. final var sectionB = EoFileSectionDescription.builder() .setTag(tagA) .setCardinality(ONE_TO_N) .build(); // A section with tag C, that must appear at most once, and must be last. final var sectionC = EoFileSectionDescription.builder() .setTag(tagC) .setCardinality(ZERO_TO_ONE) .setOrdering(MUST_BE_LAST) .build(); // Format version 1.0 final var format1 = EoFileDescription.builder() .setVersionMajor(1) .setVersionMinor(0) .setFileTag(fileTag) .setEndTag(endTag) .addSections(sectionA, sectionB, sectionC) .build(); // All versions of the format. final var formatAll = EoFileVersionsDescription.builder() .addDescriptions(format1) .build();

Open a reader that can validate and read files:

final var readers = new EoFileReadersChecked(); final Path path = ...; try (final var reader = readers.forFile(tagFile, tagEnd, path, formatAll)) { System.out.printf("Version: %s%n", reader.version()); System.out.printf("Sections: %s%n", reader.sections()); try (SeekableByteChannel channel = reader.dataChannel(reader.sections.first())) { // `channel` is bounded and can only read from within the data of the // passed-in section. } }

The checked reader validates that the file sections obey the declared cardinality and ordering rules, and that the file is well-formed in other ways (such as not being truncated). It also checks that the file tag matches that of the declared format, and that the major version of the file is equal to one of the declared versions. The entomos implementation performs the minimum amount of I/O needed to validate files, and critically does not read the entire file into memory (although it will perform one seek for each section in the file in order to read section tags and sizes). No data within sections is actually read until reader.dataChannel() is used.

All interfaces produce detailed structured exceptions in the case of errors.

An unchecked reader is also provided that merely enumerates sections within the file. The checked reader is implemented on top of the unchecked reader.

Releases & Development Snapshots

Releases

You can subscribe to the atom feed to be notified of project releases.

The most recently released version of the package is 0.0.1.

0.0.1 Release (2025-05-06Z)

The compiled artifacts for the release (and all previous releases) are available on Maven Central.

Maven Modules

<dependency> <group>com.io7m.entomos</group> <artifactId>com.io7m.entomos.core</artifactId> <version>0.0.1</version> </dependency><dependency> <group>com.io7m.entomos</group> <artifactId>com.io7m.entomos.tests</artifactId> <version>0.0.1</version> </dependency>

Development Snapshots

At the time of writing, the current unstable development version of the package is 0.0.2-SNAPSHOT.

Development snapshots may be available in the Central Portal Snapshots repository. Snapshots are published to this repository every time the project is built by the project's continuous integration system, but snapshots do expire after around ninety days and so may or may not be available depending on when a build of the package was last triggered.

Manual

This project does not have any user manuals or other documentation beyond what might be present on the page above.

Sources

This project uses Git to manage source code.

Repository: https://www.github.com/io7m-com/entomos

$ git clone --recursive https://www.github.com/io7m-com/entomos

Issues

This project uses GitHub Issues to track issues.

License

Copyright © 2025 Mark Raynsford <code@io7m.com> https://www.io7m.com Permission to use, copy, modify, and/or distribute this software for any purpose with or without fee is hereby granted, provided that the above copyright notice and this permission notice appear in all copies. THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.