The
laurel application attempts to provide tools to assist with
image captioning
within the context of
machine learning.
In particular, the application is geared towards the management of smaller datasets
(in the range of thousands of
images) for use in techniques such as
LORA
training.
The laurel application provides the following features:
There are several ways to install the Laurel application.
The portable application distribution is simply a zip archive consisting of a couple
of frontend shell scripts and
the Java jar files that comprise the application. This distribution is mostly platform-independent
, but requires some (fairly straightforward)
manual setup.
The distribution uses your locally installed Java VM. First, check that you have a
JDK 21 or higher JVM installed:
The application distribution is a zip file with a laurel directory in the root of
the zip archive.
On UNIX-like platforms, ensure the included frontend scripts are executable:
Set the LAUREL_HOME environment variable to the directory:
Now run either laurel/bin/laurel for the command-line tool, or
laurel/bin/laurel-ui
for the graphical user interface.
This section of the documentation describes how to
use the application without spending
any time explaining the underlying model the application works with, and without describing
how exactly the
application
works. The
theory of operation
section of the manual describes the inner workings of the application in a more formal
manner.
The vast majority of operations in the application can be undone. When an operation is
performed, it can typically be reverted by selecting
Undo
from the menu. Any operation that has been undone can be performed again by
selecting
Redo
from the menu.
The application is slightly atypical in that there is no "save" functionality. Instead,
every operation performed
in the application that changes the state of the dataset is persisted into the dataset
itself. This, effectively,
provides an unbounded undo stack that survives application restarts.
The current state of the undo/redo stack can be viewed in the
History
tab. Details of the undo implementation are described in the
theory of operation.
The application opens to an empty file view.
Via the menu, it's possible to create a new dataset, or open an existing one.
With a dataset loaded, the file view shows a series of tabs.
The
Images tab allows for loading
images
and assigning
captions
to images.
Click the Add Image button to load an image from the filesystem.
Once an image is loaded, it appears in the image list.
Clicking the image opens an image preview window that contains a larger copy of the
image. This window will constantly update to whatever is the currently selected image.
The intended use case for
the image preview window is to be left open on a separate screen so that a large version
of the image is always visible when manually captioning images.
Click the Create Caption button to create a new caption.
When a caption is first created, it is visible in the set of
unassigned
captions for the selected image. Naturally, the set of unassigned captions is different
for each image. The Assign Caption button can be used to assign one or more selected
captions to the currently selected image.
Typically, in fine-tuning methods such as
LORAs, there will be one or more captions that should be globally applied to all images,
and
should also, when the captions are exported, always appear at the beginning of the
list of captions for each
image.
Click the Configure Global Prefix Captions button to configure
global prefix captions.
The Configure Global Prefix Captions window allows for creating, deleting, modifying,
and reordering captions.
Click the Add Category button to create a new category.
When a category is selected, the captions that are not assigned to that category will
appear in the list of
unassigned captions. Conversely, the captions that
are assigned to the category will
appear in the list of assigned captions. In a similar manner as for
image caption assignment, captions can be assigned and
unassigned to/from a category using the arrow buttons.
Categories can be marked as
required using
the buttons above the category list. When a category is
required, all images
must have at least one caption from that category assigned to pass
validation checks.
The Metadata tab allows for embedding textual metadata into the dataset. This can be
used to hold author information, license information, and etc.
Metadata values can be added using the Add Metadata button. Existing metadata values can
be modified with the Modify Metadata button, and removed with the
Remove Metadata
button.
The History tab displays the undo and
redo stack for the currently loaded dataset.
The history can be deleted using the Delete History button. Note that this operation
cannot be undone, and requires confirmation. It is recommended that the history be
deleted before datasets are
distributed.
The Validation tab allows for running validation checks on the dataset. Validation
is executed using the Validate button.
If validation suceeds, a success message is displayed.
If validation fails, the reasons for the failures are displayed.
The application supports importing directories filled with captioned images.
Importing can be accessed from the menu.
Any errors encountered during the import process are shown in the dialog.
The import process will recursively walk through a given directory hierarchy searching
for image files.
When an image file is discovered, the process will look for a caption file associated
with the image.
A caption file must have the file extension caption. For example, if the
process discovers an image file named example.png, the caption file associated
with it must be called example.caption.
The application supports exporting datasets to directories.
Exporting can be accessed from the menu. If the
Export Images
checkbox is checked, image files will be written to the output directory. For very
large datasets where captions
are being repeatedly exported during development, it can be useful to switch off image
exports in order to save
time.
Any errors encountered during the export process are shown in the dialog.
The laurel package provides a command-line interface for performing tasks such as
importing and exporting datasets. The base
laurel
command is broken into a number of subcommands which are documented over the following
sections.
All subcommands accept a --verbose parameter that may be set to one of
trace, debug, info,
warn, or error. This parameter sets the lower bound for
the severity of messages that will be logged. For example, at debug verbosity, only
messages of severity debug and above will be logged. Setting the verbosity to
trace
level effectively causes everything to be logged, and will produce large volumes of
debugging output.
The
laurel command-line tool uses
quarrel
to parse command-line arguments, and therefore supports placing command-line arguments
into a file, one argument
per line, and then referencing that file with
@. For example:
All subcommands, unless otherwise specified, yield an exit code of 0 on success, and
a non-zero exit code on failure.
import
- Import a directory into a dataset.
The
import
command
imports
a dataset.
export
- Export a dataset into a directory.
The
export
command
exports
a dataset.
A caption is a string that can be applied to an image to describe some element of that
image.
Captions must conform to the following format:
A caption file is a file consisting of a comma-separated list of captions. More
formally, the file conforms to the following format:
An example caption file is as follows:
Note that the trailing comma on the last line is optional. All whitespace around commas
is ignored.
Categories
allow for grouping captions in a manner that allows the application to assist with
keeping image
captioning consistent.
When adding captions to images for use in training models such as
LORAs, it is important to keep captions
consistent.
Consistent
in this case means to avoid
false positive
and
false negative
captions. To understand what these terms mean and why this is important, it is necessary
to understand how image
training processes typically work.
Let m be an existing text-to-image model that we're attempting to fine-tune. Let
generate(k, p)
be a function that, given a model
k
and a text prompt p, generates an image. For example, if the model
m
knows about the concept of
laurel trees, then we'd hope that
generate(m, "laurel tree")
would produce a picture of a laurel tree.
Let's assume that m has not been trained on pictures of rose bushes and doesn't
know what a rose bush is. If we evaluate generate(m, "rose bush"), then we'll
just get arbitrary images that likely don't contain rose bushes. We want to fine-tune
m
by producing a LORA that introduces the concept of rose bushes. We produce a large
dataset of images of rose
bushes, and caption each image with (at the very least) the caption
rose bush.
The training process then steps through each image i in the dataset and performs
the following steps:
In our training process, assuming that we've properly captioned the images in our
dataset, we would hope that
the only significant difference between g and
i
at each step would be that i would contain an image of a rose bush, and
g
would not. This would, slowly, cause the fine-tuning of the model to learn what constitutes
a rose bush.
Stepping through the entire dataset once and performing the above steps for each image
is known as a single
training epoch. It will take most training processes multiple
epochs
to actually learn anything significant. In practice, the model m can conceptually
be considered to be updated on each training step with the new information it has
learned. For the sake of
simplicity of discussion, we ignore this aspect of training here.
A false positive caption is a caption that's accidentally applied to an image when that
image does not contain the object being captioned. For example, if an image does not
contain a red sofa, and a caption "red sofa" is provided, then the
"red sofa"
caption is a false positive.
To understand why a
false positive caption is a problem, consider the
training process
described above. Assume that our original model
m knows about the concept of "red
sofas".
Similarly, a false negative caption is a caption that's accidentally
not
applied to an image when it really
should
have been. To understand how this might affect training, consider the training process
once again:
In practice,
false negative captions happen much more frequently than
false positive
captions. The reason for this is that it is impractical to know all of the concepts
known to the model being
trained, and therefore it's impractical to know which concepts the model can tell
are
missing
from the images it inspects.
Given the above understanding of
false positive
and
false negative
captions, the following best practices can be inferred for captioning datasets:
In our example
training process above, we should use
"rose bush"
as the primary caption for each of our images, and we should caption the objects in
each image that are not rose
bushes (for example,
"grass",
"soil",
"sky",
"afternoon lighting",
"outdoors", etc.)
When a category is marked as required, then each image in the dataset
must
contain one or more captions from that category.
Unlike
captions which can share their meanings across different datasets, categories are
a tool used to help ensure consistent captioning within a single dataset. It is up
to users to pick suitable
categories for their captions in order to ensure that they caption their images in
a consistent manner. A useful
category for most datasets, for example, is
"lighting". Assign captions such as
"dramatic lighting",
"outdoor lighting", and so on,
to a required
"lighting" category. The
validation
process will then fail if a user has forgotten to caption lighting in one or more
images.
Categories must conform to the following format:
An
image is a rectangular array of pixels. The application does not do any special
processing of images beyond storing them in the dataset. Images are, in practice,
expected to be in one of the
various popular image formats such as
PNG.
Each image in the dataset may have zero or more
captions
assigned.
The application stores the complete, persistent history of every change ever made
to the dataset.
The undo and redo stacks are stored in the
file model.
Each command that is executed on the file model is invertible. That is, each command
knows how to perform an action, and how to revert that action. By storing the complete
sequence of executed
commands, it is effectively possible to take a dataset and repeatedly undo operations
until the dataset is back at
the blank starting state.
Metadata in the application is a simple string key/value store. Keys are unique.
It is recommended that creators annotate datasets with the standard
Dublin Core
metadata terms, summarized in the following table:
The applications stores the dataset in a structure known as the file model.
The database uses the following schema:
The schema_version table's single row MUST
contain com.io7m.laurel in the
version_application_id column.
Limitations in SQLite mean that it is, unfortunately, impractical to enforce invariants
such as
category and
caption formats at the database level.
When an undoable command is successfully executed on the file model, the
parameters of the original command, and the data that was modified, is stored in the
undo table. When a command is undone, that same
data is moved to the redo table.
The data and parameters are serialized to
Java Properties
format, but the precise names and types of the keys is currently unspecified. This
means that, although
applications other than
Laurel can open and manipulate datasets, they will currently
need to do some mild reverse engineering to manipulate the history.
The validation process checks a number of properties of the underlying
file model.
The validation process checks to see if the category requirements are satisfied for
all images in the dataset. In
pseudocode, the process is:
Informally, for each image i, for each required category
c, validation succeeds if at least one caption in
c
is assigned to i.
Copyright © 2024 Mark Raynsford <code@io7m.com> https://www.io7m.com.
This book is placed into the public domain for free use by anyone for any purpose.
It may be freely used, modified,
and distributed.
In jurisdictions that do not recognise the public domain this book may be freely used,
modified, and distributed
without restriction.
This book comes with absolutely no warranty.