TCS Alignment Toolbox Version 3.1.1

Copyright (C) 2016-2018
Benjamin Paaßen
AG Theoretical Computer Science
Centre of Excellence Cognitive Interaction Technology (CITEC)
University of Bielefeld

This program is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more details.

Introduction

This Java Toolbox provides several edit distance algorithms between two data structures, be it lists, sets, or trees, which may contain arbitrary kinds of nodes. The toolbox also provides options to parametrize the edit distance, to compute optimal solutions via backtracing, and to compute gradients of the edit distance with respect to metric parameters. The toolbox is written in Java 1.7 and is compatible with MATLAB (R).

If you use this toolbox in academic work, please cite:

  • Paaßen, B., Mokbel, B., & Hammer, B. (2015). A Toolbox for Adaptive Sequence Dissimilarity Measures for Intelligent Tutoring Systems. In O. C. Santos, J. G. Boticario, C. Romero, M. Pechenizkiy, A. Merceron, P. Mitros, J. M. Luna, et al. (Eds.), Proceedings of the 8th International Conference on Educational Data Mining (pp. 632-632). International Educational Datamining Society. (Link)

To use this software is recommended if

  • you want to use an alignment distance as basis for some relational classifier on structured data, e.g.: k-nearest-neighbour-classification of biological sequence data or syntax trees. For those cases we also provide gradients to use a learning scheme for classifier optimization.
  • you want to use edit distances on structured data <em>and</em> want to optimize your metric parameters using gradient-based techniques. For that we provide the gradient on the edit distances.
  • you want to use an edit distance in some other way, but your data is too complex for other tools, e.g. you are using multi-modal sequences, where each node of the sequence has several features.
  • you want to use a non-standard function to define the distance between elements. Our architecture supports the implementation of custom comparators that can be easily plugged in without any changes in our code base.
  • you want to compute a non-standard edit distance. Our architecture also supports the abstract definition of edit distances and provides the efficient implementation automatically using Algebraic Dynamic Programming.

This toolbox is not the right tool for you if you want to do standard bulk computation of dynamic time warping on one-dimensional double sequences or simple string edit distances. In that case there are faster tools out there.

Installation

The TCS Alignment Toolbox can be directly used in your Matlab/Java-program as a .jar file or via maven.

Binary Distribution (.jar file)

The binary distribution of the full TCS Alignment Toolbox can be found here.

You can use it in Matlab using the command

javaaddpath alignment-3.1.1-full.jar;

Maven modules

The TCS Alignment Toolbox is available on maven central. For most standard cases it should suffice to declare the following dependencies:

<dependencies>
    <dependency>
        <groupId>de.cit-ec.tcs.alignment</groupId>
        <artifactId>comparators-lib</artifactId>
        <version>3.1.1</version>
        <scope>compile</scope>
    </dependency>
    <dependency>
        <groupId>de.cit-ec.tcs.alignment</groupId>
        <artifactId>algorithms-lib</artifactId>
        <version>3.1.1</version>
        <scope>compile</scope>
    </dependency>
</dependencies>

The different modules are discussed in more detail in section Modules.

Source Code

If you want to inspect the source code or compile the project yourself, you can either download the sources from maven, or check out the GIT repository using the command

git clone https://openresearch.cit-ec.de/git/tcs.tcs-alignment.git

You can also inspect the GIT repository in your browser: Link.

Quickstart Guide

If you want to use the Toolbox as fast as possible it is recommended to use one of the code examples in the 'examples' directory and develop your application code from there on.

You can get a copy of the examples by cloning the GIT repository as described above.

If you want to apply the Toolbox in an educational datamining context, we suggest to have a look at our demo with graphical user interface as presented at the International Educational Datamining Conference.

Features

The idea behind the toolbox is to enable alignments of arbitrary input sequences as well as metric learning on these alignments. We approach this problem by a clean software architecture. As input for alignments we use the Java List interface. As elements of the lists you can use arbitrary objects, as long as you provide a Comparator that works on such objects. A Comparator computes a distance between elements, which an AlignmentAlgorithm can then use to compute a distance on sequences of such elements.

For convenience, we provide several standard implementations of the Comparator interface (e.g. Euclidean distance, Manhattan-Distance, Normalized Compression Distance and Character Distances) as well as the AlignmentAlgorithm interface (edit distance and dynamic time warping), which can be found in the comparators-lib and the algorithms-lib package respectively.

The toolbox supports parallel computation of many alignments via the ParallelProcessingEngine and the SquareParallelProcessingEngine.

To enable a detailed inspection of alignments we provide the option to not only calculate the alignment distance but to also a backtrace of the alignment operations that have been used to arrive at the alignment distance. Detailed HTML visualizations of such alignments can be generated as well using the visualization module.

Metric learning is possible using gradient descent schemes on the alignment distance. This kind of metric learning has been applied successfully in several publications.

The toolbox is build upon the powerful theoretical framework of Algebraic Dynamic Programming which provides a generic approach to compute a wide variety of alignment schemes efficiently. Our implementation of ADP enables the definition of custom alignment schemes as well (refer to the adp module).

Usage

In order to use the Toolbox, you'll need to execute (some of) the following steps:

  1. Generate List objects of your input data. The primitives module provides some convenience functions in case you are dealing with primitive data types.
  2. Create a Comparator object that works on elements of your sequences. Some standard implementations are provided by the comparators-lib module.
  3. Create an AlignmentAlgorithm object of your choice and provide the Comparator in the constructor. Some standard implementations are provided by the algorithms-lib module.
  4. Optionally initialize a SquareParallelProcessingEngine with your sequences and your algorithm to use multi-thread computation and call setFull().
  5. Calculate the alignment results for your input.
  6. Optionally calculate derivatives of your alignment results.
  7. Optionally visualize your alignment as HTML.

Some code examples for Java and Matlab are listed in the 'examples' directory.

Algorithms

AlignmentAlgorithm implementations may differ in the underlying algorithm as well as the return value. In the algorithms-lib module we provide standard implementations for the classic edit distance as well as dynamic time warping. For each of those we provide a Score version which just returns the numeric score and a Full version which returns an Alignment object containing the actual Operation objects that correspond to the alignment distance. Therefore, if you want to compute the optimum Alignment according to a dynamic time warping distance, you should use the StrictDTWFullAlgorithm of the algorithms-lib module.

Oftentimes not only one Alignment corresponds to the alignment distance, but there are multiple co-optimal Alignments. Further, there might be Alignments which are not optimal, but near misses. This is captured by other data structures in the algorithms module. Co-optimal alignments can be stored in an AlignmentList, as it is returned by the StrictAlignmentAllOptimalAlgorithm. A model that entails all co-optimal alignments in a compressed fashion is the CooptimalModel. If sub-optimal alignments should be considered as well, they can be stored in a AlignmentMap, which maps scores to Alignments.

An entirely different view on co-optimal and suboptimal alignments is offered by the Soft alignment scheme, where we do not take the strict minimum of the possible distance but a soft minimum. his is a key feature to achieve robust metric learning results and is described in more detail in our papers (see Background)

To increase maintainability we do not provide standard implementations for each of these variations in the algorithms-lib module. Rather, we provide an abstract version in the adp module, which then can be applied to any algorithm of your liking.

In general, if you want to construct your own AlignmentAlgorithm implementations we recommend the adp module, which is built on Algebraic Dynamic Programming. This is explained in the adp example in the 'examples' directory and in more detail in the paper Adaptive structure metrics for automated feedback provision in intelligent tutoring systems.

Runtime

It is a general result in algebraic dynamic programming that alignment algorithms have the same asymptotic runtime, namely O(m ⋅ n) where m is the length of the first and n is the length of the second input sequence. However, constant factors differ depending on the implementation. The most important rules of thumb are:

  • Dynamic Time Warping (DTW) is faster than other alignment schemes,
  • parallel processing is much faster than iterative processing,
  • hand-tailored implementations are much faster than ADP generated ones.

Also note that not only the choice of algorithm influences the runtime but also the runtime of your Comparator.

The following figures show the runtime for one alignment in seconds versus the sequence length for some example algorithms. The sequences were random strings generated over the alphabet {a, b, c, d}. More details on the evaluation can be found in the performance evaluation script in the 'performance' directory.

Runtime versus sequence length for global alignment algorithms. Only one alignment was calculated 20 times and the median runtime is displayed here.

Runtime versus sequence length for DTW algorithms. Only one alignment was calculated 20 times and the median runtime is displayed here.

Runtime versus sequence length for affine alignment algorithms. Only one alignment was calculated 20 times and the median runtime is displayed here.

The following figure shows (for the case of the global alignment strict score algorithm) how much runtime can be saved by using parallel processing instead of iterative processing. All pairwise alignments where calculated for an ascending number of sequences. The sequence length was fixed to 50.

Runtime versus number of sequences for iterative and parallel processing. All pairwise alignments where calculated. The sequence length was 50.

The following figure show the runtime to compute the gradient of one alignment versus the sequence length. Asymptotically, gradient calculation over a strict alignment has linear complexity (as an alignment can have at most O(m + n) operations), while gradient calculation over a soft alignment has quadratic complexity (or O(m ⋅ n)). However, the gradient tends to vanish far away from the most likely alignment path, such that we obtain favourable constant factors.

Runtime versus sequence length for global strict alignment and global soft alignment. Only one alignment was calculated 20 times and the median runtime is displayed here.

Further Documentation

Further documentation on the TCS Alignment Toolbox can be found in

Background

This Toolbox has been developed in 2013-2016 in the Theoretical Computer Science for Cognitive Systems Workgroup at the Centre of Excellence 'Cognitive Interaction Technology' (CITEC), Bielefeld University. Development is part of the research project 'Learning Dynamic Feedback for Intelligent Tutoring Systems' (DynaFIT), grant number HA 2719/6-2, which is part of the DFG priority program 'Autonomous Learning'.

If you use this toolbox in academic work, please cite:

  • Paaßen, B., Mokbel, B., & Hammer, B. (2015). A Toolbox for Adaptive Sequence Dissimilarity Measures for Intelligent Tutoring Systems. In O. C. Santos, J. G. Boticario, C. Romero, M. Pechenizkiy, A. Merceron, P. Mitros, J. M. Luna, et al. (Eds.), Proceedings of the 8th International Conference on Educational Data Mining (pp. 632-632). International Educational Datamining Society. (Link)

We have published several articles using this toolbox, in particular for metric learning:

  • Paassen, B. (2015). Adaptive Affine Sequence Alignment Using Algebraic Dynamic Programming, Master Thesis, Bielefeld University (Preprint)
  • Mokbel, B., Paassen, B., & Hammer, B. (2014). Adaptive Distance Measures for Sequential Data In M. Verleysen (Ed.), ESANN, 22nd European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (pp. 265-270). Bruges, Belgium: i6doc.com. (Preprint)
  • Mokbel, B., Paassen B., & Hammer B. (2014). Efficient adaptation of structure metrics in prototype-based classification In: S. Wermter, C. Weber, W. Duch, T. Honkela, P. Koprinkova-Hristova, S. Magg, G. Palm, et al. (Eds.), Lecture Notes in Computer Science: Vol. 8681. Artificial Neural Networks and Machine Learning - ICANN 2014 - 24th International Conference on Artificial Neural Networks, Hamburg, Germany, September 15-19, 2014. Proceedings (pp. 571-578). Springer. (Preprint)
  • Mokbel, B., Paaßen, B., Schleif, F.-M., & Hammer, B. (2015). Metric learning for sequences in relational LVQ. In: Neurocomputing, 169, 306-322. (Preprint)
  • Paassen, B., Mokbel, B., & Hammer, B. (2015). Adaptive structure metrics for automated feedback provision in Java programming In: ESANN, 23rd European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (pp. 307-312). Bruges, Belgium: i6doc.com. (Preprint)
  • Paaßen, B., Mokbel, B., & Hammer, B. (2016). Adaptive structure metrics for automated feedback provision in intelligent tutoring systems. Neurocomputing, 192, 3-13. (Preprint)
  • Paaßenm B., Gallicchio C., Micheli A., & Hammer B. (2018). Tree Edit Distance Learning via Adaptive Symbol Embeddings. In: Dy J., Krause A. (eds.). Proceedings of the 35th International Conference on Machine Learning (ICML 2018). Proceedings of Machine Learning Research. Vol 80. (Link)
  • Paaßen, B. (2018). Revisiting the tree edit distance and its backtracing: A tutorial. arXiv: 1805.06869 [cs.DS]

Modules

The toolbox is structured into several maven modules with different purposes. For practical applications it is ususally sufficient to declare the following dependencies:

<dependencies>
    <dependency>
        <groupId>de.cit-ec.tcs.alignment</groupId>
        <artifactId>comparators-lib</artifactId>
        <version>3.1.1</version>
        <scope>compile</scope>
    </dependency>
    <dependency>
        <groupId>de.cit-ec.tcs.alignment</groupId>
        <artifactId>algorithms-lib</artifactId>
        <version>3.1.1</version>
        <scope>compile</scope>
    </dependency>
</dependencies>

The full dependency structure between modules is visualized in this UML diagram:

A uml diagram visualizing the dependencies between the different modules of the TCS Alignment Toolbox.

In more detail, the modules are:

Core Modules

  • de.cit-ec.tcs.alignment:parallel contains a wrapper for the java standard library ThreadPool to enable parallel processing with reports on the computation progress. This is a very generic implementation and might be useful for applications outside the scope of this toolbox as well.
  • de.cit-ec.tcs.alignment:comparators defines the Comparator interface as well as related interfaces and some convenience functions to define the distance between sequence elements.
  • de.cit-ec.tcs.alignment:algorithms defines the AlignmentAlgorithm interface as well as helper classes such as data structures for alignment results and the ParallelProcessingEngine.

Library Modules

  • de.cit-ec.tcs.alignment:comparators-lib contains some standard implementations of the Comparator interface, such as the Euclidean distance for vectors, a simple character frequency distance on strings or a Kronecker-delta on characters.
  • de.cit-ec.tcs.alignment:algorithms-lib contains some standard implementations of the AlignmentAlgorithm interface, in particular edit distance/global alignment and dynamic time warping.
  • de.cit-ec.tcs.alignment:sets contains algorithms to align unordered sequences (that is, sets) using the Hungarian algorithms.
  • de.cit-ec.tcs.alignment:trees contains algorithms to align trees and forests using Zhang & Shasha's (1989) tree edit distance algorithm.

Utility Modules

  • de.cit-ec.tcs.alignment:primitives contains convenience functions to transfer primitive datastructures to lists of primitives.
  • de.cit-ec.tcs.alignment:adp contains an implementation of a generic dynamic programming algorithm for alignment schemes based on Algebraic Dynamic Programming. Using these capabilities you can comfortably define your own non-standard alignment schemes without having to worry about implementation details. More information on this particular topic is contained in the adp example in the 'examples' directory and in the paper Adaptive structure metrics for automated feedback provision in intelligent tutoring systems.
  • de.cit-ec.tcs.alignment:visualization contains helper classes to produce a HTML visualization of Alignment objects.
  • de.cit-ec.tcs.alignment:learning contains support for metric learning using the Large Margin Nearest Neighbor (LMNN) approach by Weinberger, Saul et al. The details of this method are described in the master's thesis Adaptive Affine Sequence Alignment Using Algebraic Dynamic Programming.

Sequence Datastructure Modules

The Sequence datastructure was mandatory in earlier versions of the TCS Alignment Toolbox and is built for multi-dimensional and multimodal sequences. It has been made fully compatible with the new toolbox version, but the use is not mandatory anymore and it is mainly kept to achieve backwards-compatibility.

  • de.cit-ec.tcs.alignment:sequences defines the Sequence data structure as well as the possible ValueTypes and KeywordSpecifications. In essence it contains everything you need to define input for the alignment toolbox.
  • de.cit-ec.tcs.alignment:sequence-comparators contains wrapper classes for Comparators on primitive data types to be applied on Node objects.
  • de.cit-ec.tcs.alignment:wrappers contains helper classes for some standard applications of alignment, in particular string edit distance and simple time series data. Using these convenience classes it gets easier to convert existing data to Sequence objects or to set up an AlignmentSpecification.
  • de.cit-ec.tcs.alignment:sequence-visualization contains implementations of the HTMLColumn interface that make it easier to create HTML visualizations of Alignment objects resulting from the Sequence datastructure.
  • de.cit-ec.tcs.alignment:csv contains two classes for exporting and importing Sequence objects to and from a human-readable CSV file. Special thanks go to Bassam Mokbel for this implementation.

Dependencies

As general utility library we employ lombok in all modules in accordance to the MIT License.

For the normalized compression distance (NCDComparator) in the comparators-lib module we use the fast LZ4 implementation by jpountz. Our use is in accordance with the Apache License, Version 2.

For the export and import of NodeSpecifications in the csv module, we employ the JSON reference implementation of Douglas Crockford according to the JSON License.

License

The TCSAlignmentToolbox is licensed under the AGPLv3. The full license text can be found here.

This documentation is licensed under the conditions of the Creative Commons Attribution Share-Alike International License 4.0.