Inverse Parametric Optimization from Partial Examples

This library provides a way to solve the Inverse Parametric Optimization problem for Partial Examples.

Motivation

The motivation for this problem comes from Pairwise Protein Sequence Alignment in BioInformatics: You have two strings of proteins sequences A and B, and you want to align them so that identical or similar characters are aligned in successive columns. The way it is calculated is by assigning a score to each tranformation (insertion, deletion, or substitution). There are 20 amino acids, so a 210 substitution scores. You also have gap-open and gap-extension penalties. The total score is calculated by adding up the scores for each transformation. We define an optimal alignment for two sequences to be one where the total score is minimal.
Inverse Parametric Optimization is motivated by the following problem. Given a set of input protein sequences and output alignments (that we know to be correct), how do we get the alignment scoring scheme (or the values of the substitution scores and the gap-open and gap-extension penalties) that make each alignment be the optimal output for each input. This is problem is called Inverse Alignment. Inverse Parametric Optimization a way of solving the general problem.
Inverse Parametric Optimization from Partial examples is also motivated by alignment. Often times we have output alignments with some regions that are unspecified or unreliable. For the general case, we defined partial examples to simply be examples where only some part of the output is reliable.

Problem Description

Let $f$ be a scoring function with parameters $(p_1, p_2, ... , p_t)$. Let $$f_p(O) = \sum\limits_{1 \leq i \leq t}p_ig_i(O) + g_0(O)$$
The Forward Optimization Problem: Given input $(I, p)$, find legal output $O$ that minimizes $f_p(O)$.

The Inverse Optimization Problem from Complete Examples: The Inverse Optimization problem can be specified in terms of two criteria.

Inverse Optimization from Complete Examples under Absolute Error: Inverse Optimization under Absolute Error is the following problem. The input is a collection of complete outputs $O_i$ of inputs $I_i$ for $1 \leq i \leq k$. The output is the parameter vector $$\newcommand{\argmin}{\arg\!\min} x^* := \argmin_{x \in D} E_{abs}(x)$$ where $$E_{abs}(x) := \frac{1}{k} \sum_{1 \leq i \leq k} (f_x(O_i) - f_x^*(I_i))$$ For scoring function $f$ and parameters $p$, we write $f_p^*(I_i)$ for the score of an optimal output of input $I_i$ under $f_p$. Output vector $x^*$ minimizes the average absolute error of the scores under $f_p$ of the examples.

Inverse Optimization from Complete Examples under the Discrepancy Criterion: Inverse Optimization under the discrepancy criterion is the following problem. The input is a collection of complete outputs $A_i$ of input $I_i$ for $1 \leq i \leq k$. The output is the parameter vector $$x^* := \argmin_{x \in D} D(x)$$ where $$D(x) := \frac{1}{k} \sum_{1 \leq i \leq k} \max_{B_i of I_i} \bigg\{ (1 - \alpha)\frac{f_x(A_i) - f_x(B_i)}{L(I_i)} + \alpha d(A_i, B_i) \bigg\}$$ Here the sum is over all the legal outputs $B_i$ of $I_i$. $L(I_i)$ is the length of the input, $d(A_i, B_i)$ is the error in recovering example $A_i$ by output $B_i$, and $\alpha$ is the constant that controls the weight between score error and recovery error.

Before we can describe the Inverse Optimization problem from Partial Examples, we need to define what a partial example and a completion is. A partial example is an output $O_i$ of input $I_i$ where each element of the output is defined to be either reliable or unreliable. A completion for a partial example is an output $\overline{O_i}$ for which all elements are labeled as reliable.

Inverse Optimization Problem from Partial Examples: Inverse Optimization from Partial Examples is the following problem. The input is a set of partial outputs $O_i$ of inputs $I_i$ for $1 \leq i \leq k$. The output is the parameter vector $$x^* := \argmin_{x \in D} \min_{\overline{O_1}... \overline{O_k}} E(x)$$ The error function E is either $E_{abs}(x)$ or $D(x)$ and the symbol $\overline{O_i}$ varies over all possible completions of $O_i$.

Distribution

The source code is in C++, and requires the IBM ILOG CPLEX Optimizer to run. It is distributed as Unix tar and zip files. A Makefile and a detailed example using alignment is provided.

Version: 0.1
Source Code: Zip file here.
Documentation: A brief description here.
More information: A detailed description of how the library works.

For questions regarding the software, please contact inverseopt@cs.arizona.edu.

People

Faculty

Dr. John Kececioglu

Students

Lisa Peairs
Sabrina Nusrat
Shloka Desai

Papers

Learning models for aligning protein sequences with predicted secondary structure

Eagu Kim, Travis Wheeler and John Kececioglu,
Proceedings of the 13th Conference on Research in Computational Molecular Biology (RECOMB), Springer-Verlag Lecture Notes in Bioinformatics 5541, 512-531, 2009.

Learning scoring schemes for sequence alignment from partial examples

Eagu Kim and John Kececioglu,
IEEE/ACM Transactions on Computational Biology and Bioinformatics 5:4, 546-556, 2008.

Inverse sequence alignment from partial examples

Eagu Kim and John Kececioglu,
Proceedings of the 7th EATCS/ISCB Workshop on Algorithms in Bioinformatics (WABI), Springer-Verlag Lecture Notes in Bioinformatics 4645, 359-370, 2007.
Software implementing the algorithm in this paper is available at inversealign.cs.arizona.edu.

Simple and fast inverse alignment

John Kececioglu and Eagu Kim,
Proceedings of the 10th ACM Conference on Research in Computational Molecular Biology (RECOMB), 441-455, 2006.

Date last updated: 5/26/2015