vendredi 1 février 2013

MAP inference in Large Factor Graphs with Reinforcement Learning

Khashayar Rohanimanesh, Michael Wick, Sameer Singh, and Andrew
McCallum

I find this paper extremely interesting. I try next to summarize/extract the most
interesting ideas from my personal point of view.

Before going into this paper, a few words on factor graphs. They subsume Bayesian networks, and represent the decomposition of a global function as a product of factors, each factor being one function of some variables. Both variables and functions are shown in the graph, with 2 types of nodes. A link between a variable and a function encodes the fact that this function depends on this variable.

Decoding = MAP inference = finding the best Y is know to be intractable for large graphs.

First approach to solve this: SEARN=LASO = learn a score function to predict if a partial configuration leads to an output configuration Y; It relies on a 0/1-loss that does not benefit from imperfect configurations.

Second approach: SampleRank = learn to rank complete configuration pairs; may stuck to local maxima; kinda similar to "slow mixing"; there is a nice example to illustrate this. The problem comes from the posterior distribution that behaves nicely in a neighbourhood of the true optimal, but badly everywhere else.

Reinforcement Learning may be helpful, because of the delayed reward: it's best to look at the reward that can be achieved in the end, rather than at the shape of the posterior in the current location.


A state = values assigned to Y = very large state space handled with linear function approximation

An action = change some Y, as allowed by a proposer (e.g. split/merge)

Reward = local improvement; here = F1(s')-F1(s)

Reminder SampleRank

The paper instantiates all this to train a CRF = factor graph with function = exp(weights * features)

Training of a CRF involves summing over all possible Y to compute the gradient.
One solution = sampling Y for each factor independently
Another solution = SampleRank:
- Each step of MCMC produces Y' from Y
- Y' is accepted or not(like with Metropolos-Hastings)
- The CRF parameters are updated based on F1(Y')-F1(Y) and factor(Y'-Y)

Q-Learning

Q-Learning only works for very small state-space, otherwise (state,action) must be
represented by a feature vector and Q(s,a)=linear function of all features.
The updates are then made to the weights of this linear function instead of Q(s,a).

Issue of large state space

It requires using generalization techniques in RL:
- using a proposer to limit the number of possible actions from a state
- using a reward that is informative in every transition: here, gain in F1
The reward function can be approximated by a function that is learned at training time. Or, at test time, the next action can be chosen greedily on a fixed value function.

Application: clustering for ontology matching

Goal = match concepts in ontology A to concepts in ontology B
X = one concept
factor = set of concepts (there is an exponential number of factors)
Y = binary = all concepts in set match or not

Features = by pair of concepts = TFIDF cosine btw concepts/instance, substring match, proximity in the tree, parents/children/siblings are mapped...

Baseline1 = greedy agglomerative = MaxEnt models trained piecewise, clusters merged greedily

Baseline2 = SampleRank with MH: CRF training with SampleRank. For testing, MH
that modifies a single var (isn't it a kind of Gibbs ?) is used for inference.

Baseline3 = GLUE system

F1=94.3 (RL) 92.0 (SR) 89.9 (GA) 80 (GLUE)

RL is more robust than SampleRank-MH to local minima

Personal comments

  • Very interesting paper. But the real question is whether RL search is really better in practice than MCMC or variational approximations ?
  • In practice, the features used to define the states, and so the Y, must be highly correlated to the way the reward is computed (?). Otherwise, learning may not happen. Shouldn't we use the same features than the ones that are used to compute the reward ? Of course, this is not possible when the reward depends on the true labels as in this paper, but at test time, it is also estimated without access to the true labels, so it's possible then. But if we use the same features as in the reward, doesn't it become equivalent to a classical Monte Carlo optimization ?