Text Justification problem
I've heard of the text justification problem along time ago. But, I've never taken the time to actually try to solve it, until recently when I decided to look it up again. In this post, I'll show how to solve the problem.
Let's first describe the problem: Given a series of words and a page width (line length), left justify the words to fit them in the lines based on the page width. More formally: given S = { Wi } for i >= 0 and Pw >= 0 the page width we need to return: R = { {Wi} } i> 0 such that Sum(|Wi|) + Sum(Wi) <= Pw.
As usual, we'll turn to our inductive approach for solving the problem. Let's define Cost(i,j) - the cost of putting Wi ... Wj in a single line. The cost function is what we need to actually minimize. We'll define it more specifically later on, so for now, let's assume that we are given this cost function and that if Wi... Wj can't be put on the same line it will return infinity, otherwise, it will return a score such that 0 <= score < infinity.
Hypothesis: P(n < N) - We know how to get the minimum score for arranging words Wi ... Wn (inclusive) in multiple lines in a justified manner (e.g. with the minimal accumulative score).
Base case P(1) = Score(1,1).
How do we reduce P(N) to P(n < N)? well, let's start by fitting some K words in the last line. The score for those words is Score(WN - k + 1, WN). Once we have determined the cost of these words we are left with N - K < N words to fit in the rest of the lines. based on our hypothesis we know how to score them!
So the algorithm works as follows:
Initialization: P(-1) = INFINITY
1. For each 0 <= i <= N:
2. For each 0 <= j <= i:
3. P(i) = min(P(i),P(j - 1) + Score(j,i))
You may have already noticed that the code is arranged in a dynamic programming fashion. In fact, TeX uses this exact approach to justify text .The only thing left to be shown is how the Score(i,j) function works. The truth is, that the score function is very straightforward. It takes the difference between the page width and the sum of word length i...j (including spaces) and raises it to some power. We'll use the third power in our scoring function. Furthermore, we would like to be able to keep track of the words we put in each line. After all, the point of this exercise is to justify the text, and not just hand out scores. To do so, we'll use an additional data structure Parent(i) that will save back pointers to the previous line. As we determine the minimum j (from above) we also update Parent(i) = j - 1. this way we can retrieve the words to be arranged in each line (in reverse order). Now let's see some actual code:
Let's look on how the main function uses Justify to print a set of words in a justified manner:
As a final comment I will just note that you can take a slightly different (but equivalent) approach, and instead of filling the last line and calculate the score of the first lines, you first try to fill the first line and calculate the score for the rest of the line. I personally find dynamic programming to be easier to grasp when you look at it from end to beginning rather than the other way.