Notes:Diff algorithm

From Maths
Jump to: navigation, search

Overview

My solution to comparing "things" is to form the problem as a classic "shortest path" combinatorial optimisation problem, the problem is formulated as follows:

  • Given a finite string formed of symbols from an infinite or finite alphabet, [ilmath]A[/ilmath] and
  • another finite string of the same or different length, from the same alphabet, [ilmath]B[/ilmath]

We wish to construct a finite sequence of instructions, [ilmath]I:=i_1,i_2,i_3,\ldots,i_n[/ilmath] such that:

  • Given [ilmath]A[/ilmath] and [ilmath]I[/ilmath] we can construct [ilmath]B[/ilmath].

To do this we will start with a marker initially at the first symbol of string [ilmath]A[/ilmath] and instructions and an empty output string which we construct symbol by symbol. The instructions may be:

  1. >[ilmath]s[/ilmath] - Insert - append [ilmath]s[/ilmath] to the output string, do not move the marker forward
  2. < - Delete - deletes from the input string [ilmath]A[/ilmath] by incrementing the marker and not writing any output
  3. = - Accept - copy the symbol at the marked position of [ilmath]A[/ilmath] to the output and move the marker forward by [ilmath]1[/ilmath]
(Assuming here that >, < and = are not symbols that occur in the alphabet)

We will use the shorthands [ilmath]n[/ilmath]= for [ilmath]n[/ilmath] consecutive accepts, so 5= means ===== and [ilmath]n[/ilmath]< accordingly.

Example

Suppose we want to turn the erroneous phrase:

  • Alex's Algoritm into the correct phrase:
  • Alec's Algorithm

a "sensible" "diff" might be:

  • 3=>c<10=>h=

I use the word sensible because the diff:

  • 15<>A>l>e>c>'>s> >A>l>g>o>r>i>t>h>m is a valid diff! It just deletes everything and then enters the result letter by letter so isn't a very "good" diff.

Applying the diff

The input string is Alex's Algoritm and the diff is 3=>c<10=>h=:

  1. Alex's Algoritm - - Initially the marker is for the first character of input [ilmath]A[/ilmath] and the output is empty.
  2. Alex's Algoritm - Ale - Instruction: 3= - accept 3 characters from the input, moving the marker forward each time
  3. Alex's Algoritm - Alec - Instruction: >c - insert the letter c, do not alter the marker
  4. Alex's Algoritm - Alec - Instruction: > - delete, which means move the marker forward WITHOUT writing any output.
  5. Alex's Algoritm - Alec's Algorit - Instruction: 10= - accept 10 times from the marker (so the ' onwards 10 times)
  6. Alex's Algoritm - Alec's Algorith - Instruction: >h - insert h
  7. Alex's Algoritm - Alec's Algorithm - Instruction: = - accept once (the marker is now in the "terminal" state)

Here I present a family of algorithms that can be used to find the "best" diff.

Edit/Diff graph example

At any time where the marker isn't terminal we can delete or insert, we can never go backwards. We can also sometimes accept. We encode this information in an "edit graph" or "diff graph", in this graph we start at the top left and want to get to the bottom right, any path corresponds uniquely to a diff. Interpreted as follows:

  1. [ilmath]\rightarrow[/ilmath] - delete.
  2. [ilmath]\downarrow[/ilmath] - insert - insert the symbol for the row the head (the pointy side) of the arrow is at.
  3. [ilmath]\searrow[/ilmath] - accept - we can only do this when the input string at the current position agrees with the output.

Weight the edges as follows:

  1. [ilmath]\rightarrow[/ilmath] - [ilmath]1[/ilmath]
  2. [ilmath]\downarrow[/ilmath] - [ilmath]1[/ilmath] - inserting and deleting have the same weight (so there is no preference)
  3. [ilmath]\searrow[/ilmath] - [ilmath]0[/ilmath] - accepts (going diagonally) are "free"

A "chain" in the diffgraph is 1 or more consecutive accepts, the diffgraph below's biggest chain is a [ilmath]4[/ilmath]-chain.

Example

I shall show a diffgraph for the strings:

  1. abcabbac and
  2. abbab

Notice that I constructed this by taking abba, putting abc at the start, then putting a differing character at the end. As such we'd expect there to be a diff that inserts abc, accepts abba, deletes c and inserts b.

[ilmath] \begin{xy} \xymatrix{ \ & 0 & \mathrm{a} & \mathrm{b} & \mathrm{c} & \mathrm{a} & \mathrm{b} & \mathrm{b} & \mathrm{a} & \mathrm{c} \\ 0 & \text{S} \ar[r] \ar[d] & \cdot \ar[r] \ar[d] & \cdot \ar[r] \ar[d] & \cdot \ar[r] \ar[d] & \cdot \ar[r] \ar[d] & \cdot \ar[r] \ar[d] & \cdot \ar[r] \ar[d] & \cdot \ar[d] \ar[r] & \cdot \ar[d] \\ \mathrm{a} & \cdot \ar[r] \ar[d] & \cdot \ar[r] \ar[d] & \cdot \ar[r] \ar[d] & \cdot \ar[r] \ar[d] & \cdot \ar[r] \ar[d] & \cdot \ar[r] \ar[d] & \cdot \ar[r] \ar[d] & \cdot \ar[d] \ar[r] & \cdot \ar[d]\\ \mathrm{b} & \cdot \ar[r] \ar[d] & \cdot \ar[r] \ar[d] & \cdot \ar[r] \ar[d] & \cdot \ar[r] \ar[d] & \cdot \ar[r] \ar[d] & \cdot \ar[r] \ar[d] & \cdot \ar[r] \ar[d] & \cdot \ar[d] \ar[r] & \cdot \ar[d]\\ \mathrm{b} & \cdot \ar[r] \ar[d] & \cdot \ar[r] \ar[d] & \cdot \ar[r] \ar[d] & \cdot \ar[r] \ar[d] & \cdot \ar[r] \ar[d] & \cdot \ar[r] \ar[d] & \cdot \ar[r] \ar[d] & \cdot \ar[d] \ar[r] & \cdot \ar[d]\\ \mathrm{a} & \cdot \ar[r] \ar[d] & \cdot \ar[r] \ar[d] & \cdot \ar[r] \ar[d] & \cdot \ar[r] \ar[d] & \cdot \ar[r] \ar[d] & \cdot \ar[r] \ar[d] & \cdot \ar[r] \ar[d] & \cdot \ar[d] \ar[r] & \cdot \ar[d]\\ \mathrm{b} & \cdot \ar[r] & \cdot \ar[r] & \cdot \ar[r] & \cdot \ar[r] & \cdot \ar[r] & \cdot \ar[r] & \cdot \ar[r] & \cdot \ar[r] & \times & } % the chains \ar@{->} "2,2";"3,3" \ar@{->} "3,3";"4,4" \ar@{->} "4,3";"5,4" \ar@{->} "5,2";"6,3" \ar@{->} "6,3";"7,4" % first 2 cols done \ar@{->} "5,5";"6,6" \ar@{->} "6,6";"7,7" \ar@{->} "6,7";"7,8" % bottom 2-chain and 1 chain done \ar@{->} "2,5";"3,6" \ar@{->} "3,6";"4,7" \ar@{->} "4,7";"5,8" \ar@{->} "5,8";"6,9" % 4 chain done! \ar@{->} "4,6";"5,7" \ar@{->} "2,8";"3,9" \ar@{->} "3,7";"4,8" % and the 3 that go / through the 4 chain (the bottom right already being a part of a 2 chain \end{xy} [/ilmath]
Example diffgraph between the strings abcabbac and abbab

Notice any path produces a valid (although not necessarily good) diff, for example following every [ilmath]\rightarrow[/ilmath] along the top and then following the [ilmath]\downarrow[/ilmath] along the right produces a diff where we delete the input, then write the output letter by letter. Note we could also go down first, and then right, which corresponds to the diff that inserts the symbols of the second string then deletes all the symbols from the first. Let's mark some examples.

Example diff-paths

[ilmath] \begin{xy} \xymatrix{ \ & 0 & \mathrm{a} & \mathrm{b} & \mathrm{c} & \mathrm{a} & \mathrm{b} & \mathrm{b} & \mathrm{a} & \mathrm{c} \\ 0 & \text{S} \ar[r] \ar[d] & \cdot \ar[r] \ar[d] & \cdot \ar[r] \ar[d] & \cdot \ar[r] \ar[d] & \cdot \ar[r] \ar[d] & \cdot \ar[r] \ar[d] & \cdot \ar[r] \ar[d] & \cdot \ar[d] \ar[r] & \cdot \ar[d] \\ \mathrm{a} & \cdot \ar[r] \ar[d] & \cdot \ar[r] \ar[d] & \cdot \ar[r] \ar[d] & \cdot \ar[r] \ar[d] & \cdot \ar[r] \ar[d] & \cdot \ar[r] \ar[d] & \cdot \ar[r] \ar[d] & \cdot \ar[d] \ar[r] & \cdot \ar[d]\\ \mathrm{b} & \cdot \ar[r] \ar[d] & \cdot \ar[r] \ar[d] & \cdot \ar[r] \ar[d] & \cdot \ar[r] \ar[d] & \cdot \ar[r] \ar[d] & \cdot \ar[r] \ar[d] & \cdot \ar[r] \ar[d] & \cdot \ar[d] \ar[r] & \cdot \ar[d]\\ \mathrm{b} & \cdot \ar[r] \ar[d] & \cdot \ar[r] \ar[d] & \cdot \ar[r] \ar[d] & \cdot \ar[r] \ar[d] & \cdot \ar[r] \ar[d] & \cdot \ar[r] \ar[d] & \cdot \ar[r] \ar[d] & \cdot \ar[d] \ar[r] & \cdot \ar[d]\\ \mathrm{a} & \cdot \ar[r] \ar[d] & \cdot \ar[r] \ar[d] & \cdot \ar[r] \ar[d] & \cdot \ar[r] \ar[d] & \cdot \ar[r] \ar[d] & \cdot \ar[r] \ar[d] & \cdot \ar[r] \ar[d] & \cdot \ar[d] \ar[r] & \cdot \ar[d]\\ \mathrm{b} & \cdot \ar[r] & \cdot \ar[r] & \cdot \ar[r] & \cdot \ar[r] & \cdot \ar[r] & \cdot \ar[r] & \cdot \ar[r] & \cdot \ar[r] & \times & } % the chains \ar@{->} "2,2";"3,3" \ar@{->} "3,3";"4,4" \ar@{->} "4,3";"5,4" \ar@{->} "5,2";"6,3" \ar@{->} "6,3";"7,4" % first 2 cols done \ar@{->} "5,5";"6,6" \ar@{->} "6,6";"7,7" \ar@{->} "6,7";"7,8" % bottom 2-chain and 1 chain done \ar@{->} "2,5";"3,6" \ar@{->} "3,6";"4,7" \ar@{->} "4,7";"5,8" \ar@{->} "5,8";"6,9" % 4 chain done! \ar@{->} "4,6";"5,7" \ar@{->} "2,8";"3,9" \ar@{->} "3,7";"4,8" % and the 3 that go / through the 4 chain (the bottom right already being a part of a 2 chain %START OF RED PATH \ar@[red]@/^/@{->} "2,2";"2,3" \ar@[red]@/^/@{->} "2,3";"2,4" \ar@[red]@/^/@{->} "2,4";"2,5" \ar@[red]@/^/@{->} "2,5";"2,6" \ar@[red]@/^/@{->} "2,6";"2,7" \ar@[red]@/^/@{->} "2,7";"2,8" \ar@[red]@/^/@{->} "2,8";"2,9" \ar@[red]@/^/@{->} "2,9";"2,10" %and down \ar@[red]@/^/@{->} "2,10";"3,10" \ar@[red]@/^/@{->} "3,10";"4,10" \ar@[red]@/^/@{->} "4,10";"5,10" \ar@[red]@/^/@{->} "5,10";"6,10" \ar@[red]@/^/@{->} "6,10";"7,10" %start of green path \ar@[green]@/_/@{->} "2,2";"2,3" \ar@[green]@/_/@{->} "2,3";"2,4" \ar@[green]@/_/@{->} "2,4";"2,5" %now go down the 4-chain \ar@[green]@/_/@{->} "2,5";"3,6" \ar@[green]@/_/@{->} "3,6";"4,7" \ar@[green]@/_/@{->} "4,7";"5,8" \ar@[green]@/_/@{->} "5,8";"6,9" %now down and across \ar@[green]@/^/@{->} "6,9";"6,10" \ar@[green]@/_/@{->} "6,10";"7,10" %blue path \ar@[blue]@/^/@{->} "2,2";"3,3" \ar@[blue]@/^/@{->} "3,3";"4,4" \ar@[blue]@/^/@{->} "4,4";"4,5" \ar@[blue]@/^/@{->} "4,5";"5,5" \ar@[blue]@/^/@{->} "5,5";"6,6" \ar@[blue]@/^/@{->} "6,6";"7,7" \ar@[blue]@/^/@{->} "7,7";"7,8" \ar@[blue]@/^/@{->} "7,8";"7,9" \ar@[blue]@/^/@{->} "7,9";"7,10" %orange path \ar@[orange]@/_/@{->} "2,2";"3,3" \ar@[orange]@/_/@{->} "3,3";"4,3" \ar@[orange]@/_/@{->} "4,3";"5,4" \ar@[orange]@/_/@{->} "5,4";"5,5" \ar@[orange]@/_/@{->} "5,5";"6,6" \ar@[orange]@/_/@{->} "6,6";"6,7" \ar@[orange]@/_/@{->} "6,7";"7,8" \ar@[orange]@/_/@{->} "7,8";"7,9" \ar@[orange]@/_/@{->} "7,9";"7,10" \end{xy} [/ilmath]
Example "shortest" paths

Here:

  • The red path is a "worst case", it has weight [ilmath]13[/ilmath], the red path deletes the first string then inserts each letter of the target string
    • Note that any path consisting only of inserts and deletes (only [ilmath]\rightarrow[/ilmath] and [ilmath]\downarrow[/ilmath] has weight [ilmath]13[/ilmath]
  • The orange, green and blue paths are all optimal, they all have weight [ilmath]5[/ilmath]. Specifically:
    • Green - Objectively the best, it is "delete abc accept abba delete c, insert b"
    • Blue - Satisfactory, it is "accept ab delete c insert b accept ab delete bac"
    • orange - Unsatisfactory, it is "accept a insert b accept b delete c accept a delete b accept b delete ac"
  • Infact any path with [ilmath]4[/ilmath] [ilmath]\searrow[/ilmath] in it is "optimal" and there are more than these 3!

Diffgraph algorithm analysis=

The diffgraph I have given above (with links only to the east, south or southeast node) are incredibly simple to implement, the graph can be generated lazily and an A* pathfinding algorithm used to give (probably) great results, the problem is that the algorithm might accept "crap" diffs like the orange one above. Simplicity and speed are the obvious advantages.

One could alter the algorithm heavily to choose the path with the longest accepts (preferring 1x4-chain rather than 2x2 chains) when encountering an alternative path with equal distance to the current known distance to that node, however this will only allow it to choose the "nicest" optimal path - that is it will only choose the nicest path out of the paths of minimum length. This is a subtle distinction so I urge the reader to think about cases where the "nicest" diff (which has the most and longest chains) is different from the nicest optimum length diff. A good way to think about this is considering a page of a book and changing bits of it, the paragraphs you have left alone are long chains. A good algorithm will "lock on" to long chains and fill in what's between them, trying to think of a case where this has a greater weight than an optimum path is an interesting exercise.

Modifications


TODO: Weight chains, for example [ilmath]1/(1+\ell)[/ilmath] for an [ilmath]\ell[/ilmath]-chain