Slide 1: Today we prove more problems to be undecidable, that is, unsolvable by any computer. Slide 2: Earlier, we saw that many problems about DFAs to be decidable. For example, checking whether a DFA accepts an input string w, or whether two DFAs D and D' accept the same set of inputs. Many problems about Turing machines turned out to be undecidable, such as checking whether a Turing machine accepts an input string w. The same is true for many other problems about Turing machines. What about problems concerning context-free grammars? Some of these problems are decidable, such as checking whether a context-free grammar generates an input string w. What about checking whether a context-free grammar is able to generate all inputs? Or checking whether a context-free grammar is ambiguous or not? Today, we will see that many of these problems are undecidable. Slide 3: To show this, we need to represent computation history as a string. Let's consider the following Turing machine. This machine accepts all strings that consist of two parts separated by the pound symbol #, and the two parts are the same. When this Turing machine is run on the input string "abb%abb", it changes from the initial configuration (shown on the right) to the next, and after a sequence of changes, it arrives at an accepting configuration. Slide 4: Recall that a configuration of a Turing machine consists of all information about the machine at a particular time. It involves the current state (such as q1), the head position, and all the content on the tape. We can represent a configuration as a string. The top configuration can be represented as the string "ab q1 a". This indicates all the symbols on the tape are "aba", the machine is currently in state q1, and the read-write head is pointing at the third symbol. In this string representation, we put the state "q1" right before the symbol under the read-write head. The bottom configuration is represented as the string "abb q_{acc}". The content on the tape is "abb", the machine is in the accepting state "q_{acc}", and the read-write head is pointing at the fourth symbol. Slide 5: Computation history means the sequence of configurations a Turing machine goes through on an input string. For the Turing machine we saw earlier, when the input string is "abb%abb", the machine is initially in the configuration represented by the string "q0 abb%abb", then it goes to the next configuration, and after a sequence of steps, it ends up at an accepting configuration, represented by "xxx%xx q_{acc} x". The whole sequence of configurations is the computation history. Slide 6: We can concatenate all the configuration strings and represent the computation history as a super long string. We can separate different configurations using the special symbol pound #. Every symbol in this computation history is either a tape symbol from the tape Alphabet Γ (Gamma), or a state from Q, or the special symbol # (pound). If the machine accepts the input string w, then the accepting state q_{acc} appears in last configuration of the computation history. If the machine rejects the input string w, then the rejecting state q_{rej} appears in the last configuration of the computation history. If the machine infinite loops on w, then the computation history cannot be represented by a finite string. Slide 7: We now show the following language (or problem) to be undecidable. The problem is checking whether a context-free grammar can generate all strings. We will reduce from the complement of the Turing machine acceptance, which is the problem of checking whether a Turing machine fails to accept an input string w. In other words, whether the Turing machine rejects or infinite loops on the input string w. Slide 8: In this reduction, we convert the question "Does Turing machine M fail to accept input string w" into the question "Does context-free grammar G fail to generate some string". If M accepts w, then G can generate all strings. If M rejects or infinite loops on w, then G fails to generate some string. Slide 9: This reduction will make use of computation history. We will encode the acceptance problem about the Turing machine M as a problem about the context-free grammar G. If the Turing machine M fails to generate an input string w, then the context-free grammar G fails to generate one string, namely the accepting computation history of M on input w. Since context-free grammars are equivalent to pushdown automata, we will actually construct a PDA P from Turing machine M and w. Slide 10: So this PDA P is supposed to accept every string, except the accepting computation history of M on w. If the input string to the PDA is this long string encoding the accepting computation history, then the PDA rejects. This string is really an accepting computation history because the last state is the accepting state. The PDA accepts all other strings. We will design P by spotting an error in a string that is not the accepting computation history of M on w. There are three kinds of errors. First, the string may not be begin with the symbol # (pound), or may not end with the symbol pound. The PDA accepts any string of this form. Second, the first configuration may not be the initial configuration on input string w, or the last configuration does not contain the accepting state q_{acc}. The PDA again accepts any string of this form. Finally, two consecutive configurations may not follow the transitions of the Turing machine M. The PDA also accepts any strings of this form. Note that if a string does not belong to the above three cases, then it must correspond to the accepting computation history of M on w. Designing a PDA to accept strings of the first two types is easy. The challenge is to have design a PDA that can spot errors for transitions. In the next three slides, we will outline how to design the PDA. Slide 11: The key insight is that changes in a Turing machine transition is local. All changes happen only in the tape cells near the read-write head. Consider again this Turing machine we saw earlier, and consider its computation history on input string "ab%ab". When going from the initial configuration to the next, any change to the configuration occurs inside this box, involving symbols near the read-write head. All symbols outside the box are unchanged. The same is true when going from the second configuration to the third, so on and so forth. Slide 12: If we look at three consecutive symbols in a configuration, and the symbols at the same position in the next configuration, they become a local "window". There are certain windows that may appear in a valid transitions. Some examples are shown on the left. They correspond to valid windows for the transition shown in the middle. Some invalid windows are shown on the right. For example, the first window on the right corresponds to the read-write head moving two steps to the right. The key observation is that there are only a finite number of windows. Some of them corresponds to valid transitions, and some do not. If we find an illegal window, then the two consecutive configurations must contain an error, and do not follow from any transition of the Turing machine M. Slide 13: Now checking whether two consecutive configurations w_i and w_{i+1} have an error in transition, we do the following. We remember on the stack the offset of the window. In other words, we remember the number of symbols that appear before the window and after the pound symbol #. In the example on top right, there are two symbols before the window. The PDA also remember the first row of the window in its state. After reaching the pound symbol # preceding the second configuration, the PDA makes sure the second line of the window has the same offset. It can do so using its stack. Then it reads the next three symbols in its state. The PDA accepts if the window is illegal. Slide 14: A recap of what we did. To show the ALL_{CFG} language to be undecidable, we reduce from the complement of Turing machine acceptance problem, A_{TM}. Given an instance of the question "Does Turing machine M accept input string w?", we convert it into the question "Does context-free grammar G generate all strings?". Here G generate all strings except the accepting computation history of M on w. And we first construct a PDA that accepts all strings except the accepting computation history, and convert it into a context-free grammar G. Slide 15: Two slides later, we will show another problem about context-free grammar to be undecidable. But before that, we need to introduce a new problem about some puzzles. This problem was introduced by Emil Post and is called Post Correspondence Problem. In this problem, we are given a finite set of tiles. Each tile contains a pair of strings, a top string and a bottom string. For example, on the slide we may be given these six tiles. Our task is to figure out whether we can use these six types of tiles to construct a "top and bottom match". We are allowed to use the same kind of tile more than once. In the example shown on this slide, we can use those six tiles to form this sequence that form a "top and bottom match". Note that the forth and fifth tiles in the sequence are the same. This is a match, because the top string formed by concatenating all the top strings of the tiles is the same as the bottom string formed by concatenating all the bottom strings. Slide 16: If I give you those six kinds of tiles in the previous slide, you can find a top and bottom match. What if I give you another set of tiles? Say the same set but with the last tile removed? Can you construct a top and bottom match using the new set of five kinds of tiles? This is an example of the Post Correspondence Problem. Given a fixed set of tiles, tell me whether you can form a top and bottom match using the tiles, possibly more than once. In the next lecture, we will show that the Post Correspondence Problem is undecidable. No algorithm can solve this problem. Slide 17: Today, we will assume the Post Correspondence Problem to be undecidable, and show that the ambiguity of context-free grammar is undecidable. The ambiguity problem means that, given a context-free grammar, decide whether the grammar is ambiguous or not. In other words, whether there is a string that can be generated by the grammar with two different parse trees. We will reduce from Post Correspondence Problem to the ambiguity problem. Slide 18: In the reduction, we turn an instance of Post Correspondence Problem to an instance of context-free grammar. That is, given a finite collection T of tiles, we turn it into a context-free grammar G, so that if T has a top and bottom match, then G is ambiguous. But if T has no top and bottom match, then G is unambiguous. Let's first number the tiles, as shown in the diagram. Slide 19: When we map a set T of tiles to a context-free grammar G, we need to say what are the terminals and variables in G, and what are its production rules. The terminals in G are either the symbols that appear in some string in a tile in T, such as "a", "b", "c" in this example, or a tile number, such as "1", "2", "3". The variables in G will be S, T, and B, meaning the start variable, top and bottom, respectively. The production rules in G are as follows. It has one special rule, that says S becomes T or B. For each tile in T, there are four production rules in G. One of these rules say T becomes the top string in the tile, followed by T, followed by the tile number. Another rule is similar, but for the bottom string. Then two more rules that are almost identical to what we just described, but without T or B on the right. If there are n tiles in the tile collection, then there will be 4n+1 production rules in G. This completes the description of the reduction. Slide 20: Now we need to check two things: First, if the tile collection has a top and bottom match, then the grammar we get is ambiguous. Second, if the tile collection has no match, then the grammar is umambiguous. Let's check first claim. Take any sequence of tiles, such as this example on the slide. It corresponds to two derivations in the grammar, one for top, and one for bottom. The first derivation will generate a string corresponds to all the top strings concatenated, together with the tile numbers in reverse. Likewise the second derivation generate the bottom strings concatenated, together with the tile numbers in reverse. If the sequence of tiles is a top and bottom match, this gives two parse trees of the same string. Slide 21: Now we check the second claim. That is, if the tile collection has no top and bottom match, then the grammar is unambiguous. We will actually show the contrapositive: If the grammar is ambiguous, then the tile collection has a top and bottom match. If the grammar is ambiguous, it has two different parse trees for the same string. One of these parse tree must correspond to the top string of a tile sequence, and the other the bottom string of a tile sequence, for the following reason. Every string generated from the grammar must end with a sequence of tile numbers in reverse. And there is only one "top" parse tree and one "bottom" parse tree that the grammar can generate for a given tile sequence. If there are two different parse trees for the same string, one must correspond to the top string (and involve the T variable in the grammar), and the other the bottom string (and involve the B variable in the grammar). Also, since these two parse trees have the same tile sequence, this constitutes a top and bottom match.