6.2 Introduction to Context-Free Languages (CFL)
In this section, we will dive into Context-Free Languages (CFLs), which are a broad class of formal languages that can be generated by a specific type of grammar called Context-Free Grammar (CFG). CFLs are widely used to model the syntax of programming languages and are central to the study of compilers and language processing.
1. Context-Free Grammar (CFG)
A Context-Free Grammar (CFG) is a formal grammar in which each production rule has a single non-terminal symbol on the left-hand side. CFGs are used to define the syntax of context-free languages. A CFG consists of:
A set of non-terminal symbols (used to define the language's structure).
A set of terminal symbols (the basic symbols of the language, which appear in strings).
A set of production rules (or rewrite rules) that describe how non-terminal symbols can be replaced by other symbols.
A start symbol (the symbol from which the derivation starts).
The production rules are of the form:
For example:
This defines a language where S can be replaced by aSb
or the empty string ε
. This generates strings with balanced numbers of a
's and b
's, such as ab
, aabb
, aaabbb
, etc.
2. Derivative Trees (Top-down and Bottom-up approach, Leftmost and Rightmost Derivation)
A derivation tree (or parse tree) visually represents how a string in a language can be derived using a grammar's production rules. It provides a structured way to illustrate the step-by-step process of deriving a string from the start symbol.
Derivation trees are central to parsing in compilers, where they help in analyzing the syntax of programming languages. They can be constructed using two main approaches:
1. Top-down Approach
In this approach, the derivation starts from the start symbol of the grammar and breaks it down into the terminals step by step using production rules.
The process continues until the string (sequence of terminals) is completely derived.
Example
Consider the grammar:
S → AB
A → a
B → b
To derive the string ab:
Start with S.
Apply S → AB.
Replace A with a using A → a.
Replace B with b using B → b.
The derivation tree will look like:
2. Bottom-up Approach
In this approach, the derivation starts with the terminal string (sequence of terminals) and reduces it to the start symbol step by step.
This is the reverse of the top-down approach and is often used in shift-reduce parsing.
Example
For the same grammar, derive S from ab:
Start with ab.
Combine b into B using B → b.
Combine a into A using A → a.
Combine AB into S using S → AB.
The reduction steps:
Leftmost Derivation
At each step, the leftmost non-terminal in the current string is replaced using one of its production rules.
It is widely used in recursive-descent parsers.
Example
For the grammar:
S → AB
A → a
B → b
To derive ab using leftmost derivation:
Start with S.
Replace the leftmost S with AB: AB.
Replace the leftmost A with a: aB.
Replace the leftmost B with b: ab.
Steps:
Rightmost Derivation
At each step, the rightmost non-terminal in the current string is replaced using one of its production rules.
It is used in LR parsers (common in compiler design).
Example
Using the same grammar, derive ab with rightmost derivation:
Start with S.
Replace the rightmost S with AB: AB.
Replace the rightmost B with b: Ab.
Replace the rightmost A with a: ab.
Steps:
Key Differences Between Leftmost and Rightmost Derivations
Aspect
Leftmost Derivation
Rightmost Derivation
Replaced Non-terminal
Leftmost non-terminal first
Rightmost non-terminal first
Order of Parsing
Natural for recursive-descent parsers
Used in bottom-up parsers (like LR)
Output Order
Produces parse tree top-down
Produces parse tree bottom-up
Applications of Derivation Trees
Parsing in Compilers:
Helps validate the syntactical structure of a programming language.
Grammar Analysis:
Derivation trees illustrate ambiguities in grammars.
Understanding Language Rules:
Derivation trees are useful in formal language theory to understand the generative process of a grammar.
Compiler Design:
Used in constructing parsers, such as LL parsers (top-down) and LR parsers (bottom-up).
Syntax Trees in Programming:
Similar structures are used for abstract syntax trees in programming languages.
3. Parse Tree and Its Construction
A parse tree is a tree that represents the syntactic structure of a string according to a given grammar. It is a visual representation of the derivation process, where:
The root of the tree is the start symbol.
Internal nodes represent non-terminal symbols.
Leaves represent terminal symbols (actual characters from the string).
The parse tree is constructed by recursively applying the grammar rules starting from the start symbol, ensuring that the string is derived according to the grammar.
Example of a parse tree for the string aabb
using the grammar S → aSb | ε
:
4. Ambigous Grammar
A grammar is ambiguous if there exists a string that can be generated by the grammar in more than one way (i.e., it has more than one valid parse tree). Ambiguity in a grammar can be problematic, especially when designing compilers, as it can lead to multiple interpretations of the same string.
For example, the following grammar for arithmetic expressions:
This grammar describes how an expression E
can be built. It allows:
An expression to be the sum of two expressions (
E + E
).An expression to be the product of two expressions (
E * E
).An expression to be enclosed in parentheses (
(E)
).An expression to be a simple identifier (
id
).
The string id + id * id
has more than one valid parse tree, leading to ambiguity in how the expression should be evaluated.
Why This is Ambiguous?
The same string, id + id * id
, can be parsed in two different ways:
Addition first, followed by multiplication.
Multiplication first, followed by addition.
This leads to ambiguity because there is no way to tell from the grammar itself whether the multiplication or the addition should be performed first.
Problems of Ambiguity
Multiple Interpretations: Ambiguity causes multiple valid parse trees, leading to different interpretations of the same input string. For arithmetic expressions, the most common ambiguity arises from operator precedence—should multiplication or addition take precedence? In most programming languages, multiplication has higher precedence than addition, but the given grammar doesn't specify this, leading to ambiguity.
Compilers: When a compiler or interpreter encounters an ambiguous grammar, it may not know how to correctly parse the input. This could lead to errors or unexpected results in the output program. Compilers need unambiguous grammars to ensure they can generate the correct machine code or interpret the program as intended.
Parsing Performance: Ambiguity in a grammar makes parsing more complex, as the parser may need to explore multiple parse trees and choose the correct one based on further context or rules, increasing computational overhead.
Ambiguity in grammars is problematic in formal languages and programming languages because it leads to multiple interpretations of the same input. Resolving this ambiguity usually involves restructuring the grammar to reflect operator precedence and associativity rules. This is especially important when designing compilers or interpreters that need to generate or interpret code accurately.
5. Chomsky Normal Form (CNF)
Chomsky Normal Form (CNF) is a standardized form of context-free grammar (CFG) that simplifies certain operations, such as parsing and proving properties of context-free languages. CNF restricts the production rules of a CFG to a specific structure, making it easier to analyze and work with. The CNF is essential for certain algorithms, especially parsing algorithms like the CYK (Cocke-Younger-Kasami) algorithm.
1. Definition of Chomsky Normal Form
A grammar is said to be in Chomsky Normal Form (CNF) if every production rule adheres to one of the following three forms:
A → BC
Where A, B, and C are non-terminal symbols. This means that a non-terminal can only produce two other non-terminals. This is referred to as binary branching.
A → a
Where A is a non-terminal and a is a terminal symbol. In this case, a non-terminal produces a single terminal symbol.
A → ε
This rule can only occur for the start symbol, and only if the start symbol can derive the empty string (ε). The empty string rule is a special case and is allowed only for the start symbol in CNF.
2. Restrictions in CNF
In CNF, there are a few important restrictions on the form of production rules:
No direct terminal to non-terminal productions:
In a standard context-free grammar, a production might allow a non-terminal to be replaced by a string of terminals, such as:
where
a
is a terminal symbol, andB
is a non-terminal. In CNF, this would not be allowed. Instead, terminal symbols must be produced by a single non-terminal, i.e.,A → a
, and non-terminals must always be replaced by either two non-terminals or an empty string in the case of the start symbol.
Binary branching only:
Productions must be binary. This means that a non-terminal must either produce two non-terminals or one terminal. There can be no production where a non-terminal generates more than two symbols, like
A → BCD
, or fewer than two symbols, likeA → B
.
3. Why Use CNF?
The structure of CNF is useful for several reasons:
Efficient Parsing:
CNF is useful for parsing algorithms such as the CYK algorithm, which operates efficiently on grammars in CNF. These algorithms rely on the fact that every rule has at most two non-terminals on the right-hand side, simplifying the process of constructing parse trees.
Theoretical Simplicity:
CNF simplifies the theoretical analysis of context-free languages. It helps in proving properties like pumping lemmas and the existence of a parse tree for any derivable string.
Uniformity:
CNF restricts all production rules to a consistent and uniform form, making the grammar easier to work with in formal proofs and computational models.
4. Conversion to CNF
Converting a general context-free grammar (CFG) to CNF is a process that involves several steps. Here’s an overview of the steps involved in converting a CFG to CNF:
Remove ε-productions (Empty String Productions)
If the start symbol can derive the empty string ε, we need to create new productions that account for this possibility.
For each rule where a non-terminal can derive the empty string, we need to add alternative productions that account for the presence or absence of that non-terminal.
For example, if we have a rule like:
we may need to modify rules where A appears and generate alternative versions where A is absent.
Eliminate Unit Productions
A unit production is a rule where a non-terminal produces exactly one non-terminal:
These productions must be eliminated by substituting B's production rules into the rule for A.
Eliminate Terminal Symbols from Right-hand Sides of Productions (Other Than Single Terminal Productions)
If a rule has a terminal symbol combined with non-terminals, such as:
We need to replace
a
(the terminal) with a new non-terminalX
that producesa
(i.e.,X → a
). Then, we substitute this new non-terminal into the rule:
Convert Long Productions to Binary Form
If a production has more than two symbols on the right-hand side, such as:
We break it down into binary form. First, introduce a new non-terminal
X
:This transformation ensures that the grammar follows the binary branching rule.
Chomsky Normal Form is a strict form of context-free grammar that simplifies the process of parsing and analyzing context-free languages. By restricting production rules to binary branching or single terminal symbols, CNF makes certain algorithms more efficient and easier to implement. While many real-world grammars may not be in CNF, converting them into CNF is often necessary for certain types of parsing algorithms and theoretical analysis.
6. Greibach Normal Form (GNF)
A Greibach Normal Form (GNF) is another form of context-free grammar where every production rule has the form:
Where a
is a terminal symbol, and α
is a (possibly empty) string of non-terminals. This form ensures that every production starts with a terminal symbol.
7. Backus-Naur Form (BNF)
Backus-Naur Form (BNF) is a notation used to express the syntax of programming languages and formal grammars. It is a way of describing the rules for forming strings in a language. BNF consists of:
A set of production rules where the left-hand side is a non-terminal and the right-hand side is a string of terminals and/or non-terminals.
The start symbol, which is the initial symbol of the grammar.
For example:
8. Pushdown Automata
A Pushdown Automaton (PDA) is a type of automaton that is used to recognize context-free languages. It is an extension of a finite automaton with a stack, which allows it to store an unbounded amount of information.
The PDA can be used to accept a language by emptying its stack when the entire input string is processed.
To construct a Pushdown Automaton (PDA) for the language L = {aⁿbⁿ | n ≥ 1}, we need to design the PDA such that it accepts strings where the number of 'a's is equal to the number of 'b's, with the following conditions:
The string starts with one or more 'a's.
For every 'a' encountered, push an 'a' onto the stack.
For every 'b' encountered, pop one 'a' from the stack.
The string is accepted if, after processing all the input, the stack is empty and all symbols are consumed.
PDA Construction
We can represent the PDA as a 7-tuple (Q, Σ, Γ, δ, q₀, Z₀, F):
Q: A finite set of states.
Σ: The input alphabet (the symbols that can appear in the input string).
Γ: The stack alphabet (the symbols that can be pushed onto or popped from the stack).
δ: The transition function that describes the state transitions and stack operations.
q₀: The initial state of the PDA.
Z₀: The initial symbol on the stack.
F: The set of accept states.
Components of the PDA
Q = {q₀, q₁, q₂}
q₀
: Initial state, where we begin processing the input.q₁
: Intermediate state for processing 'a's.q₂
: Accept state, where the string is accepted after all input is consumed and the stack is empty.
Σ = {a, b}
The input alphabet is composed of 'a' and 'b'.
Γ = {a, Z₀}
The stack alphabet consists of 'a' (which we push and pop) and the initial stack symbol
Z₀
(which marks the bottom of the stack).
δ: The transition function is defined as follows:
δ(q₀, a, Z₀) → (q₀, aZ₀): If we are in state
q₀
and the input symbol is 'a' with the stack top symbol beingZ₀
, we push 'a' onto the stack and stay in stateq₀
.δ(q₀, a, a) → (q₀, aa): If we are in state
q₀
and the input symbol is 'a' with the stack top symbol being 'a', we push another 'a' onto the stack and stay in stateq₀
.δ(q₀, b, a) → (q₁, ε): If we are in state
q₀
and the input symbol is 'b' with the stack top symbol being 'a', we pop the 'a' from the stack and move to stateq₁
.δ(q₁, b, a) → (q₁, ε): If we are in state
q₁
and the input symbol is 'b' with the stack top symbol being 'a', we pop the 'a' from the stack and stay in stateq₁
.δ(q₁, ε, Z₀) → (q₂, Z₀): If we are in state
q₁
and the input is empty (i.e., we've consumed all the symbols) with the stack top symbol beingZ₀
, we move to the accept stateq₂
and the stack remains unchanged.
q₀: The initial state.
Z₀: The initial stack symbol.
F = {q₂}: The accept state.
Key Difference between Finite Automata and Push Down Automata:
Finite Automaton (FA): Only has a finite set of states and cannot store any extra information beyond the current state.
Pushdown Automaton (PDA): Has a stack, allowing it to store an unbounded amount of information. This stack allows the PDA to handle languages with nested or recursive structures (like balanced parentheses or equal numbers of
a
s andb
s).
Example:
Finite Automaton Example: Consider the language of strings containing at least one
a
(L = {w | w contains at least one 'a'}
). A finite automaton can recognize this because it just needs to check if any 'a' appears in the input. The FA transitions between two states (one for seeing 'a' and one for not seeing 'a') and does not need any additional memory.Pushdown Automaton Example: Consider the language
L = {aⁿbⁿ | n ≥ 1}
(equal numbers ofa
s andb
s). A finite automaton cannot recognize this because it cannot "count" the number ofa
s andb
s. However, a PDA can store ana
on the stack every time it encounters one and pop it when it encounters ab
. If the stack is empty at the end and the entire input string is processed, the PDA accepts the string.
9. Equivalence of Context-Free Languages and PDA
Context-free languages (CFLs) and Pushdown Automata (PDA) are equivalent in the sense that every context-free language can be recognized by a PDA.
A context-free grammar can be converted to a PDA, and vice versa. The PDA uses its stack to handle the non-context-sensitive features of the language (such as matching parentheses or balanced symbols), which is what makes it capable of recognizing CFLs.
10. Pumping Lemma for Context-Free Languages
The Pumping Lemma for Context-Free Languages is a property that helps prove whether a given language is not context-free. It states that for any context-free language, there exists a "pumping length" p
such that any string w
in the language of length at least p
can be divided into five parts: w = uvxyz
, where:
|vxy| ≤ p
(the length ofvxy
is at mostp
),|vy| > 0
(the stringvy
is non-empty),uv^n x y^n z
is in the language for alln ≥ 0
.
This property is used to show that certain languages cannot be generated by context-free grammars.
11. Properties of Context-Free Languages
Closure Properties: Context-free languages are closed under union, concatenation, and Kleene star, but not under intersection or difference.
Decidability: Many problems, such as membership (whether a string belongs to a language) and emptiness (whether a language is empty), are decidable for context-free languages.
Parsing: Parsing context-free languages is a central problem in compiler design. Techniques like LL parsing and LR parsing are used to parse context-free languages.
Important Differences Between Regular Languages and Context-Free Languages:
Regular Languages (recognized by Finite Automata) cannot handle recursion or nested structures. They are simpler and only include patterns that can be described by finite state machines (e.g.,
a*
,ab*
).Context-Free Languages (recognized by Pushdown Automata) can handle complex patterns like matching parentheses or recursive structures, requiring the use of a stack.
FA recognizes Regular Languages, which are simpler and do not involve recursion or nested patterns.
PDA recognizes Context-Free Languages (CFLs), which are more complex and include patterns like balanced parentheses, matched numbers of
a
s andb
s, palindromes, and nested expressions.
Conclusion
Context-Free Grammar (CFG) is used to define context-free languages, where each production rule has a single non-terminal on the left-hand side.
Derivation trees and parse trees are used to represent the structure of strings generated by a CFG.
A grammar is ambiguous if there is more than one parse tree for a string.
Normal forms like CNF, GNF, and BNF provide standardized ways of expressing grammars.
Pushdown Automata (PDA) recognize context-free languages, and PDAs are equivalent to context-free grammars.
The Pumping Lemma is used to prove that certain languages are not context-free.
Last updated