Some aspects of intelligence:
This test is often criticised because it only tests a limited aspect of intelligence.
Some people think that even if a machine could pass the Turing Test it may still not be intelligent.
This problem is criticised because, it may well be possible for the complete system to be intelligent (i.e. room and person inside) without the person being intelligent.
Some people say that passing the Turing test is sufficient to prove intelligence but it is not necessary to prove intelligence. In other words, a machine may fail the Turing Test but still be intelligent.
There are plenty of examples of computer systems that perform tasks that would require intelligence if they were performed by a human being.
Manipulate Symbols and reduce problem (usually recursively), until the answer is obvious. That is, it can be looked up in a table.
Computers (Turing Machines) have the power for general intelligent action.
Constraints: Fox and goose cannot be left together Goose and grain cannot be left together.
How to cross the river?
English language representation is hard to solve.
Try visual/graphical representation:
To solve this problem we need only follow the tree from its root node to any leaf node.
Definitions may be organised into four categories.
To pass this test requires:
Example: GPS (General Problem Solver) was an early computer program that attempted to model human thinking. The developers were not so much interested in whether or not GPS solved problems correctly. They were more interested in showing that it solved problems like people, going through the same steps and taking around the same amount of time to perform those steps.
Example: All computers use energy. Using energy always generates heat. Therefore, all computers generate heat.
This initiate the field of logic. Formal logic was developed in the late nineteenth century. This was the first step toward enabling computer programs to reason logically.
By 1965, programs existed that could, given enough time and memory, take a description of the problem in logical notation and find the solution, if one existed. The logicist tradition in AI hopes to build on such programs to create intelligence.
There are two main obstacles to this approach: First, it is difficult to make informal knowledge precise enough to use the logicist approach particularly when there is uncertainty in the knowledge. Second, there is a big difference between being able to solve a problem in principle and doing so in practice.
In the logical approach to AI, the emphasis is on correct inferences. This is often part of being a rational agent because one way to act rationally is to reason logically and then act on ones conclusions. But this is not all of rationality because agents often find themselves in situations where there is no provably correct thing to do, yet they must do something.
There are also ways to act rationally that do not seem to involve inference, e.g., reflex actions.
The study of AI as rational agent design has two advantages:
Philosophers staked out most of the important ideas of AI, but to move to a formal science requires a level of mathematical formalism in three main areas: computation, logic and probability.
Mathematicians have proved that there exists an algorithm to prove any true statement in firstorder logic. However, if one adds the principle of induction required to capture the semantics of the natural numbers, then this is no longer the case. Specifically, the incompleteness theorem showed that in any language expressive enough to describe the properties of the natural numbers, there are true statements that are undecidable: their truth cannot be established by any algorithm.
Analogously, Turing showed that there are some functions that no Turing machine can compute.
Although undecidability and noncomputability are important in the understanding of computation, the notion of intractability has had much greater impact on computer science and AI. A class of problems in called intractable if the time required to solve instances of the class grows at least exponentially with the size of the instances.
Exponential vs. polynomial. In between is nondeterministic polynomial.
Even moderately sized instances of intractable problems classes cannot be solved in reasonable amounts of time. Therefore, one should strive to divide the overall problem of generating intelligent behaviour into tractable subproblems rather than intractable ones.
Another important concept from mathematics is problem reduction. A reduction is a general transformation from one class of problems to another such that the solutions to the first class can be found by reducing them to problems in the second class and then solving those.
One notion for recognizing intractable problems in that of NPCompleteness. Problems that can be solved in nondeterministic polynomial time. Any problem class to which an NPcomplete problem can be reduced is likely to be intractable.
Probability is the principle mathematical tool that we have to represent and reason about uncertainty. Bayes proposed a rule for updating subjective probabilities in the light of new evidence. This rule forms the basis of the modern approach to uncertain reasoning in AI.
Decision theory combines probability theory with utility theory (which provides a framework for specifying the preferences of an agent) to give a general theory that can distinguish good actions from bad ones.
Psychology
The principle characteristic of cognitive psychology is that the brain processes and processes information. The claim is that beliefs, goals, and reasoning steps can be useful components of a theory of human behaviour. The knowledgebased agent has three key steps:
Having a theory of how humans successfully process natural language is an AIcomplete problem  if we could solve this problem then we would have created a model of intelligence.
Much of the early work in knowledge representation was done in support of programs that attempted natural language understanding.
The microworlds approach to AI was pioneered in the 1960's and tried to solve problems in limited domains.
The ANALOGY program could solve geometric analogy problems such as this
The most famous microworld is the Blocks World (1970's). A command such as "Pick up the red block" could be used to manipulate the world.
Figure 1.3
The microworld approach turned out to have problems because the advances made in writing programs for microworlds turned out not to be generalisable.
Early work in the Logicist camp also had problems because of the use of weak methods (these use weak information about a domain). However, knowledge intensive approaches have been more successful. A key development from the Logicist tradition was knowledgebased systems in the 1980's.
In the late 1980's Neural Networks became fashionable again (they had been popular in the 60's) due to improved learning algorithms and faster processors.
AI Time Line
1943  McCulloch and Pits propose modelling neurons using on/off devices.
1950's  Claude Shannon and Alan Turing try to write chess playing
programs.
1957  John McCarthy thinks of the name "Artificial Intelligence".
1960's  Logic Theorist, GPS, microworlds, neural networks.
1971  NPComlpeteness theory (Cook and Karp) casts doubt on general
applicability of AI methods.
1970's  Knowledge based systems and expert systems.
1980's  AI techniques in widespread use, neural networks rediscovered.
1990's  Deep Blue wins against world chess champion. Image and Speech
recognition become practical.
Our aim is to design agents.
A rational agent is one that performs the actions that cause the agent to be most successful.
We use the term performance measure for the criteria that determine how successful and agent is. We will insist on an objective performance measure imposed by some authority.
Example Consider the case of an agent that is supposed to vacuum a dirty floor. A plausible performance measure would be amount of dirt cleaned in a certain period of time. A more sophisticated measure would include the amount of electricity consumed and amount of noise generated.
We need to be careful to distinguish between rationality and omniscience. If an agent is omniscient, it knows the actual outcomes of its actions. Rationality is concerned with expected success given what has been perceived. In other words, we cannot blame an agent for not taking into account something it could not perceive or for failing to take an action that it is not capable of taking.
What is rational at any given time depends on four things:
Ideal mapping from percept sequences to actions
For an ideal agent, we can simply make a table of the action it should take in response to each possible percept sequence. (For most agents, this would be an infinite table.) This table is called a mapping from the percept sequences to actions.
Specifying which action an agent ought to take in response to any given percept sequence provides a design for an ideal agent.
It is, of course, possible to specify the mapping for an ideal agent without creating a table for every possible percept sequence.
Example: The sqrt agent
The percept sequence for this agent is a sequence of keystrokes representing a number and an action is to display a number on a screen. The ideal mapping when a percept is an positive number x is to display a positive number z such that z^{2}=x. This specification does not require the designer to actually construct a table of square roots.
Algorithms exist that make it possible to encode the ideal sqrt agent very compactly. It turns out that the same is true for much more general agents.
One more requirement for agents: Autonomy
If an agent's actions are based completely on builtin knowledge, such that it need pay no attention to its percepts, then we say that the agent lacks autonomy.
An agent's behaviour can depend both on its builtin knowledge and its experience. A system is autonomous if it's behaviour is determined by it's own experience.
It seems likely that the most successful agents will have some builtin knowledge and will also have the ability to learn.
The job of AI is to design the agent program: a function that implements the agent mapping from percepts to actions. We assume this program will run on some sort of computing device call the architecture.
The architecture makes the percepts from the sensors available to the agent program. runs the program and feeds the program's action choices to the effectors as they are generated.
Example A robot designed to inspect parts as they go by on a conveyer belt can make use of a number of simplifying assumptions: that the lighting will always be the same, that the only thing that will be on the conveyer belt are parts of a certain kind, and there are only two actions: accept and reject.
In contrast, some software agents (softbots) exist in rich unlimited domains, e.g., a robot designed to fly a 747 flight simulator or one designed to scan online news sources and show the interesting items to its customers. For the latter to do well, it must be able to process natural language, learn the interests of its customers and must be able to dynamically change its plans as news sources become available or unavailable.
Agent Programs
All agent programs have roughly the same skeleton; they accept percepts from the environment and generate actions.
Each agent uses some internal data structure that is updated as new
percepts arrive. These data structures are operated on by the agent's decisionmaking
procedures to generate an action choice, which is then passed to the architecture
for execution. Good data structures are often very important in AI.
Agents will receive only single percepts as input. It is up the agent to build up the percept sequence in memory if it so desires. In some environments it is possible to be quite successful without storing the percept sequence, and in complex domains, it is infeasible to store the complete sequence.
Why not do table lookup?
An agent that can play chess would need a table with about 35^{100} entries.
Furthermore, such an agent has no autonomy at all because the calculation of best actions is entirely builtin. If the environment changed in some way, the agent would be entirely lost.
Learning in the context of such a large table is hopeless.
Example: The taxi driver agent
The full taxi driver task is extremely openended  there is no limit to the novel situations that can arise.
We start with percepts, actions, and goals
The taxi driver will need to know where it is, what else is on the road and how fast it is going. This information can be obtained from the percepts provided by one or more controllable cameras, a speedometer and odometer. To control the vehicle properly it should have an accelerometer. It will need to know the state of the vehicle, so it will need the usual arrays of engine and electrical sensors. It might also have instruments like a GPS to give exact position with respect to an electronic map. It might also have infrared or sonar sensors. Finally, it will need some way for the customer to communicate destination.
The actions will include control over the engine through the accelerator pedal and control over steering and braking. Some way of talking to passengers and perhaps some way to communicate with other vehicles.
Performance measures?
Getting to the correct destination, minimizing fuel consumption and wear and tear, minimizing trip time and cost, minimizing traffic violations and disturbances of other drivers, maximizing safety and passenger comfort. Some of these goals conflict and so there will be tradeoffs.
Operating environment?
City streets? highways? snow and other road hazards? driving on right or left?
The more controlled the environment, the easier the problem.
We will now consider four types of agent programs:
Constructing a lookup table is out of the question. The visual input from a simple camera comes in at the rate of 50 megabytes per second, so the lookup table for an hour would be 2^{60x60x50M}. However, we can summarize certain portions of the table by noting certain commonly occurring input/output associations. For example, if the car in front brakes, then the driver should also brake.
In other words some processing is done on visual input to establish the condition, "brake lights in front are on" and this triggers some established connection to the action "start braking". Such a connection is called a conditionaction rule written as
Agents that keep track of the world
Simple reflex agents only work if the correct action can be chosen based only on the current percept.
Even for the simple braking rule above, we need some sort of internal description of the world state. (To determine if the car in front is braking, we would probably need to compare the current image with the previous to see if the brake light has come on.)
Another example
From time to time the driver looks in the rear view mirror to check on the location of nearby vehicles. When the driver is not looking in the mirror, vehicles in the next lane are invisible. However, in order to decide on a lane change requires that the driver know the location of vehicles in the next lane.
This problem illustrates that for any complex domain, sensors do not provide access to the complete world state. In such domains, the agent must maintain an internal state that it updates as new sensor information becomes available.
Updating the state requires the agent to have two kinds of information: First, it needs information about how the world changes over time. Second, it needs information about how its actions effect the world.
Goalbased agents
Knowing about the state of the world is not always enough for the agent to know what to do next. For example, at an intersection, the taxi driver can either turn left, right, or go straight. Which turn it should make depends on where it is trying to get to: its goal.
Goal information describes states that are desirable and that the agent should try to achieve.
The agent can combine goal information with information about what it's actions achieve to plan sequences of actions that achieve those goals.
Search and Planning are the sub fields of AI devoted to finding action sequences that achieve goals.
Decision making of this kind is fundamentally different from conditionaction
rules, in that it involves consideration of the future. In the reflex agent
design this information is not used because the designer has precomputed
the correct action for the various cases. A goalbased agent could reason
that if the car in front has its brake lights on, it will slow down. From
the way in which the world evolves, the only action that will achieve the
goal of not hitting the braking car is to slow down. To do so requires
hitting the brakes. The goalbased agent is more flexible but takes longer
to decide what to do.
Utilitybased agents
Goals alone are not enough to generate highquality behaviour. For example, there are many action sequences that will get the taxi to its destination, but some are quicker, safer, more reliable, cheaper, etc.
Goals just provide a crude distinction between "happy" and "unhappy" states whereas a more general performance measure should allow a comparison of different world states. "Happiness" of an agent is called utility.
Utility can be represented as a function that maps states into real numbers. The larger the number the higher the utility of the state.
A complete specification of the utility function allows rational decisions in two kinds of cases where goals have trouble. First, when there are conflicting goals, only some of which can be achieved (e.g., speed vs. safety), the utility function specifies the appropriate tradeoff. Second, when there are several goals that the agent can aim for, none of which can be achieved with certainty, utility provides a way in which the likelihood of success can be weighed up against the importance of the goals.
Properties of environments
Environment Programs
The following pseudocode illustrates the basic relationship between agents and environments. The environment simulator takes one or more agents as input and arranges to repeatedly give each agent the right percepts and receive back an action. The simulator then updates the environment based on the actions and possibly other dynamic processes in the environment that are not considered to be agents.
The environment is therefore defined by an initial state and an update function. Obviously, an agent that works in a simulated environment ought also to work in the real thing.
RunEvalEnvironment returns the performance measure for a single environment defined by a single initial state and a particular update function. Usually, an agent is designed to work in an environment class, a whole set of different environments. So, in order to measure the performance of an agent, we need an environment generator that selects particular environments in which to run the agent. We are then interested in average performance.
A possible confusion arises between the state variable in the environment simulator and the state variable in the agent itself. These must be kept strictly separate!
ProblemSolving Agents
Intelligent agents are supposed to act in such a way that the environment
goes through a sequence of states that maximizes the performance measure.
Unfortunately, this specification is difficult to translate into a
successful agent design. The task is simplified if the agent can adopt
a goal and aim to satisfy it.
Example Suppose the agent is in Auckland and wishes to get to Wellington. There are a number of factors to consider e.g. cost, speed and comfort of journey.
Goals such as this help to organize behaviour by limiting the objectives that the agent is trying to achieve. Goal formulation, based on the current situation is the first step in problem solving. In addition to formulating a goal, the agent may wish to decide on some other factors that affect the desirability of different ways of achieving the goal.
We will consider a goal to be a set of states  just those states in
which the goal is satisfied.
Actions can be viewed as causing transitions between states.
How can the agent decide on what types of actions to consider?
Problem formulation is the process of deciding what actions and states to consider. For now let us assume that the agent will consider actions at the level of driving from one city to another.
The states will then correspond to being in particular towns along the way.
The agent has now adopted the goal of getting to Wellington, so unless it is already there, it must transform the current state into the desired one. Suppose that there are three roads leaving Auckland but that none of them lead directly to Wellington. What should the agent do? If it does not know the geography it can do no better than to pick one of the roads at random.
However, suppose the agent has a map of the area. The purpose of a map is to provide the agent with information about the states it might get itself into and the actions it can take. The agent can use the map to consider subsequent steps of a hypothetical journey that will eventually reach its goal.
In general, an agent with several intermediate options of unknown value can decide what to do by first examining different possible sequences of actions that lead to states of known value and then choosing the best one.
This process is called search. A search algorithm takes a problem
as input and returns a solution in the form of an action sequence.
Once a solution is found, the actions it recommends can be carried out.
This is called the execution phase. Hence, we have a simple "formulate,
search, execute" design for the agent.
Formulating Problems
Formulating problems is an art. First, we look at the different amounts
of knowledge that an agent can have concerning its actions and the state
that it is in. This depends on how the agent is connected to its environment.
There are four essentially different types of problems.
First, suppose that the agent's sensors give it enough information to tell exactly which state it is in (i.e., the world is completely accessible) and suppose it knows exactly what each of its actions does.
Then it can calculate exactly what state it will be in after any sequence of actions. For example, if the initial state is 5, then it can calculate the result of the actions sequence {right, suck}.
This simplest case is called a singlestate problem.
Now suppose the agent knows all of the effects of its actions, but world access is limited. For example, in the extreme case, the agent has no sensors so it knows only that its initial state is one of the set {1,2,3,4,5,6,7,8}. In this simple world, the agent can succeed even though it has no sensors. It knows that the action {right} will cause it to be in one of the states {2,4,6,8}. In fact, it can discover that the sequence {right, suck, left, suck} is guaranteed to reach a goal state no matter what the initial state is.
In this case, when the world is not fully accessible, the agent must reason about sets of states that it might get to, rather than single states. This is called the multiplestate problem.
The case of ignorance about the effects of actions can be treated similarly.
Example Suppose the suck action sometimes deposits dirt when there is none. Then if the agent is in state 4, sucking will place it in one of {2,4}. There is still a sequence of actions that is guaranteed to reach the goal state.
However, sometimes ignorance prevents the agent from finding a guaranteed solution sequence. Suppose that the agent has the nondeterministic suck action as above and that it has a position sensor and a local dirt sensor. Suppose the agent is in one of the states {1,3}. The agent might formulate the action sequence {suck, right, suck}. The first action would change the state to one of {5,7}; moving right would then change the state to {6,8}. If the agent is, in fact, in state 6, the plan will succeed, otherwise it will fail. It turns out there is no fixed action sequence that guarantees a solution to this problem.
The agent can solve the problem if it can perform sensing actions during execution. For example, starting from one of {1,3}: first suck dirt, then move right, then suck only if there is dirt there. In this case the agent must calculate a whole tree of actions rather than a single sequence, i.e., plans now have conditionals in them that are based on the results of sensing actions.
For this reason, we call this a contingency problem. Many problems in the real world are contingency problems. This is why most people keep their eyes open when walking and driving around.
Singlestate and multiplestate problems can be handled by similar search techniques. Contingency problems require more complex algorithms. They also lend them selves to an agent design in which planning and execution are interleaved.
We will only consider cases where guaranteed solutions consist of a single action sequence.
The last type of problem is the exploration problem. In this type of problem, the agent has no information about the effects of its actions. The agent must experiment, gradually discovering what its actions do and what sorts of states exist. This is search in the real world rather than in a model.
Welldefined problems and solutions
We have seen that the basic elements of a problem definition are the states and actions. To capture these ideas more precisely, we need the following:
A path in the state space is simply any sequence of actions leading from one state to another.
To deal with multiplestate problems, a problem consists of an initial state set; a set of operators specifying for each action the set of states reached from any given state; and a goal test and path cost function. The state space is replaced by the state set space.
Measuring problemsolving performance
The effectiveness of a search can be measured in at least three ways:
Toy problems
The 8puzzle problem
One important trick is to notice that rather than use operators such
as "move the 3 tile into the blank space," it is more sensible to have
operators such as "the blank space changes places with the tile to its
left." This is because there are fewer of the latter kind of operator.
This leads to the following formulation:
The Travelling Salesperson Problem is a famous touring problem in which each city in a network of connected cities must be visited exactly once. The goal is to find the shortest tour. This problem is NPhard.
VLSI layout  a VLSI chips can have millions of gates and the positioning and connections to every gate are crucial to the successful operation of the chip. CAD tools are used in every phase of the process. Two difficult subtasks are cell layout and channel routing. The goal in cell layout is to place every cell on the chip with no overlap and enough room in between to place connecting wires. There are additional constrains such as placing particular cells near each other. Channel routing in the process of placing the connecting wires.
Robot navigation  Guide a robot about Mars.
Assembly sequencing  Robot assembly of complex objects.
Generating action sequences
To solve the route planning example problem, the first step is to test
if the current state is the goal state. If not, we must consider some other
states. This is done by applying the operators to the current state, thereby
generating a new set of states. This process is called expanding
the state. Whenever there are multiple possibilities, a choice must be
made about which one to consider further.
This is the essence of search  choosing one option and putting the others aside for later, in case the first choice does not lead to a solution. We continue choosing, testing, and expanding until a solution is found or until there are no more states that can be expanded. The choice of which state to expand first is determined by the search strategy.
It is helpful to think of the process as building a search tree that is superimposed over the state space. The root of the tree is a search node corresponding to the initial state. The leaves of the tree are nodes that do not have successors either because they are goal states or because no operators can be applied to them.
The general search algorithm is described as follows:
Data structures for search trees
We will assume a search tree node is a five component structure with
the following components:
The EXPAND function is responsible for calculating each of the five
components above for the nodes that it generates.
We also need to represent the collection of nodes that are waiting
to be expanded  this collection is called the fringe or frontier.
For efficiency, the collection of node on the fringe will be implemented
as a queue using the following operations:
Now we can write a more formal version of the general search algorithm.
Even though uninformed search is less effective than heuristic search, uninformed search is still important because there are many problems for which information used to make informed choices is not available.
We now present six uninformed search strategies:
Breadthfirst search
In this strategy, the root node is expanded first, then all of the
nodes generated by the root are expanded before any of their successors.
Then these successors are all expanded before any of their successors.
In general, all the nodes at depth d in the search tree are expanded
before any nodes at d+1. This kind of search can be implemented by using
the queuing function ENQUEUEATEND.
Breadthfirst search is complete and optimal provided the path cost is a nondecreasing function of the depth of the node. Lets look at the amount of time and memory it requires. Consider a hypothetical search tree with b successors for every node. Such a tree is said to have a branching factor of b. If the first solution is a depth d, then the maximum number of nodes expanded before reaching a solution is 1+b+b^{2}+b^{3}+...+b^{d}.
This is not good. For example consider the following table for b=10, 1 node/ms, 100 bytes/node:
Note that memory requirements are a bigger problem here than execution
time.
In general, exponential complexity search problems cannot be solved
for any but the smallest instances.
Uniform cost search (Dijkstra's Algorithm)
This strategy modifies breadthfirst search to account for general
path cost. Now the lowest cost node on the fringe is always the first one
expanded (recall that the cost of a node is the cost of the path to that
node).
When certain conditions are met, the first solution found is guaranteed to be the cheapest.
Consider an example
Uniform cost search finds the cheapest solution provided the cost of a path never decreases as it gets longer, i.e., g(SUCCESSOR(n)>=g(n) for every node n.
Depthfirst search
This search strategy always expands one node to the deepest level of the tree. Only when a deadend is encountered does the search backup and expand nodes at shallower levels.
This strategy can be implemented by calling GENERALSEARCH with a queuing
function that always puts the newest node on the front ( ENQUEUEATFRONT)
Depthfirst search has modest memory requirements, storing only a single
path from root to the current node along with the remaining unexpanded
sibling nodes for each node on the path. For a state space with branching
factor b and maximum depth m, depth first search requires storage of only
bm nodes, in contrast to b^{d} for breadthfirst search. Using
the same assumptions as our analysis of breadthfirst search, depthfirst
search would require 12 kilobytes instead of 111 terabytes at depth 12,
a factor of 10 billion times less space.
For problems that have a lot of solutions, depthfirst search is often faster than breadthfirst search because it has a good chance of finding a solution after exploring only a small portion of the search space.
However, depthfirst search can get stuck on a wrong path in infinite search spaces.
It is common to implement depthfirst search with a recursive function that calls itself on each of its children in turn. In this case, the queue is stored implicitly in the local state of each invocation on the calling stack.
Depthlimited search
This strategy avoids the pitfalls of depthfirst search by imposing
a cutoff on the maximum depth of a path. Depthfirst search is used to
search to the given depth.
Depthlimited search is complete but non optimal and if we choose a
depthlimit that is too shallow its not even complete.
Iterative deepening search
The hard part about depthlimited search is picking the right depth
limit. Most of the time we will not know what to pick for a depth limit.
Iterative deepening search is a strategy that sidesteps the issue of choosing
the depth limit by trying all possible limits in increasing order.
In effect, this search strategy keeps the best features of both breadthfirst
and depthfirst search. It has the modest memory requirements of depthfirst
search but is optimal and complete like breadthfirst search. Also it does
not get stuck in dead ends.
Iterative deepening may seem wasteful because many states are expanded multiple times. For most problems, however, the overhead of this multiple expansion is actually rather small because a majority of the nodes are at the bottom of the tree.
So instead of:
1+b+b^{2}+...+b^{d2}+b^{d1}+b^{d}
we have:
(d+1)1+(d)b+(d1)b^{2}+...+(3)b^{d2}+(2)b^{d1}+b^{d}
For b=10 and d=5, the numbers are 111,111 and 123,456. So iterative
deepening is 11% more costly than depthlimited in this case.
In terms of complexity numbers, iterative deepening has the same asymptotic
complexity as breadthfirst, but has space complexity O(bd).
In general, iterative deepening is the preferred search method when
there is a large search space and the depth of the solution is unknown.
Bidirectional Search
The idea here is to search forward from the initial state and backward
from the goal state simultaneously. When it works well, the search meets
in the middle. For problems where the branching factor is b in both directions,
bidirectional search can make a big difference.
The solution will be found in O(bd/2) steps because both searches
have to go only half way. For example, if b=10 and d=6, breadthfirst generates
1,111,111 nodes, whereas bidirectional search generates only 2,222 nodes.
Even though this sounds great in theory, there are several issues that
must be addressed before this algorithm can be implemented.
Even when the tree is finite, avoiding repeated states can yield exponential speedups. The classic example is a space that contains only m+1 states. The tree contains every possible path which is 2^{m} branches.
There are three ways to deal with repeated states, in increasing order of effectiveness and complexity:
Example The 8queens problem can be posed as a CSP in which the variables are the locations of the eight queens; the possible values are squares on the board; and the constraints state that no two queens can be in the same row, column, or diagonal.
Cryptoarithmetic and VLSI layout can be posed as CSPs.
Constraints can be unary, binary, or higherorder:
Each variable v_{i }in a CSP has a domain D_{i }which is the set of possible values that the variable can take on. The domain can be either discrete or continuous.
A unary constraint specifies the allowable subset of a variable's domain and a binary constraint specifies the allowable subset of the crossproduct of the two domains. In discrete CSPs with finite domains, constraints can be represented simply by enumerating the allowable combinations of values.
Example in 8queens, the no attack constraint can be represented by a set of pairs of allowable values: {<1,3>,<1,4>,<1,5>,...,<2,4>,<2,5>,...}.
Constraints involving continuous variables cannot be enumerated in this way. Here we consider only CSPs with discrete, absolute, binary or unary constraints.
The initial state will be the state in which all variables are unassigned. Operators will assign a value to a variable from the set of possible values. The goal test will check if all variables are assigned and all constraints satisfied.
We can use depthlimited search because all solutions are of depth n, where n is the number of variables.
In the most naive implementation, any unassigned variable in a state
can be assigned a value by and operator. Hence, the branching factor is
the sum of the cardinalities of the different domains, i.e., ,
e.g., 64 in the 8queens problem. Then the search space would be 64^{8}.
But note that the order of assignments makes no difference to the final
solution. So we can choose a value for a single variables at each node.
This results in a search space that is ,
or 8^{8} in 8queens.
A straightforward depthlimited search will examine all of these possible states. CSP contain some well known NPcomplete problems and so in the worst case any algorithm to solve CSPs will be exponential.
In most real problems, however, we can take advantage of problem structure to eliminate large parts of the search space.
In CSPs, the goal test is a set of constraints which can be considered individually, as opposed to all together. Depthlimited search on a CSP wastes time searching when constraints have already been violated. For example, suppose we put the first two queens in the top row. DLS will examine all 8^{6} possibilities for the remaining queens before discovering that no solution exists in that subtree.
A substantial improvement is to insert a test before the successor generation step to check whether any constraint has been violated by the variable assignments up to this point. The resulting algorithm backtracks immediately to try something else. This is called backtracking search.
Backtracking still has some failings. Suppose the squares chosen for the first six queens make it impossible to place the seventh queen, because they attack all 8 squares in the last column. Backtracking will still try all possible placings of the seventh queen. Forward checking avoids this problem by looking ahead to detect unsolvability. Each time a variable is instantiated, forward checking deletes from the domains of the asyetuninstantiated variables all of those values that conflict with the variables assigned so far. If any of the domains becomes empty, then the search backtracks immediately.
Forward checking is a special case of arc consistency checking. A state is arcconsistent if every variable has a value in its domain that is consistent with each of the constraints on that variable. This can be achieved by successive deletion of values that are inconsistent with some constraint. As values are deleted, other values may become inconsistent because they relied on the deleted values. This behaviour is called constraint propagation: choices are gradually narrowed down. Sometimes arc consistency is all that is needed to completely solve a problem.
Minimize estimated cost to reach a goal: Greedy search
The simplest bestfirst strategy is to minimize the estimated cost
to reach the goal, i.e., always expand the node that appears to be closest
to the goal. A function that calculates cost estimates is called an heuristic
function.
To get an idea of what an heuristic function looks like, lets look at a particular problem. Here we can use as h the straightline distance to the goal. To do this, we need the map coordinates of each city. This heuristic works because roads tend to head in more or less of a straight line.
This figure shows the progress of a greedy search to find a path from
Arad to Bucharest.
For this problem, greedy search leads to a minimal cost search because
no node off the solution path is expanded. However, it does not find the
optimal path: the path it found via Sibiu and Fagaras to Bucharest is 32
miles longer than the path through Pimnicu Vilcea and Pitesti.
Hence, the algorithm always chooses what looks locally best, rather
than worrying about whether or not it will be best in the long run. (This
is why its called greedy search.)
Greedy search is susceptible to false starts. Consider the problem
of getting from Iasi to Fagaras. h suggests that Neamt be expanded first,
but it is a deadend. The solution is to go first to Vaslui and then continue
to Urziceni, Bucharest and Fagaras.
Note that if we are not careful to detect repeated states, the solution
will never be found  the search will oscillate between Neamt and Iasi.
Greedy search resembles dfs in the way that it prefers to follow a
single path to the goal and backup only when a deadend is encountered.
It suffers from the same defects as dfs  it is not optimal and it is incomplete
because it can start down an infinite path and never try other possibilities.
The worstcase complexity for greedy search is O(b^{m}),
where m is the maximum depth of the search. Its space complexity is the
same as its time complexity, but the worst case can be substantially reduced
with a good heuristic function.
Minimizing the total path cost: A^{*} search
Recall that uniform search minimizes the cost of the path so far, g(n):
it is optimal and complete, but can be very inefficient. We can combine
the use of g(n) and h(n) simply by summing them
Hence, if we are trying to find the cheapest solution, a reasonable thing to try is the node with the lowest value for f. It is possible to show that this strategy is complete and optimal, given a simple restriction on the h function: h must never overestimate the cost to reach the goal. Such an h is called an admissible heuristic.
If h is admissible, f(n) never overestimates the actual cost of the best solution through n.
Bestfirst search using f and an admissible h is known as A^{*} search.
Notice that h_{sld} (straightline distance) is an addmissible heuristic because the shortest path between two points is a straight line. The following diagram shows the first few steps of the A^{*} search for Bucharest using h_{sld}.
Notice that A^{*} prefers to expand from Rimnicu Vilcea rather
than Fagaras. Even though Fagaras is closer to Bucharest, the path taken
to get to Fagaras is not as efficient in getting close to Bucharest
as the path taken to get to Rimnicu.
The behaviour of A^{*} search
An important thing to notice about the example A^{*} search
above: along any path from the root, the fcost never decreases. This fact
holds true for almost all admissible heuristics. An heuristic with this
property is said to exhibit monotonicity.
Monotonicity is not quite the same as admissible. In a few cases, heuristics are admissible, but not monotonic.
When our heuristics are not monotonic, we will make them so, so we can assume that if an heuristics is admissible it is monotonic.
Now, we can take a look at what is going on in A^{*}. If f never decreases along any path out of the root, we can conceptually draw contours in the state space.
Because A^{*} expands the leaf node of lowest f, an A^{*}
search fans out from the start node, adding nodes in concentric bands of
increasing fcost.
With uniformcost search (A^{*} using h=0), the bands will
be "circular" around the start node. WIth more accurate heuristics, the
bands will stretch towards the goal state and become narrowly focused around
the optimal path. If we define f^{*} to the the cost of the optimal
solution path, we can say:
A^{*} is optimally efficient for any given admissible heuristic function. This is because any algorithm that does not expand all nodes in the contours between the root and the goal runs the risk of missing the optimal solution.
Complexity of A^{*}
The catch with A^{*} is that even though its complete, optimal
and optimally efficient, it still can't always be used, because for most
problems, the number of nodes within the goal contour search space is still
exponential in the length of the solution.
Similarly to breadthfirst search, however, the major difficulty with A^{*} is the amount of space that it uses.
If we want to find shortest solutions, we need a function that never overestimates the number of steps to the goal. Here are two possibilities:
Usually, the effective branching factor of an heuristic is fairly constant over problem instances and, hence, experimental measurements of b^{*} on a small set of problems provides a good approximation of the heuristic's usefulness.
A well designed heuristic should have a value of b^{*} that
is close to 1. The following is a table of the performance of A^{*}
with h1 and h2 above on 100 randomly generated problems. It shows that
h2 is better than h1 and that both are much better than iterative deepening
search.
Note that h2 is always a better heuristic function than h1.
Inventing heuristic functions
Is is possible for heuristic functions to be invented mechanically?
In many cases, the answer is yes. In the 8puzzle case, h1 and h2 can
be considered perfectly accurate path lengths for simplified versions of
the 8puzzle problem. That is, if the rules of the puzzle were changed
so that a tile could be moved anywhere instead of just into an opening,
h1 would accurately give the number of steps in the shortest solution.
Is there a simplified problem in which h2 gives correct path lengths?
Yes. If a tile could move a square in any direction, even onto an occupied
square.
A problem with less restrictions on its operators is called a relaxed
problem. It is often the case that the cost of an exact solution to
a relaxed problem is a good heuristic for the original problem.
If a problem definition is written down in a formal language, it is
possible to automatically generate relaxations. Again in 8puzzle, the
operators can be described as:
It turns out that if there are several choices for heuristic functions with none clearly dominating the others, and they are all admissible, then they can be combined as
Often it is possible to pick out features of a state that contribute to its heuristic evaluation, even if it is hard to say what the contribution should be. For example, the goal in chess is to checkmate and the relevant features include number of pieces of each kind belonging to each side, the number of pieces that are attacked by opponent's pieces, etc.
Usually, the evaluation function is assumed to be a linear combination of the features and even if we don't know the weights, it is possible to use a learning algorithm to acquire them.
We have not considered the cost of computing the heuristic functions in our discussion. So long as its about the cost of expanding a node, this is not an issue, but when the cost is as high as expanding several hundred nodes, it can no longer be ignored.
Heuristics for constraint satisfaction
We have so far examined variants of depth first search for solving
CSPs. Now we extend the analysis by considering heuristics for selecting
a variable to instantiate and for choosing a value for the variable.
Suppose we are trying to colour the following map with three colours
so that no two adjacent regions have the same colour.
Clearly the best next move is to colour region E first because the
only possible colour for it is blue. All the other regions have a choice
and we might make the wrong one requiring us to backtrack.
This idea is called the mostconstrainedvariable heuristic. It is used with forward checking. At each point in the search, the variable with the fewest possible values is chosen to have a value assigned next. In this way the branching factor in the search tends to be minimized. When this heuristic is used in solving the nqueens problems, the maximum value of n goes from 30 for forward checking to around 100.
The mostconstrainingvariable heuristic attempts to reduce the branching factor on future choices by assigning a value to the variable that is involved in the largest number of constraints on other unassigned variables.
Once a variable has been selected, a value must still be chosen for it. For example, suppose we decide to assign a value to region C after A and B. Red is a better choice than blue because it leaves more freedom for future choices. This is the leastconstrainingvalue heuristic  choose a value that rules out the smallest number of values in variables connected to the current variable by constraints. When applied to the nqueens problem, it allows problems up to n=1000 to be solved.
IDA^{*} is complete and optimal, but because it is depthfirst it only requires space proportional to the longest path that it explores. b*d is a good estimate of storage requirements.
The algorithm's time complexity depends on the number of different values that the h function can take on. The city block distance function used in the 8puzzle takes on one of a small number of integer values. So typically, f only increases two or three times along any solution path. Hence, IDA^{*} only goes through 2 or 3 iterations. Because IDA^{*} does not need to maintain a queue of nodes, its treatment of nodes is faster.
Unfortunately, IDA^{*} has difficulty in more complex domains. In the travelling salesperson problem, for example, the heuristic value is different for every state. As a result, the search is much more like iterative deepening: if A^{*} expands N nodes, IDA^{*} will have to go through N iterations and will therefore expand 1+2+...+N=O(N^{2}) nodes.
One way around this problem is to increase the fcost limit by a fixed amount [e] on each iteration, so that the total number of iterations is proportional to 1/[e]. This can reduce the search cost at the expense of returning solutions that can be worse than optimal by at most [e]. Such an algorithm is called [e]admissible.
SMA^{*} search
IDA^{*}s difficulties in some problem spaces can be traced to using too little memory. Between iterations, it retains only a single number, the current fcost limit.
SMA^{*}(Simplified memorybounded A^{*}) can make use of all available memory to carry out the search. SMA^{*} has the following properties:
When SMA^{*} needs to make space on its queue, it drops a node from the queue. It prefers to drop nodes that are unpromising  that is nodes with high fcost. To avoid reexploring subtrees that it has dropped from memory, it retains in the ancestor nodes information about the quality of the best path in the forgotten subtree. In this way, it only regenerates the subtree when all other paths have been shown to look worse than the path it has forgotten.
Example Here is a hand simulation of SMA^{*} on
illustrates how it works.
In this case, it would return D which is a reachable, nonoptimal solution.
Given a reasonable amount of memory, SMA^{*} can solve significantly more difficult problems that A^{*} without incurring significant overhead in terms of extra nodes generated. It performs well on problems with highly connected state spaces with realvalued heuristics, on which IDA^{*} has difficulty.
For this kind of algorithm, consider a grid containing all of the problem states. The height of any point corresponds to the value of the evaluation function on the state at that point.
The idea of iterative improvement algorithms is to move around the grid, trying to find the highest peaks, which are optimal solutions.
These algorithms usually keep track only of the current state and do not look ahead beyond immediate neighbors.
There are two major classes of iterative improvement algorithms: Hillclimbing algorithms try to make changes that improve the current state. Simulated annealing algorithms sometimes make changes that make things worse.
HillClimbing Search
Always move in a direction of increasing quality. This simple policy has three wellknown drawbacks:
Ridges  a ridge can have steeply sloping sides, so that the search reaches the top with ease, but on the top may slope gently toward a peak. Unless there happen to be operators that move directly along the top of the ridge, the search may oscillate from side to side making little or no progress.
Simulated Annealing
Allow hillclimbing to take some downhill steps to escape local maxima.
The innermost loop of simulated annealing is quite similar to hillclimbing. Instead of picking the best move, however, it picks a random move. If the move actually improves the situation, it is executed. Otherwise, the algorithm make the move with some probability less than 1. The probability decrease exponentially with the badness of the move  the amount [[Delta]]E by which the evaluation is worsened.
A second parameter T is also used to determine the probability. At higher values of T, bad moves are more likely to be allowed. As T tends towards 0, they become more and more unlikely, until the algorithms behaves like hillclimbing. The schedule input determines the value of T as a function of how many cycles already have been completed.
Simulated annealing was developed from an explicit analogy with annealing  the process of gradually cooling a liquid until it freezes. The value function corresponds to the total energy of the atoms in the material and T corresponds to the temperature. The schedule determines the rate at which the temperature is lowered. Individual moves in the state space correspond to random fluctuations due to thermal noise. One can prove that if the temperature is lowered sufficiently slowly, the material will attain a lowestenergy configuration. This corresponds to the statement that if schedule lowers T slowly enough, the algorithm will find a global optimum.
Applications to constraint satisfaction problems
CSPs can be solved by iterative improvement methods by first assigning values to all the variables and then applying modification operators to move the configuration towards a solution. These operators simply assign a different value to a variable, e.g., 8queens.
Algorithms that solve CSPs in the fashion are called heuristic repair methods. In choosing a new value for a variable, the most obvious heuristic is to select the value that results in the minimum number of conflicts with other variables. This is called the minconflicts heuristic.
Need a way to rank game outcomes. The ranking function is called a utility
function. A simple utility function may give 1,0 or +1 for lose,
draw and win.
Consider noughts and crosses with two players called MAX and MIN. If
MAX is X and goes first then the tree is as follows.
Max will always try and maximize the utility function for the terminal
state and min will always try to minimise it.
An exhaustive search of the game tree for a solution is usually not
possible.
Consider the following simple game tree.
The utility function is only applied to terminal nodes. If max makes
move A_{1} then Min should make move A_{11}. Thus the result
of making move A_{1}is a value of 3 for the utility function. Similarly
A_{2} leads to a value of 2 and A_{3} a value of 2. Max
wants to maximise the utility function and so A_{1} is the best
move to make.
Minimax consists of 5 steps.
If the maximum depth of the tree is m and there are b moves at each node then the time complexity of minimax is O(b^{m}) . The algorithm is depth first so memory requirement is O(bm).
Instead of doing this we could use a heuristic evaluation function to estimate which moves leads to better terminal states. Now a cutoff test must be used to limit the search.
Choosing a good evaluation function is very important to the efficiency of this method. The evaluation function must agree with the utility function for terminal states and it must be a good predictor of the terminal values. If the evaluation function is infallible then no search is necessary, just choose the move that gives the best position. The better the evaluation function the less search need to be done.
e.g. the following shows some evaluations of chess positions.
The evaluation function is often a linear combination of features.
In general:
The analysis may stop just before your opponent captures one of your pieces or the analysis stops just before you capture one your opponent's pieces.
This is called the horizon effect: a good (or bad) move may be just over the horizon.
e.g. consider the following chess position. Black is ahead in material
but if white can reach the eighth row with it's pawn then it can win. Black
can stall this for some time by checking the white and so will never see
that this is a bad position.
If you have an idea that is surely bad, do not take time to see how terrible it is.
Consider the following example:
Search proceeds exactly as for minimax evaluating A_{11} A_{12}
A_{13} and A_{1} until, after looking at A_{21},
we find that A_{2} must have a utility less than 2. We now know
that A_{2} is a bad choice because A_{1} has a higher utility
and we need not evaluate any more branches below A_{2} .
The following diagram shows this for a general case.
If we have a better choice m, either at the parent node of n
or at any choice point further up, then n will never be reached
in actual play.
Minimax is depth first, so all that need be remembered is the best
choice found so far for the player (MAX) and the best choice found so far
for the opponent (MIN). These are called alpha and beta respectively.
The following algorithm illustrates this.
Suppose the tree is ordered as follows:
^
____________________________o____________________________
/

\
v ________o________
_______o________
________o________
/
 \
/ 
\ /
 \
^ o
o o
o o
o o
o o
/  \ /  \
/  \ /  \ /  \
/  \ /  \ /  \
/  \
14 15 16 17 18 19 20 21 22 13 14 15
26 27 28 29 30 31 12 13 14 35 36 37 38 39 40
* * * *
* * * *
* * *
Only those nodes marked (*) need be evaluated.
How many static evaluations are needed?.
If b is the branching factor (3 above) and d is the depth (3 above)
s = 2b^{d/2}  1 IF d even
s = b^{(d+1)/2} + b^{(d1)/2}  1 IF d odd
For our tree d=3, b=3 and so s=11.
This is only for the perfectly arranged tree. It gives a lower bound
on the number of evaluations of approximately 2b^{d/2} .The worst
case is b^{d} (minimax)
In practice, for reasonable games, the complexity is O(b^{3d/4}).
Using minimax with alpha beta pruning allows us to look ahead about half
as far again as without.
Now each position has no known outcome only an expected outcome. The
expected value for a chance node above a max node is:
see: http://www.chess.ibm.com/meet/html/d.3.2.html for more details of Deep Blue
A knowledgebased agent needs to know: the current state of the world, how to infer unseen properties from percepts, how the world evolves over time, what it wants to achieve, and what its own actions do in various circumstances.
The basic elements of a reasoning agent's design are: a formal language in which knowledge can be expressed and a means of carrying out reasoning in such a language. These two elements constitute a logic.
There must be a way to add new sentences to the knowledge base and a way to query what is known. We will call these standard functions TELL and ASK, respectively.
Determining what follows from a KB is the job of the inference mechanism.
The agent maintains a KB which may initially contains some background knowledge.
In the process of answering this query, logical reasoning is used to prove which action is better than all others, given what the agent knows and what its goals are.
Makeperceptsentence takes a percept and a time and returns a sentence representing the fact that the agent perceived the percept at time t.
Makeactionquery takes a time and returns a sentence that is suitable for asking what action should be performed at that time.
At any point, we can describe a knowledgebased agent at three levels:
The agents task is to find the gold and return to square [1,1].
In most environments in this class, it is possible for the agent to return successfully with the gold. However, in some the agent must choose between going home empty handed or taking a chance that could lead to death or the gold. And in about 21% of the environments, there is no way the agent can get a positive score.
Acting and reasoning in the wumpus world
Why does an agent that is successful in this world need to be able to reason? The following figure shows the state of the agent's knowledge after it has received its initial percept in the world above.
The agent detects a breeze in [2,1], so there must be a pit in a neighbouring
square, either [2,2] or [3,1]. (The notation P? indicates a possible pit.)
The pit cannot be in [1,1] because the agent has already been there and
did not fall in. At this point, there is only one known square that is
OK and has not yet been visited. So the prudent agent turns back and goes
to [1,1] and then [1,2], giving us (a) below:
We want to generate sentences that are necessarily true given that the
old sentences are true. This relationship between sentences is called entailment.
Example Consider the sentence E=mc^{2}. The syntax of the "equation language" allows two expressions to be connected by =. An expression can be a symbol or a number, a concatenation of two expressions, two expressions joined by +, and so on.
The semantics of the language says that the two expressions on each side of the = refer to the same quantity; that the concatenation of two expressions refers to the quantity that is the product of the two quantities referred to by each of the expressions; and so on.
From the semantics, we can show that a new sentence can be generated by, for example, concatenating the same expression to both sides of the equation. ET=mc^{2}T.
Representation
Representations need to be good at the job for which they are intended, but what does this mean? We start by looking at some more familiar examples than representation languages.
Programming languages are good for describing algorithms and concrete data structures. It is easy to imagine using a 4x4 array to represent the wumpus world and world[2,2]<pit is a natural way to place a pit in square [2,2] of a world. However, when we want to say less specific things, we run into trouble. For example, "there is a pit in [2,2] or [3,1]" or "there is a wumpus somewhere."
The reason we run into this problem is that programming languages are designed to completely describe the state of the computer and how it changes as a program executes. But we would like to be able to describe situations where complete information is not available. A language that does not allow us to do this is not sufficiently expressive.
A good knowledge representation language should be expressive and concise so that what we say today will still be understandable tomorrow. It should be unambiguous and independent of context. Also, it should be possible to design an inference procedure for it.
Semantics
In logic, the meaning of a sentence is what it states about the world. So, how do we establish the correspondence between facts and sentences? In order to say what a sentence means, an interpretation must be provided to say what fact is corresponds to. Without an interpretation, a sentence is meaningless. This idea is easiest to see in fabricated languages.
The languages that we will deal with are all compositional  the meaning of a sentence is a function of the meaning of its parts. Just as the meaning of x^{2} + y^{2} is determined from the meanings of x^{2} and y^{2}, we want the meaning of "s1 and s2" to be determined from the meanings of "s1" and "s2".
A sentence is true under a particular interpretation if the state of affairs it represents is the case. Hence, truth depends both on the interpretation and the actual state of affairs in the world.
Example
"S1,2" would be true under the interpretation in which it means that there is a stench in [1,2], in the world we explored earlier, but it would be false in worlds that do not have a stench in [1,2]. Also it would be false in our example world if its interpretation were "there is a breeze in [1,2]."
Inference
We are concerned with sound inference which is called logical inference or deduction. Logical inference is a process that implements the entailment relation between sentences. We begin with the idea of necessary truth.
Validity and satisfiability
A sentence is valid or necessarily true if and only if it is true under all possible interpretations in all possible worlds.
Example "There is a stench at [1,1] implies there is a stench at [1,1]." By contrast, "There is an open area in the square in front of me or there is a wall in the square in front of me" is not valid.
A sentence is satisfiable if and only if there is some interpretation in some world that makes it true, e.g., "there is a wumpus at [1,2]". If there is no interpretation and no world in which a sentence is true, it is called unsatisfiable.
Example There is a wall in front of me if and only if there is no wall in front of me.
Inference in computers
It turns out that validity and unsatisfiability are crucial to the ability of a computer to reason logically.
Computers do not know the interpretation you are using for the sentences in the knowledge base, nor does it know anything more than what appears on the knowledge base (it has no direct experience with the world).
Example Suppose we ask if it is OK to move to square [2,2]. The computer does not know what OK means, nor does it know what a wumpus a a pit is. So it cannot reason informally as we do. All it can do is see if KB"[2,2] is OK". The inference procedure must show that the sentence, "If KB is true then [2,2] is OK" is a valid sentence. If it can show this then [2,2] is OK under any interpretation and model.
Hence, the powerful thing about formal inference is that it can be used to derive valid conclusions even if the reasoner does not know the intended interpretation.
Logics
As we have said, logics consist of the following:
In propositional logic symbols represent whole propositions (what we have been calling facts), e.g., D might have the interpretation, "the wumpus is dead". These symbols can be combined into more complicated sentences using boolean connectives.
Firstorder logic commits to the representation of the world in terms of objects and predicates on those objects as well as connectives (as in propositional logic) and quantifiers.
It is illuminating to consider logics in the light of their ontological and epistemological commitments. Ontological commitments have to do with the nature of reality. For example, propositional logic assumes that there are facts that either hold or do not hold in the world. Firstorder logic further assumes that the world consists of objects with certain relationships between them that either hold or do not hold. Special purpose logics make further commitments.
Epistemological commitments have to do with the possible states
of knowledge an agent can have using various types of logics. In propositional
and firstorder logic, a sentence represents a fact and the agent either
believes the sentence to be true, believes it to be false or cannot make
a determination. Hence, these logics have three possible states of belief
about any sentence. Systems using probability theory can have any degree
of belief in a sentence, ranging from 0 (false) to 1 (true).
The symbols are True and False, propositional symbols such as P or Q,
the logical connectives, ,,<=>,=>,
and parentheses. All sentences are made by putting these symbols together
using the following rules:
Semantics
A propositional symbol can be interpreted as any fact. The propositional constants have the obvious interpretations.
For a complex sentence the meaning (as a function of the truth values of the constituent parts, is given by truth tables.
Validity and inference
Truth tables can also be used to test the validity of sentences.
Example
Models
Any world in which a sentence is true under a particular interpretation is called a model of that sentence under that interpretation.
The more we claim, the fewer models there will be of those claims, e.g., there are fewer models of "There is a pit in [1,2] and there is a Wumpus in [4,2]" than of "There is a pit in [1,2]."
Models are very important in logic because of the relationship between models and entailment. If KB, then every model of KB is also a model of . Hence, whenever KB is true so also is .
We could define the meaning of a sentence by means of set operations on sets of models. For example, the set of models of PQ is the intersection of the models of P and the models of Q.
The process by which the soundness of an inference rule is established through truth tables can be extended to entire classes of inferences. There are certain patterns of inferences that occur over and over, and their soundness can be shown once and for all. Then the pattern can be captured in what is called an inference rule. Once this has been done, the rule can be used to make inferences without going through the tedious process of building truth tables.
We have already seen the notation to say that can be derived from by inference. An alternate notation emphasizes that this is not a sentence, but rather an inference rule:
Here and match sentences.
Whenever something in the knowledge base matches the pattern above the line, the inference rule concludes the sentence below the line.
An inference rule is sound if the conclusion is true in all cases in which the premises are true.
Some example inference rules:
Modus Ponens
=>, AndElimination
12...n  iOrIntroduction
1  12...nDoubleNegation Elimination
Unit Resolution
, Resolution
, Complexity of Propositional Inference
Since there are 2^{n} rows in a truth table, one might suspect that the complexity of propositional inference is exponential. One might wonder if there is a polynomialtime procedure for propositional inference. This very problem was the motivation for the class of NPcomplete problems.
There is a useful class of sentences for which polynomialtime inference
procedure exists. This is a class called Horn sentences. These sentences
have the form:
Let B1,1 mean there is a breeze in [1,1].
Let W1,1 mean there is a Wumpus in [1,1].
Let P1,1 mean there is a pit in [1,1].
what we know about the wumpus world is:
S1,1
S2,1
S1,2
B1,1
B2,1
B1,2
How can we express the rules about Wumpuses and Pits?
r1: S1,1=>W1,1W1,2W2,1
r2: S2,1=>W1,1W2,1W2,2W3,1
r3: S1,2=>W1,3W1,2W2,2W1,1
Now to find the wumpus using the inference rules:
modus ponens with S1,1 and r1 gives
W1,1W1,2W2,1
and elimination gives:
W1,1
W1,2
W2,1
modus ponens followed by and elimination with S2,1 and r2 gives:
W1,1
W2,1
W2,2
W3,1
modus ponens with S1,2 and r3 gives:
W1,3W1,2W2,2W1,1
apply unit resolution where is W1,3W1,2W2,2 and is W1,1 gives:
W1,3W1,2W2,2
apply unit resolution where is W1,3W1,2 and is W2,2 gives:
W1,3W1,2
apply unit resolution where is W1,3W1,2 and is W1,2 gives:
W1,3
We have found the wumpus!!
Examples
Objects: people, houses, numbers, planets,...
Relations: parent, brotherof, greaterthan,...
Properties: red, round, prime,...
Functions: fatherof, onemorethan
Examples:
"One plus one equals two."
"Squares neighbouring the Wumpus are smelly."
Firstorder logic is universal in the sense that it can express anything
that can be programmed.
<Sentence> := <AtomicSentence>
 <Sentence> <Connective> <Sentence>
 <Quantifier> <Variable>,... <Sentence>
 <Sentence>
 (<Sentence>)
<Atomic Sentence> := <Predicate>(<Term>,...)
 <Term> = <Term>
<Term> := <Function>(<Term>,...)
 <Constant>
 <Variable>
<Connective> :=  
<=>  =>
<Quantifier> :=

<Constant> := Martin  59302  Cat  X  ...
<Variable> := a  x  s  ...
<Predicate> := Before  Likes  Raining  Fails  ...
<Function> := Father  Hairof  304gradefor  ...
BNF for firstorder logic
Constant symbols: A, B, C, 1, John,...
Each constant symbol names exactly one object in the world, not
all objects need to have names and some can have more than one name.
Predicate symbols: Round, Brother,...
A predicate symbol refers to a particular relation in the model. For
example, Brother ; since Brother is a binary
relation symbol, the relation it refers to must also be binary, i.e., it
must hold or fail to hold between pairs of objects.
Function symbols: Cosine, Fatherof, Leftlegof
A functional relation relates an object to exactly one other object.
The last element in the tuple is the value of the function for the other
elements. e.g. Officeof(Martin,sc1.40)
Terms
A term is a logical expression that refers to an object. Constant
symbols are terms. Terms can also be constructed from function symbols
and constant symbols, e.g., Fatherof(John).
The formal semantics of terms is as follows: An interpretation specifies a functional relation referred to by the function symbol and objects referred to by the constant symbols. Hence, a function term refers to the n+1^{th} object in a tuple whose first n elements are those objects refereed to by the arguments of the function.
Atomic sentences
An atomic sentence is formed from a predicate symbol follows by a parenthesized
list of terms, for example, Brother(Richard,John) states that the
object referred to by Richard is the brother of the object referred to
by John.
Atomic sentences can have arguments that are complex terms:
Examples
Brother(Richard,John)
Brother(John,Richard)
is true just in case John is the brother of Richard and Richard is the
brother of John.
Older(John,30)
Younger(John,30) is true when John is older than 30 or he is younger
than 30.
Quantifiers
We now have a logic that allows objects, quantifiers let us
express properties of entire collections of objects.
There are two quantifiers in firstorder logic: universal and
existential.
Universal quantification()
Using this we can say things like, "All cats are mammals."
Hence, a sentence x(x)
is true in a model only if
is true of every object in the model.
Existential Quantification
Where universal quantification makes statements about every object,
existential quantification makes statements about some objects.
To say, for example that Spot has a sister that is a cat, we write
Nested quantifiers
Very complex statements can be made if one nests quantifiers. Starting
simply, without mixing types of quantifiers, we can say things like
Connections between
and
There is an intimate connection between the two quantifiers. To see
this, consider the sentence
x P <=> x P PQ<=>(PQ) x P<=>x P (PQ)<=>PQ x P<=>x P PQ<=>(PQ) x P<=>x P PQ<=>(PQ)Equality
The Kinship domain
Examples
m,c Mother(c)=m <=>
Female(m) Parent(m,c)
w,h Husband(h,w) <=>
Male(h) Spouse(h,w)
x Male(x) <=> Female(x)
g,c Grandparent(g,c) <=> p
Parent(g,p) Parent(p,c)
x,y Sibling(x,y) <=>
xy p
Parent(p,x) Parent(p,y)
Axioms, Definitions and Theorems
Axioms capture the basic facts about a domain, e.g., the examples
above are axioms. The axioms are then used to prove theorems.
We can have too few or too many axioms. If we have too few, then we
cannot prove theorems that we expect should actually follow in a domain.
If we have too many axioms, then some axioms follow from (combinations
of) others. An axiom is said to be independent of a set of axioms
if it does not follow from that set.
None of the primitive predicates (functions) are defined, instead we
give partial specifications of the properties of these.
Asking questions and getting answers
To add sentences to the knowledge base, we call TELL, e.g.,
TELL(KB,m,c
Mother(c)=m <=> Female(m)
Parent(m,c))
This goes both for axioms (as above) and specific facts about a particular
situation such as
TELL(KB,(Female(Maxi)Parent(Maxi,Spot)Parent(Spot,Boots)))
With these facts added, we can
ASK(KB,Grandparent(Maxi,Boots))
and receive a yes/no answer. We can also ask questions that return
additional information in the answers such as
ASK(KB,x
Child(x,Spot)).
Here we do not want simply a yes/no answer. In addition we would like
to know a term for x denoting an object in the domain. In general, for
a query involving existentially quantified variables, we want to know bindings
for those variables. Hence, ASK returns a binding list, e.g., {x/boots}.
The first step is to define the interface between the agent and the
world. The percept sequence must contain both the percepts and the time
at which they occurred, otherwise the agent will not be able to tell when
things were observed. We will use integers for time steps so a typical
percept sentence would be
Situation Calculus
Situation Calculus is the name for a particular way of describing
change in FOL. It conceives of the world as a sequence of situations,
each of which is a snapshot of the state of the world. Situations are generated
from previous situations by actions.
Every relation whose truth can change over time is handled by giving
an extra situation argument to the corresponding predicate symbol. By convention,
we make the situation argument always the last. Hence, instead of At(Agent,Location),
we would have At(Agent,[1,1],S0)
At(Agent,[1,2],S1).
Relations or properties that do not change over time need not have
the extra argument, e.g., WallAt([0,1]).
The next step is to represent how the world changes from one situation
to the next. One way to do this is to have a function result that
maps actions and states to new states. Then we would have:
A more elegant representation combines the effects and frame axioms
into a single axiom that describes how to compute the next time step. These
axioms have the structure
a,x,s Holding(x,Result(a,s))<=>[(a=GrabPresent(x,s)Portable(x)) (Holding(x,s)aRelease)]
This type if axiom is called a successorstate axiom. One such axiom is needed for each predicate that may change its value over time. The axiom must list all the ways in which a predicate can become true and all the ways in which it can become false.
Keeping track of location
In the Wumpus world, the agent cannot directly perceive its location,
so it needs to keep track of where it has been. Its also a good idea for
it to keep track of what it has seen in previous locations. The agent needs
to know its location and orientation:
We need to write some rules that relate various aspects of single world
states (as opposed to across states). There are two main kinds of such
rules:
The important thing to remember about all this is that if the axioms correctly and completely describe the way the world works and the way percepts are produced, the inference procedure will correctly infer the strongest possible description of the world state given the available percepts.
The first step is to describe the desirability of actions independent
of each other. In doing this we will use a simple scale: actions can be
Great, Good, Medium, Risky, or Deadly. Obviously, the agent should always
do the best action it can find:
Feature Extraction:
It is easy to learn about the training set, just remember them all, unfortunately this will be of little use for classification of the test set  this is overtraining. Similarly If the system knows too little about the training set then it will classify badly.
The system must not overgeneralise or undergeneralise about patterns.
Bayesian Classification
if x is the feature vector and w_{i} is class i.
Pr(w_{i}) is the prior probability of the pattern being in class i.
p(x  w_{i}) is the probability
density function for the feature vector.
p(x) = sum(p(x  w_{i})
Pr(w_{i}))
From Bayes Rule:
Pr(w_{i}  x) = p(x
 w_{i}) Pr(w_{i})
sum(p(x  w_{i}) Pr(w_{i}))
Bayesian theory gives us the optimum decision boundary but in practice we don't have enough information to calculate this boundary.
For a feature vector x = {x_{1}, x_{2}, x_{3,}.....,x_{n}}
and a training pattern t = {t_{1}, t_{2}, t_{3,}.....,t_{n}}
Using K neighbours, take a majority vote of their classes to give a classification.
Consider the plane shown above; this ane be used to make decisions about patterns in the xy plane. For every point in the plane xy there is a corresponding value for z where (x,y,z) lies on the angled plane. If z is negative then the point lies above the dotted line, otherwise it is below the dotted line.
The equation for z is
z=wx+vy+c
We can generalise this for more dimensions
o=w1*i1+w2*i2+w3*i3....
o=sum[j=1..n](wj * ij)
o is the output, w are the weights and i is the input.
This can be drawn as shown below.
Generally we don't care about the size of the output, it will be a yes/no answer for a class; so use the following:
here:
y=f(sum(w_{i}x_{i}+theta))
The function in the box is called the sigmoid function and is:
f(x) = 1/(1+exp(ax))
If t is the expected output then there will be an error between t and o. To train the system we need to minimise this error. For a number of these functions there will be one error  the sum of the individual errors.
E = (sum_{k}(t^{k}y^{k})^{2})/2
E = ((t^{k}f(sum(w_{i}^{k}x_{i}^{k}+theta)))^{2})/2
To train the classifier, just change each w so that e becomes smaller.
Now
dE/dw_{i}=sum_{k}(t^{k}y^{k})*dy^{k}/dw_{i}
dE/dw_{i}=sum_{k}(t^{k}y^{k})*f '(sum(w_{i}^{k}x_{i}^{k})+theta)x_{i}^{k}
dE/dw_{i}=sum_{k}(d^{k}x_{i}^{k})
Where
d^{k }= (t^{k}y^{k})*f '(sum(w_{i}^{k}x_{i}^{k})+theta)
f(x) = 1/(1+exp(ax))
therefore
f '(x) = f(x)(1f(x))
so
d^{k }= (t^{k}y^{k})*y^{k}(1y^{k})
Weight updating rule:
dw_{i} = n sum_{k}(d^{k}x_{i}^{k})
where n is a parameter used to change the speed of gradient descent. Note that we want to reduce the error so the minus sign disappears.
For theta:
dE/dtheta=sum_{k}(d^{k})
therefore
dtheta = n sum_{k}(d^{k})
This device is sometimes called a perceptron. The training rule is called the delta rule.
repeat { set dw1=dw2=dw3=dtheta=0; for every pattern x^{k} do { calculate y^{k } calculate d^{k } Add n d^{k }x_{i}^{k }to dw_{i} for i=1..n Add n d^{k }to dtheta } Add dw_{i }to w_{i }for i=1..n Add dtheta_{ }to theta } until E becomes small or changes little.
^{This perceptron can only learn simple problems.}
^{}A linearly separable problem is one in which the classes can be separated by a single hyperplane.
It is often the case that a problem is not linearly separable. To solve these we use a MultiLayer Perceptron (MLP) where one layer feeds into the next. Consider the following problem.
For two inputs x_{1} and x_{2}, the output is the exclusive OR of the inputs.
The pattern space for this problem looks like this:
This cannot be solved using a single line, the solution uses two lines:
A two layer MultiLayer Perceptron to solve this problem looks like this:
The shape of regions in pattern space that can be separated by a MultiLayer Perceptron is shown in the following table.
We can see that a three layer MLP can learn arbitrary areas while a two layer MLP can learn convex regions. (if you can draw a line from any point in the region to any other in the region and the line passes out of the region then that region is not convex).
The following procedure can be used to train a backpropagation network.
t is the target units[l] is the number of units in layer l n[l][i] is unit i in layer l n[l][i].output is the output n[l][i].delta is the delta n[l][i].weight[j] is weight j ek is the learning constant
adapt() { int i,j,k,l; for(l=layers1;l>=0;l) for(i=0;i<units[l];i++) if(l==layers1) n[l][i].delta= ek*n[l][i].output* (1.0n[l][i].output)* (t[i]n[l][i].output); else { n[l][i].delta=0.0; for(k=0;k<units[l];k++) n[l][i].delta+= n[l+1][k].delta* n[l+1][k].weight[i]; n[l][i].delta=n[l][i].delta* ek*n[l][i].output* (1.0n[l][i].output); } for(l=layers1;l>=1;l) for(i=0;i<units[l];i++) for(j=0;j<weights;j++) n[l][i].weight[j]+= n[l1][j].output* n[l][i].delta; for(i=0;i<units[0];i++) for(j=0;j<weights;j++) n[0][i].weight[j]+= input[j]*n[0][i].delta; }When this algorithm is applied to the XOR we get the following output.
iteration no 0, inputs 0 1, target 1, output 0.477995 iteration no 20, inputs 0 0, target 1, output 0.447816 iteration no 40, inputs 1 0, target 0, output 0.450292 iteration no 60, inputs 0 0, target 1, output 0.549096 iteration no 80, inputs 1 0, target 0, output 0.460706 iteration no 100, inputs 0 0, target 1, output 0.507636 iteration no 120, inputs 0 1, target 1, output 0.571619 iteration no 140, inputs 1 0, target 0, output 0.451493 iteration no 160, inputs 0 1, target 1, output 0.570574 iteration no 180, inputs 0 0, target 1, output 0.575979 iteration no 200, inputs 0 1, target 1, output 0.744079 iteration no 220, inputs 1 0, target 0, output 0.233541 iteration no 240, inputs 0 1, target 1, output 0.755600 iteration no 260, inputs 1 1, target 0, output 0.185273 iteration no 280, inputs 0 1, target 1, output 0.788309 iteration no 300, inputs 1 1, target 0, output 0.167068 iteration no 320, inputs 1 0, target 0, output 0.123461 iteration no 340, inputs 1 1, target 0, output 0.132892 iteration no 360, inputs 1 1, target 0, output 0.133583 iteration no 380, inputs 1 1, target 0, output 0.116641 iteration no 400, inputs 1 0, target 0, output 0.088269 iteration no 420, inputs 0 0, target 1, output 0.861810 iteration no 440, inputs 1 1, target 0, output 0.102406 iteration no 460, inputs 1 0, target 0, output 0.080179 iteration no 480, inputs 1 0, target 0, output 0.075584 iteration no 500, inputs 0 0, target 1, output 0.884442 iteration no 520, inputs 0 0, target 1, output 0.892789 iteration no 540, inputs 0 1, target 1, output 0.923969 iteration no 560, inputs 1 0, target 0, output 0.064146 iteration no 580, inputs 1 1, target 0, output 0.071938 iteration no 600, inputs 1 1, target 0, output 0.075764 iteration no 620, inputs 1 1, target 0, output 0.074536 iteration no 640, inputs 1 1, target 0, output 0.069014 iteration no 660, inputs 1 1, target 0, output 0.066534 iteration no 680, inputs 0 0, target 1, output 0.918422 iteration no 700, inputs 0 0, target 1, output 0.924860 iteration no 720, inputs 1 1, target 0, output 0.065864 iteration no 740, inputs 1 0, target 0, output 0.052634 iteration no 760, inputs 0 0, target 1, output 0.927081 iteration no 780, inputs 1 0, target 0, output 0.050964 iteration no 800, inputs 0 1, target 1, output 0.948869 iteration no 820, inputs 1 0, target 0, output 0.049082 iteration no 840, inputs 1 0, target 0, output 0.048074 iteration no 860, inputs 1 1, target 0, output 0.057916 iteration no 880, inputs 1 1, target 0, output 0.056088 iteration no 900, inputs 0 1, target 1, output 0.954659 iteration no 920, inputs 1 1, target 0, output 0.057337 iteration no 940, inputs 0 0, target 1, output 0.944243 iteration no 960, inputs 1 0, target 0, output 0.045653 iteration no 980, inputs 0 0, target 1, output 0.946199
Convert English text into the vowel and consonant sounds (phonemes) that make up speech. For example consider the following words: albany, Albania, misled and bedraggled.
if a ``c'' appears before an ``i'', ``e'', or ``y'' then pronounce it like an ``s'' else pronounce it like a ``k'' Exception: celtic
Advantages of the Network Approach
Patterns are often created by processes we are not able to describe well. In these cases we can try to build a classifier based on inherent properties of the patterns.
This is Unsupervised Learning
example:
If we take speech signals and transform them to the frequency domain we get a speech spectrogram.
The spectrogram is a two dimensional signal showing how the frequencies
change over time.
The spectrogram above shows the phrase, "In nineteen ninety two"
For human speech we find that speech consists of two of three main
bands called formants. We could try to classify speech by using the position
of the formants. (this is not easy because there is a wide variation in
way the formants change for a particular sound).
Since there is some inherent structure in the signal we can use unsupervised learning to reduce the quantity of information in the formants.
We need to construct a classifier that minimises not the error between output and target, but the error in the fit of a particular model to the data.
If we use a very complex model it will be able to fit the data quite well.
An example model is a set of points in the pattern space, each point represents the centre of a cluster. An algorithm must be devised to move the points so that an error is reduced.
Repeat from 2 until
To modify the K mans algorithm we need to add an extra step 4 which can split or merge clusters.
Clustering algorithms perform dimensionality reduction. They do this because they take a high dimensional pattern space and produce a lower dimensional space. Hopefully the lower dimensional space still contains most of the information from the original space.
Image Processing
Image coding reduces the storage requirement for an image by removing redundant information. This may be done in two ways.
The image may be reconstructed exactly. Standard coding
techniques may be used.
Compression of 2:1 is typically achieved.
An image may need to be processed so that it may be viewed.
For example, an Xray image may be displayed with greater contrast in some
regions so that very faint features can be seen.
Original
Enhanced
If an image has known faults then these may often be corrected.
A blurred image may be 'deblurred' or a noisy image may have noise removed.
Restoration often relies on knowledge of the process that caused the degradation.
Original Image
Noisy Image
Noise removal applied once
Noise Removal applied twice
Original Image
Sharpened Image
Sometimes similar areas in an image must be identified,
for example from an aerial photograph the land and the sea may need to
be segmented.
Original Image
Segmented Image
Often, only part of an image is of interest.
Feature detection creates an image that contains information about the presence of certain types of features. Edges are common features.
Original Image
Edges in Image
Image of Road
Hough Transform of Road Image
The convolution operation is a weighted sum of all the pixels in the neighbourhood. For example, shown below is an area in an image and a weight array:
I_{00} 
I_{10} 
I_{20} 
M_{00} 
M_{10} 
M_{20} 

I_{01} 
I_{11} 
I_{21} 
M_{01} 
M_{11} 
M_{21} 

I_{02} 
I_{12} 
I_{22} 
M_{02} 
M_{12} 
M_{22} 
If I is a 3x3 neighbourhood from the image, and M is an array the same size as I, then a pixel I'_{11} to replace I_{11} is found using.
M is called the convolution mask, and by choosing values for the elements of M, many different operations may be performed. The convolution mask may be of any size, but is commonly 3x3, 5x5 or 7x7. The sum of the elements of M is commonly either 0 or 1. In order to see why, consider an area in an image with pixels of a uniform grey level.
If the sum of the elements in M is 0 then the operator produces an image in which with all pixels are zero. Thus such operators are used to detect properties of non uniform regions, for example edges.
If the sum of the elements in M is 1 then the operator
produces an image with all pixels unchanged. Thus such operators are used
to change the pixels in an image that are in non uniform areas. They may
be used, for example, to blur or sharpen an image.
The symbol used to denote a convolution is a star 
*
If all the mask coefficients are positive, then the result
will be a blurred image.
1/9 
1/9 
1/9 
1/12 
1/12 
1/12 

1/9 
1/9 
1/9 
or 
1/12 
1/3 
1/12 
1/9 
1/9 
1/9 
1/12 
1/12 
1/12 
The mask to the left uses the average in the neighbourhood.
The mask to the right uses a weighted average. Commonly, the weighted average
is calculated using a Gaussian distribution so that there is no sudden
discontinuity in the mask.
The gradient in an image is important because it gives information about edges. The first and second derivative are often used in edge detection.
An image is a two dimensional function, thus the
gradient at any point is a vector
The gradient at any point in an image is a vector given by:
The gradient in the x direction in an image I at a point
x,y may be approximated as:
g_{x}(x,y)=I(x+0.5,y)I(x0.5,y) and, since x and y are integers
g_{x}(x0.5,y)=I(x,y)I(x1,y)
or 2g_{x}(x,y)=I(x+1,y)I(x1,y)
Similarly the gradient in the y direction is:
g_{y}(x,y0.5)=I(x,y)I(x,y1)
or 2g_{y}(x,y)=I(x,y+1)I(x,y1)
The convolution masks for this operation are:
In the x direction In the y direction
1 
0 
1 
or 
1 
1 
1 
1 

0 
or 
1 

1 
Other masks are:
Prewitt Sobel
x mask y mask x
mask y
mask
1 
0 
1 
1 
1 
1 
1 
0 
1 
1 
2 
1 

1 
0 
1 
0 
0 
0 
2 
0 
2 
0 
0 
0 

1 
0 
1 
1 
1 
1 
1 
0 
1 
1 
2 
1 
Isotropic Roberts
x mask y mask
1 
0 
1 
1 

1 
1 
0 
0 
1 


0 

0 
0 
0 
0 
1 
1 
0 

1 
0 
1 
1 

1 
The magnitude and direction of the gradient may be found using
Often, only the magnitude of the gradient is required. In this case an approximation is usually used.
This needs only a single mask
0 
1 
1 
2 
The magnitude of the gradient may be used to detect edges by thresholding. However, the width of the line in the thresholded image will be proportional to the strength of the edge. If the position of the edge is required then the second derivative can be used. Any place where the second derivative crosses zero will be on an edge.
The centre of the edge occurs at the zero crossing
of the second derivative
In one dimension, an approximation to the second derivative can be obtained by using the approximation to the first derivative and applying it again.
f'(x)= f(x+0.5)f(x0.5)
f''(x)=f'(x+0.5)f'(x0.5)
f''(x)= f(x+1)f(x)(f(x)f(x1))
f''(x)=f(x+1)2f(x)+f(x1)
In two dimensions, the second derivative is the Laplacian .
Convolution Masks that will approximate the Laplacian
or the negative Laplacian are:
0 
1 
0 
0 
1 
0 

1 
4 
1 
or 
1 
4 
1 
0 
1 
0 
0 
1 
0 
The following mask will give the average for the pixels in a 3x3 neighbourhood.
1/9 
1/9 
1/9 
1/9 
1/9 
1/9 
1/9 
1/9 
1/9 
If this is applied to an image and then the Laplacian is applied, the result will be equivalent to using the following mask.
0 
1/9 
1/9 
1/9 
0 
1/9 
2/9 
1/9 
2/9 
1/9 
1/9 
1/9 
0 
1/9 
1/9 
1/9 
2/9 
1/9 
2/9 
1/9 
0 
1/9 
1/9 
1/9 
0 
Thus, two convolutions may be replaced by one, and vice versa. This is useful when two small convolution masks can be used to replace one large mask. This is only possible for certain masks. If a mask can be split into two one dimensional masks it is said to be separable. The Laplacian operator is not separable, but a similar mask is:
1 
1 
2 
1 

2 
followed by 
1 
2 
1 
is equivalent to 
2 
4 
2 
1 
1 
2 
1 
In order to be separable each row in the mask must be a multiple of all other rows and each column must be a multiple of all other columns.
Using separable masks it is possible to use large neighbourhood sizes because the number of computations is proportional to the width of the neighbourhood and not its area.
As described earlier, the edges in an image correspond to the zero crossings of the Laplacian. These may be found by examining the pixels in a 2x2 neighbourhood, i.e.
a 
b 

d 
c 
if ac<t or ab<t or ad<t there is an edge at a 
The vale of t must be set so that noise pixels do not produce edges.
The Laplacian locates edges, but it does not differentiate between edges of different widths. In practice some edges are more important than others. For example, given a face image a single hair will generate an edge as will the border between face and background. In order to detect edges with a large width, then a larger convolution mask must be necessary.
If the image is blurred by a Gaussian then the Laplacian will only detect wide edges. These two operations are usually combined by finding the Laplacian of a Gaussian operator.
thus and
This may be extended to two dimensions and used to generate a convolution mask.
The convolution operation is a linear process. If there is a conditional step in the process of obtaining a pixel value from a neighbourhood, then it is said to be non linear.
This maximum value in the neighbourhood is chosen. The maximum operation causes light areas to expand, it is sometimes called the dilation or expand operation. It is useful for displaying data which contains isolated points and in segmentation.
This operation involves choosing the minimum in the neighbourhood. Sometimes called an erode or shrink operation, it is often used in conjunction with the maximum operator to perform fuzzy morphological operations on grey scale images.
The most commonly used non linear neighbourhood operator is the median filter.
The pixel is replaced by the median value from within the neighbourhood. The pixel values must be sorted and the middle value chosen. For example, given the following neighbourhood:
10 
20 
30 

15 
10 
40 
These values are sorted to give 
10,10,15,18,20,20,30,40,100 
100 
18 
20 
In this case the median value is 20.
The median filter is important because it removes pixels with values that are not similar to those in the neighbourhood while preserving edges. To illustrate this, consider the one dimensional case shown below.
The example to the left shows a feature that has a size less than half the width of the neighbourhood. In this case there is no point where the median value will be taken from inside the feature. Thus it is removed.
The example to the right shows a feature that has a size greater than half the width of the neighbourhood. In this case the median will always be the same as the value in the centre of the neighbourhood. This feature remains intact.
This property of the median filter is important because other noise removal methods, for example blurring, will change edge features.
The disadvantage of the median filter is than it is computationally intensive because it involves sorting. It is possible to use a moving histogram to reduce the number of calculations for large neighbourhood sizes.
Original Image 1/4 of all pixels replaced by random values
3x3 Median Filter applied once 3x3 Median Filter applied
twice
The mode filter replaces a pixel with the most common pixel value in the neighbourhood. This is only useful if there are a small number of grey levels. If grey levels are used to represent segmented regions then noise can be removed by using the mode filter.
Instead of using all the pixels in the neighbourhood use the K pixels closest in value to the central pixel. Replace it with the average of these. For example if K=6:
10 
20 
30 
The 6 pixels nearest 10 are:  
15 
10 
40 
10,15,18,20,20 and 30  
100 
18 
20 
Average = 113/6 19 
This operator performs an operation without including extreme values (the 100 in this case).
Calculate the average for a number of neighbourhoods around a pixel and use the closest to the pixel value.
n1 
n2 

n5 

n3 
n4 

Calculate the average in each neighbourhood n1..n5 and
use the closest of these to the original pixel value.
The Laplacian operation may be used to enhance an image.
f(x) f''(x) f(x)f''(x)
If the second derivative of a function is subtracted from
the function then sharp transitions in the function are emphasised as shown
above. Similarly, the Laplacian may be used to enhance the edges in an
image.
0 
0 
0 
0 
1 
0 
0 
1 
0 

0 
1 
0 
 
1 
4 
1 
= 
1 
5 
1 
0 
0 
0 
0 
1 
0 
0 
1 
0 
Image Laplacian Edge Enhancement
Original Sharpened
using Laplacian
Often it is necessary to process binary images. Since each pixel may take only two values, the pixel is primarily used to denote the presence or absence of a feature. This feature may be the result of previous processing, for example edge detection. Often the binary image gives information about the shape of an object. Neighbourhood operators for binary images are called morphological operators because they deal with shape information.
In order to use morphological operators it is necessary to decide how one pixel is connected to another pixel.
The simplest method for deciding if a pixel is joined to one of its neighbours, is to look at all eight of the neighbours. If a pixel in this 8 neighbourhood is set, then it is joined to the central pixel. Eight connectedness correctly connects diagonal lines but it also connects the background across a diagonal line. Thus when eight connectedness is used the background must be treated differently to the foreground.
Pixels are four connected, if they are joined to the left,
right, above or below, but not diagonally.
Eight Connected Figure Four Connected Figure
Two important morphological operators are shrink and expand. Shrink will clear the pixels that have any neighbours clear. Expand will set the pixels that have any neighbours set.
An expand followed by a shrink is called a Close because it will remove small gaps between objects.
A shrink followed by and expand is called an open because it will keep small gaps between objects open.
Shrink, Expand, Open and Close are illustrated below.
original expand shrink close open
Shrink and expand are analogous to the neighbourhood operators min and max.
The edges in an image may be found by taking the original away from the expanded image, or taking the shrunk image away from the original. These correspond to the outer 4 connected edge and the inner 4 connected edge respectively.
The shrink operator may be used to reduce the size of a region of set pixels but if it is repeatedly applied then the region will eventually disappear. If the shape of the region is important then rather than shrink to nothing an operator should shrink the region to its skeleton.
The skeleton (or medial axis) is a set of points which are equidistant from the two or more closest edge points in the image.
Skeletonisation may be performed by repeatedly applying an algorithm which thins the image until the skeleton remains.
Binary morphological operators may be implemented by using
a lookup table. The pixels in the neighbourhood are combined to form a
binary word. This word will be between 0 and 511, for a 3x3 neighbourhood.
This word is used to address a table which contains a new value for the
central pixel.
The Hough transform is a technique which can be used to isolate features of a particular shape within an image.
Because it requires that the desired features be specified in some parametric form, the classical Hough transform is most commonly used for the detection of regular curves such as lines, circles, ellipses, etc.
A generalized Hough transform can be employed in applications where a simple analytic description of a feature(s) is not possible.
The Hough transform has many applications, as most manufactured parts (and many anatomical parts investigated in medical imagery) contain feature boundaries which can be described by regular curves or straight lines.
The main advantage of the Hough transform is that it is tolerant of gaps in feature boundary descriptions and is relatively unaffected by image noise.
The Hough technique is useful for computing a global description of a feature(s) (where the number of solution classes need not be known a priori), given (possibly noisy) local measurements.
The motivating idea behind the Hough technique for line detection is that each input measurement (e.g. coordinate point) indicates its contribution to a globally consistent solution (e.g. the physical line which gave rise to that image point).
As a simple example, consider the common problem of fitting a set of line segments to a set of discrete image points (e.g. pixel locations output from an edge detector). The diagram below shows some possible solutions to this problem.
a) Coordinate points. b) and c) Possible straight line fittings.
We can analytically describe a line segment in a number of forms. However, a convenient equation for describing a set of lines uses parametric or normal notion:
where is the length of a normal from the origin to this line and is the orientation of with respect to the Xaxis. (See Figure 2.) For any point on this line, and are constant.
Figure 2 Parametric description of a straight line.
In an image analysis context, the coordinates of the point(s) of edge segments (i.e. ) in the image are known and therefore serve as constants in the parametric line equation, while and are the unknown variables we seek. If we plot the possible values defined by each , points in cartesian image space map to curves (i.e. sinusoids) in the polar Hough parameter space. This pointtocurve transformation is the Hough transformation for straight lines. When viewed in Hough parameter space, points which are collinear in the cartesian image space become readily apparent as they yield curves which intersect at a common point.
The transform is implemented by quantizing the Hough parameter space into finite intervals or accumulator cells. (i.e. a multidimensional array). As the algorithm runs, each is transformed into a discretized curve and the accumulator cells which lie along this curve are incremented.
Peaks in the accumulator array represent strong evidence that a corresponding straight line exists in the image.
We can use this same procedure to detect other features with analytical descriptions. For instance, in the case of circles, the parametric equation is
where and are the coordinates of the center of the circle and is the radius. In this case, the computational complexity of the algorithm begins to increase as we now have three coordinates in the parameter space and a 3D accumulator. (In general, the computation and the size of the accumulator array increase polynomially with the number of parameters. Thus, the basic Hough technique described here is only practical for simple curves.)
The Hough transform can be used to identify the parameter(s) of a curve which best fits a set of given edge points.
This edge description is commonly obtained by using an edge detector such as the zero crossings of the laplacian. The edge image may be noisy, i.e. it may contain multiple edge fragments corresponding to a single whole feature.
Since the output of an edge detector defines only where features are in an image, the work of the Hough transform is to determine both what the features are (i.e. to detect the feature(s) for which it has a parametric description) and how many of them exist in the image.
In order to illustrate the Hough transform in detail, we begin with the simple image of two occluding rectangles,
An edge detector can produce a set of boundary descriptions for this part, as shown in
Here we see the overall boundaries in the image, but this result tells us nothing about the identity (and quantity) of feature(s) within this boundary description. In this case, we can use the Hough (line detecting) transform to detect the eight separate straight lines segments of this image and thereby identify the true geometric structure of the subject.
If we use these edge/boundary points as input to the Hough transform, a curve is generated in polar space for each edge point in cartesian space. The accumulator array, when viewed as an intensity image, looks like
Enhancing the contrast of this image allows us to see the patterns of information contained in the low intensity pixel values.
Although and are notionally polar coordinates, the accumulator space is plotted rectangularly.
Note that the accumulator space wraps around at the vertical edge of the image such that, in fact, there are only 8 real peaks.
Curves generated by collinear points in the gradient image intersect in peaks in the Hough transform space. These intersection points characterize the straight line segments of the original image.
There are a number of methods which one might employ to extract these bright points, or local maxima, from the accumulator array.
For example, a simple method involves thresholding and then applying some thinning to the isolated clusters of bright spots in the accumulator array image.
Here we use a relative threshold to extract the unique points corresponding to each of the straight line edges in the original image.
(In other words, we take only those local maxima in the accumulator array whose values are equal to or greater than some fixed percentage of the global maximum value.)
Mapping back from Hough transform space (i.e. deHoughing) into cartesian space yields a set of line descriptions of the image subject. By overlaying this image on an inverted version of the original, we can confirm the result that the Hough transform found the 8 true sides of the two rectangles.
The accuracy of alignment of detected and original image lines is not perfect due to the quantization of the accumulator array.
Note also that the lines generated by the Hough transform are infinite in length. If we wish to identify the actual line segments which generated the transform parameters, further image analysis is required in order to see which portions of these infinitely long lines actually have points on them.
To illustrate the Hough technique's robustness to noise, the edge description is been corrupted by noise.
The result, plotted in Hough space, is
DeHoughing this result (and overlaying it on the original) yields
(As in the above case, the relative threshold is 40%.)
The sensitivity of the Hough transform to gaps in the feature boundary can be investigated by transforming the following image
The Hough representation is
The deHoughed image is
In this case, because the accumulator space did not receive as many entries as in previous examples, only 7 peaks were found, but these are all structurally relevant lines.
We will now show some examples with natural images. In the first case, we have a city scene where the buildings are obstructed in fog,
If we want to find the true edges of the buildings, an edge detector cannot recover this information very well, as shown in:
However, the Hough transform can detect some of the straight lines representing building edges within the obstructed region. The accumulator space representation of the original image is shown here:
If we set the relative threshold to 70%, we get the following deHoughed image
Only a few of the long edges are detected here, and there is a lot of duplication where many lines or edge fragments are nearly colinear. Applying a more generous relative threshold, i.e. 50%, yields
This yields more of the expected lines, but at the expense of many spurious lines arising from the many colinear edge fragments.
Another example comes from a remote sensing application. Here we would like to detect the streets in the image
of a reasonably rectangular city sector. We can edge detect the image as follows:
However, street information is not available as output of the edge detector alone. The image
shows that the Hough line detector is able to recover some of this information. Because the contrast in the original image is poor, a limited set of features (i.e. streets) is identified.
Generalized Hough Transform
The generalized Hough transform is used when the shape of the feature that we wish to isolate does not have a simple analytic equation describing its boundary. In this case, instead of using a parametric equation of the curve, we use a lookup table to define the relationship between the boundary positions and orientations and the Hough parameters.
For example, suppose that we know the shape and orientation of the desired feature. (See Figure 3.) We can specify an arbitrary reference point within the feature, with respect to which the shape (i.e. the distance and angle of normal lines drawn from the boundary to this reference point ) of the feature is defined. Our lookup table (i.e. Rtable) will consist of these distance and direction pairs, indexed by the orientation of the boundary.
Description of Rtable components.
The Hough transform space is now defined in terms of the
possible positions of the shape in the image, i.e. the possible
ranges of .
In other words, the transformation is defined by:
(The and values are derived from the Rtable for particular known orientations .) If the orientation of the desired feature is unknown, this procedure is complicated by the fact that we must extend the accumulator by incorporating an extra parameter to account for changes in orientation.