【学习笔记】Classic CS Problems - Search problems

677

收藏 2019-05-30

1 - DNA Search
Problem statement: Genes are commonly represented in computer software as a sequence of the characters A, C, G, and T. Each letter represents a nucleotide, and the combination of three nucleotides is called a codon. A codon codes for a specific amino acid that together with other amino acids can form a protein.
A classic task in bioinformatics software is to find a particular codon within a gene.

1.1 - Storing DNA

a nucleotide can be represented as a IntEnum with four cases (use type IntEnum instead of just Enum , because IntEnum gives us comparison operators ( < , >= , and so on) “for free.”)
Codons can be defined as a tuple of three Nucleotides.
A gene may be defined as a list of Codons.
Typically, genes on the internet will be in a file format that contains a giant string representing all of the nucleotides in the gene’s sequence. We will define such a string for an imaginary gene and call it gene_str .

1.2 - Linear search

One basic operation: search gene for a particular codon. The goal is to simply find out whether the codon exists within the gene or not.

1.3 - Binary search

There is a faster way to search than looking at every element, but it requires us to know something about the order of the data structure ahead of time. If we know that the structure is sorted, and we can instantly access any item within it by its index, we can perform a binary search.

1.4 - A generic example

by using typing and define Comparable

2 - Maze solving

by finding a path through a maze, to illustrate the breadth-first search, depth-first search, and A* algorithms

2.1 - storing Maze

Our maze will be a two-dimensional grid of Cells. A Cell is an enum with str values where " " will represent an empty space and "X" will represent a blocked space.
- Our Maze class will internally keep track of a grid (a list of lists) representing its state. It will also have instance variables for the number of rows, number of columns, start location, and goal location. Its grid will be randomly filled with blocked cells.
MazeLocation - also need a way to refer to an individual location in the maze - a NamedTuple with properties representing the row and column of the location.

2.2 Search Algorithms
need a Node class which will be used to keep track of how we got from one state to another state (or from one place to another place) as we search. You can think of a Node as a wrapper around a state.

In the case of our maze-solving problem, those states are of type MazeLocation
We’ll call the Node that a state came from its parent . We will also define our Node class as having cost and heuristic properties and with __lt__() implemented, so we can reuse it later in the A* algorithm

2.2.1 - Depth-first search
An in-progress depth-first search needs to keep track of two data structures: the stack of states (or “places”) that we are considering searching, which we will call the frontier ; and the set of states that we have already searched, which we will call explored.

2.2.2 - Breadth-first search
the solution paths to the mazes found by depth-first traversal seem unnatural. They are usually not the shortest paths. Breadth-first search (BFS) always finds the shortest path by systematically looking one layer of nodes farther away from the start state in each iteration of the search. There are particular problems in which a depth-first search is likely to find a solution more quickly than a breadth-first search, and vice versa. Therefore, choosing between the two is sometimes a trade-off between the possibility of finding a solution quickly and the certainty of finding the shortest path to the goal (if one exists).

To implement BFS, a data structure known as a queue is required.
the algorithm for a breadth-first search is identical to the algorithm for a depth-first search, with the frontier changed from a stack to a queue.

2.2.3 - A* search
Like a BFS, an A* search aims to find the shortest path from start state to goal state. Unlike BFS, an A* search uses a combination of a cost function and a heuristic function to focus its search on pathways most likely to get to the goal quickly.

The cost function, g(n), examines the cost to get to a particular state.
- In the case of our maze, this would be how many previous steps we had to go through to get to the state in question.
The heuristic function, h(n), gives an estimate of the cost to get from the state in question to the goal state.
- It can be proved that if h(n) is an admissible heuristic, then the final path found will be optimal. An admissible heuristic is one that never overestimates the cost to reach the goal. On a two-dimensional plane, one example is a straight-line distance heuristic, because a straight line is always the shortest path.
The total cost for any state being considered is f(n), which is simply the combination of g(n) and h(n). In fact, f(n) = g(n) + h(n). When choosing the next state to explore from the frontier, an A* search picks the one with the lowest f(n). This is how it distinguishes itself from BFS and DFS.

PRIORITY QUEUES
- To pick the state on the frontier with the lowest f(n), an A* search uses a priority queue as the data structure for its frontier. A priority queue keeps its elements in an internal order, such that the first element popped out is always the highest-priority element. (In our case, the highest-priority item is the one with the lowest f(n).) Usually this means the internal use of a binary heap, which results in O(lg n) pushes and O(lg n) pops.

HEURISTICS
- A heuristic is an intuition about the way to solve a problem. In the case of maze solving, a heuristic aims to choose the best maze location to search next, in the quest to get to the goal. In other words, it is an educated guess about which nodes on the frontier are closest to the goal. As was mentioned previously, if a heuristic used with an A* search produces an accurate relative result and is admissible (never overestimates the distance), then A* will deliver the shortest path. Heuristics that calculate smaller values end up leading to a search through more states, whereas heuristics closer to the exact real distance (but not over it, which would make them inadmissible) lead to a search through fewer states. Therefore, ideal heuristics come as close to the real distance as possible without ever going over it.
EUCLIDEAN DISTANCE
- As we learn in geometry, the shortest path between two points is a straight line. It makes sense, then, that a straight-line heuristic will always be admissible for the maze-solving problem. The Euclidean distance, derived from the Pythagorean theorem, states that distance = √((difference in x) 2 + (difference in y) 2 ) . For our mazes, the difference in x is equivalent to the difference in columns between two maze locations, and the difference in y is equivalent to the difference in rows. Note that we are implementing this back in maze.py.
MANHATTAN DISTANCE
- Euclidean distance is great, but for our particular problem (a maze in which you can move only in one of four directions) we can do even better. The Manhattan distance is derived from navigating the streets of Manhattan, the most famous of New York City’s boroughs, which is laid out in a grid pattern. To get from anywhere to anywhere in Manhattan, one needs to walk a certain number of horizontal blocks and a certain number of vertical blocks. (There are almost no diagonal streets in Manhattan.) The Manhattan distance is derived by simply finding the difference in rows between two maze locations and summing it with the difference in columns.
THE A* ALGORITHM
- To go from BFS to A* search, we need to make several small modifications.
  - The first is changing the frontier from a queue to a priority queue. This way, the frontier will pop nodes with the lowest f(n).
  - The second is changing the explored set to a dictionary. A dictionary will allow us to keep track of the lowest cost (g(n)) of each node we may visit. With the heuristic function now at play, it is possible some nodes may be visited twice if the heuristic is inconsistent. If the node found through the new direction has a lower cost to get to than the prior time we visited it, we will prefer the new route.

3 - Missionaries and cannibals

3.1 - Representing the problem
We will represent the problem by having a structure that keeps track of the west bank. How many missionaries and cannibals are on the west bank? Is the boat on the west bank? Once we have this knowledge, we can figure out what is on the east bank, because anything not on the west bank is on the east bank.

3.2 - Solve the problem

The function display_solution() converts a solution path into printed output - a human-readable solution to the problem. It works by iterating through all of the states in the solution path while keeping track of the last state as well. It looks at the difference between the last state and the state it is currently iterating on to find out how many missionaries and cannibals moved across the river and in which direction.
The last thing we need to do is actually solve the missionaries-and-cannibals problem. To do so we can conveniently reuse a search function that we have already implemented, because we implemented them generically.
- use bfs() (because using dfs() would require marking referentially different states with the same value as equal, and astar() would require a heuristic).

4 - Real-world applications
Search plays some role in all useful software. Knowing the correct search algorithm to apply to a data structure is essential for performance.

A* is one of the most widely deployed path-finding algorithms. It is only beaten by algorithms that do pre-calculation in the search space. For a blind search, A* is yet to be reliably beaten in all scenarios, and this has made it an essential component of everything from route planning to figuring out the shortest way to parse a programming language. Most directions-providing map software (think Google Maps) uses Dijkstra’s algorithm (which A* is a variant of) to navigate.

Breadth-first search and depth-first search are often the basis for more complex search algorithms like uniform-cost search and backtracking search. Breadth-first search is often a sufficient technique for finding the shortest path in a fairly small graph. But due to its similarity to A*, it is easy to swap out for A* if a good heuristic exists for a larger graph.