Content of the library#
The main purpose of this library is to present easy-to-implement-algorithms. So priority was put on simplicity and readability, rather than on efficiency for example. For example even though Dijkstra’s shortest path algorithm has the best worst case time complexity when implemented with a Fibonacci heap, we choose simpler implementations, which have worse but still acceptable time complexities.
The content of our library is organized by problem classes as follows.
Basic algorithms and data structures#
We illustrate how to read from standard input and write to standard output, using Freivald’s test, see freivalds. Given n by n matrices A,B,C the goal is to decide whether AB=C. A naive test would have a time complexity of \(O(n^3)\). But Freivald’s test, generates a random vector x and tests in time \(O(n^2)\) if \(ABx=Cx\). It can be shown that the probability that the tests fails to detect difference in AB and in C is negligibly small.
A simple implementation of a first-in-first-out queue is presented in our_queue, as well as of a heap in our_heap.
The union-find structure permits to maintain a partitioning of n items (typically vertices of a graph), and implements the operation union(x,y)
to merge the two sets containing x and y, as well as find(x)
to return a canonical element of the set containing x. It is implemented in kruskal.
Python orders tuples lexicographically. We illustrate this in majority with a small example identifying a majority element in a given list. It is quite simple to sort a list in python, we illustrate this in closest_values a small example identifying two closest values of a given list.
The sweepline algorithmic technique is important for many computational geometric problems. In max_interval_intersec we illustrate it on the problem of computing a value that is contained in a maximum number of intervals among a given set of intervals.
Other important techniques, include the greedy algorithm, which we illustrate in scalar on the problem of permuting a vector x, so to minimize the scalar product with a given vector y.
Sometimes it is convenient to encode sets over a small universe as bit vectors in integers. We illustrate this technique on the problem of generating an arithmetic expression with a small given set of values to reach as close as possible some target value, see arithm_expr_target.
In binary_search we illustrate different variants of binary search or variants. For example ternary_search permits to find the maximum of a bitonic function in logarithmic time in the size of the search interval.
A very interesting data structure is called PartitionRefinement. It maintains a partition over the set {0,1,…,n-1}. The main operation is called refine(S) which splits each part P of the current partition into elements that are in S and elements that are not in S. The complexity of this operation is linear in the size of S.
A similar data structure is called PC_tree. It maintains a set of partions over a ground set, which all satisfy a list of constraints of the form: in the permutation all elements of a given set S need to be consecutive. Basically one can add a constraint, and check if there is still a valid permutation, and generate one of those valid permutations. This data structure needs a double linked list Sequence, which is adapted from a code by David Eppstein.
Strings#
A word x is an anagram of a word y, if the letters of x can be permuted to form y. In anagrams we show how to detect all anagrams among a given list of words.
The generation of mobile phone which existed before the smart phones had the possibility to type text messages with the keys 2 to 9, using a letter to digit mapping printed on the keys. The phone predicted which was the most likely word that the user wants to type using a frequency augmented dictionary of words. In predictive_text we show how this can be done efficiently.
An elegant but not so well known algorithm by Manacher permits to detect all palindromes which are substrings of a given string, see manacher.
A well studied problems on strings, is the problem of finding a searching a string t inside a string s. This is generally solved using the algorithm by knuth_morris_pratt. Our implementation is based on the problem of finding the longest border for each prefix of a given string w., see maximum_border_length, which permits also to solve many other problems. One such application is to compute for a string x the largest power k such that there is a string y, and x is the result of concatenating k copies of y.
An alternative algorithm for the string matching problem has been given by Rabin-Karp, see rabin_karp. This algorithm has the advantage of generalizing to searching for multiple strings, or to search patterns in 2-dimensions, or even to find the longest common substring to two given strings.
A trie, or prefix tree is a data structure storing a set of words, which permits to search for a given word, allowing spelling errors. This is the data-structure one needs in a spell checker. Our implementation is in trie.
Sequences and Arrays#
Many problem on sequences (which are basically strings) can be solved with dynamic programming. A classical example is the computation of the edit distance (also called Levenshtein distance) between two given strings, see levenshtein. Other examples include the longest common subsequence (longest_common_subsequence) and the longest increasing subsequence (longest_increasing_subsequence).
A very easy problem consisting of computing the sorted union of two given sorted lists, see merge_ordered_lists. In the area of caching it is interesting to compute maximal time intervals, where no cache fault happened. For this purpose one is given a list of items, and need to identify all inclusion wise maximal intervals that contain at most k distinct items. This problem is solved with the sliding window technique, which we illustrate in windows_k_distinct.
A data structure for an array is an object that stores the array (possibly in an implicit manner) and allows access and modification to the array, as well as different query operations. For example the range minimum query structure permits to modify individual items of an array and to query the minimum in a given interval of indices. This is solved generally with a segment tree, allowing logarithmic complexity (in the array size) for these operations, see range_minimum_query. There is a cute variant — called LazySegmentTree — which allows updates over index ranges, setting to a value or adding a value, and at the same time allows queries over index ranges, asking for the maximum, the minimum or the sum.
A similar structure is the Fenwick tree also called BIT or binary indexed tree. It permits again to modify individual items of the array, but also to query the sum over a given interval of indices, see fenwick.
Intervals#
The tryalgo library covers two nice problems on intervals. The first one is a data structure called interval tree, which stores a set of intervals and permits to select all those that contain a given value, see interval_tree.
The other problem consists in finding a smallest hitting set for a given set of intervals, that is to find a smallest set of values, that intersect each interval. This problem is solved with the sweep line technique mentioned above, see interval_cover.
Graphs#
For this library we decided to encode graphs using in two different manners, at your convenience. In the adjacency list a graph on n vertices is encoded as a list of n elements, where the i-th element is the adjacency list of the i-th vertex. Labels on edges or on vertices are stored in separate matrices or lists. We call it the listlist graph format. In contrast in the adjacency dictionary a graph is encoded as a list of n dictionary, such that the keys of the i-th dictionary are the neighbors of the i-th vertex, and the values are the attached edge weights. We call this one the listdict graph format. Undirected graphs are represented as directed graphs by duplicating each edge into two opposite facing arcs. All functions in our library work with the listlist graph format, and most work also with the listdict graph format. This is documented for each function.
Some graph libraries, like PADS, choose to represent graphs as dictionaries, where graph[u]
would again be a dictionary mapping each neighbor v
to the arc weight graph[u][v]
. In this representation, nodes can be any hashable objects, like strings for example, or tuples. We call it the dictdict graph format, and provide functions to convert between the different graph representations, namely
matrix_to_listlist,
listlist_and_matrix_to_listdict,
listdict_to_listlist_and_matrix,
dictdict_to_listdict. In addition we provide a class GraphNamedVertices, which allows vertices to be hashable objects like strings or tuples.
We use several representations for trees. A tree can be represented as an adjacency table, as a graph. In case the tree is rooted, it can be represented in form of a node structure that contains references to descendant nodes, or in form of an antecedent table, storing at index i the antecedent vertex of the i-th vertex in the tree, using None for the root.
In graph we provide several helper functions to read a graph from a file, or to write it into a file in the DOT format. This module contains also functions to convert between different tree representations and between graph representations.
Important operations on graphs are explorations along the edges, for examples to detect connected components, or shortest paths. The depth first search is implemented in dfs, and illustrated in its iterative and recursive form, as well as the special case of exploring grids. The breadth-first search is implemented in bfs. Such a graph exploration can be used to check if a given graph is_bipartite. A variant called bfs_implicit can be used if the graph is given implictly by a function mapping a vertex to its neighbors.
The problem of detecting the connected components in a graph is best solved using Kruskal’s algorithm, see kruskal.
A cut vertex is a vertex which removal splits a connected components. A cut edge is defined similarly. Detecting cut vertices and cut edges is important in order to determine biconnected components, which are particular vertex sets such that each pair of vertices is connected by two vertex disjoint paths. These sets are important for communication networks. A subtle modification of the depth first search permits to detect these cut vertices and cut edges, see biconnected_components.
For directed graphs there are two important problems. The first one is the topological sorting, which consists in ordering the vertices, such that every arc points only from left to right, see topological_order.
Another important problem consists in determining strongly connected components, which are vertex sets such that for each vertex pair there is a directed path connecting them. These can be computed by an algorithm by Tarjan or by an algorithm by Kosaraju, see strongly_connected_components. The main application is the resolution of 2-SAT boolean formulas, see two_sat. Another polynomial variant of SAT is Horn-SAT, see horn_sat.
Cycles#
The library contains implementations of several cycle finding algorithms. The most basic problem consists of finding any cycle in a given undirected graph. In the second problem we are given an edge weighted graph and want to compute a cycle of minimum total weight. For the third problem we want to minimize the total cycle weight over the cycle length. In the Eulerian cycle problem we want to find a cycle that visits every edge exactly once. In the iterated function cycle detection problem we are given implcitly a directed graph with outdegree 1, in form of a function. Applying the function iteratively from a source vertex leads to an infinite sequence, which after d iterations cycles with a period c. The goal is to compute c and d.
problem |
graph |
complexity |
algorithm |
implementation |
---|---|---|---|---|
find a cycle |
undirected |
\(O(|V| + |E|)\) |
depth-first search |
|
shortest cycle |
undirected |
\(O(|V|\cdot|E|)\) |
breath-first search |
|
minimum weight cycle |
directed |
\(O(|V|\cdot |E|)\) |
||
minimum mean cycle |
directed |
\(O(|V|\cdot |E|)\) |
||
Eulerian cycle |
both |
\(O(|V|+|E|)\) |
||
Hamiltonian cycle |
complete |
\(O(n^2 2^n)\) |
||
Iterated function cycle |
outdeg=1 |
\(O(|V|)\) |
Shortest paths#
Several shortest path algorithms are included in the library, which apply for different classes of graphs. They are summarized in the following table. For the complexity indication we assume that \(|E|\geq |V|\).
problem |
complexity |
algorithm |
implementation |
---|---|---|---|
unweighted graph |
\(O(|E|)\) |
||
grid |
\(O(|E|)\) |
breadth-first search adapted to the grid graph |
|
{0,1} weighted graph |
\(O(|E|)\) |
||
non negative weighted graph |
\(O(|E| \log |V|)\) |
||
with lower bound on distance |
\(O(|E| \log |V|)\) |
||
arbitrary weighted graph |
\(O(|E| \cdot |V|)\) |
||
all source destination pairs |
\(O(|V|^3)\) |
Trees#
A classical example of a problem solved by the greedy algorithm is the problem of constructing optimal Huffman codes. An implementation can be found in the module huffman.
A similar, but different problem, concerns finding an optimal binary search tree, given various request frequencies. The module dyn_prog_tricks allows to solve this task in quadratic time.
Another example, which is as classical and famous, is the problem of constructing a minimum weight spanning tree for a given edge weighted connected graph. It is solved with the greedy Kruskal’s algorithm, see kruskal.
The lowest common ancestor problem consists of building a data structure that stores a rooted tree and can answer efficiently queries of the form: “Which vertex is the closest common ancestor to two given vertices”. The most elegant solution consists in a reduction to the minimum range query problem, see lowest_common_ancestor.
Sets#
A simple data structure to store an ordered set allowing insertions and deletions is the skip tree. The expected cost of an update is \(O(\log n)\).
Many problems defined on sets can be solved by dynamic programming. This is the case of the Knapsack problem. We are given n items, each has a size and a value, and we wish to find a subset of maximum total value which size does not exceed a given capacity C. This problem is NP-hard, but can be solved efficiently in time O(nC) if the capacity is bounded by a small value, see knapsack.
In the coin change problem, we are given a collection of coins of n different values and unbounded number of coins for each value and a target value C. The goal is to find a set of coins of total value C. Again this problem can be solved by dynamic programming in time O(nC), see subsetsum. A similar problem is called the subset sum problem and consists of finding a subset out of n given values that sum up to a target value C. It can be solved with the same method. When n is small and C large, there is a different algorithm with complexity \(O(n^{\lceil n/2 \rceil})\), see subsetsum_divide.
Geometry#
A very classical problem in computational geometry is the computation of the convex hull of a given point set in the Euclidean space. Generally text books present Graham’s algorithm. But for this library we made the choice of Andrew’s sweepline algorithm, which has the advantage of avoiding trigonometric operations, see convex_hull. (With some work Graham’s algorithm can also be implemented without trigonometric operations, but it is a bit more tricky than Andrew’s algorithm.)
Another not less classical problem is the problem of determining a closest pair among a given point set. It can be solved in time O(n log n) with a sweep line algorithm or using a divide and conquer approach. In this library we present a randomized very simple algorithm with an expected linear running time, see closest_points.
And among problems related to points, is the problem of computing the Pareto set of a given set of points. These consists of all non-dominated points, where we say that a point dominates another point if it has smaller coordinates in each dimension, and strictly smaller in at least one dimension. This can be computed in time O(n log n), both in 2 dimensions, and in 3 dimensions.
The area of a given simple polygon can be computed in linear time, see polygon. And testing whether a given rectilinear polygon is simple can be verified with a sweepline algorithm in time O(n log n), see is_simple.
Here is an algorithmic puzzle that we like a lot. Given a set of n points in the plane, we which to find out how many 4-tuples we can form such that they are the 4 corners of a rectangle. The solution can be found in rectangles_from_points.
Speaking of rectangles, a nice problem illustrating the amortized analysis consists in finding a largest rectangle under a given histogram. A linear time algorithm is implemented in rectangles_from_histogram. This algorithm is the key to solve another interesting problem. Given a binary matrix, we want to find the largest rectangular sub-matrix consisting only of ones. The linear time solution can be found in rectangles_from_grid.
Computing the area of the union of n given rectilinear rectangles can be done in time O(n log n) using a sweep line algorithm and a dynamic data structure called segment tree, see union_rectangles.
Arithmetic#
All prime numbers less than some given integer n are easily generated with Eratosthene’s method, see primes. Its complexity is \(O(n \log\log n)\). This is improved by the Gries-Misra sieve which not only has complexity \(O(n)\), but also produces a table indicating for every positive integer less than n, its smallest prime factor.
The library contains functions to compute the greatest common divisor (GCD in english or PGCD in french), to compute the Bezot coefficients and the binomial coefficients, see arithm.
Fast exponentiation is a very powerful technique, which applies also to exponentiation of matrices, see fast_exponentiation.
An arithmetic expression given in form of a string can be evaluated in different manners. The library contains a simple method using a stack for the operations and for the intermediate values, see arithm_expr_eval.
For solving a system of linear equations, a classical method is to use the Gauss-Jordan triangulation technique, see gauss_jordan.
When multiplying a sequence of matrices the order of evaluation does not matter, but placing the parenthesis in a good manner, permits to minimize the number of arithmetic operations necessary for the computation. This is a classical problem which can be solved by dynamic programming, see matrix_chain_mult.
The module roman_numbers provides functions to convert an integer into its roman number representation string and vice-versa.
Suppose you want to multiply two polynomials given by the lists of the coefficients, and let n be their length. Karatsuba’s algorithm does exactly this, and in time \(O(n^{1.585})\). The same module contains functions to add, substruct or evaluate polynomials. A more efficient method however uses the fast Fourier transformation. The module fft provides a function fft()
to compute the fast Fourier transformation of a given vector, and the corresponding inverse function inv_fft()
. They are used in the function mul_poly_fft()
which multiplies two polynomials in time \(O(n\log n)\).
Backtracking#
Sometimes all our known techniques fail on some problems, and then we need to attack it with brute force and backtracking. This technique is illustrated in laser_mirrors on a problem consisting of a grid containing in some cells two sided mirrors which can be oriented at angles 45 or 225 degrees. The goal is to find an orientation which permits to orient the trajectory of a laser beam entering at a specific position on the left border of the grid, so it reaches a specific position on the right side of the grid.
The Rolls-Royce of backtracking algorithms is the dancing link algorithm, which solves quite efficiently the NP-hard problem /exact set cover/. It is implemented in dancing_links and is illustrated on the classical Sudoku problem in sudoku.
Finally a useful procedure is next_permutation()
which takes as input a table of size n containing a permutation of the integers 1 to n and puts them in the lexicographically next permutation order, see next_permutation.
Last words#
We hope that you find the library instructive and useful. If you miss some functionality, let us know, and you might want to have a look at PADS. and NetworkX.