13.2. Searching#

This section develops search methods from foundational to efficient, then evaluates them with correctness checks and timing evidence.

Learning Goals

  • Implement linear and binary search correctly

  • Explain when sorted order changes what is possible

  • Use hash-based lookup for repeated membership checks

  • Compare asymptotic cost with measured runtime

  • Select a search strategy from data and workload constraints

Section Roadmap

  • Establish correctness and edge-case behavior first.

  • Compare scaling behavior as input size and lookup count increase.

  • Use the decision guide to choose an approach for concrete scenarios.

from time import perf_counter   # High-resolution timer for benchmark cells.
import random                   # Randomized sample generation for lookup demos.

13.2.3. The bisect Module#

The bisect module, a Python implementation of binary search, provides production-ready tools for searching and insertion in sorted lists. The term “bisect” means “cut in half,” which is conceptually the same as “binary search.”

Function

What it does

bisect_left(a, x)

Leftmost index where x can be inserted while preserving sort order

bisect_right(a, x)

Rightmost insertion index (also available as bisect.bisect)

insort(a, x)

Inserts x into its sorted position (in-place)

import bisect

a = [1, 3, 5, 7, 9]
print(bisect.bisect_left(a, 5))  # returns 2 — the index where 5 is found
print(bisect.bisect_right(a, 5))  # returns 3 — the index where 5 would be inserted to maintain order
2
3

13.2.3.1. Why bisect#

The hand-written binary_search function from the previous section is a clean illustration, but bisect is almost always the better choice in practice for three reasons:

  1. Correctness. Off-by-one errors in binary search are notoriously easy to introduce. The three most common mistakes are: using left < right instead of left <= right (exits one step early), left = mid instead of left = mid + 1 (infinite loop), and right = mid instead of right = mid - 1 (same). Each fails only on certain inputs, making the bug easy to miss. The bisect module is part of the Python standard library and has been validated across many Python versions and platforms.

  2. Insertion point. binary_search returns only an index or -1. bisect_left and bisect_right always return a valid insertion point, which enables tasks beyond simple lookup: finding the nearest value, counting elements in a range, or inserting while preserving order. A -1 sentinel cannot express these results. If the target is larger than all elements, bisect_left returns len(a),

  3. Built-in insertion support. bisect.insort keeps a list sorted as elements arrive, combining the position search and the insert in one call. Replicating this correctly with a hand-written implementation requires both a correct binary search and careful use of list.insert.

When bisect is the right choice:

  • The data is already sorted (or can be kept sorted).

  • Lookups, insertions, or range queries on sorted data are needed.

When a custom implementation may be warranted:

  • The search logic differs from a simple equality check (e.g., searching a rotated array).

  • The data structure is not a plain list (e.g., a sorted linked list or a binary search tree).

The most important practical pattern is: locate with bisect_left, then verify equality. This prevents false positives when the target is absent and avoids re-implementing binary search logic.

13.2.3.2. Insertion with insort#

bisect.insort(a, x) keeps a list sorted as new elements arrive. Internally it calls bisect_right to find the correct position, then calls list.insert() to place the element there.

The two-step cost matters in practice:

  • Finding the position takes \(O(\log n)\) — binary search locates the slot.

  • Shifting elements to make room takes \(O(n)\) — every item after the insertion point moves one index to the right.

The overall cost per insort call is therefore \(O(n)\), dominated by the shift. This is acceptable when insertions are infrequent and the list must remain sorted at all times (an online scenario; meaning “one item at a time, future items unknown”). It becomes a bottleneck when adding many items at once: appending all items and calling list.sort() once costs \(O(n \log n)\) total, versus \(O(n^2)\) for \(n\) individual insort calls.

The code below illustrates all three ideas: the absent/present distinction between bisect_left and bisect_right, the insort operation, and the bisect_search idiom.

13.2.3.3. The bisect_search() Idiom#

The bisect_search function is the idiom to use whenever you need to check existence or find an index. It includes the bounds check and equality check. The bounds check must come first due to Python’s short-circuit evaluation: if the index is out of bounds, the second condition is never evaluated, avoiding an IndexError.

import bisect

data = [1, 3, 4, 7, 9, 14, 20]

print(f"bisect_left(5)  → index {bisect.bisect_left(data, 5)}")    # 3 — absent: insert at index 3 (between 4 and 7)
print(f"bisect_right(5) → index {bisect.bisect_right(data, 5)}")   # 3 — absent: same index as bisect_left

print(f"bisect_left(7)  → index {bisect.bisect_left(data, 7)}")    # 3 — present: index of existing 7
print(f"bisect_right(7) → index {bisect.bisect_right(data, 7)}")   # 4 — present: one past existing 7

bisect.insort(data, 5)
print(f"after insort(5): {data}")                                   # 5 inserted at index 3

def bisect_search(sorted_items, target):
    """Return index of target in sorted_items, or -1 if not found."""
    i = bisect.bisect_left(sorted_items, target)
    if i < len(sorted_items) and sorted_items[i] == target:         # sorted_items[i] is the first item >= target; check if it's a match
        return i
    return -1                                                       # target not found

nums = [1, 3, 4, 7, 9, 14, 20]
print(bisect_search(nums, 9))   # 4
print(bisect_search(nums, 6))   # -1

# Edge case: target larger than all elements — bisect_left returns len(nums)
print(f"bisect_left(999) → index {bisect.bisect_left(nums, 999)}")  # 7 — equals len(nums); out of bounds, which is why bisect_search checks i < len(sorted_items)
bisect_left(5)  → index 3
bisect_right(5) → index 3
bisect_left(7)  → index 3
bisect_right(7) → index 4
after insort(5): [1, 3, 4, 5, 7, 9, 14, 20]
4
-1
bisect_left(999) → index 7

When searching, the following questions may arise and bisect handles them all:

  • Does the target exist? — bisect_search idiom handles this

  • Where is the first occurrence? — bisect_left + verify

  • Where is the last occurrence? — bisect_right - 1 + verify

  • How many occurrences? — bisect_right(a, x) - bisect_left(a, x)

  • What is the range of indices? — combines the above two

This is a powerful pattern because bisect handles all of these without any custom loop.

import bisect

a = [1, 2, 2, 2, 3]

left  = bisect.bisect_left(a, 2)    # 1
right = bisect.bisect_right(a, 2)   # 4
count = right - left                # 3
exists = left < len(a) and a[left] == 2  # True

print(f"2 appears {count} times in {a}, exists: {exists}")
2 appears 3 times in [1, 2, 2, 2, 3], exists: True
### Exercise: Using the bisect Module
#   scores = [72, 85, 88, 90, 95]
#   1. Use bisect.insort() to insert 87. Print the updated list.
#   2. Use bisect.bisect_left() and bisect.bisect_right() to count
#      how many times 88 appears in the updated list.
#   3. Use the bisect_search idiom to check whether 90 exists.
### Your code starts here.


### Your code ends here.

Hide code cell source

### Solution
import bisect
scores = [72, 85, 88, 90, 95]
bisect.insort(scores, 87)
print(scores)                                         # [72, 85, 87, 88, 90, 95]

left  = bisect.bisect_left(scores, 88)
right = bisect.bisect_right(scores, 88)
print(right - left)                                   # 1

idx = bisect.bisect_left(scores, 90)
print(idx < len(scores) and scores[idx] == 90)        # True
[72, 85, 87, 88, 90, 95]
1
True

13.2.4. Hash-Based Lookup#

The previous methods assume sorted data. If order doesn’t matter and you only need membership testing, a hash-based approach is often faster and simpler.

When the task is membership testing, a set or dict is often the best tool available. Hash tables use key-based addressing rather than scanning: a hash function maps each key to a memory slot directly, so a lookup does not need to examine other elements. This makes \(O(1)\) average-case time constant regardless of collection size.

Common misconception: The in operator is not uniformly fast. target in my_list is \(O(n)\) — it performs a linear scan, equivalent to linear_search. Only target in my_set and target in my_dict are \(O(1)\) average case. The syntax looks identical; the data structure determines the cost.

When this approach fits

  • You need many membership checks on the same collection.

  • Sorted access or index-based operations are not required.

  • Extra memory to hold the hash table is acceptable.

Key trade-off: building the hash table has an upfront \(O(n)\) cost, but repeated lookups are fast.

random.seed(42)  # Fixed seed keeps the demonstration reproducible across runs.
numbers = [random.randint(1, 20_000) for _ in range(15_000)]
lookup_set = set(numbers)

target = numbers[len(numbers) // 2]  # Use one present value for contrast.
missing = 999_999  # Use one absent value for contrast.

print('target in list:', target in numbers)
print('target in set :', target in lookup_set)
print('missing in list:', missing in numbers)
print('missing in set :', missing in lookup_set)
target in list: True
target in set : True
missing in list: False
missing in set : False
### Exercise: Hash-Based vs. List Lookup
#   data = list(range(0, 10_000, 3))   # every third integer from 0 to 9999
#   1. Build data_set = set(data).
#   2. For each target in [9, 100, 999, 1000, 9999], print whether it is
#      in data (list) and in data_set (set) — one line per target.
#   3. Are results the same? Add a comment explaining which is faster and why.
### Your code starts here.


### Your code ends here.

Hide code cell source

### Solution
data = list(range(0, 10_000, 3))
data_set = set(data)
for t in [9, 100, 999, 1000, 9999]:
    print(t, t in data, t in data_set)  # results identical
# data_set is O(1) per lookup via hash table; data (list) is O(n) per lookup.
# For repeated membership tests, set is significantly faster.
9 True True
100 False False
999 True True
1000 False False
9999 True True

13.2.5. Correctness Checks#

Before benchmarking, verify correctness on edge cases that often reveal hidden bugs:

  • empty input

  • one-element input (present and absent targets)

  • duplicates (index semantics matter)

  • a target near the end (worst case for linear search)

assert here is used as an executable correctness check. Using assert turns expectations into executable checks. If any assumption is violated (i.e., condition is untrue), execution stops immediately and the failing case (AssertionError) is exposed.

tests = [  # Each pair is (input_sequence, target).
    ([], 5),
    ([10], 10),
    ([10], 5),
    ([1, 2, 2, 2, 3], 2),
    (list(range(100)), 99),
]

for arr, target in tests:           ### unpack tuple into arr and target for each test case
    lin_idx = linear_search(arr, target)
    arr_sorted = sorted(arr)
    bin_idx = binary_search(arr_sorted, target)

    assert (target in set(arr)) == (target in arr)  # Set/list membership should agree as booleans.

    if target in arr:
        assert lin_idx != -1
        assert arr[lin_idx] == target  # Found case: index must point to a matching value.
        assert bin_idx != -1
        assert arr_sorted[bin_idx] == target
    else:
        assert lin_idx == -1  # Absent case: both searches should report -1.
        assert bin_idx == -1

print('All checks passed.')
All checks passed.

13.2.5.1. Finding First Duplicate#

Standard binary search returns a matching index, but not necessarily the leftmost one. With duplicates — for example, [1, 2, 2, 2, 4, 4, 5]binary_search may land on any of the 2s. Many tasks need the first (leftmost) occurrence: finding the start of a range, counting elements, or inserting before all copies of a value.

The fix: when a match is found, record the index as a candidate and keep searching left (right = mid - 1). The loop terminates with the smallest index that matches.

Contract: binary_search_first() returns the index of the leftmost occurrence of target, or -1 if absent.

def binary_search_first(sorted_items, target):
    left, right = 0, len(sorted_items) - 1
    result = -1

    while left <= right:
        mid = (left + right) // 2
        guess = sorted_items[mid]

        if guess == target:
            result = mid  # Record candidate and keep searching left for first occurrence.
            right = mid - 1
        elif guess < target:
            left = mid + 1
        else:
            right = mid - 1

    return result

dup_data = [1, 2, 2, 2, 4, 4, 5]
assert binary_search_first([], 2) == -1  # Empty input check.
assert binary_search_first([2, 2, 2], 2) == 0  # All-duplicate input check.
assert binary_search_first(dup_data, 2) == 1
assert binary_search_first(dup_data, 4) == 4
assert binary_search_first(dup_data, 3) == -1  # Absent target check.

print(binary_search_first(dup_data, 2))  # expected: 1
print(binary_search_first(dup_data, 4))  # expected: 4
print(binary_search_first(dup_data, 3))  # expected: -1
print('binary_search_first checks passed.')
1
4
-1
binary_search_first checks passed.
### Exercise: First Occurrence with Duplicates
#   sorted_items = [1, 2, 2, 2, 4, 4, 5]
#   1. Use binary_search_first() to find the first index of 2 and of 4.
#   2. Use binary_search() for the same targets.
#   3. Are the results always the same? Add a comment explaining the difference.
### Your code starts here.


### Your code ends here.

Hide code cell source

### Solution
sorted_items = [1, 2, 2, 2, 4, 4, 5]
print(binary_search_first(sorted_items, 2))  # 1 — guaranteed leftmost 2
print(binary_search_first(sorted_items, 4))  # 4 — guaranteed leftmost 4
print(binary_search(sorted_items, 2))         # 1 or 2 — a valid index, not guaranteed leftmost
print(binary_search(sorted_items, 4))         # 4 or 5 — a valid index, not guaranteed leftmost
# binary_search returns *a* matching index; binary_search_first always returns the *first*.
# The difference matters when you need to iterate forward from the match.
1
4
3
5

13.2.6. Timing Comparison#

This section evaluates a practical question: When is preprocessing worth it?

Two workloads are benchmarked to represent common usage patterns:

  • One lookup: measure the full workflow cost, including any preprocessing (sorting or building a set).

  • Many repeated lookups: perform preprocessing once, then reuse the resulting structure across many queries.

This side-by-side design separates two ideas that are often conflated:

  • Asymptotic lookup cost (for example, \(O(\log n)\) or \(O(1)\)), and

  • Total workflow cost (setup + lookup).

Interpretation guide:

  • If only one lookup is needed, preprocessing may dominate total time.

  • If many lookups are needed, preprocessing is often amortized, so faster lookup methods become more beneficial.

Why wrapper functions are used. linear_search and binary_search are passed directly to average_runtime because their signatures match fn(data, target). Set membership (target in lookup_set) is an expression, not a callable with that signature, so it is wrapped in named functions. This keeps the benchmark interface consistent and makes each timing represent a comparable workflow.

We use time.perf_counter() for higher-resolution timing. The coarser time.time() from earlier chapters is acceptable for rough timing, but less reliable for short operations.

Predict before running: For n = 100,000 with a single lookup, which method do you expect to be fastest — linear search, sort-then-binary, or build-set-then-lookup? Write your prediction and reasoning before executing the cell below.

(Hint: think about the full workflow cost — preprocessing included — not just the per-lookup complexity.)

def average_runtime(fn, *args, repeats=30):
    total = 0.0
    for _ in range(repeats):
        start = perf_counter()
        fn(*args)
        total += perf_counter() - start
    return total / repeats  # Average repeated measurements to reduce timing noise.


def sort_then_binary_once(data, target):
    sorted_data = sorted(data)
    return binary_search(sorted_data, target)  # One-time workflow: sort, then one binary lookup.


def build_set_once(data, target):
    lookup_set = set(data)
    return target in lookup_set  # One-time workflow: build hash table, then one membership test.


def linear_many(data, targets):
    return sum(linear_search(data, target) != -1 for target in targets)  # Repeated lookups, no preprocessing.


def sort_then_binary_many(data, targets):
    sorted_data = sorted(data)
    return sum(binary_search(sorted_data, target) != -1 for target in targets)  # Preprocess once, reuse sorted data.


def build_set_many(data, targets):
    lookup_set = set(data)
    return sum(target in lookup_set for target in targets)  # Preprocess once, reuse hash table.


sizes = [10_000, 50_000, 100_000, 200_000]
print('One lookup: includes preprocessing cost')
print(f"{'n':>10} {'linear total':>15} {'sort+binary':>15} {'build+set':>15}")

for n in sizes:
    data = list(range(n))
    target = n - 1

    t_linear_one = average_runtime(linear_search, data, target)
    t_binary_one = average_runtime(sort_then_binary_once, data, target)
    t_set_one = average_runtime(build_set_once, data, target)

    print(f"{n:>10} {t_linear_one:>15.6f} {t_binary_one:>15.6f} {t_set_one:>15.6f}")

print()
print('Many lookups: preprocessing is paid once and reused')
print(f"{'n':>10} {'linear total':>15} {'sort+binary':>15} {'build+set':>15}")

for n in sizes:
    data = list(range(n))
    targets = [0, n // 4, n // 2, (3 * n) // 4, n - 1] * 20  # Mix positions to avoid one-case bias.

    t_linear_many = average_runtime(linear_many, data, targets)
    t_binary_many = average_runtime(sort_then_binary_many, data, targets)
    t_set_many = average_runtime(build_set_many, data, targets)

    print(f"{n:>10} {t_linear_many:>15.6f} {t_binary_many:>15.6f} {t_set_many:>15.6f}")
One lookup: includes preprocessing cost
         n    linear total     sort+binary       build+set
     10000        0.000155        0.000032        0.000055
     50000        0.000782        0.000143        0.000246
    100000        0.001531        0.000322        0.000551
    200000        0.003064        0.000667        0.001144

Many lookups: preprocessing is paid once and reused
         n    linear total     sort+binary       build+set
     10000        0.007790        0.000111        0.000051
     50000        0.038162        0.000247        0.000239
    100000        0.076312        0.000424        0.000523
    200000        0.163739        0.001260        0.001941

13.2.6.1. Interpreting the Results#

One-lookup results: Linear search wins. Sorting \(n\) elements costs \(O(n \log n)\) and building a hash table costs \(O(n)\) — both are more expensive than a single \(O(n)\) linear scan. Preprocessing is never recovered when there is only one query.

Many-lookups results: The picture reverses. Preprocessing is paid once and amortized across all queries. As \(n\) grows, linear search scales poorly (\(O(n)\) per lookup), while binary search (\(O(\log n)\)) and set lookup (\(O(1)\) average) stay comparatively fast. The set typically wins on raw lookup speed; binary search is competitive when memory is constrained or the data is already sorted.

Takeaway: Total workflow cost = preprocessing cost + (number of lookups × per-lookup cost). The asymptotically fastest isolated lookup is not always the fastest workflow.

13.2.7. Search Strategy Summary#

No single search method is optimal in every context. Selection should be based on workload, data properties, and total cost.

Strategy

Time

Needs sorted data?

Best for

Linear search

\(O(n)\)

No

One-off lookups; small or unsorted input

Binary search

\(O(\log n)\)

Yes

Repeated lookups on sorted data

bisect module

\(O(\log n)\)

Yes

Production search/insertion on sorted lists

set / dict lookup

\(O(1)\) average

No

High-volume membership checks

Cost of sorting: Binary search and bisect require sorted input. Since sorting costs \(O(n \log n)\), it may not pay off for only one or two lookups.

Preprocessing trade-off: Building a set costs \(O(n)\) time and \(O(n)\) space, but each later membership check is typically \(O(1)\).

Decision checklist

  • Is the data already sorted?

  • How many lookups will be performed?

  • Is insertion while preserving order required?

  • Is extra memory acceptable?

In practice, choose the method with the lowest total workflow cost (setup + lookup), not just the lowest isolated lookup complexity.