[REF] Compute_plausible_gaps, Efficiency, Stability

bosd · bosd · commit 313f75b54c39 · 2024-10-31T22:04:04.000+01:00
1. **Use of `get` Method**: When retrieving the best alignment, we use `self._textline_to_alignments.get(most_aligned_tl)` instead of direct indexing. This prevents a potential `KeyError` if `most_aligned_tl` is not in the dictionary, which could lead to unexpected behavior.

2. **Early Exit Conditions**: We explicitly check if `best_alignment` is `None` after attempting to retrieve it. This ensures that we do not proceed with calculations if the alignment data is missing.

3. **Sorting and Gap Calculation**: I retained the logic to sort the text lines and calculate gaps. This part of the code is straightforward and unlikely to lead to an infinite loop as long as the input lists are correctly managed.

4. **Returning `None` for Insufficient Data**: The checks for the lengths of the text line lists ensure that we only proceed if there are enough lines to compute meaningful gaps. If there are not enough lines, we return `None` to avoid further computation.

5. **List Comprehensions for Gap Calculation**: The gap calculations for horizontal and vertical gaps are done using list comprehensions, which are more concise and Pythonic, making the code cleaner.
diff --git a/camelot/parsers/network.py b/camelot/parsers/network.py
@@ -445,45 +445,56 @@ def compute_plausible_gaps(self):
         Returns
         -------
         gaps_hv : tuple
-            (horizontal_gap, horizontal_gap) in pdf coordinate space.
+            (horizontal_gap, vertical_gap) in pdf coordinate space.
 
         """
         # Determine the textline that has the most combined
         # alignments across horizontal and vertical axis.
-        # It will serve as a reference axis along which to collect the average
-        # spacing between rows/cols.
         most_aligned_tl = self.most_connected_textline()
         if most_aligned_tl is None:
             return None
 
-        # Retrieve the list of textlines it's aligned with, across both
-        # axis
-        best_alignment = self._textline_to_alignments[most_aligned_tl]
+        # Retrieve the list of textlines it's aligned with, across both axes
+        best_alignment = self._textline_to_alignments.get(most_aligned_tl)
+        if best_alignment is None:
+            return None
+
         __, ref_h_textlines = best_alignment.max_h()
         __, ref_v_textlines = best_alignment.max_v()
+
+        # Ensure we have enough textlines for calculations
         if len(ref_v_textlines) <= 1 or len(ref_h_textlines) <= 1:
             return None
 
+        # Sort textlines based on their positions
         h_textlines = sorted(
             ref_h_textlines, key=lambda textline: textline.x0, reverse=True
         )
         v_textlines = sorted(
             ref_v_textlines, key=lambda textline: textline.y0, reverse=True
         )
 
-        h_gaps, v_gaps = [], []
-        for i in range(1, len(v_textlines)):
-            v_gaps.append(v_textlines[i - 1].y0 - v_textlines[i].y0)
-        for i in range(1, len(h_textlines)):
-            h_gaps.append(h_textlines[i - 1].x0 - h_textlines[i].x0)
+        # Calculate gaps between textlines
+        h_gaps = [
+            h_textlines[i - 1].x0 - h_textlines[i].x0
+            for i in range(1, len(h_textlines))
+        ]
+        v_gaps = [
+            v_textlines[i - 1].y0 - v_textlines[i].y0
+            for i in range(1, len(v_textlines))
+        ]
 
+        # If no gaps are found, return None
         if not h_gaps or not v_gaps:
             return None
+
+        # Calculate the 75th percentile gaps
         percentile = 75
         gaps_hv = (
             2.0 * np.percentile(h_gaps, percentile),
             2.0 * np.percentile(v_gaps, percentile),
         )
+
         return gaps_hv
 
     def search_table_body(