Skip to content

Conversation

@ikappaki
Copy link
Contributor

@ikappaki ikappaki commented Dec 30, 2024

Update1: this PR was repurposed to focus on optimizing to seq-consuming library functions to coerce their inputs into seqs before operating on them. It fixes #1234.

Support generators in seq-consuming functions via the when-let [coll (seq coll)] idiom where applicable. Fixes #1192

@ikappaki ikappaki marked this pull request as draft December 30, 2024 15:16
@ikappaki ikappaki changed the title [wip] support generators in seq processing fns support generators in seq transforming fns Dec 30, 2024
@ikappaki ikappaki force-pushed the support/generators-in-seq-fns branch from 19ad10e to d4b1994 Compare February 20, 2025 05:33
@ikappaki
Copy link
Contributor Author

Hi @chrisrink10,

I attempted to properly support the generator case in the seq transforming functions (map, etc.) as discussed in #1192. However, I found it too finicky due to the large number of functions involved need updating, and I don’t think it's worth the effort, even though I’ve updated and created test cases for most of them (see changes in this PR). There are still more to address (sort-by, for etc), and for in particular doesn’t seem straightforward. It is also easy to miss some, since there is no definite list where seq is used in the core functions.

I’m leaning toward another approach: explicitly dropping support for generators in seq. Instead, when seq is called on a generator, we would simply throw an error. This way, users receive a clear notification that generators aren’t supported and must result to functions like into or to-array to convert them into a seqable structure. I believe this leads to a cleaner abstraction for these constructs that doesn't require intrusive changes.

What do you think?

Thanks

@ikappaki ikappaki force-pushed the support/generators-in-seq-fns branch from 6675abd to b22ad57 Compare February 20, 2025 06:59
@chrisrink10
Copy link
Member

Hi @chrisrink10,

I attempted to properly support the generator case in the seq transforming functions (map, etc.) as discussed in #1192. However, I found it too finicky due to the large number of functions involved need updating, and I don’t think it's worth the effort, even though I’ve updated and created test cases for most of them (see changes in this PR). There are still more to address (sort-by, for etc), and for in particular doesn’t seem straightforward. It is also easy to miss some, since there is no definite list where seq is used in the core functions.

I’m leaning toward another approach: explicitly dropping support for generators in seq. Instead, when seq is called on a generator, we would simply throw an error. This way, users receive a clear notification that generators aren’t supported and must result to functions like into or to-array to convert them into a seqable structure. I believe this leads to a cleaner abstraction for these constructs that doesn't require intrusive changes.

What do you think?

Thanks

I thought the intent was to call seq at the beginning of every seq library function as in Clojure. Is it more complicated than that?

@ikappaki
Copy link
Contributor Author

ikappaki commented Feb 24, 2025

I thought the intent was to call seq at the beginning of every seq library function as in Clojure. Is it more complicated than that?

After reconsidering the issue while working on it, my opinion has changed and I no longer believe that generators should be supported by seq.

  • seq is the suggested idiomatic way for checking if a collection is empty?, as in (not (seq coll)). If seq is called on a generator for this purpose, it will consume the first element, and users must handle this case explicitly. Imagine a user writing a general function that checks for an empty collection using seq. If a generator is passed in, it will break in a non-obvious way. I don't expect every user to be aware of this special handling (I wasn't even aware until this came up), making it a subtle and hard-to-diagnose issue when used with generators.
    https://github.com/clojure/clojure/blob/ce55092f2b2f5481d25cff6205470c1335760ef6/src/clj/clojure/core.clj#L6241-L6246
(defn empty?
  "Returns true if coll has no items - same as (not (seq coll)).
  Please use the idiom (seq x) rather than (not (empty? x))"
  {:added "1.0"
   :static true}
  [coll] (not (seq coll)))
  • There is no definitive list of seq-related functions (e.g., map, reduce, last, first etc) in the standard clojure/basilisp library. I'm currently scanning the standard library, updating cases with (if/when-let [x (seq x)] ...), but I worry about missing some. Any overlooked function would behave incorrectly with generators (i.e. they will be buggy), and users wouldn’t receive a clear indication that something went wrong.
  • I'm thginking that the (if/when-let [x (seq x)] ...) is an optimisation in Clojure, rather than something to support "consumable" collections, whose elements are "removed" from the original collections (in our case a generator) when accessed.

Please consider this a checkpoint for deciding how to proceed. Options:

  1. Continue current work: Identify and modify the remaining seq-based functions from the Basilisp standard library (not already addressed in this PR) that could irrevocably consume elements from a generator (e.g., for, sort-by, ???...). If any function is missed, using it on a generator could cause hard-to-detect bugs. The same applies to user-defined functions that rely on seq without accounting for generators. Update documentation to warn users.
  2. Disallow seq on generators: Modify seq to throw an error when passed a generator. Users must convert generators to persistent collections (e.g., via to-array or into) before using in seq functions.

I'm more in favor of (2) now. Let me know of your thoughts and insight how to proceed forward.

Thanks

@amano-kenji
Copy link
Contributor

amano-kenji commented Apr 2, 2025

https://clojure.org/reference/sequences says

Clojure uses the ISeq interface to allow many data structures to provide access to their elements as sequences. The seq function yields an implementation of ISeq appropriate to the collection. Seqs differ from iterators in that they are persistent and immutable, not stateful cursors into a collection. As such, they are useful for much more than foreach - functions can consume and produce seqs, they are thread safe, they can share structure etc.

When seq is used on objects that implement Iterable, the resulting sequence is still immutable and persistent, and will represent a single pass across the data. Because that pass might happen lazily, the pass might see changes that happen after seq has been called. Also, if the backing iterator is subject to ConcurrentModificationException, then so too is the resulting seq. When seq is used on native Java arrays, changes to the underlying array will be reflected in the seq - you must copy the source array to get full immutability. That said, there is still a lot of utility to using seq on Iterables and arrays since seqs support multi-pass and lazy algorithms. Robust programs should not mutate arrays or Iterables that have seqs on them.

Many of the functions in the seq library take one or more collections, call seq on them, and then operate on the resulting seq. In other words, many of these functions take collections but operate on their seqs.

From the above quotes, we can deduce that functions take host Iterable collections and call seq on them to improve immutability and structural sharing.

seq is not optimization. It is about immutability and structural sharing. Host Iterable collections behind clojure collections are mutable, but clojure collections are immutable and persistent.

I think clojure functions were designed to inter-operate with host collections seamlessly while making them as immutable and persistent as possible.

Rich Hickey wants you to seamlessly interoperate with host collections.

@chrisrink10
Copy link
Member

Please consider this a checkpoint for deciding how to proceed. Options:

  1. Continue current work: Identify and modify the remaining seq-based functions from the Basilisp standard library (not already addressed in this PR) that could irrevocably consume elements from a generator (e.g., for, sort-by, ???...). If any function is missed, using it on a generator could cause hard-to-detect bugs. The same applies to user-defined functions that rely on seq without accounting for generators. Update documentation to warn users.
  2. Disallow seq on generators: Modify seq to throw an error when passed a generator. Users must convert generators to persistent collections (e.g., via to-array or into) before using in seq functions.

I'm more in favor of (2) now. Let me know of your thoughts and insight how to proceed forward.

I think I still prefer (1), but I do believe that the idiomatic empty check issue is somewhat of a wrinkle. I think that could be solved by documentation. I would tend to agree with @amano-kenji (and Rich Hickey) that we should be allowing and interoperating freely with host collections, so I don't want to create a situation where calling (seq generator) throws an exception.

@ikappaki
Copy link
Contributor Author

Please consider this a checkpoint for deciding how to proceed. Options:

  1. Continue current work: Identify and modify the remaining seq-based functions from the Basilisp standard library (not already addressed in this PR) that could irrevocably consume elements from a generator (e.g., for, sort-by, ???...). If any function is missed, using it on a generator could cause hard-to-detect bugs. The same applies to user-defined functions that rely on seq without accounting for generators. Update documentation to warn users.
  2. Disallow seq on generators: Modify seq to throw an error when passed a generator. Users must convert generators to persistent collections (e.g., via to-array or into) before using in seq functions.

I'm more in favor of (2) now. Let me know of your thoughts and insight how to proceed forward.

I think I still prefer (1), but I do believe that the idiomatic empty check issue is somewhat of a wrinkle. I think that could be solved by documentation. I would tend to agree with @amano-kenji (and Rich Hickey) that we should be allowing and interoperating freely with host collections, so I don't want to create a situation where calling (seq generator) throws an exception.

I agree with the point of interoperating with host collections. However, I think when it comes to iterators (see generators are iterators), Clojure doesn't interop with them directly, calling seq on them throws an exception. Instead, it provides dedicated functions to convert them to sequences. This supports my intuition that generators shouldn't be treated as seqs directly, but should be explicitly converted first through the dedicated functions.

For example, both java.util.Iterators and java.util.stream fail when passed to seq and Clojure provides iterator-seq and stream-seq! to handle them:

user=> (import [java.util Iterator]
               [java.util.stream Stream])
java.util.stream.Stream
user=> (defn count-iterator [max]
         (let [state (atom 0)]
           (reify Iterator
             (hasNext [_]
               (< @state max))
             (next [_]
               (let [val @state]
                 (swap! state inc)
                 val)))))
#'user/count-iterator
user=> (def it (count-iterator 3))
#'user/it
user=> (seq it)
Execution error (IllegalArgumentException) at user/eval141 (REPL:1).
Don't know how to create ISeq from: user$count_iterator$reify__138
user=> (iterator-seq it)
(0 1 2)


user=> (defn stream-count-iterator [max]
         (-> (Stream/iterate 0 (reify java.util.function.UnaryOperator
                                 (apply [_ n] (inc n))))
             (.limit max)))
#'user/stream-count-iterator
user=> (def sit (stream-count-iterator 3))
#'user/sit
user=> (seq sit)
Execution error (IllegalArgumentException) at user/eval148 (REPL:1).
Don't know how to create ISeq from: java.util.stream.SliceOps$1
user=> (stream-seq! sit)
(0 1 2)

thanks

@amano-kenji
Copy link
Contributor

amano-kenji commented Apr 10, 2025

Python streams

On JVM, java streams are a sequence of discrete objects which can be mapped to clojure sequence. Python streams are asynchnorous iterables suitable for async for.

stream-seq! doesn't make sense on python.

Python iterators

Generator is both Iterable and Iterator.

>>> def abc():
...   yield 2
... 
>>> isinstance(abc(), Iterable)
True
>>> isinstance(abc(), Iterator)
True

On python and JVM, an Iterable object produces an Iterator. To get an iterator, you have to call .__iter__() method on an Iterable. Even though iterator-seq exists for JVM iterators that don't implement Iterable interface, clojure functions still convert host Iterable collections to clojure sequence by convention.

On Difference between Python's Generators and Iterators, a comment says an Iterator is an Iterable which requires __iter__ method. An iterator has __iter__ method that returns itself.

If a generator wasn't Iterable, iterator-seq would be needed for a generator. On JVM, an iterator is not an iterable. On python, it is.

Conclusion

Every conceivable discrete sequence in python is Iterable. Just convert Iterables to clojure sequence via seq in clojure functions.

The list of Iterables:

You don't need iterator-seq and stream-seq! in python.

@chrisrink10
Copy link
Member

For example, both java.util.Iterators and java.util.stream fail when passed to seq and Clojure provides iterator-seq and stream-seq! to handle them:

But in the iterator-seq documentation you linked, there's a counterexample to this point:

;; Note this is not strictly necessary since keySet is a collection
;; implementing Iterable but it does show the usage.

user=> (iterator-seq (.iterator (.keySet (java.lang.System/getProperties))))

Is the intent that something like this would fail too?

user=> (seq (.keySet (java.lang.System/getProperties)))
("java.specification.version" "sun.jnu.encoding" ...)

@ikappaki
Copy link
Contributor Author

For example, both java.util.Iterators and java.util.stream fail when passed to seq and Clojure provides iterator-seq and stream-seq! to handle them:

But in the iterator-seq documentation you linked, there's a counterexample to this point:

;; Note this is not strictly necessary since keySet is a collection
;; implementing Iterable but it does show the usage.

user=> (iterator-seq (.iterator (.keySet (java.lang.System/getProperties))))

The HashMap keySet method just returns a Java Set. I think the commenter simply wanted a quick example using iterator-seq, and chose this even though, in typical code, they wouldn’t bother extracting an iterator. I agree, though, the comment is a bit confusing and open to interpretation.

Is the intent that something like this would fail too?

user=> (seq (.keySet (java.lang.System/getProperties)))
("java.specification.version" "sun.jnu.encoding" ...)

No, this simply extracts the keys as a Java Set, it's not an iterator and doesn't consume the elements.

However, note that iterator-seq will consume the elements, whereas calling seq on the Java Set will not.

user=> (def props (.keySet (java.lang.System/getProperties)))
#'user/props
user=> (seq props)
("java.specification.version" "sun.cpu.isalist" "sun.jnu.encoding" "java.class.path" "java.vm.vendor" "sun.arch.data.model" "user.variant" "java.vendor.url" "user.timezone" "clojure.basis" "java.vm.specification.version" "os.name" "sun.java.launcher" "user.country" "sun.boot.library.path" "sun.java.command" "jdk.debug" "sun.cpu.endian" "user.home" "user.language" "sun.stderr.encoding" "java.specification.vendor" "java.version.date" "java.home" "file.separator" "java.vm.compressedOopsMode" "line.separator" "sun.stdout.encoding" "java.vm.specification.vendor" "java.specification.name" "user.script" "sun.management.compiler" "java.runtime.version" "user.name" "path.separator" "os.version" "java.runtime.name" "file.encoding" "java.vm.name" "java.vendor.url.bug" "java.io.tmpdir" "java.version" "user.dir" "os.arch" "java.vm.specification.name" "sun.os.patch.level" "native.encoding" "java.library.path" "java.vm.info" "java.vendor" "java.vm.version" "sun.io.unicode.encoding" "java.class.version")
user=> (seq props)
("java.specification.version" "sun.cpu.isalist" "sun.jnu.encoding" "java.class.path" "java.vm.vendor" "sun.arch.data.model" "user.variant" "java.vendor.url" "user.timezone" "clojure.basis" "java.vm.specification.version" "os.name" "sun.java.launcher" "user.country" "sun.boot.library.path" "sun.java.command" "jdk.debug" "sun.cpu.endian" "user.home" "user.language" "sun.stderr.encoding" "java.specification.vendor" "java.version.date" "java.home" "file.separator" "java.vm.compressedOopsMode" "line.separator" "sun.stdout.encoding" "java.vm.specification.vendor" "java.specification.name" "user.script" "sun.management.compiler" "java.runtime.version" "user.name" "path.separator" "os.version" "java.runtime.name" "file.encoding" "java.vm.name" "java.vendor.url.bug" "java.io.tmpdir" "java.version" "user.dir" "os.arch" "java.vm.specification.name" "sun.os.patch.level" "native.encoding" "java.library.path" "java.vm.info" "java.vendor" "java.vm.version" "sun.io.unicode.encoding" "java.class.version")




user=> (def iprops (.iterator props))
#'user/iprops
user=> (iterator-seq iprops)
("java.specification.version" "sun.cpu.isalist" "sun.jnu.encoding" "java.class.path" "java.vm.vendor" "sun.arch.data.model" "user.variant" "java.vendor.url" "user.timezone" "clojure.basis" "java.vm.specification.version" "os.name" "sun.java.launcher" "user.country" "sun.boot.library.path" "sun.java.command" "jdk.debug" "sun.cpu.endian" "user.home" "user.language" "sun.stderr.encoding" "java.specification.vendor" "java.version.date" "java.home" "file.separator" "java.vm.compressedOopsMode" "line.separator" "sun.stdout.encoding" "java.vm.specification.vendor" "java.specification.name" "user.script" "sun.management.compiler" "java.runtime.version" "user.name" "path.separator" "os.version" "java.runtime.name" "file.encoding" "java.vm.name" "java.vendor.url.bug" "java.io.tmpdir" "java.version" "user.dir" "os.arch" "java.vm.specification.name" "sun.os.patch.level" "native.encoding" "java.library.path" "java.vm.info" "java.vendor" "java.vm.version" "sun.io.unicode.encoding" "java.class.version")
user=> (iterator-seq iprops) # the iterator was consumed
nil
user=> (seq props)  # the original Set is still intact
("java.specification.version" "sun.cpu.isalist" "sun.jnu.encoding" "java.class.path" "java.vm.vendor" "sun.arch.data.model" "user.variant" "java.vendor.url" "user.timezone" "clojure.basis" "java.vm.specification.version" "os.name" "sun.java.launcher" "user.country" "sun.boot.library.path" "sun.java.command" "jdk.debug" "sun.cpu.endian" "user.home" "user.language" "sun.stderr.encoding" "java.specification.vendor" "java.version.date" "java.home" "file.separator" "java.vm.compressedOopsMode" "line.separator" "sun.stdout.encoding" "java.vm.specification.vendor" "java.specification.name" "user.script" "sun.management.compiler" "java.runtime.version" "user.name" "path.separator" "os.version" "java.runtime.name" "file.encoding" "java.vm.name" "java.vendor.url.bug" "java.io.tmpdir" "java.version" "user.dir" "os.arch" "java.vm.specification.name" "sun.os.patch.level" "native.encoding" "java.library.path" "java.vm.info" "java.vendor" "java.vm.version" "sun.io.unicode.encoding" "java.class.version")

thanks

@ikappaki
Copy link
Contributor Author

Python streams

On JVM, java streams are a sequence of discrete objects which can be mapped to clojure sequence. Python streams are asynchnorous iterables suitable for async for.

stream-seq! doesn't make sense on python.

Python iterators

Generator is both Iterable and Iterator.

>>> def abc():
...   yield 2
... 
>>> isinstance(abc(), Iterable)
True
>>> isinstance(abc(), Iterator)
True

...
You don't need iterator-seq and stream-seq! in python.

Hi @amano-kenji, this reads like unedited AI-generated content and misses the core point if so. This isn't about implementing iterator-seq or stream-seq! in Basilisp. The point is that iterators are consumable and not directly supported by seq in Clojure, they require explicit conversion, and using seq on them throw an exception. The same should apply in Basilisp: generators (which are Python iterators) should not be treated as seq-able. Instead, Basilisp should provide dedicated operators for converting generators to sequences, just like Clojure does for Java iterators and streams, and throw an exception when seq is used directly on them.

thanks

@amano-kenji
Copy link
Contributor

amano-kenji commented Apr 11, 2025

I can write like chatbots because I talk with chatbots instead of humans most of the time. Chatbots tend to write in structured markdown format with headers and bullet points.

The point is that iterators are consumable and not directly supported by seq in Clojure, they require explicit conversion, and using seq on them throw an exception.

It took a while to wrap my head around what seq is supposed to be compatible with. Chatbots helped me understand this issue more quickly. It seems seq is supposed to be compatible with iterables that produce new iterators whenever iterators are requested. While a python iterator is an iterable, it doesn't produce a new iterator when .__iter__() method is called on it. iterator-seq exists to warn programmers that iterables can be traversed multiple times and iterators can only be traversed once. I think seq is supposed to yield the same results even if it's called on the same object multiple times.

If we accept the theory that seq is supposed to be compatible with iterables that produce new iterators whenever .__iter__ method is called, then when an iterator is passed to seq, seq may throw an error that says seq doesn't support iterators and iterators must be converted to seq via iterator-seq. We want good error messages. seqable? can look like this.

(defn seqable?
  [x]
  (if (isinstance x Iterator)
    false
    (isinstance x Iterable)))

In python, iterator is iterable, so you need to first check whether x is Iterator.

@chrisrink10
Copy link
Member

@ikappaki I think there may be two separate issues being discussed here, so it may be useful for us to separate those out into two separate discussions.

The first issue is that seq library functions don't coerce their inputs into seqs before operating on them. Clojure does this in most of it's seq library:

(when-let [s (seq coll)]
  ;; ...
  )

This seems like a fairly uncontroversial change that we should make and merge.

The second (and more contentious issue) is what happens when we attempt to call seq on a generator. I think you have made some convincing arguments, but I also still have some reservations because it feels like a bad user experience to just prevent users from being able to seamlessly work with Python generators.

Perhaps we can make some progress on the former without blocking on the latter. Let me know what you think.

@amano-kenji
Copy link
Contributor

amano-kenji commented Apr 13, 2025

In java, it seems you rarely encounter iterators instead of iterables. On the other hand, python generators seem common. Having to use iterator-seq often can be a bit bad for user experience.

@ikappaki
Copy link
Contributor Author

I can write like chatbots because I talk with chatbots instead of humans most of the time. Chatbots tend to write in structured markdown format with headers and bullet points.
👍

The point is that iterators are consumable and not directly supported by seq in Clojure, they require explicit conversion, and using seq on them throw an exception.

It took a while to wrap my head around what seq is supposed to be compatible with. Chatbots helped me understand this issue more quickly. It seems seq is supposed to be compatible with iterables that produce new iterators whenever iterators are requested. While a python iterator is an iterable, it doesn't produce a new iterator when .__iter__() method is called on it. iterator-seq exists to warn programmers that iterables can be traversed multiple times and iterators can only be traversed once. I think seq is supposed to yield the same results even if it's called on the same object multiple times.

Right, if I understood this correctly, this is what my argument is. seq should produce the same result if called on the same collection twice. That fails for iterators/generators, since their contents are consumed ("lost") on every next iteration. Clojure doesn’t allow calling seq directly on such Iterable's. In Java, to get an iterator from an Iterable, one must explicitly call iterator() or one of the two methods explicitly (i.e. seq doesn't work with Iterable out of the box):

image

If we accept the theory that seq is supposed to be compatible with iterables that produce new iterators whenever .__iter__ method is called, then when an iterator is passed to seq, seq may throw an error that says seq doesn't support iterators and iterators must be converted to seq via iterator-seq. We want good error messages. seqable? can look like this.

(defn seqable?
  [x]
  (if (isinstance x Iterator)
    false
    (isinstance x Iterable)))

In python, iterator is iterable, so you need to first check whether x is Iterator.

Yes, I advocate something along these lines.

In java, it seems you rarely encounter iterators instead of iterables. On the other hand, python generators seem common. Having to use iterator-seq often can be a bit bad for user experience.

Indeed, but allowing seq to accept iterators directly risks subtle bugs and at least breaks the idiom (seq x) as a check for empty?-ness, the first element will be for ever lost :

(empty? coll)
Returns true if coll has no items - same as (not (seq coll)).
Please use the idiom (seq x) rather than (not (empty? x))

I consider supporting generators seamlessly with seq might be convenient but is risky, despite their ubiquity in Python.

thanks

@ikappaki
Copy link
Contributor Author

ikappaki commented Apr 13, 2025

@ikappaki I think there may be two separate issues being discussed here, so it may be useful for us to separate those out into two separate discussions.

The first issue is that seq library functions don't coerce their inputs into seqs before operating on them. Clojure does this in most of it's seq library:

(when-let [s (seq coll)]
  ;; ...
  )

This seems like a fairly uncontroversial change that we should make and merge.

Sure, would you mind opening a ticket for this please so I can commit the changes against it? I'm not entirely sure how to phrase it. This looks like a straightforward optimization with no functional changes to the current supported implementation. I believe what's in the MR now is sufficient, and we don't need any tests, since there's no change in functionality.

The second (and more contentious issue) is what happens when we attempt to call seq on a generator. I think you have made some convincing arguments, but I also still have some reservations because it feels like a bad user experience to just prevent users from being able to seamlessly work with Python generators.

Unfortunately, I can't think of a way to make generators safely compatible with seq. The closest idea I have is introducing a new gen-seq! function, similar to iterator-seq or stream-seq!, that returns a lazy sequence by consuming the original generator, while seq would throw an error if used directly. This is the closer I can think of getting to the Clojure standard (i.e. to an "agreeable" user experience). But please feel free to suggest alternative suggestions.

Please consider a couple of more example where using seq on generators can lead to unexpected behavior:

  1. Using seq as the recommended idiom instead of empty?. This makes use of seq to check for emptyness an invalid operation, since the generator is consumed in the process.
  2. Using count to count the elements of a generator. This consumes the element while counting, making it unsuitable for such use:
> basilisp repl
basilisp.user=> (defn gen []
                  (dotimes [i 10]
                    (yield i)))
#'basilisp.user/gen

basilisp.user=> (def git (gen))
#'basilisp.user/git

basilisp.user=> (count git)
10

basilisp.user=> (count git)
0

Perhaps we can make some progress on the former withot blocking on the latter. Let me know what you think.

Sure, as outlined above. thanks

@amano-kenji
Copy link
Contributor

amano-kenji commented Apr 14, 2025

If we decide to support generators with a separate function, the function should be iterator-seq? If iterator-seq exists, then basilisp programmers should be educated adequately on iterator-seq through good error messages (from seq) or good documentation.

I'm not against iterator-seq as long as programmers have no difficulty with finding iterator-seq when they need to pass a generator to clojure functions.

But, I'm not the maintainer. Chris is.

I read https://clojure.org/reference/sequences again. It says

Clojure uses the ISeq interface to allow many data structures to provide access to their elements as sequences. The seq function yields an implementation of ISeq appropriate to the collection. Seqs differ from iterators in that they are persistent and immutable, not stateful cursors into a collection.

Technically, seq is supposed to expose an ISeq interface to a data structure or a collection rather than a stateful cursor(iterator) in a collection. An iterator is not a collection but a stateful cursor into a collection.

@chrisrink10
Copy link
Member

Ok I think the more time I've spent thinking about it @ikappaki is right and we should just introduce iterator-seq for generator types. We can have seq throw an error when a generator type is given indicating the use of iterator-seq as a wrapper to address the issue.

iterator-seq will wrap the generator as by basilisp.lang.seq.sequence, as seq does right now.

I'm not sure there's a clear solution for actual Python Iterator types (e.g. those exposing an __iter__ magic method), since it is not always clear if they will be consumed in the way generators are.

@amano-kenji
Copy link
Contributor

amano-kenji commented Apr 15, 2025

Iterator types expose __next__ method? I think we can treat generator as just another iterator?

Iterable exposes __iter__ method which create a new Iterator which exposes __next__ and also __iter__ which returns itself instead of a new iterator.

As far as I know, there are only iterators and iterables. A generator can be passed to a for statement because an iterator happens to also be an iterable that returns itself as the iterator.

https://peps.python.org/pep-0255/ says

a Python generator is a kind of Python iterator, but of an especially powerful kind.

A generator function creates a generator which is a lazy iterator. It is a lazy iterator. If you call __iter__ on a list, you get an eager iterator. If you write a generator function and call it, you get a lazy iterator.

A generator function is a syntactic sugar for an iterator class.

class Iterator
  def __iter__(self):
    return self

  def __next__(self, ...):

@ikappaki
Copy link
Contributor Author

ikappaki commented Apr 16, 2025

I believe I’ve come up with a plan that incorporates and addresses most points and concerns raised above.

Basically, any iterable object can be coerced into a sequence using seq. In fact, this is what Basilisp does under the hood to support fundamental types (like lists, tuples, etc.) as sequences, and specifically, as LazySeq objects:

def sequence(s: Iterable[T]) -> ISeq[T]:
"""Create a Sequence from Iterable s."""
i = iter(s)
def _next_elem() -> ISeq[T]:
try:
e = next(i)
except StopIteration:
return EMPTY
else:
return Cons(e, LazySeq(_next_elem))
return LazySeq(_next_elem)

However, this only works well in general if the input object is re-iterable—that is, if calling iter() on it (as sequence does above) returns a new iterator each time.

Consider the following simple example, where we check the number of items in the iterable, and if there’s more than one, we return the first item:

$ basilisp repl
basilisp.user=> (defn count-first [input]
                  (when (> (count input) 1)
                    (first input)))
 
#'basilisp.user/count-first

basilisp.user=> (count-first #py [1 2 3])
1

This works fine with Python’s built-in collection types, such as lists and tuples, since calling iter() on them returns a fresh iterator each time. But it fails with single-use iterators, like generators or file objects, because the initial pass (e.g., for counting) exhausts the iterator, and there's nothing left to return when we call first next:

basilisp.user=> (import os)
nil

basilisp.user=> (count (os/scandir))
18

basilisp.user=> (first (os/scandir))
<DirEntry 'LICENSE'>

basilisp.user=> (count-first (os/scandir))
nil

Therefore, single-use iterables should not be implicitly coerced into sequences, as this can lead to subtle and hard-to-diagnose bugs.

Fortunately, there’s a way to detect and exclude these single-use iterables from implicit coercion. As @amano-kenji alluded, we can check whether iter(obj) returns the object itself. If it does, then the object is an iterator (i.e., single-use).

Based on that, I plan to revise the sequence function (which maps to seq) to raise an error for single-use iterables, and instead introduce a separate iterator_sequence function (which will map to iterator-seq) to explicitly handle them. Here’s the idea:

def sequence(s: Iterable[T], fail_single_use: bool = True) -> ISeq[T]:
    """Create a Sequence from Iterable s."""
    i = iter(s)

    if fail_single_use and i is s:
        raise TypeError(f"Single-use iterator type detected, please use iterator-seq instead :type {type(s)}")

    def _next_elem() -> ISeq[T]:
        try:
            e = next(i)
        except StopIteration:
            return EMPTY
        else:
            return Cons(e, LazySeq(_next_elem))

    return LazySeq(_next_elem)


def iterator_sequence(s: Iterable[T]) -> ISeq[T]:
    """Create a Sequence from Iterable s."""
    return sequence(s, fail_single_use=False)

I’ve already reviewed the Basilisp test suite and found at least one case where single-use iterators were likely dropping elements during iteration. I’ll open a separate PR to address that issue by promoting the fix described above. This PR will remain focused on optimizing the standard library by reducing unnecessary seq calls.

Please let me know your thoughts.

Thanks

@chrisrink10
Copy link
Member

@ikappaki that seems reasonable to me. Thanks!

@amano-kenji
Copy link
Contributor

amano-kenji commented Apr 17, 2025

Hmm? Is isinstance(x, Iterator) unreliable for checking whether something is an iterator? Or, did you write sequence as above because it is more efficient to check whether an iterator is the same as iterable?

@ikappaki
Copy link
Contributor Author

ikappaki commented Apr 17, 2025

Hmm? Is isinstance(x, Iterator) unreliable for checking whether something is an iterator? Or, did you write sequence as above because it is more efficient to check whether an iterator is the same as iterable?

I find the object comparison method to be more in line with my analysis: I'm looking to identify Iterable objects that are single-use iterables, i.e., objects that return themselves when iter() is called. The isinstance(x, Iterator) method identifies an object as a typing.Iterator, meaning it has both an __iter__ and a __next__ method. However, while calling iter() on the Iterator object is likely to return itself, it does not guarantee this behavior.

I can imagine that one could write a custom class with these two methods, but the __iter__() could return a different iterator each time it is called, meaning the although it is identifies as a typing.Iterator, it is not a single-use Iterable.

@amano-kenji
Copy link
Contributor

amano-kenji commented Apr 17, 2025

Okay. I think we have finally discovered a good approach. But, how will you implement seqable?? In practice, if you are going to create a different iterator each time __iter__ is called, you wouldn't implement __next__ method in an iterable.

Is something like this not going to be a bit inefficient?

(defn seqable?
  [x]
  (and (isinstance x Iterable)
       (not= x (iter x))))

I don't know how often seqable? is called.

@amano-kenji
Copy link
Contributor

This is another implementation.

(defn seqable?
  [x]
  (try
    (not= x (iter x))
    (catch TypeError _ false)))

@ikappaki
Copy link
Contributor Author

ikappaki commented Apr 19, 2025

Both of the above suggestions are on the right track in my opinion. isinstance should be fast enough, and if I were doing early optimization, I might consider using hasattr(obj, '__iter__'). For now, though, I'll stick with isinstance and let @chrisrink10 weigh in during the review if he sees any performance concerns. New PR incoming ... #1235

chrisrink10 pushed a commit that referenced this pull request Apr 19, 2025
Hi, 

could you please consider patch to remove explicit support of single-use
iterables in sequences. It fixes #1192.

single-use iterables now throw a `TypeError`, when they are implicitly
or explicitly coerced into a sequence. A new `iterator-seq` is
introduced to explicitly convert them into a seq instead . See
#1198 (comment)
for the rationale.

The `count` fn was also updated to behave the same.

There were a few places where single use iterators were converted
implicitly to a seq, which are now explicitly converted using a
`iterator-seq` or the python equivalent instead.

Tests are added for the same, as well as documentation section under
Python Iterators.

Thanks

---------

Co-authored-by: ikappaki <[email protected]>
@ikappaki ikappaki changed the title support generators in seq transforming fns optimize seq-consuming library functions Apr 20, 2025
@ikappaki ikappaki force-pushed the support/generators-in-seq-fns branch from b22ad57 to 176baca Compare April 20, 2025 15:49
@ikappaki ikappaki force-pushed the support/generators-in-seq-fns branch from 176baca to b064f93 Compare April 20, 2025 17:46
@ikappaki ikappaki marked this pull request as ready for review April 20, 2025 17:49
@ikappaki
Copy link
Contributor Author

ikappaki commented Apr 20, 2025

Hi @chrisrink10 ,

as discussed above, I've repurposed this PR as an optimisation patch to coerce key seq-consuming functions inputs to a sequence up front. It addresses #1234.

There are no functional changes, thus the existing test coverage should be enough. Can you please review.

Thanks

Copy link
Member

@chrisrink10 chrisrink10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems good to me. Had one minor question, but it doesn't materially change my opinion.

(apply map f (rest coll) (map rest colls)))))))
(when-let [coll (seq coll)]
(let [colls (map seq colls)]
(when (every? some? colls)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why some? rather than identity?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, thanks

I was reasoning at the semantics level, just checking that every value wasn't nil, so I used (every? some?). I hadn't considered using identity, which relies on knowing that (map seq colls) returns either a seq or nil.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, I missed the fact that the PR was already merged 😅

@chrisrink10 chrisrink10 merged commit 0fead05 into basilisp-lang:main Apr 20, 2025
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

optimisation: standard library seq functions should coerce inputs before using them Unexpected behavior when passing a Python generator to a map

3 participants