Skip to content

Conversation

@SamHames
Copy link
Contributor

This fixes bugs in timelines and conversations, which use the
search_all endpoint without the fix for the new API behaviour, therefore
limiting them to the last 30 days only even when using the academic
archive endpoints.

This fix just moves the default start_time set in the CLI app to
the client app, so it will apply by default to all uses of the
search_all method.

This fixes bugs in timelines and conversations, which use the
search_all endpoint without the fix, therefore limiting them to
the last 30 days only.

This fix just moves the default start_time set in the CLI app to
the client app, so it will apply by default to all uses of the
search_all method.
@edsu edsu merged commit bfd59e1 into main Aug 17, 2021
@edsu
Copy link
Member

edsu commented Aug 17, 2021

Thanks for catching this @SamHames! It was just released in v2.4.2

@igorbrigadir
Copy link
Contributor

Hmm I remember doing this for a reason - does this still work as expected with progress bars?

@edsu
Copy link
Member

edsu commented Aug 17, 2021

@igorbrigadir I think you are the best judge of that.

@igorbrigadir
Copy link
Contributor

Ah, it does break it:

For example:

twarc2 search "from:DarpaDan" out.jsonl

Is /6 days as expected

But

twarc2 search --archive "from:DarpaDan" out.jsonl

is now also /6days, when it used to work and be /15 years as expected, because now the CLI is missing the start time for the progress bar.

They're kinda hard to automate but these are the tests i did for the progress bars:

#490 (comment)

The timelines ones did work as expected, but maybe there was another bug? conversation would call the search API too, with the exact same parameters. timelines was a bit more complicated and didn't have --archive but inferred the start time when using --use-search.

I don't think i understand what the original error was in the first place:

without the fix for the new API behaviour

The way it was previously, the API, client2.py behaved exactly as the API docs - defaulting to last 30 days when start time was not set, and the command2.py CLI had --archive option that set the start time for you - it should have worked for conversations the same way, and timelines had --use-search not --archive.

@edsu
Copy link
Member

edsu commented Aug 17, 2021

Thanks @igorbrigadir. I apologize if I merged/released this too quickly. I'll create a new issue to address the desired progress bar behavior that you've documented here.

The way it was previously, the API, client2.py behaved exactly as the API docs - defaulting to last 30 days when start time was not set, and the command2.py CLI had --archive option that set the start time for you - it should have worked for conversations the same way, and timelines had --use-search not --archive.

I think perhaps this is an area where the API behavior and twarc should diverge. As a user when I say:

twarc2 conversation 21 --archive

I expect to get all the conversation thread for tweet id 21. Not only the last 30 days. I think it's asking too much to expect twarc users to know that the default is the last 30 days and to remember to:

twarc2 conversation 21 --archive --start-time 2006-03-21

While we can expect users of twarc as a library to be more knowledgeable about the Twitter API defaults, I think:

client.search_all('fiddlesticks')

should search all tweets by default...not just the last 30 days. But maybe I'm being wrongheaded?

@edsu
Copy link
Member

edsu commented Aug 17, 2021

I opened this issue that hopefully characterizes the problem #519. I'll grab it since I introduced the problem by merging too quickly.

@SamHames
Copy link
Contributor Author

SamHames commented Aug 17, 2021 via email

@SamHames
Copy link
Contributor Author

SamHames commented Aug 17, 2021 via email

@igorbrigadir
Copy link
Contributor

I think the library should align with the API as close as possible - but the command line should be more user friendly.

So these should work:

twarc2 conversation 21 --archive

(sets start_date to 2006-03-21)

but

client.search_all('fiddlesticks')

should follow exactly how the API works: https://developer.twitter.com/en/docs/twitter-api/tweets/search/api-reference/get-tweets-search-all (By default, a request will return Tweets from up to 30 days ago if you do not include this parameter.)

Alternatively, we can make start_date non optional in the library API, and throw an error if it's not set, so

client.search_all('fiddlesticks')

will throw an IllegalArgumentException but

client.search_all('fiddlesticks', start_time="..", end_time="..") 

will work - this way the command line and progress bars and stuff will work, and the library will also be clearer to use with a good error message.

@edsu
Copy link
Member

edsu commented Aug 17, 2021

But making start_time required is diverging from the Twitter API. I think the Twitter API defaults in this case are really, truly awful, and it pains me to reproduce them in twarc. But there is value in consistency, assuming we are taking this approach elsewhere in twarc.client2. So we'll have to keep an eye on the Twitter API defaults, and if they change we'll have to change twarc?

@igorbrigadir
Copy link
Contributor

Yeah, I don't think twitter API is going to change so fast and often that we won't be able to keep up.

I'd also prefer to stick to the API, but if we want to override start_time in the client, we also have to do it in the command line to fix the progress bars - i added this in #520

@edsu edsu deleted the fix_search_all_start_time branch October 4, 2021 12:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants