Downloading

The TweetDownloader object allows you to download and store tweets for further post-processing and exporting.

Overview

There are three main functions used to download tweets:

Function	Description
`get_tweets`	Gets tweets from the Twitter API and user defined queries.
`get_replies`	Gets replies on conversations originated from already downloaded tweets.
`tweets_from_csv`	Gets tweets from the Twitter API and user defined queries using a csv file table as input.

Get tweets

To download tweets you need to initialize the TweetDownloader class by passing the credentials YAML file as a parameter in the class constructor. You can also name your project so all exported files have this string leading in their name. Additionally, you can specify the path to folder in which you want to save future results.

from gtdownloader import TweetDownloader

# create downloader using Twitter API credentials
gtd = TweetDownloader(name='Tennis_players_project', credentials='twitter_keys.yaml', output_folder='Tennis_project_downloads')

Once the function is initialized, you can call the get_tweets() method by passing a string you want to look up in Twitter and any additional parameter you want. Here we specify a language and a date range defined by start_time and end_time. Here we want to see what people were saying about Rafael Nadal during the 6th and 7th of July of 2022, right after he beated Taylor Fritz at Wimbledon. Notice we set a maximum amount of downloaded tweets of 1000, although there is no guarantee that amount is going to be reached.

gtd.get_tweets(
               query='(Nadal) OR (Rafael Nadal)',
               lang='en',
               start_time='07/06/2022',
               end_time='07/07/2022',
               max_tweets=1000            
)

The download will initiate by going through the Twitter API pagination with a next-page-token system. Given that Twitter API has monthly caps on the amount of tweets to download, a temp_ csv with a timestamp containing the progress per page is downloaded at each page so in case the download is interrupted there is no need to re-download already downloaded Tweets. The download is done either when the maximum amount of tweets is reached or when there are no more tweets that satisfy the query parameters to download.

Downloading tweets...
Current progress saved at: Tennis_project_downloads\temp_Tennis_players_project_08032022_171311.csv
Ending page 1 with next_token=b26v89c19zqg8o3fpz2m17r4qqlvzhsejuwhysusao1a5. 496 tweets retrieved (496 total)
Current progress saved at: Tennis_project_downloads\temp_Tennis_players_project_08032022_171311.csv
Tweets download done. A total of 766 tweets were retrieved.
csv files: Tennis_project_downloads\Tennis_players_project_tweets_08032022_171311.csv, Tennis_project_downloads\Tennis_players_project_places_08032022_171311.csv, and Tennis_project_downloads\Tennis_players_project_authors_08032022_171311.csv were generated

The previous method saves three dataframes in csv format:

File	Description
{Project name}_tweets.csv	Tweets table. Each row represents an individual tweet with its corresponding attributes
{Project name}_places.csv	Places table. Each row represents a location from which one or more tweets came from
{Project name}_authors.csv	Authors table. Each row represents a user that wrote one or more tweets

The corresponding dataframes can be accessed as the attributes: gtd.tweetds_df, gtd.places_df, and gtd.authors_df, which are pandas DataFrames:

gtd.tweetds_df.head()

	created_at	text	id	conversation_id	author_id	geo	public_metrics	place_id	date	likes	replies	retweets
0	2022-07-06T23:53:53.000Z	@christophclarey Nadal won his 1st GS on his 1st attempt at RG in 05 when he was 19. It was a watershed moment. Conversely, in 86 TM was never the same after his loss to IL. As for TF time will tell.	1544832034429816832	1544773073928425472	2200548513	{'place_id': '01fb944c0dff3d86'}	{'retweet_count': 0, 'reply_count': 0, 'like_count': 0, 'quote_count': 0}	01fb944c0dff3d86	2022-07-06 23:53:53+00:00	0	0	0
1	2022-07-06T23:43:21.000Z	@guygavrielkay What a beautiful image. And appropriate, given the number of people who talk about Nadal as if he's Aslan. :)	1544829383092879360	1544752022171586560	14703552	{'place_id': '58a65d4a55d1b7f6'}	{'retweet_count': 0, 'reply_count': 0, 'like_count': 0, 'quote_count': 0}	58a65d4a55d1b7f6	2022-07-06 23:43:21+00:00	0	0	0
2	2022-07-06T23:32:12.000Z	@AnnaK_4ever Agreed. I think people are harder on Fritz because of all Nadal’s fake injuries.	1544826579989184512	1544767392626212866	388514822	{'place_id': '64ab889e24887e12'}	{'retweet_count': 0, 'reply_count': 1, 'like_count': 1, 'quote_count': 0}	64ab889e24887e12	2022-07-06 23:32:12+00:00	1	1	0
3	2022-07-06T23:25:08.000Z	@rollxadvertisers Grow Your Buissness #itshappening #HBDIconOfMillionsDhoni #Nadal #cryptocurrencies https://t.co/FeBknzlnk8	1544824799540793347	1544824799540793347	1535277320830865408	{'place_id': '00cc0d5640394308'}	{'retweet_count': 0, 'reply_count': 0, 'like_count': 0, 'quote_count': 0}	00cc0d5640394308	2022-07-06 23:25:08+00:00	0	0	0
4	2022-07-06T23:20:47.000Z	Which kind play b this 😅 #Nadal #itshappening Jordan/ Inaki Williams/. KNUST SRC https://t.co/3qshCRv7tg	1544823707448840194	1544823707448840194	953033567549915136	{'place_id': '0085c4a6640325a8'}	{'retweet_count': 3, 'reply_count': 0, 'like_count': 2, 'quote_count': 0}	0085c4a6640325a8	2022-07-06 23:20:47+00:00	2	0	3

gtd.places_df.head()

	name	id	country_code	full_name	place_type	country	geo
0	Peñalolén	01fb944c0dff3d86	CL	Peñalolén, Chile	city	Chile	{'type': 'Feature', 'bbox': [-70.5912832, -33.5127583, -70.4388729, -33.4591303], 'properties': {}}
1	Amherst	58a65d4a55d1b7f6	CA	Amherst, Nova Scotia	city	Canada	{'type': 'Feature', 'bbox': [-64.232955, 45.802245, -64.179066, 45.844832], 'properties': {}}
2	Collierville	64ab889e24887e12	US	Collierville, TN	city	United States	{'type': 'Feature', 'bbox': [-89.7444626, 35.006217, -89.640889, 35.110826], 'properties': {}}
3	Punjab	00cc0d5640394308	PK	Punjab, Pakistan	admin	Pakistan	{'type': 'Feature', 'bbox': [69.328873, 27.708226, 75.382124, 34.019989], 'properties': {}}
4	Oyarifa	0085c4a6640325a8	GH	Oyarifa, Ghana	city	Ghana	{'type': 'Feature', 'bbox': [-0.2508272, 5.6545401, -0.1145835, 5.7836066], 'properties': {}}

gtd.authors_df.head()

	name	public_metrics	location	username	id
0	Gary Counsil	{'followers_count': 123, 'following_count': 581, 'tweet_count': 4368, 'listed_count': 0}	New York, N.Y. Santiago, Chile	garyecounsil	2200548513
1	Frederick Lane 🇺🇸🇮🇪🏴󠁧󠁢󠁳󠁣󠁴󠁿🇺🇦	{'followers_count': 4569, 'following_count': 5025, 'tweet_count': 94035, 'listed_count': 108}	Brooklyn, NY	fsl3	14703552
2	Chris Sahm	{'followers_count': 225, 'following_count': 372, 'tweet_count': 6703, 'listed_count': 9}	New Palestine, IN	ChrisSahm	388514822
3	Roll-X Advertisers	{'followers_count': 55, 'following_count': 182, 'tweet_count': 26, 'listed_count': 0}	Islamabad, Pakistan	rollxadvertiser	1535277320830865408
4	N E N E 💎 O S U Q U A Y E 💭	{'followers_count': 1661, 'following_count': 1257, 'tweet_count': 6570, 'listed_count': 0}	Accra, Ghana	sirdesmond3	953033567549915136

Get replies

The get_tweets() method can retrieve tweets that generate replies that may not satisfy the query parameters and hence, these would not show up in the results. Call this method to get the replies to the downloaded tweests. Warning: proceed with caution as this method can increase significantly the API calls. Make sure to cap the maximum amount of tweets by using the max_replies parameter

Example:

gtd.get_replies(max_replies=20)

Downloading replies... this might take some time
getting replies for tweet 1 out of 189 (total replies so far:0)
Current progress saved at: downloads\temp_Tennis_players_project_08032022_174714.csv
getting replies for tweet 2 out of 189 (total replies so far:14)
Current progress saved at: downloads\temp_Tennis_players_project_08032022_174714.csv
Replies download done. 28 reply tweets were downloaded

The resulting replies dataframe can be accessed as gtd.repliesdf. Notice the total amount of tweets went slightly over the maximum. This is a result of the impossibility to determine the actual amount of tweets that are going to be retrieved from each API call.

Search parameters in csv file

Some users of this library might want to make use of it by writing as little code as possible. For this cases, the function tweets_from_csv() is available. To use it, you need to set up a table in a csv containing the following information. Please keep the row names as indicated. The description column is not needed :

	parameter	value	description
0	query	(Nadal) OR (Rafael Nadal)	use single space for AND operator and use ''OR" for the OR operator. Example= apples OR (grapes bananas). For more information and operators go to: https://developer.twitter.com/en/docs/twitter-api/tweets/search/integrate/build-a-query#build
1	start_time	2022-07-06T00:00:00z	oldest possible date and time of the retrieved tweets in YYYY-MM-DDTHH:mm:ssZ format
2	end_time	2022-07-07T00:00:00z	the most recient possible date and time for retrieved tweets in YYYY-MM-DDTHH:mm:ssZ format
3	max_tweets	1000	total maximum amount of tweets.
4	max_tweets_page	500	maximum amount of tweets per tweet page. Must be between 10 and 500
5	language	en	if specific language is needed
6	place		if specific place is needed
7	include_retweets	no	"yes" or "no" depending on whether the user wants
8	only_georreferenced	yes	"yes" or "no" depending on whether the user wants
9	filename		The name of the file. A timestamp will be added at the end to avoid file overwriting when file name stays the same
10	wordcloud	no	"yes" or "no" depending on whether the user wants to get a wordcoud or not
11	stopwords	no	comma separated words to be exlcuded from wordcloud. Example:"http","https", "TN", "Nashville", "County", "Putnam", "tornado", "tornadoes", "tennesse"
12	barplot	no	"yes" or "no" depending on whether the user wants to get a barplot or not

After building the parameters table and saving it in a csv file, you can call the tweets_from_csv() by passing the csv file as a parameter:

gtd.tweets_from_csv('parameters.csv')

Non-georreferenced tweets

Sometimes you might want to work with tweets that are not necessarily geo-tagged. It is now possible to download such tweets using the has_geo parameter in the get_tweets() method:

gtd.get_tweets(
               query='(Nadal) OR (Rafael Nadal)',
               lang='en',
               start_time='07/06/2022',
               end_time='07/07/2022',
               max_tweets=1000,
               has_geo=False
)