Downloading
The TweetDownloader object allows you to download and store tweets for further post-processing and exporting.
Overview
There are three main functions used to download tweets:
| Function | Description |
|---|---|
get_tweets |
Gets tweets from the Twitter API and user defined queries. |
get_replies |
Gets replies on conversations originated from already downloaded tweets. |
tweets_from_csv |
Gets tweets from the Twitter API and user defined queries using a csv file table as input. |
Get tweets
To download tweets you need to initialize the TweetDownloader class by passing the credentials YAML file as a parameter in the class constructor. You can also name your project so all exported files have this string leading in their name. Additionally, you can specify the path to folder in which you want to save future results.
from gtdownloader import TweetDownloader
# create downloader using Twitter API credentials
gtd = TweetDownloader(name='Tennis_players_project', credentials='twitter_keys.yaml', output_folder='Tennis_project_downloads')
Once the function is initialized, you can call the get_tweets() method by passing a string you want to look up in Twitter and any additional parameter you want. Here we specify a language and a date range defined by start_time and end_time. Here we want to see what people were saying about Rafael Nadal during the 6th and 7th of July of 2022, right after he beated Taylor Fritz at Wimbledon. Notice we set a maximum amount of downloaded tweets of 1000, although there is no guarantee that amount is going to be reached.
gtd.get_tweets(
query='(Nadal) OR (Rafael Nadal)',
lang='en',
start_time='07/06/2022',
end_time='07/07/2022',
max_tweets=1000
)
The download will initiate by going through the Twitter API pagination with a next-page-token system. Given that Twitter API has monthly caps on the amount of tweets to download, a temp_ csv with a timestamp containing the progress per page is downloaded at each page so in case the download is interrupted there is no need to re-download already downloaded Tweets. The download is done either when the maximum amount of tweets is reached or when there are no more tweets that satisfy the query parameters to download.
Downloading tweets...
Current progress saved at: Tennis_project_downloads\temp_Tennis_players_project_08032022_171311.csv
Ending page 1 with next_token=b26v89c19zqg8o3fpz2m17r4qqlvzhsejuwhysusao1a5. 496 tweets retrieved (496 total)
Current progress saved at: Tennis_project_downloads\temp_Tennis_players_project_08032022_171311.csv
Tweets download done. A total of 766 tweets were retrieved.
csv files: Tennis_project_downloads\Tennis_players_project_tweets_08032022_171311.csv, Tennis_project_downloads\Tennis_players_project_places_08032022_171311.csv, and Tennis_project_downloads\Tennis_players_project_authors_08032022_171311.csv were generated
The previous method saves three dataframes in csv format:
| File | Description |
|---|---|
| {Project name}_tweets.csv | Tweets table. Each row represents an individual tweet with its corresponding attributes |
| {Project name}_places.csv | Places table. Each row represents a location from which one or more tweets came from |
| {Project name}_authors.csv | Authors table. Each row represents a user that wrote one or more tweets |
The corresponding dataframes can be accessed as the attributes: gtd.tweetds_df, gtd.places_df, and gtd.authors_df, which are pandas DataFrames:
gtd.tweetds_df.head()
| created_at | text | id | conversation_id | author_id | geo | public_metrics | place_id | date | likes | replies | retweets | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2022-07-06T23:53:53.000Z | @christophclarey Nadal won his 1st GS on his 1st attempt at RG in 05 when he was 19. It was a watershed moment. Conversely, in 86 TM was never the same after his loss to IL. As for TF time will tell. | 1544832034429816832 | 1544773073928425472 | 2200548513 | {'place_id': '01fb944c0dff3d86'} | {'retweet_count': 0, 'reply_count': 0, 'like_count': 0, 'quote_count': 0} | 01fb944c0dff3d86 | 2022-07-06 23:53:53+00:00 | 0 | 0 | 0 |
| 1 | 2022-07-06T23:43:21.000Z | @guygavrielkay What a beautiful image. And appropriate, given the number of people who talk about Nadal as if he's Aslan. :) | 1544829383092879360 | 1544752022171586560 | 14703552 | {'place_id': '58a65d4a55d1b7f6'} | {'retweet_count': 0, 'reply_count': 0, 'like_count': 0, 'quote_count': 0} | 58a65d4a55d1b7f6 | 2022-07-06 23:43:21+00:00 | 0 | 0 | 0 |
| 2 | 2022-07-06T23:32:12.000Z | @AnnaK_4ever Agreed. I think people are harder on Fritz because of all Nadal’s fake injuries. | 1544826579989184512 | 1544767392626212866 | 388514822 | {'place_id': '64ab889e24887e12'} | {'retweet_count': 0, 'reply_count': 1, 'like_count': 1, 'quote_count': 0} | 64ab889e24887e12 | 2022-07-06 23:32:12+00:00 | 1 | 1 | 0 |
| 3 | 2022-07-06T23:25:08.000Z | @rollxadvertisers Grow Your Buissness #itshappening #HBDIconOfMillionsDhoni #Nadal #cryptocurrencies https://t.co/FeBknzlnk8 | 1544824799540793347 | 1544824799540793347 | 1535277320830865408 | {'place_id': '00cc0d5640394308'} | {'retweet_count': 0, 'reply_count': 0, 'like_count': 0, 'quote_count': 0} | 00cc0d5640394308 | 2022-07-06 23:25:08+00:00 | 0 | 0 | 0 |
| 4 | 2022-07-06T23:20:47.000Z | Which kind play b this 😅 #Nadal #itshappening Jordan/ Inaki Williams/. KNUST SRC https://t.co/3qshCRv7tg | 1544823707448840194 | 1544823707448840194 | 953033567549915136 | {'place_id': '0085c4a6640325a8'} | {'retweet_count': 3, 'reply_count': 0, 'like_count': 2, 'quote_count': 0} | 0085c4a6640325a8 | 2022-07-06 23:20:47+00:00 | 2 | 0 | 3 |
gtd.places_df.head()
| name | id | country_code | full_name | place_type | country | geo | |
|---|---|---|---|---|---|---|---|
| 0 | Peñalolén | 01fb944c0dff3d86 | CL | Peñalolén, Chile | city | Chile | {'type': 'Feature', 'bbox': [-70.5912832, -33.5127583, -70.4388729, -33.4591303], 'properties': {}} |
| 1 | Amherst | 58a65d4a55d1b7f6 | CA | Amherst, Nova Scotia | city | Canada | {'type': 'Feature', 'bbox': [-64.232955, 45.802245, -64.179066, 45.844832], 'properties': {}} |
| 2 | Collierville | 64ab889e24887e12 | US | Collierville, TN | city | United States | {'type': 'Feature', 'bbox': [-89.7444626, 35.006217, -89.640889, 35.110826], 'properties': {}} |
| 3 | Punjab | 00cc0d5640394308 | PK | Punjab, Pakistan | admin | Pakistan | {'type': 'Feature', 'bbox': [69.328873, 27.708226, 75.382124, 34.019989], 'properties': {}} |
| 4 | Oyarifa | 0085c4a6640325a8 | GH | Oyarifa, Ghana | city | Ghana | {'type': 'Feature', 'bbox': [-0.2508272, 5.6545401, -0.1145835, 5.7836066], 'properties': {}} |
gtd.authors_df.head()
| name | public_metrics | location | username | id | |
|---|---|---|---|---|---|
| 0 | Gary Counsil | {'followers_count': 123, 'following_count': 581, 'tweet_count': 4368, 'listed_count': 0} | New York, N.Y. Santiago, Chile | garyecounsil | 2200548513 |
| 1 | Frederick Lane 🇺🇸🇮🇪🏴🇺🇦 | {'followers_count': 4569, 'following_count': 5025, 'tweet_count': 94035, 'listed_count': 108} | Brooklyn, NY | fsl3 | 14703552 |
| 2 | Chris Sahm | {'followers_count': 225, 'following_count': 372, 'tweet_count': 6703, 'listed_count': 9} | New Palestine, IN | ChrisSahm | 388514822 |
| 3 | Roll-X Advertisers | {'followers_count': 55, 'following_count': 182, 'tweet_count': 26, 'listed_count': 0} | Islamabad, Pakistan | rollxadvertiser | 1535277320830865408 |
| 4 | N E N E 💎 O S U Q U A Y E 💭 | {'followers_count': 1661, 'following_count': 1257, 'tweet_count': 6570, 'listed_count': 0} | Accra, Ghana | sirdesmond3 | 953033567549915136 |
Get replies
The get_tweets() method can retrieve tweets that generate replies that may not satisfy the query parameters and hence, these would not show up in the results. Call this method to get the replies to the downloaded tweests. Warning: proceed with caution as this method can increase significantly the API calls. Make sure to cap the maximum amount of tweets by using the max_replies parameter
Example:
gtd.get_replies(max_replies=20)
Downloading replies... this might take some time
getting replies for tweet 1 out of 189 (total replies so far:0)
Current progress saved at: downloads\temp_Tennis_players_project_08032022_174714.csv
getting replies for tweet 2 out of 189 (total replies so far:14)
Current progress saved at: downloads\temp_Tennis_players_project_08032022_174714.csv
Replies download done. 28 reply tweets were downloaded
The resulting replies dataframe can be accessed as gtd.repliesdf. Notice the total amount of tweets went slightly over the maximum. This is a result of the impossibility to determine the actual amount of tweets that are going to be retrieved from each API call.
Search parameters in csv file
Some users of this library might want to make use of it by writing as little code as possible. For this cases, the function tweets_from_csv() is available.
To use it, you need to set up a table in a csv containing the following information. Please keep the row names as indicated. The description column is not needed :
| parameter | value | description | |
|---|---|---|---|
| 0 | query | (Nadal) OR (Rafael Nadal) | use single space for AND operator and use ''OR" for the OR operator. Example= apples OR (grapes bananas). For more information and operators go to: https://developer.twitter.com/en/docs/twitter-api/tweets/search/integrate/build-a-query#build |
| 1 | start_time | 2022-07-06T00:00:00z | oldest possible date and time of the retrieved tweets in YYYY-MM-DDTHH:mm:ssZ format |
| 2 | end_time | 2022-07-07T00:00:00z | the most recient possible date and time for retrieved tweets in YYYY-MM-DDTHH:mm:ssZ format |
| 3 | max_tweets | 1000 | total maximum amount of tweets. |
| 4 | max_tweets_page | 500 | maximum amount of tweets per tweet page. Must be between 10 and 500 |
| 5 | language | en | if specific language is needed |
| 6 | place | if specific place is needed | |
| 7 | include_retweets | no | "yes" or "no" depending on whether the user wants |
| 8 | only_georreferenced | yes | "yes" or "no" depending on whether the user wants |
| 9 | filename | The name of the file. A timestamp will be added at the end to avoid file overwriting when file name stays the same | |
| 10 | wordcloud | no | "yes" or "no" depending on whether the user wants to get a wordcoud or not |
| 11 | stopwords | no | comma separated words to be exlcuded from wordcloud. Example:"http","https", "TN", "Nashville", "County", "Putnam", "tornado", "tornadoes", "tennesse" |
| 12 | barplot | no | "yes" or "no" depending on whether the user wants to get a barplot or not |
After building the parameters table and saving it in a csv file, you can call the tweets_from_csv() by passing the csv file as a parameter:
gtd.tweets_from_csv('parameters.csv')
Non-georreferenced tweets
Sometimes you might want to work with tweets that are not necessarily geo-tagged. It is now possible to download such tweets using the has_geo parameter in the get_tweets() method:
gtd.get_tweets(
query='(Nadal) OR (Rafael Nadal)',
lang='en',
start_time='07/06/2022',
end_time='07/07/2022',
max_tweets=1000,
has_geo=False
)