[ad_1]
We’re working with a really large Twitter DataBase containing around 4,9 million entries. Every entry can either be a tweet, or a reply to a tweet (or a reply to a reply of course). Since this data has been collected using the Twitter API tweets and their replies are not neatly grouped in the DataFrame but many entries are in between:
We are trying to group the tweets with their corresponding replies so we can perform a sentimental analysis on this conversation, but this is where we are stuck.
We started by inverting the DataFrame as it will be easier to search from the last reply to the original tweet than the other way around.
Now we’ll be using the column id
(the original tweet ID) and the in_reply_to_status_id
(refers to the ID of the original tweet to which it was replied).
In essence we want to create some kind of for loop which detects the first row where the in_reply_to_status_id
is an integer and then links this to the reply/tweet above by matching it with the id
column. But it has to continue this process until it finds a row where the in_reply_to_status_id
is None
, as this means you’ve found the original tweet (as a tweet evidently can not be a reply to something).
So the first entry here would be in_reply_to_status_id
= 1244694453190897664, we store this entry and use this to search its “original” tweet:
But this gives us a new in_reply_to_status_id
of 1243885949697888263 so we store this entry as well but also have to look for its original tweet with this new in_reply_to_status_id
. We want to continue this process until we arrive at an entry where in_reply_to_status_id
is None
, as this marks the end of a conversation.
Would anyone have any ideas on how to start on such an operation?
[ad_2]