Borrowing Twitters Data

I described previously here how to set up a system to connect to twitter.com using python. This post follows from that doing some basic data scrapping. I don't want to teach python here as there are far far better sites for that. So instead partly to save time I am going to switch to merely outling the working principle of the code. The actual code is at the end if you want to copy it. Here is the idea behind the code:

Connect to twitter.
Select a main twitter user.

SECTION 1:
1/ Get a list of all those the user follows. (The API calls these friends.)
2/ For each of the friends record how many followers they have and obtain their last 40 tweets
3/ Scan each of those 40 tweets and record all the twitter users mentioned.
4/ Write to a result file each friend, the no. of followers they have and a list of all those users they mentioned in the last 40 tweets.

Notes:
To avoid running into problems due to twitter API limits we add long pauses between API calls if at any stage the number of API calls we have left gets low.

SECTION 2:
1/ Reload the results file from section 1.
2/ Scan through the file and replace all mentioned users by the number of followers they have.
3/ Save the results into a new results file.

Notes:
To avoid wasting API calls if we have previously asked how many followers a specific user has we won't ask again but will merely reuse the older value.
Again we will pause if at any stage the number of API calls we have left gets low.

SECTION 3:
1/ Reload the results file from section 2.
2/ Scan it and for every friend sort the mentioned users into bins according to now many followers they have.
3/ Write out each friend, the number of followers they have and their histogram of mentioned users.

Notes:
I used the following bins
6-50
51-100
101-500
501-1000
1001-5000
5001-10000
10001-50000
50001+ for number of followers.

And so what did the final results look like. Well they were where not that easy to interpret, I need to reduce the histogram into more simple values. Firstly I can count the total number of mentions to give a simple measure of the interactivity of each friend. Next I want a measure of how selective the user is in his mentioning. I had planned to get a function derived from the global twitter distribution of number of followers. I could then measure the fit of this function to each user's histogram. To get the function I needed data on the global distribution of number of followers of twitter users. This was helpfully provided here.

The distribution of twitter users versus followers
The distribution looked roughly approximated by a log normal plot. I started to fit to this then almost immediately found another webpage that just gave a power law. Then I found a quite angry rebuttal to that here. At which point I got bored and decided merely to measure the number of mentions of those with more followers than the user minus the number of mentions of those with less followers. I then decided this was unfair on accounts with large numbers of followers so also added a measure of more-less than the average number of followers which is about 100. So here are the results:

Top 5 followees (or friends to use twitter's term) by follower count 11694782 twitter 7304211 BillGates 5524720 nytimes 2527750 Schwarzenegger 2462414 NASA Not really many surprises there. Top 10 most interactive friends. Bracketted number is follower count 86 _BTO (8087) 77 KrustyAllslopp (1613) 64 mickskeptic (807) 62 Grumpydev (1479) 59 jimbobthomas (930) 58 WilliamShatner (1108888) 55 pupaid (2625) 54 laurenlaverne (170837) 49 OwenJones84 (43098) 49 BelfastSkeptics (714) Congratulations to all the nice people on this list.
These types of users are the folks who make twitter most interesting. Top 5 accounts most interacting with small users 18 metoffice (76443) 10 thepHbar (59) 9 ChrisGPackham (38648) 9 WHO (468027) 6 htc_linux (9039) The only notable addition here is ChrisGPackham. He was promoting
his new (and actually truly excellent) show "Secrets of our Living Planet"
during the searched 40 tweet period. If anything this shows the
weakness of looking over such a small time period. Top 5 accounts most interacting with small users relative to themselves 5 ChrisGPackham (38648) 4 FloB_CMB (15) 2 zarbs89 (13) 1 PhoneDog (19140) 1 geekologie (11947) Again Chris Packham appears along with some very small accounts and
two accounts linked to websites. This shows that the metric I choose wasn't
really that good. Also some of those accounts aren't really used which also
shows the weakness in not actually vetoing tweets by date

Their are so many limitations on this analysis I really don't want anyone to read too much into any of this. It's more a simple example of twitter data scrapping than anything I would want to bet on. I only use the last 40 tweets and am very easily skewed by local events and vocal twitter friendships. There are a huge number of ways to do much more with more robust statistical analysis. However it's worth noting that many techniques require many more API calls. Unless you are very patient this is a problem, this simple script took about 24hrs to run. I only follow ~400 who mentioned ~10000 users. If you followed a 1000 users and looked at the last 100 tweets it might take half a week even with this simple analysis. 10,000 users and the last 1000 tweets and the results would start to get interesting but the waiting time would be massive. There are however ways around like using different twitter APIs or caching results as you go. If you are an American security agency you can secretly demand access to data crunch just as fast as twitter does. So it's a safe bet there is likely a pandemonium of powerful monitoring agents running through the twitter sphere at any moment. I'll end on that cheerful note

The Code:
Below is the actual code. Now before I get ripped to shreads by purists I should point out this was stream of consciousness coding (or yes lazy coding if you prefer). Basically I wrote it in one go without planning and without tidying up after. It is therefore to say the very least not exactly optimized nor pretty. However it is hopefully easy to follow as you can pretty much see my thoughts in the code. Think of it like this:
An image of a messy breadboard and a production Arduino
If the sight of it fills you with rage please feel free to copy it, go through it, tidy it up and send me you improved code which I will append to the end. Corrections or typos in my script will be fixed if you let me know (see the contact me tab).

Note:
Even non coders can probably guess what the python lines are doing. But to help you if you are trying to do this (barring the first line which is a little special) any text to the right of a pound/hash symbol ( # ) in python is a comment line. These are ignored when running and are soley there to help explain the code.

 

#!/usr/bin/env python import tweepy import time #This is a simple script to scan through the people you follow on twitter #and record the people they have mentioned in the last few tweets. Use as #you wish but there is absolutely no guarantees it works in any way whatsoever. #Thanks to the tweepy folks for making this easy. #Author: @kasilas #Date: 8th July 2012 #Target User and Results Storage Location #Alter as required USERNAME='kasilas' FINALDESTINATION='TwitterResults.txt' #Correct extension if needed if FINALDESTINATION[-4:].lower().find('.txt')==-1: FINALDESTINATION=FINALDESTINATION+'.txt' #Connect with Oauth
#Replace XXXX with your specific values. #You get these from Twitter.com see my previous post for details. #It will run without these values but you get 350 API calls/hr #with Oauth and only 150/hr without. CONSUMER_KEY = 'XXXX' CONSUMER_SECRET = 'XXXX' ACCESS_KEY = 'XXXX' ACCESS_SECRET = 'XXXX' try: auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET) auth.set_access_token(ACCESS_KEY, ACCESS_SECRET) api = tweepy.API(auth) limit= api.rate_limit_status()['remaining_hits'] except: print "Can't connect. Is tweepy installed and is twitter.com accessable?" exit() #Open Final Result file try: Output=file(FINALDESTINATION,'w') except: print "Can't write to",FINALDESTINATION exit() #API Limit test print 'You have',limit,'twitter API calls left.' if limit < 120: print "Please wait until you have >120 calls." print "This will likely be within 1 hour or so." exit() #Quick check user exists try: FRIENDS=tweepy.Cursor(api.friends, id=USERNAME).items() USER=api.get_user(USERNAME) NOOFFRIENDS=USER.followers_count print "Searching",USERNAME,"followers." print "Note: They have",NOOFFRIENDS," followers. To comply with " print "twitter.com this processes only 4/followers a minute so it" print "will take at least",NOOFFRIENDS/4,"minutes to complete." print "Final results are stored at",FINALDESTINATION except: print "Can't access user",USERNAME exit() print "SECTION 1:" NOOFLINES=0 #The Followee loop for Followee in FRIENDS: #Pausing merely to avoid user exceeding API limits time.sleep(15) limit= api.rate_limit_status()['remaining_hits'] if limit<200: print "Due to low API limits pausing." time.sleep(30) if limit<100 : time.sleep(30) if limit< 30 : time.sleep(60) #Get username and see how many follow them try: TheirUserName=Followee.screen_name NoOfFollowers=Followee.followers_count except: Output.write( "ERROR: This follower failed for an unknown reason.\n\n\n" ) NOOFLINES=NOOFLINES+1 continue print "Processing",TheirUserName,":",limit,"calls left." #Get their last 40 tweets this costs 2 API calls (20tw/call) try: LastTweets=Followee.timeline(count=40) except: Output.write( "ERROR: "+str(TheirUserName)+" "+str(NoOfFollowers) )
Output.write( "\nCouldn't get tweets.\n\n\n") NOOFLINES=NOOFLINES+1 continue #Search each tweet for users MentionedUsers=[] CurrentTweet=0 for EachTweet in LastTweets: CurrentTweet=CurrentTweet+1 TweetText=EachTweet.text Stringplace=0 Started=False #Extract the usernames mentioned for character in TweetText: if character=='@': Started=True StartOfName=Stringplace if Started and character==' ': Started=False EndOfName=Stringplace UserName=TweetText[StartOfName+1:EndOfName] if len(UserName)>1 : MentionedUsers.append(UserName) Stringplace=Stringplace+1 #Just massaging the list into a nice string for later reading MentionedUsers=str(MentionedUsers) MentionedUsers=MentionedUsers.replace('[','')#Remove list opening MentionedUsers=MentionedUsers.replace(']','')#Remove list closing MentionedUsers=MentionedUsers.replace(':','')#Remove more user formatting MentionedUsers=MentionedUsers.replace(';','')#Remove silly user formatting MentionedUsers=MentionedUsers.replace(', u',', ')#Remove unicode markers MentionedUsers=MentionedUsers.replace(',','\n')#Seperate by newlines MentionedUsers=MentionedUsers.replace('\"','') #Remove Quotes MentionedUsers=MentionedUsers.replace(' ','') #Remove blanks MentionedUsers=MentionedUsers.replace('\'','')#Remove string marker Output.write("USER: "+str(TheirUserName)+" "+str(NoOfFollowers))
Output.write("\n"+MentionedUsers+"\n\n\n") NOOFLINES=NOOFLINES+1 #Tidy up this section Output.close() print "SECTION 2:" #Reload results file Input=file(FINALDESTINATION,'r') Output=file(FINALDESTINATION[:-4]+".no.txt",'w') #API Limit test limit= api.rate_limit_status()['remaining_hits'] print 'You have',limit,'twitter API calls left.' print 'You have',NOOFLINES,"to progress" print 'WARNING: This could potentially take',NOOFLINES/4,"mins to process." #Two lists to cache follower count requsts to attempt to #avoid hitting the Twitter API limits. preUsers=[] preCount=[] #Replace all users names by their follower count #Zero is given if the user doesn't exist CURRENTLINE=0 for line in Input: CURRENTLINE=CURRENTLINE+1 if len( line.strip() )<1 : continue print CURRENTLINE,":", if line.find('USER:')!=-1 or line.find('ERROR:')!=-1: Output.write(line) print "USER:",line[:-1] else: FollowerCount=-1 line=line.strip()#Drop line return and spaces if line[-1]=='.':line=line[:-1]#Drop fullstop if line[-1]=='!':line=line[:-1]#Drop exclaimation if line[-1]=='?':line=line[:-1]#Drop question mark if line[-1]=='}':line=line[:-1]#Drop closing curly if line[-1]==']':line=line[:-1]#Drop closing square if line[-1]==')':line=line[:-1]#Drop closing bracket print line, #Use precached if available if line in preUsers: FollowerCount=preCount[preUsers.index(line)] else: try: #Pausing merely to avoid user exceeding API limits time.sleep(15) limit= api.rate_limit_status()['remaining_hits'] if limit<200: print "Due to low API limits pausing." time.sleep(30) if limit<100 : time.sleep(30) FollowerCount=api.get_user(line).followers_count #Add results to cache preUsers.append(line) preCount.append(FollowerCount) print "loaded (",limit," calls left) and", except: FollowerCount=0 print "converted to",FollowerCount Output.write( str( FollowerCount)+"\n" ) #Tidy up after this stage Output.close() Input.close() #OKay so then double check the zeros and print out possible errors Input1=file(FINALDESTINATION,'r') Input2=file(FINALDESTINATION[:-4]+".no.txt",'r') Names=[] FolNo=[] for line in Input1 : Names.append(line.strip()) for line in Input2 : FolNo.append(line.strip()) Input1.close() Input2.close() Output=file("Zeroes.txt",'w') placeHolder=0 for element in FolNo: placeHolder=placeHolder+1 try: if int(element)==0: PossibleIssue=str(Names[(placeHolder-1)]+" at line "
PossibleIssue=PossibleIssue+str(placeHolder))+"\n") Output.write(PossibleIssue) except: continue Output.close() exit() print "SECTION 3:" #Make a simple histogram binning in
#50,100,500,1000,5000,10000,50000+ Input=file(FINALDESTINATION[:-4]+".no.txt",'r') Output=file("Histo_Results.txt",'w') notFirstLine=False for line in Input : if line.find("USER:")!=-1: print line,notFirstLine if notFirstLine : strHist=str(Histogram) Output.write(str(Username)+" Fol: "+str(Followers)+" Hist: "+strHist)
Output.write("\n") Histogram=[0,0,0,0,0,0,0,0] Username=line.split()[1] Followers=line.split()[2] else: try: Folls=int(line.strip() ) if Folls>50000 : Histogram[7]=Histogram[7]+1; continue if Folls>10000 : Histogram[6]=Histogram[6]+1; continue if Folls> 5000 : Histogram[5]=Histogram[5]+1; continue if Folls> 1000 : Histogram[4]=Histogram[4]+1; continue if Folls> 500 : Histogram[3]=Histogram[3]+1; continue if Folls> 100 : Histogram[2]=Histogram[2]+1; continue if Folls> 50 : Histogram[1]=Histogram[1]+1; continue if Folls> 5 : Histogram[0]=Histogram[0]+1; continue except: continue if notFirstLine==False : notFirstLine=True strHist=str(Histogram) Output.write(str(Username)+" Fol: "+str(Followers)+" Hist: "+strHist+"\n") #TidyUp Output.close() Input.close() print "Analysis:" #Basic analysis Input=file("Histo_Results.txt",'r') Output=file("Final.txt",'w') lineNumber=0; for line in Input : name=line.split()[0] fols=line.split()[2] hist=line[line.find(' [')+2:line.find(']')].split(',') for i in range(len(hist)): hist[i]=int(hist[i].strip() ) tots=sum(hist) averageMentionScore=sum(hist[2:])-sum(hist[:2]) rank=0 if fols>50000 : rank=7 if fols>10000 : rank=6 if fols> 5000 : rank=5 if fols> 1000 : rank=4 if fols> 500 : rank=3 if fols> 100 : rank=2 if fols> 50 : rank=1 personalMentionScore=sum(hist[rank:])-sum(hist[:rank]) newline=str(lineNumber)+" Un: " newline=newline+name+" Fo: "+str(fols)+" Tm: "+str(tots)+" Am: " newline=newline+str(-1*averageMentionScore) +" Pm: "
newline=newline+str(-1*personalMentionScore)+"\n" Output.write(newline) lineNumber=lineNumber+1; #TidyUp Input.close() Output.close() #The end exit()