Counting Keywords from Reddit and Digg
January 3rd, 2007
I program to program but I am finding an interest in statistics lately. There has been a rise in the popularity of social bookmarking sites, as well as a rush to beat Google at their own game with improved vertical search capabilities. Paul Graham writes about using a Bayesian filter to classify spam/non-spam but the same idea could be used to classify other kinds of text as well, as I’m sure is already being done by any major search engine contender. Bayesian analysis could be used to separate, for example, the word Java in the context of coffee from Java in the context of computing. Instead of only spam/non-spam probabilities, it may be practical to determin probabilities for a whole set of classifications like humor, romance, or tutorials, as they apply to web content.
Searching the web is a complex endeavor and I am more curious than anything so I wrote a simple script to scrape the popular pages of Digg and Reddit. The program output is just a list of words in order of popularity for each website. I present here some results of a scan on new-years eve, 2006. I compiled the following list in no particular order. Words were hand selected from the output based on their ability to indicate whether the text was more politically or technology inclined. PPM is number of times a word would appear in a corpus of 1 million words, % is the percent chance a word appeared on a given page.
| Word | Reddit PPM | Reddit % | Digg PPM | Digg % |
|---|---|---|---|---|
| Bush | 0001185 | 075 | 0000359 | 047 |
| Saddam | 0000359 | 047 | 0000453 | 060 |
| oil | 0000079 | 005 | 0000151 | 020 |
| war | 0000434 | 027 | 0000113 | 015 |
| Blair | 0000039 | 002 | 0000000 | 000 |
| Wii | 0000237 | 015 | 0000794 | 105 |
| Linux | 0000276 | 017 | 0001928 | 255 |
| Windows | 0000118 | 007 | 0002836 | 375 |
| computer | 0000158 | 010 | 0000151 | 020 |
| AJAX | 0000000 | 000 | 0000132 | 017 |
| Sony | 0000079 | 005 | 0000245 | 032 |
It can be seen from these results that a filter trained to separate political from technological discussion would tend to label digg the more technological, in the sense of popular technology anyway. As for political words, Bush, Blair, and war were popular on Reddit but oil and Saddam lean slightly toward Digg. An average of the PPM score for political words would place Reddit on top. A real statistician might kill me for these “conclusions” so in the spirit of learning something, or helping someone else learn something, here’s the code I used to collect the data. Make your own conclusions if you wish.
require 'uri'
require 'net/http'
require 'thread'
class WordList
attr_reader :group_count
attr_reader :total_count
def WordList.parse(text)
text.
gsub(/<!--(.|\s)*?-->/, " "). # remove comments
gsub(/<[^>]*>/, " "). # remove tags
gsub(/&(\w)+;/, " "). # entities
gsub(/[^\w\d\s+]/, " "). # replace all non word/digit/whitespace with whitespace
gsub(/[^w]\d+[^w]/, " "). # remove pure numbers
chomp.split(/\s+/)
end
def initialize
@words = Hash.new(0)
@total_count = 0
@group_count = 0
end
def mutex
@mutex ||= Mutex.new
end
def synchronize
mutex.synchronize { yield self }
end
def add(word)
return unless word and word.length > 0
@words[word.to_sym] += 1
@total_count += 1
end
def add_each(words)
words.each { |w| add w }
@group_count += 1
end
def add_string(s)
add_each self.class.parse(s)
end
def sort
@words.to_a.sort_by { |x| -x[1] }
end
def output
string = "Word, PPM of words, % of pages\n"
sort.each do |x|
string << sprintf("%20s %07d %03d\n", x[0], (x[1] * 1000000 / @total_count), (x[1] * 100 / @group_count))
end
string
end
end
class SiteScanner
def initialize()
@word_list = WordList.new
end
def base_url
raise 'overwrite me'
end
def page_to_url(n)
raise 'overwrite me'
end
def get_page(n)
url = base_url + page_to_url(n)
req = Net::HTTP.get_response(URI.parse(url))
req.body
end
def scan
threads = []
began_at = Time.now
for n in 1..num_pages
threads << Thread.new do
s = get_page(n)
print "."; $stdout.flush
@word_list.synchronize { |w| w.add_string s }
end
end
threads.each { |t| t.join }
ended_at = Time.now
File.open(filename, File::WRONLY|File::APPEND|File::CREAT, 0666) do |file|
file << "Scan began: #{began_at}\n"
file << "Scan ended: #{ended_at}\n"
file << "Words scanned: #{@word_list.total_count}\n"
file << "Pages scanned: #{@word_list.group_count}\n"
file << @word_list.output
end
end
end
class DiggScanner < SiteScanner
def base_url
"http://digg.com/news/page"
end
def page_to_url(n)
n.to_s
end
def filename
"digg.txt"
end
def num_pages
100
end
end
class RedditScanner < SiteScanner
def base_url
"http://reddit.com/?offset="
end
def page_to_url(n)
((n - 1) * 25).to_s
end
def filename
"reddit.txt"
end
def num_pages
40
end
end
RedditScanner.new.scan
DiggScanner.new.scan
This project also gave me a chance to try my hand at ruby threads. I’m not sure if the mutex on the word list is required. I have no idea what would happen if two threads called WordList#add_string at the same time. Perhaps someone with more knowledge can comment on thread safety and what happens if two threads call the same method.
It’s a long, long way from a useful application but the program output was interesting and encouraging. I would enjoy your comments.
on January 6th, 2007 at 04:04 PM Sam, great article. You truly are an abstract thinker. Just one suggestion though, I think it would be good if you could include a time period into your scanner. e.g. 30PPM over 35 hours 12 minutes and 33 seconds. Just a suggestion though. You could even start your own site called socialwords.com. Hehe, seriously great idea. With this scanner you could also determine the 'quality' level and the focus of the diff sites.