Activist Event Bot Initial Success!

I'm working on my bot/spider to find activist events. It's going well!

I collected a data set of 10000 webpages by spidering groups and events that were listed on campusactivism.org/activismnetwork.org (using sphider from sphider.eu). This gave me around 1000 event pages (mostly because of the event webpages).

I'm using logistic regression and SPSS.

I also considered using Bayesian filtering.

Currently the model classifies 63% of the "activist events" correctly, and 98.6% of the non-events.

The R^2 is 0.328 (Cox and Snell) or 0.677 (Nagelkerke). There are about 80 significant variables (including some weird ones - like the word "will" is significant with a B ten times the size of the standard error).

The strongest term is "conference".

The correct classifications increase to 85% if I only include pages with 1000 words or more, as short pages are harder to classify. Unfortunately, I cannot exclude the short pages as they make up the majority.

A substantial number of the incorrectly classified webpages are "in the middle". There are many pages that briefly mention an event. A conference can have 10-30 webpages about it on a website - for instance it can have a page for every major speaker, or a blog that includes lots of conference updates. In this case, the webpage really is in the middle - so the model is accurate. Some of this can be solved by looking for duplicate events listed on the same domain and by finding the one with the highest "is_event" score.

I'm currently mostly using single word terms (and I tested the most 1000 popular ones for statistical significance). I'm going to try adding two word terms, but it is tricky as there are many more possibilities. I have found over 100,000 words, many of which aren't really words. So there are up to 100,000*100,000 possible two word combinations = 10 billion. Even if it was just 100 million combinations, it'd be too much.

I'm mostly using two word combinations with the term "conference" in them.

Currently, I'm discarding html. So I'm not using things like the tag, meta keywords, or random things like the number of images in the model. I'm not sure if those things would be a significant factor.</p> <p>I might try storing data about domains - so I can tell if a domain is more likely to have an activist event.</p> <p>The algorithm is good for finding conferences. It doesn't do as good a job finding days or weeks of action - so I might need to have a seperate algorithm for that, and perhaps another one for major protests.</p> </div> <div class="links">» <ul class="links inline"><li class="first last comment_forbidden"><span class="comment_forbidden"><a href="/blog/user/login?destination=comment/reply/293%2523comment-form">Login</a> to post comments</span></li> </ul></div> </div> <div id="comments"></div> </div> </td> <td id="sidebar-right"> <div class="block block-block" id="block-block-5"> <h2 class="title">CampusActivism.org</h2> <div class="content"><p>Visit our main site <a href='http://www.campusactivism.org'>CampusActivism.org</a></p> </div> </div> <div class="block block-user" id="block-user-0"> <h2 class="title">User login</h2> <div class="content"><form action="/blog/node/293?destination=node%2F293" method="post" id="user-login-form"> <div><div class="form-item"> <label for="edit-name">Username: <span class="form-required" title="This field is required.">*</span></label> <input type="text" maxlength="60" name="name" id="edit-name" size="15" value="" class="form-text required" /> </div> <div class="form-item"> <label for="edit-pass">Password: <span class="form-required" title="This field is required.">*</span></label> <input type="password" name="pass" id="edit-pass" maxlength="60" size="15" class="form-text required" /> </div> <input type="submit" name="op" id="edit-submit" value="Log in" class="form-submit" /> <div class="item-list"><ul><li><a href="/blog/user/password" title="Request new password via e-mail.">Request new password</a></li></ul></div><input type="hidden" name="form_id" id="edit-user-login-block" value="user_login_block" /> </div></form> </div> </div> </td> </tr> </table> <div id="footer"> </div> </body> </html>