Bayesian Filtering Outside of Email? 54
clonebarkins asks: "Is anybody out there using Bayesian filtering for stuff other than to get rid of spam? For example, how useful would Bayesian filtering be to identify news stories/blog entries in the RSS feeds I monitor? Is there any software out there using Bayesian filtering to do this sort of thing already? Are other types of filters better for these purposes?" What other areas can you think of where Bayesian filtering may prove useful?
I'd like to use it for Slashdot.. (Score:3, Funny)
Re:I'd like to use it for Slashdot.. (Score:1)
Re:I'd like to use it for Slashdot.. (Score:1)
I'm just thinking out loud, though, & I'm not a filtering expert.
Re:I'd like to use it for Slashdot.. (Score:1)
1. Run the filter on a number of posts, trolls, karmawhores, the rest,
2. Write a comment.
3. Run the filter on the comment, if the score is too low, try to improve it by using words that will give you more karma. Such as CowboyNeal, SCO and Micro$oft.
Nyuk Nyuk Nyuk (Score:2)
And, as a more insightful suggestion, troll posts marked as redundant in slashdot stories. There have been a few "attacks" on slashdot which could have been prevented by simply blocking 'repeat' posts.
Re:Nyuk Nyuk Nyuk (Score:4, Interesting)
Filerting out GNAA posts would be nice. Not that I've run into it lately, but there was a story a couple of months back that had nearly 1,000 GNAA posts. Impressive organization on the behalf of the trolls, but it did take a while to suss out. (I wonder how many mods burned up mod points that night...)
Bayesian isn't the right approach (Score:5, Informative)
Re:Bayesian isn't the right approach (Score:4, Insightful)
There are "clustering" techniques which attempt to identify similar bunches of data, without respect to any pre-determined bins, but the are not as useful for programmatically dealing with information. This is simply because you don't know what the clusters will contain, so you cannot make assumptions about what you will want to do with each cluster.
Classification systems are used when you WANT to fit things into one of a number of bins that you already have decided what to do with (e.g. SPAM - delete, From Mistress - show now, From Boss - file for later, From Debt collector - return "Deceased", etc.) Bayesian filtering is simply one form of classification.
For more information and ideas, check out KD Nuggets [kdnuggets.com]
Nice work on the newsbot, BTW.
Re:Bayesian isn't the right approach (Score:2)
Re:Bayesian isn't the right approach (Score:2)
Re:Bayesian isn't the right approach (Score:2)
The email client would have to somehow "record" every time you moved or copied something into a folder (or numerous folders), and then, when a message fit that criteria, it would have to replicate that action, move/copy, to the specified folder or folders. I don't think it's all that hard, but I don't think it's been done in major email clients.
Provided you find a bayesian filter which can use arbitrary destinations, Sylpheed Claws [sylpheed.org] can easily take care of the automatic filtering using its folder process
Re:Bayesian isn't the right approach (Score:2)
This is really not that hard. Check out POPfile, an open-source Perl program that's intended for spam filtering, but can be used and adapted for much more. It's as good or better than Mozilla's bayesian engine - I would still be using it except that the Mozilla approach does offer some integration benefits. For othe
Re:Bayesian isn't the right approach (Score:2)
Re:Bayesian isn't the right approach (Score:2)
For newsfeeds you could set a subject (for example: "Presidential elections") and sort into "About presidential elections" and "Not about presidential elections". You just make an initial suggestion (a few articles maybe) and judge the first few artic
Bookmark Filing (Score:2)
Re:Bookmark Filing (Score:1)
I get the feeling they've been slashdotted before. Once bitten, twice shy...
Re:Bookmark Filing (Score:1)
1. Enter http://bugzilla.mozilla.org/ directly in your brower's navigation bar.
2. Enter bug # 235076 and click show.
3. View suggestion.
4. ???
5. profit !
Re:Bookmark Filing (Score:2)
2. Go back here, hit F12, and uncheck Enable Referrer Logging.
3. Click the link, and view the suggestion.
4. ???
5. Well, if you want to get rid of the Opera ad banner, it's not profit, but hey...
Autonomy's been doing this for years (Score:3, Interesting)
Bayesian Approaches to Phylogenetics (Score:5, Informative)
For those of you who don't know, phylogenetics is a set of techniques for working out a 'family tree' of taxa (taxa = basically units of analysis, normally species or genetic sequences). The main reason for doing this is that it gives an objective way of testing evolutionary hypotheses. For example - If I predict a certain protein has evolved through stages A, B then C, but my tree shows a pattern of A - C - B, I can reject that hypothesis.
Phylogenetics is extremely powerful and has allowed us to investigate many many cool things (like the origin of modern humans in Africa, and the migrations out of). The problem is that there is a *huge* number of trees to search to find the optimal set of trees. The formula (IIRC) is 5N-2!!, where N is the number of taxa. So, 10 taxa (species or whatever) has 34 million trees, and when you get up to a real dataset it gets much worse: There are 10^132 ways of connecting my 77 taxa dataset.
Bayesian approaches can really really speed up this process. We used to have to do a large number (100-1000) of heuristic analyses and then bootstrap (a resampling procedure) these to get a confidence interval, of say, a date of a divergence time or a model fit. These Bayesian techniques allow us to do, say, 10 long runs whilst simultaneously estimating parameters.
Sooo much faster (ie - that 77 taxa dataset mentioned before - instead of ~250 hours x 1,000, I can do the same in about ~100 hours x 10.
There are some problems - it possibly over-estimates support (ie underestimated uncertainty in the data) for taxa groupings, compared to the bootstrap method. This isn't terribly surprising given the hill-climbing approach these algorithms use, but no-one's really sure whether this is a good or bad thing (since no-ones really sure how to interpret the alternative bootstrap support)
Fantastic software: Mr Bayes: Bayesian Inference of Phylogeny [ebc.uu.se]
and BAMBE: Bayesian Analysis in Molecular Biology and Evolution [duq.edu]
Re:Bayesian Approaches to Phylogenetics (Score:2)
But basically, the Bayesian approach is a probability approach, not a statistics approach (i.e. what is reality like based on my data and on previous data).
Re:Bayesian Approaches to Phylogenetics (Score:4, Informative)
Also, hidden Markov models (which are used for phylogenetic analysis and involve Bayesian statistics) have been used longest in speech recognition.
i was just thinking about this (Score:1)
NNTP/Usenet (Score:2)
Re:NNTP/Usenet (Score:3, Informative)
For those who still bravely (foolishly) venture onto usenet, it would be nice to replace kill files with something Bayesian. There may be such a reader already but I haven't seen it (nevermind something cross-platform, which is a must for me).
There is one newsreader I know of which uses Bayesian filtering for articles in its latest version, but it's Mac only: MT-NewsWatcher [smfr.org].
JP
MT-Newswatcher (Score:3, Informative)
pr0n! (Score:1, Funny)
moderation (Score:2)
Re:moderation (Score:2)
Why...yes. (Score:5, Informative)
It sounds like you want to extend the naive bayes classifier to more than two categories and, in the best case, learn new categories from the data. Both can be done and have been done with varying degrees of success. You might try here [psu.edu] for some pointers to more information about how it is done (the algorithm itself has been around since the '60s---people only think its something new). Unfortunately for things like RSS and email you're going to run into two problems: you really want to do your classification on-line and your data are actually quite sparse and your prior is usually uninformative so its going to be hard to do the actual classification. But, who knows, its still an active topic of research.
Classifier4J, NNTP//RSS &Bayesian Blog Classif (Score:2, Informative)
"I now have Classifier4J and nntp//rss working together to do Bayesian classification of RSS feeds. There are a few things still to work out (perfomance and usability to name two), but I'm pretty pleased with it, since it was something I whipped up in a couple of hours. AFAIK it is the first Bayesian/RSS thing that has got far enough to have a screenshot..."
Yes, this has been done for RSS feeds (Score:2)
Also, I'm not certain, but I strongly suspect that Google is using some sort of Bayesian filtering as at least part of their criteria for Google News [google.com].
Re:Yes, this has been done for RSS feeds (Score:2)
Hey, that's me!
Yeah, I tried it. It tends to suck, actually. RSS feeds don't have quite enough information to usefully classify every article that comes up. Especially when a lot of your RSS feeds contain nothing but the title of an article.
But you can see it kinda in action on my own aggregator [stompstompstomp.com]. The software works well, but the bayesian classification is not too useful. I guess part of the problem is also that the majority of my RSS feeds I actually want to read.
Similarly... identifying webpage blocking (Score:3, Interesting)
Now, obviously for webpages its a bit easier to say 'good' 'bad', but this app (www.bandwidtharbitrator.com) already has some regular expressions for apps like Kazaa, Bittorrent, in the hopes of limiting the bandwidth. I wonder if a Bayesian system could be adapted to this domain? I considered it, but the person in charge of that part of the project is using a diff-like method (which I find silly).
Are there easy-to-plug-into APIs and libraries like that we could use to do all the 'hard work'? Is SpamBayes up to the task?
oh yeah (Score:4, Funny)
Family discussions?
the paperclip (Score:4, Informative)
http://www.wired.com/news/print/0,1294,43065,00.h
Re:the paperclip (Score:2)
Stock speculation (Score:2)
opera (Score:2)
Control algorithms (Score:5, Interesting)
Consider, for instance, the total amount of sunlight hitting your computer screen. Most people would like an automatic system to control their window blinds to keep that amount to an acceptable level, but the system cannot know a priori what that level will be for a given user. So we let the system set the blinds to a setting deemed acceptable for the average user and use the user's manual interventions to build up a list of bad settings, corresponding to the setting immediately before the intervention, and good settings, corresponding to the setting immediately after the intervention.
The system will then attempt to minimize the probability of the user rejecting its settings by applying Bayes' theorem.
I've done only preliminary exploration of this idea so far but the results are encouraging, and we plan to do a full-scale experiment this summer.
Short answer... (Score:2)
My students and I are buidling a filter for the web. We're really not ready to tlak about it yet, but it is working well and we hope to get something "out there" soon (next year?).
Has anyone seen a content filter? (Score:2)
We typically setup squid and squidguard for them and grab blacklists from a regional database the schools put together.
The first thing you can't help but notice is that it sucks. Even with the various schools additions it doesn't block much of what it should and blocks quite a bit it shouldn't. All of the same problems come into play with these hardcoded blacklists that come into pla
Re: Bayesian Filtering Outside of Email? (Score:1)
Look out for most content management systems - most of them happen to make use of some or other form of Bayesian algorithms to "cleanse" the content and/or extract attributes. After all, your "filter" is nothing but a set of rules built on a test/clean data, with which you compare your actual data.
For example, how useful would Bayesian filtering be to identify news stories/blog entries in the RSS feeds I monitor?
D
Popfile for mailing lists (Score:1)
System Logs (Score:2)
bug/suggestion tracking (Score:1)
Re:bug/suggestion tracking (Score:2)
Do you have a link or other info ??
Re:bug/suggestion tracking (Score:1)
Kind of ... (Score:3, Interesting)
News Site uses Bayesian... (Score:1)