[Comp.Sci.Dept, Utrecht] Note from archiver<at>cs.uu.nl: This page is part of a big collection of Usenet postings, archived here for your convenience. For matters concerning the content of this page, please contact its author(s); use the source, if all else fails. For matters concerning the archive as a whole, please refer to the archive description or contact the archiver.

Subject: [FAQ] Gathering Traffic Data for Proposed Newsgroups

This article was archived around: 30 May 2006 04:19:53 GMT

All FAQs in Directory: usenet/creating-newsgroups
All FAQs posted in: alt.config, news.groups
Source: Usenet Version

Archive-name: usenet/creating-newsgroups/justification Last-modified: 10 June 2001 Posting-Frequency: Monthly (on the 1st) URL: http://www.alt-config.org/justification.htm Maintainer: Rob Maxwell <rob@alt-config.org> Disclaimer: Approval for *.answers is based on form, not content.
Gathering Traffic Data for Proposed Newsgroups Or How to use Google Groups The traditional expectation that a newsgroup justify its existence by virtue of existing Usenet traffic goes back to the earliest days. It precedes the birth of alt.*, the Great Renaming that bought forth the Big 7 (later the more familiar Big 8 with the creation of the humanities.* hierarchy in 1995), and even the rise and eventual fall of the backbone Cabal. In the early 1980s, if discussion of a topic became significant enough, a new newsgroup was created to centralize the discussion. With only a relatively few corporate and university mainframes providing the Unix Users' Network (Usenet) to a similarly few readers it was fairly easy to see when a topic was worthy of receiving its own newsgroup. Today with over three Gigabytes of text-only discussion occurring on a daily basis coupled with the abuse of the alt.* newsgroup creation process leading to a significant number of alt.* newsgroups not being carried on any given news server it has become effectively impossible to see when a topic becomes popular enough to warrant a newsgroup of its own. This is where Google Groups comes into the picture. It would start in 1995 when Deja decided to begin archiving Usenet text postings until 2000 when the task became too overwhelming and expensive leading them to try different things but ultimately their efforts would be futile leading to their sale of their archive and name to the Internet search engine company Google. After a rough start, Google was finally able to bring together Deja's massive archive with their recent efforts at archiving Usenet under the name of Google Groups <http://groups.google.com/>. Getting started The journey to Justification begins at Google Groups' Advanced Group Search <http://groups.google.com/advanced_group_search>. What you will be looking for is how often the topic is discussed in English on Usenet. The customary method uses a search for the keyword or phrase being used over the last ninety-days. The recommended quantity of on-topic posts is ten (10) per day on average. For the sake of this demonstration we will be trying to justify the ABC television show "20/20". Start by typing 20/20 into "Find Messages with all of the words", change the dropdown box from "10 messages" to "100 messages", Language Return messages written in "any language" to "English", and Message Dates () Return messages posted between 29 Mar 1995 to the date three months before today's date. A visual example is available at: <http://www.alt-config.org/20-20a.gif> The results for this search for "20/20" on 27 May 2001 produced these results: Relevant English Messages for 20/20 from 28 Feb 2001 to 27 May 2001 Results 1- 100 of about 12,400. <http://www.alt-config.org/20-20b.gif> That averages out to 137.78 posts per day which clearly meets the 10 per day recommendation, or does it? Refining the search results Taking a closer look at the 20/20 example shows that the first on-topic mention of the show is the 14th search result. <http://www.alt-config.org/20- 20c.gif> Although this is an extreme example which is badly contaminated by "%20" which is a way of representing a space in a URL when of course spaces are not allowed and is often in a search result URL which is seen in the third search result for 20/20. Repeating the search for 20/20 and adding "abc" it is on produces radically different results: Relevant English Messages for 20/20 abc from 28 Feb 2001 to 27 May 2001 Results 1-100 of about 374 Three hundred seventy-four averages out to a mere 04.16 posts per day coming to less than half of the desirable results. <http://www.alt-config.org/20- 20d.gif> This is why your initial search results must be checked carefully before attempting to use them. First off, there is a known glitch in the software Google acquired from Deja which usually does a poor (sometimes comically poor) estimate of "about" how many results were found. A blatant example of this was a search for "infertility insurance": Relevant English Messages for "infertility insurance" from 18 Feb 2001 to 18 May 2001 Results 1 - 4 of about 6. <http://www.alt-config.org/20-20e.gif> The quick way to see the actual totals or least enough to see if there is justification which of course would be 900 on-topic messages over 90 days is to scroll down to the bottom of the page (or press the [End] key) and double- click the 9 under Goooooooooogle which will take you to the 901st message if there is one. [Note: This is why "100 messages" is selected instead of the default "10 messages".] The glitch is meaningless if the top line is: Relevant English Messages for "_______" from 28 Feb 2001 to 27 May 2001 Results 901-1000 of about #,###. Things to avoid Most of the things that can falsely inflate results show up on the last pages. A weekly Frequently Asked Questions (FAQ) on the topic or containing a reference to same will produce 12-14 identical results with only one being valid. Far worse then this is when the subject ends up in someone's signature if they post a few messages per day they can create a few hundred false hits in the 90 day period. A sig hit requires a search in the same time frame for the author to determine the total number of hits the sig has caused and then finding out the number of actual posts made on the subject being searched. ... END ...