a__kc Posted March 25, 2006 Posted March 25, 2006 (edited) This is awful. I've noticed significant jumps in bandwidth consumption in the last month or two. This is what Awstats shows now: Traffic viewed = 2.37 GB Traffic not viewed = 14.29 GB As Awstats tells us, "not viewed" includes "traffic generated by robots, worms, or replies with special HTTP status codes." Sadly, as of this writing, I've exceeded my alloted bandwidth I run a "modest" personal site, both in terms of scale and popularity. Some possible reasons: * It looks like most of the traffic can be blamed on busy/nosy search engine bots (some consuming several gigabytes per visit -- that ain't right). * I host a few blogs. These get spammed a lot (vast majority of ads are caught by filters, though, not that this is relevant). * I run a web feed aggregator with some 100 blog RSS feeds, etc. Until recently the feed cache had not been configured correctly, so maybe this could account for some usage (not sure how much). * I used to host a calendar that ran into infinity but quite a while I marked it "disallowed" in my robots file, so that should not be an issue (rogue bots not withstanding). Is your bandwidth anything near this ridiculous proportion (mostly non-human traffic)? What do you do about it? Any suggestions? Thanks. Edited March 25, 2006 by a__kc Quote
TCH-Andy Posted March 25, 2006 Posted March 25, 2006 I'd have a look at the raw log file. If you open a ticket with the details, I'll take a look if you like. Quote
a__kc Posted March 25, 2006 Author Posted March 25, 2006 Andy, thanks a lot for the offer. I'll do that (Yeah, TCH rulez!) Fact is, I've poured over the awstats summaries quite a few times, and I still have a lot to learn on figuring out how to read logs. Btw, thanks to Bruce for moving the post here, where it belongs. Quote
a__kc Posted April 1, 2006 Author Posted April 1, 2006 OK, I thought I'd give a brief update on the situation, especially regarding search engine bots. The top hits I received for March were: Unknown robot (identified by 'spider') 6.66 GB (mostly Baidu) Googlebot 124063+182 6.02 GB (Google) Inktomi Slurp 114994+6645 1.73 GB (Yahoo) MSNBot 4433+679 155.22 MB (MSN) Clearly the first two entries sucked up more than half of my alloted bandwidth for the month. The first entry, I discovered, can be attributed mostly to the Chinese Baidu SE. Given the nature of my contents I was reluctant to ban outright both Baidu and Google in spite of their ridiculous appetite. So I added a couple more disallowed sections of my site, notably an aggregator with lots of interconnected and dynamically generated links. The funny thing is, all major bots obeyed the new rules within hours. Except Baidu. Okay, so maybe Baidu needed more time to re-analyze my robots file. So I gave it a few more days. And it kept coming and coming. So I disallowed it from visiting root. And it still kept on coming (like the damn Energizer -- or is it Duracel? -- rabbit). By this time I've had enough. I banned Baidu's IP range. You'd think it'd have given up. Nope, it still visited! The good news is it would no longer grab 6 GB per month. And I think this should keep the bandwidth in check for the immediate future (we'll see). The take-home lesson: ban Baidu! I don't care if it's China's largest and best SE. It's evil. Quote
Kevan Posted April 1, 2006 Posted April 1, 2006 The top hits I received for March were: Unknown robot (identified by 'spider') 6.66 GB (mostly Baidu) Googlebot 124063+182 6.02 GB (Google) Inktomi Slurp 114994+6645 1.73 GB (Yahoo) MSNBot 4433+679 155.22 MB (MSN) Clearly the first two entries sucked up more than half of my alloted bandwidth for the month. The first entry, I discovered, can be attributed mostly to the Chinese Baidu SE. Given the nature of my contents I was reluctant to ban outright both Baidu and Google in spite of their ridiculous appetite. You can also add a Google Sitemap to your website to cut down on Google crawler traffic. Basically YOU tell Google what pages or sections of your site are new and Google only (mostly) pulls what that. It still does a complete crawl about once a month though. It took about a week before I started to see a difference on my site but the Google traffic dropped to half by the next month. (Below both MSN and Slurp) Plus I now have a lot better idea of how users get to my site through Google AND I can see how the Google crawler views my site. Of course your results may vary. Check it out. http:://www.google.com/webmasters or http://www.google.com/webmasters/sitemaps Quote
TCH-Tim Posted April 1, 2006 Posted April 1, 2006 Baidu eats a lot of my bandwidth too. All in the name of bringing freedom of information to China I suppose. Quote
jayson Posted April 1, 2006 Posted April 1, 2006 You can also add a Google Sitemap to your website to cut down on Google crawler traffic. Basically YOU tell Google what pages or sections of your site are new and Google only (mostly) pulls what that. It still does a complete crawl about once a month though. It took about a week before I started to see a difference on my site but the Google traffic dropped to half by the next month. (Below both MSN and Slurp) Plus I now have a lot better idea of how users get to my site through Google AND I can see how the Google crawler views my site. Of course your results may vary. Check it out. http:://www.google.com/webmasters or http://www.google.com/webmasters/sitemaps Link does not work, you need to add the .com after google Quote
jayson Posted April 1, 2006 Posted April 1, 2006 I must be getting off easy: raffic viewed * 196 386(1.96 visits/visitor) 286 (7.41 pages/visit) 11636(30.14 hits/visit) 128.95 MB (342.08 KB/visit) Traffic not viewed * 2343 8322 26.28 MB Quote
a__kc Posted April 2, 2006 Author Posted April 2, 2006 (edited) Kevan, thanks for the tip. I'll take a look at Google Sitemaps -- it looks promising and I must say I'm happy some search engines are trying to refine the old-fashioned "brute force" approach. ---- timhodge, don't want to get too political here, but I'm pretty sure Baidu censors some of the info it finds (e.g. Tibet, Falungong, Taiwan independence stuff -- and p0rn ) in terms of hiding them from view AND ranking pro-government stuff higher. The law demands it. PS: Okay, please don't ask me what constitutes "pro-government p0rn" Edited April 2, 2006 by a__kc Quote
Kevan Posted April 3, 2006 Posted April 3, 2006 I have fixed this for you Kevan. Thanks Thomas. Looks like I added an extra colon on the first one too... Should have been http://www.google.com/webmasters/ or http://www.google.com/webmasters/sitemaps/ I'll use Copy/Paste next time. Quote
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.