Jump to content

Apparent Robots Scraping Site


Recommended Posts

I've had bots on my site before, but I recently had visits from two that I don't know if I should be concerned about. The IP of 203.82.67.205 comes back to an IP of Celcom Internet Service Provider in Malaysia. The IP of 61.247.22.121 comes back to an IP of FastNet in India. The bots scraped my site; both trying to get the following file:

 

 

GET /home/******/public_html/jendar.tchmachines.com/%7E******/graphics/web_logo.gif HTTP/1.1

 

(****** replaces my user name for my site)

 

The MY bot tried to get this file 5 times within 16 seconds. It returned a 302 error.

The IN bot tried to get this file 32 times within 11 seconds and if I'm reading my logs right, attempted 2 and3 times in a second?!? It returned a 302 error.

 

My questions:

1. How does someone obtain my user name for the backend of my site?

2. What file is the bot actually trying to get? I don't have a path like that for the graphic.

3. Why is it hitting on the server itself?

4. Is this anything to be concerned about and if so, what do I need to do to prevent it from happening

again from other bots?

5. Any idea why the bots were so interested in this gif? It's my business logo?

 

I have already banned the IP addresses.

 

marlene

 

:group:

Link to post
Share on other sites
GET /home/******/public_html/jendar.tchmachines.com/%7E******/graphics/web_logo.gif HTTP/1.1

(****** replaces my user name for my site)

If that request is copied from your HTTP access log, it is ridiculously malformed. The requests as reported in the access log start at public_html, but even that part is omitted, so they start from "/".

 

So in other words, while /home/******/public_html/ isn't inconceivable as a legitimate path internally on the server, it wasn't the correct one for requesting a file through Apache by HTTP, and following it with the server name is absurd because by that point they've already reached the server, and following it with a 2nd instance of your userID is just as absurd. It's like trying to go to a file on your PC: "C:\Documents and Settings\Owner\C:\My Documents\..." It's a meaningless path that couldn't possibly result in success.

 

When I see things like this (especially when the IP traces to an ISP), I sometimes guess that it's a user's PC that has got virus-infected and is now being hijacked to send out hack attacks or DDoS attacks on websites.

 

1. I think you correctly identified the one issue of concern, how they obtained your userID. I went through the source code of some of your site pages to see if it might be exposed anywhere by accident, but didn't find anything. More about this later.

 

2. I did find some indication that you have a /graphics/ folder. Maybe the bot was just using it to paste together junk URLs. Also, some bots are dumb and can't construct a path properly even if they technically have the info that would allow creating it properly. Even some browsers seem to do the same, based on things I see in my logs: requesting the wrong files from the wrong folders even when the hyperlinks in the source code are unquestionably correct and followed correctly by all other browsers.

 

3. The other question is that although it's easy to identify the server you're on, it is mysterious why anyone, automated or manual, would bother to do so for putting together requests like this. No professional hacker or hacking organization would bother with this sort of thing. Thus, it could be two people trying to find a way to hack your site. However, if this is the case, they are utterly incompetent and not worth the trouble of worrying about. There are plenty of script kiddies and hacker wannabes in the part of the world those requests came from.

 

4. If it were me, I would not worry about this incident, except as discussed below. Do watch your logs, as you have been doing, for any signs giving a clearer indication what these people are up to, if anything.

 

5. A web search didn't turn up any reason for that particular gif name to be of any interest (i.e. it isn't used by any common software, so it couldn't be used to determine if you're running WordPress, or something like that). Probably just a random choice of file name. If you have a file by that name, but not in that folder, that's probably where they got the name.

 

One thing that could conceivably obtain your userID and a partially correct path to use for trying to access your server would be spyware on your PC. There is a particularly bad exploit going around called gumblar or martuz. It infects PCs, then steals FTP logins, and uses them to hack the remote websites. If you actually had this virus, it would most likely have logged into your FTP account rather than trying this other stuff, but to be on the safe side I suggest doing a thorough antivirus/antispyware scan with a good AV program (not AVG free).

 

I would ordinarily think attacks like this are automated, but going to the trouble of finding the name of your server and using it to create absurdly wrong URLs seems to point to humans who don't know what they're doing trying to play hacker. Obviously, the available info is limited, so I could easily be wrong.

 

The 302 response is of some concern to me for other reasons. The file doesn't exist at that location, so the response should be 404. It's possible you are using a 302 to redirect 404's to your home page or something like that, but that's not a good practice. Search engines don't like it because it makes it impossible for them to tell which pages exist on your site and which don't. If they request 500 pages and get your home page for each of them, they can give you a duplicate content penalty for having 500 duplicates of the same page, and think you're trying to deceive them. A 404 should just be a 404. It can have a link to the home page, but it shouldn't redirect there automatically.

Edited by SteveW
Link to post
Share on other sites

Wow! Excellent information, Steve. THANK YOU! ;)

 

The file requested was copied from my access log, so that was what was actually requested.

 

I haven't created a 302 response myself, so whatever the default is was served. Actually, I haven't created any error pages for my site. I tried once on a different site and I couldn't get it to work right, so I haven't tried since.

 

I have done a scan, but as you mentioned it is AVG. ;) I will get a better virus scanner and re scan.

 

Thanks again for taking the time to respond.

 

marlene

Link to post
Share on other sites

The reason I give AVG free a thumbs down is that on several active forums where I follow security-related posts, I've seen reports where people have later found viruses on their PCs even though they were using an AV program, and in some cases their sites were hacked because a virus planted spyware on their computer. So many of these people said they had been using AVG free that it became a predictable pattern. No other AV program has been mentioned often enough in this context that such a pattern emerged. Actually, at the moment I can't recall any other AV program being mentioned in that context at all, so I've come to consider AVG as the worst of the lot, especially against password-stealing Trojans. I can't claim that opinion is completely up to date; even if they've improved, it will take a lot longer to notice that the pattern isn't there anymore than it was to form the opinion in the first place.

 

-----

 

After all that, I think I just figured out why you are getting these requests. Will send you a PM with details that I don't want to post publicly.

Edited by SteveW
Link to post
Share on other sites

Well, most AV programs don't protect you from malware or spyware. That's why you can't depend on AV software alone. Scan with MalwareBytes, Spybot and AdAware as well as a good AV program (I use Avast).

Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...