WebsiteBaker Community Forum

General Community => Global WebsiteBaker 2.8.x discussion => Topic started by: Argos on July 11, 2008, 06:27:02 PM

Title: Google results linking to www.domainname.com/wb/admin/media/basic_header.html
Post by: Argos on July 11, 2008, 06:27:02 PM: I searched Google for "website baker media" and found a lot of results linking to "WebsiteBaker Administration - Media-" with url's that point to "www.domainname.com/wb/admin/media/basic_header.html".

I feel a bit uncomfortable about it somehow. Can it be dangerous?
Title: Re: Google results linking to www.domainname.com/wb/admin/media/basic_header.html
Post by: kweitzel on July 11, 2008, 06:36:01 PM: you can always use a robots.txt to protect the folders.

cheers

Klaus
Title: Re: Google results linking to www.domainname.com/wb/admin/media/basic_header.html
Post by: Ruud on July 11, 2008, 11:48:48 PM: There are lots of WB sites out there.
Every site has the /admin/media/basic_header.html file. (and all other non .php files in the admin area)
Everybody can download and look at the sourcecode to find out what it can do.

It should not be dangerous.
Typically a html page will not do much more than display data.
Php pages in the admin area (or the modules area) should all have a bit of code to prevent unning without the wb framework (and its security).

Although I can imagine that using html templates in the admin area together with .htaccess mods that allow html to run php code could be dangerous.

Ruud
Title: Re: Google results linking to www.domainname.com/wb/admin/media/basic_header.html
Post by: Argos on July 12, 2008, 01:27:19 AM: Although I understand your explanation, I do feel it's weird to have loose admin header files linked to in Google. I'll do some robots.txt stuff, but maybe it's an idea to have an updated WB version just prevent this possibility.
Title: Re: Google results linking to www.domainname.com/wb/admin/media/basic_header.html
Post by: Ruud on July 12, 2008, 06:11:39 PM: I agree on the last part.
It should not be that difficult to prevent that. (using robots.txt or .php templates instead of .html)

BTW: I wonder how google has found those pages.
there are no links pointing to those files. Google doesn't look for pages without links pointing to them.
I think the only way that could happen is when a wb tree is installed on a server without .php enabled and with directory browsing enabled.

While writing this message, if found one that had a google_sitemap.xml document (generated with some external generator) that included ALL admin .html .js .gif .png etc..
This is the opposite of the robots.txt. It's asking google to "please index my hidden stuff".
(do a search on google like this: "/media/basic_header.html google_sitemap.xml". You will see what I mean)

Conclusion, don't worry. Your pages will not be indexed by google unless you ask them to.

Ruud
Title: Re: Google results linking to www.domainname.com/wb/admin/media/basic_header.html
Post by: albatros on July 12, 2008, 09:24:44 PM: Hi,

correct me, if I am wrong, but I am sure, that a robots.txt as a solution is very unsecure. Searchengines can respect robots.txt. But they don´t need to. Maybe reading robots.txt-files could be a help and inspiration for bad guys. :-D

And what, if a bad guy knows the stucture of WB (or any other cms)? He only has to find a website, built with this cms, and he is able to see the admin-files. If the files are in google-index or not doesn´t matter.

The very simple and only safe solution is passwordprotection of admin by htaccess. You can´t see any file in any subfolder. And the admin-functions are locked safely.

So why don´t you use this? Am I completely wrong or thinking much to simple?

Regards

Uwe
Title: Re: Google results linking to www.domainname.com/wb/admin/media/basic_header.html
Post by: Ruud on July 12, 2008, 09:43:07 PM: The serious search engines will respect robots.txt. But they are not your worry.
The bad guys will use your robots.txt for finding places where you don't want the search engines to go.
Scanning for /admin will tell (on most WB sites) directly you are dealing with a WB site.
You can be sure, the admin structure is already known to someone serious in hacking.

Protecting the admin area with a .htaccess is something you could do, but you will need to logon twice every time you want to do something over there. The admin area is protected on php level.

There is nothing wrong with the html templates or images being accessible. They will not do anything anyway.
Personally I am a pretty paranoid guy, but as long as there are no vulnerability is popping up on sites like: http://secunia.com/advisories/23828/ I don't worry too much.

As I explained in my previous post, pages in the admin area are not indexed in google at all, unless you take the effort of asking google to do so.

Ruud
Title: Re: Google results linking to www.domainname.com/wb/admin/media/basic_header.html
Post by: chio on November 09, 2008, 09:29:24 AM: In some cases the apache "index of" pages might cause the problem.

I always disallow "modules" and "admin" in robots.txt, of couse I know a "bad guy" can also read it, but its more dangerous when a bad guy can simply use google-search to find security holes.
Title: Re: Google results linking to www.domainname.com/wb/admin/media/basic_header.html
Post by: kweitzel on November 09, 2008, 11:04:11 AM: Guys,

I would actually disallow everything to everybody and in the next step allow the pages directory and, if wanted selected other directories. This way you do not give away your whole folder structure.

Code: [Select]
User-agent: * Disallow: /
Then you open the pages directory, since you want to have the pages crawled

Code: [Select]
Allow: /pages/
Secure it like host systems used to be ... close everything and then open the required folders for indexing.

cheers

Klaus
Title: Re: Google results linking to www.domainname.com/wb/admin/media/basic_header.html
Post by: chio on November 09, 2008, 11:23:41 AM: Quote
User-agent: *
Disallow: /
Allow: /pages/

Uiui - and what about the homepage?
If you do this, your whole site isnt crawled if you have no deep links.
Title: Re: Google results linking to www.domainname.com/wb/admin/media/basic_header.html
Post by: kweitzel on November 09, 2008, 11:35:33 AM: That is what we want to achieve ... the crawling allow list is in part 2 ...

Code: [Select]
Allow: /pages/
cheers

Klaus
Title: Re: Google results linking to www.domainname.com/wb/admin/media/basic_header.html
Post by: chio on November 09, 2008, 02:44:34 PM: www.domain.de/ (index.php) is NOT in /pages/
it is in /
Title: Re: Google results linking to www.domainname.com/wb/admin/media/basic_header.html
Post by: kweitzel on November 09, 2008, 04:49:38 PM: ja, hast recht ... aber warum schreibst Du dann nicht einfach den Zusatz:
Code: [Select]
Allow: /index.php
Das wäre vielleicht hilfreich gewesen ...

Gruß

Klaus
Title: Re: Google results linking to www.domainname.com/wb/admin/media/basic_header.html
Post by: chio on November 09, 2008, 04:58:44 PM: weil die Startseite nicht index.php heißt (deswegen in Klammern)
sondern: /
Wenn du das Startverzeichnis per robots.txt sperrst, kommt der Crawler schon gar nicht auf die Domain, egal was danach offen ist. Wo soll er denn anfangen.
Title: Re: Google results linking to www.domainname.com/wb/admin/media/basic_header.html
Post by: kweitzel on November 09, 2008, 05:18:12 PM: Also Chio, ich bin mir nicht ganz sicher, wo Dein Wissen herkommt, oder wie alt es ist ...

Der Regelblock, den ich aufgestellt habe bewirkt folgendes:

Code: [Select]
User-agent: * # betrifft ALLE Useragents Disallow: / # erstmal darf garnix gecrawled werden bis auf: Allow: /index.php #die index.php Seite Allow: /pages/ # der Seitenordner
Vielleicht liest Du auch mal hier nach: http://janeandrobot.com/post/Managing-Robots-Access-To-Your-Website.aspx

Die guten Crawler habe eine Eigenschaft ... sie lesen und halten sich an diese Regeln, und zwar alle.

Gruß

Klaus
Title: Re: Google results linking to www.domainname.com/wb/admin/media/basic_header.html
Post by: chio on November 09, 2008, 05:56:06 PM: OK,
Wie lautet die URL der Startseite?
1) www.domain.de
oder
2) www.domain.de/index.php
?

Richtig ist: 1)
Eine normale WB-Installation verlinkt nirgends auf www.domain.de/index.php, sondern immer auf www.domain.de

www.domain.de == /
Dort beginnt auch der Bot (zwangsläufig, was anderes kennt er ohne Deeplinks nicht)
Jetzt darf er aber nicht, weil / gesperrt ist.
Also interessiert er sich schon gar nicht für /index.php oder gar /pages/, woher sollte er denn wissen, dass es diese Seiten gibt.

Klar: Du kannst sagen: wenn die Startseite per robots.txt gesperrt ist, darf sie der Spider ja trotzdem crawlen, und dann würde er die Links finden - und dann würde er auch die Seiten finden.

Ich frage mich aber, ob das auch Google so sagt.

Und auch die von dir zitierte Quelle?:
http://janeandrobot.com/robots.txt
Title: Re: Google results linking to www.domainname.com/wb/admin/media/basic_header.html
Post by: diodak on December 04, 2008, 11:40:17 PM: Easiest way is to add .htaccess password protect to /admin.