How to make a web crawler access pages that need authentication? [closed]

closed . This question needs to be more objective and is not currently accepting answers.

want to improve this question? update the question to focus on just one problem when edit it .

Closed 6 years ago .

improve this question

I need to develop a web-crowler where he would access a page (in which it is necessary to login and I have such credentials) and the "robot" would find all the links of the page and list somewhere, it could be a memo or even a txt file. It would be a similar process to firefox's downthemmall plugin. The authentication of the site is simple, done via https. But I also have the option to type captcha to access the page with the files.

Author: mgibsonbr, 2014-03-24

1 answers

I have some crawlers in PHP that access pages that require credentials. It depends on each case, since each has a form of authentication. In my case, I know the required forms. For example, I access a site where their login page contains the following Form:

<form class="onclick-submit card grid-3" accept-charset="utf-8" method="post" action="https://painel2.oculto.net/conectorPainel.php" id="frmLogin" >
    <input class="hidden" type="text" name="email" id="txtUserName" value="[email protected]" />
    <input class="hidden" type="password" name="senha" id="txtPassword" value="senha" />
    <input class="hidden" type="checkbox" name="permanecerlogado" tabindex="6" id="chkRemember" checked="checked" />
    <input class="hidden" type="hidden" value="login" name="acao" />
    ...
</form>

In this case, my crawler in PHP does an authentication on the site before processing the content:

$curl = new cURL();
$curl->post('https://painel2.oculto.net/conectorPainel.php', '[email protected]&senha=senha&permanecerlogado=1&acao=login');

The site will create a session for my subsequent accesses and the program will have access Prime. I do not even check the response of the site, since the chances of login failure are minimal and if it is denied by some other failure (such as connection failure, server down, etc.) the program will stop running and try later.

Most sites therefore require only 3 Basic Information:

  • login
  • password
  • URL

But this certainly doesn't work for everyone, as some sites create tokens for each session (e.g.: icloud.com) or some algorithm that makes automation difficult. In these cases, manual programming is required.

 3
Author: , 2014-03-24 19:24:31