What exactly identifies a site user?

Question

What exactly identifies a site user?

While studying the backend, I encountered this problem of misunderstanding: there is a site, a server Nodejs (in principle, any, but I am interested in Nodejs) and a visitor came to the site. It does not matter whether there is a registration on the site or not - what exactly identifies the user?

I know that this is managed by sessions. I read here, but I did not understand where exactly and what and from what data the server understands that right now it is the same user? Admit, two visitors came to the site: one from Moscow and the other from Vorkuta, and there is a chat on the site. The chat is simple and banal - you can send a post or chat without registration. In the chat, you need to make sure that the user's message from Vorkuta is on a blue background, and messages from Moscow are on a green background. Immediately, a new visitor came to this site, also from Moscow and from the same IP address, but from another computer that is in the next room of our Muscovite (let's say this is his wife), here is her (spouse) message must have an orange background...

As a result, I do not ask for any code (unless it is necessary for a visual example). Please explain the very logic of the server's perception of visitors. How to determine who is who?

46

сервер backend логика

Author: Kromster, 2017-11-03

Source

3 answers

In addition to the said about the shortcomings of the cook , we can say that:

Transparent server cache of requests becomes impossible if cookies are used (whether it is Varnish or transparent caching in nginx). Therefore, if you do not set cookies on the server side, if they are not needed, then the site pages will be able to use the server cache and will open faster. Attempts to cache requests with the Set-Cookie header are simply counterintuitive, so they are also they pass by the cache.
Setting up a CDN for static resources requires attention if cookies are used. If your site opens without www at the address, for example, test.ru, then by setting a cookie on the site, we can assume that this cookie will be passed in requests to all subdomains, including, for example, cdn.test.ru. Therefore, on sites that traditionally open without www, you can see that a separate second-level domain is used for static resources, and not a subdomain. For example, yastatic.net in Yandex.

The last issue was supposed to be fixed by RFC 6265, but at the time of writing this answer, some major browsers still do not fully support it. These older browsers still can't be dismissed easily, as not all over the world the use of older browsers is equally low: for example, in Japan, as of March 2018, IE is used for 16% of all requests. And this is a country where 127 million people live. If you do global service, then so and so you will have to act as if RFC 6265 does not exist yet.

In addition to JWT, you can recall other "long" cookies like ViewState from ASP.NET, which have the same problems with caching and CDN as with normal session random cookies, only worse. Such and similar cookies can be very large, easily ten kilobytes, and if they are transmitted with each request to any image or static file on your site, then this will definitely affect the speed of the site in all modes. Microsoft directly recommends that you do not use them if the speed of the site is important to you.

What else?

Instead of cookies, you can use all sorts of headers that can get into the browser cache. For example, if you once sent an ETag header with an image or file, the next time you access it, the browser will again forward the value of this header. The user can not see that you are his so you keep track, that is, you know that this is the person who came in again, so such techniques are not welcome.

Total

If you can don't identify your users, then your site will be able to run faster.

9

Author: sanmai, 2018-11-19 05:38:59

As a rule, Cookies is used - a certain string of data that is stored by the user in the browser. You can make the algorithm for generating them yourself, or use the native engine, for example PHPSESSID in PHP (see the function session_start()).

You can also identify the user without them, but in this case you will need to use other parameters that you have access to. This is primarily the User-agent (the user's browser) and its IP address. In the case of PHP, these variables are stored in $_SERVER:

1. $_SERVER['HTTP_USER_AGENT']
2. $_SERVER['REMOTE_ADDR']

Accordingly, the second option has less accuracy, but will work if the user has disabled the storage of Cookie. Example of an error in the second case - two users are behind NAT and use the same browser.

8

Author: Ossa, 2017-11-03 09:27:04

score 78 · Accepted Answer

I will list all the ways I know how to identify a user.

IP address

I specify this method because it is the only one that cannot be faked. It can be borrowed from others (proxy, VPN, Tor, just dynamic IP), but it is usually more difficult than, for example, clearing cookies. You can't delete an IP address, similar to clearing cookies: there will definitely be one. Due to its relative reliability (not everyone is too lazy to keep hundreds ready proxy servers for changing IP) it is often used to enhance security: for example, limit the maximum number of requests per second/minute/hour from a single IP. However, different people sitting on the same Internet will not be able to distinguish IP, which contradicts the condition of the question, so we go further.

Banal username and password

The point is simple: stupidly enter the username and password in each request. One of the options for implementing this method is already present in the HTTP protocol itself, via the header Authorization, already implemented in all major web browsers and web servers.

In the HTTP version, the essence is as follows:

On the first visit to the site, the client has nothing and does not send any additional information to the server. The server responds with an error 401 Unauthorized and adds an HTTP header WWW-Authenticate with information about login methods (for a simple login-password, this is Basic realm="default")
The client receives all this and asks the user for a username and password. Then it sends its request again, but with the HTTP header Authorization, which contains the username and password in base64: Basic YWRtaW46MTIzNDU2. If we decode this example, we get admin:123456 - username and password separated by a colon
The site checks all this and either responds normally, or again 401 and asks for a new username and password
This Authorization: Basic YWRtaW46MTIzNDU2 is used every time in all subsequent requests.

Dignities:

simplicity. HTTP Basic Auth is already implemented in most web browsers and web servers, no need to invent anything. If you make your own version, it is enough to implement a login-password check in each request without additional difficulties.

Problems:

Without HTTPS, there is no security at all: the login and password essentially go over the Internet in clear text (base64 is not encryption). The client is also forced to remember the password in plain text, and the server also knows the password (there are authentication schemes, when which the server may not know the password, but that's not the question);
HTTP Basic Auth in browsers only works within the current session; after restarting the browser, you need to enter your username and password again.

To be fair, I note that HTTP can not only use a bare username and password (perhaps a complete list of authorization sposb), but I will not dwell on other methods due to their low prevalence.

Random line

The simplest, most balanced in terms of" security/convenience " and the most popular method of identification. The most common (probably) PHPSESSID cookie in the world is this one. The gist is as follows:

On the first visit to the site, the client has nothing. The site notices this, creates a new random string (longer, so that it is difficult to pick up; at least 30 characters) and, along with the usual response to the request, sends this generated string in one way or another a string (Set-Cookie, redirect to a special link, or just in the response body, if this is for example a JSON API)
The client receives this string along with the response and stores it somewhere (the browser itself stores it in cookies, the SPA can put it in localStorage, etc.)
On subsequent visits to the site, the client adds this line to their request (cookies, HTTP Authentication header, or just a GET parameter in the requested address-on some old PHP forums you can still see the session id right in the address bar)
If you need to identify the client more specifically (login-password, for example), the site in its database after writes that such a random string corresponds to such a login, and then reads this information from the database during subsequent requests.

If we talk about PHP, then all this is built into it: when calling the session_start() function, a cookie PHPSESSID is created from random letters and numbers (or reads an existing one, if it already exists). The data that is associated with this cookie is accessible via the $_SESSION array (and is physically stored by default somewhere in a special directory on the server), and you can read and modify it. For subsequent requests from the user, the session content is automatically read from the file when session_start() is called, and all the data that you put in the $_SESSION array when processing previous requests will be available again. Details in the documentation.

Dignities:

Simplicity-comparing strings is trivial;
When changing the IP address (and this is a common occurrence on mobile phones), the identification does not fly off;
The implementation of the "Log me on all devices" button is reduced to simply deleting all records in the database, and if you create a separate line for each device, you can log out the devices selectively (some sites provide this option). a possibility, such as VK).

Problems:

The random string generator must be truly random (or not completely random, but cryptographically strong, not uniqid()), since the attacker can try to pick up pseudo-randomness (for example, matching the generator state in PHP or Python, or matching sessions created via uniqid (), in Invision Power Board). In no case can it be used as a string use the login hash, password hash, current time, a single pre-prepared string, and other non-random things, as this greatly simplifies the selection. How to get real randomness, read the documentation for your programming language. Or just use a ready-made implementation like session_start() in PHP;
Additional load on the server. To find out which user is hiding behind a random string, they have to access the database. Not a problem for the vast majority of sites, but for giants like Google, this is already a problem;
Cookies are sometimes bugged: for example, IE11 adds cookies to subdomains, even when it is not requested (already fixed in Edge), which can lead to data leakage to third-party CDNs, for example. So keep an eye on how the browsers you're honing the site for manipulate cookies. Well, don't forget about HttpOnly, so that you can't hijack cookies via XSS (and about Secure, if the site uses HTTPS).

Non-random, but protected string (for example, JWT)

The bottom line is this: we blatantly violate the above-mentioned ban on non-random data and shove into the string, for example, the user ID and, optionally, the available access rights (for example, whether he is an administrator), the expiration date of the string and some other data. But! In addition to this line, we add some hash, which is considered according to the data plus a certain secret string that only the site knows and does not give to anyone. in the request from the client, the site, accordingly, checks that the hash is correct. This protects against matching and forgery: to forge data, you need to recalculate the hash, and the attacker, without knowing the secret string, will not be able to do this. (The secret string must be VERY long, a hundred characters, so that it does not pick up at all, since it has all the security.) (In JWT, you can also use RSA for signing instead of just a secret string, which increases security, but I will not paint all the details of the implementation, and so on it's too long)

Dignities:

Less load on the server. The client has already sent all the necessary data, the server can only calculate the hash from this data and the secret string and check that it matches the sent one. You don't need to go to the database: the secret string is usually in some variable nearby, so it's all done quickly;
Independence from a centralized database makes it easy to verify authentication on independent and unrelated microservices, including geographically scattered around the world, because they only need to know the secret string, which changes very rarely, to check the hash sent by the user, and do not need to contact other microservices or the database;
The client can read the JWT itself and understand who it is (if the data is only protected by a hash, and not encrypted);
When changing the IP address, too, do not it flies off.

Problems:

Implementation becomes more complicated. If you do everything yourself, you can mess up and get a security hole, so it's better to take ready-made implementations like the same JWT (however, they also sometimes find holes, so be sure to monitor the news and read the Habr);
The button "Log me out on all devices" can not be done at all. To invalidate a custom data string, you must either change the secret string, or remember somewhere in the database that such and such a string with such and such data has become invalid. But this is all quite problematic and negates all the advantages of this method of identification. Therefore, such strings are usually made short-lived: for example, Google in its API issues a JWT that is valid for only half an hour (information about the expiration date is stored directly in the JWT and is also protected by a hash, you do not need to go to the database).
The information can be rotten. For example, if you write in the JWT that the user is an administrator, and then take away the administrator's rights, then the site, based on the JWT data, will continue to consider the client an administrator until the JWT itself is completely rotten. You can take information from the database, but then again it becomes easier to use a random string.
Due to the fact that JWTs and their analogues contain all the necessary information, they are usually long; with a large amount of data, the string may, for example, not fit into cookies. However, if you store only the user ID, then this is not a problem.

Supercooks and other fingerprinting

The point is to use technology for other purposes. Each browser and each OS has its own behavioral characteristics, and these features can be quite accurately identified by who exactly logged in. For example, they draw the text a little differently, and by small differences in the pixels of the text, browsers can be distinguished, or on different computers it will be slightly different the size of the browser window expanded to full screen (so Tor Browser recommends not to expand it and blocks sites from accessing the canvas). I will not describe everything in all the details, I will leave links for further reading:

Dignities:

you'll drink horseradish. If you want, you can, of course, but it's a lot of hassle: you need to be able to mimic the popular combination of browser/OS/device to "blend in with the gray mass". It is no longer just a button to" Clear cookies " to click. Without special measures, the client's device will be identified regardless of whether it has changed its IP address, cleared cookies, and etc.

Problems:

The accuracy is not one hundred percent. All iPhones are pretty much the same, and it is unlikely to distinguish one iPhone X from another iPhone X (although this applies only to fingerprinting, for supercook it is easier);
Users will find you and hurt you.