User:Andrevan/Alternative to checkuser


Checkuser is an attempt to solve the problem of sockpuppetry. However, this problem continues to occur: checkuser is limited, relies on unreliable IP address info, and can be cheated easily. IP addresses are a bad way to track people anyway and are only going to be less and less useful over time. An alternative approach has been behavioral or sentiment analysis, but with a modicum of deception to throw off such analysis, it is defeated and may be caught only in extraordinary circumstances.

Sockpuppetry is certainly an epidemic on the English Wikipedia and I imagine it is common on many wikis. Wikimedia projects do not require confirmation of identity in most circumstances. Since wikis are a social platform for collaborative work and dispute resolution, the appearance of a mob of similar-minded people throws off many social consensus-building processes. I propose a system of technical identity checking which, with requisite social ceremonies, could deter sock puppetry.

Consider a scheme in which a unique key is generated on a successful login and stored in a user's session cookie or local storage. Along with this key we store a dictionary of 1-way salted hashes of all the relevant information about this user: browser user agent, resolution, IP address, connection speed, we can store a small model of the amount of time the users spends looking at different pages on the site. This is an anonymized unique fingerprint of the user's behavioral profile on the website.

Every time a user with the same key seems to have a new fingerprint, an alert can be raised to an administrator to review. This would indicate a multi-user or role account. Similarly, if the same fingerprint appears on two different keys, this would indicate a sockpuppet. It sounds a little Orwellian but would involve 0 storage or knowledge of a user's actual personal info, if done properly. If the fingerprints change due to switching browsers or devices, the system could store which fingerprints it has seen before. A user with 2-3 different editing habits might not necessarily be a problem, because other variables in the fingerprint would not change. If everything changes at once in a weird way, or if 2 users have everything in common except for connection/IP, that might be a problem.

Since it's theoretically possible that users could be indistinguishable until the algorithm is trained to focus on weighing relevant variables, this would probably produce a few false positives at first.