CHARACTERIZING AND DETECTING
PASSWORD GUESSING ATTACKS

A Dissertation

Presented to the Faculty of the Graduate School

of Cornell University

in Partial Fulfillment of the Requirements for the Degree of

Doctor of Philosophy

by

Marina Sanusi Bohuk

May 2024


© 2024 Marina Sanusi Bohuk

ALL RIGHTS RESERVED


CHARACTERIZING AND DETECTING PASSWORD GUESSING ATTACKS

Marina Sanusi Bohuk, Ph.D.

Cornell University 2024

Modern authentication systems still mainly rely on passwords for authentica-

tion, but little is known about legitimate and malicious user behavior during

the authentication process due to the difficulty of collecting information on such

a sensitive field. Because passwords are hard to remember and often reused

across websites, they are prone to remote guessing attacks in which an attacker

iterates through a guess list of credentials, submitting them against a live lo-

gin system; but existing defenses do not leverage password-based information

because of the challenge of collecting such information in a secure way.

We address this challenge first by developing a measurement frame-

work called Gossamer for securely recording password-derived measurements,

which we used to collect data on 34 million login requests at two universities.

Then, we show how we used the data collected by Gossamer to develop a clus-

tering approach called Araña that detects and groups login requests into attack

campaigns. Finally, we explore existing timely attack detection mechanisms and

evaluate them on Gossamer data along with three new detection methods based

on Directed Anomaly Scoring. We also show that these detection methods are

vulnerable to evasion attacks by an adaptive attacker.


BIOGRAPHICAL SKETCH

Marina Bohuk was born and raised in Charlottesville, Virginia. She attended

the University of Virginia where she earned an undergraduate degree in Com-

puter Science. While at UVA, she co-founded a cybersecurity training com-

pany called MetaCTF with her now-husband and two friends. After gradu-

ating, she worked for a year as a software engineer at Mastercard before be-

ginning her studies at Cornell University. During her PhD, she has focused on

improving password-based authentication systems and protecting them from

attacks through empirical observations on real-world data. She also interned at

Cloudflare where she helped implement Might I Get Pwned, a tool for detecting

leaked credential use.


For Roman and Nikolai.


ACKNOWLEDGEMENTS

I would first and foremost like to thank Tom, without whom none of this work

would have been possible. His advice and support along the way have been

instrumental to my journey here at Cornell.

Next I would like to thank my collaborators, especially Rahul Chatterjee and

Mazharul Islam at the University of Wisconsin-Madison. I always benefit from

hearing their ideas, and I’ve really enjoyed working with them on three consec-

utive papers.

Finally, I am eternally grateful to my parents for their constant support and

to my husband Roman for his encouragement, optimism, IT support, and trips

with me back and forth to New York.


CONTENTS

Biographical Sketch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii

1 Introduction 1

2 Gossamer: Securely Measuring Password-based Logins 5
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Background and Related Work . . . . . . . . . . . . . . . . . . . . 9
2.3 Designing a Secure Measurement Architecture . . . . . . . . . . . 11
2.4 Password-Derived Measurements and Security Analysis . . . . . 17

2.4.1 Password-derived measurements . . . . . . . . . . . . . . 18
2.4.2 Security analysis of measurements . . . . . . . . . . . . . . 19

2.5 Deploying Gossamer . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.6 Login Statistics, Patterns, & Observations . . . . . . . . . . . . . . 30

2.6.1 Characteristics of high volume attacks . . . . . . . . . . . . 32
2.6.2 User and client statistics . . . . . . . . . . . . . . . . . . . . 34
2.6.3 Password security . . . . . . . . . . . . . . . . . . . . . . . 37
2.6.4 Usability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

2.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3 Araña: Discovering and Characterizing Password Guessing Attacks in
Practice 49
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.2 Background and Related Work . . . . . . . . . . . . . . . . . . . . 53
3.3 Gossamer Logs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.4 Towards Detecting Attack Campaigns . . . . . . . . . . . . . . . . 61

3.4.1 Login Sets and Features . . . . . . . . . . . . . . . . . . . . 62
3.4.2 Initial Attempts Using Compromise Reports . . . . . . . . 65

3.5 Campaign Discovery Pipeline . . . . . . . . . . . . . . . . . . . . . 66
3.5.1 Filtering Likely Benign Requests . . . . . . . . . . . . . . . 67
3.5.2 Clustering Potentially Malicious L sets . . . . . . . . . . . 71
3.5.3 Attack Campaigns Discovered . . . . . . . . . . . . . . . . 74

3.6 Analysis of Attack Campaigns . . . . . . . . . . . . . . . . . . . . 76
3.6.1 Example Attack Campaigns . . . . . . . . . . . . . . . . . . 76
3.6.2 Higher-Level Attack Behaviors Observed . . . . . . . . . . 83

3.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
3.8 Real-World Deployment Constraints . . . . . . . . . . . . . . . . . 90
3.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92


4 The Practicality and Robustness of Timely Malicious Login Detection 93
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.2 Background & Related Work . . . . . . . . . . . . . . . . . . . . . . 97
4.3 Evaluating Prior Approaches . . . . . . . . . . . . . . . . . . . . . 101

4.3.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.3.2 Performance of Volumetric Approaches . . . . . . . . . . . 104
4.3.3 Performance of ML Approaches . . . . . . . . . . . . . . . 106

4.4 Anomaly-Based Detection Approaches . . . . . . . . . . . . . . . . 110
4.4.1 Malicious Login Detection Using DAS . . . . . . . . . . . . 113

4.5 Attack Detection Performance . . . . . . . . . . . . . . . . . . . . . 117
4.6 Robustness of Malicious Login Detection . . . . . . . . . . . . . . 122

4.6.1 Model for Evasion Attacks . . . . . . . . . . . . . . . . . . 122
4.6.2 Evasion Strategies . . . . . . . . . . . . . . . . . . . . . . . 125
4.6.3 Evasion Results . . . . . . . . . . . . . . . . . . . . . . . . . 127

4.7 Discussion & Future Work . . . . . . . . . . . . . . . . . . . . . . . 132
4.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

5 Conclusion 137
5.1 Overarching Observations . . . . . . . . . . . . . . . . . . . . . . . 137
5.2 The Authentication Ecosystem . . . . . . . . . . . . . . . . . . . . 138

A Gossamer: Securely Measuring Password-based Logins 140
A.0.1 Measurements Taken . . . . . . . . . . . . . . . . . . . . . . 140
A.0.2 Filtering Out Attacks . . . . . . . . . . . . . . . . . . . . . 140
A.0.3 Login Statistics . . . . . . . . . . . . . . . . . . . . . . . . . 142

B Araña: Discovering and Characterizing Password Guessing Attacks in
Practice 145
B.1 Reasons for Compromise Reports . . . . . . . . . . . . . . . . . . . 145
B.2 Using DAS to Detect Attacks . . . . . . . . . . . . . . . . . . . . . 146
B.3 Additional Clustering Results . . . . . . . . . . . . . . . . . . . . . 148
B.4 Comparing Araña to Prior Work [29] . . . . . . . . . . . . . . . . . 152
B.5 Geographical Source of Attacks . . . . . . . . . . . . . . . . . . . . 152

C The Practicality and Robustness of Timely Malicious Login Detection 154
C.1 Finding k for Volumetric Flagging . . . . . . . . . . . . . . . . . . 154
C.2 StopGuessing Implementation Details . . . . . . . . . . . . . . . . 155
C.3 WhoAreYou Implementation Details . . . . . . . . . . . . . . . . . 158

C.3.1 Estimating Reputation Score . . . . . . . . . . . . . . . . . 158
C.3.2 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . 159

C.4 Additional Evasion Results . . . . . . . . . . . . . . . . . . . . . . 160
C.5 Attack Clusters Detected by Araña [64] . . . . . . . . . . . . . . . 161

Bibliography 163


LIST OF FIGURES

2.1 The main components of Gossamer. . . . . . . . . . . . . . . . . . 12
2.2 Relative improvement in password guessing success rate (∆q)

due to access to password-derived statistics from Gossamer over
the baseline in q ≤ 108 guesses. . . . . . . . . . . . . . . . . . . . . 22

2.3 Successful and failed login attempts per day at two universities
(for a total of 196 K unique users at U1 and 309 K at U2). Potential
high-volume attack campaigns we discovered are also shown. . 27

2.4 Summary statistics of login requests recorded by Gossamer at U1
and U2. More statistics can be found in Figure A.3. . . . . . . . . 30

2.5 Cumulative distributions of unique passwords per username
and IP for February 2021. The X-axis is log scale. . . . . . . . . . . 37

2.6 Cumulative distribution of interarrival time between requests
for different inactivity thresholds for February 2021. The X-axis
is log scale. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

2.7 Average session sizes (number of attempts per session) for dif-
ferent inactivity thresholds . . . . . . . . . . . . . . . . . . . . . . 43

2.8 Cumulative distributions of number of sessions per user per day
for an inactivity threshold of 360s for February 2021. The X-axis
is log scale. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

2.9 Cumulative distributions of session sizes (number of attempts
per session) for mobile devices for February 2021. The X-axis for
U2 is limited to 10 for comparison with U1 (although the maxi-
mum session size consisted of 54 attempts). . . . . . . . . . . . . 46

3.1 Summary statistics on the compromised accounts reported at
each university during the measurement period. . . . . . . . . . . 60

3.2 Features for L sets that we use for analysis. Nominal features
are marked with ©; all others are numerical. . . . . . . . . . . . . 63

3.3 Araña’s filter, cluster, analyze pipeline for discovering attack
campaigns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

3.4 Percentile of NR (left) and FF (right) for both universities (shown
up to the 99th percentile for viewing). By choosing the 90th per-
centile at U1 and 80th at U2, we set `U1 = 10, `U2 = 7 and
fU1 = 0.77, fU2 = 0.8 at the universities respectively. . . . . . . . 69

3.5 Number of L sets matching each of the filtering criteria at either
university. The last row shows the remaining number of sets
after all filtering steps. Threshold values ` and f for U1 and U2
are indicated in Figure 3.4. . . . . . . . . . . . . . . . . . . . . . . 71

3.6 Our distance function (logsetsim) to measure the similarity be-
tween two L sets (Left) and the hierarchical backoff distance
(HBD) calculation for nominal features (Right). . . . . . . . . . . . 73


3.7 Number of requests per day for the 1,752 and 6,408 poten-
tially malicious L sets at U1 and U2 respectively as shown
in Figure 3.5. Attack clusters we found in Section 3.4 are shown
in red boxes. Clusters marked with a * were identified in prior
work [29] as high volume attacks. Note that the x- and y-axes are
different for the two universities for better visualization. . . . . 75

3.8 Abbreviations for measurements taken . . . . . . . . . . . . . . . 76
3.9 Attack clusters detected using a set of heuristics and manual re-

view. The first column notes the ID of each cluster as we refer
to them in the paper [64]. The attack IDs we describe in detail
(Section 3.6) are shown in bold font. We discovered a total of 41
and 1,116 unique compromised users at U1 and U2 respectively.
Abbreviations are described in Figure 3.8. . . . . . . . . . . . . . 77

3.10 Different attack behaviors we observed. . . . . . . . . . . . . . . . 84

4.1 Breakdown of the Gossamer logs used in our simulations. Login
requests associated with Araña-identified clusters are labeled as
malicious, and all others are labeled as benign. . . . . . . . . . . . 101

4.2 Evaluation of the volumetric, ML-based, and DAS-based ap-
proaches on the Gossamer holdout dataset using attacks re-
ported in [64] as ground truth. We report the Precision, Recall,
and F1 scores (by percentage (%)) broken down both by indi-
vidual requests and L sets. The highest value in each column
is boldfaced. StopGuessing-G and WhoAreYou do not flag any
requests as malicious at U1. . . . . . . . . . . . . . . . . . . . . . . 105

4.3 The detection performance (recall) of the three DAS-based ap-
proaches on the four attack clusters at U1 and seven at U2 re-
ported by Araña [64] over the holdout period. Multi-day clusters
are highlighted with an asterisk (∗), and the # L sets is boldfaced
when all L sets of the attack cluster are detected by the approach. 114

4.4 Manual analysis of the false positives (FPs) generated by the
three DAS-based approaches. . . . . . . . . . . . . . . . . . . . . . 117

4.5 For the three DAS-based approaches, we report the total num-
ber of alerts (# of L sets flagged malicious) generated per day
(red line) and how many of them are true alerts (blue line) over
the testing period at U2. We also report the number of known
malicious L sets on each day (green line) which is the same for
all approaches. A smaller gap between “total alerts” and “true
alerts” indicates higher precision, and a smaller gap between
“true alerts” and “known attacks” indicates higher recall. . . . . 118

4.6 Evasion results for the Comparison Set approach when attempt-
ing to evade attack clusters by redistributing the attack transcript
over θip IP addresses or inserting θip chaff L sets. . . . . . . . . . . 128


4.7 Results of the three blending evasion strategies against Per-day
Clustering technique. None of the attacks at U1 could evade Per-
day Clustering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

A.1 Measurements we log in ephemeral (Eph.) and persistent (Pers.)
storage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

A.2 The distribution of operating systems (OS) as detected in the
user-agent of all the requests (after removing requests contain-
ing empty user-agent string at U2). . . . . . . . . . . . . . . . . . 142

A.3 Summary statistics of login requests recorded by Gossamer at U1
(from Dec. ’20 to July ’21) and U2 (from Dec ’20 to Mar ’21). . . . 143

A.4 Top 10 most common user agents at U1 (top) and at U2 (bottom) 144

B.1 The reasons for compromise as noted at U1 and the number of
instances of such compromise. Users reported multiple times are
counted as distinct instances. . . . . . . . . . . . . . . . . . . . . 146

B.2 Silhouette scores of different clustering models. . . . . . . . . . . 148
B.3 Confusion matrices of the number of L sets corresponding to

the three attack campaigns found in prior work [29] using their
manual approach versus our FCA approach. . . . . . . . . . . . . 152

B.4 The five most common countries at U1 (left) and U2 (right) by
number of attack requests. . . . . . . . . . . . . . . . . . . . . . . 153

C.1 The F1 scores for different values of k over the training period of
Gossamer logs at both universities. . . . . . . . . . . . . . . . . . 154

C.2 Evaluation of the volumetric, ML-based, and DAS-based ap-
proaches on the Gossamer train and holdout datasets using at-
tacks reported in [64] as ground truth. We report the Precision,
Recall, and F1 scores (in percentage (%)) broken down both by
individual requests and L sets. StopGuessing and WhoAreYou
do not flag any requests as malicious at U1. . . . . . . . . . . . . . 155

C.3 % of unseen IP addresses, ISPs, AS numbers, and countries at
each level ` observed by U2 during the holdout period. . . . . . . 156

C.4 Results of the evasion attacks when redistributing the login re-
quests across θip IP addresses involved in the corresponding at-
tack cluster, adding θr additional login requests, and lengthening
the attack by θt. We show the results over the 19 attack clusters
detected by the Per-day Clustering approach at U2. For this ex-
periment we assume the attacker has access to all login requests
as observed by the login server. A “-” symbol means the attacker
does not require the corresponding resource for evasion. . . . . . 160

C.5 Attack clusters detected by Araña [?] on Gossamer dataset. Sim-
ilar to [64], we refer to each attack cluster using the ID in the first
column. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162


CHAPTER 1

INTRODUCTION

The first digital password was invented in 1961 by computer science profes-

sor Fernando Corbato at MIT [126]. Little did he know that passwords would

become the most common method of digital authentication for at least the next

sixty years despite proposals to move to other authentication mechanisms like

biometrics, hardware tokens, or passkeys [30, 55]. However, these early pass-

words wered stored in plaintext, and the very first password theft occurred two

years later by simply printing out the system’s password file [28], starting a

pattern of credential theft that would continue to the present day.

In modern authentication systems, passwords are hashed before being

stored to the database. Then when a user attempts to login, the server hashes

the submitted password and compares it to the hash stored in the database; if

they match, the server allows the user to login. However, hashing passwords

does not eliminate the possibility of an attacker stealing or guessing a user’s cre-

dentials. If an attacker were to gain access to the database of password hashes,

they could attempt to recover the original passwords via brute force by taking

a list of common passwords, hashing each of them, and checking if the result-

ing hash matches any hash in the database. Indeed, such attackers have created

“rainbow tables”—tables mapping a plaintext password to its hash—to make

these lookups faster and easier. The introduction of a “salt”—a random string

appended to the password before being hashed—makes such offline guessing

attacks more difficult and computationally intensive to execute.

An attacker may still discover a user’s password, though, via an online guess-

ing attack in which they directly query the web service. For example, they


might launch a password spraying attack by trying a few very common pass-

words against many different usernames. Using a stronger, more complex pass-

word makes these attacks more difficult; but the average user today owns 240

online accounts [43], and human brains were not designed to memorize hun-

dreds of random strings of text. Thus users often choose easily memorable

passwords or reuse the same password across multiple websites. In fact, Das

et al. found that 43% of users reuse exactly the same password across different

websites [42].

The prevalence of breached credentials only further enables online guessing

attacks. When a password database is leaked, an attacker can take a user’s

password breached from website A and submit it to website B’s login form;

because of users’ tendency to reuse passwords, these attacks—called credential

stuffing—are widely observed and highly damaging. In fact, Okta reported that

credential stuffing accounts for 34% of login traffic in 2022 [99]. Similarly, an

attacker could take a user’s breached password from website A, apply small

tweaks to the password, and try to log in with that tweaked password to website

B. Pal et al. showed via simulation that these so-called “credential tweaking”

attacks can be highly effective, compromising 16% of user accounts in under

1,000 guesses [93].

More insight into the passwords submitted during login attempts would

help to improve the design of authentication systems and their defenses against

guessing attacks. However, it is difficult to gather data on passwords due to

their sensitive nature. To solve this issue, we designed a measurement frame-

work called Gossamer that records password-derived measurements on real lo-

gin requests [29]. To ensure the security and privacy of users, we performed a

2


simulation to choose a set of safe-to-collect measurements that can give insight

into the password without leaking too much information and compromising

user security. We deployed Gossamer at two universities over a seven month

period and collected data on over 34 million login requests.

Next, we designed a clustering approach called Araña that can group lo-

gin requests into clusters that are likely to be malicious attack campaigns [64].

In this approach, we used password-derived information to describe behavior

resembling credential stuffing, credential tweaking, targeted, and other kinds

of password guessing attacks. We tested the approach on Gossamer data and

found 29 attack campaigns across the two universities displaying different at-

tack behaviors.

However, an analyst might prefer to investigate attacks as they occur and,

for example, receive a report of attack campaigns at the end of each day. Thus

we performed an evaluation of existing timely detection methods [65], as well as

three new detection methods we designed based off Directed Anomaly Scoring

(DAS), a risk scoring mechanism designed by Ho et al [61]. Since an adapative

attacker may try to evade detection, we also consider whether the detection

methods are robust to evasion attacks. To do this, we performed a simulation of

evasion attacks against the three DAS-based detection methods by applying a

series of evasive strategies to the attack campaigns found by Araña and checked

whether or not the attack campaign was still detected. We find that evasion

attacks are possible even with limited resources and should be considered when

designing authentication systems.

In summary, modern password-based authentications are vulnerable to dif-

ferent types of online password guessing attacks, and existing defenses remain

3


primitive and susceptible to evasion attacks. This dissertation discusses how to

design better authentication systems and attack countermeasures using empiri-

cal observations on real-world login data.

4


CHAPTER 2

GOSSAMER: SECURELY MEASURING PASSWORD-BASED LOGINS

2.1 Introduction

Despite the prevalence of password-based authentication across the internet,

little is known about the passwords submitted to login systems. Knowing the

characteristics of such login information would help practitioners make better

security policies to improve both usability as well as attack detection. A key

challenge hindering progress is that passwords are highly sensitive, and as a

result prior work has only performed very limited measurements.

Two prior works are particularly relevant. Bonneau et al. [30] instrumented

Yahoo login servers for 48 hours to learn the distribution of actual user pass-

words; but his technique could not record other information about submitted

(valid or invalid) passwords, such as the similarity between the successive pass-

word submissions. Chatterjee et al. [38] were the first to investigate incorrect

password submissions from the viewpoint of a login server. They instrumented

Dropbox’s login service for 24 hours to investigate how often users submit a

fixed set of easy-to-correct typos; however, their study was limited to only a

specific set of typos and does not provide a general framework for analyzing

submitted passwords. Thus the question remains: Can we build login measure-

ment infrastructure that monitors password submissions but doesn’t endanger

users’ security?

In this work, we design, build, and deploy a measurement system called

Gossamer that securely records login requests, including statistics about sub-


mitted passwords. Doing this safely required extreme care, and our main

contribution is a holistic approach that combines systems security features, a

simulation-based framework to guide selection of password-derived statistics,

and procedural safeguards. Ultimately, our initial deployment at two large uni-

versities is able to answer, for the first time, basic questions about submitted

passwords — such as how often legitimate users are making typos or repeat-

edly submitting the same incorrect password, whether attacks are detectable as

credential stuffing, and more.

Performing such measurements requires jointly analyzing passwords sub-

mitted at different points in time. Prior measurement studies computed a

(keyed) hash over a correctly submitted password [30] or compared the hashes

of a small handful of variants of a submitted password to the real user’s pass-

word hash [38]. Neither approach allows inferring whether users are submitting

the same password multiple times or how many unique passwords they submit.

To enable such measurements, Gossamer’s design uses a two-service logging

infrastructure to ensure least privilege. Gossamer has a specialized measurement

service that receives a copy of login requests from login servers, processes them

by computing password statistics and encrypting submitted usernames, and

outputs sanitized logs to a persistent database on a different machine. The mea-

surement service, like login servers, has access to plaintext passwords. Thus

we designed it to match or exceed the security properties of login servers: It

is safe-on-reboot [86] (no sensitive data such as passwords are ever stored on

disk), deletes all in-memory data periodically to limit the scope of what would

be exposed in the case of a breach, and is administered by the same security

staff in charge of login servers.

6


Researchers use a separate analysis service to access the sanitized logs stored

in the persistent database. The sanitized logs and analysis service are still

treated as sensitive, and cannot be made publically available. To assess the risk

to user passwords in the unlikely case of complete exposure of both a login

system’s password hash databases plus Gossamer logs, we developed a new

simulation-based approach to analyze the speed-up of brute-force cracking at-

tacks that attempt to additionally exploit Gossamer logs. For example, simula-

tions show that storing raw strength scores (as measured by zxcvbn [123]) can

provide up to a 20% increase in cracking efficacy (using up to 109 hash compu-

tations), leading us to reduce the granularity of strength scores. Ultimately, our

simulations suggest that the best performing attack increases password recov-

ery rates using Gossamer logs by less than 2% using 109 queries.

To showcase the utility of Gossamer, we worked in close collaboration with

two large universities’ information technology (IT) security departments to per-

form a measurement study of login behavior. Our measurement study proto-

cols, including Gossamer’s design and implementation, went through a thor-

ough, multi-step review process that included reviews by the security engineer-

ing teams from both universities, representatives of each university’s adminis-

tration, and the relevant IRBs. This process culminated in a determination that

Gossamer poses minimal risk. We deployed Gossamer for seven months at Uni-

versity 1 (U1) and for three months at University 2 (U2). We observed 34 million

login requests (combined) for approximately 500 K users who regularly log in to

access various university-provided critical online services such as email, course

enrollment, and employment information.

This enables first-of-their-kind measurements of password usability and se-

7


curity. We saw that 1.9% of valid users at U1 and 4.6% at U2 changed their

password in the data collection period. We found that 6.5% of usernames at U1

appearing in public breaches are still using a password that is only a small vari-

ant of one of their leaked passwords. This motivates deployment of password

breach alerting services that take into account similarity [94]. On the usability

front, while the Dropbox study reported that 5% of failed attempts were due to

easily correctable typos, our measurements indicate that 65% of failed attempts

could be typos (within edit distance two from the actual password), suggesting

this is a much larger cause of user frustration than previously imagined. We

also report on the rate of login retries, the success and failure rate of app-based

two-factor authentication, and the possible adoption of password managers. Fi-

nally, we are able to report a few high-volume attacks, with insights enabled by

Gossamer to characterize the attacker behavior involved.

Summary. In summary, our project is the first to propose a measurement frame-

work for passwords that can safely help answer basic questions about password

use. Our contributions include:

• Design of Gossamer, which combines systems security, simulation-based se-

lection of password statistics, and procedural safeguards to enable measure-

ment studies of password-based login behavior.

• We worked with two large universities’ IT departments to deploy Gossamer

for multi-month measurement periods.

• We report for the first time on a variety of aspects of password-related us-

ability and security, and discuss the implications of these measurements. For

example, our measurements motivate the need for password breach alerting,

suggest ways to improve lockout mechanisms, and more.

8


Finally, we hope that Gossamer can serve as a platform to help drive future

research on improving usability and security of passwords. As such we are re-

leasing Gossamer as a public, open source project that may be useful for security

researchers both in industry and academia.

2.2 Background and Related Work

Current login systems still heavily rely on password-based authentication.

Users typically enter their usernames and passwords to a form on a web client,

which submits them along with other information relating to the user or ma-

chine such as HTTP headers, cookies, IP, and user agent to the login server over

HTTPS. The server hashes the password (and a salt), checks if the username and

hash pair is present in the login database, and if so, allows the user to log in or

prompts for further authentication checks. Otherwise the request fails.

Single sign-on (SSO) systems allow a user to log into multiple different web

services using the same username and password. When a user accesses a ser-

vice, the service provider (SP) redirects the user to obtain a proof of authentica-

tion from the identity provider (IdP). The IdP provides the proof immediately if

the user has recently authenticated with it, or requires the user to authenticate

and provides the proof if the authentication is successful. The OAuth frame-

work [12] is a common way to achieve SSO.

Looking ahead we perform measurements at two large universities. Both

U1 and U2 use SSO with Microsoft Active Directory Federation Service (ADFS).

While at U2 all login traffic goes through ADFS, at U1, only a portion of traffic

is via ADFS. This is part of the reason we see a lower rate of logins per day at

9


U1 compared to U2 (Section 2.4).

Studies about passwords. Prior works [60, 79, 85, 88, 122] have investigated

guessability of user-chosen passwords. Most of these rely on breached pass-

word data to understand the distribution of user-chosen passwords. Several

studies have used Amazon Mechanical Turk (AMT) to understand user choice

of passwords [69, 72, 114], under different factors such as password require-

ments [72], presence of a password strength meter [114], and use of a password

blocklist [57].

As passwords chosen in this environment may not represent passwords on

real websites, several studies have inspected real user passwords through client-

side, server-side, or offline instrumentation [48, 50, 96]. On the client side, Flo-

rencio and Herley [48] installed a Windows Live Toolbar component for five

hundred thousand volunteer participants and analyzed their password behav-

ior over 85 days. Similarly, Forget et al. [50] created a client-side data collection

tool to observe user’s password behavior in its natural environment [50, 96].

Measurement studies with login systems. To our knowledge, three studies

have looked at user passwords by instrumenting the login servers. Bonneau et

al. [30] instrumented Yahoo’s login servers to receive login requests (including

user passwords) and construct histograms of password characteristics based on

user demographics. Mazurek et al. [83] correlated password strength with de-

mographic information in an offline study with reversibly encrypted passwords

on an access-restricted computer. Chatterjee et al. [38] instrumented the login

code at Dropbox for failed login attempts to test whether applying a typo cor-

10


rection to the submitted password would have produced the correct password.

Open questions. Many open questions remain about the characteristics of the

passwords submitted to a login system. For example, how often do users log in

from multiple devices, how often do users submit the same incorrect password

multiple times, and how often do users submit passwords that are similar to

one of their leaked passwords? More importantly, can we collect information

about the submitted password that allows analyzing login characteristics with-

out degrading the security of user passwords? Such a framework would help

practitioners make data-driven login security policies, such as account lockout

thresholds, that better balance between usability and security.

2.3 Designing a Secure Measurement Architecture

To analyze the passwords submitted to a login system, we need to instrument

live login services and monitor login requests, including the submitted user-

name and passwords. User passwords are highly sensitive and should never be

logged. We therefore designed a secure instrumentation architecture that pre-

serves the privacy of login requests while allowing meaningful analysis. We

refer to it as Gossamer and deploy it at two login systems used at two univer-

sities in the United States. Gossamer is designed in close collaboration with the

security engineers at these universities. Below we describe the built-in security

considerations in our design and the integration with the existing login infras-

tructure at these universities.

The architecture. Gossamer enables instrumentation of typical web login

servers, such as those used for single sign-on (SSO). An overview of Gossamer’s

11


Login Server

Measurement 

Service

Ephemeral 

Database

Requires high 

security clearance

Researchers’ access

Persistent 

Database

Analysis 

Service

Login 

requests

Reported Account 

Compromise Log

Encrypted 

passwords
Anon. login data,

pw heuristics,

compromise logsKey Management 

Service (KMS)

Gossamer

Figure 2.1: The main components of Gossamer.

architecture appears in Figure 2.1. A lightweight hook is deployed within the

login server that, on every received login request, sends a stripped-down copy

of the request to our instrumentation infrastructure on a separate, in-network

machine. This is done using a separate thread to avoid any noticeable latency

impact by the instrumentation on login behavior. A login request includes the

username, password, IP address, a subset1 of the HTTP headers, timestamp, lo-

gin result (success or failure), and finally an application-specific result code for

the login attempt.

A measurement service receives this forwarded login information. It is respon-

sible for processing the raw login data in a secure manner, converting it into san-

itized logs, and storing them in a persistent database. The persistent database

can be accessed by analysts (in our case, researchers) via a dedicated analysis ser-

vice for understanding user login behaviors. As such, we partition Gossamer’s

architecture into two security levels: The lightweight login hook and measure-

ment service run at a higher privilege level and are administered by IT security

1The login server removes sensitive cookies specified by the security engineers from the
HTTP header, as some content can be more sensitive than user passwords. For example, an
“authentication cookie” could bypass MFA requirements.

12


staff; the analysis service is instead at a lower privilege level, accessible by ana-

lysts (researchers).

We explain more about these two services further below, but first describe

our security and design goals.

Security properties and design goals. We design Gossamer to resist a variety

of attacks. We note that all our network traffic is encrypted using TLS, and as

such we do not discuss network adversaries further. Instead, we focus on the

threat of complete compromise of each (or both) of the services, as well as the

weaker adversarial threat of exposure of logs generated by Gossamer.

To protect against these threats, we design Gossamer to conform to four se-

curity properties.

• Least privilege access to password data. The system must ensure that the analyst

receives only the information necessary for analyzing login behaviors, while

plaintext passwords remain restricted to particular services.

• Bounded-leakage logging. The system persistently logs a small set of statistics

about user passwords. The set of statistics is carefully designed to bound the

improvement in guessing attacks against user passwords, even in the case of

complete compromise of the analysis service.

• Periodic deletion. The system should expunge all raw, sensitive data older than

24 hours to reduce the exposure of any data should the system get compro-

mised.

• Safe-on-reboot. Finally, we must ensure that all sensitive data from raw HTTP

requests is destroyed on reboot. This property was first introduced in [86] for

the Bunker secure network tracing system.

13


We will return to these properties as we elaborate on the details of the architec-

ture.

Security considerations in Gossamer. As mentioned, Gossamer uses two ser-

vices running at different privilege levels on two different machines — a mea-

surement service for processing the raw login data and storing the anonymized

statistics of user logins into a secure, persistent relational database, and an anal-

ysis service for analyzing the data from the persistent database. Separating mea-

surement from analysis enables us to maintain the same privilege requirements

for access to password data as there are without Gossamer. The high-level archi-

tecture diagram of Gossamer including privilege levels is shown in Figure 2.1.

The measurement service runs on a heavily access-restricted machine that

receives a copy of the login request over an encrypted channel from the login

servers. The service then computes measurements over the submitted pass-

word and stores them in the persistent database. Some interesting statistics re-

quire the plaintext submitted password across multiple requests — for example,

the number of unique passwords submitted by a single user or from a single IP

address. Therefore, the measurement service stores passwords encrypted using

an in-memory key in the ephemeral database. The key is stored only in memory

and is automatically replaced with a new key every day at midnight local time.

The ephemeral database is placed in a memory-based file system, such as the

/tmp directory. The key rotation cryptographically erases the data stored in the

ephemeral database every 24 hours.2 If the measurement service is killed or the

device is rebooted, all ephemeral data is effectively deleted. This ensures our

2Key rotation at midnight deletes the data received, say, an hour before midnight, limiting
our ability to correlate between passwords received before and after midnight. It is an open
question how to design an efficient key-rotation technique that will allow secure deletion with-
out requiring storing linear number keys in memory.

14


periodic deletion and safe-on-reboot requirements.

The ephemeral database allows us to calculate a number of measurements

referencing the passwords submitted across multiple logins. These measure-

ments (given in Section 2.4) allow us to characterize user behavior and could

help in building attack detection mechanisms. The measurement result is stored

in a persistent database outside the privilege boundary, where it can be accessed

by the researchers. This database is placed in a disk-encrypted volume, provid-

ing another layer of protection in case the volume is backed up to an unpro-

tected machine or is compromised. The key to the encrypted volume is only

known to a subset of researchers, as the IT security engineers did not need it .

Protecting user privacy. To protect the privacy of users, Gossamer anonymizes

the usernames before storing in the persistent database by encrypting them us-

ing a deterministic encryption scheme [58]. The encryption key is only accessi-

ble from within the measurement service. The deterministic nature of encryp-

tion allows us to cross-reference the logins against a username without knowing

the actual username, while also allowing us to report compromised usernames

to the security engineers, should we discover any. We do not record any per-

sonally identifiable information about the user, including their real name, affili-

ation, or account type (such as student, faculty, or staff). We do record the source

IP address for requests, which is needed to analyze client and attack behaviors.

Of course re-identification attacks [89, 108] may be possible given access to

these logs, and for this reason alone logs are not suitable for public release. We

have strict policies against re-identification for the limited set of researchers who

access the analysis service. All analysis is performed on the analysis server with

encrypted usernames, and only summary statistics leave the analysis service.

15


Both the persistent and ephemeral databases are instantiated as MySQL

databases. The measurement service is a Python Flask application running on

an Apache server. We use the Python Fernet [7] library to encrypt user pass-

words in the ephemeral database using AES-256, and we use the Python Mis-

creant [8] library to deterministically encrypt and decrypt the usernames using

AES-SIV.

Integration with other data sources. Looking ahead, in both of our deploy-

ments there are other relevant data sources available that we would like to ana-

lyze. Namely, it is common for organizations to have a database that contains re-

ports about potentially compromised accounts which can provide insights into

what attacks are (not) being caught by current security mechanisms. At the

universities we worked with, these reports are generated when a user alerts IT

security to a compromise, or because existing alert generation mechanisms (in-

cluding third-party breach alerting services the universities subscribe to) flag an

account. In both cases, an IT analyst manually inspects existing logs to attempt

to determine if a compromise occurred.

To leverage these compromise reports, we add to the measurement service

the ability to accept such logs, anonymize them by encrypting all usernames

(using the same key as above), and transfer the resulting data to the persistent

database. A similar approach can be used in other deployments of Gossamer to

incorporate other relevant data sources, such as logs from MFA services.

Process of designing and deploying Gossamer. The process of designing a se-

cure architecture to measure information about passwords, getting all the nec-

essary approvals, and deploying the Gossamer framework to production was

long and involved.

16


We met with the engineers in the IT departments multiple times and went

through several iterations to agree on a design that would work with their sys-

tems and meet a level of security that would receive approval from the IT secu-

rity directors. Every element was carefully thought out across several dimen-

sions — such as who should have access to it, what operating system or soft-

ware version should be used, and what extra protections can we put in place to

ensure that it is locked down as much as possible.

Once we were satisfied with the design, we worked with the Chief Informa-

tion Security Officer and the Associate Chief Information Security Officer to get

their feedback and approvals. Then we wrote and submitted an application for

IRB; and after several rounds of emails, the IRB representative determined that

our research does not meet the definition of Human Subjects Research because

we were working with deidentified data, and she granted us IRB exemption.

Finally, we wrote the code, set up the infrastructure needed within the IT

security network, and tested our measurement service using a dummy SSO ser-

vice sending fake data.

2.4 Password-Derived Measurements and Security Analysis

Gossamer enables analyses based on the passwords submitted during login,

which will improve our ability to characterize user login behaviors. Passwords,

however, cannot be made available to analysts for security reasons. We therefore

design Gossamer to only collect limited, useful statistics about the submitted

passwords without storing passwords persistently. However, even just statistics

computed over user passwords could leak information about the password, so

17


care must be taken on which statistics are persistently stored and made available

to the analysts.

In this section, we discuss how we assessed the security implications of the

Gossamer logs storing different kinds of password-derived measurements.

2.4.1 Password-derived measurements

To assess risks related to password-derived measurements, we adopt an itera-

tive, simulation-based methodology. We consider a potential logging schema,

namely the set of measurements that we log about login attempts. For a can-

didate schema, we perform a simulation-based risk analysis that consists of (1)

defining a threat model including log exposure; (2) determining a baseline at-

tack that does not exploit exposed logs; (2) developing a log-exploiting attack

that incorporates information leaked via fields from the candidate schema; and

(3) running simulations using leaked passwords to assess the increased success

rate of the log-exploiting attack over the baseline. This allows quantifying risk,

and if it is too high we adjust the logging schema and repeat the process until

we are satisfied that risk is relatively low.

First we identify potential fields to include in a schema. Figure A.1 lists the

fields we ultimately utilize in Gossamer. We also considered several other fields

that we eventually discarded as too risky, as we now explore.

To understand password strength, we consider including a zxcvbn strength

score [123], and whether or not it belongs to one of the popular password guess-

ing lists (the most frequent 5,000 passwords in RockYou [10] or the top 5,000

18


passwords generated3 by Hashcat [106]).

To understand how often breached passwords are submitted, we consider

marking passwords as being in well-known, public breaches, which makes

those users vulnerable to credential stuffing attacks. For this we use a dataset

of 1.3 billion breached username-password pairs [120] and the Compilation of

Many Breaches (COMB) containing 3.2 billion pairs [103] released in February

2021. For each attempt, we logged whether the username, the password, or the

username-password pair appeared in this breach dataset. We also consider vul-

nerability to credential tweaking attacks [42, 93, 119] that target passwords sim-

ilar to a user’s other breached passwords. We therefore consider recording the

submitted password’s edit distance, PPSM similarity [93], and Pass2Path simi-

larity [93] from each breached password for the given username, if it is present

in the breach. We also consider logging the edit distance of the submitted pass-

word from previous passwords submitted for that username and IP in the same

24 hour window, which would shed light on whether users are making typos or

submitting distinct passwords, and what types of password-guessing strategies

attackers employ.

2.4.2 Security analysis of measurements

We now turn to making risk assessments about candidate measurements

schemas and, in particular, how exposure of Gossamer logs using a candidate

3We use the rule list best_64.rule [11] (a rule list compiled by the Hashcat community of
what are considered to be the best 64 rules) with RockYou to generate the guesses.

19


schema can be exploited to improve password guessing attacks.

Threat models. As discussed in the last section, we designed Gossamer and our

deployment procedures to limit the risk of illicit access to Gossamer logs, but

the principle of defense-in-depth suggests that we consider when these mecha-

nisms and procedures fail. For example, an insider attacker could leak the logs

to the public internet, or a smash-and-grab attack could somehow compromise

the analysis service and exfiltrate the logs.

We therefore consider two threat models. Both threat models assume the

attacker obtains a copy of Gossamer logs, can re-identify4 usernames within the

dataset, and seeks to infer the password associated to some particular username.

In the first threat model, the attacker can mount an online guessing attack by

querying the login service. In the second threat model, the attacker is assumed

to additionally have access to some salted hash of the password and so can

perform an offline guessing attack. The only difference between the two threat

models for our purposes is the expected guessing budget q, which would be on

the order of tens to thousands in the online case and hundreds of millions in the

offline case.

Looking ahead, it will also be material whether or not the targeted username

appears in an attacker-known password breach. If not, the best strategy is for

the attacker to modify a target-agnostic password guessing attack (e.g., a dictio-

nary attack) to incorporate relevant information leaked via the log file. In the

targeted guessing threat model, the username appears in data breaches known

to the attacker, and so the best strategy is to modify a targeted password guess-

4Recall that the persistent database contains masked usernames, and it is not exactly clear
how attackers would re-identify in this setting. Nevertheless, we conservatively assume that
re-identification is perfect.

20


ing attack (e.g., [93]) to incorporate relevant information leaked via the log file.

We detail particular attacks more below.

In each threat model we consider also a baseline attack (specified below)

which performs either targeted or untargeted guessing without exploiting the

log files.

Dataset for simulations. The simulations discussed below are based on

the breach data used in prior work [76, 93] containing 1.3 billion username-

password pairs. There are 370 million unique passwords between length 6 and

30, associated with 1.12 billion usernames. We removed passwords shorter than

6 characters and longer than 30 characters as done in [76, 93]. We split this data

so that the attacker has access to 80% to inform a guess list, and we randomly

sampled 10,000 passwords with replacement from the remaining 20% as target

user passwords that the attacker is trying to guess.

Password strength measurements. We first focus on four of the password-

derived measurements: (1) whether the password is in the top 5,000 RockYou

passwords (RY), (2) whether it’s in the first 5,000 Hashcat-generated passwords

(HC), (3) the binary zxcvbn (ZB) score of the password that we explain below,

and (4) the raw zxcvbn [123] score of the password for comparison. By default,

the zxcvbn password strength meter returns a password strength between 0 and

4. We hypothesized that this level of granularity would leak too much infor-

mation about the password and speed up a guessing attack. Therefore, we also

consider a modified zxcvbn score, where we record 0 if the zxcvbn score is 0 and

1 otherwise; we refer to this as the binary zxcvbn score (ZB). We also consider a

combined measurement of all three measurements described above (except the

21


100 101 102 103 104 105 106 107 108 109

0

5

10

15

20

# of guesses q (log scale)
∆

q
(%

)

Hashcat (HC)
RockYou (RY)
zxcvbn
Binary zxcvbn score (ZB)
HC+RY+ZB

Figure 2.2: Relative improvement in password guessing success rate (∆q) due
to access to password-derived statistics from Gossamer over the baseline in q ≤
108 guesses.

original zxcvbn score) — that is, whether the values of all three measurements

match between two given passwords.

Each of these password-derived measurements can be represented as a func-

tion M(w) that outputs the result of the measurement as a boolean, an integer,

or a tuple. Given a measurement value m = M(w̃) about a randomly chosen tar-

get password w̃, the attacker wins if they can guess the target password within

q guesses. This is also called the q-success rate (λq). We measure λq for dif-

ferent values of q. An attacker, given a measurement m = M(w̃), can filter its

list of guesses W to only the passwords w ∈ W that match the measurement,

i.e., m = M(w). We let λM
q be the success rate of this attack. The baseline success

rate λ0
q is given by the recovery rate of the attack that simply queries the attacker

guess list in descending frequency order (without filtering). Thus, we measure

the increase in attacker success as ∆q(M) = λM
q − λ0

q.

The results of our simulations are shown in Figure 2.2. We first discuss the

online context, where q ≤ 1000. Among different measurements, revealing the

zxcvbn score provides the largest improvement in attack efficacy, enabling an

attacker to guess 1.6% more passwords in less than 1,000 guesses, compared to

22


the baseline (of 1.8% passwords). The binary zxcvbn score reduces the guessing

advantage by a modest amount. Overall, all three measurements combined —

RockYou top 5,000, Hashcat top 5,000, and the binary zxcvbn score — would

enable an attacker to guess ∆103 = 3.1% more passwords compared to not having

access to the password-derived measurements. Note that this is without any

password policy (except the minimum length requirement of 6 characters).

We also check the success rates for passwords containing at least three char-

acter classes (uppercase, lowercase, digits, and symbols), a common require-

ment. Both universities have three character classes as part of the password

policy for current students, staff, and faculty, although old alumni accounts may

not satisfy this requirement. When looking at passwords that meet this pol-

icy, only 0.8% of passwords are guessed within the first 1,000 guesses without

any password-derived information, and the combination of password-derived

fields brings this percentage up to just 1.3% (∆103 = 0.5%).

Next we discuss the success rate of an attacker for a large guessing budget

q ∈ [103, 109]. As before, the zxcvbn score can be damaging to the privacy of

user passwords, resulting in as high as a 20% increase in attacker success. The

binary zxcvbn score provides less information and never leads to more than a

2% increase in attacker success even with a very high number of guesses. Com-

bined measurements also lead to a bounded increase in the fraction of pass-

words cracked by the attacker, who can guess at most 4% more passwords with

the measurement information in an untargeted attack compared to an attacker

without the information. More importantly, most of the improvement occurs for

passwords guessed in less than 10,000 guesses because the statistics allow an at-

tacker to rule out many popular passwords when guessing already vulnerable,

23


weak passwords.

Thus we conclude that while including full zxcvbn would be too risky, the

increase in attacker success when given the other password strength measures

(with the binary zxcvbn score) is sufficiently small.

Edit distance measurements. Other measurements we consider are the edit

distance from breached passwords for the appropriate username and the edit

distance from other submitted passwords, including the correct ones. Early ver-

sions of Gossamer recorded the precise edit distance; we instead now suggest

quantizing to just indicate whether a submitted password is within edit distance

two of a breached password or other submitted password. Our current imple-

mentation does so, and we quantized the data in previously gathered logs.

To come to this conclusion, we observe that an attacker in our threat model

can check whether a username is in a breach. (Recall that we conservatively as-

sume the attacker can perfectly reidentify usernames and that they have access

to all the breach data used by the measurement system). If the username is not

in a breach, then the attacker can proceed as above through a general guess list.

If it is, then the attacker can mount a targeted credential tweaking attack in the

following way. They start by generating guesses using the state-of-the-art cre-

dential tweaking attack based on pass2path [93], seeding it with the passwords

in the breach for the targeted username. They can then use the edit distance

fields in the log data to filter this guess list by removing any guesses that are

the incorrect edit distance from the breached passwords, or not within the ap-

propriate quantized edit distance from the breached password, depending on

which schema option we are evaluating.

24


To assess the improvement in attacks, we compare the modified attack to a

baseline one that just applies pass2path. For simulations, we randomly selected

10,000 username-password pairs from the 20% test data described above, but

conditioned on the usernames also appearing in the 80% leaked dataset. We

exclude cases where the password is the same on both sides of the split (such

passwords would be easily guessed via credential stuffing). For each target

username-password pair, we give the attacker the edit distance of the target

password from all the breached passwords associated with that username. Then

λed
q is the success rate for pass2path with guesses filtered to only those that have

matching edit distance information. The baseline attacker’s success λ0
q is vanilla

pass2path’s success rate.

The baseline success rate in the online setting is λ0
103 = 21.6%. With precise

edit distances knowledge, the attacker can instead achieve λed
103 = 85.2%. This

is an uncomfortably large jump in attacker’s success. Quantizing to just edit

distance ≤ 2 yields a 22.5% success rate, just a 0.9% increase over the baseline.

For larger query budgets (relevant in offline cracking attacks), the improvement

for quantized edit distance is even less — with a 0.25% increase over the baseline

for q = 108.

Discussion. Our simulations suggest that including even moderately granular

data such as zxcvbn scores or edit distances in log files might be a risk factor

in the case that persistant logs are somehow leaked to adversaries. Therefore

we suggest a conservative approach and select logging schemas that avoid im-

proving guessing attacks significantly. contribution of this work, as it allows

reasoning in a structured way about risk of password-derived fields.

One current limitation of the framework is that it focuses thus far on attacks

25


against a single user, and so we do not yet know how best to assess the risk

of measures capturing similarity of passwords across usernames. Future work

could look at extending the framework to look at multi-user attacks. Another

limitation is that we rely on best-known attacks (such as pass2path), and as such

future work could yield improved attacks. It is therefore important to retain the

ability to sanitize or delete older logs should new results surface previously

unforeseen risks.

A full list of measurements logged by Gossamer can be found in Figure A.1.

2.5 Deploying Gossamer

We partnered with security engineers at two large universities to deploy

Gossamer and collect data, beginning in December 2020. We collected data for

seven months at U1 and three months at U2 (a shorter timeframe due to the

preferences of the IT department) . This timeframe encompassed mid-semester,

exam, and break periods, so we were able to observe different levels of activi-

ties. Throughout this timeframe, we occasionally made updates to some of the

measurement mechanisms; these updates were done after careful review of the

code by pulling from a git repository accessible to the virtual machine running

the measurement service, and any summary statistic that may be affected an

update was calculated using the data after the update was made .

Review process and ethical considerations. Although our research could help

understand characteristics of password submissions received by login systems,

we must also consider the privacy and security risks of inadvertent exposure

26


0

0.5

1

Lo
gi

n
at

te
m

pt
s

pe
r

da
y

(×
10

5 )

U1

2020-12-06

2020-12-26

2021-01-15

2021-02-04

2021-02-24

2021-03-16

2021-04-05

2021-04-25

2021-05-15

2021-06-04

2021-06-240

2

4 Failures
Successes

U2

Figure 2.3: Successful and failed login attempts per day at two universities (for a
total of 196 K unique users at U1 and 309 K at U2). Potential high-volume attack
campaigns we discovered are also shown.

of the sensitive information in the logs. Our design of Gossamer, therefore,

strives to balance these risks with the benefits of protecting user accounts. We

conducted a nearly year-long design and implementation effort that entailed

a number of external review processes to help guide our research study and

reduce potential privacy and security risks to users.

As noted in Section 2.3, Gossamer logs do not include PII. Researchers (with

one exception mentioned below for compromise reporting) do not have access

to usernames or email addresses. They do have access to IP addresses from

where login requests originate.

To protect sensitive data, Gossamer uses a layered approach to security with

an encrypted disk, strict firewall rules, and MFA login to access the analysis ser-

vice. Moreover, Gossamer never stores plaintext passwords on disk; it instead

stores a set of password-derived measurements in the persistent database for

analysis. Even in the case that these are somehow leaked the chosen measure-

ments represent little additional risk of password disclosure (see Section 2.4).

The research group had ground rules for handling the data, including min-

imizing granularity of information shared outside the confines of analysis sys-

27


tems, restricting persistent database access to only a subset of researchers, and

setting clear expectations about (in)appropriate use of access (e.g., prohibiting

re-identification attacks attempting to identify a user from the obfuscated data).

We also went through a careful vetting and approval process with university

leadership and their information technology (IT) security departments. This in-

volved presenting to the university leaderships about the goals, design, and

procedures associated with the measurement studies, and working closely with

our universities’ security engineers to design and implement Gossamer. Sat-

isfied with our process and the potential benefits to university account hold-

ers, we received approval from university leadership for the deployment of

Gossamer.

Before deploying the system, we submitted our study designs for IRB review

at each university. The study protocols at the two universities were slightly dif-

ferent in how we report to IT security compromised user accounts. At U1, re-

searchers never had access to plaintext usernames, and the security engineers

handled the decryption of reported (encrypted) usernames. Therefore, we re-

ceived an IRB exemption at U1, which found that the research study does not

qualify as human subjects research. At U2, security engineers requested that we

report the plaintext usernames for operational simplicity. One researcher de-

crypts the username before reporting compromises. Therefore, we received IRB

approval at U2, finding the study as a minimal risk human subject research. We

did not seek consent from individual users, as we do not know their usernames

or email addresses. Instead we obtained explicit approval for conducting this

study from the universities’ leadership and IT departments who provide the lo-

gin services. Such waivers of explicit consent from participants were used in

28


prior work (e.g., [66]) and are discussed as an acceptable approach in the Menlo

report [70].

Deployment configuration. As mentioned in Section 2.3, Gossamer consists

of two services — a measurement service and an analysis service. For U1, we

used Amazon EC2 in a virtual private network to host the measurement ser-

vice as recommended by the U1 security engineers, and we used an on-premise

dedicated server for analysis. For U2, we used two separate on-premise virtual

machines for running the measurement service and the analysis service. The

persistent storage is hosted inside the VM running the analysis.

Strict firewalls were set up for all machines (on-premise or EC2) that block

all incoming and outgoing requests except from inside the private network of

the respective universities. Only a subset of the researchers have access to these

machines via SSH, and the access requires two-factor authentication. All incom-

ing and outgoing network connections are logged by the firewall and regularly

checked by the security engineers for signs of intrusion attempts. Relevant se-

curity patches are checked regularly and applied immediately. The persistent

storage uses a MySQL database, which uses TLS for all communication and the

underlying disk is encrypted at rest. The login server and the measurement

service also use TLS with pinned certificates [47] for all communications. All

passwords used are longer than 12 characters, randomly generated, and stored

in a password manager. The security engineers also ran scans on the code of

Gossamer and the VMs to check for known vulnerabilities.

29


U1 U2

Session Statistics (with a 360s threshold)
Avg. session size 2.25 2.21
99th percentile session size 10 6
% abandoned sessions 5.47% 1.96%

User Statistics
# of unique usernames seen 196,424 309,801
# of valid users 177,286 169,774
# of active users 130,695 110,476
% valid users w/ weak passwords 0.03% 0.06%
% valid users w/ username in breach† 5.79% 3.27%
% valid users w/ passwords in breach† 2.92% 9.34%
% valid users w/ user-pw pair in breach† 0.01% 0.15%
% valid users w/ tweaked password 1.22% 0.66%
% valid users who may be using password managers 24.76% 27.34%
Avg. devices per user per day 1.51 1.91
Avg. devicesper user (over whole time period) 14.51 14.97
Avg. num unique passwords per user 1.96 9.59

Login Statistics
Avg. Login requests per day 49,302 246,274
Avg. # of submitted usernames per day 24,822 61,798
% of requests succeeded 94.99% 92.35%
Avg. # requests per day per user 1.99 2.05
% of requests from mobile device 31.00% 35.57%
% IPs from VPNs, proxies, or Tor nodes 22.08% 4.91%

Submitted password statistics
% req. w/ password in breach† 2.71% 0.10%
% req. w/ user-pwd pair in breach† 0.07% 0.01%
% failed req. containing a typo 29.67% 12.04%
% failed req. (with edit dist msmt) containing a typo 62.39% 58.37%
% failed req. from mobile device containing a typo 38.63% 38.36%
% failed req. (with edit dist msmt) from mobile device containing a typo 72.69% 81.87%
% pwds tweaked 0.92% 0.34%

† Statistics related to breach data were calculated for data beginning Jan 27 ’21 after we added more breach data to the
instrumentation.

Figure 2.4: Summary statistics of login requests recorded by Gossamer at U1
and U2. More statistics can be found in Figure A.3.

2.6 Login Statistics, Patterns, & Observations

We collected data at U1 for seven months and at U2 for three months starting in

December 2020. Overall, we observed 10 million requests at U1 and 24 million

30


requests at U2. We show the daily successful and failed requests in Figure 3.7.

On average, 5–8% of requests failed; however, on a few days we observed a

spike in the failure rate (> 50%). These were high volume attacks that we dis-

cuss below. After removing the noise caused by these attacks, we found that

users submitted login requests on average 1.99 times per day at U1 and 2.05

times per day at U2. We see fewer login requests at U1 because some login re-

quests are handled via a different authentication server that is outside of our

instrumentation.

We saw 196 K unique usernames submitted to U1, out of which 177 K were

valid usernames, of which 154 K users had a successful login at least once dur-

ing our instrumentation period. At U2, we saw 170 K users with at least one

successful login during our instrumentation period and 15 K additional users

who tried to log in with a valid username but could not complete login due to

errors. We consider a user active if they have at least one successful login every

month. We found about 130 K active users at U1 and 110 K at U2.

Requests originated from 526 K unique IP addresses (approximately 88 K per

month) at U1 and 513 K (171 K per month) at U2, and they were associated with

44 K unique user agents at U1 and 31 K at U2. We also used the GeoIP2 database

published by MaxMind [80] to find location information for IPs; the majority

of login requests at both schools originated from within the United States, fol-

lowed by China and India. Summary statistics that we report on these login

requests can be found in Figure 2.4, with additional statistics in Figure A.3.

31


2.6.1 Characteristics of high volume attacks

We observed three high volume attacks during our instrumentation. Since we

are focusing on understanding the full picture of user behavior, we first report

on these attacks and then remove them from the dataset to avoid skewing other

statistics we report. In total, we removed 54 K requests at U1 and 81 K requests

at U2.

Attack 1: Naïve, multiple-IP, high-volume credential stuffing attack campaign at U1.

Over January 25–26, 2021, four IPs conducted a credential stuffing campaign

consisting of 36 K attempts to 19 K users. Two of these IPs were identified by the

MaxMind GeoIP database [81] as coming from NordVPN, one from Microsoft,

and one from Inwi Mobile [5]. All IPs were active in non-overlapping time pe-

riods and submitted up to 100 requests per second. More than 99% of requests

from these IPs in this time frame had null user agents. Almost a third of the

attempts (29%) of submitted username-password pairs from these IPs were di-

rectly from prior breaches, and 60% of submitted passwords were present in

prior breaches.

The attack campaign successfully compromised 23 accounts at U1, all of

which had been flagged by security engineers and had their passwords scram-

bled to prevent access. We observed some duplicate username-password pairs

submitted across IPs; thus we hypothesize that the attacker used an automated

script that iterates through an unfiltered list of breach data with a variety of IPs.

Attack 2: Credential stuffing attack using Sentry MBA tool against U1. An IP

hosted in Google Cloud [4] executed another credential stuffing attack at U1

on March 14th, 2021 from 18:42 - 22:22 UTC. This IP submitted over 17 K re-

32


quests to 15 K unique usernames, submitting approximately 80 requests per

second. Of these attempts, 22% of the submitted username-password pairs and

56% of the submitted passwords were directly from the breach data we used

with Gossamer. The attacker successfully guessed the passwords for 14 users.

Among those, 13 were already recorded as compromised by security engineers;

we reported the remaining encrypted username to the security engineers as a

potentially compromised account.

Although both Attacks 1 and 2 were credential stuffing campaigns, we sus-

pect that the respective attackers were using two different sets of breach data,

as there was little overlap in the users targeted.

We noticed that the attack traffic in this campaign was evenly split between

five distinct user agents that were not present in the rest of our data; these five

user agents are the default user agents for a tool called Sentry MBA [110]. Sentry

MBA is a credential stuffing tool where the user can specify a list of usernames

and passwords, a config file for specifying the HTML fields on the target login

page, and a list of proxies from which to send traffic.

Attack 3: High volume, password spraying attack at U2. On December 22nd, 2020,

a total of 12 unique IPs belonging to Digital Ocean Cloud [3] carried out a high

volume password spraying attack by targeting 76 K unique users with 169 K re-

quests at an average of 262 requests per minute. The attacker pretended to send

requests from SMTP and IMAP clients. The number of usernames targeted by

each IP was evenly distributed among the IPs, and these IPs were active only

during the attack period. Less than 3% of submitted passwords were from prior

breaches, and none of the submitted usernames were present in prior breaches.

We also noticed that all of the login attempts were for non-U2 usernames, in-

33


dicating that the breach data was not filtered before the attack. Consequently

during this attack campaign, there were no successful logins.

Filtering out attacks. We remove requests from IP addresses corresponding to

these three attacks on the respective days for all subsequent statistics to avoid

skewing the statistics. Although we did have access to compromised user-

names, there was no clear way to determine which IP address compromised

a given user. In Section A.0.2, we show why excluding all IP addresses that con-

tacted a compromised user would not have significantly affected the statistics.

This filtering approach works for our setup but may not generalize to other sys-

tems. Similarly, although there could be other low-volume attacks that we did

not detect, we believe they will not impact the statistics we report.

2.6.2 User and client statistics

Gossamer observed login attempts for 196 K unique usernames at U1. Among

the users who could never login, 42 K (21%) of them used a username that does

not exist in the U1 login database; however, only 0.1% of these usernames ap-

peared in our breach data. At U2, we saw 310 K users who tried to login, 170 K

(55%) that were successful, and a staggering 139 K (45%) usernames that do not

exist. We are not sure what caused such a high volume of login submissions

with invalid usernames.

More summary statistics on these login requests can be found in Figure 2.4

and are elaborated in more detail below.

IP and client characteristics. Requests originated from 539 K unique IP ad-

34


dresses at U1 and 2.47 M at U2. There were 44 K IPs that sent requests to both

universities; of these, 621 IPs had no successful logins.

We also recorded the user agent strings present in the HTTP header of the

login requests. We observed about 5 K unique user agent strings at U1, and

about 31 K at U2. The top 10 user agents at each school are listed in Figure A.4.

In 22% of requests at U2, the user agent string was set to an empty string; all

of these requests were through basic authentication [51]. U1 does not support

basic authentication, and less than 1% of the requests had an empty user agent

field. We suspect that attempts with an empty user agent field were submitted

via an automated script that neglected to set the user agent field. A breakdown

of operating systems mentioned in the user agent string appears in Figure A.2.

Prior work [52] has suggested that attempts from multiple devices for a user

is suspicious and should require additional authentication steps. Freeman et

al. defined a device as a pair of a unique IP address and a user agent [52]. We

observed that on average, 3% of users at U1 and 19% of users at U2 log in from

two different devices per day. 11% of users at U1 and 6% of users at U2 have

logged from more than fifty devices over the course of the study period.

To find what fraction of IPs were public VPNs, proxies, or Tor exit nodes, we

used the Blackbox5 API. We found that 22% of IPs at U1 were VPNs, proxies,

or Tor exit nodes. However, at U1, 16% of all IPs are 10-space IPs, all of which

are marked as VPNs/proxies, contributing to this high percentage. These IP ad-

dresses are set up by the university IT department and are not accessible outside

of the university network. At U2 about 5% IPs were flagged as VPN/proxies by

Blackbox API, and 3.6% of all IPs are 10-space IP addresses. Because users may

5https://blackbox.ipinfo.app/

35


be sharing a VPN or proxy network, it is possible that some users may have the

same IP; thus we report most statistics based on device, since it is less likely two

different users would also have the same user agent.

The IP address of a user’s device might change over time due to DHCP churn

or switching between multiple networks. Therefore a login from the same de-

vice and browser may appear as if it’s from multiple different devices, accord-

ing to the previous definition. So, we conservatively estimate the number of

different browsers per user based on the user agent string. At U1, 21% of valid

users, and at U2 29%, successfully logged in with ten or more user agents. Thus,

many users log in from different browsers, and login security mechanisms must

consider this while designing policies.

We were also interested in determining what percentage of users logged in

on their mobile device (smartphone or tablet) versus on a laptop or desktop

computer. To do this, we used a regular expression that matched mobile de-

vices on the user agent. We found that at U1, 31% of requests originated from

a mobile device; out of those, 84% of requests originated from iOS devices and

16% from Android devices. At U2, 18% of requests (80% of which were iOS and

20% Android) came from a mobile device. At both schools, iOS devices were

significantly more popular.

These findings may be useful in developing attack detection mechanisms.

For example, if a user always logs in from an Android device, it may be suspi-

cious if they attempt to log in from an iOS device.

36


100 101 102

0

0.2

0.4

0.6

0.8

1

Number of unique passwords per day

Fr
ac

ti
on

of
re

qu
es

ts

Per IP address (U1)
Per username (U1)
Per username (U2)

Figure 2.5: Cumulative distributions of unique passwords per username and IP
for February 2021. The X-axis is log scale.

2.6.3 Password security

The strength and guessability of a password directly affect the security of a

user’s account. Using the visibility into passwords provided by Gossamer, we

investigate password security in terms of strength, the number of unique pass-

words submitted for a username, and the use of breached credentials.

Password strength. We used four different measurements in Gossamer to mea-

sure password strength. We record the bucketized zxcvbn score, as described in

Section 2.4; whether the password appears in the top 5k most common RockYou

passwords; whether the password appears in the top 5k passwords generated

by Hashcat on the RockYou dataset with the best64 ruleset [11]; and finally,

whether the password appears in the top 1000 most common passwords in our

breach compilation dataset.

Exact percentages of requests matching each of these password strength met-

rics can be found in Appendix ?? Figure A.3. In summary, we found that the

vast majority of valid users were using strong passwords by these metrics. Both

universities use strong password policies, requiring a minimum length of 8 and

37


at U1 three different character classes; so this finding is not surprising.

Unique passwords. We also measure the number of unique passwords sub-

mitted for a given username or from a given IP address on a certain day.

We observe that a median of one and a 99th percentile of seven unique pass-

words are submitted against a single username per day. Slightly more unique

passwords — a median of one and 99th percentile of nine — are submitted from

a single IP address at U1 (unfortunately we do not have this measurement at U2

because of a configuration issue) . Both distributions are shown in Figure 2.5.

More unique passwords are submitted from a single IP address than for a sin-

gle username, which makes sense because an IP address may submit to multiple

different users, especially if it is a VPN/proxy.

We noted earlier that some organizations may lock accounts that receive a

certain number of failed attempts in a given time period. However, these lock-

out mechanisms do not take into account whether a user submitted the same

password multiple times. Creating a lockout threshold based on the number of

unique passwords tried instead of the total number of attempts would improve

usability without any improvement in attacker success rates.

To demonstrate this, we consider a lockout threshold of 10 and calculate the

number of accounts that would have been locked by counting the number of

attempts on a given day instead of the number of unique passwords. We find

that 17,863 accounts would have been locked under the simple policy, versus

just 2,220 accounts under the policy that counts by unique passwords — an 88%

decrease. Similarly for a lockout threshold of 5, implementing the more complex

policy would decrease the number of lockouts by 91% from 280,360 to 23,919.

38


Such a lockout policy could be implemented relatively simply by storing the

password hashes submitted for a given user for the last day. Then a lockout pol-

icy would check the number of unique hashes submitted in the designated time

period. This new policy would improve usability with only a slight increase in

implementation complexity and no benefit to a potential attacker.

Breached credential use. To measure the usage of breached credentials,

Gossamer records for each attempt whether the username, password, or

username-password pair appeared in our breach dataset. Usernames were

stripped of a domain name, if applicable, before performing the match. We

find that nearly 6% of valid users at U1 and 3% of valid users at U2 appear in

our breach dataset, indicating that they have appeared in some data breach in

the past. Additionally, we find that 3% of submitted passwords at U1 and 0.1%

at U2 appeared in a breach; finally, 0.07 % of username-password pairs at U1

and 0.01% at U2 appeared in a breach. Most of these were failed attempts, but

we find that 23 users (0.01% of valid users) at U1 and 254 users at U2 (0.15% of

valid users) were still using a breached password as their actual password.

These users are vulnerable to credential stuffing attacks, and this motivates

the deployment of breach alerting tools in login systems that would prevent

users from continuing to use breached passwords.

For each attempt, we also check for tweaked passwords by recording the

similarity of the submitted password to a breached password using three met-

rics — edit distance, PPSM similarity, and pass2path rank [93]. We define a pass-

word as being tweaked if the edit distance is less than or equal to two, the PPSM

similarity is zero (indicating that they are similar), or the pass2path rank is less

than or equal to 1,000. We found that at U1, 0.92% of all passwords submit-

39


ted were tweaked, and 2,164 users (1.22% of valid users) were using a tweaked

password. At U2, 0.34% of passwords submitted were tweaked, and 1,125 users

(0.66% of valid users) were using a tweaked password.

However, due to the design of the system, we can only determine whether

a user has a tweaked password if they had a previously breached password in

our dataset to which we can compare. When we take this into account, we find

that nearly 7% of users at both universities with at least one breached password

were using a tweaked password. A recent study by Pal et al. [93] reported a

slightly higher rate of 8.4%.

We hypothesize that users may append a single character to their old,

breached password, causing them to be vulnerable to credential tweaking at-

tacks, in which an attacker tries close variants of breached passwords in an at-

tempt to guess user passwords. Implementing a breach alerting system such as

Might I Get Pwned [94] or using a personalized password strength meter [93] in

the password change flow could alert users when they attempt to change their

password to one that is vulnerable to credential tweaking attacks.

Password changes. Neither of the universities have any periodic password

change policies. At U2, users are recommended to change their passwords twice

a year, but this is not enforced. New passwords are recommended to not be the

same or similar to any previous passwords, but this is not enforced either. Al-

though we did not instrument the password change system, we are able to

estimate a subset of the password changes made using the password’s edit dis-

tance from the previous submissions for a user logged by Gossamer. Because

previous submissions are cleared every 24 hours, though, we can only use this

measurement in instances when the user logged in with their old and new pass-

40


word on the same day. We find 9,011 total password change events made by

2,893 unique users — 78 of which appeared in the compromise database — at

U1. At U2, we saw 42,827 total password change events made by 7,812 unique

users — 211 of which appeared in the compromise database. Of the password

change events at U1, 100 resulted in a new password that was in our breach

dataset, and 125 resulted in a new password that was a tweaked version of one

in our breach dataset. At U1, at least 1.9% of valid users changed their pass-

word at one point during the seven month measurement period (about 1.3 K per

month) and at U2, at least 4.6% (14 K per month). This is only a lower bound,

since we can only measure a fraction of password changes, and prior work is

consistent with this bound, reporting higher password change rates [30, 48].

2.6.4 Usability

A longstanding complaint about passwords is their usability. Users often have

trouble remembering strong passwords, and they may commit typos especially

if they do choose a stronger, more complex password [38, 104, 105]. A key

benefit of Gossamer is that it provides a new observation point for measuring

password-based login usability.

Login sessions. A user can retry logging in if a login attempt fails (probably

due to submitting an incorrect password). To better understand a user’s pattern

of login retries, we define a login session as a sequence of login attempts to a

username from the same device ending either in a successful attempt or in a

period of inactivity (indicating that the user has given up after a series of failed

41


100 101 102 103

0.2

0.4

0.6

0.8

1

Interarrival time (seconds, in log scale)

Fr
ac

ti
on

of
re

qu
es

ts

U1

No inactivity thresholds
Threshold = 600s
Threshold = 360s
Threshold = 300s
Threshold = 120s
Threshold = 60s

100 101 102 103

0

0.2

0.4

0.6

0.8

1

Interarrival time (seconds, in log scale)

Fr
ac

ti
on

of
re

qu
es

ts

U2

No inactivity thresholds
Threshold = 600s
Threshold = 360s
Threshold = 300s
Threshold = 120s
Threshold = 60s

Figure 2.6: Cumulative distribution of interarrival time between requests for
different inactivity thresholds for February 2021. The X-axis is log scale.

attempts). We define a device as a tuple of an IP address and a user agent. It

is possible that more than one user may use the same internet connection and

thus have the same IP address. In this case, they may appear in the same login

session if they both submit login requests to the same username in a period of

time and share the same user agent as well; however, this seems an unlikely

scenario. The definition of login session is parameterized by the length of the

period of inactivity, which we call the inactivity threshold. We refer to a session

that did not end in a successful login as an abandoned session and the number of

login attempts in a login session as session size.

We examine the time between successive login requests, termed as inter-

arrival time, to determine the inactivity threshold that provides stable session

size. Because a successful login request indicates the end of a login session, we

specifically investigate pairs of successive login requests where the first request

was not successful. We show this distribution in Figure 2.6 at both universities

for the month of February 2021, as a representative month for which the data

collection periods overlap. We limit the X-axis to 15 minutes for easier viewing.

42


Inactivity threshold Average session size
(seconds) U1 U2

60 2.16 2.44
360 2.29 2.22
600 2.32 2.99

Figure 2.7: Average session sizes (number of attempts per session) for different
inactivity thresholds

At U1, 81% of successive attempts are executed within 60 seconds of the previ-

ous attempt, 90% of requests are executed within 361 seconds (6 minutes), and

95% of requests are executed within 13 minutes.

We can also see by looking at the average session size for multiple thresholds

(Figure 2.7) that the choice of inactivity threshold does not significantly affect

the session size. Thus, we choose 6 minutes as our inactivity threshold for future

statistics involving sessions, since 90% of successive attempts occurred within

that window.

With this definition, we find that 51% of sessions at U1 required more than

one attempt, and nearly 5% of sessions were abandoned. For U2, 38% of ses-

sions required more than one attempt, and 2% of sessions were abandoned.

With so many sessions requiring more than one attempt, there is much room for

improvement in usability of password-based logins, which we elaborate on in

our discussions on password typos and lockout thresholds.

Retries. The session size indicates how many retries were required before a

successful login or the user giving up. Thus we show the average session sizes

for different inactivity thresholds in Figure 2.7. At U1, we found that 22% of

successful sessions and 14% of abandoned sessions required more than one at-

tempt; at U2, 38% of sessions required more than one attempt. Some login

43


100 101 102 103

0

0.2

0.4

0.6

0.8

1

Number of sessions per user per day

C
um

ul
at

iv
e

fr
ac

ti
on

of
us

er
na

m
e-

da
te

pa
ir

s

U1
U2

Figure 2.8: Cumulative distributions of number of sessions per user per day for
an inactivity threshold of 360s for February 2021. The X-axis is log scale.

systems have a lockout policy, in which they lock a user’s account after a cer-

tain number of failed attempts have been made. In this case, the 99th percentile

of session size is 10 attempts per session, providing empirical support for a stan-

dard choice for lockout threshold.

We measure the number of sessions per user in a single day, as this will

inform how often a user needs to go through the login process, how important

it is to have a frictionless login process, and how effective single sign-on is. We

find that a user attempts to log in a median of one session per day and a 99th

percentile of six sessions per day; we show the distribution of the number of

sessions per user per day in Figure 2.8. However, the tail end of the graph shows

that some users attempted up to 112 sessions in a single day. Given that the

99th percentile is six sessions per day, this is probably indicative of suspicious

behavior, and in future work this metric may be used in conjunction with the

other metrics we’ve reported to further investigate possible attacks.

To investigate the password-based login usability of mobile devices, we

compare the session size and frequency of sessions per day for mobile and

non-mobile sessions. We break mobile devicesdown even further by compar-

44


ing iPhone and Android devices; these distributions are shown in Figure 2.9.

We can see from these graphs that users of iOS devices tend to require more

attempts than Android ones.

Password typos. One of the areas of friction in a login system may be password

typos, especially for stronger, more complex passwords. With the ephemeral

datastore in Gossamer, we computed whether a password submission was

within edit distance two of the actual password.

We can estimate the number of password typos using this measurement;

however, we only have this measurement for users that later logged in success-

fully on the same day, which is only 45% of all failed attempts at U1. Thus we

find that at U1, 62% of failed requests where we have this measurement con-

tained typos of edit distance two or less, or 30% of all failed requests. In a study

at Dropbox, Chatterjee et al. [38] estimated the number of typos to be at least 9%

of all failed requests.

Requests originating from mobile devices tend to contain typos even more

frequently. Out of all failed mobile requests with the edit distance measurement

at U1, 72% were typos (or 39% of all failed mobile requests).

When investigating sessions with two or more attempts, we found that 12%

of eventually successful sessions of size two or greater initially failed because of

a typo. The remaining failures may be explained by user memory errors (using

the wrong password). These findings further underscore the utility of password

managers, since they help avoid both typos and memory errors.

Password managers. Although we cannot identify users with password man-

45


2 4 6 8 10

0

0.2

0.4

0.6

0.8

1

Session size

Fr
ac

ti
on

of
re

qu
es

ts

All mobile sessions
iOS sessions
Android sessions

Figure 2.9: Cumulative distributions of session sizes (number of attempts per
session) for mobile devices for February 2021. The X-axis for U2 is limited to
10 for comparison with U1 (although the maximum session size consisted of 54
attempts).

agers with certainty, we can find the number of users with a certain number of

successful login attempts and no failures, and we conjecture that this may be

an approximation. For example, we found 25% of valid users at U1 and 27% at

U2 have at least ten logins and never had a failed login over the course of our

measurements. Since, as we show in Figure 2.4, the majority of users — 99.97%

users in U1 and 99.94% users in U2 — are using strong passwords (according to

the zxcvbn score), we believe that these users are using password managers to

enter their passwords.

Duo logs. Both U1 and U2 have introduced the use of Duo two factor authen-

tication [13] to further strengthen account security. At U2, we were able to an-

alyze Duo logs for successful login attempts and combine them with Gossamer

logs to explore the tension between usability and security with respect to MFA.

Unfortunately, our logs and Duo log entries do not share any unique identifiers.

Instead, we try to match each successful login attempt to a corresponding Duo

push within two minutes of the login request originating from the same user. If

there were multiple matches to a single successful login attempt, we chose the

46


Duo push closest in time to the login request.

At U2, out of 15.9 M successful login requests, 62% were using Basic

Auth [51], which does not require users to enroll in two-factor authentication.

Among the remaining 6.0 M successful logins, we could match 89% of requests,

and the remaining 11% already had a previously obtained Duo cookie which

remained active for 12 hours after a successful authentication. Among the Duo

pushes we could successfully match with a login attempt, we found that 96.7%

were successful, 3.2% were denied, and only 46 (< 0.01%) were marked as fraud

by users. Users on average took 14 seconds to mark their Duo push as valid,

slowing down logins for honest users.

2.7 Conclusion

We designed, built, and carefully deployed Gossamer, a framework for securely

recording statistics about login requests and submitted passwords during lo-

gin. Our approach combines system security mechanisms, a simulation-based

approach to assessing risk of different measurements, and procedural mecha-

nisms to enable new kinds of measurement studies. In studies conducted at

two large universities in collaboration with their IT security teams, we were

able to gather first-of-their-kind measurements about login behavior that shed

light on usability, security, and attacker behavior.

Through these measurements, we observed several patterns in user login be-

havior. First, we saw that login friction is still high. Many users require more

than one attempt before successfully logging in, and Duo two-factor authenti-

cation added on average an additional 14 seconds to logins. One suggestion we

47


have for reducing login friction is to count unique passwords submitted for a

user, rather than every request; we find that this would reduce lockouts by 88%.

Second, we observe that reuse of breached credentials is a serious problem:

nearly 3% of valid users at U1 and more than 9% at U2 are using a breached

password that appears in our breach dataset. During our data collection pe-

riod, some users changed their passwords to either exactly equal to one of their

breached passwords or very similar to their breached passwords. To reduce this

number, we encourage the deployment of breach alerting services such as Might

I Get Pwned [94] that could indicate when a user’s password is vulnerable to

credential stuffing.

We also find that visibility into submitted passwords is helpful for designing

security mechanisms and informing new policies. Similar to how we proposed a

new lockout policy, authentication system developers can use a framework such

as Gossamer to test the effect of new policies before their deployment. Future

analyses with Gossamer may investigate how password-derived signals help in

detecting different types of password guessing attacks and the efficacy of Duo

two-factor authentication in stopping different types of attacks. Also, Gossamer

can be extended with additional measurements using the same framework and

simulation method for assessing the risk of different measurements. Thus both

the tool and approach can be useful in the future for helping analysts detect and

develop mitigations against attacks. We will be releasing Gossamer as a public,

open-source project.

48


CHAPTER 3

ARAÑA: DISCOVERING AND CHARACTERIZING PASSWORD

GUESSING ATTACKS IN PRACTICE

3.1 Introduction

Remote password guessing attacks are one of the most effective and preva-

lent causes of account compromise for password-based authentication sys-

tems [111,124]. Password guessing attacks are easy to mount for attackers, who

may attempt logging in under known usernames with widely popular pass-

words or perform credential stuffing by submitting username-password pairs

that have appeared in previous breaches. Despite being straightforward, such

attacks can nevertheless be highly damaging. As a result, the most advanced lo-

gin services in practice use proprietary mechanisms in an attempt to detect ma-

licious logins using more than just the correctness of the submitted password,

e.g., via user risk profiles [52, 59, 111].

Improving such mechanisms requires understanding attacker behavior as

observed by login services — a delicate task given the sensitive nature of login

data, especially passwords. As such, no in-depth measurement studies of attack

behavior as seen from login services have been conducted.

In our prior work [29], we designed a new login service instrumentation

tool called Gossamer. It securely records information about login requests, in-

cluding certain carefully chosen statistics about the passwords used in the re-

quests. Gossamer was deployed for seven months at one university (U1) and

three months at another (U2). We analyzed various aspects of user login behav-


ior but were only able to identify three obvious high-volume attacks. How to

find and characterize more attacks was left as an open question.

In this work, we make progress on answering this open question. To do

so, we obtained approval—from our university IRBs and IT security teams—to

perform a fresh analysis of the datasets generated by the previous study. These

datasets include a rich amount of information on over 34 million login requests,

Duo two-factor authentication (2FA) logs (just for U2), and more than 2,000 com-

promise reports.

Even with the measurements provided by Gossamer, discovering attacks

represents a tricky bootstrapping problem. The compromise logs do not im-

plicate specific login requests and, more broadly, there is no ground truth any-

where in the dataset for attacks. This makes the supervised machine learning

approaches used in prior work [52, 59] inapplicable to our setting. At the same

time, the scale of the problem renders manual analysis prohibitive, and so prior

work only used a simple, known method for detecting attacks: Flag IP addresses

that individually flooded the login service with requests. This of course misses

many attacks.

We develop a new analysis pipeline that we call Araña. Crucially, it focuses

not on individual requests but sets of login requests emanating from the same

IP within a day. This ensures sufficient signal about client behavior for patterns

to emerge. We then built an application-specific filtering, clustering, and man-

ual analysis approach. We filter out likely-benign login request sets using cus-

tom heuristics, such as filtering out IP addresses exhibiting high login success

rates. We then use an unsupervised learning approach, specifically agglomera-

tive clustering [87], with a custom distance function tailored to login sets. This

50


helps us identify clusters of login sets that can be manually analyzed with little

effort to discover attack campaigns — sets of login requests likely to be submitted

by the same attacker.

Araña is effective at finding these attack campaigns, including coordinated

attacks involving multiple IP addresses and those spread out over long stretches

of time. We use it to discover and characterize a diverse set of 29 attack clusters.

Many are high volume credential stuffing attacks that submit on average of one

password per targeted username and exhibit a notable fraction of username-

password pairs appearing in breaches. In addition to the more simplistic stuff-

ing attacks that quickly flood requests from a handful of IP addresses, Araña

allows for discovering widely distributed attacks that may originate from hun-

dreds of different IP addresses. Some of these attacks try to be “low and slow”,

submitting a low number of requests per day at a slow rate. This confirms that

credential stuffing attackers attempt to evade countermeasures that focus on in-

dividual high volume IPs. These credential stuffing attacks are unfortunately

effective: We uncover hundreds of potential successful logins by attackers for

usernames not flagged as compromised already by the security engineers.

We are also able to discover lower volume, targeted attacks by focusing on

Araña-identified clusters that exhibit a large number of requests to individual

usernames. For example, we identified an attack campaign made up of two

clusters that targeted 127 users with on average 25 guessed passwords per user.

These attacks included successful logins, suggesting this targeted strategy can

also work for attackers.

We discuss many other attack campaigns in the body. Overall, our anal-

yses highlight a number of important takeaways for authentication system

51


designers. First, they suggest that, perhaps unsurprisingly, credential stuff-

ing attacks remain a primary vector for account compromise. This, plus the

fact that most compromised users in our study also had passwords in known

breaches, underscores the urgent need for broader use of breach alerting APIs

(e.g., [63, 76, 111, 113]).

Second, we saw a large number of attacks on Microsoft’s basic authentication

endpoint at U2. It is the least protected of all of U2’s authentication services, as

it does not support rate limiting or Duo two-factor authentication. It is also

easy for attackers to target any organization using basic authentication by just

changing the destination URL. We observed that attackers regularly target such

weak points, underlining a challenge for large organizations with heterogenous

authentication infrastructure.

Third, mechanisms that look for a large number of submissions per unit time

from an IP or to a username are ineffective against many attacks already being

deployed in practice. A better approach would be to work towards operational-

izing mechanisms like Araña to detect in (near) real time distributed and sneaky

attacks. However, future work will be needed to explore how to make such

mechanisms robust against evasion by future adversaries.

Summary. Our contributions include the following:

• We perform the first in-depth analysis of a dataset with over 34 million login

events generated by a recent system called Gossamer to discover and charac-

terize remote password guessing attacks.

• We design an analysis framework called Araña that shows how to filter and

cluster Gossamer logs, enabling easy manual analysis and identification of

attack campaigns.

52


• We use Araña to discover and characterize 29 attack clusters against two ma-

jor universities that compromised hundreds of user accounts in total.

• We identify key characteristics and patterns of attacks received by authentica-

tion systems at these universities, and we discuss how authentication systems

should evolve to counter such threats.

Our work has already had some practical impact in terms of discovering new at-

tacks. We have worked with security engineers from the two universities to per-

form responsible disclosure of potentially compromised usernames. We believe

that Araña will be useful in developing more robust attack detection methods,

and we release it as a public, open source project [1].

3.2 Background and Related Work

Passwords remain the most widely used mechanism for user authentication,

despite efforts to move past them [31]. We focus our discussion on the literature

most closely related to our topic: characterizing password guessing attacks as

seen by login services.

Measurement studies on user passwords. User behavior with respect to pass-

word choice has been extensively researched. Many studies simulate login

services (e.g., via Mechanical Turk) to perform user studies [69, 72]. Others

look at password breach data to characterize aspects of user password selec-

tion [42, 122]. A handful of studies have measured user passwords in real de-

ployments [31, 39, 48, 83].

Most recently, the Gossamer system [29] was used to measure not only le-

53


gitimate user password strength, but also various user password submission

behaviors as observed by the single-sign on (SSO) login services at two large

universities. This project uses the same measurement datasets as Gossamer but

focuses on characterizing malicious login behavior.

Password guessing attacks. The literature discussed so far concerned itself

with legitimate user behavior, trying in part to assess whether users are select-

ing passwords that resist various types of guessing attacks. A traditional focus

has been on offline password cracking attacks, which occur when an attacker at-

tempts to crack password hashes found in exposed password hash databases

using tools like Hashcat [106] and John the Ripper [92]. Researchers have also

developed natural language processing techniques to improve guess genera-

tion [60, 71, 79, 85, 88, 122].

Breached username-password pairs — either obtained via offline cracking,

gathered via phishing or malware, or stolen from another web service stor-

ing plaintext passwords — can be used in credential stuffing attacks, where an

attacker tries to log into a system using breached username-password pairs.

Users frequently reuse passwords across multiple services [42, 93, 119], mak-

ing credential stuffing one of the most prevalent forms of account compromise

attacks [111, 124]. In an effort to reduce credential stuffing attacks, breach alert-

ing services such as HaveIBeenPwned [63], Google Password Checkup [112],

and Cloudflare’s exposed credential checks [94, 113] provide APIs that check

whether a user’s passwords have been compromised.

Credential stuffing is one kind of online guessing attack — an attack that re-

motely attempts to log into a service with guessed credentials. Other examples

of guessing attacks include password spraying attacks that submit a few very pop-

54


ular passwords against a large number of user accounts and credential tweaking

attacks [93] that submit slight variants of breached passwords.

Detecting malicious logins. Resisting online password guessing attacks re-

quires determining which login attempts are malicious. As Bonneau et al. [32]

discuss, login services increasingly should treat login as more of a classifica-

tion problem, taking into account more than just the correctness of a submitted

password. But only a handful of prior works have focused on how to do so.

Freeman et al. [52] were the first to report on a statistical framework using

the client IP address and user agent to differentiate between valid and invalid

login attempts; this study used real-world LinkedIn login data, but did not in-

clude password-based features. They also do not report on observed attack

campaigns. Schechter et al. [102] built a malicious login detection system that

also utilizes password-based features, including differentiating password typos

from other failures and using privacy-preserving data structures such as a bi-

nomial ladder filter for detecting frequently used passwords. Their study used

simulated data to argue the system’s efficacy at detecting malicious logins. Fi-

nally, Thomas et al. [111] studied underground forums and tools used to steal

credentials — their only measurements using logins are used to report on the

lack of evidence of targeted guessing attacks that try multiple queries against a

single account.

The Gossamer paper [29] reports on three attack campaigns that are easily

detected using known techniques (i.e., a huge number of requests from an IP

in a relatively short amount of time). Such techniques will only catch obvious

attacks and are blind to “low and slow” attacks that purposefully use a small

number of guesses per target account (low) and per unit time (slow), or dis-

55


tributed attacks that perform guesses from many different IP addresses. While

missing these attacks does not appear to affect the results on benign user be-

havior reported in [29], understanding attacker behavior requires finding and

investigating these attacks.

In summary, no prior work has characterized the behavior of password

guessing attacks as seen from login services.

Two-factor authentication. Although effective at preventing account com-

promise, two-factor authentication (2FA) has yet to achieve widespread adop-

tion [40, 44], and user friction is still very high [19, 45]. As we show, many older

accounts at universities are not enrolled in 2FA; and more importantly, even

when 2FA is used, recent attacks have shown that attackers can spoof push-

based 2FA and hide spoofed pushes by sending them soon after the victim has

logged in [67]. Regardless of whether 2FA is bypassed in an attack, we would

like to know when an attacker successfully guesses a user’s password so that IT

security personnel and/or the user can take preventative steps such as chang-

ing the password for the indicated account and for any other account that uses

the same (or a similar) password.

Unsupervised techniques for attack detection. In this work, we therefore de-

velop a new approach for detecting and characterizing password guessing at-

tack campaigns using Gossamer logs. As we explain in Section 3.4.2, we did

not have success using supervised machine learning approaches and so instead

focus on unsupervised techniques, which have a long history of use in attack

detection. For example, they are frequently used in intrusion detection systems

(c.f., [35]) and for various kinds of anomaly detection (c.f., [21]). To the best of

56


our knowledge, no prior work has developed mechanisms to detect whether

password-based logins are malicious.

3.3 Gossamer Logs

In this work, we use a dataset of login requests compiled from two universities

(U1 and U2) via our prior work Gossamer [29]. For completeness, we provide

some details about Gossamer and how the resulting datasets were collected. We

refer readers to [29] for more details.

Gossamer logs. Our prior work introduced a new, privacy-preserving instru-

mentation approach for use with login systems [29]. The resulting system,

Gossamer, provides a secure way to collect measurement statistics about login

requests, including a subset of HTTP headers, source IP address, target user-

name, success or failure of a login request, and carefully chosen measurements

on the submitted password. Gossamer uses two levels of storage to provide ex-

tra security for particularly sensitive information. The submitted passwords are

encrypted and stored in an in-memory database. They are cryptographically

erased every 24 hours, providing a good trade-off between the ability to calcu-

late password statistics over one-day windows (e.g., the edit distance between

two submitted passwords) and the ability to protect the secrecy of passwords

even in the low-likelihood case of a full compromise of the instrumentation ser-

vice. The system also deterministically encrypts the usernames to preserve their

privacy and blind researchers from them.

Gossamer was designed in close collaboration with the information technol-

ogy (IT) security teams at two large universities. We received approval from

57


the IRB and IT security teams to deploy Gossamer for 7 months at U1 and for

3 months at U2. In total, we collected more than 34 million login requests from

347 K valid usernames. In [29], Bohuk et al. detailed the design of Gossamer

and reported on an analysis of this dataset in terms of characterizing legitimate

user behavior. At the time we were only able to detect a few obvious attacks

that stood out due to their high rate of requests, and left finding more attacks as

an open question. This project is a first step at addressing this open question.

Duo logs. Both universities use Duo 2FA for most login requests; users are

prompted to provide this second factor after they successfully submit the pass-

word (and if they do not have a valid “Duo cookie” stored in their browser).

The datasets we received include sanitized Duo logs for U2 only.

Unfortunately, there is no identifier in the Duo logs that can be used to

uniquely associate a log entry with a particular login request. We therefore had

to use a timestamp-based heuristic similar to that used previously in [29], to

find the Duo prompt that likely corresponds to a successful password submis-

sion. Using the timestamp and encrypted username associated with the Duo

request, we correlate the Duo requests with login requests within a 2-minute

time window. Out of these requests, 96.7% were successfully completed, 3.2%

were denied, and only 46 (< 0.001%) were marked as fraud by the user.

Compromised user logs. At both universities, the security analysts have pro-

cesses for identifying and reporting compromised accounts. These processes

include user reporting mechanisms and third-party breach notification services

(e.g., U1 uses a Microsoft product to detect accounts that are sending spam).

We were given access to compromised account logs for the period of our previ-

58


ous measurement study, containing the encrypted username, the timestamp re-

ported, the estimated timestamp of the attack (only available at U1), and the rea-

son for compromise. Importantly, compromise reports at U1 were logged from

multiple authentication services, but the Gossamer deployment only instru-

mented the main one (U1 web login); so the number of compromised accounts

will be an upper bound for those that were actually compromised through U1

web login. Also, the IT departments consider an account compromised if the

attacker made a successful login request, even if they did not bypass two-factor

authentication. Indeed, many older accounts may not be enrolled in 2FA. We

adopt the same terminology and refer to account compromises as a successful

login from an attacker independent of whether 2FA was also bypassed. In most

cases, analysts respond to compromise reports by scrambling the passwords of

compromised accounts to prevent further damage.

Our prior work using Gossamer logs contains some basic details about the

compromise logs; we elaborate more here since it will be relevant to interpret-

ing some results later. We considered compromise reports for the measurement

time period at each university, plus one additional week (which allowed time

for analysts to enter compromise reports for attacks that may have occurred

during the measurement period). We give summary statistics on this compro-

mise database in Figure 3.1. On average, 190 usernames (0.59% of all valid user-

names) are reported every month at U1, and 163 usernames (0.28% of all valid

usernames) at U2. At U1, 323 usernames were reported multiple times, 32 of

which were reported for two different reasons. Users who are reported twice

are compromised on average 26 days apart.

We discuss a breakdown of the reasons for compromise in more detail in

59


U1 U2
(7 mo.) (3 mo.)

Total reports 1,818 611
Unique usernames

- reported 1,468 489
- more than once for same reason 323 122
- more than once for diff. reasons 32 0

% of unique usernames compromised / month 0.59% 0.28%

# IPs tried to log in
- with a compromised username 4,633 6,409
- with multiple compromised usernames 716 156

Max. comp. usernames associated with an IP 382 24
Avg. usernames associated with a single IP 1.98 1.05

Figure 3.1: Summary statistics on the compromised accounts reported at each
university during the measurement period.

Appendix B.1. In summary, we found that the majority of accounts were re-

ported as compromised through large-scale automated attacks. At U1, the es-

timated time of attack is also reported for each compromised user, and so we

investigated the time it takes for an attack to be recorded in the compromise

database. We found that 17% of compromised accounts are reported within

the first hour after the attack, and 90% of compromised accounts are reported

within 61 hours. However, the “timestamp of attack” field is an approximation

of the time of attack based on the analyst’s best guess. There is no such field

recorded at U2.

We also approximate the time of attack by taking the last successful login for

a given username before that username is entered into the account compromise

database. In doing so, we find that only 4% of compromised accounts at U1

are reported within the first hour after their last successful login, and 90% are

reported within 46 days. This large difference shows that either the analyst’s

estimated time of attack was not very accurate, or the last successful login before

a compromise report is not a good approximation for the time of an attack—

60


an important challenge in using compromised logs for detecting attacks, as we

discuss in the next section.

Ethical considerations. Throughout our analysis, we worked closely with the

IT security offices at both universities similar to how we did for Gossamer [29].

We obtained approval for both this study and Gossamer from the IT security

offices and university administrators. Both studies were reviewed by our uni-

versity IRBs and received IRB exemptions, as the Gossamer logs do not include

usernames or other data considered to be PII by IRB. We could not request con-

sent from individual users, as we do not know the usernames of users. To fur-

ther protect the privacy of users, we used similar security measures as we used

for Gossamer study for data analysis: We used a machine only accessible to a

subset of researchers logging in from a specific computer in the university net-

work and requiring a strong password and 2FA for access. We also notified

the IT administrations at both universities of all previously unreported compro-

mised accounts we found using the encrypted usernames.

3.4 Towards Detecting Attack Campaigns

We now turn to the challenge of discovering remote password guessing attacks

using Gossamer logs. Prior work has used supervised machine learning ap-

proaches to flag likely malicious login requests [52, 102]. However, they relied

on simulated logins, for which they marked each request as benign or malicious.

For login requests observed in practice, flagging each as either benign or ma-

licious represents a challenging bootstrapping problem. Not only is the number

of login requests received too massive for a reasonable set of security engineers

61


to manually analyze, but also there is no clear set of criteria to flag a login re-

quest as malicious. Moreover, even if an account has been flagged as compro-

mised by a security engineer or reported as compromised by a user, there is no

obvious way to correlate this with individual login attempts because, as dis-

cussed in Section 3.3, compromise databases do not include information on the

specific login sessions.

Individual login requests often do not contain enough information for even

a human analyst to determine if they are malicious. We instead aim to identify

attack campaigns—that is, sets of login requests submitted by the same attacker.

Attackers often use automated scripts to send guesses for usernames and pass-

words from one or more IP addresses over a period of hours, days, or weeks.

To identify attack campaigns we will cluster login requests into groups that are

likely to be part of the same attack campaign, as we explain below.

3.4.1 Login Sets and Features

The main Gossamer logs consist of a sequence of login requests. Each request

entry includes (1) a timestamp; (2) the (deterministically encrypted) username;

(3) the client IP address; (4) the client user agent string; (5) whether the sub-

mitted password is weak (has a bucketized zxcvbn score of zero, as explained

in prior work [29]); (5) the edit distances between the submitted password and

previous passwords submitted by the same IP or username; (6) whether the re-

quest succeeded or failed due to an invalid username or the wrong password;

(7) whether the submitted password is a tweaked variant of a breached pass-

word known to Gossamer; or (8) whether the submitted password, username,

62


Group Features Acronym

Client IP address © IP
ISP © ISP
User agent string © UA

Volume Total # requests submitted NR
Total # unique usernames submitted NU
Avg. # unique password per user AUP

Login
timing

Date© DATE
Mean interarrival time MIT
Std. interarrival time SIT

Success rates Fraction of submitted requests
that failed FF
w/ invalid usernames FIU

Password
guessability

Fraction of submitted requests
w/ a weak pw, zxcvbn(w) = 0 FWP
w/ pw in breach FPIB
w/ username in breach FUIB
w/ username-password pair in breach FCIB
w/ a tweaked pw in breach FTP

Figure 3.2: Features for L sets that we use for analysis. Nominal features are
marked with ©; all others are numerical.

or username-password pair appear in a breach dataset known to Gossamer.

We group login requests based on the client IP address and date it was re-

ceived. We call this grouping a login set, which we denote by L. The set of all L
sets within a Gossamer log we denote as L = {L1, . . . ,Ln}, where n is the number

of L sets in the log. Each Li contains all login requests received in a day from

an IP address. Note that L defines a partition over all login attempts.

We define an attack campaign as a set of one or more L sets. This assumes

that all logins in a L set are either malicious or all are benign. Although it is

possible that legitimate users might share the same VPN or proxy IP address as

an attacker, we expect such scenarios to be rare (and did not encounter them in

our analyses). We leave to future work how to differentiate benign and attack

login attempts from a single IP.

63


To determine if an L set is potentially malicious or benign, we utilize a rich

set of features describing a L set based on Gossamer logs. The full set of fea-

tures, consisting of four nominal and 12 numerical features, are summarized in

Figure 3.2. At a high level, these 16 features can be roughly divided into follow-

ing groups:

• Client features include the source IP address and user agent (UA) within the

request. We also look up the ISP associated with each IP address (as reported

via the MaxMind Geolocation API [80]). These are useful for determining

whether requests are from the same client device.

• Volumetric features include the total number of requests in a L set (NR), the

number of unique usernames targeted in the set (NU), and the average num-

ber of unique passwords submitted to a particular username (AUP) for a L
set.

• Login timing features include the date of the L set, as well as the mean (MIT)

and standard deviation (SIT) of the interarrival time between requests within

the set.

• Success rate features measure the fraction of invalid usernames submitted (FIU)

and the fraction of invalid username-password pairs submitted (FF).

• Password guessability features include the fraction of submitted passwords in

an L set that have a zxcvbn score of zero (FWP), indicating a weak password,

as well as the fraction of submitted passwords in a known breach (FPIB),

usernames in a known breach (FUIB), username-password pairs in a known

breach (FCIB), and the fraction of passwords submitted that are a close vari-

ant of a breach password—called a “tweaked password” (FTP).

Together, these features serve as the basis for our attack campaign detection

mechanisms.

64


Gossamer
logs

Potentially
malicious
L sets

Pairwise
distance
matrix

Clusters
Attack

campaigns

Filtering
benign
L sets

Similarity
measures

Agglomerative
clustering

Manual
analysis

Figure 3.3: Araña’s filter, cluster, analyze pipeline for discovering attack cam-
paigns.

3.4.2 Initial Attempts Using Compromise Reports

An obvious potential approach for detecting attack campaigns is to utilize com-

promised account reports to label L sets as potentially malicious. However, as

discussed in Section 3.3, those reports do not record information sufficient to

identify exactly which L set contains the compromising login attempt. We note

that this ambiguity is not just a limitation at the universities we investigated;

anytime a login system allows compromise reports from users or third party

services, they will not be directly associated with logins. Nevertheless, we ex-

perimented with various ways of using compromise reports as ground truth for

supervised approaches. We briefly report on two here.

Classifier based approaches. We identified 23,016 L sets where the corre-

sponding IP address contacted at least one eventually-compromised user ac-

count (at any point during the data collection period). Let C denote the set of all

those L sets. Of course, not all of these are necessarily malicious; but we mark

them all as malicious since it is unclear how to label them more precisely.

We first tried to develop a classifier that can identify L sets likely to have a

compromised user based on the features we identified in Figure 3.2. We added

to C an equal number ofL sets not associated with a compromised user account,

labeled them as benign, and performed an 80/20 training and testing split on

the combined set. We then trained linear regression, decision tree, random for-

65


est, logistic regression, and SVM models to predict maliciousness. All the mod-

els exhibited very bad precision; the linear regression classifier performed best,

with a recall 0.79 but a precision of just 0.13, making it essentially useless. The

primary reason is the crude labeling of training data as malicious or benign,

presumably including many false positives.

Directed anomaly scoring. We also explored using Ho et al.’s directed anomaly

scoring (DAS) [61], tuned via compromise reports, to rank L sets in order of

“suspiciousness.” These experiments were promising at detecting new types

of attacks, but failed to help us detect attacks spanning multiple source IP ad-

dresses, making it less useful for driving an analysis and characterization of

attack campaigns. See Appendix B.2 for details.

3.5 Campaign Discovery Pipeline

We developed an analysis pipeline, called Araña, that combines heuristic-based

filtering, unsupervised machine learning to cluster behavior, and a manual anal-

ysis review step as illustrated in Figure 3.3. We refer to this as an FCA (filter,

cluster, analyze) approach. While the general notion of an FCA-type analysis

pipeline is not novel, our application-specific heuristics, similarity measures,

and manual review processes are new. Our goal here is not completeness; iden-

tifying all malicious logins is impossible. Rather, we build a high precision (low

false positive) pipeline that helps us discover a wide range of attack campaigns

with minimal manual effort.

We first work to develop heuristics for filtering out likely benignL sets. First,

we remove L sets that do not exhibit a high failure rate (HFR), and then we re-

66


move some anomalous known-benign behaviors (such as misconfigured clients

repeatedly making failed requests).

We then use the wide range of features described in the previous section to

help us define an application-specific similarity measure logsetsim(Li,L j) that

outputs a similarity score for any L set pair Li,L j. This helps us capture the

likelihood that twoL sets are part of the same attack campaign. Given logsetsim,

we can create a pairwise distance matrix and apply unsupervised agglomerative

clustering [87] to discover clusters of L sets that could each be part of a single

attack campaign. The candidates can then be manually inspected by analysts;

we report on our findings in Section 3.6.

3.5.1 Filtering Likely Benign Requests

Most logins are benign, and benign requests are a source of obfuscating noise

in unsupervised algorithms. We therefore first develop a set of heuristics for

filtering out likely benign behavior or, equivalently, identifying potentially ma-

licious L sets. Our heuristics are based on three assumptions regarding attacker

behavior reported in prior work [52, 59, 102]: (a) malicious logins are only a

small fraction of the total login traffic; (b) a large fraction of the malicious lo-

gin attempts fail; and (c) attackers use automated scripts or tools to minimize

the time and effort required to send a large number of requests from IP pools

(purchased or free proxies, VPNs), hiding their own IP. Based on these assump-

tions, we first use heuristics to filter out obviously benign behaviors (thereby

identifying potentially malicious behavior).

67


Filtering based on failure rate. We start by focusing attention on L sets that

exhibit a high failure rate (HFR). We choose thresholds for two features: the

total number of login requests NR within L and the fraction FF of these login

requests that are failures. We say a L set meets the HFR condition if NR > ` and

FF > f for two thresholds ` and f that we set below. Intuitively, benign logins

should succeed most of the time, and failure rates are not meaningful for one or

two logins.

To choose these thresholds ` and f , we first plot the distributions of NR and

FF over the subset of L that includes all L sets that have at least two login

requests (NR > 1) and one failed login request (FF > 0). We show their distribu-

tions in Figure 3.4. If we choose extreme thresholds, the HFR heuristic may miss

potentially malicious L sets; but on the other hand, choosing a relaxed thresh-

old may flag benign L sets as potentially malicious, increasing the noise in our

final attack campaign detection procedure.

Our goal was to build a semi-automated tool for discovering high-volume

attack campaigns and help IT analysts discover them with minimal manual ef-

fort. Since security analysts are concerned with finding the most damaging

attacks (that is, attacks compromising the highest number of users) with high

precision, we chose aggressive thresholds for considering malicious login sets.

Thus, at U1, we used the 90th percentile as the threshold, making `U1 = 9

and fU1 = 0.77. However, for U2, the 90th percentile yielded fU2 = 1.0, the

highest possible value; this is because U2 had more high volume unsuccessful

attacks which inflated the 90th percentile. Therefore, we decreased the thresh-

old to the 80th percentile for U2, setting `U2 = 6 and fU2 = 0.8. Although these

68


0 20 40 60 80

0

50

100

150

Percentile

N
R

U1
U2

0 20 40 60 80

0

0.2

0.4

0.6

0.8

1

Percentile

FF

U1
U2

Figure 3.4: Percentile of NR (left) and FF (right) for both universities (shown
up to the 99th percentile for viewing). By choosing the 90th percentile at U1 and
80th at U2, we set `U1 = 10, `U2 = 7 and fU1 = 0.77, fU2 = 0.8 at the universities
respectively.
aggressive thresholds may miss certain attacks, lowering them runs the risk of

degrading the quality (precision) of clusters by mixing L sets with starkly dif-

ferent behavior in the same cluster. Lower thresholds may also cause benign

clusters to be mistakenly flagged as malicious, which would quickly cause an-

alysts to lose trust in the tool. We investigate the effects of lowering thresholds

by running additional experiments with more relaxed thresholds as explained

in Appendix B.3 .

Using the HFR heuristic, we filtered out a large number of outright benign

L sets for both universities, as shown in Figure 3.5. This allowed us to focus on

L sets exhibiting more anomalous behavior and potentially malicious behavior.

Filtering out benign behavior. After removing L sets that do not match the

HFR heuristic, we further filter out other likely benign behaviors.

First, we filter out L sets with IPs that have successfully completed Duo

requests for all target users at least once on the same day because we assume

that a remote attacker conducting large scale password guessing attacks does

69


not have physical access to a victim’s Duo authentication device. As we only

have the Duo logs at U2, we applied this filtering to the U2 data. We found 563

L sets matching the HFR heuristic, but all of them contacted only one user and

had successfully completed at least one Duo request for that user. Around 90%

of these L sets had submitted ≤ 5 unique passwords, and the failures were due

to incorrect password entry — probably just a typo.

Second, we filter out any L sets having an IP within the university networks

or belonging to school proxy/VPN servers. Access to these IPs is restricted, and

members of the two universities can only use them after successful authentica-

tion and completing the Duo request.

Lastly, we filter out L sets that seem to emanate from a malfunctioning or

misconfigured (benign) client. Specifically, we noticed many IPs submitting a

large number of failed login requests with the same incorrect password. Upon

further inspection, we observe that these were automated requests belonging to

email clients, Outlook Exchange Web Services, and calendar auto-sync agents

that were configured with an invalid password. Since any rational attacker

would not try the same invalid username-password pair repeatedly, we re-

moved all L sets reusing the same incorrect username-password pair for more

than 90% of their login requests.

After filtering benign L sets based on the HFR condition, failure rate, suc-

cessful Duo submission, school IP addresses, and number of unique username-

password pairs, there were 1,717 remaining potentially malicious L sets at U1

and 6,408 at U2. We show a breakdown of these filtering steps in Figure 3.5.

70


Filter # L sets
U1 U2

L with NR > 1 & FF > 0 225,468 227,893

L with NR > ` & FF > f (HFR) 2,074 11,929

L with HFR and
w/ successful Duo request on DATE n/a 563
IP belongs to school IP pool protected by Duo 60 825
>90% password reuse 419 4,133

Remaining L sets 1,717 6,408

Figure 3.5: Number of L sets matching each of the filtering criteria at either
university. The last row shows the remaining number of sets after all filtering
steps. Threshold values ` and f for U1 and U2 are indicated in Figure 3.4.

3.5.2 Clustering Potentially Malicious L sets

After the filtering step, we cluster the remaining L sets using the features de-

scribed in Figure 3.2. The key question for clustering is how to define a distance

function that gauges similarity betweenL sets, such that twoL sets have a small

distance if they belong to the same attack campaign.

Similarity modeling. We design a distance function that can measure the like-

liness that two given L sets belong to the same campaign. For the numerical

features presented in Figure 3.2, we use the normalized difference between two

values x and y, defined as ND(x, y) = |x − y|/(x + y). We chose this particular dis-

tance function since ND(x, y) ≈ 0 when x ≈ y, and it is a high value when x and y

have a high difference. That said, we believe other numerical distance measures

would work well too.

For the nominal features IP, ISP, UA, and DATE, we cannot do a straightfor-

ward equality checking to measure their similarity, since these nominal features

are sparse. Therefore, similar to Freeman et. al. [52], we use a hierarchical back-

off distance (HBD) approach. This technique defines a number of levels: If the

71


two feature values do not match at a lower-level, we “backoff” and use values

of the feature from a higher level, while incurring a dissimilarity cost. We set

this dissimilarity to e−`−e−(`+1) for backing off from the lower hierarchical level `

to a higher hierarchical level ` + 1, and we accumulate the dissimilarity costs

from each level with each backoff.

For the IP feature, we use the hierarchical structure of Internet routing to

define four levels. At level ` = 0 we check for strict IP equality; at level ` = 1

we check for /24 subnet equality; at level ` = 2 we check for ISP equality (as

reported in the ISP feature); and at level ` = 3 everything is considered equal.

Thus the max level for this feature is `IP
max = 3. For the UA feature, we use five

levels. At level ` = 0 we check if the UA strings are identical. For subsequent

levels we extract from the UA string to obtain the application (desktop, mobile,

unknown), browser (chrome, edge, other), and OS (Windows, iOS, Mac OS,

Linux, other). Equality checks for each of these three define levels ` = 1 through

` = 3. The final level ` = 4 indicates that the user agents do not have anything

in common.

For the final nominal feature, DATE, we find the number of days d between

two L sets and compute 1 − e−d. Thus two L sets with the same DATE have a

distance of 0; with 1 day apart, the distance is 0.63, and so on. Note that this

backoff can be calculated (less efficiently) via the same approach as for the other

nominal features, by setting `DATE
max equal to the maximum number of days.

Pseudocode for the complete logsetsim and the hierarchical backoff distance

(HBD) is shown in Figure 3.6. The set Γ̄ includes all feature we used (shown on

Figure 3.2), and we abuse notation slightly by letting γ ∈ Γ̄ define a function that

maps a L set to the relevant feature value. For the nominal features, we further

72


logsetsim(L1,L2):
s← 0
for γ ∈ Γ̄:

x← γ (L1); y← γ (L2)
if type(γ) is numerical:

s← s + wγ · ND (x, y)
else

s← s + wγ · HBD(γ, x, y)
return s

HBD(γ, x, y) :
s← 0
` ← 0
while ` < `γmax or γ`(x) , γ`(y) do

s← s + e−` − e−(`+1)

` ← ` + 1
return s

Figure 3.6: Our distance function (logsetsim) to measure the similarity between
two L sets (Left) and the hierarchical backoff distance (HBD) calculation for
nominal features (Right).

let γ` denote the function that extracts the `th level from the feature value. We

also define a predicate type over features that indicates whether the feature is

nominal or numerical. We weight the distance values computed for each feature

based on a hyper-parameter called feature weight wγ ∈ [0, 1].

Clustering. We used an agglomerative clustering technique [87] that can work

with a non-metric distance function. Agglomerative clustering starts with each

data point as a single cluster and only merges two clusters if their distance is

smaller than a given threshold. Adjusting the threshold can help avoid clus-

tering together L sets showing different login behavior. To set the distance

threshold appropriately, we rely on a knee locator method [100] frequently used

in clustering algorithms to find the correct threshold for distances. We set the

linkage type to “average” and set the distance threshold for U1 and U2 to 0.44

and 0.51 respectively after applying the knee locator method at each university

separately. The silhouette score [23] for agglomerative clustering was 0.19 for

U1 and 0.17 for U2, which beat other approaches we tried (see Appendix B.3).

Implementation details. We implemented our similarity model logsetsim in

240 lines of code written in Python 3.6. To extract the four hierarchical levels

73


for the IP feature, we used the MaxMind GeoIP database [80]; and for the UA

feature, we used the ua-parser package [9]. Given an L of size n, we computed

an n×n distance matrix to be used for various clustering algorithms. Computing

the distance matrix takes O(n2) time; however, it can be easily parallelized. We

used 40 threads and were able to compute the distance matrix for n = 1, 717

within 9 minutes at U1 and for n = 6, 408 within 28 minutes at U2 (using an Intel

Xeon Linux machine with 56 cores and 125 GB of memory). For the clustering

step, we used the sklearn [36] library, and clustering completed in less than a

minute for both universities.

3.5.3 Attack Campaigns Discovered

Based on our designed similarity modeling, the agglomerative clustering ap-

proach described in Section 3.5.2 produced 366 clusters from 1,717 L sets at U1

and 640 clusters from 6,408 L sets at U2. For each cluster we recompute the fea-

ture values mentioned in Figure 3.2, after taking the union of L sets. Next, we

sample a few of the top most interesting clusters for manually analyzing them.

To identify the large volumetric attack campaigns, we sampled clusters con-

taining a high number of requests or high number of unique usernames. At U1,

we found eight such clusters with NR ≥ 1, 000∨NU ≥ 1, 000. At U2, we sampled

12 clusters with NR ≥ 5, 000 ∨NU ≥ 5, 000. We chose these thresholds by manu-

ally observing that the clusters found after these thresholds do not clearly show

malicious behavior. As our goal is to characterize attacks, we focus on precise

identification of attack campaigns, foregoing high recall. Thus we sorted possi-

ble attacks by a few different metrics. For example, we sorted by the number of

74


01-01
01-21

02-10
03-02

03-22
04-11

05-01
05-21

06-10
06-30

102
103
104

N
um

be
r

of
re

qu
es

ts
pe

r
da

y
in

lo
g

sc
al

e.
U1

*Clusters 1 & 6
*Cluster 2 Cluster 8

Cluster 7

Cluster 5
Cluster 9

Clusters 3 & 4

12-20
12-25

12-30
01-04

01-09
01-14

01-19
01-24

01-29
02-03

02-08
02-13

02-18
02-23

02-28
103

104

105
U2

*Cluster 10

Cluster 11

Cluster 13Cluster 14 Cluster 23

Cluster 22

Figure 3.7: Number of requests per day for the 1,752 and 6,408 potentially ma-
licious L sets at U1 and U2 respectively as shown in Figure 3.5. Attack clusters
we found in Section 3.4 are shown in red boxes. Clusters marked with a * were
identified in prior work [29] as high volume attacks. Note that the x- and y-axes
are different for the two universities for better visualization.

unique usernames and found one additional attack at U1; all the top clusters at

U2 had already been found using the volumetric thresholds.

All of the above sample clusters did not exhibit any targeted behavior —

sending roughly one unique password to each user on average. Therefore,

to capture attack campaigns exhibiting targeted behavior, we consider clusters

sending a high number of unique passwords submitted per username on aver-

age. Although at U1 we did not find any such clusters appearing to be attacks,

we discovered eight clusters from U2 that submitted an average of at least 25

unique passwords per user. Altogether, we sampled nine attack clusters at U1

and 20 attack clusters at U2. Attack clusters at U1 sent on average 8,432 re-

quests to 1,294 unique usernames, with a total number of 41 successful logins

among these 9 clusters. At U2, we saw an average of 14,358 requests submitted

per cluster to 7,614 unique usernames, with a total number of 1,116 successful

75


Abbreviation Measurement

NU Number of unique users
NR Number of requests
AUP Average unique passwords per user
FWP Fraction of requests with a weak password
FPIB Fraction of requests with the password in a breach
FCIB Fraction of requests with the username-password pair in a breach
FUIB Fraction of requests with the username in a breach
FTP Fraction of requests with a tweaked password
FVU Fraction of valid usernames
FF Fraction of requests failed

Figure 3.8: Abbreviations for measurements taken

logins among all 20 clusters. We describe these clusters further in Section 3.6.1.

3.6 Analysis of Attack Campaigns

As seen in the last section, Araña’s FCA pipeline helped us identify 29 clusters

that are probable attack campaigns. We first describe a subset of these cam-

paigns in more detail, to better understand attack behavior as seen in practice.

In Section 3.6.2 we generalize from these case studies to identify a variety of

observed higher-level attacker behaviors.

3.6.1 Example Attack Campaigns

First we describe some attack clusters representative of different types of attack

behavior we observed and show how we group some of them into campaigns

involving more than one cluster. We show the timeline of the attack campaigns

76


ID
/ Univ.

Dura
tio

n

# IP
s

# ISP
s

NU NR AUP
FW

P
FPIB

FCIB
FUIB

FTP
FVU

FF Com
p.

use
rs Com

p.

use
rs

flag
ged

.∗

1§ / U1 5 h 3 3 10,424 18,093 1.21 0.12 0.56 0.41 0.69 0.06 0.94 1.00 1 1
2§ / U1 4 h 1 1 15,209 17,827 1.10 0.04 0.56 0.22 0.42 0.06 0.97 1.00 14 13

3‡ / U1 14 h 555 4 12,659 15,117 1.00 0.01 0.14 0.02 0.58 0.46 0.78 1.00 14 12

4‡ / U1 11 h 77 2 12,318 14,603 1.00 0.01 0.14 0.02 0.58 0.45 0.78 1.00 7 6

5 / U1 41 m 1 1 4,480 4,593 1.03 0.08 0.63 0.21 0.44 0.08 0.97 1.00 1 1

6§ / U1 4 h 3 1 2,101 2,683 1.21 0.07 0.66 0.48 0.69 0.04 0.98 1.00 0 0
7 / U1 9 h 1 1 1,347 1,481 1.00 0.00 0.32 0.10 0.22 0.05 0.99 1.00 2 2

8 / U1 19 h 85 13 894 1,246 1.00 0.01 0.33 0.08 0.92 0.84 0.92 1.00 2 2

9 / U1 37m 30 7 219 241 1.00 0.06 0.92 0.78 0.96 0.10 0.93 1.00 0 0

10§ / U2 11 h 12 1 76,321 169,573 1.01 0.00 0.03 0.00 0.00 0.00 0.00 1.00 0 0

11 / U2 10 m 663 10 27,488 33,304 1.00 0.10 0.79 0.00 0.00 0.00 0.01 1.00 0 0

12 / U2 15d 1 1 8,192 13,289 1.02 0.04 0.61 0.07 0.54 0.02 0.63 0.93 501 6

13 / U2 3 d 1 1 7,939 12,240 1.30 0.05 0.66 0.10 0.75 0.00 0.35 0.99 120 0

14 / U2 23h 843 13 7,771 10,535 1.01 0.08 0.66 0.00 0.71 0.00 0.48 0.97 258 0

15 / U2 5 d 1 1 4,662 9,714 1.07 0.09 0.84 0.00 0.91 0.00 0.40 0.99 32 0

16 / U2 2 d 2 1 4,934 7,323 1.00 0.06 0.58 0.00 0.23 0.00 0.81 0.84 786 0

17 / U2 2d 3 2 2,513 6,290 1.55 0.03 0.40 0.18 0.91 0.70 0.37 0.99 35 0

18 / U2 10 d 1 1 1,902 5,434 1.37 0.04 0.55 0.27 0.82 0.41 0.39 0.99 13 0

19 / U2 5 d 1 1 3,584 5,261 1.04 0.07 0.76 0.80 0.98 0.18 0.42 1.00 6 0
20 / U2 2 d 3 2 1,756 5,199 1.56 0.04 0.50 0.22 0.92 0.72 0.35 0.98 83 0

21 / U2 23 h 1 1 5,076 5,103 1.00 0.06 0.59 0.00 0.26 0.00 0.80 0.85 777 0

22 / U2 13 h 73 43 75 1,878 25.04 0.82 0.88 0.01 0.31 0.29 0.96 1.00 1 0

23 / U2 21 h 52 38 52 1,296 24.92 0.82 0.89 0.01 0.34 0.32 0.88 1.00 1 0

24 / U2 12 h 4 4 4 101 25.25 0.80 0.86 0.00 0.55 0.50 0.51 0.99 1 0

25 / U2 16 h 3 3 3 80 26.67 0.81 0.75 0.01 0.65 0.58 0.30 0.99 1 0

26 / U2 8 m 1 1 1 27 27.00 0.85 0.85 0.00 0.00 0.00 0.00 1.00 0 0

27 / U2 8 m 1 1 1 27 27.00 0.85 0.78 0.04 1.00 0.96 1.00 1.00 0 0

28 / U2 13 m 1 1 1 25 25.00 0.84 0.88 0.00 0.00 0.00 0.00 1.00 0 0

29 / U2 4 d 1 1 3 468 38.75 0.15 0.59 0.00 0.67 0.16 0.33 1.00 0 0

‡ Since the measurement collection ended during these clusters, some of these measurements
(such as NR and NU) may be lower bounds.
§ These clusters are part of campaigns that were discussed in previous work [29].
† Duration units are days (d), hours (h), and minutes (m).
∗ The last column represents the accounts that were flagged independently by security
engineers at the respective universities.

Figure 3.9: Attack clusters detected using a set of heuristics and manual review.
The first column notes the ID of each cluster as we refer to them in the paper [64].
The attack IDs we describe in detail (Section 3.6) are shown in bold font. We
discovered a total of 41 and 1,116 unique compromised users at U1 and U2
respectively. Abbreviations are described in Figure 3.8.

77


discussed in Figure 3.7 and the full list of attack clusters in Figure C.5.

Previously reported attacks. Three attacks were manually identified and dis-

cussed in prior work [29]. All three were also detected via our FCA pipeline:

Attack #1 from [29] corresponds to Clusters 1 & 6, Attack #2 corresponds to

Cluster 2, and Attack #3 corresponds to Cluster 10 (also shown in Figure 3.7).

Attack #1 from [29] was a credential stuffing attack distributed over four IPs

that were active on different days. Each of the IPs submitted lots of requests

very quickly, with a peak rate of over 100 requests per second. As such, these

were easily identified manually as related attacks by the similar attack behaviors

and time period when they were active. This still required significant manual

analysis of Gossamer logs; however, our FCA approach automates this analy-

sis. Two clusters identified by the FCA method capture the bulk of the L sets

associated with these attacks. One L set containing 16,035 requests that were

previously manually identified was omitted; this L set had a lower failure rate

(0.67) than our heuristic filtering thresholds (which was 0.77 at U1).

Attack #2 from [29] was a high volume credential stuffing attack using Sen-

tryMBA [110] to send requests from a single source IP address. Since the attack

consisted of only one IP address on a single date, the cluster consisted of one

L set, ranked second by volume of requests in our FCA approach.

Attack #3 from [29] was observed at U2 and involved 12 distinct IP addresses

with an average rate of 188 requests per second. This attack was easy to detect

due to its sheer volume and rate of requests. Manual inspection revealed that

the attack requests mimicked SMTP and IMAP clients, which helped in manu-

ally clustering them. Again, our FCA pipeline automated this step, placing all

78


activity from these IP addresses into a single attack cluster.

Our new FCA approach helps automatically identify these attack campaigns

with little error relative to the manual analysis used in prior work [29]. How-

ever, it did split the first attack into two clusters, and it also missed an IP address

from that attack. Across these three attack campaigns, 17 L sets were identified

by both the manual and FCA approach as being part of attack campaigns, and

one L set was identified by the manual approach but not the FCA approach.

We show the corresponding confusion matrices in Appendix B.4. Note that the

FCA approach also found many attack campaigns hard to manually detect that

we describe next.

Future work can look into whether unsupervised clustering approaches can

be improved to be more accurate than manual inspection. For now it is clear

that FCA helps find and characterize attack campaigns in an automated fashion

that can later be investigated by an analyst. We now discuss in further detail the

new attacks discovered using our pipeline.

Clusters 5 & 7: Repeated attacks from the same IP. We saw two credential

stuffing attacks at U1 (Clusters 5 & 7) a month apart (May 22 and June 20) from

the same IP address and user agent (and thus, likely from the same attacker)

attempting to login to 4,480 and 1,347 users respectively. The five user agents

used in the June 20 attack were a subset of the seven used in the May 22 attack,

and the two attacks targeted 295 overlapping usernames. Thus we believe that

the IP was under the same attacker’s control for both attacks. This IP was active

on 24 other days, but only submitted between one and four requests per day to

nine distinct usernames over that time period. We believe the attack exhibited

credential stuffing behavior, as the fractions of breached passwords submitted

79


were 0.63 and 0.32 respectively, and less than 8% of the passwords submitted

were weak passwords. Other features such as the AUP, FTP, and FSP were rel-

atively similar between the two clusters. All but two usernames were incorrect

across the clusters, so we believe the attack was curated for U1.

Cluster 12: Multi-day attack from a single IP. Although most of the attacks

we saw were finished within a 24 hour period, some attackers may spread out

the attack over multiple days in an attempt to stealthily avoid detection. We

saw this behavior in Cluster 12 at U2. In this attack, one IP from Microsoft Cor-

poration ISP sent one request per minute on 15 days in a two month period. In

total, the IP sent 13,289 requests to 8,192 unique usernames. There was evidence

of credential stuffing, as 61% of passwords submitted appeared in breach data

known to Gossamer, and fewer than 0.04% of them were weak passwords. We

did not see any evidence of targeted behavior, as there was an average of only

one unique password submitted per username; however 63% of the attempted

usernames were valid, meaning the attacker curated their password guessing

attack for U2. The IP attacked the basic authentication protocol that does not

have two factor authentication set up at U2. This IP successfully guessed the

passwords of 501 user accounts over this time period, only 163 of which were

detected by the security engineers.

We saw similar multi-day attacks in Clusters 15, 18, 19, and 29, albeit for

shorter time periods. Such attacks may easily avoid simple volumetric detection

methods (as they were missed in the prior work) and thus show the utility of a

richer feature-based clustering approach.

Cluster 14: High volume, distributed, credential stuffing. We saw several

cases exhibiting high volume, distributed, curated credential stuffing. For ex-

80


ample, in Cluster 14 at U2, 843 IPs belonging to 13 ISPs submitted 10,535 re-

quests to 7,771 usernames over the course of 23 hours. Since each IP submitted

only around 12 requests, it avoided detection by the manual analysis performed

in the prior work [29]. This attack exhibited credential stuffing, with 66% of

passwords present in breach data; and 48% of the submitted request contained

a valid usernames from U2. During this attack, the involved IPs successfully

guessed the passwords of 258 unique user accounts, 74 of which were confirmed

by security engineers independently. Clusters 3, 4, 8, 9, 11-13, and 15-19 showed

similar signs of high volume, distributed, credential stuffing with varying levels

of curation to the target university.

Cluster 17: Possible credential stuffing with unknown breach data. A few at-

tacks showed a high fraction of tweaked passwords (passwords that are a close

variant of a breached password for the targeted user), but a lower fraction of

breached passwords. For example, in Cluster 17, we saw three IPs submit 6,290

requests to 2,513 unique usernames over the course of two days. Although 91%

of the submitted usernames in this attack campaign are present in the breach

data used by Gossamer, only 18% of the submitted passwords appear with those

usernames. Of the remaining 82% submitted passwords for those users, we that

found 70% are only small tweaks of the password(s) present in the breach data

with the corresponding username, and 40% of the submitted passwords exactly

match the breach data. This may indicate that the attacker is using a breach

dataset that is unknown to Gossamer; but because users choose similar pass-

words across web services [42], we can still detect these breached passwords

not present in Gossamer’s breach data.

This attack produced successful logins to 35 users, two of which were inde-

81


pendently detected by the security engineers. Cluster 20 showed similar signs

of possible credential stuffing with unknown breach data, with a high fraction of

tweaked passwords. Neither of these attacks exhibited targeted behavior (hav-

ing AUPs of 1.55 and 1.56 respectively); therefore, we do not believe these are

credential tweaking attacks [93].

Clusters 22 & 23: Targeted attacks. In a few cases, we saw a higher average

number of unique passwords tried per username. For example, in Cluster 22,

we saw 73 IPs submit 1,878 requests to 75 unique usernames, with an aver-

age of 25 unique passwords tried per username, each attacking only one user.

Cluster 23 was active on the same day and executed a very similar attack: 52

IPs submitted 1,296 requests to 52 usernames, each submitting an average of 25

unique passwords to one unique username, just like Cluster 22. Thus we believe

they are part of the same campaign.

In this attack campaign, the attacker clearly used a popular password dictio-

nary, as 82% of passwords submitted had weak passwords (zxcvbn score of 0).

The fraction of valid usernames for Clusters 22 and 23 were 0.96 and 0.88 respec-

tively, indicating that the attack data was probably curated to the university. The

campaign was successful in guessing the passwords for two usernames, neither

of which was detected by the security engineers.

Clusters 24-28 exhibited similar targeted dictionary attack behavior, al-

though with fewer IPs and a lower number of total requests. Our clustering

approach, however, failed to cluster all of these L sets together. This is a com-

mon limitation in agglomerative clustering [109].

82


3.6.2 Higher-Level Attack Behaviors Observed

Attack campaigns employ various strategies to maximize their success in com-

promising user accounts. Broadly, there are two components in an attack cam-

paign that the attacker has to choose: (a) the types of username-password pairs

to submit, and (b) how login requests with those username-password pairs

are delivered to the target service. We noticed different approaches to each of

the two components in the attack campaigns we found at U1 and U2, as we

discuss in more detail below. The list of different behaviors we observed as

well as some example clusters showing that behavior are shown in Figure 3.10.

We also explore the geographical distribution of attacks, but relegate details to

Appendix B.5.

Types of submitted username-password pairs. A key component in an at-

tack is the set of guessed username-password pairs, as the success of the attack

depends on it. There are different strategies for picking username-password

pairs — e.g., an attacker can choose usernames belonging to the university and

try several popular passwords against those users, or an attacker can choose

breached username-password pairs with or without filtering the usernames

specific to the university.

Curates usernames to the target university. We found that all attacks at U1 cu-

rated their set of usernames, with more than 75% of requests containing a user-

name present in U1. However, at U2, only 7 (out of 20) attack clusters contained

more than 50% valid usernames. Clusters 10 and 11 combined submitted nearly

200 K requests, but only 286 of them contained a username present at U2. Even

when an attacker attempted to log in multiple times for a particular username,

83


Attack component Behavior Ex. clusters

Type of
username-
password
pairs

Curates to the target university 2, 21
Targets certain users 22, 23
Uses breach data 9, 19
Submits popular passwords 22, 23

Delivery of
requests

Distributed among multiple IPs 3, 11
Distributed among multiple days 12, 19
Ends quickly (within 24 hours) 1, 11
Exhibits low interarrival time 5, 11

Figure 3.10: Different attack behaviors we observed.

we found that in several cases the username did not exist. Some attacks are

therefore rather indiscriminate, trying arbitrary usernames without checking

first whether they are valid for the target authentication service.

Targets certain users. An attacker may submit requests against one, a few, or

many unique usernames. Among attacks we detected, it was more common for

an attack to target a large number of usernames in a “horizontal attack.” Most of

the attacks we found were horizontal attacks — trying one or two passwords per

username but for a large number of usernames. An attack submitting multiple

unique passwords against fewer usernames may be exhibiting targeted behavior.

We found six attack clusters targeting 137 users in total, each with more than 25

unique popular passwords. We do not know if the set of passwords used were

the same for different users, as Gossamer logs at U2 do not allow comparing

passwords submitted to different usernames.

Uses breach data. In our analysis, we found that attackers often use prior

breaches to source their username-password pairs in what is popularly known

as a credential stuffing attack. In six out of nine clusters at U1 and in one cluster

at U2, 50% of the submitted username-password pairs are present in the breach

data used by Gossamer. Among the remaining 22 clusters, eight clusters (all

84


at U2) had more than 50% of the targeted usernames in a known breach, and

all but one attack cluster at U2 submitted passwords that were found in prior

breaches. This reiterates the threat of credential stuffing.

Uses popular passwords. Although most attacks used breached passwords, some

attacks relied on especially popular passwords. Attackers may use popular

password dictionaries [125] curated from prior password breaches for such at-

tacks, and such passwords will, by definition, also be flagged as having ap-

peared in a known breach. We found a number of attack clusters submitting

popular weak passwords, such as Clusters 22 & 23. More than 81% of the pass-

words submitted in these attacks appear in the 1000 most frequent passwords

found in the breach data known to Gossamer. Notably, all of these attacks

were targeted, submitting more than 25 passwords per user. Both universities

have password selection policies that aim to disallow popular, weak passwords.

Nevertheless, we found that such attacks were successful in compromising two

users at U2.

Delivery methods for attack requests. After choosing the set of username-

password pairs to submit, the attacker must decide how to submit them to the

target service. Primarily, the attacker must identify how quickly they can sub-

mit all the username-password pairs without being detected. To do so, attackers

can use multiple IP addresses to parallelize the attacks, and/or spread the attack

over a long period of time.

Distributed among multiple IPs. We found several attack campaigns that dis-

tributed their login requests over multiple IP addresses. In particular, we iden-

tified three large attack clusters using more than 500 IP addresses and five other

85


clusters each using more than 50 IP addresses. In the two clusters with the most

IP addresses at U2 (Clusters 11 & 14), the majority of IPs were flagged as proxies

by Blackbox API [2]. By distributing over multiple IPs, an attacker can achieve a

higher throughput, as exhibited in cluster 11 which used multiple IPs to achieve

a very fast request rate. This distribution of requests across IPs can help to avoid

volumetric detection mechanisms, as exhibited in clusters 22 and 23.

Duration of the attack. Attacks can be short-lived and bursty or spread across

multiple days. All attack clusters at U1 were short-lived, finishing within 19

hours and four of them completing within five hours. At U2, however, we found

that most high volume attacks (submitting more than 5 K requests) span mul-

tiple days, and two clusters were active for over 10 and 15 days, respectively.

Interestingly, these attacks would be very difficult to detect by looking at their

behavior on only one day; however, our clustering approach helps to see the

full attack by combining the attack behaviors from multiple days.

Interarrival time. Finally, sending requests too quickly could also trigger alarm

or lockout. Therefore, some attackers try to submit requests at a lower rate.

However, we see at least four attack campaigns that submitted as much as 60

requests per minute at U1 and U2. These attacks are not curated, meaning that

the majority of the submitted usernames do not belong to the target university

and thus could not result in a successful login. Both U1 and U2 have a soft rate

limiting policy, meaning that too many unsuccessful attempts against a user-

name could lead to account lockout for 15–30 minutes. However, we did not

find any attack that submitted requests fast enough to a single user that could

trigger that lockout. We are unsure if attackers adapted their attacks to this

86


lockout constraint or if their typical behavior avoids such lockouts.

Endpoints targeted. An attacker may choose an attack endpoint with the least

safeguards in place. We found that most of the attacks at U2 targeted the Mi-

crosoft basic email authentication endpoint. We suspect that there are four

reasons for this high usage of the basic authentication endpoint: (a) this end-

point does not support two factor authentication; (b) it has poor (non-existent)

rate limiting of requests, as basic authentication requires frequent submission of

passwords; (c) the requesting IP seems to be from Microsoft (as seen by the au-

thentication server); and (d) attackers would have to make only a small change

(the URL) to attack different organizations using basic authentication.

3.7 Discussion

We design Araña, an attack analysis system based on our FCA — filter, clus-

ter, analyze — framework. Using Araña with the logs created by our prior

work [29], we identified several attack campaigns in two university login sys-

tems. These attack campaigns are spread across 29 clusters as detailed in

Figure C.5 and compromised 1,157 users across two universities. Although our

study is limited to the attacks received by two universities, we believe the pat-

terns we observed across attacks can form a basis for building future defenses

against password guessing attacks.

At U1, the most common type of attack was a combination of a high vol-

ume, distributed, curated, short-lived, credential stuffing attack; the interarrival

times of the attacks varied from 55 milliseconds to 108 seconds. At U2, almost

87


all the attacks were high volume credential stuffing attacks, but they varied in

whether they were distributed, curated to the university, short-lived, or exhibit-

ing a low interarrival time. Of the eight lower volume attacks we found, all were

curated, short-lived, targeted credential stuffing; and they varied as to whether

they were distributed across IPs. The distributions of attacks observed at the

two universities are quite different, possibly because Gossamer gathered data

from all login endpoints at U2, but only the main login endpoint at U1; thus

attackers targeting different endpoints at U1 are not reflected in Gossamer logs.

Efficacy of breach alerting services. To determine how much a breach alerting

service could mitigate password guessing attacks, we investigated the charac-

teristics of the password from the last successful login before a user was re-

ported as compromised. We found that 2% of compromise reports were asso-

ciated with a breached username-password pair (that is, the last successful lo-

gin to that username before the compromise report was a breached username-

password pair), 12% were associated with at least a breached password, and

19% were associated with a tweaked password. At U2, we observed that 11% of

compromised users’ last logins were made with a breached username-password

pair, 46% were made with at least a breached password, and 1.51% were made

with a tweaked password. Thus automatically resetting passwords of users us-

ing vulnerable, breached, or tweaked credentials could have prevented a signif-

icant fraction of account compromises. At U1, 71% of eventually compromised

users used a breach password at some point during the instrumentation period.

This further underpins the need for proactive breach alerting [76] services that

could have saved 47% of account takeovers as stated earlier.

88


Key observations from attacks. Our findings reiterate the ongoing threat of

credential stuffing, which has been the most prevalent and successful form of

attack. We also observed a few low volume targeted attacks against specific

users. Attackers often use multiple IPs from cloud providers, VPNs, and net-

work proxies to distribute and hide their attacks, but our FCA approach can

identify such campaigns even when each individual IP makes only a handful

of requests. Such observations should be taken into account when building

defense policies. For example, locking user accounts due to a small number

of incorrect attempts rarely translates to higher security, whereas discouraging

users from reusing passwords from other websites and using breach alerting

services can be very effective. Proactive breach alerting [76] using services such

as HIBP [63] would be helpful in combating credential stuffing attacks.

Using an FCA approach for attack analysis. Our FCA approach helps ana-

lyze attack campaigns by clustering L sets with similar attack behavior. This

enables a security engineer to look at the whole attack, instead of considering

login activities from a single IP or on a single day. As we show, several attacks

are spread across multiple days and use multiple IPs; in some cases, they may

use only one IP and spread the attack over multiple days, making them very

hard to detect. We believe clustering seemingly unclear behaviors into groups

can help security engineers see the attack pattern, detect hard-to-detect attacks,

and have confidence in their judgment. We envision that our FCA approach can

be used on authentication logs such as the ones produced by Gossamer [29] to

group possible attack campaigns and sort by those that are more likely to be ac-

tual attacks. Then analysts can manually investigate further, using the features

89


and groupings produced by the clustering to inform their decision. Through-

out the design of our FCA approach, we focused on minimizing false positives

by choosing aggressive thresholds and evaluating threshold effects so that an

analyst can have confidence that clusters found by Araña are highly likely to

be attacks. Thus our FCA approach could reduce the effort needed to analyze

the complete logs of all IPs; such a tool could prepare daily (or weekly) reports

about potential attacks.

3.8 Real-World Deployment Constraints

Despite the many campaigns detected by Araña, there are certain challenges

that may occur during deployment in real-world settings. Here we talk about a

few that we have faced.

Recording password-derived features. Most organizations do not record

password-based features for good reason, and while Gossamer [29] showed

how to record them safely, it is still an open question whether widespread de-

ployment of recording password-derived features is wise or necessary. Araña

provides evidence that certain password-derived features can be useful for at-

tack detection. For example, we identified that password breach statistics are

very effective at detecting credential stuffing attacks. On the other hand, pass-

word similarity statistics — which require storing passwords in memory for 24

hours — are not as effective for attack detection in the FCA approach. So or-

ganizations could log a small set of already proven-helpful password-derived

features and apply clustering using those features. We detail clustering results

without password-based features in Appendix B.3 and leave to future work the

90


open question of which features are the most helpful for detecting different

types of attacks.

Reporting compromised accounts. Based on the attack campaigns discovered

by our FCA approach, we believe 41 unique user accounts were compromised

at U1, and 1,116 were compromised at U2 (some accounts were compromised

multiple times by different campaigns).

At U1, 37 of the 41 compromised accounts detected by Araña were already

detected via other mechanisms, and manually recorded by the security engi-

neers as compromised. We reported the remaining accounts and received feed-

back that they were indeed compromised, and the remaining account had al-

ready been deleted.

At U2, only six out of 1,116 compromised users detected by Araña had been

independently flagged by the security engineers. We reported the rest to U2’s

security engineers in two rounds. In the first round, we reported 823 compro-

mised accounts, 373 (44%) of which the security engineers confirmed as defi-

nitely compromised. For the remaining 450 accounts, U2’s security engineers

said they could not verify fully due to unavailability of adequate logs but that

their best guess was that they were compromised based on what logs were avail-

able. In the second round, we reported another 293 compromised usernames;

however, we received a similar response mentioning that the IT department did

not have adequate logs to determine whether these were compromised.

In practice, confirming if accounts are compromised is nuanced and often

relies on indirect indicators like an account being used to send spam or a report

from an account owner. Elsewhere, compromise status can be inherently am-

91


biguous. Even so, our experience working with the security engineers suggests

that more detailed and persistent logging may help. Likewise, our experience

with Araña suggests that password-derived measurements can be helpful to

analysts when attempting to characterize and confirm compromises. We also

note that this difficulty in confirming account compromise makes it difficult to

automatically finetune many of the hyperparameters in Araña (e.g., distance

thresholds and percentile thresholds for filtering L sets). Therefore we had

to rely on experimentation with different thresholds and manual analysis on

password-derived and volumetric-based measurements which took compara-

tively much more effort. Nevertheless, Araña provides a way to use Gossamer

logs to identify high-volume and distributed attacks that were not detected by

existing mechanisms.

3.9 Conclusion

Using data collected at two universities from a measurement framework for

logging password-derived information behavior, we designed a set of features

that describe an IP address on a given date, along with a clustering algorithm

to group IP addresses active on certain dates together into probable attack cam-

paigns. We describe several of these clusters in full detail to show the differ-

ences and similarities between attacks, and we discuss our observations about

behavioral patterns of the attacks as a whole.

Our results indicate that clustering approaches can aid an analyst in detect-

ing and labeling suspicious groups of requests that may be part of the same

attack campaign. They also provide initial progress towads the future design

and deployment of real-time, automated attack campaign detection tools.

92


CHAPTER 4

THE PRACTICALITY AND ROBUSTNESS OF TIMELY MALICIOUS

LOGIN DETECTION

4.1 Introduction

Remote password guessing attacks pose a serious risk to users’ online accounts.

The prevalence of large password breaches [63] along with poor password prac-

tices (e.g., users’ tendencies to pick weak, easy-to-guess passwords [96,115] and

reuse them across web services [83, 118, 121]) have made it easy for attackers to

guess a user’s password and gain access to their account [95]. Despite ongoing

efforts to move beyond passwords, such as by transitioning to device-hosted

cryptographic keys [56,73], passwords remain the primary mode of authentica-

tion. Therefore, password guessing attacks will likely persist as a severe security

threat for the foreseeable future [24, 78].

To prevent password guessing attacks, several server-side defenses have

been proposed that analyze more features of a login request beyond just pass-

word correctness [32]. Examples include volumetric approaches like block-

ing IP addresses or locking an account after too many login failures [34].

Prior academic works have proposed supervised and unsupervised machine

learning (ML)-based mechanisms that train models using prior known at-

tacks [52, 101, 102] or model anomalous login behavior [59, 64].

While these approaches hold significant promise, no prior work has evalu-

ated the efficacy of such approaches in the timely detection of real-world login

attacks. Moreover, most previous studies [52, 59, 101, 102] have relied on syn-


thetic login data for simulated evaluations, raising concerns about their appli-

cability in practice. This limitation stems from the challenges associated with ac-

quiring and analyzing sensitive login traffic. For example, Schechter et al. [102]

require the authentication server to log fine-grained sensitive information about

users’ passwords.

In prior work, we addressed this challenge by designing a framework called

Gossamer [29], which we used to collect a dataset of 34 million login requests at

two universities. To preserve the security and privacy of users’ passwords, we

specified a set of coarse-grained measurements deemed safe to collect without

leaking too much information about a user’s password. For example, Gossamer

records a binary password strength score using Zxcvbn [123] for submitted

passwords instead of the raw Zxcvbn score (between 0-4), as it may endanger

users’ security.

Therefore, a key question remains: How do existing volumetric and ML-

based timely detection approaches perform when authentication servers collect

coarse-grained information from login requests to safeguard the privacy and se-

curity of their users? In our work on Araña, we further analyzed the Gossamer

dataset and detected 29 attack campaigns using a new FCA (filter, cluster, an-

alyze) pipeline [64]. However, this approach relied on months of data to iden-

tify malicious activity, rendering this technique unsuitable to detect attacks fast

enough to prevent exploitation of a compromised account. Additionally, no

prior research has assessed the threat of attackers adapting to detection mecha-

nisms, leaving open the question of their robustness against more sophisticated

adversaries.

To begin answering these questions, we first perform a comprehensive in-

94


vestigation of existing volumetric and ML-based detection approaches on the

real-world large-scale dataset collected by the Gossamer framework. We mea-

sure the effectiveness of the existing detection approaches by assessing how ac-

curately they can detect the 29 attack campaigns without also raising too many

false alerts. Our findings indicate that existing volumetric and ML-based ap-

proaches perform poorly, indicating that performance on simulated attacks may

not generalize to real-world attacks — at least on real login datasets that must

record coarser grained information about submitted passwords to avoid com-

promising user security.

We explore three new detection approaches that leverage Directed Anomaly

Scoring (DAS), which was designed by Ho et al. [61] for the context of

spearphishing detection. Following our approach in Araña [64], we first parti-

tion logins by grouping requests from each client IP address on the same day —

referred to as anL set. We design four sets of features called sub-detectors, each

corresponding to a distinct guessing attack behavior reported in prior work. We

explore three ways of using sub-detectors: using DAS to score an L set against

previously identified suspicious L sets, using DAS to score against a sliding

window of L sets, and applying the Araña clustering algorithm on an indi-

vidual day’s worth of L sets and scoring the resulting clusters using DAS. All

approaches are timely in that they could generate a daily report of likely attacks

for an analyst to further investigate.

We evaluate these approaches on the Gossamer dataset and find that they

perform much better than prior approaches at detecting attacks. For example,

the per-day clustering approach discovered >98% of the attacks reported by

Arañawhile maintaining a precision of >52% at both universities. The sliding

95


window approach even discovered new attacks not reported in prior work, but

generates more false alerts and had precision of 22%. We manually inspected a

subset of these false alerts following the approach specified in [64] and found

that a large fraction of the false alerts are likely attacks missed by [64].

While this shows that DAS-based approaches can detect existing attack

strategies, it is unclear if they will remain effective even when the attacker

adapts to such defense mechanisms by actively modifying their login requests

to evade being flagged as malicious. We therefore propose a new, data-driven

approach for evaluating evasion attacks: We use real attack campaign login

data, modify it as per an evasion strategy, and then place it back into the stream

of otherwise untouched background login traffic. We then simulate running the

defense mechanism on the new combined stream of logins to allow a realistic,

in situ analysis of detectability.

We devise four evasion strategies that either distract the detector from the

actual attack or help the attack blend in with other benign logins. We evalu-

ate them using our approach for varying amounts of attacker resources—the

number of additional IP addresses, login requests, and time spent on the at-

tack campaign. We find that all attack detection mechanisms can be evaded

when using enough resources, though certain detection approaches require the

attacker to spend more resources to succeed. While future work may improve

our evasion attacks, for now our results suggest that deploying timely detection

mechanisms would help catch unsophisticated attacks and increase the cost of

more sophisticated attacks.

Contributions. In summary, we make the following contributions in our work.

• We perform the first in-depth performance analysis of existing volumetric-

96


and ML-based attack detection approaches on a real login dataset containing

34 million login requests.

• We propose three new timely detection approaches leveraging Directed

Anomaly Scoring (DAS) and evaluate them on the Gossamer dataset.

• We show how to measure the robustness of detection approaches against eva-

sion attacks and evaluate it for the three DAS-based detection mechanisms.

We are in the process of contacting the universities to perform responsible dis-

closure about the new attacks discovered in the course of this work. We have

open sourced the codebase we developed for this study.1

4.2 Background & Related Work

This project focuses on the timely detection of online password guessing attacks.

Online guessing attacks. In a password guessing attack, attackers submit can-

didate username-password combinations to an authentication server to guess

users’ correct passwords. They may employ different strategies—for example,

they may deploy a password spraying attack by trying a few common passwords

against many different user accounts.

Studies have shown that users tend to reuse the same or similar passwords

across different websites [90, 119]; an attacker can leverage this in a credential

stuffing attack by guessing breached passwords from one website against an-

other. Billions of breached passwords have made credential stuffing a powerful

1Available at
https://anonymous.4open.science/r/real-time-malicious-login-detection-884B/

97

https://anonymous.4open.science/r/real-time-malicious-login-detection-884B/


tool for attackers to compromise user accounts. In fact, automated bot-driven

credential stuffing attacks account for 91% of login traffic against ecommerce

websites [98] and contribute to 49% of all successful data breaches [46]. Despite

this, existing server-side defenses have not been evaluated on credential stuffing

attacks in practice.

Client-side defenses. Modern websites use various proactive client-side mea-

sures to protect users against online guessing attacks. These measures include

enforcing password composition policies, banning the use of common, eas-

ily guessed passwords like “123456”, and deploying password strength me-

ters [93, 122, 123] and breach alerting services [74, 94, 112]. However, these

client-side approaches rely on user motivation, and users may bypass or ignore

them [53, 115]. Users often resist changing their passwords even after receiving

warnings from breach alerting services [22]. Password managers and two-factor

authentication (2FA) could help address these issues, but they are not yet widely

adopted in practice, and user friction still remains high [19,41,44,62,82,97,127].

Recent studies have also identified several key security issues with both pass-

word managers [91, 107] and 2FA [54, 68] Therefore, modern websites must

also implement server-side defenses complementary to the above-mentioned

approaches to protect their user accounts.

Volumetric defenses. To prevent online password guessing attacks, web-

sites can (temporarily) prevent login to an account if too many prior login at-

tempts failed. Prior work has suggested various thresholds including three [49],

ten [34], and a dynamically adjusted threshold that gets adjusted based on as-

pects of the submitted password [27]. Such account lockout strategies can slow

down targeted password guessing attacks. However, attackers can abuse this

98


defense mechanism to launch a simple denial of service attack, locking users out

of their accounts [14]; as such, it is not widely used. Indeed, 72% (131) of Alexa

Top 500 websites do not use account lockout [77]. More importantly, 21 out of 29

attack campaigns discovered by Islam et al. [64] exhibited untargeted behavior,

sending only one request per username to a large number of usernames; thus

account lockout defenses would be ineffective against such attacks.

Login servers can also restrict or reject login from an IP addresses if too

many login attempts from that IP address fail. For example, to slow down

online guessing attacks, websites might silently block requests from such an

IP address [17], present a CAPTCHA challenge [117], or require additional

knowledge-based challenges [113].

Machine learning defenses. Another set of approaches to detect password

guessing attacks utilizes machine learning (ML) [52, 59, 64, 101, 102]. For exam-

ple, Freeman et al. [52] proposed WhoAreYou which uses the reputation of IP

addresses and user agents to classify logins as malicious or benign. Building on

this, Schechter et al. [102] proposed StopGuessing which goes further to include

password-based features in creating a classifier for detecting malicious login

attempts. Both approaches, however, have primarily been evaluated using sim-

ulated attacks. Thus the effectiveness of these approaches against real-world

password guessing attacks remains uncertain. We analyze the efficacy of these

approaches on the Gossamer dataset in Section 4.3.3.

Herley et al. [59] and Schechter et al. [101] also proposed similar ML-based

login traffic classification approaches using different features. However, these

rely on unrealistic assumptions for deployment—such as the presence of an

attack-free period for collecting benign login traffic [59] or requiring users to

99


change their password behavior prior to attack detection [101]. We do not con-

sider them for further analysis.

Recently, Islam et al. [64] proposed Araña,which uses volumetric, client-

level, and password-based features to group IP addresses with similar suspi-

cious behavior together into attack campaigns. Arañawas tested on a large lo-

gin dataset of 34 million login requests (called the Gossamer dataset, described

in Section 4.3.1) and discovered 29 attack clusters. Manual analysis revealed di-

verse characteristics including short and bursty behavior, as well as attempts

to conceal behavior by distributing guesses across multiple IP addresses over

longer duration. Notably, the majority of successful attacks (21 out of 29) exhibit

credential stuffing behavior, while the eight other targeted attacks are mostly

unsuccessful. All attack clusters are illustrated in Figure C.5.

Arañarequires months of login data for creating clusters; thus it is uncertain

if it can be used to detect attacks in a timely manner (say, within a day). Also,

Arañarelies on a handcrafted filtering rule to separate attack clusters from the

benign clusters, making Arañamore challenging to deploy in practice.

Our work focuses on timely server-side detection of online password guess-

ing attacks. To the best of our knowledge, both the volumetric and ML-based

approaches that are deployed in practice have not been evaluated comprehen-

sively for their efficacy against such powerful real-world password guessing at-

tacks. In Section 4.3, we fill this gap by attempting to measure the effectiveness

of a few volumetric and ML-based approaches on a real-world login dataset.

100


U1 U2
benign malicious benign malicious

Login requests
train 8,693,670 41,330 1,892,509 275,584
holdout 1,444,446 34,554 525,924 11,583
all 10,138,116 75,884 2,418,433 287,167

L sets
train 3,962 93 185,776 1,568
holdout 194 663 40,408 141
all 4,156 756 226,184 1,709

Target usernames
train 175,503 23,951 101,557 10,614
holdout 96,462 24,339 50,195 3,077
all 271,965 48,290 151,752 13,691

Figure 4.1: Breakdown of the Gossamer logs used in our simulations. Login
requests associated with Araña-identified clusters are labeled as malicious, and
all others are labeled as benign.

4.3 Evaluating Prior Approaches

We evaluate the performance of the existing volumetric and machine learn-

ing (ML)-based attack detection approaches on a large-scale, real-world login

dataset we obtained from the authors of Gossamer [29]. In this section, we re-

port on the efficacy of prior detection approaches when tested on a real-world

login dataset such as Gossamer’s.

4.3.1 Dataset

To measure the effectiveness of existing approaches, we use a large dataset of

34M login requests collected via the Gossamer framework [29]. Gossamer en-

ables secure logging of information about login requests, including the source

IP address, the anonymized username, user agent, HTTP headers, timestamp,

submission result, and a set of carefully chosen password-derived features from

each login request.

101


Bohuk et al. [29] deployed this framework at two large universities that we

will refer to as U1 and U2. At U1, the instrumentation collected logins over

a period of seven months (212 days); for U2, it collected logins for two and a

half months (72 days). Altogether, the instrumentation collected information

on 34M login requests originating from 1,039 K unique IP addresses to 505 K

unique usernames — making Gossamer-collected logs the largest and most di-

verse real-world login dataset used in academic research to date.

In subsequent work, Islam et al. [64] used Gossamer logs to identify ten high-

volume attack campaigns at U1 and 19 at U2 using a new clustering-based ap-

proach. This approach, called Araña, takes login requests over the whole instru-

mentation period and groups these login requests by their client IP address and

the corresponding day of the login. Similar to [64], we refer to each group of

logins associated with an (IP address, day) pair using the notation L set. Araña,

then, filters L sets likely to be benign (using simple heuristics) and applies an

application-specific clustering algorithm producing an output that allows man-

ual analysis to identify the 29 attack campaigns.

Bohuk at al. [29] did not publicly release the Gossamer dataset at the time

of publication. We worked in close collaboration with the authors of [29] to get

access to the Gossamer dataset and handled the dataset following the same set

of ethical guidelines as described below.

Ethical considerations. Gossamer dataset is anonymized (meaning it does not

contain actual usernames), but it includes fields that were derived from pass-

words. Therefore, analyzing such data requires careful ethical considerations.

We maintained the same set of strict ethical guidelines to safeguard the privacy

and security of Gossamer dataset as [29].

102


For example, Gossamer dataset was safely stored on a VM protected by

multi-factor authentication, and only researchers have access to it for perform-

ing the experiments. Similar to prior work [29,64], we could not ask for explicit

consent from users since usernames were deterministically encrypted with keys

that were kept secret from researchers. Our study was reviewed by the IRB and

received exemption, as the research poses no more than minimal risk, and the

Gossamer logs do not contain any usernames or other data considered to be PII

by the IRB.

Simulating defense mechanisms. Gossamer logs provide a rich dataset to

drive simulations of attack detection mechanisms. We therefore build a data-

driven simulation pipeline. To evaluate a defense mechanism, the simulator

iterates through all login requests from Gossamer logs. Given the nature of

Gossamer logs, our simulator works for a wide range of defense mechanisms,

but not all — for example, it will not work if a mechanisms needs the plaintext

password or username string, since neither are included in the logs.

Some of the mechanisms we evaluate require training on login data. To

simulate these, we divide the instrumentation period at each university into

two parts. The login data collected during the initial 80% of each instrumen-

tation period is used for training malicious login detection models, as detailed

in Section 4.3.3. Then the last 20% of each instrumentation period is used as a

holdout for evaluating the efficacy of the defense mechanism.

For the ground truth in evaluating mechanisms, we classify all L sets and

login requests associated with the 29 attack campaigns previously documented

(by Araña [64]) as malicious and the remaining data as benign. Figure 4.1 pro-

vides a summary of the number of benign and malicious L sets and login re-

103


quests across the train and holdout splits.

Characterizing malicious login detectors. For malicious login detection to be

useful in practice, it must identify malicious login requests (high recall) without

flagging too many benign login requests (high precision). We use precision and

recall as opposed to other metrics (e.g., AUC-ROC) given the highly skewed na-

ture of the dataset (few attacks relative to benign traffic — see Figure 4.1) [20].

Along with precision and recall, we also show the harmonic mean of precision

and recall, called the F1 score, as an easy way to interpret the performance using

only one score. Detectors should allow the analyst to identify guessing attacks

in a timely manner, ideally before the exploitation of the compromised accounts.

Additionally, the detector should provide clear reasons to the analyst for flag-

ging login requests from a specific IP as malicious. This clarity ensures that

analysts can easily confirm genuinely compromised accounts and avoid misre-

porting accounts that are not compromised.

4.3.2 Performance of Volumetric Approaches

Volumetric detection approaches flag an IP address as suspicious if the number

of failed requests from an IP exceeds a certain threshold within a period of time.

First, we consider a strict threshold of k = 10 failed login requests within a day;

Bohuk et al. [29] suggest this threshold based on their empirical observation that

99% of login attempt sequences from an IP contain 10 or fewer login requests

within a day.

We also try to find a k that would improve the precision of detection, pre-

sumably at the expense of missing some low-volume attacks. We test different

104


U1 U2
Request level L set level Request level L set level

P R F P R F P R F P R F

Volumetric
k = 10 51 84 63 82 46 59 6 99 11 7 16 10
k = 1, 000 100 10 19 100 < 1 < 1 8 35 14 10 1 2

ML-based
SGG n/a 0 0 n/a 0 0 11 5 7 17 19 18
WAY n/a 0 0 n/a 0 0 39 9 15 6 13 8

Anomaly-based§

CS 84 (86) 24 38 19 (28) 13 15 23 (93) 61 37 32 (78) 43 47
SW 85 (86) 82 84 16 (18) 47 23 22 (86) 99 36 27 (46) 95 42
PDC 94 (94) 100 97 79 (79) 100 88 33 (83) 81 47 52 (93) 98 68

§ Manual inspection of the false alerts generated by anomaly-based approaches (Section 4.5)
revealed 49 (at U1) and 221 L sets (at U2) as clearly malicious. We show within parenthesis (·)
the adjusted precision if these new attacks are included in Arañaground truth.

Figure 4.2: Evaluation of the volumetric, ML-based, and DAS-based approaches
on the Gossamer holdout dataset using attacks reported in [64] as ground truth.
We report the Precision, Recall, and F1 scores (by percentage (%)) broken down
both by individual requests and L sets. The highest value in each column is
boldfaced. StopGuessing-G and WhoAreYou do not flag any requests as mali-
cious at U1.

values of k (k ∈ {10, 102, 103, 104}) and find that k = 103 produces the highest F1

score for requests at both U1 and U2 (82% at U1 and 75% at U2). Detailed results

of this experiment are given in Appendix C.1.

Over the holdout data, we observe that this volumetric approach effectively

flags IP addresses when using the relaxed lockout threshold of k = 103, achiev-

ing an F1 score of 82% for requests at U1. However, on the L set level, the F1

score drops significantly to 12% primarily due to a very low recall of 6%. This

indicates that the relaxed threshold misses many highly distributed attack cam-

paigns at U1. In contrast, the more strict lockout threshold of k = 10 results in an

F1 score of 23% for requests and 9% for L sets. At U2, both lockout thresholds

k = 10 and k = 103 performed poorly with request-level F1 scores of 11% and

105


14% respectively (Figure 4.2). For L sets, the F1 scores are 10% and 2% for these

two thresholds.

In summary, volumetric approaches perform poorly at detecting most at-

tacks, only flagging a few high volume ones. This can lead to high recall rates

when measured on individual requests because the detected attacks are so high

volume compared to the others not detected. Prior work also suggests that the

higher threshold of k = 1, 000 is not sufficiently strict: Pal et al. [93] demon-

strated a credential tweaking attack for which their simulations indicate that

16% of user accounts can be compromised within 1,000 guesses.

4.3.3 Performance of ML Approaches

We consider two ML-based approaches for detecting password guessing at-

tacks: WhoAreYou [52] and StopGuessing [102].

WhoAreYou. Freeman et al. [52] proposed WhoAreYou (WAY) to estimate the

risk score of a login request from its associated IP address and user agent string,

flagging the request as malicious if the score is above a threshold. They tested

their approach on simulated attacks in which the attacker attempts to imitate the

IP addresses and user agents of the victim user, and on one real botnet attack

experienced by the LinkedIn website.

We implement WhoAreYou following the authors’ description and train the

model on the training split of Gossamer logs. The performance of this model

necessarily relies on a reliable reputation score of the IP addresses and user

agents. Freeman et al. used the internal reputation scores at LinkedIn [15]

106


for IP addresses and user agents, but neither the scoring API nor the details

of the scoring mechanism are available publicly. In our model, we calculate

the reputation scores using the number of attack logins (as per [64]) observed

during the training split of Gossamer data. Intuitively, this means that the more

malicious logins we observe from an IP address, the riskier its reputation score

will be. We follow the same approach to calculate reputation scores for different

countries, autonomous system (AS) numbers, internet service providers (ISPs),

and user agent strings. We give more details in Section C.3.1.

Since WhoAreYou does not take characteristics of the submitted password

into consideration, unlike StopGuessing, we are not restricted to considering

only successful requests; we therefore We use all 75,884 and 287,167 malicious

logins from U1 and U2 (Figure 4.1) for the training and subsequent evaluation

of the model. We randomly sample the same number of benign login requests

as the number of malicious logins.

We evaluate the performance of WhoAreYou on the Gossamer holdout data.

We follow the same smoothing technique suggested by Freeman et al. [52] to es-

timate risk scores for login requests having unseen IP addresses or user agents

during the holdout period of Gossamer logs. At U1, WhoAreYou does not per-

form well, failing to flag any malicious logins. At U2, WhoAreYou only correctly

flags 2,480 malicious login requests (only 9% of the total malicious logins during

the holdout period) with an F1-score of 15%.

We attribute this issue, as detailed further in Appendix C.3.2, to the sparse

nature of Gossamer logs — only a few IP addresses, ISPs, or even AS num-

bers are present in both the holdout and the training data. Even the smooth-

ing technique proves ineffective in addressing this issue. While it’s possible

107


that WhoAreYou and the underlying smoothing technique may demonstrate

efficacy on other real-world settings, our evaluation clearly shows its limiting

performance on Gossamer logs, which is currently the most diverse and largest

real-world login dataset available.

StopGuessing. Schechter et al. [102] go further to develop an ML model that

uses five features to score IP addresses; if the score is above a threshold, all

attempts from that IP address for a period of time are deemed malicious.

When a login request from an IP address is successful, the model computes

the following features from the prior login requests submitted from the IP ad-

dress within a period of time: (a) whether the submitted the password is from

the frequently submitted password list of the attacker, (b) the expected number

of attempts needed to guess the correct password of the account with the suc-

cessful login request, (d) the fraction of submitted passwords to the username

from the IP address that were a typo, (e) whether the IP address has a high num-

ber of historical failed login attempts, and (f) whether the login is from a client

with a prior successful login (identified by an authentication cookie). Addition-

ally, this approach ignores repeated username-password pair submissions from

the IP address, and it calculates feature (a) twice from two distinct sets of login

requests from the IP address: the set of all login requests with valid usernames

and the set with invalid usernames.

This approach was only evaluated on synthetic dataset generated by simu-

lating benign users’ behavior and various types of password spraying attacks

— a key limitation of the work as noted by the authors themselves [102]. Al-

though some of the code is available at [18], a few exact details related to calcu-

lating password-derived features are missing — for example, what edit distance

108


threshold should be used to differentiate typos from incorrect submissions and

how to measure whether a password is frequently submitted.

Given the sensitive nature of passwords, Gossamer logs contain only

password-derived measurements that the authors deemed safe to collect from

login requests. Authors in [102] sidestepped this problem by populating

password-derived features from different distributions. Thus, to evaluate

StopGuessing on a real-world login dataset, we approximate these features from

the available safe measurements collected from Gossamer logs, and for clarity

we refer to it as StopGuessing-G (SGG).

For example, StopGuessing-G uses a coarse-grained password strength

measurement computed by the zxcvbn strength meter [123] to identify weak

accounts. To consider a password as frequently submitted, StopGuessing-G

checks if it is within the top 1,000 frequently submitted incorrect passwords

in the last 24 hours. Furthermore, to identify logins from the same device,

StopGuessing-G uses a tuple of the normalized user agent and username, as

Gossamer logs filter the authentication cookie value from the HTTP headers of

the login request. We provide more details of StopGuessing-G, our best effort

implementation of StopGuessing, in Appendix C.2.

To train the model, we treat each successful malicious login request during

the training period as a positive sample. However, the training period only

contains 12 such positive samples at U1 and 3,005 at U2. Following Schechter

et al. [102], we randomly sample the same number of successful benign login

requests (negative samples) during the training period. Finally, we calculate the

aforementioned seven features for each login request from the training samples.

109


If any successful request from an IP is flagged malicious by the model, all

requests from that IP on that day are considered malicious. The trained model

achieves an F1 score of 52% at U2 and 0% at U1 over the training dataset. More

details of our implementation of StopGuessing are given in Appendix C.2.

StopGuessing-G performed poorly at both universities in detecting success-

ful malicious login requests during the holdout period. At U1, it did not flag

any login requests as malicious, and at U2, it only flagged 5% of malicious lo-

gins with a poor precision of 11%. We hypothesize that the poor performance of

StopGuessing-G is primarily because there were very few successful malicious

login requests in the training dataset.

Open questions. We show that both the volumetric and ML-based approaches

exhibit poor performance when applied to the Gossamer dataset. Specifically,

volumetric approaches prove to be too simplistic for capturing the complex and

diverse attack strategies, while ML-based approaches struggle to perform well

under real-world constraints, such as sparse IP addresses and coarse-grained

password derived measurements. A fundamental question persists: Can we de-

velop an efficient and timely malicious login detection mechanism that achieves

high precision and recall in a real-world dataset such as Gossamer logs?

4.4 Anomaly-Based Detection Approaches

We now describe three new malicious login detection methods based on

anomaly detection techniques. We leverage an approach called directed

anomaly scoring (DAS) introduced by Ho et al. [61]. Briefly, their approach

involves first choosing a set of features that describes the suspiciousness of an

110


event and labeling for each feature which direction—higher or lower—is “sus-

picious.” For example, the higher the number of requests from an IP address,

the more likely it’s malicious. Thus the number of requests (NR) can be a DAS

feature with higher being the direction of maliciousness. Then DAS uses these

per-feature comparisons to define an event comparison: event A is more sus-

picious than event B if all of A’s feature values are more malicious than the

respective feature values of B. Finally, the DAS score of an event compared to a

set of other events is equal to the number of other events for which it is more

suspicious.

Two observations motivated us to adopt DAS-based approaches. First, as

Ho et al. argue, supervised ML approaches do not handle well the heavy im-

balance between normal and anomaly events. While Ho et al. investigated this

in the context of spearphishing, we conjecture that similar dynamics arise in

malicious login detection. For example, in Gossamer logs, only 18% of logins

at U1 and 11% at U2 are malicious (Figure 4.1). Also, previous supervised ML

approaches such as those proposed in [52,102] proved ineffective when applied

to our datasets (see Section 4.3.3). Secondly, in most real-world login datasets,

there is no way to describe benign behavior as a set of patterns or distributions,

which is essential for standard anomaly detection approaches [37, 75]. There-

fore, we use DAS which can overcome these limitations.

DAS scores of L sets. We group all login requests within a given period (say,

a day) by their source IP address, which we refer to as an L set. We then de-

sign a set of features Γ that quantifies the suspiciousness of an L set. Thus, the

DAS score of an L set given a set of other L sets L and a set of features Γ is

DAS_Score(L,L,Γ) = |{L′ ∈ L|∀γ ∈ Γ, γ(L) > γ(L′)}. We later describe how we

111


use DAS_Score to determine if an L set is malicious. If an L set is labeled as

malicious, we consider all login requests belonging to that L set as malicious.

Designing feature sets. Online guessing attacks through an IP address will

exhibit a number of distinct characteristics. For example, malicious IP addresses

often have an abnormally high number of failed login requests (NF), since the

majority of an attacker’s guesses fail. Often, attackers send popular passwords

to many users in a “password spraying attack”; other times, they may focus on

testing many passwords against one user in a targeted attack. Attackers may use

publicly available or privately sold breached credentials to execute a guessing

attack. They may draw out their attack over a longer period of time or send a

burst of login requests in a short time period. We synthesize these malicious

characteristics into four distinct attack behaviors and use sets of features that

describe each of these behaviors.

• Targeted attacks (ΓTar). The attacker targets certain users by sending many pass-

word guesses to only one or a few users. We use the average number of

unique passwords submitted to a user (AUP) from an L set to detect this sus-

picious behavior.

• High volume bursty activity (ΓBursty). The attacker sends a burst of logins for a

very short period of time. We use the mean and standard deviation of the

interarrival time between successive login requests (called MIT and SIT re-

spectively) to identify L sets with bursty behavior. We augment these two

features with the number of requests (NR) and the number of unique user-

names (NU) originating from the L set, which reduces false alerts.

• Known breach credential stuffing attack(ΓKnwnCrd). We use the fraction of requests

with a breached password (FPIB), the fraction of requests with a username

112


from a breach (FUIB), and the fraction of requests with a username-password

pair in the breach (FCIB).2 Similar to the bursty activity feature set, we aug-

ment these three features with NR and NU.

• Unknown breach credential stuffing attack (ΓUnkwnCrd). In some cases, anL set may

appear to indicate credential stuffing behavior, but the credentials used do

not directly appear in Gossamer’s breach data. We hypothesize that these at-

tacks are using a different breach dataset than that of Gossamer, and we refer

to such attacks as unknown breach credential stuffing attacks. L sets exhibiting

such behavior would have a low FPIB but a higher fraction of tweaked pass-

words (FTP) and fraction of usernames in the breach (FUIB). Thus we rely on

a feature set of {FTP,FUIB,FCIB,NR,NU} to detect such attacks.

For a given L set, we calculate the DAS score separately for each set of features

Γ, and we consider an L set as malicious if it has a positive DAS score using any

of the four feature sets. We also experimented with other DAS score thresholds

larger than zero while identifying an L set as malicious; however they did not

perform well.

4.4.1 Malicious Login Detection Using DAS

We design three different approaches for performing timely attack detection us-

ing DAS scoring. By timely, we envision that an analyst could deploy these

approaches and after a certain period (at the end of a day), collect and analyze

2After 29 days into instrumentation, the U2 deployment of Gossamer had university-specific
breach data added. We therefore use this FCIB feature only after that date for U2, as the FCIB
for the L sets in the first 29 days is zero, making it unhelpful for DAS scoring.

113


Attack type ID / Univ # of
L sets NR CS SW PDC

# L sets % req # L sets % req # L sets % req

Credential
stuffing

3 / U1 555 15,117 47 14 208 61 555 100
4 / U1 77 14,603 8 10 71 98 77 100
5 / U1 1 4,593 1 100 1 100 1 100
9 / U1 30 241 29 97 30 97 30 100
12 / U2 ∗ 3 2,601 0 0 3 100 1 42
19 / U2 1 428 1 100 1 100 1 100
20 / U2 4 5,199 4 100 4 100 3 86

Targeted

22 / U2 74 1,878 1 1 74 100 74 100
23 / U2 52 1,296 48 96 48 94 52 100
24 / U2 4 101 4 100 4 100 4 100
25 / U2 3 80 3 100 3 100 3 100

Figure 4.3: The detection performance (recall) of the three DAS-based ap-
proaches on the four attack clusters at U1 and seven at U2 reported by
Araña [64] over the holdout period. Multi-day clusters are highlighted with
an asterisk (∗), and the # L sets is boldfaced when all L sets of the attack cluster
are detected by the approach.

the L sets flagged.

Comparison Set (CS). In our first approach, we use a comparison set of mali-

cious L sets to which we compare a given L set. Specifically, to formulate the

comparison set for a given L set on day d, we first gather all L sets from the last

δ days (L[d−δ,d)). Then we calculate the DAS scores of each L set in L[d−δ,d) with

respect to L[d−δ,d) given a feature set Γ. The comparison set then consists of the

δ ·β L sets with the highest DAS scores, where β is an estimate of the number of

L sets an analyst would be able to analyze per day for each sub-detector. Ties

are broken based on the number of requests in the L set (NR) and then by the

number of unique users contacted (NU).

We set δ = 7 and configure each sub-detector with a budget of β = 5. Then

to flag an L set on day d, we evaluate the DAS score of the L set against the

comparison sets of that day for each sub-detector; if the DAS score is positive

114


for any sub-detector, we flag the L set as malicious.

Already, with this configuration, Comparison Set outperforms the previ-

ously tested ML approaches on the training period of Gossamer logs, producing

F1 scores of 82% at U1 and 52% at U2 for detecting malicious logins (Figure C.2).

It missed someL sets with comparatively fewer logins—yielding a recall of 53%

at U1 and 45% at U2 for L sets. However, the performance of this approach on

the holdout data was modest, as we describe in Section 4.5.

Sliding Window (SW). Since the Comparison Set approach requires the last

δ = 7 days of L sets for detecting malicious logins, we explore an alternate

approach where we look at L sets in a much shorter time interval (or window)

of size swin—say one hour. We consider the requests within a time window

[ts, ts + swin]. We group all login requests within this window into L sets and

then score each of the resulting L sets using DAS w.r.t. all L sets in that time

period for each sub-detector Γ. We start by setting ts to the beginning of a given

day and “slide the window” to increment ts by ssld, and we repeat the process

until the end of the day. Then we reset ts to the start of the next day and repeat

the process.

Similar to other approaches, anyL sets with a positive DAS score are flagged

as malicious. If an IP address corresponding to an L set is flagged in any win-

dow on a given day, we consider all login requests from the IP address through-

out the day as malicious.

We perform experiments with swin = 1 hour and ssld = 10 mins for both

universities. (We also experimented with different swin and ssld values but found

that the F1 score did not change much at both universities over the training

115


dataset). After evaluating on our training dataset, we observe that this approach

flags more L sets as malicious—with a recall of 99% at U1 and 98% at U2 for

malicious requests. It also has a precision of 71% (U1) and 61% (U2). Thus, the

Sliding Window approach outperforms the Comparison Set method in terms of

both recall and precision during the training period of Gossamer logs.

This tendency to flag manyL sets as malicious can be a double-edged sword.

On the one hand, it may hurt the precision of Sliding Window; on the other, it

may help discover previously undiscovered attacks. We delve more into this

tradeoff in Section 4.5.

Per-day Clustering (PDC). In the clustering approach used for Araña, Islam et

al. [64] clustered L sets into attacks using a set of volumetric, client-level, and

password-based features. However, their clustering approach relied on seeing

the whole time period of data before clustering into attacks, which does not

allow an analyst to detect attacks in a timely fashion.

We now test whether this clustering approach could be applied at the end

of each day, allowing for timely detection. We apply Arañato each day of login

requests to obtain a set of clusters and calculate the combined feature values of

the resulting clusters (normally taking an average or sum of the values). We

then compute the DAS score for each cluster w.r.t. each other cluster, and we do

this using each set of sub-detector features separately. As before, any positive

DAS score indicates that a cluster is malicious, and if a cluster is flagged mali-

cious, we consider all L sets (and therefore all requests) within that cluster as

malicious. This approach exhibited the best performance so far with F1 scores

of 82% and 55% for requests at U1 and U2, respectively. More importantly, this

approach maintains its performance over the holdout period, as we highlight

116


# of L sets U1 U2

False alerts based on Araña 1,851 367
Manually analyzed w/ NR > 5 229 367
Manually flagged as malicious 49 221
Manually flagged malicious and detected by
• Comparison Set 38 48
• Sliding Window 11 133
• Per-day Clustering 0 110

Figure 4.4: Manual analysis of the false positives (FPs) generated by the three
DAS-based approaches.

in Section 4.5.

4.5 Attack Detection Performance

We now evaluate the performance of the three DAS-based detection approaches

described in Section 4.4.1 on the holdout period of Gossamer logs. We focus

on answering two questions: how well each approach can capture the attacks

reported in prior work [64] and how often it generates false alerts. We report the

precision, recall, and F1 scores of the three approaches over the holdout period

in Figure 4.2.

In general, DAS-based attack detection approaches performed better than

the volumetric and ML-based ones. The Comparison Set approach performed

relatively poorly with an F1 score of 38% for requests at U1 and 37% at U2.

Interestingly, the Sliding Window approach performed quite well at U1 with a

precision (85%) and recall (82%) but had too many false positives (with respect

to the prior approaches) at U2. Per-day Clustering performed the best among

the three DAS-based approaches at U1, achieving perfect recall for both requests

and L sets, as well as good precision. For U2, Per-day Clustering in general

117


Comparison Set

2021-02-18

2021-02-20

2021-02-22

2021-02-24

2021-02-26

2021-02-28

2021-03-020

20

40

60

80

Days

#
of

al
er

ts
(L

se
ts

) Total alerts
True alerts
Known attacks

Sliding Window

2021-02-18

2021-02-20

2021-02-22

2021-02-24

2021-02-26

2021-02-28

2021-03-020

20

40

60

80

Days

#
of

al
er

ts
(L

se
ts

) Total alerts
True alerts
Known attacks

Per-day Clustering

2021-02-18

2021-02-20

2021-02-22

2021-02-24

2021-02-26

2021-02-28

2021-03-020

20

40

60

80

Days

#
of

al
er

ts
(L

se
ts

) Total alerts
True alerts
Known attacks

Figure 4.5: For the three DAS-based approaches, we report the total number of
alerts (# ofL sets flagged malicious) generated per day (red line) and how many
of them are true alerts (blue line) over the testing period at U2. We also report
the number of known malicious L sets on each day (green line) which is the
same for all approaches. A smaller gap between “total alerts” and “true alerts”
indicates higher precision, and a smaller gap between “true alerts” and “known
attacks” indicates higher recall.

performed better than the approaches but had a low precision. Since Per-day

Clustering is based on Araña, this high performance is not surprising. The low

precision at U2 is due to potential attacks that were not detected by Araña [64].

Capturing known attacks. Figure 4.3 illustrates the performance of the DAS-

based detection mechanisms on the 11 attack clusters present in the holdout

period.

118


At U1, we observe that both the Comparison Set and Sliding Window ap-

proaches struggle to detect two high volume attack clusters (#3 and #4) which

are likely part of the same attack, as they occurred on the same day and dis-

played similar behavior. The Per-day Clustering approach performed the best

at U1, with a 100% recall. We believe this is because clustering can group the low

volumeL sets exhibiting similar malicious behavior together (such as in clusters

#3 and #4); and since Per-day Clustering calculates DAS scoring at the cluster

level, these low volume L sets are unable to escape detection when grouped

together.

At U2, Comparison Set misses one attack cluster #12 completely, along with

all but one L set from cluster #22, four L sets from cluster #23, producing a low

recall of 61% for requests and 43% for L sets. On the other hand, the Sliding

Window and Per-day Clustering approaches display a higher recall of > 81%

for requests and > 95% for L sets. In particular, the Sliding Window approach

captures all attack L sets at U2 but has a lower precision of 27%. The Per-day

Clustering approach captures all but threeL sets, resulting in a recall of 98% and

81% for L sets and requests respectively. However, unlike the Sliding Window

approach, Per-day Clustering maintains a higher precision.

Analyzing false alerts. At both universities, the DAS-based detection ap-

proaches flagged a number of L sets as false positives according to the Araña-

ground truth [64]. We manually analyzed these false positives to confirm

them as either benign or malicious (meaning they were initially missed by

Araña [64]).

To do this, we take all the false positive L sets (1,851 at U1 and 367 at U2)

and filter out those containing five or fewer requests, as it is too difficult to make

119


a judgment on such a low volume L set. This filtered out 1,622 L sets at U1 and

none at U2. We collaboratively performed a manual analysis of theseL sets and

inspected if they were clearly malicious as described in prior work [64]; if not,

we consider them benign.

Figure 4.4 presents a quantitative breakdown of false positive L sets (based

on Arañaground truth) that our conservative manual analysis confirms as ma-

licious. We find 49 L sets at U1 and 221 at U2 to be clearly malicious

For example, at U1, the Comparison Set approach flagged 378 low volume

L sets that appear on the same day as attack cluster #9. We manually analyze

38 of these 378 L sets having NR > 5 and find all 38 L sets to be part of a

larger attack encompassing attack cluster #9. We do not make a decision about

the other 340 L sets with NR ≤ 5 because it is difficult to make a judgment

for an L set sending less than five requests. These 38 L sets were not flagged

by Arañabecause they were discarded by the “HFR” heuristic that filtered out

L sets sending less than 10 login requests at U1 (NR ≤ 10). By treating those 38

L sets as true alerts, precision improves slightly from 19% to 28% at the L set

level and 84% to 86% at the request level. The Sliding Window approach flagged

an overwhelming 1,684 false positive L sets at U1, 22 of which had NR > 5.

After manually analyzing these 22, we found that 11 were indeed malicious.

This increased the precision slightly from 16% to 18% at the L set level and 85%

to 86% at the request level for Sliding Window. Although the Per-day Clustering

approach flagged 168 false positiveL sets at U1, none appeared malicious under

manual inspection, as all have NR ≤ 5.

At U2, we find that 221 (60%) of the false positiveL sets are clearly malicious.

These L sets were missed by Arañabecause Arañafocused on high volume at-

120


tack clusters, only looking at clusters with NR ≥ 5, 000 ∨ NU ≥ 5, 000. Thus, it

missed these L sets that were part of lower volume clusters, even though they

exhibit similar malicious behavior as the high volume attack clusters. We aug-

ment our ground truth of known malicious L sets from Araña [64] with these

manually confirmed 221 malicious L sets at U2 and reassess the precision of

all three detection approaches. For the Comparison Set approach, the precision

improves to 92% (from 23%) for requests and 78% (from 32%) forL sets. For the

Sliding Window approach, we observe a similar trend—the precision improves

to 86% (from 22%) and 46% (from 27%) at the request and L set levels respec-

tively. Similarly for Per-day Clustering, precision improved to 83% (from 33%),

and 93% (from 52%) at the request and L set levels respectively.

Per-day workload analysis. Since the goal of these detection approaches is to

flag malicious L sets that a human analyst can verify, it is important to under-

stand the total number of alerts an analyst would receive each day. On average,

the Comparison Set approach raises 11 alerts per day at U2, Sliding Window

raises 38, and Per-day Clustering raises 21. However, some of these alerts could

be false positives, burdening the analyst.

To better understand this, we show (in red) the total number of L sets

flagged as malicious at U2 during the holdout period in Figure 4.5. We aug-

ment the Arañaground truth with the manually labeled 221 malicious L sets

at U2 and show in green the number of known attacks on each day (this is the

same for all detection approaches). The blue line represents how many of the

alerts flagged by the detection approach are true attacks.

We observe that the Comparison Set approach does not perform well, miss-

ing many attacks and raising many false alerts. The Sliding Window approach

121


captures all known attacks (thus the known attacks and true alerts lines overlap)

but sacrifices precision by raising too many false alerts. The Per-day Clustering

approach prioritizes precision (thus the total alerts and true alerts lines over-

lap), although it does miss a few known attacks that Sliding Window is able to

capture. Considering also that Per-day Clustering has the best performance at

U1, we believe it would be suitable for an analyst who prefers high precision

and moderate recall.

4.6 Robustness of Malicious Login Detection

A major concern when deploying an attack detection approach is how robust

such an approach would be against adversaries that modify their attack strategy

based on the defense to evade detection. In this section, we model such evasion

attacks and empirically measure their efficacy.

4.6.1 Model for Evasion Attacks

Let L denote a request — a tuple of information submitted to the login server.

This tuple includes a username L.u, a password L.w, the IP address of the

client L.ip, the user agent of the client browser L.ua, and a timestamp L.t. We

assume for simplicity that all timestamps of the requests (adversarial or other-

wise) are unique, allowing a total ordering over any set of requests. A tran-

script T then is a sequence of some number of requests ordered by timestamp

— T = [L1, L2, . . .] . We denote combining two transcripts via T = T1 ∪ T2. The

set of 〈u,w〉 corresponding to a T we call a “guess set” G = {(L.u, L.w) | L ∈ T }.

122


A login server receives login requests from many other users in addition to

the attacker-controlled login requests; we refer to this as the background traffic

and denote it by a transcript B — it may include a mix of malicious and benign

traffic. A defense strategy D is an algorithm that takes as input a transcript of

requests T and outputs for each attempt whether it is malicious or benign.

An evasion attack converts an existing malicious attack transcript A =

[L1, . . . , Lm] of some length m into a modified transcript A′ ∪ S . We allow the

attacker to modify the timestamp, user agent, or IP address of requests in A (or

leave them unchanged) to produce a new transcript A′; the evasion attacker

is not allowed to change the username and password in these requests and

|A| = |A′| = m. However, the evasion attacker may add some number of ad-

ditional spurious requests S with whatever values they like.

To gauge the efficacy of an evasion, we compare how easily the unmodified

attack A and the modified one A can be detected relative to some background

traffic B. In more detail, we runD on B ∪ A and let c be the number of attempts

from A marked by D as malicious. We then run D on B ∪ A′ ∪ S and let c′ be

the number of attempts from A′ that are marked as malicious. Then the success

rate of an evasion attacker is simply the fraction of attempts that are rendered

benign after introducing evasion, namely (c − c′)/m . We assume the attacker

knows howDworks but not the contents of B.

Modeling resources for A. Given sufficient resources, there are always

straightforward evasion attacks for any of the defense mechanisms we have

considered. For example, given sufficient time, an attacker could send one re-

quest per day. Given sufficiently many IP addresses, the attacker could send one

request per IP address, which no detection mechanism can currently differenti-

123


ate from benign login traffic. Thus, the goal of existing detection approaches is

to ensure that evading detection requires a large amount of resources. Con-

versely, an optimal evasion attacker is one who minimizes the resources re-

quired for a maximal evasion success rate. Since solving this minimax opti-

mization problem is challenging, we adopt a practical approach as follows.

We assume that the attackers have constraints on the resources used. We

consider three types of resources: (1) I(A) denotes the number of IP addresses

used in the attack traffic A; (2) R(A) denotes the number requests sent over the

internet to the login server, which is equal to |A|; and (3) T (A) denotes the total

time used to send all the attack traffic, that is T (A) = maxL∈A L.t −minL∈A L.t.

We are interested in the success rate of evasion attackers that operate within

some restricted amount of additional resources. Relative to an attack transcript

A, let θip be the number of additional IP addresses the attacker has access to

beyond those in A, let θr be the number |S | of additional spurious requests the

attacker is allowed to send, and let θt be the amount of extra time to execute the

attack. Thus, an evasion attack A for a transcript A will be valid for resource

constraints θip, θr, θt if any possible output A′, S from it is such that I(A′ ∪ S ) ≤
I(A) + θip and R(A′ ∪ S ) ≤ R(A) + θr and T (A′ ∪ S ) ≤ T (A) + θt.

Simulating evasion attacks. With the above formalization, simulating evasion

attacks is simple. To do this, we give the transcript of an unmodified attack

cluster (A) to the attacker. The attacker is allowed to spend a fixed amount of

resources 〈θip, θr, θt〉 to generate B ∪ A′ ∪ S .

124


4.6.2 Evasion Strategies

Based on the way resources are used by the evasive attacker, we devise different

evasive strategies to test the robustness of the three DAS-based detection ap-

proaches. These evasive strategies are easy to launch and do not require signif-

icantly modifying existing attack scripts or tools (e.g., OpenBullet2 [16], Sentry

MBA [?]); yet, as we will show, they are effective in evading many attack cam-

paigns. Broadly, these evasive strategies can be categorized into two classes:

introducing chaff which distracts the detection algorithms from the actual attack

traffic and blending the attack traffic into the benign background traffic. We do

not experiment with evading detection from the volumetric or machine learning

approaches discussed earlier, as they already do not perform well.

1 Introducing chaff L sets. Attackers can introduce chaff to distract the de-

tector from the actual attack by sending malicious-appearing, non-attack logins

from their set of IP addresses. For example, recall that the Comparison Set de-

tection method creates a comparison set consisting of the top δ × β L sets with

the highest DAS score in the last δ days. An attack L set is flagged as malicious

if it has a positive DAS score against any other L sets inside the comparison

set. By introducing chaff, the attacker could create new L sets that score higher

than the existing L sets in the comparison set so that the chaff L sets entirely

populate the comparison set; these chaff L sets would be designed so that the

attack L set never has a relative DAS score higher than these chaff L sets.

2 Blending in using sacrificial logins. Attackers can also attempt to blend

their malicious login requests into the background benign login requests. To do

this, the attacker can add additional spurious login requests with their attack,

125


which we call sacrificial requests, as they aim to reduce (or hide) the suspicious-

ness of malicious L sets by making them appear more like the surrounding

benign traffic.

For example, consider an attack L set originally containing 100 requests.

If 40 requests from this L set contain a breached password, the fraction of

breached passwords (FPIB) feature would be 0.4. Now if the attacker sends

100 additional sacrificial requests containing a password not present in breach

data, that attack L set would now send 200 requests in total with a less suspi-

cious FPIB value of 0.2; this reduced suspiciousness could help the malicious

L set blend into the benign L sets and avoid detection.

3 Blending in by redistributing the guess set. The attacker could also at-

tempt to blend into background benign traffic by redistributing their guess set

across additional IP addresses (by changing the IP address fields in their attack

transcript). For example, consider the case of detecting targeted behavior using

the AUP feature. If an attack L set exhibits targeted attack behavior by send-

ing 25 different password guesses to a single user, then it would have a AUP

of 25 and would likely be detected for such a high AUP value. However, the

attacker can evade detection by sending these 25 different password guesses

from 25 different L sets. Each of these 25 L sets would now have an AUP of

one, effectively avoiding detection, as an AUP value of one does not exhibit tar-

geted behavior. For the sake of evaluation, we assume that the attacker evenly

splits the total number of requests and the number of requests exhibiting any

suspicious feature across all L sets.

4 Blending in by slowing the attack. Finally, an attacker can insert additional

126


delay between successive login requests to minimize the suspiciousness of the

timestamp-based features (i.e., MIT, SIT). This strategy is effective for evading

malicious L sets that send many requests within a short period of time.

For example, consider an attacker sending NR login requests from an IP

within t seconds. Assuming the attacker is evenly distributing these NR lo-

gins over the time period t, the attack L set would have a mean interarrival

time of MIT = t/NR. By remaining active for an additional θt period, the at-

tacker could increase (and thus reduce suspiciousness of) its MIT feature value

to (t + θt)/NR. Attack tools (e.g., OpenBullet2 [16]) typically provide such an op-

tion to reschedule logins to a given interval. Of course, this would increase the

cost for the attacker, requiring them to hold the IP addresses for a longer period

of time.

This list is not exhaustive; an attacker may find other ways to craft evasive

attacks, and we detail a few of these at the end of Section 4.6.3. However, we

focus on the four described evasion strategies in evaluating our three proposed

detection approaches, as we believe they are indicative of what an attacker may

try and could be effective in evading detection mechanisms.

4.6.3 Evasion Results

To investigate the efficacy of the evasion strategies, we explore a set of reason-

able amounts of resources an evasion attacker could use for evading the attack

clusters: θip ∈ {10, 20, 35}, θr = 1, 000, and θt = 2 hours. The results of our exper-

iments are shown in Figures 4.6 and 4.7. In these figures and when describing

the evasion results, we refer to each attack cluster using the same ID as from

127


ID / Univ. Type L NR
% NR evaded

Redistributing Chaff L sets
θip = 10 20 35 θip = 10 20 35

1 / U1 C 3 18,093 0 0 0 0 0 100
2 / U1 C 1 17,827 0 0 100 0 0 100
3 / U1 § C 47 2,080 0 1 37 49 49 100
4 / U1 § C 8 1,467 61 22 58 30 96 100
5 / U1 § C 1 4,593 100 100 100 100 100 100
6 / U1 C 1 2,345 100 100 100 0 0 100
7 / U1 C 1 1,481 0 0 14 100 100 100
8 / U1 C 28 472 18 56 90 69 69 100
9 / U1 C 29 30 100 100 100 97 97 100

11 / U2 §F B 663 27,750 100 100 100 100 100 100
12 / U2 § C 8 9,123 0 24 45 97 100 100
14 / U2 C 2 65 100 100 100 100 100 100
17 / U2 C 32 4,925 24 33 56 100 100 100
18 / U2 C 6 4,454 0 0 31 52 100 100
19 / U2 § C 4 4,783 0 0 9 100 100 100
20 / U2 § C 4 5,199 0 62 70 49 100 100

22 / U2 § T 1 28 0 100 100 100 100 100
23 / U2 §F T, B 48 1,250 0 0 23 71 81 100
24 / U2 § T 4 101 0 0 53 100 100 100
25 / U2 § T 3 80 0 0 100 34 100 100
26 / U2 T 1 27 0 0 100 100 100 100
27 / U2 T 1 27 0 0 100 100 100 100
28 / U2 T 1 25 0 100 100 100 100 100
29 / U2 T 4 468 0 0 0 0 0 0

Figure 4.6: Evasion results for the Comparison Set approach when attempting to
evade attack clusters by redistributing the attack transcript over θip IP addresses
or inserting θip chaff L sets.

prior work [64] (as indicated by the “ID” column). For completeness, the list

of all attack clusters from [64] is shown in Appendix C.5. For each attack clus-

ter, we use the symbols B, C, and T in the “Type” column to indicate whether

the cluster was detected by the ΓBursty, ΓKnwnCrd or ΓUnkwnCrd, or ΓTar feature sets,

respectively.

Comparison Set. We apply the evasion strategies to the nine attack clusters

at U1 and 15 at U2 detected by the Comparison Set approach. Specifically, we

apply the redistribution and chaff strategies with θip ∈ {10, 20, 35} IP addresses;

128


ID / Univ. Type L NR
% of NR evaded

θip = 10 20 35

10 / U2 F B 12 169,573 0 0 0
11 / U2 F B 663 33,304 100 100 100
12 / U2 § C 5 4,521 66 0 100
13 / U2 C 3 12,240 58 45 100
14 / U2 C 843 7,104 67 24 16
15 / U2 C 1 5,287 100 100 100
17 / U2 C 3 4,925 100 100 100
18 / U2 C 6 4,219 83 0 100
19 / U2 § C 4 3,938 0 0 100
20 / U2 § C 4 5,199 0 0 100
21 / U2 C 2 5,103 50 0 100

22 / U2 F§ T, B 74 1,878 49 66 77
23 / U2 F§ T, B 52 1,250 43 72 76
24 / U2 § T 4 101 100 100 100
25 / U2 § T 3 80 100 100 100
26 / U2 T 1 27 100 100 100
27 / U2 T 1 27 100 100 100
28 / U2 T 1 25 100 100 100
29 / U2 T 4 468 0 0 0

§ These attack clusters appear during the holdout period; the others occur during the training
period.
F For attack clusters that exhibit bursty activity, we slow down the attack by θt = 2 hours;

Figure 4.7: Results of the three blending evasion strategies against Per-day Clus-
tering technique. None of the attacks at U1 could evade Per-day Clustering.

and for the two attack clusters (#11 and #23 at U2) exhibiting bursty behavior,

we also add an additional θt = 2 hours to slow down the attack. The results are

shown in Figure 4.6.

We first try redistributing the guess set across additional IP addresses; this

evasion is partially effective, with ten attack clusters evading completely with

θip = 35 IP addresses. Next we try inserting chaff L sets to replace the existing

L sets in the comparison set. For this evasion attack to work, an attacker would

need the chaff L sets such that they have higher DAS score than the ones cur-

rently in the comparison set, but not strictly less suspicious than the attackL set,

as that would prevent evasion. We accomplish this by setting the feature values

129


of these chaff L sets as follows. For the number of requests per L set NR, we

take the 50th percentile value of allL sets inside the comparison set and thus set

NR = 19 at U1 and NR = 27 at U2 (An attacker could potentially compute these

percentiles from historical login data they might have access to). For the other

relevant features, we max out their suspiciousness by setting FPIB = FUIB = 1

and FCIB = FTP = 0.5 at both universities. We also set AUP = NR to allow for

evasion of targeted attack clusters. Thus we create chaff L sets with very high

suspiciousness along all features, producing a higher DAS score which is neces-

sary to be added to the comparison set. This also ensures that the actual attack

L sets do not dominate any of the chaff L sets and get flagged. Chaff L sets and

requests must be sent at least a day before the attack is executed.

Using chaff IP addresses in this manner allows an attacker to fully populate

the comparison set with chaff L sets for θip = 35, effectively evading all but

one attack cluster. This evasion only requires 665 requests at U1 and 945 at U2.

However, this approach fails against attack #29, as none of the evasion strategies

are effective. This is because attack #29 had an AUP of 37, exceeding the AUP of

the employed chaff L sets (≤ 27). Consequently, the attack dominates the chaff

L sets for the ΓTar sub-detector.

Sliding Window. Since the Sliding Window approach considers all L sets

within a window, introducing new chaff L sets would not help distract a detec-

tor from an actual attack. Instead, attackers might try the blending in approach

by inserting θr sacrificial logins, redistributing the guess set evenly across θip

extra IP addresses, or slowing down the attack with an extra time period θt.

Therefore, we attempt to evade all 29 attack clusters detected by Sliding Win-

dow with θr = 1, 000, θip ∈ {10, 20, 35}, and θt = 2 hours.

130


We find that this blending in strategy is only effective in evading attack clus-

ters exhibiting targeted behavior. Specifically, the eight targeted attack clusters

at U2 can be evaded by the redistributing login requests across only θip = 10 IP

addresses. The other 21 attack clusters from both universities exhibiting bursty,

credential stuffing behavior can not be evaded even with θr = 1, 000, θip = 35,

and θt = 2 hours.

Per-day Clustering. Similar to Sliding Window, Per-day Clustering also con-

siders all clusters within a single day when assessing whether a cluster is mali-

cious. Therefore, it is relatively resistant to the evasion strategy of introducing

chaff L sets.

However, an evasion attacker can still exploit the Agglomerative clustering

mechanism by using the three blending in evasion strategies to merge malicious

and benign clusters into one or split an attack cluster such that DAS scoring can-

not flag it as malicious. Existing work on poisoning Agglomerative clustering

focuses on single [26] or complete [25] linkage, but Per-day Clustering uses mean

linkage. Future work can look at how to poison Agglomerative clustering with

mean linkage and how to use that poisoning technique to evade attack clusters

detected by Per-day Clustering.

For now, we rely on using the three blending in strategies with θip =

[10, 20, 35] IP addresses, θr = 1, 000 login requests, and θt = 2 hours. Thus, the

attacker could first redistribute their attack transcript A across these additional

θip IP addresses. Then to hide A further, they can add θr = 1, 000 sacrificial logins

evenly distributed among the L sets in the attack clusters to decrease the suspi-

ciousness of the feature values. Lastly, to hide any bursty activity, the attacker

could lengthen their attack by θt = 2 hours. Figure 4.7 (right) shows the results

131


of our evasion attempts.

At U1, evasion attacks against Per-day Clustering are unsuccessful for all

ten attack clusters. At U2, we observe a progressive trend in evasion success as

we increase the number of additional IP addresses θip while holding θr = 1, 000

and θt = 2 constant for all 16 attack clusters. In particular, with θip = 35, 13 out

of 16 attack clusters evade completely. Attack #11 can evade detection just by

lengthening the attack by θt = 2 hours.

Discussion of other evasion strategies An attacker could also try other attack

strategies that don’t use extra resources. For example, since Per-day Clustering

approach considers client-level features (i.e., the ISP of the IP address and the

user agent string) while clustering L sets, the attacker could try to manipulate

these features of the attack L sets to avoid detection by randomizing the user

agents or using an IP addresses belonging to a cloud ISP popular among benign

traffic. We find that by changing the ISP of all theL sets to a Microsoft Azure ISP

and setting the user agent string to that of a browser on a mobile device, attack

#22 can completely evade detection from Per-day Clustering without needing

any extra resources.

4.7 Discussion & Future Work

Deploying detection mechanisms in practice. The three DAS-based ap-

proaches rely on four feature sets, including two involving password-derived

features. Safely collecting these features requires a framework such as

Gossamer [29] that can log such information without compromising user pri-

vacy and security.

132


Implementing and running these approaches is not difficult in practice. DAS

is straightforward to implement and runs in O(n2) time, where n is the number

of L sets the login server observes in a day. The Per-day Clustering approach

also requires O(n2) time to cluster the n L sets before performing DAS scoring.

We run our experiments on an Intel i7-4790 Ubuntu Linux machine with

eight cores and eight GB of memory at U1 and an Intel Xeon Linux machine

with 56 cores and 125 GB of memory at U2, and we parallelize the DAS and

Arañaclustering calculations using 40 threads. On average, all three approaches

finish in less than a minute per day.

Comparing results between the two universities. Although the Per-day Clus-

tering approach performs well at both universities, we observe contrasting

trends when comparing the performance of the other two approaches (Com-

parison Set and Sliding Window) between the two universities (Figure 4.2). At

times, drawing a direct comparison between these two universities is difficult

since the attacks they experienced exhibit different behaviors, both quantita-

tively and qualitatively in many cases. For example, at U2, nine attack clusters

exhibited targeted behavior, and four exhibited bursty behavior; whereas at U1,

no attack clusters exhibited either behavior. This is possibly because, as sug-

gested in [64], Gossamer collected data from every login endpoint at U2. But at

U1, Gossamer collected logins from only the main login endpoint.

We also observe differences in manually analyzing the false positives. At

U2, 367 L sets were initially labeled as false positives by the three approaches.

After manually analyzing, we found that 221 (60%) of them were actually ma-

licious. In contrast, at U1 1,851 L sets were initially labeled as false positives,

and only 49 (2.6%) of them were confirmed as malicious after manual review;

133


this was partly because the majority of false positives L sets were too low vol-

ume (NR ≤ 5) to make a judgment. Another contributing factor may be the

thresholds selected to filter clusters. At U2, Arañaconsiders attack clusters as

malicious if they meet the criteria NR ≥ 5, 000 ∨ NU ≥ 5, 000; however, these

thresholds were decreased to 1,000 at U1 because it experienced lower volume

attacks. Thus Arañamay have originally missed many attacks at U2 due to this

higher threshold. Future work can investigate how the performance of detec-

tion mechanisms generalizes or changes based on the organization concerned.

Optimizing evasion attacks. It would be interesting to know, given an attack

cluster, what are the minimum resources needed to evade detection for a given

approach. While this is a complex (and probably NP-hard) optimization prob-

lem to tackle, we perform some initial exploration using a simple heuristic to

optimize resources θip, θr, and θt needed by attackers. As mentioned earlier, op-

timizing the required resources requires the attacker to have knowledge about

other login requests, which we assume the attacker has for this experiment. We

find that attack cluster #11 (which is not evadable with θt = 2 hours) can be

evaded with θt = 5 hours. For cluster #29 which also did not evade with θt = 5,

θip = 45 is required for a successful evasion. Attack clusters not exhibiting bursty

behavior do not need any extra time. More details about our heuristics and ex-

periments are in Appendix C.4.

Preventing evasion attacks. We explore two adaptations to the DAS detection

mechanisms that may improve their robustness to evasion attacks. Our initial

attempt was to randomly sample L sets to populate the comparison set instead

of choosing the ones with the highest DAS scores. Such randomization would

make it more difficult for an attacker to control whether their chaff L sets get

134


into the comparison set and thus limit their ability to distract from an attack.

While this performed well on the training period at U1, it performed poorly on

the test period with F1 scores of 38% at U1 and 43% at U2.

In our next attempt, instead of scoring an L set with DAS against other

L sets, we take the median value for each feature in the four sub-detector feature

sets, and we flag an L set as malicious if it has a more suspicious feature value

than the median for all features in a sub-detector. Because of the simplicity of

this approach, it would be easy to deploy in web authentication systems and

will be efficient to compute. We found this approach performed well at U1 with

an F1 Score of 87% for the test period. However, it performed poorly at U2 with

an F1 Score of 31%. Adjusting the thresholds to a higher percentile (rather than

the median) improved performance slightly, and we found that the best perfor-

mance at U2 occurred at the 34th percentile with an F1 Score of 54% over the

test period. Because of the varied performance, this approach may not be gen-

eralizable across different services. Further work on detection methods should

investigate how login servers can combine different approaches that reduce the

ease of evasion while preserving the ability to detect attacks.

4.8 Conclusion

We conduct a comprehensive investigation into the efficacy and resilience of

timely malicious login detection approaches, evaluating existing volumetric

and machine learning-based defenses on the Gossamer dataset comprised of 34

million real-world login requests. Our findings indicate that current defenses

underperform against the intricate attack behaviors observed in real-world sce-

135


narios. We then explore how to develop approaches that are better at captur-

ing such complex attacks without raising too many false alerts by leveraging

an anomaly event scoring mechanism. Lastly, we underscore the importance

of considering evasion attackers when deploying such defenses and show how

current detection mechanisms can be broken by an evasive attacker.

136


CHAPTER 5

CONCLUSION

In this work, we designed and built a framework to collect information on sub-

mitted measurements, and then we used the resulting measurements to detect

password guessing attacks. Our goal in starting these projects was to improve

password-based authentication through observations on real data and develop

new approaches for detecting guessing attacks, and we feel that we accom-

plished this goal.

5.1 Overarching Observations

Through this work we make the following observations.

Passwords provide valuable information for detecting attacks. We showed

in Gossamer [29] how password-derived information can be logged in a secure

manner. Different types of guessing attacks can then be characterized using dif-

ferent combinations of password-derived measurements, and such information

can be helpful when designing attack detectors or even for an analyst when

manually reviewing the data. Existing login systems could gather such infor-

mation using Gossamer [29]; or if that is not feasible due to the implementation

effort or system requirements, a credential checking tool such as Might I Get

Pwned (MIGP) [94] can be used to check whether credentials and their variants

have appeared in a breach.

Clustering is helpful for detecting attack campaigns. Through our work on

Araña [64] and in investigating timely detection approaches [65], we saw sev-


eral distributed attacks containing dozens or hundreds of IP addresses that may

not have been easily identified as the same attack without a clustering approach.

More work is needed to develop detection methods that are resistent to

evasion attacks. We showed in our final project [65] that attackers can use dif-

ferent approaches to attempt to conceal their attacks, and these approaches can

be very effective in evading detection. Login system developers should consider

an evasive attacker when designing attack detection mechanisms.

5.2 The Authentication Ecosystem

In recent years, outsourcing authentication to third parties such as Okta [6] has

become more and more common. This could help reduce compromises, since

these larger, authentication-focused companies have more resources to build

more secure systems and better attack countermeasures, although a compro-

mise at such an authentication company could be much more damaging.

Passkeys and biometrics are also quickly becoming the recommended

method of authentication [33, 55], which is another positive for the authenti-

cation space because they are less prone to guessing attacks. However, pass-

words will probably remain the dominant authentication mechanism for quite

some time both to support backwards compatibility (e.g., rolling out passkeys

for most users but allowing alumni users to log into their school accounts with

their passwords) and because many companies may not have the resources to

roll out a new authentication mechanism.

Our work underlines some of the weaknesses of passwords and confirms

138


some findings in prior research — such as that password typos are very com-

mon and credential stuffing is highly prevalent and damaging. Our findings

support the use of password-derived information to detect guessing attacks.

Cloudflare has already started allowing companies to use tools such as

Might I Get Pwned [94, 116] on their login traffic to detect credential stuffing

attacks. Other companies could also consider doing so; adding such checks

could be done easily without deploying a full password measurement infras-

tructure. Companies should also consider proactive breach alerting during lo-

gins and password changes to prevent credential stuffing attacks before they

happen. In particular, Might I Get Pwned checks whether a password or close

variant of the password has appeared in a breach, and our measurements at

Cloudflare showed that it flags 20% more vulnerable passwords than an exact

match breach alerting service [94].

As more and more companies roll out guessing attack countermeasures,

though, we can expect to see more advanced attacks such as credential tweaking

(which we did not observe with Gossamer, perhaps because automated tools

have not advanced to that level yet). We can also expect attackers to work

harder to conceal their attacks, and as alternative authentication mechanisms

like passkeys become more ubiquitous, attackers may move more towards tar-

geting account recovery methods.

As the authentication ecosystem moves and changes, we hope that our work

can help pave the way for more measurement studies on passwords and other

authentication mechanisms.

139


APPENDIX A

GOSSAMER: SECURELY MEASURING PASSWORD-BASED LOGINS

A.0.1 Measurements Taken

We show all data stored in the ephemeral and persistent databases in

Figure A.1. Note that the raw password is only stored in encrypted format for

24 hours in the ephemeral database. The symbol × indicates that a field was

stored in plaintext, and
⊗

indicates it was encrypted before storing in the in-

dicated level of storage.

A.0.2 Filtering Out Attacks

We excluded high volume attacks before reporting summary statistics to cap-

ture an accurate description of regular user behavior. The remaining harder-

to-detect adversarial behavior appears to be an insignificant fraction of logins.

We were given access to compromise account reports from the instrumentation

time period (as described in Section 2.3. We found that an average of 190 com-

promised accounts were reported every month. Overall, less than 1% of the

total user population in the monitoring period were compromised, less than

1% of all logins were to accounts that were compromised at some point in the

measurement period, and 2% of IPs were associated with those requests.

The compromise report database logged compromise usernames, but not

specific requests or IP addresses corresponding to the user’s compromise; thus

it was difficult to exclude all attacks without filtering out an even larger por-


Field Eph. Pers.

Basic statistics
username

⊗ ⊗
password

⊗
IP address × ×
receipt timestamp at the login server and at Gossamer × ×
receipt timestamp at Gossamer × ×
HTTP headers × ×
result (success or failure), result code × ×
zxcvbn score (bucketized to 0, 1) ×
password was malformed ×
Credential stuffing measurements
username appeared in the breach data ×
password appeared in the breach data ×
username-password pair appeared in the breach data ×
breach source of the username, password, or pair ×
Credential spraying & dictionary-based guessing measurements
password appeared in

– top {10, 100, 1,000} most common breached pws ×
– top {2,000, 5,000} hashcat-generated pws ×
– top {2,000, 5,000} RockYou pws ×

was password frequently submitted today? ×
was username frequently submitted today? ×
Credential tweaking measurements
PPSM [93] strength of password ×
guess rank due to credential tweaking attack [93] ×
edit dist. ≤ 2 of pw from

– other submissions for same username ×
– other submissions for same IP ×

Figure A.1: Measurements we log in ephemeral (Eph.) and persistent (Pers.)
storage.

tion of benign behavior. Excluding all login requests to usernames that were

compromised at any point in our instrumentation period did not significantly

change the summary statistics we reported. However, filtering out the high-

volume attacks, as we did in the paper, did change some statistics.

To concretize this, one statistic that we believe is more sensitive to whether

attack traffic is included in measurements is the fraction of requests using a

breached password. With the high volume attacks included in the calculation,

this fraction was 4.73%. This percentage decreased to 2.36% when we filtered

141


out high volume attacks, but only further decreased to 2.34% when we filtered

out all requests associated with a compromised username.

Characterizing benign behavior in the presence of attacks is a fundamen-

tal challenge, given the lack of ground truth. Future work on improving the

identification of adversial behavior can give further insights and confirm our

characterization of benign user behavior.

A.0.3 Login Statistics

We show some additional statistics about the login requests recorded by

Gossamer. Figure A.3 shows some additional statistics we reported earlier in

Figure 2.4. Figure A.2 shows the distribution of operation systems as parsed

from the user agents of the requests at both universities; Figure A.4 shows the

top 10 most common user agents we saw at both schools.

OS U1 U2

Windows 36.13% 23.96%
Mac OS X 29.47% 19.58%
iOS 26.20% 43.47%
Android 5.04% 3.18%
Linux 2.43% 0.31%
Other 0.73% 8.96%

Figure A.2: The distribution of operating systems (OS) as detected in the user-
agent of all the requests (after removing requests containing empty user-agent
string at U2).

142


U1 U2

Submitted password statistics
% req. w/ password in breach† 2.71% 0.10%
% req. w/ username in breach† 5.31% 3.08%
% req. w/ user-pwd pair in breach† 0.07% 0.01%
% failed req.

– containing a typo 29.67% 12.04%
– containing a typo (with edit dist msmt) 62.39% 58.37%
– from mobile device containing a typo 38.63% 38.36%
– from mobile device containing a typo (with edit dist msmt) 72.69% 81.87%

% pwds tweaked 0.92% 0.34%
% pwds w/ zxcvbn score of 0 0.06% 0.40%
% pwds in top 5k hashcat < 0.01% 0.06%
% pwds in top 5k rockyou 0.02% 0.17%
% pwds in top 1k breach comp 0.01% 0.10%

Session Statistics (with a 360s threshold)
Avg. session size 2.25 2.21
99th percentile session size 10 6
% abandoned sessions 5.47% 1.96%
% sessions with at least two attempts 22.24% 38.22%
% mobile sessions 41.32% 35.45%
% sessions with a typo 2.64% 0.85%
% mobile sessions with a typo 0.01% 0.20%
Avg. num sessions per user per day 1.74 9.23

User Statistics
# of unique usernames seen 196,424 309,801
# of valid users 177,286 169,774
# of active users 130,695 110,476
% valid users w/ weak passwords 0.03% 0.06%
% valid users w/ username in breach† 5.79% 3.27%
% valid users w/ passwords in breach† 2.92% 9.34%
% valid users w/ user-pw pair in breach† 0.01% 0.15%
% valid users w/ tweaked password 1.22% 0.66%
% valid users w/ no failed attempts 33.21% 58.02%
% valid users who may be using pw managers 24.76% 27.34%
Avg. fails before a success 1.18 1.19
Avg. devices per user per day 1.51 1.91
Avg. devices per user 14.51 14.97
Avg. IPs per user 8.70 10.56
Avg. successful IPs per user 10.65 17.63
Avg. user agents per user 6.15 3.99
Avg. unique passwords per user 1.96 9.59
Avg. attempts per unique IP per user 5.86 5.21

Login Statistics
Avg. Login requests per day 49,302 246,274
Avg. # of submitted usernames per day 24,822 61,798
% of requests succeeded 94.99% 92.35%
% of requests with null user agent 0.48% 34.31%
# of requests per day per user

– Average 1.99 2.05
– median 1 2
– 99th-percentile 7 12

% of requests from mobile device 31.00% 35.57%
% of failed requests from mobile device 24.90% 11.96%

Figure A.3: Summary statistics of login requests recorded by Gossamer at U1
(from Dec. ’20 to July ’21) and U2 (from Dec ’20 to Mar ’21).

143


User agent % req.

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML,
like Gecko) Chrome/87.0.4280.88 Safari/537.36

1.54%

Mozilla/5.0 (iPhone; CPU iPhone OS 14_2 like Mac OS X) AppleWe-
bKit/605.1.15 (KHTML, like Gecko) Version/14.0.1 Mobile/15E148 Sa-
fari/604.1

0.99%

Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36

0.36%

Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Fire-
fox/78.0

0.23%

Mozilla/5.0 (Unknown; Linux x86_64) AppleWebKit/534.34 (KHTML, like
Gecko) PingdomTMS/0.8.5 Safari/534.34

0.22%

Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36

0.21%

Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/605.1.15
(KHTML, like Gecko) Version/14.0.1 Safari/605.1.15

0.15%

Mozilla/5.0 (Macintosh; Intel Mac OS X 11_1_0) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36

0.14%

Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:84.0) Gecko/20100101 Fire-
fox/84.0

0.13%

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML,
like Gecko) Chrome/87.0.4280.88 Safari/537.36 Edg/87.0.664.66

0.13%

User agent % req.

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML,
like Gecko) Chrome/87.0.4280.88 Safari/537.36

3.15%

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML,
like Gecko) Chrome/87.0.4280.141 Safari/537.36

1.56%

Mozilla/5.0 (iPhone; CPU iPhone OS 14_2 like Mac OS X) AppleWe-
bKit/605.1.15 (KHTML, like Gecko) Version/14.0.1 Mobile/15E148 Sa-
fari/604.1

1.54%

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML,
like Gecko) Chrome/88.0.4324.150 Safari/537.36

1.32%

Mozilla/5.0 (iPhone; CPU iPhone OS 14_3 like Mac OS X) AppleWe-
bKit/605.1.15 (KHTML, like Gecko) Version/14.0.2 Mobile/15E148 Sa-
fari/604.1

1.16%

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML,
like Gecko) Chrome/88.0.4324.104 Safari/537.36

1.12%

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML,
like Gecko) Chrome/86.0.4240.198 Safari/537.36

0.88%

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML,
like Gecko) Chrome/88.0.4324.182 Safari/537.36

0.76%

Mozilla/5.0 (iPhone; CPU iPhone OS 14_4 like Mac OS X) AppleWe-
bKit/605.1.15 (KHTML, like Gecko) Version/14.0.3 Mobile/15E148 Sa-
fari/604.1

0.70%

Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36

0.62%

Figure A.4: Top 10 most common user agents at U1 (top) and at U2 (bottom)

144


APPENDIX B

ARAÑA: DISCOVERING AND CHARACTERIZING PASSWORD

GUESSING ATTACKS IN PRACTICE

B.1 Reasons for Compromise Reports

As described in Section 3.3, we received the logs of compromised account re-

ports from both universities. We show the breakdown of the reasons for com-

promise as recorded at U1 in Figure B.1. We that found 69% of accounts were

reported as compromised due to large-scale automated attacks (referred to as

“Bulk credential testing” at U1). For compromised accounts reported within

one hour of the time of attack, 46% were classified as “Bulk credential testing”,

17% as self-reported password compromise, and the remaining 37% were split

between the other 10 reasons appearing in the dataset. For the six compromised

accounts that took longer than 300 hours (12.5 days) to report, all were reported

as “Simultaneous use from different locales.”

At U2, the recorded reasons for compromise are very coarse: 99% are re-

ported simply as “compromised accounts”. Three accounts were disabled based

on requests from the human resources department (former employee), and two

were disabled as the user is “deceased”. No username is reported more than

once for different reasons.


Reason Count Percent

Bulk credential testing 1255 69.3%
Simultaneous use from different locales 438 24.2%
Self-reported password compromise 45 2.5%
Blocked for spamming via Office 365 23 1.3%
Password misused 20 1.1%
Spamming via Office 365 / Gmail 11 0.6%
Other (incl. reports from Duo fraud, third-party, etc.) 20 1.1%

Figure B.1: The reasons for compromise as noted at U1 and the number of in-
stances of such compromise. Users reported multiple times are counted as dis-
tinct instances.

B.2 Using DAS to Detect Attacks

The DAS algorithm [61] has been used successfully in other contexts, such as de-

tecting spearphishing attacks. Applied to our context, the DAS algorithm takes

a set L1,L2, . . . and orders the sets as follows. For some configured subset of

numerical features, we first associate an ordering operator over possible feature

values (e.g., higher failure rate FF is more suspicious). Then we associate to Li

a score that is equal to the number of other sets L j (i , j) such that Li’s features

are all strictly more suspicious than L j’s features. We can then obtain a partial

ordering over the set of L sets based on these scores.

In finding a configuration of features for DAS, we optimized for the fraction

of the top 50L sets as ordered by DAS that were associated with a compromised

username, as reported by the security engineers. In doing so, we found that

using two simple volumetric features — the number of requests submitted (NR)

and the number of unique users contacted (NU) — yielded one of the highest

fractions ofL sets associated with a compromised username. In fact, computing

DAS with the features NU and NR yielded 41 out of the top 50 L sets and 87 out

of the top 100 L sets associated with a compromised username. Adding as a

146


feature the number of consecutive days an IP has been active (CD) increased

it to 50 out of the top 50 and 98 out of the top 100. Thus we hypothesize that

these three features are most correlated with the current mechanisms used by

the university IT offices for detecting attacks that cause compromise reports.

DAS with NR, NU, and CD may only be useful for discovering attacks al-

ready being caught by existing countermeasures, so we explore extending to

further features that the IT offices may not be considering. For example, features

such as the fraction of passwords in a breach (FPIB), the fraction of tweaked

passwords (FTP), and average unique passwords per user (AUP) may all help in

detecting attacks. We tried running DAS on a configuration with a much richer

feature set: NR, NU, FPIB, FCIB, FTP, AUP, and FF. Despite manually flagging

all 50 as probable attacks, only 19 were flagged as attacks in the compromise

database. This indicates that the IT offices may be missing many attacks that

could be found with a richer combination of features, and it also suggests that

the compromise reports may not be a good ground truth.

Our experiments with DAS show that it has promise for discovering attacks.

While naively using it with just volumetric features seems to miss more subtle

attack behaviors, when used with richer Gossamer-enabled features DAS can

even discover (successful) attacks that are not being caught by existing counter-

measures. This suggests that future deployments may want to consider using

DAS-style approaches for remote guessing attack detection, similar to its origi-

nal use with spearphishing. However, from the perspective of our goal of better

characterizing attack campaigns, DAS has various limitations. In particular, it

cannot group IP addresses into attack campaigns, and distributed attacks that

use multiple IPs will be treated as separate attacks.

147


Clustering algorithm Silhouette score ↑
U1 U2

PCA and K-Means++ −0.13 −0.09
DBSCAN −0.39 −0.30
HDBSCAN −0.12 −0.08

Agglomerative +0.19 +0.17

Figure B.2: Silhouette scores of different clustering models.

Thus we consider clustering as an unsupervised approach to grouping sus-

picious IP addresses into attack campaigns.

B.3 Additional Clustering Results

Clustering Quality. Before settling on agglomerative clustering for group-

ing L sets into potential campaigns, we tried a number of other clustering

techniques. We considered the K-means++, DBSCAN, HDBSCAN, and ag-

glomerative clustering techniques in this study on the same set of features

from Figure 3.2. Since K-means++ cannot work with a custom similarity model,

we applied principal component analysis (PCA) on all feature values of L sets

and projected them in a two dimensional space to run K-means++ clustering.

We also tried projecting to more than two dimensions but did not observe any

noticeable effect on the clustering quality. To judge the quality of the clustering

techniques, we used the silhouette score [23], which computes the normalized

difference between the average inter-class and intra-class distances. We did a

grid search over all hyperparameters for each of the clustering techniques and

reported the best silhouette score in Figure B.2.

All clustering methods except agglomerative clustering received a negative

silhouette score, signifying poor quality clusters. We hypothesize that the poor

148


performance is because our similarity model is not a metric similar to Euclidean

distance, which is essential for K-means++ to produce meaningful clusters. DB-

SCAN performed well when the clusters were of the same density — when

L sets belonging to the same cluster are uniformly distributed inside a cluster.

However, in our use case, it is common to have attack campaigns of different

densities. While HDBSCAN [84] is specifically designed to handle non-uniform

clusters and produced relatively better clusters for some attack campaigns, it

still received a lower silhouette score than agglomerative clustering.

Sensitivity of clustering threshold. To analyze how sensitive our clustering

results are for different thresholds, we run two experiments by changing (a)

the filtering threshold which is applied to L sets and (b) the distance threshold

that dictates whether two clusters would be merged or not. Our findings show

that the clustering results are sensitive to changes in the distance threshold but

remain relatively the same for changes to filtering thresholds.

For the first experiment, we change the filtering threshold to a less aggressive

70th percentile instead of the 80th percentile as we did originally, while keeping

the distance threshold the same as before at 0.51. After manual analysis, we find

no noticeable change in the likely malicious clusters in comparison to Araña’s

clustering results. This indicates that clustering results may not be sensitive to

changes in the filtering thresholds.

In the second experiment, in addition to the filtering thresholds, we also

change the distance threshold by applying a knee locator method on L sets

filtered by the 70th percentile. This gives a distance threshold of 0.41, which is

lower than the original distance threshold of 0.51. Although the majority of the

resulting clusters remain the same, we observe a few noticeable changes in the

149


clusters as we describe below.

We sampled the top 20 (16 untargeted and four targeted) likely malicious

clusters using the same sampling criteria used in Araña. The first 16 clusters

had NR ≥ 5, 000 ∨ NU ≥ 5, 000, exhibiting high volume, untargeted behavior.

Out of these 16, we discover that nine clusters were exactly the same as the ones

found in Araña’s clustering results. However, we identify five new untargeted

clusters which we did not see in Araña’s clustering result. Our manual analy-

sis confirms one of them to be a malicious attack campaign, one to be clearly

benign behavior, and the other three to contain a mix of benign and malicious

behavior. Furthermore, we observe that two attack campaigns which were pre-

viously split into five clusters by Araña, are more accurately represented by two

distinct clusters with this new distance threshold. The remaining four clusters

with AUP ≥ 25 exhibited targeted behavior and were identical to those found

by Araña.

Lastly, we notice that, with the new relaxed distance threshold, two targeted

attack campaigns detected by Araña are completely missed. This is because

lowering the distance threshold introduced spurious L sets that did not exhibit

the same targeted behavior as the other malicious L sets showing clear targeted

behavior. Thus, the AUP value of the mixed cluster was reduced below our

selection threshold of 25.

In conclusion, we find that Araña’s clustering results are particularly sen-

sitve to changes to the distance thresholds. While lowering the distance thesh-

old can allow for the discovery of more potential attack campaigns, it also in-

creases the chance of mixing benign and malicious L sets in the same cluster.

To prioritize precision and a low false positive rate, we choose the higher per-

150


centile filtering thresholds which result in a high distance threshold, minimizing

the chance of benign and malicious L sets appearing in the same cluster.

Clustering without password-based features. To understand the importance

of recording users’ password-based information, we reran Araña at U2 with-

out the six password-based features; We found that without these features, we

were still able detect 92% (1,570) of 1,709 malicious L sets and 80% (16) of 20

attack clusters discovered by Araña as shown in Figure C.5. However, it missed

one multi-day credential stuffing attack from a single IP address (cluster #15)

and three targeted attacks (clusters #22, #23, #29). We hypothesize that since

all L sets in attack cluster #15 have similar values across five password guess-

ability features and all L sets in attack clusters #22, #23, and #29 have similar

feature values for AUP, Araña placed them in the same respective attack clusters

predicting that they originate from the same attackers even if they were spread

across multiple days.

Additionally, clustering without password-based features flagged 46 new

L sets. We manually analyzed these 46 L sets and suspect that at least four

L sets are malformed clients, since they submitted the same password against

a single user. Moreover, we found that L sets within the same cluster have

significantly different values for password-based features. For example, eight

L sets formed a new cluster in which one IP address had an FPIB value of 0, but

the other seven L sets had FPIB > 0.55 with an average FPIB of 0.66. For the

remaining 34 L sets, it is difficult for us to judge them as either fully malicious

or benign, since the password-based features varied within a cluster. Thus we

believe password-based features are important for grouping attack traffic into

attack campaigns, as well as manually judging a cluster as malicious or benign.

151


# L sets # L sets flagged by FCA
Attack # [29] flagged [29] Clusters Flagged Not Flagged

#1 (at U1) 7 1,6 6 1
#2 (at U1) 1 2 1 0
#3 (at U2) 12 10 12 0

Figure B.3: Confusion matrices of the number of L sets corresponding to the
three attack campaigns found in prior work [29] using their manual approach
versus our FCA approach.

B.4 Comparing Araña to Prior Work [29]

In Section 3.6.1, we discuss the three attack campaigns manually identified in

prior work [29]. Here we compare in more detail how their attack identification

compares to the corresponding clusters found using Araña. In Figure B.3, we

present three confusion matrices — one for each attack. These show how many

L sets were labeled as part of the attack or not for both methods. For two attacks,

agreement was perfect (unsurprising for attack #2 which only emanated from

a single IP on a single day), and for attack #1 Araña missed just one L set. We

believe that this false negative arose because that L set set fell on the next day

(after midnight) compared to the priorL sets, suggesting that there can be some

noise introduced by our 24-hour cut-offs for login sets. Even so, an analyst using

Araña could, in this case, easily detect the false negative manually since the IP

address is the same.

B.5 Geographical Source of Attacks

We use the ISP of a request to determine the country or countries of origin. Since

a campaign often contains multiple IPs, there may be more than one country of

origin. Figure B.4 shows the most common countries from which attacks orig-

152


inated. At U1 and U2, the vast majority of malicious IP addresses originated

from within the United States.

Country # of requests

United States 65,474
Germany 6,170
Morocco 3,467
Canada 152
Brazil 112

Country # of requests

United States 234,515
Russia 19,003
Ireland 13,130
Netherlands 5,678
Canada 1,571

Figure B.4: The five most common countries at U1 (left) and U2 (right) by num-
ber of attack requests.

153


APPENDIX C

THE PRACTICALITY AND ROBUSTNESS OF TIMELY MALICIOUS

LOGIN DETECTION

C.1 Finding k for Volumetric Flagging

To find a k that performs well at both universities, we investigate different val-

ues of k and their F1 scores over the training period of Gossamer logs. As shown

in Figure C.1, we observe that after k = 500, the F1 score at U1 becomes > 80

and does not change much till k = 1, 500. At U2, for k = 1, 000, the F1 score

is the highest at 76%. Increasing k further decreases the F1 score. We also ex-

perimented with k = 10, 000 (not shown in Figure C.1) which performs poorly

at both universities. Therefore, we chose k = 1, 000 as a relaxed threshold for

IP-based volumetric flagging, as it performs well at both universities.

0 500 1,000 1,500
0

20

40

60

80

100

k

F1
sc

or
e

(%
)

U1
U2

Figure C.1: The F1 scores for different values of k over the training period of
Gossamer logs at both universities.


Train Test
Request level L set level Request level L set level
P R F P R F P R F P R F

Volumetric
k = 10 9 40 23 5 81 9 45 96 62 54 76 63
k = 1, 000 71 96 82 86 6 12 100 13 23 100 0 0

Anomaly-based
Comparison Set 71 97 82 97 37 53 84 24 38 19 13 15
Sliding Window 71 99 83 64 71 67 85 82 84 16 47 23
Per-day Clustering 99 100 99 73 100 84 94 100 97 79 100 88

(a) U1

Train Test
Request level L set level Request level L set level
P R F P R F P R F P R F

Volumetric
k = 10 14 81 24 35 31 33 51 84 63 82 46 59
k = 1, 000 71 65 68 86 < 1 1 100 10 19 100 < 1 < 1

ML-based
StopGuessing 43 65 52 20 61 31 11 5 7 17 19 18
WhoAreYou 33 62 43 27 95 42 39 9 15 6 13 8

Anomaly-based
Comparison Set 43 65 52 68 33 45 23 61 37 32 43 47
Sliding Window 61 98 75 33 62 43 22 99 36 27 95 42
Per-day Clustering 83 81 82 72 45 55 33 81 47 52 98 68

(b) U2

Figure C.2: Evaluation of the volumetric, ML-based, and DAS-based ap-
proaches on the Gossamer train and holdout datasets using attacks reported
in [64] as ground truth. We report the Precision, Recall, and F1 scores (in per-
centage (%)) broken down both by individual requests and L sets. StopGuess-
ing and WhoAreYou do not flag any requests as malicious at U1.

C.2 StopGuessing Implementation Details

StopGuessing [102] incorporates five features of traditional volumetric IP-based

blocking. Some of these techniques require fine-grained information about

submitted passwords or browser authentication cookies which are privacy-

sensitive and can endanger user security [29]. StopGuessing sidestepped this

crucial problem by relying on a simulator, available at [18] to generate synthetic

155


` Level Name Total # Unseen

4 IP address 136 > 99%
3 ISP 77 93%
2 Autonomous System (AS) 45 93%
1 Country 3 33%

Figure C.3: % of unseen IP addresses, ISPs, AS numbers, and countries at each
level ` observed by U2 during the holdout period.

logins of benign users and attackers.

Developed by Bohuk et al. [29], Gossamer is a password measurement

framework that can collect password-related information from real world user

logins without endangering user privacy and security. Gossamer keeps the

password encrypted in an ephemeral database which rotates the encryption key

at the end of each day and stores certain safe-to-log password-derived measure-

ments to a persistent database. It does not log the authentication cookie to pre-

serve user security. We estimate the five features used in StopGuessing with

the password-related information recorded by Gossamer and without using au-

thentication cookies as follows.

(1) Penalize frequently-guessed passwords. A strong signal for capturing

malicious IP addresses is to check whether a request contains a frequently

guessed/popular password on a given day. Schechter et al. [102], estimated

this by storing a data structure called a binomial ladder frequency filter that can

detect if the submitted password is one of the most frequently guessed or not.

Similarly, Gossamer uses a counting bloom filter in the ephemeral database to

store if the submitted password is within the top 1,000 frequently submitted

password in the last 24 hours, and it flushes a binary 0/1 value to persistent

storage at the end of the 24 hours for each login request. We use the fraction

of passwords having a zero in this field to penalize the IP addresses that send

156


frequently guessed passwords. This feature is calculated twice—once for login

requests with a valid username and again for those with invalid usernames.

(2) Defend weaker accounts more. Accounts with weak passwords need more

protection. Schechter et al. [102] mention that they estimate this using the most

frequently guessed incorrect passwords observed in previous login attempts. We use

the binary Zxcvbn [123] score of the submitted password to identify accounts

with weak passwords that need more protection, since Zxcvbn is a popular mea-

sure of password strength recorded by Gossamer.

(3) Penalize typos less than other failures. Schechter et al. [102] simulate users

committing typos by randomly introducing error (i.e., making small changes to

the password) with a probability of one in 50. We instead use the Gossamer

measurement that records if a failed login attempt has edit distance ≤ 2 to the

correct password (if there is a successful login to that username within the day).

(4) Ignore repeated account/password pairs. Gossamer logs do not contain

hashed or encrypted passwords in the persistent storage. Therefore, identify-

ing and removing requests with repeated username-password pairs is difficult.

Nevertheless, we use information about the username and password derived

from the breach compilation dataset used by Gossamer. For example, Gossamer

records if the username and password appear in a breach dataset; it also records

the edit distance of the submitted password from any passwords associated

with the same username in the breach dataset. Although this information is

not available for usernames not appearing in the breach dataset, we can use

this edit distance to identify repeated submitted username-password pairs for

157


usernames in the breach.

(5) Account for prior successful logins. To identify if the user is logging in

from a client where they have prior successful logins, Schechter et al. [102]

assign a device cookie if the client is logging into the account for the first time;

subsequent logins would be indicated with that cookie to the login server. How-

ever, Gossamer does not log this cookie field from the HTTP header to protect

user privacy. Therefore, to identify if the user is logging in from the same client,

we consider the tuple of the normalized user agent and username.

Lastly, StopGuessing offsets the calculated score for a carefully selected sub-

set of IP addresses to accommodate the observation that many users log in from

IP addresses that are proxies or NAT. Since Gossamer stores the result of the

login request, we can estimate this by considering the number of user accounts

to which the IP address has successfully logged in.

C.3 WhoAreYou Implementation Details

We now detail how we calculated the reputation scores, and we describe our ab-

lation study to understand why it failed in yielding a reliable reputation scoring

for IP addresses.

C.3.1 Estimating Reputation Score

Reputation score refers to the chances of an IP address x sending malicious login

requests p(x|mal). We use the 18 attack clusters reported in [64] on the Gossamer

158


logs that occurred during the training period for estimating reputation scores.

For example, if during the training period the IP address x sends nx malicious

logins, then p(x|mal) = nx/N, where N is the number of malicious logins received

during the training period.

C.3.2 Ablation Study

Smoothing technique to solve the sparsity problem. Real-world login data

is sparse, and thus the p(x|mal) calculation becomes difficult for unseen IP ad-

dresses that appear only during the testing period. Freeman et al. [52] suggested

using a smoothing technique to solve this problem.

To do this, they first consider a hierarchical level for IP addresses. At the

lowest level, the IP addresses itself (` = 4), ISP (` = 3), autonomous system

numbers (ASN) (` = 2), country (` = 1) of the IP addresses, and finally world

(the highest level) (` = 0). This hierarchy allows us to calculate the reputation

score of an IP address x at different levels p(x`|mal) and then sum them up to get

the final reputation score; that is, p(x|mal) =
∑`=4
`=1 p(x`|mal).

Why doesn’t the smoothing technique work? For calculating p(x|mal) of a

given x, we also use the same smoothing technique described above. However,

this technique does not yield a reliable reputation score. This is because the

sparsity problem is more intense for Gossamer logs as shown in Figure C.3.

For example, for U2, at ` = 4 (IP-level) 135 out of 136 IP addresses with ma-

licious logins in the testing period were unseen in the training period. Moving

to a higher level does not increase the percentage of seen IP addresses either. At

159


ID / Univ. Type L sets NR θip θr θt

10 / U2 B 12 169,573 - - 5
11 / U2 B 663 33,304 - - 2
12 / U2 C 5 4,521 5 91 -
13 / U2 C 3 12,240 5 10,097 -
14 / U2 C, B 843 7,104 30 1883 2
15 / U2 C 1 5,287 15 2494 -
18 / U2 C 7 6,290 10 2487 -
19 / U2 C 4 3,938 5 3956 -
20 / U2 C 3 4,463 5 447 -
22 / U2 T, B 74 1878 15 732 1
23 / U2 T, B 52 1,250 25 993 1
24 / U2 T 4 468 15 632 -
25 / U2 T 4 101 5 78 -
26 / U2 T 3 80 5 101 -
27 / U2 T 1 27 5 125 -
28 / U2 T 1 27 5 632 -
29 / U2 T 1 25 40 94 -

Figure C.4: Results of the evasion attacks when redistributing the login requests
across θip IP addresses involved in the corresponding attack cluster, adding θr

additional login requests, and lengthening the attack by θt. We show the results
over the 19 attack clusters detected by the Per-day Clustering approach at U2.
For this experiment we assume the attacker has access to all login requests as
observed by the login server. A “-” symbol means the attacker does not require
the corresponding resource for evasion.

` = 3, only five out of 77 ISPs are seen during the training period; at ` = 2, only

three out of 45 are. At ` = 1 (the country level) only one out of three are seen,

but scoring at this level is also less precise in general.

C.4 Additional Evasion Results

We now explain how an attacker can optimize their spent resources by conser-

vatively assuming the attacker has knowledge about all other login requests. We

develop a simple heuristic and empirically measure the resources required by

the attacker. We describe our heuristic below and show the results in Figure C.4.

160


At first, we give the attacker no additional IP addresses (θip = 0). Then, we

perform a binary search over θr with an upper limit of 106 to attempt evasion

of the targeted and credential stuffing behavior exhibited by the attack cluster.

If successful, then we perform another binary search over the additional time θt

with an upper limit of 24 hours to evade the bursty behavior. Finally, if we are

successful in evading all malicious behavior exhibited by the attack cluster, we

stop. Otherwise, we increase θip by five and perform the two binary searches

over θr and θt again until the evasion succeeds.

As shown in Figure C.4, we find two attack clusters (#10 and #29) which

remain unevadable with our experimental setting in Section 4.6 can be evaded

with parameters θt = 5 and θip = 40. We also observe that spending θt is not

necessary for evasion except for five clusters (#10, #11, #14, #22, #23) that exhibit

bursty behavior.

C.5 Attack Clusters Detected by Araña [64]

Using a novel clustering technique along with manual analysis, Arañadetected

nine attack clusters at U1 and 20 at U2. We use these attack clusters as the

ground truth of attacks for our analysis. Details of the attack clusters including

the number of requests in each cluster, number of IPs involved, number of days

the attack was active, how fast the requests were submitted, and information

about the submitted password are shown in Figure C.5. This table is replicated

from [64].

161


ID
/ Univ.

Dura
tio

n

# IP
s

# ISP
s

NU NR AUP
FPIB

FCIB
FUIB

FTP
FVU

FF

1 / U1 5 h 3 3 10,424 18,093 1.21 0.56 0.41 0.69 0.06 0.94 1.00
2 / U1 4 h 1 1 15,209 17,827 1.10 0.56 0.22 0.42 0.06 0.97 1.00

3 / U1 14 h 555 4 12,659 15,117 1.00 0.14 0.02 0.58 0.46 0.78 1.00

4 / U1 11 h 77 2 12,318 14,603 1.00 0.14 0.02 0.58 0.45 0.78 1.00

5 / U1 41 m 1 1 4,480 4,593 1.03 0.63 0.21 0.44 0.08 0.97 1.00

6 / U1 4 h 3 1 2,101 2,683 1.21 0.66 0.48 0.69 0.04 0.98 1.00
7 / U1 9 h 1 1 1,347 1,481 1.00 0.32 0.10 0.22 0.05 0.99 1.00

8 / U1 19 h 85 13 894 1,246 1.00 0.33 0.08 0.92 0.84 0.92 1.00

9 / U1 37m 30 7 219 241 1.00 0.92 0.78 0.96 0.10 0.93 1.00

10/ U2 11 h 12 1 76,321 169,573 1.01 0.03 0.00 0.00 0.00 0.00 1.00

11 / U2 10 m 663 10 27,488 33,304 1.00 0.79 0.00 0.00 0.00 0.01 1.00

12 / U2 15d 1 1 8,192 13,289 1.02 0.61 0.07 0.54 0.02 0.63 0.93
13 / U2 3 d 1 1 7,939 12,240 1.30 0.66 0.10 0.75 0.00 0.35 0.99

14 / U2 23h 843 13 7,771 10,535 1.01 0.66 0.00 0.71 0.00 0.48 0.97

15 / U2 5 d 1 1 4,662 9,714 1.07 0.84 0.00 0.91 0.00 0.40 0.99
16 / U2 2 d 2 1 4,934 7,323 1.00 0.58 0.00 0.23 0.00 0.81 0.84
17 / U2 2d 3 2 2,513 6,290 1.55 0.40 0.18 0.91 0.70 0.37 0.99

18 / U2 10 d 1 1 1,902 5,434 1.37 0.55 0.27 0.82 0.41 0.39 0.99

19 / U2 5 d 1 1 3,584 5,261 1.04 0.76 0.80 0.98 0.18 0.42 1.00
20 / U2 2 d 3 2 1,756 5,199 1.56 0.50 0.22 0.92 0.72 0.35 0.98

21 / U2 23 h 1 1 5,076 5,103 1.00 0.59 0.00 0.26 0.00 0.80 0.85

22 / U2 13 h 73 43 75 1,878 25.04 0.88 0.01 0.31 0.29 0.96 1.00

23 / U2 21 h 52 38 52 1,296 24.92 0.89 0.01 0.34 0.32 0.88 1.00

24 / U2 12 h 4 4 4 101 25.25 0.86 0.00 0.55 0.50 0.51 0.99

25 / U2 16 h 3 3 3 80 26.67 0.75 0.01 0.65 0.58 0.30 0.99

26 / U2 8 m 1 1 1 27 27.00 0.85 0.00 0.00 0.00 0.00 1.00

27 / U2 8 m 1 1 1 27 27.00 0.78 0.04 1.00 0.96 1.00 1.00

28 / U2 13 m 1 1 1 25 25.00 0.88 0.00 0.00 0.00 0.00 1.00

29 / U2 4 d 1 1 3 468 38.75 0.59 0.00 0.67 0.16 0.33 1.00

Figure C.5: Attack clusters detected by Araña [?] on Gossamer dataset. Similar
to [64], we refer to each attack cluster using the ID in the first column.

162


BIBLIOGRAPHY

[1] https://github.com/islamazhar/Arana-Public.

[2] https://blackbox.ipinfo.app.

[3] DigitalOcean – The developer cloud. https://www.digitalocean.com.

[4] Google Cloud: Cloud Computing Services. https://cloud.google.com.

[5] Inwi – Opérateur téléphonique mobile, fixe & internet au Maroc. https:
//www.inwi.ma.

[6] Okta. https://www.okta.com/.

[7] Python Fernet Cryptography Library. https://cryptography.io/en/
latest/fernet.html.

[8] Python Miscreant Encryption Library. https://github.com/miscreant/
miscreant.py.

[9] UA-parser. https://github.com/ua-parser/uap-python.

[10] RockYou Hack: From Bad To Worse. https://techcrunch.com/2009/12/
14/rockyou-hack-security-myspace-facebook-passwords/, 2009.

[11] Best64 Rule List. https://github.com/hashcat/hashcat/blob/master/
rules/best64.rule, 2018.

[12] OAuth. https://oauth.net/2/, 2020.

[13] Duo Security: Two-Factor Authentication & End Point Security. https:
//duo.com/, 2021.

[14] Account lockout threshold. https://learn.microsoft.com/en-us/
windows/security/threat-protection/security-policy-settings/account-
lockout-threshold, 2023.

[15] LinkedIn Corporation. https://www.linkedin.com/, 2023.

[16] OpenBullet2. https://github.com/openbullet/OpenBullet2, 2023.

https://github.com/islamazhar/Arana-Public
https://blackbox.ipinfo.app
https://www.digitalocean.com
https://cloud.google.com
https://www.inwi.ma
https://www.inwi.ma
https://www.okta.com/
https://cryptography.io/en/latest/fernet.html
https://cryptography.io/en/latest/fernet.html
https://github.com/miscreant/miscreant.py
https://github.com/miscreant/miscreant.py
https://github.com/ua-parser/uap-python
https://techcrunch.com/2009/12/14/rockyou-hack-security-myspace-facebook-passwords/
https://techcrunch.com/2009/12/14/rockyou-hack-security-myspace-facebook-passwords/
https://github.com/hashcat/hashcat/blob/master/rules/best64.rule
https://github.com/hashcat/hashcat/blob/master/rules/best64.rule
https://oauth.net/2/
https://duo.com/
https://duo.com/
https://learn.microsoft.com/en-us/windows/security/threat-protection/security-policy-settings/account-lockout-threshold
https://learn.microsoft.com/en-us/windows/security/threat-protection/security-policy-settings/account-lockout-threshold
https://learn.microsoft.com/en-us/windows/security/threat-protection/security-policy-settings/account-lockout-threshold
https://www.linkedin.com/
https://github.com/openbullet/OpenBullet2


[17] R-fx Networks Projects. https://www.rfxn.com/, 2023.

[18] StopGuessing. https://github.com/microsoft/StopGuessing, 2023.

[19] Jacob Abbott and Sameer Patil. How Mandatory Second Factor Affects
the Authentication User Experience. In 2020 CHI Conference on Human
Factors in Computing Systems, CHI ’20, page 1–13, New York, NY, USA,
2020. Association for Computing Machinery.

[20] Daniel Arp, Erwin Quiring, Feargus Pendlebury, Alexander Warnecke,
Fabio Pierazzi, Christian Wressnegger, Lorenzo Cavallaro, and Konrad
Rieck. Dos and Don’ts of Machine Learning in Computer Security. In
USENIX Security 22, 2022.

[21] Mahdi Hassan Aysa, Abdullahi Abdu Ibrahim, and Alaa Hamid Mo-
hammed. IoT DDOS Attack Detection Using Machine Learning. In 2020
4th International Symposium on Multidisciplinary Studies and Innovative Tech-
nologies (ISMSIT), 2020.

[22] Sruti Bhagavatula, Lujo Bauer, and Apu Kapadia. (How) Do people
change their passwords after a breach? USENIX ;login, 2021.

[23] Ashutosh Bhardwaj. Silhouette coefficient. https://towardsdatascience.
com/silhouette-coefficient-validating-clustering-techniques-
e976bb81d10c, 2020.

[24] Kemal Bicakci and Yusuf Uzunay. Is FIDO2 Passwordless Authentication
a Hype or for Real?: A Position Paper. In 2022 15th International Conference
on Information Security and Cryptography (ISCTURKEY), pages 68–73. IEEE,
2022.

[25] Battista Biggio, Samuel Rota Bulò, Ignazio Pillai, Michele Mura,
Eyasu Zemene Mequanint, Marcello Pelillo, and Fabio Roli. Poison-
ing complete-linkage hierarchical clustering. In Structural, Syntactic,
and Statistical Pattern Recognition: Joint IAPR International Workshop, S+
SSPR 2014, Joensuu, Finland, August 20-22, 2014. Proceedings, pages 42–52.
Springer, 2014.

[26] Battista Biggio, Ignazio Pillai, Samuel Rota Bulò, Davide Ariu, Marcello
Pelillo, and Fabio Roli. Is data clustering in adversarial settings secure?
In 2013 ACM workshop on Artificial Intelligence and Security, pages 87–98,
2013.

164

https://www.rfxn.com/
https://github.com/microsoft/StopGuessing
https://towardsdatascience.com/silhouette-coefficient-validating-clustering-techniques-e976bb81d10c
https://towardsdatascience.com/silhouette-coefficient-validating-clustering-techniques-e976bb81d10c
https://towardsdatascience.com/silhouette-coefficient-validating-clustering-techniques-e976bb81d10c


[27] Jeremiah Blocki and Wuwei Zhang. DALock: Password Distribution-
Aware Throttling. Proceedings on Privacy Enhancing Technologies, 2022.

[28] Beyond Identity Blog. The history and future of passwords.
https://www.beyondidentity.com/blog/history-and-future-passwords,
Sep 2021.

[29] Marina Sanusi Bohuk, Mazharul Islam, Syed Suleman Ahmad, Michael
Swift, Thomas Ristenpart, and Rahul Chatterjee. Gossamer: Securely
Measuring Password-based Logins. In 31nd USENIX Security Symposium
(USENIX Security 22), USA, 2022. USENIX Association.

[30] Joseph Bonneau, Cormac Herley, Paul C. van Oorschot, and Frank Sta-
jano. The quest to replace passwords: A framework for comparative eval-
uation of web authentication schemes. In 2012 IEEE Symposium on Security
and Privacy, pages 553–567, 2012.

[31] Joseph Bonneau, Cormac Herley, Paul C. Van Oorschot, and Frank Sta-
jano. The quest to replace passwords: A framework for comparative eval-
uation of web authentication schemes. In 2012 IEEE Symposium on Security
and Privacy, 2012.

[32] Joseph Bonneau, Cormac Herley, Paul C Van Oorschot, and Frank Stajano.
Passwords and the evolution of imperfect authentication. Communications
of the ACM, 58(7), 2015.

[33] Christlaan Brand and Sriram Karra. The beginning of the end of the pass-
word. https://blog.google/technology/safety-security/the-beginning-
of-the-end-of-the-password/.

[34] AMR Brostoff and MA Sasse. ’Ten strikes and you’re out’: Increasing
the number of login attempts can improve password usability. In CHI
2003 Workshop on Human-Computer Interaction and Security Systems, Fort
Lauderdale, 2003.

[35] Anna L. Buczak and Erhan Guven. A Survey of Data Mining and Machine
Learning Methods for Cyber Security Intrusion Detection. IEEE Commu-
nications Surveys & Tutorials, 18(2), 2016.

[36] Lars Buitinck, Gilles Louppe, Mathieu Blondel, Fabian Pedregosa, An-
dreas Mueller, Olivier Grisel, Vlad Niculae, Peter Prettenhofer, Alexandre
Gramfort, Jaques Grobler, Robert Layton, Jake VanderPlas, Arnaud Joly,

165

https://www.beyondidentity.com/blog/history-and-future-passwords
https://blog.google/technology/safety-security/the-beginning-of-the-end-of-the-password/
https://blog.google/technology/safety-security/the-beginning-of-the-end-of-the-password/


Brian Holt, and Gaël Varoquaux. API design for machine learning soft-
ware: experiences from the scikit-learn project. In ECML PKDD Workshop:
Languages for Data Mining and Machine Learning, pages 108–122, 2013.

[37] Varun Chandola, Arindam Banerjee, and Vipin Kumar. Anomaly detec-
tion: A survey. ACM computing surveys (CSUR), 41(3):1–58, 2009.

[38] Rahul Chatterjee, Anish Athalye, Devdatta Akhawe, Ari Juels, and
Thomas Ristenpart. pASSWORD tYPOS and how to correct them
securely. 2016. Full version of the paper can be found at
https://cs.cornell.edu/ rahul/papers/pwtypos.pdf.

[39] Rahul Chatterjee, Anish Athayle, Devdatta Akhawe, Ari Juels, and
Thomas Ristenpart. pASSWORD tYPOS and how to correct them se-
curely. In 2016 IEEE Symposium on Security and Privacy, 2016.

[40] Jessica Colnago, Summer Devlin, Maggie Oates, Chelse Swoopes, Lujo
Bauer, Lorrie Cranor, and Nicolas Christin. "It’s not actually that horri-
ble": Exploring Adoption of Two-Factor Authentication at a University.
In 2018 CHI Conference on Human Factors in Computing Systems (CHI ’18).
Association for Computing Machinery, 2018.

[41] Jessica Colnago, Summer Devlin, Maggie Oates, Chelse Swoopes, Lujo
Bauer, Lorrie Cranor, and Nicolas Christin. “It’s Not Actually That Horri-
ble”: Exploring Adoption of Two-Factor Authentication at a University, page
1–11. Association for Computing Machinery, New York, NY, USA, 2018.

[42] Anupam Das, Joseph Bonneau, Matthew Caesar, Nikita Borisov, and Xi-
aoFeng Wang. The Tangled Web of Password Reuse. In NDSS, volume 14,
pages 23–26, 2014.

[43] Dashlane. Report: A global look at password health. https://www.
dashlane.com/global-password-health-score-report-2022.

[44] Katie Deighton. Tech Companies Push Users to Adopt Two-Factor
Authentication. https://www.wsj.com/articles/tech-companies-push-
users-to-adopt-two-factor-authentication-11635807088, 2021.

[45] Periwinkle Doerfler, Kurt Thomas, Maija Marincenko, Juri Ranieri,
Yu Jiang, Angelika Moscicki, and Damon Mccoy. Evaluating Login Chal-
lenges as a Defense Against Account Takeover. In 28th International Con-
ference on World Wide Web (WWW ’19), 2019.

166

https://www.dashlane.com/global-password-health-score-report-2022
https://www.dashlane.com/global-password-health-score-report-2022
https://www.wsj.com/articles/tech-companies-push-users-to-adopt-two-factor-authentication-11635807088
https://www.wsj.com/articles/tech-companies-push-users-to-adopt-two-factor-authentication-11635807088


[46] Verizon Enterprise. DBIR: 2023 Databreach Investigation Report.
https://www.verizon.com/business/resources/Tf0f/reports/2023-
data-breach-investigations-report-dbir.pdf, 2023.

[47] Chris Evans, Chris Palmer, and Ryan Sleevi. Public key pinning exten-
sion for http. Internet Engineering Task Force. 27Available: http://www. ietf.
org/internet-drafts/draft-ietf-websec-key-pinning-09. txt, 2015.

[48] Dinei Florencio and Cormac Herley. A large-scale study of web password
habits. In 16th International Conference on World Wide Web, WWW ’07, page
657–666, New York, NY, USA, 2007. Association for Computing Machin-
ery.

[49] Dinei Florêncio, Cormac Herley, and Baris Coskun. Do strong web pass-
words accomplish anything? HotSec, 7(6):159, 2007.

[50] Alain Forget, Saranga Komanduri, Alessandro Acquisti, Nicolas Christin,
Lorrie Faith Cranor, and Rahul Telang. Building the Security Behavior
Observatory: An Infrastructure for Long-Term Monitoring of Client Ma-
chines. In 2014 Symposium and Bootcamp on the Science of Security, HotSoS
’14. Association for Computing Machinery, 2014.

[51] John Franks, Phillip Hallam-Baker, Jeffrey Hostetler, Scott Lawrence, Paul
Leach, Ari Luotonen, and Lawrence Stewart. HTTP authentication: Basic
and digest access authentication, 1999.

[52] David Freeman, Sakshi Jain, Markus Duermuth, Battista Biggio, and Gior-
gio Giacinto. Who Are You? A Statistical Approach to Measuring User
Authenticity. In NDSS, 2016.

[53] Eva Gerlitz, Maximilian Häring, and Matthew Smith. Please do not
use or your License Plate Number: Analyzing Password Policies in Ger-
man Companies. In Seventeenth Symposium on Usable Privacy and Security
(SOUPS 2021), pages 17–36. USENIX Association, August 2021.

[54] Conor Gilsenan, Fuzail Shakir, Noura Alomar, and Serge Egelman. Secu-
rity and Privacy Failures in Popular 2FA Apps. In 32nd USENIX Security
Symposium (USENIX Security 23), 2023.

[55] Samuel Greengard. Passkeys unlock a new era for authentica-
tion. https://cacm.acm.org/news/270924-passkeys-unlock-a-new-era-
for-authentication/fulltext, Mar 2023.

167

https://www.verizon.com/business/resources/Tf0f/reports/2023-data-breach-investigations-report-dbir.pdf
https://www.verizon.com/business/resources/Tf0f/reports/2023-data-breach-investigations-report-dbir.pdf
https://cacm.acm.org/news/270924-passkeys-unlock-a-new-era-for-authentication/fulltext
https://cacm.acm.org/news/270924-passkeys-unlock-a-new-era-for-authentication/fulltext


[56] W3C Web Authentication Working Group. Passkeys: A Web Authenti-
cation API for Secure and Passwordless Sign-in. "https://w3.org/TR/
webauthn-passkeys/".

[57] Hana Habib, Jessica Colnago, William Melicher, Blase Ur, Sean Segreti,
Lujo Bauer, Nicolas Christin, and Lorrie Cranor. Password creation in the
presence of blacklists. Proc. USEC, 2017.

[58] D. Harkings. Synthetic initialization vector (SIV) authenticated encryp-
tion using the advanced encryption standard (AES).

[59] Cormac Herley and Stuart E. Schechter. Distinguishing Attacks from Le-
gitimate Authentication Traffic at Scale. In Network and Distributed Systems
Security (NDSS) Symposium 2019, 2019.

[60] Briland Hitaj, Paolo Gasti, Giuseppe Ateniese, and Fernando Perez-Cruz.
PassGAN: A Deep Learning Approach for Password Guessing. In Interna-
tional conference on applied cryptography and network security, pages 217–237.
Springer, 2019.

[61] Grant Ho, Aashish Sharma, Mobin Javed, Vern Paxson, and David Wag-
ner. Detecting Credential Spearphishing Attacks in Enterprise Settings.
In 26th USENIX Security Symposium (USENIX Security 17), SEC’17, page
469–485, USA, 2017. USENIX Association.

[62] Nicolas Huaman, Sabrina Amft, Marten Oltrogge, Yasemin Acar, and
Sascha Fahl. They would do better if they worked together: The case of
interaction problems between password managers and websites. In 2021
IEEE Symposium on Security and Privacy (SP), pages 1367–1381. IEEE, 2021.

[63] Troy Hunt. Have I Been Pwned? https://haveibeenpwned.com/
Passwords/, 2018.

[64] Mazharul Islam, Marina Sanusi Bohuk, Paul Chung, Thomas Ristenpart,
and Rahul Chatterjee. Araña: Discovering and Characterizing Pass-
word Guessing Attacks in Practice. In 32nd USENIX Security Sympo-
sium (USENIX Security 23), pages 1019–1036, Anaheim, CA, August 2023.
USENIX Association.

[65] Mazharul Islam, Marina Sanusi Bohuk, Thomas Ristenpart, and Rahul
Chatterjee. The Practicality and Robustness of Timely Malicious Login
Detection. In In Preparation.

168

"https://w3.org/TR/webauthn-passkeys/"
"https://w3.org/TR/webauthn-passkeys/"
https://haveibeenpwned.com/Passwords/
https://haveibeenpwned.com/Passwords/


[66] Mobin Javed and Vern Paxson. Detecting stealthy, distributed ssh brute-
forcing. In 2013 ACM SIGSAC conference on Computer & communications
security, pages 85–96, 2013.

[67] Mohammed Jubur, Prakash Shrestha, Nitesh Saxena, and Jay Prakash. By-
passing push-based second factor and passwordless authentication with
human-indistinguishable notifications. In 2021 ACM Asia Conference on
Computer and Communications Security, ASIA CCS ’21, page 447–461, New
York, NY, USA, 2021. Association for Computing Machinery.

[68] Mohammed Jubur, Prakash Shrestha, Nitesh Saxena, and Jay Prakash. By-
passing push-based second factor and passwordless authentication with
human-indistinguishable notifications. In 2021 ACM Asia Conference on
Computer and Communications Security, ASIA CCS ’21, page 447–461, New
York, NY, USA, 2021. Association for Computing Machinery.

[69] P. G. Kelley, S. Komanduri, M. L. Mazurek, R. Shay, T. Vidas, L. Bauer,
N. Christin, L. F. Cranor, and J. Lopez. Guess Again (and Again and
Again): Measuring Password Strength by Simulating Password-Cracking
Algorithms. In 2012 IEEE Symposium on Security and Privacy, 2012.

[70] Erin Kenneally and David Dittrich. The menlo report: Ethical principles
guiding information and communication technology research. Available
at SSRN 2445102, 2012.

[71] Saranga Komanduri. Modeling the Adversary to Evaluate Password
Strength With Limited Samples. https://kilthub.cmu.edu/articles/
thesis/Modeling_the_Adversary_to_Evaluate_Password_Strength_
With_Limited_Samples/6720701/1, 2018.

[72] Saranga Komanduri, Richard Shay, Patrick Gage Kelley, Michelle L.
Mazurek, Lujo Bauer, Nicolas Christin, Lorrie Faith Cranor, and Serge
Egelman. Of Passwords and People: Measuring the Effect of Password-
Composition Policies. In SIGCHI Conference on Human Factors in Comput-
ing Systems (CHI ’11). Association for Computing Machinery, 2011.

[73] Leona Lassak, Elleen Pan, Blase Ur, and Maximilian Golla. Why Aren’t
We Using Passkeys? Obstacles Companies Face Deploying FIDO2 Pass-
wordless Authentication. In 33rd USENIX Security Symposium (USENIX
Security 24), Philadelphia, PA, 2024. USENIX Association.

[74] Kristin Lauter. Password Monitor: Safeguarding passwords in Microsoft

169

https://kilthub.cmu.edu/articles/thesis/Modeling_the_Adversary_to_Evaluate_Password_Strength_With_Limited_Samples/6720701/1
https://kilthub.cmu.edu/articles/thesis/Modeling_the_Adversary_to_Evaluate_Password_Strength_With_Limited_Samples/6720701/1
https://kilthub.cmu.edu/articles/thesis/Modeling_the_Adversary_to_Evaluate_Password_Strength_With_Limited_Samples/6720701/1


Edge. https://www.microsoft.com/en-us/research/blog/password-
monitor-safeguarding-passwords-in-microsoft-edge/, 2021.

[75] Aleksandar Lazarevic, Levent Ertoz, Vipin Kumar, Aysel Ozgur, and
Jaideep Srivastava. A comparative study of anomaly detection schemes
in network intrusion detection. In 2003 SIAM international conference on
data mining, pages 25–36. SIAM, 2003.

[76] Lucy Li, Bijeeta Pal, Junade Ali, Nick Sullivan, Rahul Chatterjee, and
Thomas Ristenpart. Protocols for checking compromised credentials. In
2019 ACM SIGSAC Conference on Computer and Communications Security,
CCS ’19, page 1387–1403, New York, NY, USA, 2019. Association for Com-
puting Machinery.

[77] Bo Lu, Xiaokuan Zhang, Ziman Ling, Yinqian Zhang, and Zhiqiang Lin. A
measurement study of authentication rate-limiting mechanisms of mod-
ern websites. In 34th Annual Computer Security Applications Conference,
ACSAC ’18, page 89–100, New York, NY, USA, 2018. Association for Com-
puting Machinery.

[78] Sanam Ghorbani Lyastani, Michael Schilling, Michaela Neumayr, Michael
Backes, and Sven Bugiel. Is FIDO2 the kingslayer of user authentication?
A comparative usability study of FIDO2 passwordless authentication. In
2020 IEEE Symposium on Security and Privacy (SP), pages 268–285. IEEE,
2020.

[79] J. Ma, W. Yang, M. Luo, and N. Li. A study of probabilistic password
models. In 2014 IEEE Symposium on Security and Privacy, pages 689–704,
2014.

[80] MaxMind. GeoIP2 Database. https://www.maxmind.com/en/geoip2-
databases.

[81] LLC MaxMind. GeoIP. https://www.maxmind.com/en/geoip-demo.

[82] Peter Mayer, Collins W Munyendo, Michelle L Mazurek, and Adam J
Aviv. Why Users (Don’t) Use Password Managers at a Large Educational
Institution. In 31st USENIX Security Symposium (USENIX Security 22),
pages 1849–1866, 2022.

[83] Michelle L. Mazurek, Saranga Komanduri, Timothy Vidas, Lujo Bauer,
Nicolas Christin, Lorrie Faith Cranor, Patrick Gage Kelley, Richard Shay,

170

https://www.microsoft.com/en-us/research/blog/password-monitor-safeguarding-passwords-in-microsoft-edge/
https://www.microsoft.com/en-us/research/blog/password-monitor-safeguarding-passwords-in-microsoft-edge/
https://www.maxmind.com/en/geoip2-databases
https://www.maxmind.com/en/geoip2-databases
https://www.maxmind.com/en/geoip-demo


and Blase Ur. Measuring Password Guessability for an Entire University.
In 2013 ACM SIGSAC Conference on Computer and Communications Security
(CCS ’13). Association for Computing Machinery, 2013.

[84] Leland McInnes, John Healy, and Steve Astels. HDBSCAN: Hierarchical
Density-Based Clustering. The Journal of Open Source Software, 2(11), 2017.

[85] William Melicher, Blase Ur, Sean M. Segreti, Saranga Komanduri, Lujo
Bauer, Nicolas Christin, and Lorrie Faith Cranor. Fast, Lean, and Ac-
curate: Modeling Password Guessability Using Neural Networks. In
USENIX Security 16. USENIX Association, 2016.

[86] Andrew Miklas, Stefan Saroiu, Alec Wolman, and Angela Demke Brown.
Bunker: A privacy-oriented platform for network tracing. In 6th USENIX
Symposium on Networked Systems Design and Implementation (NSDI), 2009.

[87] Daniel Müllner. Modern hierarchical, agglomerative clustering algo-
rithms. arXiv preprint arXiv:1109.2378, 2011.

[88] Arvind Narayanan and Vitaly Shmatikov. Fast Dictionary Attacks on
Passwords Using Time-Space Tradeoff. In 12th ACM Conference on Com-
puter and Communications Security, CCS ’05, page 364–372, New York, NY,
USA, 2005. Association for Computing Machinery.

[89] Arvind Narayanan and Vitaly Shmatikov. Robust de-anonymization of
large sparse datasets. In 2008 IEEE Symposium on Security and Privacy (sp
2008), pages 111–125. IEEE, 2008.

[90] Alexandra Nisenoff, Maximilian Golla, Miranda Wei, Juliette Hainline,
Hayley Szymanek, Annika Braun, Annika Hildebrandt, Blair Christensen,
David Langenberg, and Blase Ur. A {Two-Decade} retrospective analysis
of a university’s vulnerability to attacks exploiting reused passwords. In
32nd USENIX Security Symposium (USENIX Security 23), pages 5127–5144,
2023.

[91] Sean Oesch and Scott Ruoti. That Was Then, This Is Now: A Security Eval-
uation of Password Generation, Storage, and Autofill in Browser-Based
Password Managers. In 29th USENIX Security Symposium (USENIX Secu-
rity 20), pages 2165–2182. USENIX Association, August 2020.

[92] Openwall. John The Ripper. https://www.openwall.com/john/.

171

https://www.openwall.com/john/


[93] Bijeeta Pal, Tal Daniel, Rahul Chatterjee, and Thomas Ristenpart. Beyond
credential stuffing: Password similarity models using neural networks.
In 2019 IEEE Symposium on Security and Privacy (SP), pages 417–434, May
2019.

[94] Bijeeta Pal, Mazharul Islam, Thomas Ristenpart, and Rahul Chatterjee.
Might I get pwned: A second generation password breach alerting ser-
vice, 2021.

[95] Danny Palmer. These systems are facing billions of attacks every month as
hackers try to guess passwords. https://www.zdnet.com/article/these-
systems-are-facing-billions-of-attacks-every-month-as-hackers-try-to-
guess-passwords/, 2021.

[96] Sarah Pearman, Jeremy Thomas, Pardis Emami Naeini, Hana Habib, Lujo
Bauer, Nicolas Christin, Lorrie Faith Cranor, Serge Egelman, and Alain
Forget. Let’s go in for a closer look: Observing passwords in their natural
habitat. In 2017 ACM SIGSAC Conference on Computer and Communications
Security (CCS ’17). Association for Computing Machinery, 2017.

[97] Sarah Pearman, Shikun Aerin Zhang, Lujo Bauer, Nicolas Christin, and
Lorrie Faith Cranor. Why people (don’t) use password managers effec-
tively. In Fifteenth Symposium on Usable Privacy and Security (SOUPS 2019),
pages 319–338, 2019.

[98] Soc Radar. 91% of e-commerce login traffic is credential stuff-
ing. https://socradar.io/91-of-e-commerce-login-traffic-is-credential-
stuffing-attempts/, 2022.

[99] Nabeel Saeed. Top insights from our 2022 state of secure identity report.

[100] Ville Satopaa, Jeannie Albrecht, David Irwin, and Barath Raghavan. Find-
ing a "Kneedle" in a Haystack: Detecting Knee Points in System Behav-
ior. In 2011 31st International Conference on Distributed Computing Systems
Workshops, 2011.

[101] Stuart Schechter, Cormac Herley, and Michael Mitzenmacher. Popularity
Is Everything: A New Approach to Protecting Passwords from Statistical-
Guessing Attacks. In 5th USENIX Workshop on Hot Topics in Security (Hot-
Sec 10), Washington, DC, August 2010. USENIX Association.

[102] Stuart Schechter, Yuan Tian, and Cormac Herley. StopGuessing: Using

172

https://www.zdnet.com/article/these-systems-are-facing-billions-of-attacks-every-month-as-hackers-try-to-guess-passwords/
https://www.zdnet.com/article/these-systems-are-facing-billions-of-attacks-every-month-as-hackers-try-to-guess-passwords/
https://www.zdnet.com/article/these-systems-are-facing-billions-of-attacks-every-month-as-hackers-try-to-guess-passwords/
https://socradar.io/91-of-e-commerce-login-traffic-is-credential-stuffing-attempts/
https://socradar.io/91-of-e-commerce-login-traffic-is-credential-stuffing-attempts/


Guessed Passwords to Thwart Online Guessing. In 2019 IEEE European
Symposium on Security and Privacy (EuroS&P), 2019.

[103] Tara Seals. Billions of passwords offered for $2 in cyber-underground.
ThreatPost, 2021.

[104] Richard Shay, Patrick Gage Kelley, Saranga Komanduri, Michelle L.
Mazurek, Blase Ur, Timothy Vidas, Lujo Bauer, Nicolas Christin, and Lor-
rie Faith Cranor. Correct horse battery staple: Exploring the usability of
system-assigned passphrases. In Eighth Symposium on Usable Privacy and
Security, SOUPS ’12, New York, NY, USA, 2012. Association for Comput-
ing Machinery.

[105] Richard Shay, Saranga Komanduri, Adam L. Durity, Phillip (Seyoung)
Huh, Michelle L. Mazurek, Sean M. Segreti, Blase Ur, Lujo Bauer, Nico-
las Christin, and Lorrie Faith Cranor. Can long passwords be secure and
usable? In SIGCHI Conference on Human Factors in Computing Systems, CHI
’14, page 2927–2936, New York, NY, USA, 2014. Association for Comput-
ing Machinery.

[106] Jens Steube and Gabriele Gristina. Hashcat. https://hashcat.net/
hashcat/.

[107] Schechter Stuart. Before You Use a Password Manager. https:
//stuartschechter.medium.com/before-you-use-a-password-manager-
9f5949ccf168, 2019.

[108] Latanya Sweeney. Weaving technology and policy together to maintain
confidentiality. The Journal of Law, Medicine & Ethics, 25(2-3):98–110, 1997.

[109] Pang-Ning Tan, Michael Steinbach, Anuj Karpatneand, and Vipin Kumar.
Introduction to Data Mining, 2nd Edition. People’s Posts and Telecommuni-
cations Publishing House, Beijing, 2019.

[110] Danna Thee. Sentry MBA. https://e.cyberint.com/hubfs/Sentry%
20MBA%20Report/CyberInt_Sentry-MBA_Report.pdf.

[111] Kurt Thomas, Frank Li, Ali Zand, Jacob Barrett, Juri Ranieri, Luca Inv-
ernizzi, Yarik Markov, Oxana Comanescu, Vijay Eranti, Angelika Mosci-
cki, Daniel Margolis, Vern Paxson, and Elie Bursztein. Data Breaches,
Phishing, or Malware? Understanding the Risks of Stolen Credentials.
In 2017 ACM SIGSAC Conference on Computer and Communications Security
(CCS ’17). Association for Computing Machinery, 2017.

173

https://hashcat.net/hashcat/
https://hashcat.net/hashcat/
https://stuartschechter.medium.com/before-you-use-a-password-manager-9f5949ccf168
https://stuartschechter.medium.com/before-you-use-a-password-manager-9f5949ccf168
https://stuartschechter.medium.com/before-you-use-a-password-manager-9f5949ccf168
https://e.cyberint.com/hubfs/Sentry%20MBA%20Report/CyberInt_Sentry-MBA_Report.pdf
https://e.cyberint.com/hubfs/Sentry%20MBA%20Report/CyberInt_Sentry-MBA_Report.pdf


[112] Kurt Thomas, Jennifer Pullman, Kevin Yeo, Ananth Raghunathan,
Patrick Gage Kelley, Luca Invernizzi, Borbala Benko, Tadek Pietraszek,
Sarvar Patel, Dan Boneh, and Elie Bursztein. Protecting Accounts from
Credential Stuffing with Password Breach Alerting. In USENIX Security
19. USENIX Association, 2019.

[113] Michael Tremante. End User Security: Account Takeover Protections with
Cloudflare. https://blog.cloudflare.com/account-takeover-protection/,
2021.

[114] Blase Ur, Patrick Gage Kelley, Saranga Komanduri, Joel Lee, Michael
Maass, Michelle L. Mazurek, Timothy Passaro, Richard Shay, Timothy Vi-
das, Lujo Bauer, Nicolas Christin, and Lorrie Faith Cranor. How does
your password measure up? the effect of strength meters on password
creation. In USENIX Security 22. USENIX Association, 2012.

[115] Blase Ur, Fumiko Noma, Jonathan Bees, Sean M Segreti, Richard Shay,
Lujo Bauer, Nicolas Christin, and Lorrie Faith Cranor. " I Added’!’at the
End to Make It Secure": Observing Password Creation in the Lab. In
Eleventh symposium on usable privacy and security (SOUPS 2015), pages 123–
140, 2015.

[116] Luke Valenta, Cefan Daniel Rubin, and Christopher Wood. Privacy-
Preserving Compromised Credential Checking. https://blog.cloudflare.
com/privacy-preserving-compromised-credential-checking, 2021.

[117] Luis Von Ahn, Manuel Blum, Nicholas J Hopper, and John Langford.
Captcha: Using hard ai problems for security. In International conference
on the theory and applications of cryptographic techniques, pages 294–311.
Springer, 2003.

[118] Chun Wang, Steve T.K. Jan, Hang Hu, Douglas Bossart, and Gang Wang.
The next domino to fall: Empirical analysis of user passwords across on-
line services. In Eighth ACM Conference on Data and Application Security
and Privacy, CODASPY ’18, page 196–203, New York, NY, USA, 2018. As-
sociation for Computing Machinery.

[119] Ding Wang, Zijian Zhang, Ping Wang, Jeff Yan, and Xinyi Huang. Tar-
geted online password guessing: An underestimated threat. 2016.

[120] WAQAS. A trove of 1.4 billion clear text credentials file found on dark
web, 2017.

174

https://blog.cloudflare.com/account-takeover-protection/
https://blog.cloudflare.com/privacy-preserving-compromised-credential-checking
https://blog.cloudflare.com/privacy-preserving-compromised-credential-checking


[121] Rick Wash, Emilee Rader, Ruthie Berman, and Zac Wellmer. Under-
standing password choices: How frequently entered passwords are re-
used across websites. In Twelfth Symposium on Usable Privacy and Security
(SOUPS 2016), pages 175–188, 2016.

[122] Matt Weir, Sudhir Aggarwal, Breno de Medeiros, and Bill Glodek. Pass-
word Cracking Using Probabilistic Context-Free Grammars. In 2009 30th
IEEE Symposium on Security and Privacy, 2009.

[123] Dan Lowe Wheeler. zxcvbn: Low-Budget Password Strength Estimation.
In Proc. USENIX Security, 2016.

[124] Suzanne Widup, Bob Rudis, Dave Hylender, and Marc Spitler. 2015
Data Breach Investigations Report. https://www.researchgate.net/
publication/289254638_2015_Verizon_Data_Breach_Investigations_
Report, 2015.

[125] Wikipedia. List of the most common passwords. https://en.wikipedia.
org/wiki/List_of_the_most_common_passwords.

[126] Danny Yadron. Man behind the first computer password: It’s become a
nightmare. Wall Street Journal, May 2014.

[127] Samira Zibaei, Dinah Rinoa Malapaya, Benjamin Mercier, Amirali Salehi-
Abari, and Julie Thorpe. Do password managers nudge secure (ran-
dom) passwords? In Eighteenth Symposium on Usable Privacy and Security
(SOUPS 2022), pages 581–597, 2022.

175

https://www.researchgate.net/publication/289254638_2015_Verizon_Data_Breach_Investigations_Report
https://www.researchgate.net/publication/289254638_2015_Verizon_Data_Breach_Investigations_Report
https://www.researchgate.net/publication/289254638_2015_Verizon_Data_Breach_Investigations_Report
https://en.wikipedia.org/wiki/List_of_the_most_common_passwords
https://en.wikipedia.org/wiki/List_of_the_most_common_passwords

	Biographical Sketch
	Dedication
	Acknowledgements
	Contents
	List of Figures
	Introduction
	Gossamer: Securely Measuring Password-based Logins
	Introduction
	Background and Related Work
	Designing a Secure Measurement Architecture
	Password-Derived Measurements and Security Analysis
	Password-derived measurements
	Security analysis of measurements

	Deploying Gossamer
	Login Statistics, Patterns, & Observations
	Characteristics of high volume attacks
	User and client statistics
	Password security
	Usability

	Conclusion

	Araña: Discovering and Characterizing Password Guessing Attacks in Practice
	Introduction
	Background and Related Work
	Gossamer Logs
	Towards Detecting Attack Campaigns
	Login Sets and Features
	Initial Attempts Using Compromise Reports

	Campaign Discovery Pipeline
	Filtering Likely Benign Requests
	Clustering Potentially Malicious  sets
	Attack Campaigns Discovered

	Analysis of Attack Campaigns
	Example Attack Campaigns
	Higher-Level Attack Behaviors Observed

	Discussion
	Real-World Deployment Constraints
	Conclusion

	The Practicality and Robustness of Timely Malicious Login Detection
	Introduction
	Background & Related Work
	Evaluating Prior Approaches
	Dataset
	Performance of Volumetric Approaches
	Performance of ML Approaches

	Anomaly-Based Detection Approaches
	Malicious Login Detection Using DAS

	Attack Detection Performance
	Robustness of Malicious Login Detection
	Model for Evasion Attacks
	Evasion Strategies
	Evasion Results

	Discussion & Future Work
	Conclusion

	Conclusion
	Overarching Observations
	The Authentication Ecosystem

	Gossamer: Securely Measuring Password-based Logins
	Measurements Taken
	Filtering Out Attacks 
	Login Statistics


	Araña: Discovering and Characterizing Password Guessing Attacks in Practice
	Reasons for Compromise Reports
	Using DAS to Detect Attacks
	Additional Clustering Results
	Comparing Araña to Prior Work gossamer
	Geographical Source of Attacks

	The Practicality and Robustness of Timely Malicious Login Detection
	Finding k for Volumetric Flagging
	StopGuessing Implementation Details
	WhoAreYou Implementation Details
	Estimating Reputation Score
	Ablation Study

	Additional Evasion Results
	Attack Clusters Detected by Araña arana

	Bibliography