CHARACTERIZING AND DETECTING PASSWORD GUESSING ATTACKS A Dissertation Presented to the Faculty of the Graduate School of Cornell University in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy by Marina Sanusi Bohuk May 2024 © 2024 Marina Sanusi Bohuk ALL RIGHTS RESERVED CHARACTERIZING AND DETECTING PASSWORD GUESSING ATTACKS Marina Sanusi Bohuk, Ph.D. Cornell University 2024 Modern authentication systems still mainly rely on passwords for authentica- tion, but little is known about legitimate and malicious user behavior during the authentication process due to the difficulty of collecting information on such a sensitive field. Because passwords are hard to remember and often reused across websites, they are prone to remote guessing attacks in which an attacker iterates through a guess list of credentials, submitting them against a live lo- gin system; but existing defenses do not leverage password-based information because of the challenge of collecting such information in a secure way. We address this challenge first by developing a measurement frame- work called Gossamer for securely recording password-derived measurements, which we used to collect data on 34 million login requests at two universities. Then, we show how we used the data collected by Gossamer to develop a clus- tering approach called Araña that detects and groups login requests into attack campaigns. Finally, we explore existing timely attack detection mechanisms and evaluate them on Gossamer data along with three new detection methods based on Directed Anomaly Scoring. We also show that these detection methods are vulnerable to evasion attacks by an adaptive attacker. BIOGRAPHICAL SKETCH Marina Bohuk was born and raised in Charlottesville, Virginia. She attended the University of Virginia where she earned an undergraduate degree in Com- puter Science. While at UVA, she co-founded a cybersecurity training com- pany called MetaCTF with her now-husband and two friends. After gradu- ating, she worked for a year as a software engineer at Mastercard before be- ginning her studies at Cornell University. During her PhD, she has focused on improving password-based authentication systems and protecting them from attacks through empirical observations on real-world data. She also interned at Cloudflare where she helped implement Might I Get Pwned, a tool for detecting leaked credential use. For Roman and Nikolai. ACKNOWLEDGEMENTS I would first and foremost like to thank Tom, without whom none of this work would have been possible. His advice and support along the way have been instrumental to my journey here at Cornell. Next I would like to thank my collaborators, especially Rahul Chatterjee and Mazharul Islam at the University of Wisconsin-Madison. I always benefit from hearing their ideas, and I’ve really enjoyed working with them on three consec- utive papers. Finally, I am eternally grateful to my parents for their constant support and to my husband Roman for his encouragement, optimism, IT support, and trips with me back and forth to New York. CONTENTS Biographical Sketch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii 1 Introduction 1 2 Gossamer: Securely Measuring Password-based Logins 5 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Background and Related Work . . . . . . . . . . . . . . . . . . . . 9 2.3 Designing a Secure Measurement Architecture . . . . . . . . . . . 11 2.4 Password-Derived Measurements and Security Analysis . . . . . 17 2.4.1 Password-derived measurements . . . . . . . . . . . . . . 18 2.4.2 Security analysis of measurements . . . . . . . . . . . . . . 19 2.5 Deploying Gossamer . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.6 Login Statistics, Patterns, & Observations . . . . . . . . . . . . . . 30 2.6.1 Characteristics of high volume attacks . . . . . . . . . . . . 32 2.6.2 User and client statistics . . . . . . . . . . . . . . . . . . . . 34 2.6.3 Password security . . . . . . . . . . . . . . . . . . . . . . . 37 2.6.4 Usability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 2.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3 Araña: Discovering and Characterizing Password Guessing Attacks in Practice 49 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.2 Background and Related Work . . . . . . . . . . . . . . . . . . . . 53 3.3 Gossamer Logs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.4 Towards Detecting Attack Campaigns . . . . . . . . . . . . . . . . 61 3.4.1 Login Sets and Features . . . . . . . . . . . . . . . . . . . . 62 3.4.2 Initial Attempts Using Compromise Reports . . . . . . . . 65 3.5 Campaign Discovery Pipeline . . . . . . . . . . . . . . . . . . . . . 66 3.5.1 Filtering Likely Benign Requests . . . . . . . . . . . . . . . 67 3.5.2 Clustering Potentially Malicious L sets . . . . . . . . . . . 71 3.5.3 Attack Campaigns Discovered . . . . . . . . . . . . . . . . 74 3.6 Analysis of Attack Campaigns . . . . . . . . . . . . . . . . . . . . 76 3.6.1 Example Attack Campaigns . . . . . . . . . . . . . . . . . . 76 3.6.2 Higher-Level Attack Behaviors Observed . . . . . . . . . . 83 3.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 3.8 Real-World Deployment Constraints . . . . . . . . . . . . . . . . . 90 3.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 4 The Practicality and Robustness of Timely Malicious Login Detection 93 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 4.2 Background & Related Work . . . . . . . . . . . . . . . . . . . . . . 97 4.3 Evaluating Prior Approaches . . . . . . . . . . . . . . . . . . . . . 101 4.3.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 4.3.2 Performance of Volumetric Approaches . . . . . . . . . . . 104 4.3.3 Performance of ML Approaches . . . . . . . . . . . . . . . 106 4.4 Anomaly-Based Detection Approaches . . . . . . . . . . . . . . . . 110 4.4.1 Malicious Login Detection Using DAS . . . . . . . . . . . . 113 4.5 Attack Detection Performance . . . . . . . . . . . . . . . . . . . . . 117 4.6 Robustness of Malicious Login Detection . . . . . . . . . . . . . . 122 4.6.1 Model for Evasion Attacks . . . . . . . . . . . . . . . . . . 122 4.6.2 Evasion Strategies . . . . . . . . . . . . . . . . . . . . . . . 125 4.6.3 Evasion Results . . . . . . . . . . . . . . . . . . . . . . . . . 127 4.7 Discussion & Future Work . . . . . . . . . . . . . . . . . . . . . . . 132 4.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 5 Conclusion 137 5.1 Overarching Observations . . . . . . . . . . . . . . . . . . . . . . . 137 5.2 The Authentication Ecosystem . . . . . . . . . . . . . . . . . . . . 138 A Gossamer: Securely Measuring Password-based Logins 140 A.0.1 Measurements Taken . . . . . . . . . . . . . . . . . . . . . . 140 A.0.2 Filtering Out Attacks . . . . . . . . . . . . . . . . . . . . . 140 A.0.3 Login Statistics . . . . . . . . . . . . . . . . . . . . . . . . . 142 B Araña: Discovering and Characterizing Password Guessing Attacks in Practice 145 B.1 Reasons for Compromise Reports . . . . . . . . . . . . . . . . . . . 145 B.2 Using DAS to Detect Attacks . . . . . . . . . . . . . . . . . . . . . 146 B.3 Additional Clustering Results . . . . . . . . . . . . . . . . . . . . . 148 B.4 Comparing Araña to Prior Work [29] . . . . . . . . . . . . . . . . . 152 B.5 Geographical Source of Attacks . . . . . . . . . . . . . . . . . . . . 152 C The Practicality and Robustness of Timely Malicious Login Detection 154 C.1 Finding k for Volumetric Flagging . . . . . . . . . . . . . . . . . . 154 C.2 StopGuessing Implementation Details . . . . . . . . . . . . . . . . 155 C.3 WhoAreYou Implementation Details . . . . . . . . . . . . . . . . . 158 C.3.1 Estimating Reputation Score . . . . . . . . . . . . . . . . . 158 C.3.2 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . 159 C.4 Additional Evasion Results . . . . . . . . . . . . . . . . . . . . . . 160 C.5 Attack Clusters Detected by Araña [64] . . . . . . . . . . . . . . . 161 Bibliography 163 LIST OF FIGURES 2.1 The main components of Gossamer. . . . . . . . . . . . . . . . . . 12 2.2 Relative improvement in password guessing success rate (∆q) due to access to password-derived statistics from Gossamer over the baseline in q ≤ 108 guesses. . . . . . . . . . . . . . . . . . . . . 22 2.3 Successful and failed login attempts per day at two universities (for a total of 196 K unique users at U1 and 309 K at U2). Potential high-volume attack campaigns we discovered are also shown. . 27 2.4 Summary statistics of login requests recorded by Gossamer at U1 and U2. More statistics can be found in Figure A.3. . . . . . . . . 30 2.5 Cumulative distributions of unique passwords per username and IP for February 2021. The X-axis is log scale. . . . . . . . . . . 37 2.6 Cumulative distribution of interarrival time between requests for different inactivity thresholds for February 2021. The X-axis is log scale. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 2.7 Average session sizes (number of attempts per session) for dif- ferent inactivity thresholds . . . . . . . . . . . . . . . . . . . . . . 43 2.8 Cumulative distributions of number of sessions per user per day for an inactivity threshold of 360s for February 2021. The X-axis is log scale. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 2.9 Cumulative distributions of session sizes (number of attempts per session) for mobile devices for February 2021. The X-axis for U2 is limited to 10 for comparison with U1 (although the maxi- mum session size consisted of 54 attempts). . . . . . . . . . . . . 46 3.1 Summary statistics on the compromised accounts reported at each university during the measurement period. . . . . . . . . . . 60 3.2 Features for L sets that we use for analysis. Nominal features are marked with ©; all others are numerical. . . . . . . . . . . . . 63 3.3 Araña’s filter, cluster, analyze pipeline for discovering attack campaigns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 3.4 Percentile of NR (left) and FF (right) for both universities (shown up to the 99th percentile for viewing). By choosing the 90th per- centile at U1 and 80th at U2, we set `U1 = 10, `U2 = 7 and fU1 = 0.77, fU2 = 0.8 at the universities respectively. . . . . . . . 69 3.5 Number of L sets matching each of the filtering criteria at either university. The last row shows the remaining number of sets after all filtering steps. Threshold values ` and f for U1 and U2 are indicated in Figure 3.4. . . . . . . . . . . . . . . . . . . . . . . 71 3.6 Our distance function (logsetsim) to measure the similarity be- tween two L sets (Left) and the hierarchical backoff distance (HBD) calculation for nominal features (Right). . . . . . . . . . . . 73 3.7 Number of requests per day for the 1,752 and 6,408 poten- tially malicious L sets at U1 and U2 respectively as shown in Figure 3.5. Attack clusters we found in Section 3.4 are shown in red boxes. Clusters marked with a * were identified in prior work [29] as high volume attacks. Note that the x- and y-axes are different for the two universities for better visualization. . . . . 75 3.8 Abbreviations for measurements taken . . . . . . . . . . . . . . . 76 3.9 Attack clusters detected using a set of heuristics and manual re- view. The first column notes the ID of each cluster as we refer to them in the paper [64]. The attack IDs we describe in detail (Section 3.6) are shown in bold font. We discovered a total of 41 and 1,116 unique compromised users at U1 and U2 respectively. Abbreviations are described in Figure 3.8. . . . . . . . . . . . . . 77 3.10 Different attack behaviors we observed. . . . . . . . . . . . . . . . 84 4.1 Breakdown of the Gossamer logs used in our simulations. Login requests associated with Araña-identified clusters are labeled as malicious, and all others are labeled as benign. . . . . . . . . . . . 101 4.2 Evaluation of the volumetric, ML-based, and DAS-based ap- proaches on the Gossamer holdout dataset using attacks re- ported in [64] as ground truth. We report the Precision, Recall, and F1 scores (by percentage (%)) broken down both by indi- vidual requests and L sets. The highest value in each column is boldfaced. StopGuessing-G and WhoAreYou do not flag any requests as malicious at U1. . . . . . . . . . . . . . . . . . . . . . . 105 4.3 The detection performance (recall) of the three DAS-based ap- proaches on the four attack clusters at U1 and seven at U2 re- ported by Araña [64] over the holdout period. Multi-day clusters are highlighted with an asterisk (∗), and the # L sets is boldfaced when all L sets of the attack cluster are detected by the approach. 114 4.4 Manual analysis of the false positives (FPs) generated by the three DAS-based approaches. . . . . . . . . . . . . . . . . . . . . . 117 4.5 For the three DAS-based approaches, we report the total num- ber of alerts (# of L sets flagged malicious) generated per day (red line) and how many of them are true alerts (blue line) over the testing period at U2. We also report the number of known malicious L sets on each day (green line) which is the same for all approaches. A smaller gap between “total alerts” and “true alerts” indicates higher precision, and a smaller gap between “true alerts” and “known attacks” indicates higher recall. . . . . 118 4.6 Evasion results for the Comparison Set approach when attempt- ing to evade attack clusters by redistributing the attack transcript over θip IP addresses or inserting θip chaff L sets. . . . . . . . . . . 128 4.7 Results of the three blending evasion strategies against Per-day Clustering technique. None of the attacks at U1 could evade Per- day Clustering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 A.1 Measurements we log in ephemeral (Eph.) and persistent (Pers.) storage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 A.2 The distribution of operating systems (OS) as detected in the user-agent of all the requests (after removing requests contain- ing empty user-agent string at U2). . . . . . . . . . . . . . . . . . 142 A.3 Summary statistics of login requests recorded by Gossamer at U1 (from Dec. ’20 to July ’21) and U2 (from Dec ’20 to Mar ’21). . . . 143 A.4 Top 10 most common user agents at U1 (top) and at U2 (bottom) 144 B.1 The reasons for compromise as noted at U1 and the number of instances of such compromise. Users reported multiple times are counted as distinct instances. . . . . . . . . . . . . . . . . . . . . 146 B.2 Silhouette scores of different clustering models. . . . . . . . . . . 148 B.3 Confusion matrices of the number of L sets corresponding to the three attack campaigns found in prior work [29] using their manual approach versus our FCA approach. . . . . . . . . . . . . 152 B.4 The five most common countries at U1 (left) and U2 (right) by number of attack requests. . . . . . . . . . . . . . . . . . . . . . . 153 C.1 The F1 scores for different values of k over the training period of Gossamer logs at both universities. . . . . . . . . . . . . . . . . . 154 C.2 Evaluation of the volumetric, ML-based, and DAS-based ap- proaches on the Gossamer train and holdout datasets using at- tacks reported in [64] as ground truth. We report the Precision, Recall, and F1 scores (in percentage (%)) broken down both by individual requests and L sets. StopGuessing and WhoAreYou do not flag any requests as malicious at U1. . . . . . . . . . . . . . 155 C.3 % of unseen IP addresses, ISPs, AS numbers, and countries at each level ` observed by U2 during the holdout period. . . . . . . 156 C.4 Results of the evasion attacks when redistributing the login re- quests across θip IP addresses involved in the corresponding at- tack cluster, adding θr additional login requests, and lengthening the attack by θt. We show the results over the 19 attack clusters detected by the Per-day Clustering approach at U2. For this ex- periment we assume the attacker has access to all login requests as observed by the login server. A “-” symbol means the attacker does not require the corresponding resource for evasion. . . . . . 160 C.5 Attack clusters detected by Araña [?] on Gossamer dataset. Sim- ilar to [64], we refer to each attack cluster using the ID in the first column. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 CHAPTER 1 INTRODUCTION The first digital password was invented in 1961 by computer science profes- sor Fernando Corbato at MIT [126]. Little did he know that passwords would become the most common method of digital authentication for at least the next sixty years despite proposals to move to other authentication mechanisms like biometrics, hardware tokens, or passkeys [30, 55]. However, these early pass- words wered stored in plaintext, and the very first password theft occurred two years later by simply printing out the system’s password file [28], starting a pattern of credential theft that would continue to the present day. In modern authentication systems, passwords are hashed before being stored to the database. Then when a user attempts to login, the server hashes the submitted password and compares it to the hash stored in the database; if they match, the server allows the user to login. However, hashing passwords does not eliminate the possibility of an attacker stealing or guessing a user’s cre- dentials. If an attacker were to gain access to the database of password hashes, they could attempt to recover the original passwords via brute force by taking a list of common passwords, hashing each of them, and checking if the result- ing hash matches any hash in the database. Indeed, such attackers have created “rainbow tables”—tables mapping a plaintext password to its hash—to make these lookups faster and easier. The introduction of a “salt”—a random string appended to the password before being hashed—makes such offline guessing attacks more difficult and computationally intensive to execute. An attacker may still discover a user’s password, though, via an online guess- ing attack in which they directly query the web service. For example, they might launch a password spraying attack by trying a few very common pass- words against many different usernames. Using a stronger, more complex pass- word makes these attacks more difficult; but the average user today owns 240 online accounts [43], and human brains were not designed to memorize hun- dreds of random strings of text. Thus users often choose easily memorable passwords or reuse the same password across multiple websites. In fact, Das et al. found that 43% of users reuse exactly the same password across different websites [42]. The prevalence of breached credentials only further enables online guessing attacks. When a password database is leaked, an attacker can take a user’s password breached from website A and submit it to website B’s login form; because of users’ tendency to reuse passwords, these attacks—called credential stuffing—are widely observed and highly damaging. In fact, Okta reported that credential stuffing accounts for 34% of login traffic in 2022 [99]. Similarly, an attacker could take a user’s breached password from website A, apply small tweaks to the password, and try to log in with that tweaked password to website B. Pal et al. showed via simulation that these so-called “credential tweaking” attacks can be highly effective, compromising 16% of user accounts in under 1,000 guesses [93]. More insight into the passwords submitted during login attempts would help to improve the design of authentication systems and their defenses against guessing attacks. However, it is difficult to gather data on passwords due to their sensitive nature. To solve this issue, we designed a measurement frame- work called Gossamer that records password-derived measurements on real lo- gin requests [29]. To ensure the security and privacy of users, we performed a 2 simulation to choose a set of safe-to-collect measurements that can give insight into the password without leaking too much information and compromising user security. We deployed Gossamer at two universities over a seven month period and collected data on over 34 million login requests. Next, we designed a clustering approach called Araña that can group lo- gin requests into clusters that are likely to be malicious attack campaigns [64]. In this approach, we used password-derived information to describe behavior resembling credential stuffing, credential tweaking, targeted, and other kinds of password guessing attacks. We tested the approach on Gossamer data and found 29 attack campaigns across the two universities displaying different at- tack behaviors. However, an analyst might prefer to investigate attacks as they occur and, for example, receive a report of attack campaigns at the end of each day. Thus we performed an evaluation of existing timely detection methods [65], as well as three new detection methods we designed based off Directed Anomaly Scoring (DAS), a risk scoring mechanism designed by Ho et al [61]. Since an adapative attacker may try to evade detection, we also consider whether the detection methods are robust to evasion attacks. To do this, we performed a simulation of evasion attacks against the three DAS-based detection methods by applying a series of evasive strategies to the attack campaigns found by Araña and checked whether or not the attack campaign was still detected. We find that evasion attacks are possible even with limited resources and should be considered when designing authentication systems. In summary, modern password-based authentications are vulnerable to dif- ferent types of online password guessing attacks, and existing defenses remain 3 primitive and susceptible to evasion attacks. This dissertation discusses how to design better authentication systems and attack countermeasures using empiri- cal observations on real-world login data. 4 CHAPTER 2 GOSSAMER: SECURELY MEASURING PASSWORD-BASED LOGINS 2.1 Introduction Despite the prevalence of password-based authentication across the internet, little is known about the passwords submitted to login systems. Knowing the characteristics of such login information would help practitioners make better security policies to improve both usability as well as attack detection. A key challenge hindering progress is that passwords are highly sensitive, and as a result prior work has only performed very limited measurements. Two prior works are particularly relevant. Bonneau et al. [30] instrumented Yahoo login servers for 48 hours to learn the distribution of actual user pass- words; but his technique could not record other information about submitted (valid or invalid) passwords, such as the similarity between the successive pass- word submissions. Chatterjee et al. [38] were the first to investigate incorrect password submissions from the viewpoint of a login server. They instrumented Dropbox’s login service for 24 hours to investigate how often users submit a fixed set of easy-to-correct typos; however, their study was limited to only a specific set of typos and does not provide a general framework for analyzing submitted passwords. Thus the question remains: Can we build login measure- ment infrastructure that monitors password submissions but doesn’t endanger users’ security? In this work, we design, build, and deploy a measurement system called Gossamer that securely records login requests, including statistics about sub- mitted passwords. Doing this safely required extreme care, and our main contribution is a holistic approach that combines systems security features, a simulation-based framework to guide selection of password-derived statistics, and procedural safeguards. Ultimately, our initial deployment at two large uni- versities is able to answer, for the first time, basic questions about submitted passwords — such as how often legitimate users are making typos or repeat- edly submitting the same incorrect password, whether attacks are detectable as credential stuffing, and more. Performing such measurements requires jointly analyzing passwords sub- mitted at different points in time. Prior measurement studies computed a (keyed) hash over a correctly submitted password [30] or compared the hashes of a small handful of variants of a submitted password to the real user’s pass- word hash [38]. Neither approach allows inferring whether users are submitting the same password multiple times or how many unique passwords they submit. To enable such measurements, Gossamer’s design uses a two-service logging infrastructure to ensure least privilege. Gossamer has a specialized measurement service that receives a copy of login requests from login servers, processes them by computing password statistics and encrypting submitted usernames, and outputs sanitized logs to a persistent database on a different machine. The mea- surement service, like login servers, has access to plaintext passwords. Thus we designed it to match or exceed the security properties of login servers: It is safe-on-reboot [86] (no sensitive data such as passwords are ever stored on disk), deletes all in-memory data periodically to limit the scope of what would be exposed in the case of a breach, and is administered by the same security staff in charge of login servers. 6 Researchers use a separate analysis service to access the sanitized logs stored in the persistent database. The sanitized logs and analysis service are still treated as sensitive, and cannot be made publically available. To assess the risk to user passwords in the unlikely case of complete exposure of both a login system’s password hash databases plus Gossamer logs, we developed a new simulation-based approach to analyze the speed-up of brute-force cracking at- tacks that attempt to additionally exploit Gossamer logs. For example, simula- tions show that storing raw strength scores (as measured by zxcvbn [123]) can provide up to a 20% increase in cracking efficacy (using up to 109 hash compu- tations), leading us to reduce the granularity of strength scores. Ultimately, our simulations suggest that the best performing attack increases password recov- ery rates using Gossamer logs by less than 2% using 109 queries. To showcase the utility of Gossamer, we worked in close collaboration with two large universities’ information technology (IT) security departments to per- form a measurement study of login behavior. Our measurement study proto- cols, including Gossamer’s design and implementation, went through a thor- ough, multi-step review process that included reviews by the security engineer- ing teams from both universities, representatives of each university’s adminis- tration, and the relevant IRBs. This process culminated in a determination that Gossamer poses minimal risk. We deployed Gossamer for seven months at Uni- versity 1 (U1) and for three months at University 2 (U2). We observed 34 million login requests (combined) for approximately 500 K users who regularly log in to access various university-provided critical online services such as email, course enrollment, and employment information. This enables first-of-their-kind measurements of password usability and se- 7 curity. We saw that 1.9% of valid users at U1 and 4.6% at U2 changed their password in the data collection period. We found that 6.5% of usernames at U1 appearing in public breaches are still using a password that is only a small vari- ant of one of their leaked passwords. This motivates deployment of password breach alerting services that take into account similarity [94]. On the usability front, while the Dropbox study reported that 5% of failed attempts were due to easily correctable typos, our measurements indicate that 65% of failed attempts could be typos (within edit distance two from the actual password), suggesting this is a much larger cause of user frustration than previously imagined. We also report on the rate of login retries, the success and failure rate of app-based two-factor authentication, and the possible adoption of password managers. Fi- nally, we are able to report a few high-volume attacks, with insights enabled by Gossamer to characterize the attacker behavior involved. Summary. In summary, our project is the first to propose a measurement frame- work for passwords that can safely help answer basic questions about password use. Our contributions include: • Design of Gossamer, which combines systems security, simulation-based se- lection of password statistics, and procedural safeguards to enable measure- ment studies of password-based login behavior. • We worked with two large universities’ IT departments to deploy Gossamer for multi-month measurement periods. • We report for the first time on a variety of aspects of password-related us- ability and security, and discuss the implications of these measurements. For example, our measurements motivate the need for password breach alerting, suggest ways to improve lockout mechanisms, and more. 8 Finally, we hope that Gossamer can serve as a platform to help drive future research on improving usability and security of passwords. As such we are re- leasing Gossamer as a public, open source project that may be useful for security researchers both in industry and academia. 2.2 Background and Related Work Current login systems still heavily rely on password-based authentication. Users typically enter their usernames and passwords to a form on a web client, which submits them along with other information relating to the user or ma- chine such as HTTP headers, cookies, IP, and user agent to the login server over HTTPS. The server hashes the password (and a salt), checks if the username and hash pair is present in the login database, and if so, allows the user to log in or prompts for further authentication checks. Otherwise the request fails. Single sign-on (SSO) systems allow a user to log into multiple different web services using the same username and password. When a user accesses a ser- vice, the service provider (SP) redirects the user to obtain a proof of authentica- tion from the identity provider (IdP). The IdP provides the proof immediately if the user has recently authenticated with it, or requires the user to authenticate and provides the proof if the authentication is successful. The OAuth frame- work [12] is a common way to achieve SSO. Looking ahead we perform measurements at two large universities. Both U1 and U2 use SSO with Microsoft Active Directory Federation Service (ADFS). While at U2 all login traffic goes through ADFS, at U1, only a portion of traffic is via ADFS. This is part of the reason we see a lower rate of logins per day at 9 U1 compared to U2 (Section 2.4). Studies about passwords. Prior works [60, 79, 85, 88, 122] have investigated guessability of user-chosen passwords. Most of these rely on breached pass- word data to understand the distribution of user-chosen passwords. Several studies have used Amazon Mechanical Turk (AMT) to understand user choice of passwords [69, 72, 114], under different factors such as password require- ments [72], presence of a password strength meter [114], and use of a password blocklist [57]. As passwords chosen in this environment may not represent passwords on real websites, several studies have inspected real user passwords through client- side, server-side, or offline instrumentation [48, 50, 96]. On the client side, Flo- rencio and Herley [48] installed a Windows Live Toolbar component for five hundred thousand volunteer participants and analyzed their password behav- ior over 85 days. Similarly, Forget et al. [50] created a client-side data collection tool to observe user’s password behavior in its natural environment [50, 96]. Measurement studies with login systems. To our knowledge, three studies have looked at user passwords by instrumenting the login servers. Bonneau et al. [30] instrumented Yahoo’s login servers to receive login requests (including user passwords) and construct histograms of password characteristics based on user demographics. Mazurek et al. [83] correlated password strength with de- mographic information in an offline study with reversibly encrypted passwords on an access-restricted computer. Chatterjee et al. [38] instrumented the login code at Dropbox for failed login attempts to test whether applying a typo cor- 10 rection to the submitted password would have produced the correct password. Open questions. Many open questions remain about the characteristics of the passwords submitted to a login system. For example, how often do users log in from multiple devices, how often do users submit the same incorrect password multiple times, and how often do users submit passwords that are similar to one of their leaked passwords? More importantly, can we collect information about the submitted password that allows analyzing login characteristics with- out degrading the security of user passwords? Such a framework would help practitioners make data-driven login security policies, such as account lockout thresholds, that better balance between usability and security. 2.3 Designing a Secure Measurement Architecture To analyze the passwords submitted to a login system, we need to instrument live login services and monitor login requests, including the submitted user- name and passwords. User passwords are highly sensitive and should never be logged. We therefore designed a secure instrumentation architecture that pre- serves the privacy of login requests while allowing meaningful analysis. We refer to it as Gossamer and deploy it at two login systems used at two univer- sities in the United States. Gossamer is designed in close collaboration with the security engineers at these universities. Below we describe the built-in security considerations in our design and the integration with the existing login infras- tructure at these universities. The architecture. Gossamer enables instrumentation of typical web login servers, such as those used for single sign-on (SSO). An overview of Gossamer’s 11 Login Server Measurement Service Ephemeral Database Requires high security clearance Researchers’ access Persistent Database Analysis Service Login requests Reported Account Compromise Log Encrypted passwords Anon. login data, pw heuristics, compromise logsKey Management Service (KMS) Gossamer Figure 2.1: The main components of Gossamer. architecture appears in Figure 2.1. A lightweight hook is deployed within the login server that, on every received login request, sends a stripped-down copy of the request to our instrumentation infrastructure on a separate, in-network machine. This is done using a separate thread to avoid any noticeable latency impact by the instrumentation on login behavior. A login request includes the username, password, IP address, a subset1 of the HTTP headers, timestamp, lo- gin result (success or failure), and finally an application-specific result code for the login attempt. A measurement service receives this forwarded login information. It is respon- sible for processing the raw login data in a secure manner, converting it into san- itized logs, and storing them in a persistent database. The persistent database can be accessed by analysts (in our case, researchers) via a dedicated analysis ser- vice for understanding user login behaviors. As such, we partition Gossamer’s architecture into two security levels: The lightweight login hook and measure- ment service run at a higher privilege level and are administered by IT security 1The login server removes sensitive cookies specified by the security engineers from the HTTP header, as some content can be more sensitive than user passwords. For example, an “authentication cookie” could bypass MFA requirements. 12 staff; the analysis service is instead at a lower privilege level, accessible by ana- lysts (researchers). We explain more about these two services further below, but first describe our security and design goals. Security properties and design goals. We design Gossamer to resist a variety of attacks. We note that all our network traffic is encrypted using TLS, and as such we do not discuss network adversaries further. Instead, we focus on the threat of complete compromise of each (or both) of the services, as well as the weaker adversarial threat of exposure of logs generated by Gossamer. To protect against these threats, we design Gossamer to conform to four se- curity properties. • Least privilege access to password data. The system must ensure that the analyst receives only the information necessary for analyzing login behaviors, while plaintext passwords remain restricted to particular services. • Bounded-leakage logging. The system persistently logs a small set of statistics about user passwords. The set of statistics is carefully designed to bound the improvement in guessing attacks against user passwords, even in the case of complete compromise of the analysis service. • Periodic deletion. The system should expunge all raw, sensitive data older than 24 hours to reduce the exposure of any data should the system get compro- mised. • Safe-on-reboot. Finally, we must ensure that all sensitive data from raw HTTP requests is destroyed on reboot. This property was first introduced in [86] for the Bunker secure network tracing system. 13 We will return to these properties as we elaborate on the details of the architec- ture. Security considerations in Gossamer. As mentioned, Gossamer uses two ser- vices running at different privilege levels on two different machines — a mea- surement service for processing the raw login data and storing the anonymized statistics of user logins into a secure, persistent relational database, and an anal- ysis service for analyzing the data from the persistent database. Separating mea- surement from analysis enables us to maintain the same privilege requirements for access to password data as there are without Gossamer. The high-level archi- tecture diagram of Gossamer including privilege levels is shown in Figure 2.1. The measurement service runs on a heavily access-restricted machine that receives a copy of the login request over an encrypted channel from the login servers. The service then computes measurements over the submitted pass- word and stores them in the persistent database. Some interesting statistics re- quire the plaintext submitted password across multiple requests — for example, the number of unique passwords submitted by a single user or from a single IP address. Therefore, the measurement service stores passwords encrypted using an in-memory key in the ephemeral database. The key is stored only in memory and is automatically replaced with a new key every day at midnight local time. The ephemeral database is placed in a memory-based file system, such as the /tmp directory. The key rotation cryptographically erases the data stored in the ephemeral database every 24 hours.2 If the measurement service is killed or the device is rebooted, all ephemeral data is effectively deleted. This ensures our 2Key rotation at midnight deletes the data received, say, an hour before midnight, limiting our ability to correlate between passwords received before and after midnight. It is an open question how to design an efficient key-rotation technique that will allow secure deletion with- out requiring storing linear number keys in memory. 14 periodic deletion and safe-on-reboot requirements. The ephemeral database allows us to calculate a number of measurements referencing the passwords submitted across multiple logins. These measure- ments (given in Section 2.4) allow us to characterize user behavior and could help in building attack detection mechanisms. The measurement result is stored in a persistent database outside the privilege boundary, where it can be accessed by the researchers. This database is placed in a disk-encrypted volume, provid- ing another layer of protection in case the volume is backed up to an unpro- tected machine or is compromised. The key to the encrypted volume is only known to a subset of researchers, as the IT security engineers did not need it . Protecting user privacy. To protect the privacy of users, Gossamer anonymizes the usernames before storing in the persistent database by encrypting them us- ing a deterministic encryption scheme [58]. The encryption key is only accessi- ble from within the measurement service. The deterministic nature of encryp- tion allows us to cross-reference the logins against a username without knowing the actual username, while also allowing us to report compromised usernames to the security engineers, should we discover any. We do not record any per- sonally identifiable information about the user, including their real name, affili- ation, or account type (such as student, faculty, or staff). We do record the source IP address for requests, which is needed to analyze client and attack behaviors. Of course re-identification attacks [89, 108] may be possible given access to these logs, and for this reason alone logs are not suitable for public release. We have strict policies against re-identification for the limited set of researchers who access the analysis service. All analysis is performed on the analysis server with encrypted usernames, and only summary statistics leave the analysis service. 15 Both the persistent and ephemeral databases are instantiated as MySQL databases. The measurement service is a Python Flask application running on an Apache server. We use the Python Fernet [7] library to encrypt user pass- words in the ephemeral database using AES-256, and we use the Python Mis- creant [8] library to deterministically encrypt and decrypt the usernames using AES-SIV. Integration with other data sources. Looking ahead, in both of our deploy- ments there are other relevant data sources available that we would like to ana- lyze. Namely, it is common for organizations to have a database that contains re- ports about potentially compromised accounts which can provide insights into what attacks are (not) being caught by current security mechanisms. At the universities we worked with, these reports are generated when a user alerts IT security to a compromise, or because existing alert generation mechanisms (in- cluding third-party breach alerting services the universities subscribe to) flag an account. In both cases, an IT analyst manually inspects existing logs to attempt to determine if a compromise occurred. To leverage these compromise reports, we add to the measurement service the ability to accept such logs, anonymize them by encrypting all usernames (using the same key as above), and transfer the resulting data to the persistent database. A similar approach can be used in other deployments of Gossamer to incorporate other relevant data sources, such as logs from MFA services. Process of designing and deploying Gossamer. The process of designing a se- cure architecture to measure information about passwords, getting all the nec- essary approvals, and deploying the Gossamer framework to production was long and involved. 16 We met with the engineers in the IT departments multiple times and went through several iterations to agree on a design that would work with their sys- tems and meet a level of security that would receive approval from the IT secu- rity directors. Every element was carefully thought out across several dimen- sions — such as who should have access to it, what operating system or soft- ware version should be used, and what extra protections can we put in place to ensure that it is locked down as much as possible. Once we were satisfied with the design, we worked with the Chief Informa- tion Security Officer and the Associate Chief Information Security Officer to get their feedback and approvals. Then we wrote and submitted an application for IRB; and after several rounds of emails, the IRB representative determined that our research does not meet the definition of Human Subjects Research because we were working with deidentified data, and she granted us IRB exemption. Finally, we wrote the code, set up the infrastructure needed within the IT security network, and tested our measurement service using a dummy SSO ser- vice sending fake data. 2.4 Password-Derived Measurements and Security Analysis Gossamer enables analyses based on the passwords submitted during login, which will improve our ability to characterize user login behaviors. Passwords, however, cannot be made available to analysts for security reasons. We therefore design Gossamer to only collect limited, useful statistics about the submitted passwords without storing passwords persistently. However, even just statistics computed over user passwords could leak information about the password, so 17 care must be taken on which statistics are persistently stored and made available to the analysts. In this section, we discuss how we assessed the security implications of the Gossamer logs storing different kinds of password-derived measurements. 2.4.1 Password-derived measurements To assess risks related to password-derived measurements, we adopt an itera- tive, simulation-based methodology. We consider a potential logging schema, namely the set of measurements that we log about login attempts. For a can- didate schema, we perform a simulation-based risk analysis that consists of (1) defining a threat model including log exposure; (2) determining a baseline at- tack that does not exploit exposed logs; (2) developing a log-exploiting attack that incorporates information leaked via fields from the candidate schema; and (3) running simulations using leaked passwords to assess the increased success rate of the log-exploiting attack over the baseline. This allows quantifying risk, and if it is too high we adjust the logging schema and repeat the process until we are satisfied that risk is relatively low. First we identify potential fields to include in a schema. Figure A.1 lists the fields we ultimately utilize in Gossamer. We also considered several other fields that we eventually discarded as too risky, as we now explore. To understand password strength, we consider including a zxcvbn strength score [123], and whether or not it belongs to one of the popular password guess- ing lists (the most frequent 5,000 passwords in RockYou [10] or the top 5,000 18 passwords generated3 by Hashcat [106]). To understand how often breached passwords are submitted, we consider marking passwords as being in well-known, public breaches, which makes those users vulnerable to credential stuffing attacks. For this we use a dataset of 1.3 billion breached username-password pairs [120] and the Compilation of Many Breaches (COMB) containing 3.2 billion pairs [103] released in February 2021. For each attempt, we logged whether the username, the password, or the username-password pair appeared in this breach dataset. We also consider vul- nerability to credential tweaking attacks [42, 93, 119] that target passwords sim- ilar to a user’s other breached passwords. We therefore consider recording the submitted password’s edit distance, PPSM similarity [93], and Pass2Path simi- larity [93] from each breached password for the given username, if it is present in the breach. We also consider logging the edit distance of the submitted pass- word from previous passwords submitted for that username and IP in the same 24 hour window, which would shed light on whether users are making typos or submitting distinct passwords, and what types of password-guessing strategies attackers employ. 2.4.2 Security analysis of measurements We now turn to making risk assessments about candidate measurements schemas and, in particular, how exposure of Gossamer logs using a candidate 3We use the rule list best_64.rule [11] (a rule list compiled by the Hashcat community of what are considered to be the best 64 rules) with RockYou to generate the guesses. 19 schema can be exploited to improve password guessing attacks. Threat models. As discussed in the last section, we designed Gossamer and our deployment procedures to limit the risk of illicit access to Gossamer logs, but the principle of defense-in-depth suggests that we consider when these mecha- nisms and procedures fail. For example, an insider attacker could leak the logs to the public internet, or a smash-and-grab attack could somehow compromise the analysis service and exfiltrate the logs. We therefore consider two threat models. Both threat models assume the attacker obtains a copy of Gossamer logs, can re-identify4 usernames within the dataset, and seeks to infer the password associated to some particular username. In the first threat model, the attacker can mount an online guessing attack by querying the login service. In the second threat model, the attacker is assumed to additionally have access to some salted hash of the password and so can perform an offline guessing attack. The only difference between the two threat models for our purposes is the expected guessing budget q, which would be on the order of tens to thousands in the online case and hundreds of millions in the offline case. Looking ahead, it will also be material whether or not the targeted username appears in an attacker-known password breach. If not, the best strategy is for the attacker to modify a target-agnostic password guessing attack (e.g., a dictio- nary attack) to incorporate relevant information leaked via the log file. In the targeted guessing threat model, the username appears in data breaches known to the attacker, and so the best strategy is to modify a targeted password guess- 4Recall that the persistent database contains masked usernames, and it is not exactly clear how attackers would re-identify in this setting. Nevertheless, we conservatively assume that re-identification is perfect. 20 ing attack (e.g., [93]) to incorporate relevant information leaked via the log file. We detail particular attacks more below. In each threat model we consider also a baseline attack (specified below) which performs either targeted or untargeted guessing without exploiting the log files. Dataset for simulations. The simulations discussed below are based on the breach data used in prior work [76, 93] containing 1.3 billion username- password pairs. There are 370 million unique passwords between length 6 and 30, associated with 1.12 billion usernames. We removed passwords shorter than 6 characters and longer than 30 characters as done in [76, 93]. We split this data so that the attacker has access to 80% to inform a guess list, and we randomly sampled 10,000 passwords with replacement from the remaining 20% as target user passwords that the attacker is trying to guess. Password strength measurements. We first focus on four of the password- derived measurements: (1) whether the password is in the top 5,000 RockYou passwords (RY), (2) whether it’s in the first 5,000 Hashcat-generated passwords (HC), (3) the binary zxcvbn (ZB) score of the password that we explain below, and (4) the raw zxcvbn [123] score of the password for comparison. By default, the zxcvbn password strength meter returns a password strength between 0 and 4. We hypothesized that this level of granularity would leak too much infor- mation about the password and speed up a guessing attack. Therefore, we also consider a modified zxcvbn score, where we record 0 if the zxcvbn score is 0 and 1 otherwise; we refer to this as the binary zxcvbn score (ZB). We also consider a combined measurement of all three measurements described above (except the 21 100 101 102 103 104 105 106 107 108 109 0 5 10 15 20 # of guesses q (log scale) ∆ q (% ) Hashcat (HC) RockYou (RY) zxcvbn Binary zxcvbn score (ZB) HC+RY+ZB Figure 2.2: Relative improvement in password guessing success rate (∆q) due to access to password-derived statistics from Gossamer over the baseline in q ≤ 108 guesses. original zxcvbn score) — that is, whether the values of all three measurements match between two given passwords. Each of these password-derived measurements can be represented as a func- tion M(w) that outputs the result of the measurement as a boolean, an integer, or a tuple. Given a measurement value m = M(w̃) about a randomly chosen tar- get password w̃, the attacker wins if they can guess the target password within q guesses. This is also called the q-success rate (λq). We measure λq for dif- ferent values of q. An attacker, given a measurement m = M(w̃), can filter its list of guesses W to only the passwords w ∈ W that match the measurement, i.e., m = M(w). We let λM q be the success rate of this attack. The baseline success rate λ0 q is given by the recovery rate of the attack that simply queries the attacker guess list in descending frequency order (without filtering). Thus, we measure the increase in attacker success as ∆q(M) = λM q − λ0 q. The results of our simulations are shown in Figure 2.2. We first discuss the online context, where q ≤ 1000. Among different measurements, revealing the zxcvbn score provides the largest improvement in attack efficacy, enabling an attacker to guess 1.6% more passwords in less than 1,000 guesses, compared to 22 the baseline (of 1.8% passwords). The binary zxcvbn score reduces the guessing advantage by a modest amount. Overall, all three measurements combined — RockYou top 5,000, Hashcat top 5,000, and the binary zxcvbn score — would enable an attacker to guess ∆103 = 3.1% more passwords compared to not having access to the password-derived measurements. Note that this is without any password policy (except the minimum length requirement of 6 characters). We also check the success rates for passwords containing at least three char- acter classes (uppercase, lowercase, digits, and symbols), a common require- ment. Both universities have three character classes as part of the password policy for current students, staff, and faculty, although old alumni accounts may not satisfy this requirement. When looking at passwords that meet this pol- icy, only 0.8% of passwords are guessed within the first 1,000 guesses without any password-derived information, and the combination of password-derived fields brings this percentage up to just 1.3% (∆103 = 0.5%). Next we discuss the success rate of an attacker for a large guessing budget q ∈ [103, 109]. As before, the zxcvbn score can be damaging to the privacy of user passwords, resulting in as high as a 20% increase in attacker success. The binary zxcvbn score provides less information and never leads to more than a 2% increase in attacker success even with a very high number of guesses. Com- bined measurements also lead to a bounded increase in the fraction of pass- words cracked by the attacker, who can guess at most 4% more passwords with the measurement information in an untargeted attack compared to an attacker without the information. More importantly, most of the improvement occurs for passwords guessed in less than 10,000 guesses because the statistics allow an at- tacker to rule out many popular passwords when guessing already vulnerable, 23 weak passwords. Thus we conclude that while including full zxcvbn would be too risky, the increase in attacker success when given the other password strength measures (with the binary zxcvbn score) is sufficiently small. Edit distance measurements. Other measurements we consider are the edit distance from breached passwords for the appropriate username and the edit distance from other submitted passwords, including the correct ones. Early ver- sions of Gossamer recorded the precise edit distance; we instead now suggest quantizing to just indicate whether a submitted password is within edit distance two of a breached password or other submitted password. Our current imple- mentation does so, and we quantized the data in previously gathered logs. To come to this conclusion, we observe that an attacker in our threat model can check whether a username is in a breach. (Recall that we conservatively as- sume the attacker can perfectly reidentify usernames and that they have access to all the breach data used by the measurement system). If the username is not in a breach, then the attacker can proceed as above through a general guess list. If it is, then the attacker can mount a targeted credential tweaking attack in the following way. They start by generating guesses using the state-of-the-art cre- dential tweaking attack based on pass2path [93], seeding it with the passwords in the breach for the targeted username. They can then use the edit distance fields in the log data to filter this guess list by removing any guesses that are the incorrect edit distance from the breached passwords, or not within the ap- propriate quantized edit distance from the breached password, depending on which schema option we are evaluating. 24 To assess the improvement in attacks, we compare the modified attack to a baseline one that just applies pass2path. For simulations, we randomly selected 10,000 username-password pairs from the 20% test data described above, but conditioned on the usernames also appearing in the 80% leaked dataset. We exclude cases where the password is the same on both sides of the split (such passwords would be easily guessed via credential stuffing). For each target username-password pair, we give the attacker the edit distance of the target password from all the breached passwords associated with that username. Then λed q is the success rate for pass2path with guesses filtered to only those that have matching edit distance information. The baseline attacker’s success λ0 q is vanilla pass2path’s success rate. The baseline success rate in the online setting is λ0 103 = 21.6%. With precise edit distances knowledge, the attacker can instead achieve λed 103 = 85.2%. This is an uncomfortably large jump in attacker’s success. Quantizing to just edit distance ≤ 2 yields a 22.5% success rate, just a 0.9% increase over the baseline. For larger query budgets (relevant in offline cracking attacks), the improvement for quantized edit distance is even less — with a 0.25% increase over the baseline for q = 108. Discussion. Our simulations suggest that including even moderately granular data such as zxcvbn scores or edit distances in log files might be a risk factor in the case that persistant logs are somehow leaked to adversaries. Therefore we suggest a conservative approach and select logging schemas that avoid im- proving guessing attacks significantly. contribution of this work, as it allows reasoning in a structured way about risk of password-derived fields. One current limitation of the framework is that it focuses thus far on attacks 25 against a single user, and so we do not yet know how best to assess the risk of measures capturing similarity of passwords across usernames. Future work could look at extending the framework to look at multi-user attacks. Another limitation is that we rely on best-known attacks (such as pass2path), and as such future work could yield improved attacks. It is therefore important to retain the ability to sanitize or delete older logs should new results surface previously unforeseen risks. A full list of measurements logged by Gossamer can be found in Figure A.1. 2.5 Deploying Gossamer We partnered with security engineers at two large universities to deploy Gossamer and collect data, beginning in December 2020. We collected data for seven months at U1 and three months at U2 (a shorter timeframe due to the preferences of the IT department) . This timeframe encompassed mid-semester, exam, and break periods, so we were able to observe different levels of activi- ties. Throughout this timeframe, we occasionally made updates to some of the measurement mechanisms; these updates were done after careful review of the code by pulling from a git repository accessible to the virtual machine running the measurement service, and any summary statistic that may be affected an update was calculated using the data after the update was made . Review process and ethical considerations. Although our research could help understand characteristics of password submissions received by login systems, we must also consider the privacy and security risks of inadvertent exposure 26 0 0.5 1 Lo gi n at te m pt s pe r da y (× 10 5 ) U1 2020-12-06 2020-12-26 2021-01-15 2021-02-04 2021-02-24 2021-03-16 2021-04-05 2021-04-25 2021-05-15 2021-06-04 2021-06-240 2 4 Failures Successes U2 Figure 2.3: Successful and failed login attempts per day at two universities (for a total of 196 K unique users at U1 and 309 K at U2). Potential high-volume attack campaigns we discovered are also shown. of the sensitive information in the logs. Our design of Gossamer, therefore, strives to balance these risks with the benefits of protecting user accounts. We conducted a nearly year-long design and implementation effort that entailed a number of external review processes to help guide our research study and reduce potential privacy and security risks to users. As noted in Section 2.3, Gossamer logs do not include PII. Researchers (with one exception mentioned below for compromise reporting) do not have access to usernames or email addresses. They do have access to IP addresses from where login requests originate. To protect sensitive data, Gossamer uses a layered approach to security with an encrypted disk, strict firewall rules, and MFA login to access the analysis ser- vice. Moreover, Gossamer never stores plaintext passwords on disk; it instead stores a set of password-derived measurements in the persistent database for analysis. Even in the case that these are somehow leaked the chosen measure- ments represent little additional risk of password disclosure (see Section 2.4). The research group had ground rules for handling the data, including min- imizing granularity of information shared outside the confines of analysis sys- 27 tems, restricting persistent database access to only a subset of researchers, and setting clear expectations about (in)appropriate use of access (e.g., prohibiting re-identification attacks attempting to identify a user from the obfuscated data). We also went through a careful vetting and approval process with university leadership and their information technology (IT) security departments. This in- volved presenting to the university leaderships about the goals, design, and procedures associated with the measurement studies, and working closely with our universities’ security engineers to design and implement Gossamer. Sat- isfied with our process and the potential benefits to university account hold- ers, we received approval from university leadership for the deployment of Gossamer. Before deploying the system, we submitted our study designs for IRB review at each university. The study protocols at the two universities were slightly dif- ferent in how we report to IT security compromised user accounts. At U1, re- searchers never had access to plaintext usernames, and the security engineers handled the decryption of reported (encrypted) usernames. Therefore, we re- ceived an IRB exemption at U1, which found that the research study does not qualify as human subjects research. At U2, security engineers requested that we report the plaintext usernames for operational simplicity. One researcher de- crypts the username before reporting compromises. Therefore, we received IRB approval at U2, finding the study as a minimal risk human subject research. We did not seek consent from individual users, as we do not know their usernames or email addresses. Instead we obtained explicit approval for conducting this study from the universities’ leadership and IT departments who provide the lo- gin services. Such waivers of explicit consent from participants were used in 28 prior work (e.g., [66]) and are discussed as an acceptable approach in the Menlo report [70]. Deployment configuration. As mentioned in Section 2.3, Gossamer consists of two services — a measurement service and an analysis service. For U1, we used Amazon EC2 in a virtual private network to host the measurement ser- vice as recommended by the U1 security engineers, and we used an on-premise dedicated server for analysis. For U2, we used two separate on-premise virtual machines for running the measurement service and the analysis service. The persistent storage is hosted inside the VM running the analysis. Strict firewalls were set up for all machines (on-premise or EC2) that block all incoming and outgoing requests except from inside the private network of the respective universities. Only a subset of the researchers have access to these machines via SSH, and the access requires two-factor authentication. All incom- ing and outgoing network connections are logged by the firewall and regularly checked by the security engineers for signs of intrusion attempts. Relevant se- curity patches are checked regularly and applied immediately. The persistent storage uses a MySQL database, which uses TLS for all communication and the underlying disk is encrypted at rest. The login server and the measurement service also use TLS with pinned certificates [47] for all communications. All passwords used are longer than 12 characters, randomly generated, and stored in a password manager. The security engineers also ran scans on the code of Gossamer and the VMs to check for known vulnerabilities. 29 U1 U2 Session Statistics (with a 360s threshold) Avg. session size 2.25 2.21 99th percentile session size 10 6 % abandoned sessions 5.47% 1.96% User Statistics # of unique usernames seen 196,424 309,801 # of valid users 177,286 169,774 # of active users 130,695 110,476 % valid users w/ weak passwords 0.03% 0.06% % valid users w/ username in breach† 5.79% 3.27% % valid users w/ passwords in breach† 2.92% 9.34% % valid users w/ user-pw pair in breach† 0.01% 0.15% % valid users w/ tweaked password 1.22% 0.66% % valid users who may be using password managers 24.76% 27.34% Avg. devices per user per day 1.51 1.91 Avg. devicesper user (over whole time period) 14.51 14.97 Avg. num unique passwords per user 1.96 9.59 Login Statistics Avg. Login requests per day 49,302 246,274 Avg. # of submitted usernames per day 24,822 61,798 % of requests succeeded 94.99% 92.35% Avg. # requests per day per user 1.99 2.05 % of requests from mobile device 31.00% 35.57% % IPs from VPNs, proxies, or Tor nodes 22.08% 4.91% Submitted password statistics % req. w/ password in breach† 2.71% 0.10% % req. w/ user-pwd pair in breach† 0.07% 0.01% % failed req. containing a typo 29.67% 12.04% % failed req. (with edit dist msmt) containing a typo 62.39% 58.37% % failed req. from mobile device containing a typo 38.63% 38.36% % failed req. (with edit dist msmt) from mobile device containing a typo 72.69% 81.87% % pwds tweaked 0.92% 0.34% † Statistics related to breach data were calculated for data beginning Jan 27 ’21 after we added more breach data to the instrumentation. Figure 2.4: Summary statistics of login requests recorded by Gossamer at U1 and U2. More statistics can be found in Figure A.3. 2.6 Login Statistics, Patterns, & Observations We collected data at U1 for seven months and at U2 for three months starting in December 2020. Overall, we observed 10 million requests at U1 and 24 million 30 requests at U2. We show the daily successful and failed requests in Figure 3.7. On average, 5–8% of requests failed; however, on a few days we observed a spike in the failure rate (> 50%). These were high volume attacks that we dis- cuss below. After removing the noise caused by these attacks, we found that users submitted login requests on average 1.99 times per day at U1 and 2.05 times per day at U2. We see fewer login requests at U1 because some login re- quests are handled via a different authentication server that is outside of our instrumentation. We saw 196 K unique usernames submitted to U1, out of which 177 K were valid usernames, of which 154 K users had a successful login at least once dur- ing our instrumentation period. At U2, we saw 170 K users with at least one successful login during our instrumentation period and 15 K additional users who tried to log in with a valid username but could not complete login due to errors. We consider a user active if they have at least one successful login every month. We found about 130 K active users at U1 and 110 K at U2. Requests originated from 526 K unique IP addresses (approximately 88 K per month) at U1 and 513 K (171 K per month) at U2, and they were associated with 44 K unique user agents at U1 and 31 K at U2. We also used the GeoIP2 database published by MaxMind [80] to find location information for IPs; the majority of login requests at both schools originated from within the United States, fol- lowed by China and India. Summary statistics that we report on these login requests can be found in Figure 2.4, with additional statistics in Figure A.3. 31 2.6.1 Characteristics of high volume attacks We observed three high volume attacks during our instrumentation. Since we are focusing on understanding the full picture of user behavior, we first report on these attacks and then remove them from the dataset to avoid skewing other statistics we report. In total, we removed 54 K requests at U1 and 81 K requests at U2. Attack 1: Naïve, multiple-IP, high-volume credential stuffing attack campaign at U1. Over January 25–26, 2021, four IPs conducted a credential stuffing campaign consisting of 36 K attempts to 19 K users. Two of these IPs were identified by the MaxMind GeoIP database [81] as coming from NordVPN, one from Microsoft, and one from Inwi Mobile [5]. All IPs were active in non-overlapping time pe- riods and submitted up to 100 requests per second. More than 99% of requests from these IPs in this time frame had null user agents. Almost a third of the attempts (29%) of submitted username-password pairs from these IPs were di- rectly from prior breaches, and 60% of submitted passwords were present in prior breaches. The attack campaign successfully compromised 23 accounts at U1, all of which had been flagged by security engineers and had their passwords scram- bled to prevent access. We observed some duplicate username-password pairs submitted across IPs; thus we hypothesize that the attacker used an automated script that iterates through an unfiltered list of breach data with a variety of IPs. Attack 2: Credential stuffing attack using Sentry MBA tool against U1. An IP hosted in Google Cloud [4] executed another credential stuffing attack at U1 on March 14th, 2021 from 18:42 - 22:22 UTC. This IP submitted over 17 K re- 32 quests to 15 K unique usernames, submitting approximately 80 requests per second. Of these attempts, 22% of the submitted username-password pairs and 56% of the submitted passwords were directly from the breach data we used with Gossamer. The attacker successfully guessed the passwords for 14 users. Among those, 13 were already recorded as compromised by security engineers; we reported the remaining encrypted username to the security engineers as a potentially compromised account. Although both Attacks 1 and 2 were credential stuffing campaigns, we sus- pect that the respective attackers were using two different sets of breach data, as there was little overlap in the users targeted. We noticed that the attack traffic in this campaign was evenly split between five distinct user agents that were not present in the rest of our data; these five user agents are the default user agents for a tool called Sentry MBA [110]. Sentry MBA is a credential stuffing tool where the user can specify a list of usernames and passwords, a config file for specifying the HTML fields on the target login page, and a list of proxies from which to send traffic. Attack 3: High volume, password spraying attack at U2. On December 22nd, 2020, a total of 12 unique IPs belonging to Digital Ocean Cloud [3] carried out a high volume password spraying attack by targeting 76 K unique users with 169 K re- quests at an average of 262 requests per minute. The attacker pretended to send requests from SMTP and IMAP clients. The number of usernames targeted by each IP was evenly distributed among the IPs, and these IPs were active only during the attack period. Less than 3% of submitted passwords were from prior breaches, and none of the submitted usernames were present in prior breaches. We also noticed that all of the login attempts were for non-U2 usernames, in- 33 dicating that the breach data was not filtered before the attack. Consequently during this attack campaign, there were no successful logins. Filtering out attacks. We remove requests from IP addresses corresponding to these three attacks on the respective days for all subsequent statistics to avoid skewing the statistics. Although we did have access to compromised user- names, there was no clear way to determine which IP address compromised a given user. In Section A.0.2, we show why excluding all IP addresses that con- tacted a compromised user would not have significantly affected the statistics. This filtering approach works for our setup but may not generalize to other sys- tems. Similarly, although there could be other low-volume attacks that we did not detect, we believe they will not impact the statistics we report. 2.6.2 User and client statistics Gossamer observed login attempts for 196 K unique usernames at U1. Among the users who could never login, 42 K (21%) of them used a username that does not exist in the U1 login database; however, only 0.1% of these usernames ap- peared in our breach data. At U2, we saw 310 K users who tried to login, 170 K (55%) that were successful, and a staggering 139 K (45%) usernames that do not exist. We are not sure what caused such a high volume of login submissions with invalid usernames. More summary statistics on these login requests can be found in Figure 2.4 and are elaborated in more detail below. IP and client characteristics. Requests originated from 539 K unique IP ad- 34 dresses at U1 and 2.47 M at U2. There were 44 K IPs that sent requests to both universities; of these, 621 IPs had no successful logins. We also recorded the user agent strings present in the HTTP header of the login requests. We observed about 5 K unique user agent strings at U1, and about 31 K at U2. The top 10 user agents at each school are listed in Figure A.4. In 22% of requests at U2, the user agent string was set to an empty string; all of these requests were through basic authentication [51]. U1 does not support basic authentication, and less than 1% of the requests had an empty user agent field. We suspect that attempts with an empty user agent field were submitted via an automated script that neglected to set the user agent field. A breakdown of operating systems mentioned in the user agent string appears in Figure A.2. Prior work [52] has suggested that attempts from multiple devices for a user is suspicious and should require additional authentication steps. Freeman et al. defined a device as a pair of a unique IP address and a user agent [52]. We observed that on average, 3% of users at U1 and 19% of users at U2 log in from two different devices per day. 11% of users at U1 and 6% of users at U2 have logged from more than fifty devices over the course of the study period. To find what fraction of IPs were public VPNs, proxies, or Tor exit nodes, we used the Blackbox5 API. We found that 22% of IPs at U1 were VPNs, proxies, or Tor exit nodes. However, at U1, 16% of all IPs are 10-space IPs, all of which are marked as VPNs/proxies, contributing to this high percentage. These IP ad- dresses are set up by the university IT department and are not accessible outside of the university network. At U2 about 5% IPs were flagged as VPN/proxies by Blackbox API, and 3.6% of all IPs are 10-space IP addresses. Because users may 5https://blackbox.ipinfo.app/ 35 be sharing a VPN or proxy network, it is possible that some users may have the same IP; thus we report most statistics based on device, since it is less likely two different users would also have the same user agent. The IP address of a user’s device might change over time due to DHCP churn or switching between multiple networks. Therefore a login from the same de- vice and browser may appear as if it’s from multiple different devices, accord- ing to the previous definition. So, we conservatively estimate the number of different browsers per user based on the user agent string. At U1, 21% of valid users, and at U2 29%, successfully logged in with ten or more user agents. Thus, many users log in from different browsers, and login security mechanisms must consider this while designing policies. We were also interested in determining what percentage of users logged in on their mobile device (smartphone or tablet) versus on a laptop or desktop computer. To do this, we used a regular expression that matched mobile de- vices on the user agent. We found that at U1, 31% of requests originated from a mobile device; out of those, 84% of requests originated from iOS devices and 16% from Android devices. At U2, 18% of requests (80% of which were iOS and 20% Android) came from a mobile device. At both schools, iOS devices were significantly more popular. These findings may be useful in developing attack detection mechanisms. For example, if a user always logs in from an Android device, it may be suspi- cious if they attempt to log in from an iOS device. 36 100 101 102 0 0.2 0.4 0.6 0.8 1 Number of unique passwords per day Fr ac ti on of re qu es ts Per IP address (U1) Per username (U1) Per username (U2) Figure 2.5: Cumulative distributions of unique passwords per username and IP for February 2021. The X-axis is log scale. 2.6.3 Password security The strength and guessability of a password directly affect the security of a user’s account. Using the visibility into passwords provided by Gossamer, we investigate password security in terms of strength, the number of unique pass- words submitted for a username, and the use of breached credentials. Password strength. We used four different measurements in Gossamer to mea- sure password strength. We record the bucketized zxcvbn score, as described in Section 2.4; whether the password appears in the top 5k most common RockYou passwords; whether the password appears in the top 5k passwords generated by Hashcat on the RockYou dataset with the best64 ruleset [11]; and finally, whether the password appears in the top 1000 most common passwords in our breach compilation dataset. Exact percentages of requests matching each of these password strength met- rics can be found in Appendix ?? Figure A.3. In summary, we found that the vast majority of valid users were using strong passwords by these metrics. Both universities use strong password policies, requiring a minimum length of 8 and 37 at U1 three different character classes; so this finding is not surprising. Unique passwords. We also measure the number of unique passwords sub- mitted for a given username or from a given IP address on a certain day. We observe that a median of one and a 99th percentile of seven unique pass- words are submitted against a single username per day. Slightly more unique passwords — a median of one and 99th percentile of nine — are submitted from a single IP address at U1 (unfortunately we do not have this measurement at U2 because of a configuration issue) . Both distributions are shown in Figure 2.5. More unique passwords are submitted from a single IP address than for a sin- gle username, which makes sense because an IP address may submit to multiple different users, especially if it is a VPN/proxy. We noted earlier that some organizations may lock accounts that receive a certain number of failed attempts in a given time period. However, these lock- out mechanisms do not take into account whether a user submitted the same password multiple times. Creating a lockout threshold based on the number of unique passwords tried instead of the total number of attempts would improve usability without any improvement in attacker success rates. To demonstrate this, we consider a lockout threshold of 10 and calculate the number of accounts that would have been locked by counting the number of attempts on a given day instead of the number of unique passwords. We find that 17,863 accounts would have been locked under the simple policy, versus just 2,220 accounts under the policy that counts by unique passwords — an 88% decrease. Similarly for a lockout threshold of 5, implementing the more complex policy would decrease the number of lockouts by 91% from 280,360 to 23,919. 38 Such a lockout policy could be implemented relatively simply by storing the password hashes submitted for a given user for the last day. Then a lockout pol- icy would check the number of unique hashes submitted in the designated time period. This new policy would improve usability with only a slight increase in implementation complexity and no benefit to a potential attacker. Breached credential use. To measure the usage of breached credentials, Gossamer records for each attempt whether the username, password, or username-password pair appeared in our breach dataset. Usernames were stripped of a domain name, if applicable, before performing the match. We find that nearly 6% of valid users at U1 and 3% of valid users at U2 appear in our breach dataset, indicating that they have appeared in some data breach in the past. Additionally, we find that 3% of submitted passwords at U1 and 0.1% at U2 appeared in a breach; finally, 0.07 % of username-password pairs at U1 and 0.01% at U2 appeared in a breach. Most of these were failed attempts, but we find that 23 users (0.01% of valid users) at U1 and 254 users at U2 (0.15% of valid users) were still using a breached password as their actual password. These users are vulnerable to credential stuffing attacks, and this motivates the deployment of breach alerting tools in login systems that would prevent users from continuing to use breached passwords. For each attempt, we also check for tweaked passwords by recording the similarity of the submitted password to a breached password using three met- rics — edit distance, PPSM similarity, and pass2path rank [93]. We define a pass- word as being tweaked if the edit distance is less than or equal to two, the PPSM similarity is zero (indicating that they are similar), or the pass2path rank is less than or equal to 1,000. We found that at U1, 0.92% of all passwords submit- 39 ted were tweaked, and 2,164 users (1.22% of valid users) were using a tweaked password. At U2, 0.34% of passwords submitted were tweaked, and 1,125 users (0.66% of valid users) were using a tweaked password. However, due to the design of the system, we can only determine whether a user has a tweaked password if they had a previously breached password in our dataset to which we can compare. When we take this into account, we find that nearly 7% of users at both universities with at least one breached password were using a tweaked password. A recent study by Pal et al. [93] reported a slightly higher rate of 8.4%. We hypothesize that users may append a single character to their old, breached password, causing them to be vulnerable to credential tweaking at- tacks, in which an attacker tries close variants of breached passwords in an at- tempt to guess user passwords. Implementing a breach alerting system such as Might I Get Pwned [94] or using a personalized password strength meter [93] in the password change flow could alert users when they attempt to change their password to one that is vulnerable to credential tweaking attacks. Password changes. Neither of the universities have any periodic password change policies. At U2, users are recommended to change their passwords twice a year, but this is not enforced. New passwords are recommended to not be the same or similar to any previous passwords, but this is not enforced either. Al- though we did not instrument the password change system, we are able to estimate a subset of the password changes made using the password’s edit dis- tance from the previous submissions for a user logged by Gossamer. Because previous submissions are cleared every 24 hours, though, we can only use this measurement in instances when the user logged in with their old and new pass- 40 word on the same day. We find 9,011 total password change events made by 2,893 unique users — 78 of which appeared in the compromise database — at U1. At U2, we saw 42,827 total password change events made by 7,812 unique users — 211 of which appeared in the compromise database. Of the password change events at U1, 100 resulted in a new password that was in our breach dataset, and 125 resulted in a new password that was a tweaked version of one in our breach dataset. At U1, at least 1.9% of valid users changed their pass- word at one point during the seven month measurement period (about 1.3 K per month) and at U2, at least 4.6% (14 K per month). This is only a lower bound, since we can only measure a fraction of password changes, and prior work is consistent with this bound, reporting higher password change rates [30, 48]. 2.6.4 Usability A longstanding complaint about passwords is their usability. Users often have trouble remembering strong passwords, and they may commit typos especially if they do choose a stronger, more complex password [38, 104, 105]. A key benefit of Gossamer is that it provides a new observation point for measuring password-based login usability. Login sessions. A user can retry logging in if a login attempt fails (probably due to submitting an incorrect password). To better understand a user’s pattern of login retries, we define a login session as a sequence of login attempts to a username from the same device ending either in a successful attempt or in a period of inactivity (indicating that the user has given up after a series of failed 41 100 101 102 103 0.2 0.4 0.6 0.8 1 Interarrival time (seconds, in log scale) Fr ac ti on of re qu es ts U1 No inactivity thresholds Threshold = 600s Threshold = 360s Threshold = 300s Threshold = 120s Threshold = 60s 100 101 102 103 0 0.2 0.4 0.6 0.8 1 Interarrival time (seconds, in log scale) Fr ac ti on of re qu es ts U2 No inactivity thresholds Threshold = 600s Threshold = 360s Threshold = 300s Threshold = 120s Threshold = 60s Figure 2.6: Cumulative distribution of interarrival time between requests for different inactivity thresholds for February 2021. The X-axis is log scale. attempts). We define a device as a tuple of an IP address and a user agent. It is possible that more than one user may use the same internet connection and thus have the same IP address. In this case, they may appear in the same login session if they both submit login requests to the same username in a period of time and share the same user agent as well; however, this seems an unlikely scenario. The definition of login session is parameterized by the length of the period of inactivity, which we call the inactivity threshold. We refer to a session that did not end in a successful login as an abandoned session and the number of login attempts in a login session as session size. We examine the time between successive login requests, termed as inter- arrival time, to determine the inactivity threshold that provides stable session size. Because a successful login request indicates the end of a login session, we specifically investigate pairs of successive login requests where the first request was not successful. We show this distribution in Figure 2.6 at both universities for the month of February 2021, as a representative month for which the data collection periods overlap. We limit the X-axis to 15 minutes for easier viewing. 42 Inactivity threshold Average session size (seconds) U1 U2 60 2.16 2.44 360 2.29 2.22 600 2.32 2.99 Figure 2.7: Average session sizes (number of attempts per session) for different inactivity thresholds At U1, 81% of successive attempts are executed within 60 seconds of the previ- ous attempt, 90% of requests are executed within 361 seconds (6 minutes), and 95% of requests are executed within 13 minutes. We can also see by looking at the average session size for multiple thresholds (Figure 2.7) that the choice of inactivity threshold does not significantly affect the session size. Thus, we choose 6 minutes as our inactivity threshold for future statistics involving sessions, since 90% of successive attempts occurred within that window. With this definition, we find that 51% of sessions at U1 required more than one attempt, and nearly 5% of sessions were abandoned. For U2, 38% of ses- sions required more than one attempt, and 2% of sessions were abandoned. With so many sessions requiring more than one attempt, there is much room for improvement in usability of password-based logins, which we elaborate on in our discussions on password typos and lockout thresholds. Retries. The session size indicates how many retries were required before a successful login or the user giving up. Thus we show the average session sizes for different inactivity thresholds in Figure 2.7. At U1, we found that 22% of successful sessions and 14% of abandoned sessions required more than one at- tempt; at U2, 38% of sessions required more than one attempt. Some login 43 100 101 102 103 0 0.2 0.4 0.6 0.8 1 Number of sessions per user per day C um ul at iv e fr ac ti on of us er na m e- da te pa ir s U1 U2 Figure 2.8: Cumulative distributions of number of sessions per user per day for an inactivity threshold of 360s for February 2021. The X-axis is log scale. systems have a lockout policy, in which they lock a user’s account after a cer- tain number of failed attempts have been made. In this case, the 99th percentile of session size is 10 attempts per session, providing empirical support for a stan- dard choice for lockout threshold. We measure the number of sessions per user in a single day, as this will inform how often a user needs to go through the login process, how important it is to have a frictionless login process, and how effective single sign-on is. We find that a user attempts to log in a median of one session per day and a 99th percentile of six sessions per day; we show the distribution of the number of sessions per user per day in Figure 2.8. However, the tail end of the graph shows that some users attempted up to 112 sessions in a single day. Given that the 99th percentile is six sessions per day, this is probably indicative of suspicious behavior, and in future work this metric may be used in conjunction with the other metrics we’ve reported to further investigate possible attacks. To investigate the password-based login usability of mobile devices, we compare the session size and frequency of sessions per day for mobile and non-mobile sessions. We break mobile devicesdown even further by compar- 44 ing iPhone and Android devices; these distributions are shown in Figure 2.9. We can see from these graphs that users of iOS devices tend to require more attempts than Android ones. Password typos. One of the areas of friction in a login system may be password typos, especially for stronger, more complex passwords. With the ephemeral datastore in Gossamer, we computed whether a password submission was within edit distance two of the actual password. We can estimate the number of password typos using this measurement; however, we only have this measurement for users that later logged in success- fully on the same day, which is only 45% of all failed attempts at U1. Thus we find that at U1, 62% of failed requests where we have this measurement con- tained typos of edit distance two or less, or 30% of all failed requests. In a study at Dropbox, Chatterjee et al. [38] estimated the number of typos to be at least 9% of all failed requests. Requests originating from mobile devices tend to contain typos even more frequently. Out of all failed mobile requests with the edit distance measurement at U1, 72% were typos (or 39% of all failed mobile requests). When investigating sessions with two or more attempts, we found that 12% of eventually successful sessions of size two or greater initially failed because of a typo. The remaining failures may be explained by user memory errors (using the wrong password). These findings further underscore the utility of password managers, since they help avoid both typos and memory errors. Password managers. Although we cannot identify users with password man- 45 2 4 6 8 10 0 0.2 0.4 0.6 0.8 1 Session size Fr ac ti on of re qu es ts All mobile sessions iOS sessions Android sessions Figure 2.9: Cumulative distributions of session sizes (number of attempts per session) for mobile devices for February 2021. The X-axis for U2 is limited to 10 for comparison with U1 (although the maximum session size consisted of 54 attempts). agers with certainty, we can find the number of users with a certain number of successful login attempts and no failures, and we conjecture that this may be an approximation. For example, we found 25% of valid users at U1 and 27% at U2 have at least ten logins and never had a failed login over the course of our measurements. Since, as we show in Figure 2.4, the majority of users — 99.97% users in U1 and 99.94% users in U2 — are using strong passwords (according to the zxcvbn score), we believe that these users are using password managers to enter their passwords. Duo logs. Both U1 and U2 have introduced the use of Duo two factor authen- tication [13] to further strengthen account security. At U2, we were able to an- alyze Duo logs for successful login attempts and combine them with Gossamer logs to explore the tension between usability and security with respect to MFA. Unfortunately, our logs and Duo log entries do not share any unique identifiers. Instead, we try to match each successful login attempt to a corresponding Duo push within two minutes of the login request originating from the same user. If there were multiple matches to a single successful login attempt, we chose the 46 Duo push closest in time to the login request. At U2, out of 15.9 M successful login requests, 62% were using Basic Auth [51], which does not require users to enroll in two-factor authentication. Among the remaining 6.0 M successful logins, we could match 89% of requests, and the remaining 11% already had a previously obtained Duo cookie which remained active for 12 hours after a successful authentication. Among the Duo pushes we could successfully match with a login attempt, we found that 96.7% were successful, 3.2% were denied, and only 46 (< 0.01%) were marked as fraud by users. Users on average took 14 seconds to mark their Duo push as valid, slowing down logins for honest users. 2.7 Conclusion We designed, built, and carefully deployed Gossamer, a framework for securely recording statistics about login requests and submitted passwords during lo- gin. Our approach combines system security mechanisms, a simulation-based approach to assessing risk of different measurements, and procedural mecha- nisms to enable new kinds of measurement studies. In studies conducted at two large universities in collaboration with their IT security teams, we were able to gather first-of-their-kind measurements about login behavior that shed light on usability, security, and attacker behavior. Through these measurements, we observed several patterns in user login be- havior. First, we saw that login friction is still high. Many users require more than one attempt before successfully logging in, and Duo two-factor authenti- cation added on average an additional 14 seconds to logins. One suggestion we 47 have for reducing login friction is to count unique passwords submitted for a user, rather than every request; we find that this would reduce lockouts by 88%. Second, we observe that reuse of breached credentials is a serious problem: nearly 3% of valid users at U1 and more than 9% at U2 are using a breached password that appears in our breach dataset. During our data collection pe- riod, some users changed their passwords to either exactly equal to one of their breached passwords or very similar to their breached passwords. To reduce this number, we encourage the deployment of breach alerting services such as Might I Get Pwned [94] that could indicate when a user’s password is vulnerable to credential stuffing. We also find that visibility into submitted passwords is helpful for designing security mechanisms and informing new policies. Similar to how we proposed a new lockout policy, authentication system developers can use a framework such as Gossamer to test the effect of new policies before their deployment. Future analyses with Gossamer may investigate how password-derived signals help in detecting different types of password guessing attacks and the efficacy of Duo two-factor authentication in stopping different types of attacks. Also, Gossamer can be extended with additional measurements using the same framework and simulation method for assessing the risk of different measurements. Thus both the tool and approach can be useful in the future for helping analysts detect and develop mitigations against attacks. We will be releasing Gossamer as a public, open-source project. 48 CHAPTER 3 ARAÑA: DISCOVERING AND CHARACTERIZING PASSWORD GUESSING ATTACKS IN PRACTICE 3.1 Introduction Remote password guessing attacks are one of the most effective and preva- lent causes of account compromise for password-based authentication sys- tems [111,124]. Password guessing attacks are easy to mount for attackers, who may attempt logging in under known usernames with widely popular pass- words or perform credential stuffing by submitting username-password pairs that have appeared in previous breaches. Despite being straightforward, such attacks can nevertheless be highly damaging. As a result, the most advanced lo- gin services in practice use proprietary mechanisms in an attempt to detect ma- licious logins using more than just the correctness of the submitted password, e.g., via user risk profiles [52, 59, 111]. Improving such mechanisms requires understanding attacker behavior as observed by login services — a delicate task given the sensitive nature of login data, especially passwords. As such, no in-depth measurement studies of attack behavior as seen from login services have been conducted. In our prior work [29], we designed a new login service instrumentation tool called Gossamer. It securely records information about login requests, in- cluding certain carefully chosen statistics about the passwords used in the re- quests. Gossamer was deployed for seven months at one university (U1) and three months at another (U2). We analyzed various aspects of user login behav- ior but were only able to identify three obvious high-volume attacks. How to find and characterize more attacks was left as an open question. In this work, we make progress on answering this open question. To do so, we obtained approval—from our university IRBs and IT security teams—to perform a fresh analysis of the datasets generated by the previous study. These datasets include a rich amount of information on over 34 million login requests, Duo two-factor authentication (2FA) logs (just for U2), and more than 2,000 com- promise reports. Even with the measurements provided by Gossamer, discovering attacks represents a tricky bootstrapping problem. The compromise logs do not im- plicate specific login requests and, more broadly, there is no ground truth any- where in the dataset for attacks. This makes the supervised machine learning approaches used in prior work [52, 59] inapplicable to our setting. At the same time, the scale of the problem renders manual analysis prohibitive, and so prior work only used a simple, known method for detecting attacks: Flag IP addresses that individually flooded the login service with requests. This of course misses many attacks. We develop a new analysis pipeline that we call Araña. Crucially, it focuses not on individual requests but sets of login requests emanating from the same IP within a day. This ensures sufficient signal about client behavior for patterns to emerge. We then built an application-specific filtering, clustering, and man- ual analysis approach. We filter out likely-benign login request sets using cus- tom heuristics, such as filtering out IP addresses exhibiting high login success rates. We then use an unsupervised learning approach, specifically agglomera- tive clustering [87], with a custom distance function tailored to login sets. This 50 helps us identify clusters of login sets that can be manually analyzed with little effort to discover attack campaigns — sets of login requests likely to be submitted by the same attacker. Araña is effective at finding these attack campaigns, including coordinated attacks involving multiple IP addresses and those spread out over long stretches of time. We use it to discover and characterize a diverse set of 29 attack clusters. Many are high volume credential stuffing attacks that submit on average of one password per targeted username and exhibit a notable fraction of username- password pairs appearing in breaches. In addition to the more simplistic stuff- ing attacks that quickly flood requests from a handful of IP addresses, Araña allows for discovering widely distributed attacks that may originate from hun- dreds of different IP addresses. Some of these attacks try to be “low and slow”, submitting a low number of requests per day at a slow rate. This confirms that credential stuffing attackers attempt to evade countermeasures that focus on in- dividual high volume IPs. These credential stuffing attacks are unfortunately effective: We uncover hundreds of potential successful logins by attackers for usernames not flagged as compromised already by the security engineers. We are also able to discover lower volume, targeted attacks by focusing on Araña-identified clusters that exhibit a large number of requests to individual usernames. For example, we identified an attack campaign made up of two clusters that targeted 127 users with on average 25 guessed passwords per user. These attacks included successful logins, suggesting this targeted strategy can also work for attackers. We discuss many other attack campaigns in the body. Overall, our anal- yses highlight a number of important takeaways for authentication system 51 designers. First, they suggest that, perhaps unsurprisingly, credential stuff- ing attacks remain a primary vector for account compromise. This, plus the fact that most compromised users in our study also had passwords in known breaches, underscores the urgent need for broader use of breach alerting APIs (e.g., [63, 76, 111, 113]). Second, we saw a large number of attacks on Microsoft’s basic authentication endpoint at U2. It is the least protected of all of U2’s authentication services, as it does not support rate limiting or Duo two-factor authentication. It is also easy for attackers to target any organization using basic authentication by just changing the destination URL. We observed that attackers regularly target such weak points, underlining a challenge for large organizations with heterogenous authentication infrastructure. Third, mechanisms that look for a large number of submissions per unit time from an IP or to a username are ineffective against many attacks already being deployed in practice. A better approach would be to work towards operational- izing mechanisms like Araña to detect in (near) real time distributed and sneaky attacks. However, future work will be needed to explore how to make such mechanisms robust against evasion by future adversaries. Summary. Our contributions include the following: • We perform the first in-depth analysis of a dataset with over 34 million login events generated by a recent system called Gossamer to discover and charac- terize remote password guessing attacks. • We design an analysis framework called Araña that shows how to filter and cluster Gossamer logs, enabling easy manual analysis and identification of attack campaigns. 52 • We use Araña to discover and characterize 29 attack clusters against two ma- jor universities that compromised hundreds of user accounts in total. • We identify key characteristics and patterns of attacks received by authentica- tion systems at these universities, and we discuss how authentication systems should evolve to counter such threats. Our work has already had some practical impact in terms of discovering new at- tacks. We have worked with security engineers from the two universities to per- form responsible disclosure of potentially compromised usernames. We believe that Araña will be useful in developing more robust attack detection methods, and we release it as a public, open source project [1]. 3.2 Background and Related Work Passwords remain the most widely used mechanism for user authentication, despite efforts to move past them [31]. We focus our discussion on the literature most closely related to our topic: characterizing password guessing attacks as seen by login services. Measurement studies on user passwords. User behavior with respect to pass- word choice has been extensively researched. Many studies simulate login services (e.g., via Mechanical Turk) to perform user studies [69, 72]. Others look at password breach data to characterize aspects of user password selec- tion [42, 122]. A handful of studies have measured user passwords in real de- ployments [31, 39, 48, 83]. Most recently, the Gossamer system [29] was used to measure not only le- 53 gitimate user password strength, but also various user password submission behaviors as observed by the single-sign on (SSO) login services at two large universities. This project uses the same measurement datasets as Gossamer but focuses on characterizing malicious login behavior. Password guessing attacks. The literature discussed so far concerned itself with legitimate user behavior, trying in part to assess whether users are select- ing passwords that resist various types of guessing attacks. A traditional focus has been on offline password cracking attacks, which occur when an attacker at- tempts to crack password hashes found in exposed password hash databases using tools like Hashcat [106] and John the Ripper [92]. Researchers have also developed natural language processing techniques to improve guess genera- tion [60, 71, 79, 85, 88, 122]. Breached username-password pairs — either obtained via offline cracking, gathered via phishing or malware, or stolen from another web service stor- ing plaintext passwords — can be used in credential stuffing attacks, where an attacker tries to log into a system using breached username-password pairs. Users frequently reuse passwords across multiple services [42, 93, 119], mak- ing credential stuffing one of the most prevalent forms of account compromise attacks [111, 124]. In an effort to reduce credential stuffing attacks, breach alert- ing services such as HaveIBeenPwned [63], Google Password Checkup [112], and Cloudflare’s exposed credential checks [94, 113] provide APIs that check whether a user’s passwords have been compromised. Credential stuffing is one kind of online guessing attack — an attack that re- motely attempts to log into a service with guessed credentials. Other examples of guessing attacks include password spraying attacks that submit a few very pop- 54 ular passwords against a large number of user accounts and credential tweaking attacks [93] that submit slight variants of breached passwords. Detecting malicious logins. Resisting online password guessing attacks re- quires determining which login attempts are malicious. As Bonneau et al. [32] discuss, login services increasingly should treat login as more of a classifica- tion problem, taking into account more than just the correctness of a submitted password. But only a handful of prior works have focused on how to do so. Freeman et al. [52] were the first to report on a statistical framework using the client IP address and user agent to differentiate between valid and invalid login attempts; this study used real-world LinkedIn login data, but did not in- clude password-based features. They also do not report on observed attack campaigns. Schechter et al. [102] built a malicious login detection system that also utilizes password-based features, including differentiating password typos from other failures and using privacy-preserving data structures such as a bi- nomial ladder filter for detecting frequently used passwords. Their study used simulated data to argue the system’s efficacy at detecting malicious logins. Fi- nally, Thomas et al. [111] studied underground forums and tools used to steal credentials — their only measurements using logins are used to report on the lack of evidence of targeted guessing attacks that try multiple queries against a single account. The Gossamer paper [29] reports on three attack campaigns that are easily detected using known techniques (i.e., a huge number of requests from an IP in a relatively short amount of time). Such techniques will only catch obvious attacks and are blind to “low and slow” attacks that purposefully use a small number of guesses per target account (low) and per unit time (slow), or dis- 55 tributed attacks that perform guesses from many different IP addresses. While missing these attacks does not appear to affect the results on benign user be- havior reported in [29], understanding attacker behavior requires finding and investigating these attacks. In summary, no prior work has characterized the behavior of password guessing attacks as seen from login services. Two-factor authentication. Although effective at preventing account com- promise, two-factor authentication (2FA) has yet to achieve widespread adop- tion [40, 44], and user friction is still very high [19, 45]. As we show, many older accounts at universities are not enrolled in 2FA; and more importantly, even when 2FA is used, recent attacks have shown that attackers can spoof push- based 2FA and hide spoofed pushes by sending them soon after the victim has logged in [67]. Regardless of whether 2FA is bypassed in an attack, we would like to know when an attacker successfully guesses a user’s password so that IT security personnel and/or the user can take preventative steps such as chang- ing the password for the indicated account and for any other account that uses the same (or a similar) password. Unsupervised techniques for attack detection. In this work, we therefore de- velop a new approach for detecting and characterizing password guessing at- tack campaigns using Gossamer logs. As we explain in Section 3.4.2, we did not have success using supervised machine learning approaches and so instead focus on unsupervised techniques, which have a long history of use in attack detection. For example, they are frequently used in intrusion detection systems (c.f., [35]) and for various kinds of anomaly detection (c.f., [21]). To the best of 56 our knowledge, no prior work has developed mechanisms to detect whether password-based logins are malicious. 3.3 Gossamer Logs In this work, we use a dataset of login requests compiled from two universities (U1 and U2) via our prior work Gossamer [29]. For completeness, we provide some details about Gossamer and how the resulting datasets were collected. We refer readers to [29] for more details. Gossamer logs. Our prior work introduced a new, privacy-preserving instru- mentation approach for use with login systems [29]. The resulting system, Gossamer, provides a secure way to collect measurement statistics about login requests, including a subset of HTTP headers, source IP address, target user- name, success or failure of a login request, and carefully chosen measurements on the submitted password. Gossamer uses two levels of storage to provide ex- tra security for particularly sensitive information. The submitted passwords are encrypted and stored in an in-memory database. They are cryptographically erased every 24 hours, providing a good trade-off between the ability to calcu- late password statistics over one-day windows (e.g., the edit distance between two submitted passwords) and the ability to protect the secrecy of passwords even in the low-likelihood case of a full compromise of the instrumentation ser- vice. The system also deterministically encrypts the usernames to preserve their privacy and blind researchers from them. Gossamer was designed in close collaboration with the information technol- ogy (IT) security teams at two large universities. We received approval from 57 the IRB and IT security teams to deploy Gossamer for 7 months at U1 and for 3 months at U2. In total, we collected more than 34 million login requests from 347 K valid usernames. In [29], Bohuk et al. detailed the design of Gossamer and reported on an analysis of this dataset in terms of characterizing legitimate user behavior. At the time we were only able to detect a few obvious attacks that stood out due to their high rate of requests, and left finding more attacks as an open question. This project is a first step at addressing this open question. Duo logs. Both universities use Duo 2FA for most login requests; users are prompted to provide this second factor after they successfully submit the pass- word (and if they do not have a valid “Duo cookie” stored in their browser). The datasets we received include sanitized Duo logs for U2 only. Unfortunately, there is no identifier in the Duo logs that can be used to uniquely associate a log entry with a particular login request. We therefore had to use a timestamp-based heuristic similar to that used previously in [29], to find the Duo prompt that likely corresponds to a successful password submis- sion. Using the timestamp and encrypted username associated with the Duo request, we correlate the Duo requests with login requests within a 2-minute time window. Out of these requests, 96.7% were successfully completed, 3.2% were denied, and only 46 (< 0.001%) were marked as fraud by the user. Compromised user logs. At both universities, the security analysts have pro- cesses for identifying and reporting compromised accounts. These processes include user reporting mechanisms and third-party breach notification services (e.g., U1 uses a Microsoft product to detect accounts that are sending spam). We were given access to compromised account logs for the period of our previ- 58 ous measurement study, containing the encrypted username, the timestamp re- ported, the estimated timestamp of the attack (only available at U1), and the rea- son for compromise. Importantly, compromise reports at U1 were logged from multiple authentication services, but the Gossamer deployment only instru- mented the main one (U1 web login); so the number of compromised accounts will be an upper bound for those that were actually compromised through U1 web login. Also, the IT departments consider an account compromised if the attacker made a successful login request, even if they did not bypass two-factor authentication. Indeed, many older accounts may not be enrolled in 2FA. We adopt the same terminology and refer to account compromises as a successful login from an attacker independent of whether 2FA was also bypassed. In most cases, analysts respond to compromise reports by scrambling the passwords of compromised accounts to prevent further damage. Our prior work using Gossamer logs contains some basic details about the compromise logs; we elaborate more here since it will be relevant to interpret- ing some results later. We considered compromise reports for the measurement time period at each university, plus one additional week (which allowed time for analysts to enter compromise reports for attacks that may have occurred during the measurement period). We give summary statistics on this compro- mise database in Figure 3.1. On average, 190 usernames (0.59% of all valid user- names) are reported every month at U1, and 163 usernames (0.28% of all valid usernames) at U2. At U1, 323 usernames were reported multiple times, 32 of which were reported for two different reasons. Users who are reported twice are compromised on average 26 days apart. We discuss a breakdown of the reasons for compromise in more detail in 59 U1 U2 (7 mo.) (3 mo.) Total reports 1,818 611 Unique usernames - reported 1,468 489 - more than once for same reason 323 122 - more than once for diff. reasons 32 0 % of unique usernames compromised / month 0.59% 0.28% # IPs tried to log in - with a compromised username 4,633 6,409 - with multiple compromised usernames 716 156 Max. comp. usernames associated with an IP 382 24 Avg. usernames associated with a single IP 1.98 1.05 Figure 3.1: Summary statistics on the compromised accounts reported at each university during the measurement period. Appendix B.1. In summary, we found that the majority of accounts were re- ported as compromised through large-scale automated attacks. At U1, the es- timated time of attack is also reported for each compromised user, and so we investigated the time it takes for an attack to be recorded in the compromise database. We found that 17% of compromised accounts are reported within the first hour after the attack, and 90% of compromised accounts are reported within 61 hours. However, the “timestamp of attack” field is an approximation of the time of attack based on the analyst’s best guess. There is no such field recorded at U2. We also approximate the time of attack by taking the last successful login for a given username before that username is entered into the account compromise database. In doing so, we find that only 4% of compromised accounts at U1 are reported within the first hour after their last successful login, and 90% are reported within 46 days. This large difference shows that either the analyst’s estimated time of attack was not very accurate, or the last successful login before a compromise report is not a good approximation for the time of an attack— 60 an important challenge in using compromised logs for detecting attacks, as we discuss in the next section. Ethical considerations. Throughout our analysis, we worked closely with the IT security offices at both universities similar to how we did for Gossamer [29]. We obtained approval for both this study and Gossamer from the IT security offices and university administrators. Both studies were reviewed by our uni- versity IRBs and received IRB exemptions, as the Gossamer logs do not include usernames or other data considered to be PII by IRB. We could not request con- sent from individual users, as we do not know the usernames of users. To fur- ther protect the privacy of users, we used similar security measures as we used for Gossamer study for data analysis: We used a machine only accessible to a subset of researchers logging in from a specific computer in the university net- work and requiring a strong password and 2FA for access. We also notified the IT administrations at both universities of all previously unreported compro- mised accounts we found using the encrypted usernames. 3.4 Towards Detecting Attack Campaigns We now turn to the challenge of discovering remote password guessing attacks using Gossamer logs. Prior work has used supervised machine learning ap- proaches to flag likely malicious login requests [52, 102]. However, they relied on simulated logins, for which they marked each request as benign or malicious. For login requests observed in practice, flagging each as either benign or ma- licious represents a challenging bootstrapping problem. Not only is the number of login requests received too massive for a reasonable set of security engineers 61 to manually analyze, but also there is no clear set of criteria to flag a login re- quest as malicious. Moreover, even if an account has been flagged as compro- mised by a security engineer or reported as compromised by a user, there is no obvious way to correlate this with individual login attempts because, as dis- cussed in Section 3.3, compromise databases do not include information on the specific login sessions. Individual login requests often do not contain enough information for even a human analyst to determine if they are malicious. We instead aim to identify attack campaigns—that is, sets of login requests submitted by the same attacker. Attackers often use automated scripts to send guesses for usernames and pass- words from one or more IP addresses over a period of hours, days, or weeks. To identify attack campaigns we will cluster login requests into groups that are likely to be part of the same attack campaign, as we explain below. 3.4.1 Login Sets and Features The main Gossamer logs consist of a sequence of login requests. Each request entry includes (1) a timestamp; (2) the (deterministically encrypted) username; (3) the client IP address; (4) the client user agent string; (5) whether the sub- mitted password is weak (has a bucketized zxcvbn score of zero, as explained in prior work [29]); (5) the edit distances between the submitted password and previous passwords submitted by the same IP or username; (6) whether the re- quest succeeded or failed due to an invalid username or the wrong password; (7) whether the submitted password is a tweaked variant of a breached pass- word known to Gossamer; or (8) whether the submitted password, username, 62 Group Features Acronym Client IP address © IP ISP © ISP User agent string © UA Volume Total # requests submitted NR Total # unique usernames submitted NU Avg. # unique password per user AUP Login timing Date© DATE Mean interarrival time MIT Std. interarrival time SIT Success rates Fraction of submitted requests that failed FF w/ invalid usernames FIU Password guessability Fraction of submitted requests w/ a weak pw, zxcvbn(w) = 0 FWP w/ pw in breach FPIB w/ username in breach FUIB w/ username-password pair in breach FCIB w/ a tweaked pw in breach FTP Figure 3.2: Features for L sets that we use for analysis. Nominal features are marked with ©; all others are numerical. or username-password pair appear in a breach dataset known to Gossamer. We group login requests based on the client IP address and date it was re- ceived. We call this grouping a login set, which we denote by L. The set of all L sets within a Gossamer log we denote as L = {L1, . . . ,Ln}, where n is the number of L sets in the log. Each Li contains all login requests received in a day from an IP address. Note that L defines a partition over all login attempts. We define an attack campaign as a set of one or more L sets. This assumes that all logins in a L set are either malicious or all are benign. Although it is possible that legitimate users might share the same VPN or proxy IP address as an attacker, we expect such scenarios to be rare (and did not encounter them in our analyses). We leave to future work how to differentiate benign and attack login attempts from a single IP. 63 To determine if an L set is potentially malicious or benign, we utilize a rich set of features describing a L set based on Gossamer logs. The full set of fea- tures, consisting of four nominal and 12 numerical features, are summarized in Figure 3.2. At a high level, these 16 features can be roughly divided into follow- ing groups: • Client features include the source IP address and user agent (UA) within the request. We also look up the ISP associated with each IP address (as reported via the MaxMind Geolocation API [80]). These are useful for determining whether requests are from the same client device. • Volumetric features include the total number of requests in a L set (NR), the number of unique usernames targeted in the set (NU), and the average num- ber of unique passwords submitted to a particular username (AUP) for a L set. • Login timing features include the date of the L set, as well as the mean (MIT) and standard deviation (SIT) of the interarrival time between requests within the set. • Success rate features measure the fraction of invalid usernames submitted (FIU) and the fraction of invalid username-password pairs submitted (FF). • Password guessability features include the fraction of submitted passwords in an L set that have a zxcvbn score of zero (FWP), indicating a weak password, as well as the fraction of submitted passwords in a known breach (FPIB), usernames in a known breach (FUIB), username-password pairs in a known breach (FCIB), and the fraction of passwords submitted that are a close vari- ant of a breach password—called a “tweaked password” (FTP). Together, these features serve as the basis for our attack campaign detection mechanisms. 64 Gossamer logs Potentially malicious L sets Pairwise distance matrix Clusters Attack campaigns Filtering benign L sets Similarity measures Agglomerative clustering Manual analysis Figure 3.3: Araña’s filter, cluster, analyze pipeline for discovering attack cam- paigns. 3.4.2 Initial Attempts Using Compromise Reports An obvious potential approach for detecting attack campaigns is to utilize com- promised account reports to label L sets as potentially malicious. However, as discussed in Section 3.3, those reports do not record information sufficient to identify exactly which L set contains the compromising login attempt. We note that this ambiguity is not just a limitation at the universities we investigated; anytime a login system allows compromise reports from users or third party services, they will not be directly associated with logins. Nevertheless, we ex- perimented with various ways of using compromise reports as ground truth for supervised approaches. We briefly report on two here. Classifier based approaches. We identified 23,016 L sets where the corre- sponding IP address contacted at least one eventually-compromised user ac- count (at any point during the data collection period). Let C denote the set of all those L sets. Of course, not all of these are necessarily malicious; but we mark them all as malicious since it is unclear how to label them more precisely. We first tried to develop a classifier that can identify L sets likely to have a compromised user based on the features we identified in Figure 3.2. We added to C an equal number ofL sets not associated with a compromised user account, labeled them as benign, and performed an 80/20 training and testing split on the combined set. We then trained linear regression, decision tree, random for- 65 est, logistic regression, and SVM models to predict maliciousness. All the mod- els exhibited very bad precision; the linear regression classifier performed best, with a recall 0.79 but a precision of just 0.13, making it essentially useless. The primary reason is the crude labeling of training data as malicious or benign, presumably including many false positives. Directed anomaly scoring. We also explored using Ho et al.’s directed anomaly scoring (DAS) [61], tuned via compromise reports, to rank L sets in order of “suspiciousness.” These experiments were promising at detecting new types of attacks, but failed to help us detect attacks spanning multiple source IP ad- dresses, making it less useful for driving an analysis and characterization of attack campaigns. See Appendix B.2 for details. 3.5 Campaign Discovery Pipeline We developed an analysis pipeline, called Araña, that combines heuristic-based filtering, unsupervised machine learning to cluster behavior, and a manual anal- ysis review step as illustrated in Figure 3.3. We refer to this as an FCA (filter, cluster, analyze) approach. While the general notion of an FCA-type analysis pipeline is not novel, our application-specific heuristics, similarity measures, and manual review processes are new. Our goal here is not completeness; iden- tifying all malicious logins is impossible. Rather, we build a high precision (low false positive) pipeline that helps us discover a wide range of attack campaigns with minimal manual effort. We first work to develop heuristics for filtering out likely benignL sets. First, we remove L sets that do not exhibit a high failure rate (HFR), and then we re- 66 move some anomalous known-benign behaviors (such as misconfigured clients repeatedly making failed requests). We then use the wide range of features described in the previous section to help us define an application-specific similarity measure logsetsim(Li,L j) that outputs a similarity score for any L set pair Li,L j. This helps us capture the likelihood that twoL sets are part of the same attack campaign. Given logsetsim, we can create a pairwise distance matrix and apply unsupervised agglomerative clustering [87] to discover clusters of L sets that could each be part of a single attack campaign. The candidates can then be manually inspected by analysts; we report on our findings in Section 3.6. 3.5.1 Filtering Likely Benign Requests Most logins are benign, and benign requests are a source of obfuscating noise in unsupervised algorithms. We therefore first develop a set of heuristics for filtering out likely benign behavior or, equivalently, identifying potentially ma- licious L sets. Our heuristics are based on three assumptions regarding attacker behavior reported in prior work [52, 59, 102]: (a) malicious logins are only a small fraction of the total login traffic; (b) a large fraction of the malicious lo- gin attempts fail; and (c) attackers use automated scripts or tools to minimize the time and effort required to send a large number of requests from IP pools (purchased or free proxies, VPNs), hiding their own IP. Based on these assump- tions, we first use heuristics to filter out obviously benign behaviors (thereby identifying potentially malicious behavior). 67 Filtering based on failure rate. We start by focusing attention on L sets that exhibit a high failure rate (HFR). We choose thresholds for two features: the total number of login requests NR within L and the fraction FF of these login requests that are failures. We say a L set meets the HFR condition if NR > ` and FF > f for two thresholds ` and f that we set below. Intuitively, benign logins should succeed most of the time, and failure rates are not meaningful for one or two logins. To choose these thresholds ` and f , we first plot the distributions of NR and FF over the subset of L that includes all L sets that have at least two login requests (NR > 1) and one failed login request (FF > 0). We show their distribu- tions in Figure 3.4. If we choose extreme thresholds, the HFR heuristic may miss potentially malicious L sets; but on the other hand, choosing a relaxed thresh- old may flag benign L sets as potentially malicious, increasing the noise in our final attack campaign detection procedure. Our goal was to build a semi-automated tool for discovering high-volume attack campaigns and help IT analysts discover them with minimal manual ef- fort. Since security analysts are concerned with finding the most damaging attacks (that is, attacks compromising the highest number of users) with high precision, we chose aggressive thresholds for considering malicious login sets. Thus, at U1, we used the 90th percentile as the threshold, making `U1 = 9 and fU1 = 0.77. However, for U2, the 90th percentile yielded fU2 = 1.0, the highest possible value; this is because U2 had more high volume unsuccessful attacks which inflated the 90th percentile. Therefore, we decreased the thresh- old to the 80th percentile for U2, setting `U2 = 6 and fU2 = 0.8. Although these 68 0 20 40 60 80 0 50 100 150 Percentile N R U1 U2 0 20 40 60 80 0 0.2 0.4 0.6 0.8 1 Percentile FF U1 U2 Figure 3.4: Percentile of NR (left) and FF (right) for both universities (shown up to the 99th percentile for viewing). By choosing the 90th percentile at U1 and 80th at U2, we set `U1 = 10, `U2 = 7 and fU1 = 0.77, fU2 = 0.8 at the universities respectively. aggressive thresholds may miss certain attacks, lowering them runs the risk of degrading the quality (precision) of clusters by mixing L sets with starkly dif- ferent behavior in the same cluster. Lower thresholds may also cause benign clusters to be mistakenly flagged as malicious, which would quickly cause an- alysts to lose trust in the tool. We investigate the effects of lowering thresholds by running additional experiments with more relaxed thresholds as explained in Appendix B.3 . Using the HFR heuristic, we filtered out a large number of outright benign L sets for both universities, as shown in Figure 3.5. This allowed us to focus on L sets exhibiting more anomalous behavior and potentially malicious behavior. Filtering out benign behavior. After removing L sets that do not match the HFR heuristic, we further filter out other likely benign behaviors. First, we filter out L sets with IPs that have successfully completed Duo requests for all target users at least once on the same day because we assume that a remote attacker conducting large scale password guessing attacks does 69 not have physical access to a victim’s Duo authentication device. As we only have the Duo logs at U2, we applied this filtering to the U2 data. We found 563 L sets matching the HFR heuristic, but all of them contacted only one user and had successfully completed at least one Duo request for that user. Around 90% of these L sets had submitted ≤ 5 unique passwords, and the failures were due to incorrect password entry — probably just a typo. Second, we filter out any L sets having an IP within the university networks or belonging to school proxy/VPN servers. Access to these IPs is restricted, and members of the two universities can only use them after successful authentica- tion and completing the Duo request. Lastly, we filter out L sets that seem to emanate from a malfunctioning or misconfigured (benign) client. Specifically, we noticed many IPs submitting a large number of failed login requests with the same incorrect password. Upon further inspection, we observe that these were automated requests belonging to email clients, Outlook Exchange Web Services, and calendar auto-sync agents that were configured with an invalid password. Since any rational attacker would not try the same invalid username-password pair repeatedly, we re- moved all L sets reusing the same incorrect username-password pair for more than 90% of their login requests. After filtering benign L sets based on the HFR condition, failure rate, suc- cessful Duo submission, school IP addresses, and number of unique username- password pairs, there were 1,717 remaining potentially malicious L sets at U1 and 6,408 at U2. We show a breakdown of these filtering steps in Figure 3.5. 70 Filter # L sets U1 U2 L with NR > 1 & FF > 0 225,468 227,893 L with NR > ` & FF > f (HFR) 2,074 11,929 L with HFR and w/ successful Duo request on DATE n/a 563 IP belongs to school IP pool protected by Duo 60 825 >90% password reuse 419 4,133 Remaining L sets 1,717 6,408 Figure 3.5: Number of L sets matching each of the filtering criteria at either university. The last row shows the remaining number of sets after all filtering steps. Threshold values ` and f for U1 and U2 are indicated in Figure 3.4. 3.5.2 Clustering Potentially Malicious L sets After the filtering step, we cluster the remaining L sets using the features de- scribed in Figure 3.2. The key question for clustering is how to define a distance function that gauges similarity betweenL sets, such that twoL sets have a small distance if they belong to the same attack campaign. Similarity modeling. We design a distance function that can measure the like- liness that two given L sets belong to the same campaign. For the numerical features presented in Figure 3.2, we use the normalized difference between two values x and y, defined as ND(x, y) = |x − y|/(x + y). We chose this particular dis- tance function since ND(x, y) ≈ 0 when x ≈ y, and it is a high value when x and y have a high difference. That said, we believe other numerical distance measures would work well too. For the nominal features IP, ISP, UA, and DATE, we cannot do a straightfor- ward equality checking to measure their similarity, since these nominal features are sparse. Therefore, similar to Freeman et. al. [52], we use a hierarchical back- off distance (HBD) approach. This technique defines a number of levels: If the 71 two feature values do not match at a lower-level, we “backoff” and use values of the feature from a higher level, while incurring a dissimilarity cost. We set this dissimilarity to e−`−e−(`+1) for backing off from the lower hierarchical level ` to a higher hierarchical level ` + 1, and we accumulate the dissimilarity costs from each level with each backoff. For the IP feature, we use the hierarchical structure of Internet routing to define four levels. At level ` = 0 we check for strict IP equality; at level ` = 1 we check for /24 subnet equality; at level ` = 2 we check for ISP equality (as reported in the ISP feature); and at level ` = 3 everything is considered equal. Thus the max level for this feature is `IP max = 3. For the UA feature, we use five levels. At level ` = 0 we check if the UA strings are identical. For subsequent levels we extract from the UA string to obtain the application (desktop, mobile, unknown), browser (chrome, edge, other), and OS (Windows, iOS, Mac OS, Linux, other). Equality checks for each of these three define levels ` = 1 through ` = 3. The final level ` = 4 indicates that the user agents do not have anything in common. For the final nominal feature, DATE, we find the number of days d between two L sets and compute 1 − e−d. Thus two L sets with the same DATE have a distance of 0; with 1 day apart, the distance is 0.63, and so on. Note that this backoff can be calculated (less efficiently) via the same approach as for the other nominal features, by setting `DATE max equal to the maximum number of days. Pseudocode for the complete logsetsim and the hierarchical backoff distance (HBD) is shown in Figure 3.6. The set Γ̄ includes all feature we used (shown on Figure 3.2), and we abuse notation slightly by letting γ ∈ Γ̄ define a function that maps a L set to the relevant feature value. For the nominal features, we further 72 logsetsim(L1,L2): s← 0 for γ ∈ Γ̄: x← γ (L1); y← γ (L2) if type(γ) is numerical: s← s + wγ · ND (x, y) else s← s + wγ · HBD(γ, x, y) return s HBD(γ, x, y) : s← 0 ` ← 0 while ` < `γmax or γ`(x) , γ`(y) do s← s + e−` − e−(`+1) ` ← ` + 1 return s Figure 3.6: Our distance function (logsetsim) to measure the similarity between two L sets (Left) and the hierarchical backoff distance (HBD) calculation for nominal features (Right). let γ` denote the function that extracts the `th level from the feature value. We also define a predicate type over features that indicates whether the feature is nominal or numerical. We weight the distance values computed for each feature based on a hyper-parameter called feature weight wγ ∈ [0, 1]. Clustering. We used an agglomerative clustering technique [87] that can work with a non-metric distance function. Agglomerative clustering starts with each data point as a single cluster and only merges two clusters if their distance is smaller than a given threshold. Adjusting the threshold can help avoid clus- tering together L sets showing different login behavior. To set the distance threshold appropriately, we rely on a knee locator method [100] frequently used in clustering algorithms to find the correct threshold for distances. We set the linkage type to “average” and set the distance threshold for U1 and U2 to 0.44 and 0.51 respectively after applying the knee locator method at each university separately. The silhouette score [23] for agglomerative clustering was 0.19 for U1 and 0.17 for U2, which beat other approaches we tried (see Appendix B.3). Implementation details. We implemented our similarity model logsetsim in 240 lines of code written in Python 3.6. To extract the four hierarchical levels 73 for the IP feature, we used the MaxMind GeoIP database [80]; and for the UA feature, we used the ua-parser package [9]. Given an L of size n, we computed an n×n distance matrix to be used for various clustering algorithms. Computing the distance matrix takes O(n2) time; however, it can be easily parallelized. We used 40 threads and were able to compute the distance matrix for n = 1, 717 within 9 minutes at U1 and for n = 6, 408 within 28 minutes at U2 (using an Intel Xeon Linux machine with 56 cores and 125 GB of memory). For the clustering step, we used the sklearn [36] library, and clustering completed in less than a minute for both universities. 3.5.3 Attack Campaigns Discovered Based on our designed similarity modeling, the agglomerative clustering ap- proach described in Section 3.5.2 produced 366 clusters from 1,717 L sets at U1 and 640 clusters from 6,408 L sets at U2. For each cluster we recompute the fea- ture values mentioned in Figure 3.2, after taking the union of L sets. Next, we sample a few of the top most interesting clusters for manually analyzing them. To identify the large volumetric attack campaigns, we sampled clusters con- taining a high number of requests or high number of unique usernames. At U1, we found eight such clusters with NR ≥ 1, 000∨NU ≥ 1, 000. At U2, we sampled 12 clusters with NR ≥ 5, 000 ∨NU ≥ 5, 000. We chose these thresholds by manu- ally observing that the clusters found after these thresholds do not clearly show malicious behavior. As our goal is to characterize attacks, we focus on precise identification of attack campaigns, foregoing high recall. Thus we sorted possi- ble attacks by a few different metrics. For example, we sorted by the number of 74 01-01 01-21 02-10 03-02 03-22 04-11 05-01 05-21 06-10 06-30 102 103 104 N um be r of re qu es ts pe r da y in lo g sc al e. U1 *Clusters 1 & 6 *Cluster 2 Cluster 8 Cluster 7 Cluster 5 Cluster 9 Clusters 3 & 4 12-20 12-25 12-30 01-04 01-09 01-14 01-19 01-24 01-29 02-03 02-08 02-13 02-18 02-23 02-28 103 104 105 U2 *Cluster 10 Cluster 11 Cluster 13Cluster 14 Cluster 23 Cluster 22 Figure 3.7: Number of requests per day for the 1,752 and 6,408 potentially ma- licious L sets at U1 and U2 respectively as shown in Figure 3.5. Attack clusters we found in Section 3.4 are shown in red boxes. Clusters marked with a * were identified in prior work [29] as high volume attacks. Note that the x- and y-axes are different for the two universities for better visualization. unique usernames and found one additional attack at U1; all the top clusters at U2 had already been found using the volumetric thresholds. All of the above sample clusters did not exhibit any targeted behavior — sending roughly one unique password to each user on average. Therefore, to capture attack campaigns exhibiting targeted behavior, we consider clusters sending a high number of unique passwords submitted per username on aver- age. Although at U1 we did not find any such clusters appearing to be attacks, we discovered eight clusters from U2 that submitted an average of at least 25 unique passwords per user. Altogether, we sampled nine attack clusters at U1 and 20 attack clusters at U2. Attack clusters at U1 sent on average 8,432 re- quests to 1,294 unique usernames, with a total number of 41 successful logins among these 9 clusters. At U2, we saw an average of 14,358 requests submitted per cluster to 7,614 unique usernames, with a total number of 1,116 successful 75 Abbreviation Measurement NU Number of unique users NR Number of requests AUP Average unique passwords per user FWP Fraction of requests with a weak password FPIB Fraction of requests with the password in a breach FCIB Fraction of requests with the username-password pair in a breach FUIB Fraction of requests with the username in a breach FTP Fraction of requests with a tweaked password FVU Fraction of valid usernames FF Fraction of requests failed Figure 3.8: Abbreviations for measurements taken logins among all 20 clusters. We describe these clusters further in Section 3.6.1. 3.6 Analysis of Attack Campaigns As seen in the last section, Araña’s FCA pipeline helped us identify 29 clusters that are probable attack campaigns. We first describe a subset of these cam- paigns in more detail, to better understand attack behavior as seen in practice. In Section 3.6.2 we generalize from these case studies to identify a variety of observed higher-level attacker behaviors. 3.6.1 Example Attack Campaigns First we describe some attack clusters representative of different types of attack behavior we observed and show how we group some of them into campaigns involving more than one cluster. We show the timeline of the attack campaigns 76 ID / Univ. Dura tio n # IP s # ISP s NU NR AUP FW P FPIB FCIB FUIB FTP FVU FF Com p. use rs Com p. use rs flag ged .∗ 1§ / U1 5 h 3 3 10,424 18,093 1.21 0.12 0.56 0.41 0.69 0.06 0.94 1.00 1 1 2§ / U1 4 h 1 1 15,209 17,827 1.10 0.04 0.56 0.22 0.42 0.06 0.97 1.00 14 13 3‡ / U1 14 h 555 4 12,659 15,117 1.00 0.01 0.14 0.02 0.58 0.46 0.78 1.00 14 12 4‡ / U1 11 h 77 2 12,318 14,603 1.00 0.01 0.14 0.02 0.58 0.45 0.78 1.00 7 6 5 / U1 41 m 1 1 4,480 4,593 1.03 0.08 0.63 0.21 0.44 0.08 0.97 1.00 1 1 6§ / U1 4 h 3 1 2,101 2,683 1.21 0.07 0.66 0.48 0.69 0.04 0.98 1.00 0 0 7 / U1 9 h 1 1 1,347 1,481 1.00 0.00 0.32 0.10 0.22 0.05 0.99 1.00 2 2 8 / U1 19 h 85 13 894 1,246 1.00 0.01 0.33 0.08 0.92 0.84 0.92 1.00 2 2 9 / U1 37m 30 7 219 241 1.00 0.06 0.92 0.78 0.96 0.10 0.93 1.00 0 0 10§ / U2 11 h 12 1 76,321 169,573 1.01 0.00 0.03 0.00 0.00 0.00 0.00 1.00 0 0 11 / U2 10 m 663 10 27,488 33,304 1.00 0.10 0.79 0.00 0.00 0.00 0.01 1.00 0 0 12 / U2 15d 1 1 8,192 13,289 1.02 0.04 0.61 0.07 0.54 0.02 0.63 0.93 501 6 13 / U2 3 d 1 1 7,939 12,240 1.30 0.05 0.66 0.10 0.75 0.00 0.35 0.99 120 0 14 / U2 23h 843 13 7,771 10,535 1.01 0.08 0.66 0.00 0.71 0.00 0.48 0.97 258 0 15 / U2 5 d 1 1 4,662 9,714 1.07 0.09 0.84 0.00 0.91 0.00 0.40 0.99 32 0 16 / U2 2 d 2 1 4,934 7,323 1.00 0.06 0.58 0.00 0.23 0.00 0.81 0.84 786 0 17 / U2 2d 3 2 2,513 6,290 1.55 0.03 0.40 0.18 0.91 0.70 0.37 0.99 35 0 18 / U2 10 d 1 1 1,902 5,434 1.37 0.04 0.55 0.27 0.82 0.41 0.39 0.99 13 0 19 / U2 5 d 1 1 3,584 5,261 1.04 0.07 0.76 0.80 0.98 0.18 0.42 1.00 6 0 20 / U2 2 d 3 2 1,756 5,199 1.56 0.04 0.50 0.22 0.92 0.72 0.35 0.98 83 0 21 / U2 23 h 1 1 5,076 5,103 1.00 0.06 0.59 0.00 0.26 0.00 0.80 0.85 777 0 22 / U2 13 h 73 43 75 1,878 25.04 0.82 0.88 0.01 0.31 0.29 0.96 1.00 1 0 23 / U2 21 h 52 38 52 1,296 24.92 0.82 0.89 0.01 0.34 0.32 0.88 1.00 1 0 24 / U2 12 h 4 4 4 101 25.25 0.80 0.86 0.00 0.55 0.50 0.51 0.99 1 0 25 / U2 16 h 3 3 3 80 26.67 0.81 0.75 0.01 0.65 0.58 0.30 0.99 1 0 26 / U2 8 m 1 1 1 27 27.00 0.85 0.85 0.00 0.00 0.00 0.00 1.00 0 0 27 / U2 8 m 1 1 1 27 27.00 0.85 0.78 0.04 1.00 0.96 1.00 1.00 0 0 28 / U2 13 m 1 1 1 25 25.00 0.84 0.88 0.00 0.00 0.00 0.00 1.00 0 0 29 / U2 4 d 1 1 3 468 38.75 0.15 0.59 0.00 0.67 0.16 0.33 1.00 0 0 ‡ Since the measurement collection ended during these clusters, some of these measurements (such as NR and NU) may be lower bounds. § These clusters are part of campaigns that were discussed in previous work [29]. † Duration units are days (d), hours (h), and minutes (m). ∗ The last column represents the accounts that were flagged independently by security engineers at the respective universities. Figure 3.9: Attack clusters detected using a set of heuristics and manual review. The first column notes the ID of each cluster as we refer to them in the paper [64]. The attack IDs we describe in detail (Section 3.6) are shown in bold font. We discovered a total of 41 and 1,116 unique compromised users at U1 and U2 respectively. Abbreviations are described in Figure 3.8. 77 discussed in Figure 3.7 and the full list of attack clusters in Figure C.5. Previously reported attacks. Three attacks were manually identified and dis- cussed in prior work [29]. All three were also detected via our FCA pipeline: Attack #1 from [29] corresponds to Clusters 1 & 6, Attack #2 corresponds to Cluster 2, and Attack #3 corresponds to Cluster 10 (also shown in Figure 3.7). Attack #1 from [29] was a credential stuffing attack distributed over four IPs that were active on different days. Each of the IPs submitted lots of requests very quickly, with a peak rate of over 100 requests per second. As such, these were easily identified manually as related attacks by the similar attack behaviors and time period when they were active. This still required significant manual analysis of Gossamer logs; however, our FCA approach automates this analy- sis. Two clusters identified by the FCA method capture the bulk of the L sets associated with these attacks. One L set containing 16,035 requests that were previously manually identified was omitted; this L set had a lower failure rate (0.67) than our heuristic filtering thresholds (which was 0.77 at U1). Attack #2 from [29] was a high volume credential stuffing attack using Sen- tryMBA [110] to send requests from a single source IP address. Since the attack consisted of only one IP address on a single date, the cluster consisted of one L set, ranked second by volume of requests in our FCA approach. Attack #3 from [29] was observed at U2 and involved 12 distinct IP addresses with an average rate of 188 requests per second. This attack was easy to detect due to its sheer volume and rate of requests. Manual inspection revealed that the attack requests mimicked SMTP and IMAP clients, which helped in manu- ally clustering them. Again, our FCA pipeline automated this step, placing all 78 activity from these IP addresses into a single attack cluster. Our new FCA approach helps automatically identify these attack campaigns with little error relative to the manual analysis used in prior work [29]. How- ever, it did split the first attack into two clusters, and it also missed an IP address from that attack. Across these three attack campaigns, 17 L sets were identified by both the manual and FCA approach as being part of attack campaigns, and one L set was identified by the manual approach but not the FCA approach. We show the corresponding confusion matrices in Appendix B.4. Note that the FCA approach also found many attack campaigns hard to manually detect that we describe next. Future work can look into whether unsupervised clustering approaches can be improved to be more accurate than manual inspection. For now it is clear that FCA helps find and characterize attack campaigns in an automated fashion that can later be investigated by an analyst. We now discuss in further detail the new attacks discovered using our pipeline. Clusters 5 & 7: Repeated attacks from the same IP. We saw two credential stuffing attacks at U1 (Clusters 5 & 7) a month apart (May 22 and June 20) from the same IP address and user agent (and thus, likely from the same attacker) attempting to login to 4,480 and 1,347 users respectively. The five user agents used in the June 20 attack were a subset of the seven used in the May 22 attack, and the two attacks targeted 295 overlapping usernames. Thus we believe that the IP was under the same attacker’s control for both attacks. This IP was active on 24 other days, but only submitted between one and four requests per day to nine distinct usernames over that time period. We believe the attack exhibited credential stuffing behavior, as the fractions of breached passwords submitted 79 were 0.63 and 0.32 respectively, and less than 8% of the passwords submitted were weak passwords. Other features such as the AUP, FTP, and FSP were rel- atively similar between the two clusters. All but two usernames were incorrect across the clusters, so we believe the attack was curated for U1. Cluster 12: Multi-day attack from a single IP. Although most of the attacks we saw were finished within a 24 hour period, some attackers may spread out the attack over multiple days in an attempt to stealthily avoid detection. We saw this behavior in Cluster 12 at U2. In this attack, one IP from Microsoft Cor- poration ISP sent one request per minute on 15 days in a two month period. In total, the IP sent 13,289 requests to 8,192 unique usernames. There was evidence of credential stuffing, as 61% of passwords submitted appeared in breach data known to Gossamer, and fewer than 0.04% of them were weak passwords. We did not see any evidence of targeted behavior, as there was an average of only one unique password submitted per username; however 63% of the attempted usernames were valid, meaning the attacker curated their password guessing attack for U2. The IP attacked the basic authentication protocol that does not have two factor authentication set up at U2. This IP successfully guessed the passwords of 501 user accounts over this time period, only 163 of which were detected by the security engineers. We saw similar multi-day attacks in Clusters 15, 18, 19, and 29, albeit for shorter time periods. Such attacks may easily avoid simple volumetric detection methods (as they were missed in the prior work) and thus show the utility of a richer feature-based clustering approach. Cluster 14: High volume, distributed, credential stuffing. We saw several cases exhibiting high volume, distributed, curated credential stuffing. For ex- 80 ample, in Cluster 14 at U2, 843 IPs belonging to 13 ISPs submitted 10,535 re- quests to 7,771 usernames over the course of 23 hours. Since each IP submitted only around 12 requests, it avoided detection by the manual analysis performed in the prior work [29]. This attack exhibited credential stuffing, with 66% of passwords present in breach data; and 48% of the submitted request contained a valid usernames from U2. During this attack, the involved IPs successfully guessed the passwords of 258 unique user accounts, 74 of which were confirmed by security engineers independently. Clusters 3, 4, 8, 9, 11-13, and 15-19 showed similar signs of high volume, distributed, credential stuffing with varying levels of curation to the target university. Cluster 17: Possible credential stuffing with unknown breach data. A few at- tacks showed a high fraction of tweaked passwords (passwords that are a close variant of a breached password for the targeted user), but a lower fraction of breached passwords. For example, in Cluster 17, we saw three IPs submit 6,290 requests to 2,513 unique usernames over the course of two days. Although 91% of the submitted usernames in this attack campaign are present in the breach data used by Gossamer, only 18% of the submitted passwords appear with those usernames. Of the remaining 82% submitted passwords for those users, we that found 70% are only small tweaks of the password(s) present in the breach data with the corresponding username, and 40% of the submitted passwords exactly match the breach data. This may indicate that the attacker is using a breach dataset that is unknown to Gossamer; but because users choose similar pass- words across web services [42], we can still detect these breached passwords not present in Gossamer’s breach data. This attack produced successful logins to 35 users, two of which were inde- 81 pendently detected by the security engineers. Cluster 20 showed similar signs of possible credential stuffing with unknown breach data, with a high fraction of tweaked passwords. Neither of these attacks exhibited targeted behavior (hav- ing AUPs of 1.55 and 1.56 respectively); therefore, we do not believe these are credential tweaking attacks [93]. Clusters 22 & 23: Targeted attacks. In a few cases, we saw a higher average number of unique passwords tried per username. For example, in Cluster 22, we saw 73 IPs submit 1,878 requests to 75 unique usernames, with an aver- age of 25 unique passwords tried per username, each attacking only one user. Cluster 23 was active on the same day and executed a very similar attack: 52 IPs submitted 1,296 requests to 52 usernames, each submitting an average of 25 unique passwords to one unique username, just like Cluster 22. Thus we believe they are part of the same campaign. In this attack campaign, the attacker clearly used a popular password dictio- nary, as 82% of passwords submitted had weak passwords (zxcvbn score of 0). The fraction of valid usernames for Clusters 22 and 23 were 0.96 and 0.88 respec- tively, indicating that the attack data was probably curated to the university. The campaign was successful in guessing the passwords for two usernames, neither of which was detected by the security engineers. Clusters 24-28 exhibited similar targeted dictionary attack behavior, al- though with fewer IPs and a lower number of total requests. Our clustering approach, however, failed to cluster all of these L sets together. This is a com- mon limitation in agglomerative clustering [109]. 82 3.6.2 Higher-Level Attack Behaviors Observed Attack campaigns employ various strategies to maximize their success in com- promising user accounts. Broadly, there are two components in an attack cam- paign that the attacker has to choose: (a) the types of username-password pairs to submit, and (b) how login requests with those username-password pairs are delivered to the target service. We noticed different approaches to each of the two components in the attack campaigns we found at U1 and U2, as we discuss in more detail below. The list of different behaviors we observed as well as some example clusters showing that behavior are shown in Figure 3.10. We also explore the geographical distribution of attacks, but relegate details to Appendix B.5. Types of submitted username-password pairs. A key component in an at- tack is the set of guessed username-password pairs, as the success of the attack depends on it. There are different strategies for picking username-password pairs — e.g., an attacker can choose usernames belonging to the university and try several popular passwords against those users, or an attacker can choose breached username-password pairs with or without filtering the usernames specific to the university. Curates usernames to the target university. We found that all attacks at U1 cu- rated their set of usernames, with more than 75% of requests containing a user- name present in U1. However, at U2, only 7 (out of 20) attack clusters contained more than 50% valid usernames. Clusters 10 and 11 combined submitted nearly 200 K requests, but only 286 of them contained a username present at U2. Even when an attacker attempted to log in multiple times for a particular username, 83 Attack component Behavior Ex. clusters Type of username- password pairs Curates to the target university 2, 21 Targets certain users 22, 23 Uses breach data 9, 19 Submits popular passwords 22, 23 Delivery of requests Distributed among multiple IPs 3, 11 Distributed among multiple days 12, 19 Ends quickly (within 24 hours) 1, 11 Exhibits low interarrival time 5, 11 Figure 3.10: Different attack behaviors we observed. we found that in several cases the username did not exist. Some attacks are therefore rather indiscriminate, trying arbitrary usernames without checking first whether they are valid for the target authentication service. Targets certain users. An attacker may submit requests against one, a few, or many unique usernames. Among attacks we detected, it was more common for an attack to target a large number of usernames in a “horizontal attack.” Most of the attacks we found were horizontal attacks — trying one or two passwords per username but for a large number of usernames. An attack submitting multiple unique passwords against fewer usernames may be exhibiting targeted behavior. We found six attack clusters targeting 137 users in total, each with more than 25 unique popular passwords. We do not know if the set of passwords used were the same for different users, as Gossamer logs at U2 do not allow comparing passwords submitted to different usernames. Uses breach data. In our analysis, we found that attackers often use prior breaches to source their username-password pairs in what is popularly known as a credential stuffing attack. In six out of nine clusters at U1 and in one cluster at U2, 50% of the submitted username-password pairs are present in the breach data used by Gossamer. Among the remaining 22 clusters, eight clusters (all 84 at U2) had more than 50% of the targeted usernames in a known breach, and all but one attack cluster at U2 submitted passwords that were found in prior breaches. This reiterates the threat of credential stuffing. Uses popular passwords. Although most attacks used breached passwords, some attacks relied on especially popular passwords. Attackers may use popular password dictionaries [125] curated from prior password breaches for such at- tacks, and such passwords will, by definition, also be flagged as having ap- peared in a known breach. We found a number of attack clusters submitting popular weak passwords, such as Clusters 22 & 23. More than 81% of the pass- words submitted in these attacks appear in the 1000 most frequent passwords found in the breach data known to Gossamer. Notably, all of these attacks were targeted, submitting more than 25 passwords per user. Both universities have password selection policies that aim to disallow popular, weak passwords. Nevertheless, we found that such attacks were successful in compromising two users at U2. Delivery methods for attack requests. After choosing the set of username- password pairs to submit, the attacker must decide how to submit them to the target service. Primarily, the attacker must identify how quickly they can sub- mit all the username-password pairs without being detected. To do so, attackers can use multiple IP addresses to parallelize the attacks, and/or spread the attack over a long period of time. Distributed among multiple IPs. We found several attack campaigns that dis- tributed their login requests over multiple IP addresses. In particular, we iden- tified three large attack clusters using more than 500 IP addresses and five other 85 clusters each using more than 50 IP addresses. In the two clusters with the most IP addresses at U2 (Clusters 11 & 14), the majority of IPs were flagged as proxies by Blackbox API [2]. By distributing over multiple IPs, an attacker can achieve a higher throughput, as exhibited in cluster 11 which used multiple IPs to achieve a very fast request rate. This distribution of requests across IPs can help to avoid volumetric detection mechanisms, as exhibited in clusters 22 and 23. Duration of the attack. Attacks can be short-lived and bursty or spread across multiple days. All attack clusters at U1 were short-lived, finishing within 19 hours and four of them completing within five hours. At U2, however, we found that most high volume attacks (submitting more than 5 K requests) span mul- tiple days, and two clusters were active for over 10 and 15 days, respectively. Interestingly, these attacks would be very difficult to detect by looking at their behavior on only one day; however, our clustering approach helps to see the full attack by combining the attack behaviors from multiple days. Interarrival time. Finally, sending requests too quickly could also trigger alarm or lockout. Therefore, some attackers try to submit requests at a lower rate. However, we see at least four attack campaigns that submitted as much as 60 requests per minute at U1 and U2. These attacks are not curated, meaning that the majority of the submitted usernames do not belong to the target university and thus could not result in a successful login. Both U1 and U2 have a soft rate limiting policy, meaning that too many unsuccessful attempts against a user- name could lead to account lockout for 15–30 minutes. However, we did not find any attack that submitted requests fast enough to a single user that could trigger that lockout. We are unsure if attackers adapted their attacks to this 86 lockout constraint or if their typical behavior avoids such lockouts. Endpoints targeted. An attacker may choose an attack endpoint with the least safeguards in place. We found that most of the attacks at U2 targeted the Mi- crosoft basic email authentication endpoint. We suspect that there are four reasons for this high usage of the basic authentication endpoint: (a) this end- point does not support two factor authentication; (b) it has poor (non-existent) rate limiting of requests, as basic authentication requires frequent submission of passwords; (c) the requesting IP seems to be from Microsoft (as seen by the au- thentication server); and (d) attackers would have to make only a small change (the URL) to attack different organizations using basic authentication. 3.7 Discussion We design Araña, an attack analysis system based on our FCA — filter, clus- ter, analyze — framework. Using Araña with the logs created by our prior work [29], we identified several attack campaigns in two university login sys- tems. These attack campaigns are spread across 29 clusters as detailed in Figure C.5 and compromised 1,157 users across two universities. Although our study is limited to the attacks received by two universities, we believe the pat- terns we observed across attacks can form a basis for building future defenses against password guessing attacks. At U1, the most common type of attack was a combination of a high vol- ume, distributed, curated, short-lived, credential stuffing attack; the interarrival times of the attacks varied from 55 milliseconds to 108 seconds. At U2, almost 87 all the attacks were high volume credential stuffing attacks, but they varied in whether they were distributed, curated to the university, short-lived, or exhibit- ing a low interarrival time. Of the eight lower volume attacks we found, all were curated, short-lived, targeted credential stuffing; and they varied as to whether they were distributed across IPs. The distributions of attacks observed at the two universities are quite different, possibly because Gossamer gathered data from all login endpoints at U2, but only the main login endpoint at U1; thus attackers targeting different endpoints at U1 are not reflected in Gossamer logs. Efficacy of breach alerting services. To determine how much a breach alerting service could mitigate password guessing attacks, we investigated the charac- teristics of the password from the last successful login before a user was re- ported as compromised. We found that 2% of compromise reports were asso- ciated with a breached username-password pair (that is, the last successful lo- gin to that username before the compromise report was a breached username- password pair), 12% were associated with at least a breached password, and 19% were associated with a tweaked password. At U2, we observed that 11% of compromised users’ last logins were made with a breached username-password pair, 46% were made with at least a breached password, and 1.51% were made with a tweaked password. Thus automatically resetting passwords of users us- ing vulnerable, breached, or tweaked credentials could have prevented a signif- icant fraction of account compromises. At U1, 71% of eventually compromised users used a breach password at some point during the instrumentation period. This further underpins the need for proactive breach alerting [76] services that could have saved 47% of account takeovers as stated earlier. 88 Key observations from attacks. Our findings reiterate the ongoing threat of credential stuffing, which has been the most prevalent and successful form of attack. We also observed a few low volume targeted attacks against specific users. Attackers often use multiple IPs from cloud providers, VPNs, and net- work proxies to distribute and hide their attacks, but our FCA approach can identify such campaigns even when each individual IP makes only a handful of requests. Such observations should be taken into account when building defense policies. For example, locking user accounts due to a small number of incorrect attempts rarely translates to higher security, whereas discouraging users from reusing passwords from other websites and using breach alerting services can be very effective. Proactive breach alerting [76] using services such as HIBP [63] would be helpful in combating credential stuffing attacks. Using an FCA approach for attack analysis. Our FCA approach helps ana- lyze attack campaigns by clustering L sets with similar attack behavior. This enables a security engineer to look at the whole attack, instead of considering login activities from a single IP or on a single day. As we show, several attacks are spread across multiple days and use multiple IPs; in some cases, they may use only one IP and spread the attack over multiple days, making them very hard to detect. We believe clustering seemingly unclear behaviors into groups can help security engineers see the attack pattern, detect hard-to-detect attacks, and have confidence in their judgment. We envision that our FCA approach can be used on authentication logs such as the ones produced by Gossamer [29] to group possible attack campaigns and sort by those that are more likely to be ac- tual attacks. Then analysts can manually investigate further, using the features 89 and groupings produced by the clustering to inform their decision. Through- out the design of our FCA approach, we focused on minimizing false positives by choosing aggressive thresholds and evaluating threshold effects so that an analyst can have confidence that clusters found by Araña are highly likely to be attacks. Thus our FCA approach could reduce the effort needed to analyze the complete logs of all IPs; such a tool could prepare daily (or weekly) reports about potential attacks. 3.8 Real-World Deployment Constraints Despite the many campaigns detected by Araña, there are certain challenges that may occur during deployment in real-world settings. Here we talk about a few that we have faced. Recording password-derived features. Most organizations do not record password-based features for good reason, and while Gossamer [29] showed how to record them safely, it is still an open question whether widespread de- ployment of recording password-derived features is wise or necessary. Araña provides evidence that certain password-derived features can be useful for at- tack detection. For example, we identified that password breach statistics are very effective at detecting credential stuffing attacks. On the other hand, pass- word similarity statistics — which require storing passwords in memory for 24 hours — are not as effective for attack detection in the FCA approach. So or- ganizations could log a small set of already proven-helpful password-derived features and apply clustering using those features. We detail clustering results without password-based features in Appendix B.3 and leave to future work the 90 open question of which features are the most helpful for detecting different types of attacks. Reporting compromised accounts. Based on the attack campaigns discovered by our FCA approach, we believe 41 unique user accounts were compromised at U1, and 1,116 were compromised at U2 (some accounts were compromised multiple times by different campaigns). At U1, 37 of the 41 compromised accounts detected by Araña were already detected via other mechanisms, and manually recorded by the security engi- neers as compromised. We reported the remaining accounts and received feed- back that they were indeed compromised, and the remaining account had al- ready been deleted. At U2, only six out of 1,116 compromised users detected by Araña had been independently flagged by the security engineers. We reported the rest to U2’s security engineers in two rounds. In the first round, we reported 823 compro- mised accounts, 373 (44%) of which the security engineers confirmed as defi- nitely compromised. For the remaining 450 accounts, U2’s security engineers said they could not verify fully due to unavailability of adequate logs but that their best guess was that they were compromised based on what logs were avail- able. In the second round, we reported another 293 compromised usernames; however, we received a similar response mentioning that the IT department did not have adequate logs to determine whether these were compromised. In practice, confirming if accounts are compromised is nuanced and often relies on indirect indicators like an account being used to send spam or a report from an account owner. Elsewhere, compromise status can be inherently am- 91 biguous. Even so, our experience working with the security engineers suggests that more detailed and persistent logging may help. Likewise, our experience with Araña suggests that password-derived measurements can be helpful to analysts when attempting to characterize and confirm compromises. We also note that this difficulty in confirming account compromise makes it difficult to automatically finetune many of the hyperparameters in Araña (e.g., distance thresholds and percentile thresholds for filtering L sets). Therefore we had to rely on experimentation with different thresholds and manual analysis on password-derived and volumetric-based measurements which took compara- tively much more effort. Nevertheless, Araña provides a way to use Gossamer logs to identify high-volume and distributed attacks that were not detected by existing mechanisms. 3.9 Conclusion Using data collected at two universities from a measurement framework for logging password-derived information behavior, we designed a set of features that describe an IP address on a given date, along with a clustering algorithm to group IP addresses active on certain dates together into probable attack cam- paigns. We describe several of these clusters in full detail to show the differ- ences and similarities between attacks, and we discuss our observations about behavioral patterns of the attacks as a whole. Our results indicate that clustering approaches can aid an analyst in detect- ing and labeling suspicious groups of requests that may be part of the same attack campaign. They also provide initial progress towads the future design and deployment of real-time, automated attack campaign detection tools. 92 CHAPTER 4 THE PRACTICALITY AND ROBUSTNESS OF TIMELY MALICIOUS LOGIN DETECTION 4.1 Introduction Remote password guessing attacks pose a serious risk to users’ online accounts. The prevalence of large password breaches [63] along with poor password prac- tices (e.g., users’ tendencies to pick weak, easy-to-guess passwords [96,115] and reuse them across web services [83, 118, 121]) have made it easy for attackers to guess a user’s password and gain access to their account [95]. Despite ongoing efforts to move beyond passwords, such as by transitioning to device-hosted cryptographic keys [56,73], passwords remain the primary mode of authentica- tion. Therefore, password guessing attacks will likely persist as a severe security threat for the foreseeable future [24, 78]. To prevent password guessing attacks, several server-side defenses have been proposed that analyze more features of a login request beyond just pass- word correctness [32]. Examples include volumetric approaches like block- ing IP addresses or locking an account after too many login failures [34]. Prior academic works have proposed supervised and unsupervised machine learning (ML)-based mechanisms that train models using prior known at- tacks [52, 101, 102] or model anomalous login behavior [59, 64]. While these approaches hold significant promise, no prior work has evalu- ated the efficacy of such approaches in the timely detection of real-world login attacks. Moreover, most previous studies [52, 59, 101, 102] have relied on syn- thetic login data for simulated evaluations, raising concerns about their appli- cability in practice. This limitation stems from the challenges associated with ac- quiring and analyzing sensitive login traffic. For example, Schechter et al. [102] require the authentication server to log fine-grained sensitive information about users’ passwords. In prior work, we addressed this challenge by designing a framework called Gossamer [29], which we used to collect a dataset of 34 million login requests at two universities. To preserve the security and privacy of users’ passwords, we specified a set of coarse-grained measurements deemed safe to collect without leaking too much information about a user’s password. For example, Gossamer records a binary password strength score using Zxcvbn [123] for submitted passwords instead of the raw Zxcvbn score (between 0-4), as it may endanger users’ security. Therefore, a key question remains: How do existing volumetric and ML- based timely detection approaches perform when authentication servers collect coarse-grained information from login requests to safeguard the privacy and se- curity of their users? In our work on Araña, we further analyzed the Gossamer dataset and detected 29 attack campaigns using a new FCA (filter, cluster, an- alyze) pipeline [64]. However, this approach relied on months of data to iden- tify malicious activity, rendering this technique unsuitable to detect attacks fast enough to prevent exploitation of a compromised account. Additionally, no prior research has assessed the threat of attackers adapting to detection mecha- nisms, leaving open the question of their robustness against more sophisticated adversaries. To begin answering these questions, we first perform a comprehensive in- 94 vestigation of existing volumetric and ML-based detection approaches on the real-world large-scale dataset collected by the Gossamer framework. We mea- sure the effectiveness of the existing detection approaches by assessing how ac- curately they can detect the 29 attack campaigns without also raising too many false alerts. Our findings indicate that existing volumetric and ML-based ap- proaches perform poorly, indicating that performance on simulated attacks may not generalize to real-world attacks — at least on real login datasets that must record coarser grained information about submitted passwords to avoid com- promising user security. We explore three new detection approaches that leverage Directed Anomaly Scoring (DAS), which was designed by Ho et al. [61] for the context of spearphishing detection. Following our approach in Araña [64], we first parti- tion logins by grouping requests from each client IP address on the same day — referred to as anL set. We design four sets of features called sub-detectors, each corresponding to a distinct guessing attack behavior reported in prior work. We explore three ways of using sub-detectors: using DAS to score an L set against previously identified suspicious L sets, using DAS to score against a sliding window of L sets, and applying the Araña clustering algorithm on an indi- vidual day’s worth of L sets and scoring the resulting clusters using DAS. All approaches are timely in that they could generate a daily report of likely attacks for an analyst to further investigate. We evaluate these approaches on the Gossamer dataset and find that they perform much better than prior approaches at detecting attacks. For example, the per-day clustering approach discovered >98% of the attacks reported by Arañawhile maintaining a precision of >52% at both universities. The sliding 95 window approach even discovered new attacks not reported in prior work, but generates more false alerts and had precision of 22%. We manually inspected a subset of these false alerts following the approach specified in [64] and found that a large fraction of the false alerts are likely attacks missed by [64]. While this shows that DAS-based approaches can detect existing attack strategies, it is unclear if they will remain effective even when the attacker adapts to such defense mechanisms by actively modifying their login requests to evade being flagged as malicious. We therefore propose a new, data-driven approach for evaluating evasion attacks: We use real attack campaign login data, modify it as per an evasion strategy, and then place it back into the stream of otherwise untouched background login traffic. We then simulate running the defense mechanism on the new combined stream of logins to allow a realistic, in situ analysis of detectability. We devise four evasion strategies that either distract the detector from the actual attack or help the attack blend in with other benign logins. We evalu- ate them using our approach for varying amounts of attacker resources—the number of additional IP addresses, login requests, and time spent on the at- tack campaign. We find that all attack detection mechanisms can be evaded when using enough resources, though certain detection approaches require the attacker to spend more resources to succeed. While future work may improve our evasion attacks, for now our results suggest that deploying timely detection mechanisms would help catch unsophisticated attacks and increase the cost of more sophisticated attacks. Contributions. In summary, we make the following contributions in our work. • We perform the first in-depth performance analysis of existing volumetric- 96 and ML-based attack detection approaches on a real login dataset containing 34 million login requests. • We propose three new timely detection approaches leveraging Directed Anomaly Scoring (DAS) and evaluate them on the Gossamer dataset. • We show how to measure the robustness of detection approaches against eva- sion attacks and evaluate it for the three DAS-based detection mechanisms. We are in the process of contacting the universities to perform responsible dis- closure about the new attacks discovered in the course of this work. We have open sourced the codebase we developed for this study.1 4.2 Background & Related Work This project focuses on the timely detection of online password guessing attacks. Online guessing attacks. In a password guessing attack, attackers submit can- didate username-password combinations to an authentication server to guess users’ correct passwords. They may employ different strategies—for example, they may deploy a password spraying attack by trying a few common passwords against many different user accounts. Studies have shown that users tend to reuse the same or similar passwords across different websites [90, 119]; an attacker can leverage this in a credential stuffing attack by guessing breached passwords from one website against an- other. Billions of breached passwords have made credential stuffing a powerful 1Available at https://anonymous.4open.science/r/real-time-malicious-login-detection-884B/ 97 https://anonymous.4open.science/r/real-time-malicious-login-detection-884B/ tool for attackers to compromise user accounts. In fact, automated bot-driven credential stuffing attacks account for 91% of login traffic against ecommerce websites [98] and contribute to 49% of all successful data breaches [46]. Despite this, existing server-side defenses have not been evaluated on credential stuffing attacks in practice. Client-side defenses. Modern websites use various proactive client-side mea- sures to protect users against online guessing attacks. These measures include enforcing password composition policies, banning the use of common, eas- ily guessed passwords like “123456”, and deploying password strength me- ters [93, 122, 123] and breach alerting services [74, 94, 112]. However, these client-side approaches rely on user motivation, and users may bypass or ignore them [53, 115]. Users often resist changing their passwords even after receiving warnings from breach alerting services [22]. Password managers and two-factor authentication (2FA) could help address these issues, but they are not yet widely adopted in practice, and user friction still remains high [19,41,44,62,82,97,127]. Recent studies have also identified several key security issues with both pass- word managers [91, 107] and 2FA [54, 68] Therefore, modern websites must also implement server-side defenses complementary to the above-mentioned approaches to protect their user accounts. Volumetric defenses. To prevent online password guessing attacks, web- sites can (temporarily) prevent login to an account if too many prior login at- tempts failed. Prior work has suggested various thresholds including three [49], ten [34], and a dynamically adjusted threshold that gets adjusted based on as- pects of the submitted password [27]. Such account lockout strategies can slow down targeted password guessing attacks. However, attackers can abuse this 98 defense mechanism to launch a simple denial of service attack, locking users out of their accounts [14]; as such, it is not widely used. Indeed, 72% (131) of Alexa Top 500 websites do not use account lockout [77]. More importantly, 21 out of 29 attack campaigns discovered by Islam et al. [64] exhibited untargeted behavior, sending only one request per username to a large number of usernames; thus account lockout defenses would be ineffective against such attacks. Login servers can also restrict or reject login from an IP addresses if too many login attempts from that IP address fail. For example, to slow down online guessing attacks, websites might silently block requests from such an IP address [17], present a CAPTCHA challenge [117], or require additional knowledge-based challenges [113]. Machine learning defenses. Another set of approaches to detect password guessing attacks utilizes machine learning (ML) [52, 59, 64, 101, 102]. For exam- ple, Freeman et al. [52] proposed WhoAreYou which uses the reputation of IP addresses and user agents to classify logins as malicious or benign. Building on this, Schechter et al. [102] proposed StopGuessing which goes further to include password-based features in creating a classifier for detecting malicious login attempts. Both approaches, however, have primarily been evaluated using sim- ulated attacks. Thus the effectiveness of these approaches against real-world password guessing attacks remains uncertain. We analyze the efficacy of these approaches on the Gossamer dataset in Section 4.3.3. Herley et al. [59] and Schechter et al. [101] also proposed similar ML-based login traffic classification approaches using different features. However, these rely on unrealistic assumptions for deployment—such as the presence of an attack-free period for collecting benign login traffic [59] or requiring users to 99 change their password behavior prior to attack detection [101]. We do not con- sider them for further analysis. Recently, Islam et al. [64] proposed Araña,which uses volumetric, client- level, and password-based features to group IP addresses with similar suspi- cious behavior together into attack campaigns. Arañawas tested on a large lo- gin dataset of 34 million login requests (called the Gossamer dataset, described in Section 4.3.1) and discovered 29 attack clusters. Manual analysis revealed di- verse characteristics including short and bursty behavior, as well as attempts to conceal behavior by distributing guesses across multiple IP addresses over longer duration. Notably, the majority of successful attacks (21 out of 29) exhibit credential stuffing behavior, while the eight other targeted attacks are mostly unsuccessful. All attack clusters are illustrated in Figure C.5. Arañarequires months of login data for creating clusters; thus it is uncertain if it can be used to detect attacks in a timely manner (say, within a day). Also, Arañarelies on a handcrafted filtering rule to separate attack clusters from the benign clusters, making Arañamore challenging to deploy in practice. Our work focuses on timely server-side detection of online password guess- ing attacks. To the best of our knowledge, both the volumetric and ML-based approaches that are deployed in practice have not been evaluated comprehen- sively for their efficacy against such powerful real-world password guessing at- tacks. In Section 4.3, we fill this gap by attempting to measure the effectiveness of a few volumetric and ML-based approaches on a real-world login dataset. 100 U1 U2 benign malicious benign malicious Login requests train 8,693,670 41,330 1,892,509 275,584 holdout 1,444,446 34,554 525,924 11,583 all 10,138,116 75,884 2,418,433 287,167 L sets train 3,962 93 185,776 1,568 holdout 194 663 40,408 141 all 4,156 756 226,184 1,709 Target usernames train 175,503 23,951 101,557 10,614 holdout 96,462 24,339 50,195 3,077 all 271,965 48,290 151,752 13,691 Figure 4.1: Breakdown of the Gossamer logs used in our simulations. Login requests associated with Araña-identified clusters are labeled as malicious, and all others are labeled as benign. 4.3 Evaluating Prior Approaches We evaluate the performance of the existing volumetric and machine learn- ing (ML)-based attack detection approaches on a large-scale, real-world login dataset we obtained from the authors of Gossamer [29]. In this section, we re- port on the efficacy of prior detection approaches when tested on a real-world login dataset such as Gossamer’s. 4.3.1 Dataset To measure the effectiveness of existing approaches, we use a large dataset of 34M login requests collected via the Gossamer framework [29]. Gossamer en- ables secure logging of information about login requests, including the source IP address, the anonymized username, user agent, HTTP headers, timestamp, submission result, and a set of carefully chosen password-derived features from each login request. 101 Bohuk et al. [29] deployed this framework at two large universities that we will refer to as U1 and U2. At U1, the instrumentation collected logins over a period of seven months (212 days); for U2, it collected logins for two and a half months (72 days). Altogether, the instrumentation collected information on 34M login requests originating from 1,039 K unique IP addresses to 505 K unique usernames — making Gossamer-collected logs the largest and most di- verse real-world login dataset used in academic research to date. In subsequent work, Islam et al. [64] used Gossamer logs to identify ten high- volume attack campaigns at U1 and 19 at U2 using a new clustering-based ap- proach. This approach, called Araña, takes login requests over the whole instru- mentation period and groups these login requests by their client IP address and the corresponding day of the login. Similar to [64], we refer to each group of logins associated with an (IP address, day) pair using the notation L set. Araña, then, filters L sets likely to be benign (using simple heuristics) and applies an application-specific clustering algorithm producing an output that allows man- ual analysis to identify the 29 attack campaigns. Bohuk at al. [29] did not publicly release the Gossamer dataset at the time of publication. We worked in close collaboration with the authors of [29] to get access to the Gossamer dataset and handled the dataset following the same set of ethical guidelines as described below. Ethical considerations. Gossamer dataset is anonymized (meaning it does not contain actual usernames), but it includes fields that were derived from pass- words. Therefore, analyzing such data requires careful ethical considerations. We maintained the same set of strict ethical guidelines to safeguard the privacy and security of Gossamer dataset as [29]. 102 For example, Gossamer dataset was safely stored on a VM protected by multi-factor authentication, and only researchers have access to it for perform- ing the experiments. Similar to prior work [29,64], we could not ask for explicit consent from users since usernames were deterministically encrypted with keys that were kept secret from researchers. Our study was reviewed by the IRB and received exemption, as the research poses no more than minimal risk, and the Gossamer logs do not contain any usernames or other data considered to be PII by the IRB. Simulating defense mechanisms. Gossamer logs provide a rich dataset to drive simulations of attack detection mechanisms. We therefore build a data- driven simulation pipeline. To evaluate a defense mechanism, the simulator iterates through all login requests from Gossamer logs. Given the nature of Gossamer logs, our simulator works for a wide range of defense mechanisms, but not all — for example, it will not work if a mechanisms needs the plaintext password or username string, since neither are included in the logs. Some of the mechanisms we evaluate require training on login data. To simulate these, we divide the instrumentation period at each university into two parts. The login data collected during the initial 80% of each instrumen- tation period is used for training malicious login detection models, as detailed in Section 4.3.3. Then the last 20% of each instrumentation period is used as a holdout for evaluating the efficacy of the defense mechanism. For the ground truth in evaluating mechanisms, we classify all L sets and login requests associated with the 29 attack campaigns previously documented (by Araña [64]) as malicious and the remaining data as benign. Figure 4.1 pro- vides a summary of the number of benign and malicious L sets and login re- 103 quests across the train and holdout splits. Characterizing malicious login detectors. For malicious login detection to be useful in practice, it must identify malicious login requests (high recall) without flagging too many benign login requests (high precision). We use precision and recall as opposed to other metrics (e.g., AUC-ROC) given the highly skewed na- ture of the dataset (few attacks relative to benign traffic — see Figure 4.1) [20]. Along with precision and recall, we also show the harmonic mean of precision and recall, called the F1 score, as an easy way to interpret the performance using only one score. Detectors should allow the analyst to identify guessing attacks in a timely manner, ideally before the exploitation of the compromised accounts. Additionally, the detector should provide clear reasons to the analyst for flag- ging login requests from a specific IP as malicious. This clarity ensures that analysts can easily confirm genuinely compromised accounts and avoid misre- porting accounts that are not compromised. 4.3.2 Performance of Volumetric Approaches Volumetric detection approaches flag an IP address as suspicious if the number of failed requests from an IP exceeds a certain threshold within a period of time. First, we consider a strict threshold of k = 10 failed login requests within a day; Bohuk et al. [29] suggest this threshold based on their empirical observation that 99% of login attempt sequences from an IP contain 10 or fewer login requests within a day. We also try to find a k that would improve the precision of detection, pre- sumably at the expense of missing some low-volume attacks. We test different 104 U1 U2 Request level L set level Request level L set level P R F P R F P R F P R F Volumetric k = 10 51 84 63 82 46 59 6 99 11 7 16 10 k = 1, 000 100 10 19 100 < 1 < 1 8 35 14 10 1 2 ML-based SGG n/a 0 0 n/a 0 0 11 5 7 17 19 18 WAY n/a 0 0 n/a 0 0 39 9 15 6 13 8 Anomaly-based§ CS 84 (86) 24 38 19 (28) 13 15 23 (93) 61 37 32 (78) 43 47 SW 85 (86) 82 84 16 (18) 47 23 22 (86) 99 36 27 (46) 95 42 PDC 94 (94) 100 97 79 (79) 100 88 33 (83) 81 47 52 (93) 98 68 § Manual inspection of the false alerts generated by anomaly-based approaches (Section 4.5) revealed 49 (at U1) and 221 L sets (at U2) as clearly malicious. We show within parenthesis (·) the adjusted precision if these new attacks are included in Arañaground truth. Figure 4.2: Evaluation of the volumetric, ML-based, and DAS-based approaches on the Gossamer holdout dataset using attacks reported in [64] as ground truth. We report the Precision, Recall, and F1 scores (by percentage (%)) broken down both by individual requests and L sets. The highest value in each column is boldfaced. StopGuessing-G and WhoAreYou do not flag any requests as mali- cious at U1. values of k (k ∈ {10, 102, 103, 104}) and find that k = 103 produces the highest F1 score for requests at both U1 and U2 (82% at U1 and 75% at U2). Detailed results of this experiment are given in Appendix C.1. Over the holdout data, we observe that this volumetric approach effectively flags IP addresses when using the relaxed lockout threshold of k = 103, achiev- ing an F1 score of 82% for requests at U1. However, on the L set level, the F1 score drops significantly to 12% primarily due to a very low recall of 6%. This indicates that the relaxed threshold misses many highly distributed attack cam- paigns at U1. In contrast, the more strict lockout threshold of k = 10 results in an F1 score of 23% for requests and 9% for L sets. At U2, both lockout thresholds k = 10 and k = 103 performed poorly with request-level F1 scores of 11% and 105 14% respectively (Figure 4.2). For L sets, the F1 scores are 10% and 2% for these two thresholds. In summary, volumetric approaches perform poorly at detecting most at- tacks, only flagging a few high volume ones. This can lead to high recall rates when measured on individual requests because the detected attacks are so high volume compared to the others not detected. Prior work also suggests that the higher threshold of k = 1, 000 is not sufficiently strict: Pal et al. [93] demon- strated a credential tweaking attack for which their simulations indicate that 16% of user accounts can be compromised within 1,000 guesses. 4.3.3 Performance of ML Approaches We consider two ML-based approaches for detecting password guessing at- tacks: WhoAreYou [52] and StopGuessing [102]. WhoAreYou. Freeman et al. [52] proposed WhoAreYou (WAY) to estimate the risk score of a login request from its associated IP address and user agent string, flagging the request as malicious if the score is above a threshold. They tested their approach on simulated attacks in which the attacker attempts to imitate the IP addresses and user agents of the victim user, and on one real botnet attack experienced by the LinkedIn website. We implement WhoAreYou following the authors’ description and train the model on the training split of Gossamer logs. The performance of this model necessarily relies on a reliable reputation score of the IP addresses and user agents. Freeman et al. used the internal reputation scores at LinkedIn [15] 106 for IP addresses and user agents, but neither the scoring API nor the details of the scoring mechanism are available publicly. In our model, we calculate the reputation scores using the number of attack logins (as per [64]) observed during the training split of Gossamer data. Intuitively, this means that the more malicious logins we observe from an IP address, the riskier its reputation score will be. We follow the same approach to calculate reputation scores for different countries, autonomous system (AS) numbers, internet service providers (ISPs), and user agent strings. We give more details in Section C.3.1. Since WhoAreYou does not take characteristics of the submitted password into consideration, unlike StopGuessing, we are not restricted to considering only successful requests; we therefore We use all 75,884 and 287,167 malicious logins from U1 and U2 (Figure 4.1) for the training and subsequent evaluation of the model. We randomly sample the same number of benign login requests as the number of malicious logins. We evaluate the performance of WhoAreYou on the Gossamer holdout data. We follow the same smoothing technique suggested by Freeman et al. [52] to es- timate risk scores for login requests having unseen IP addresses or user agents during the holdout period of Gossamer logs. At U1, WhoAreYou does not per- form well, failing to flag any malicious logins. At U2, WhoAreYou only correctly flags 2,480 malicious login requests (only 9% of the total malicious logins during the holdout period) with an F1-score of 15%. We attribute this issue, as detailed further in Appendix C.3.2, to the sparse nature of Gossamer logs — only a few IP addresses, ISPs, or even AS num- bers are present in both the holdout and the training data. Even the smooth- ing technique proves ineffective in addressing this issue. While it’s possible 107 that WhoAreYou and the underlying smoothing technique may demonstrate efficacy on other real-world settings, our evaluation clearly shows its limiting performance on Gossamer logs, which is currently the most diverse and largest real-world login dataset available. StopGuessing. Schechter et al. [102] go further to develop an ML model that uses five features to score IP addresses; if the score is above a threshold, all attempts from that IP address for a period of time are deemed malicious. When a login request from an IP address is successful, the model computes the following features from the prior login requests submitted from the IP ad- dress within a period of time: (a) whether the submitted the password is from the frequently submitted password list of the attacker, (b) the expected number of attempts needed to guess the correct password of the account with the suc- cessful login request, (d) the fraction of submitted passwords to the username from the IP address that were a typo, (e) whether the IP address has a high num- ber of historical failed login attempts, and (f) whether the login is from a client with a prior successful login (identified by an authentication cookie). Addition- ally, this approach ignores repeated username-password pair submissions from the IP address, and it calculates feature (a) twice from two distinct sets of login requests from the IP address: the set of all login requests with valid usernames and the set with invalid usernames. This approach was only evaluated on synthetic dataset generated by simu- lating benign users’ behavior and various types of password spraying attacks — a key limitation of the work as noted by the authors themselves [102]. Al- though some of the code is available at [18], a few exact details related to calcu- lating password-derived features are missing — for example, what edit distance 108 threshold should be used to differentiate typos from incorrect submissions and how to measure whether a password is frequently submitted. Given the sensitive nature of passwords, Gossamer logs contain only password-derived measurements that the authors deemed safe to collect from login requests. Authors in [102] sidestepped this problem by populating password-derived features from different distributions. Thus, to evaluate StopGuessing on a real-world login dataset, we approximate these features from the available safe measurements collected from Gossamer logs, and for clarity we refer to it as StopGuessing-G (SGG). For example, StopGuessing-G uses a coarse-grained password strength measurement computed by the zxcvbn strength meter [123] to identify weak accounts. To consider a password as frequently submitted, StopGuessing-G checks if it is within the top 1,000 frequently submitted incorrect passwords in the last 24 hours. Furthermore, to identify logins from the same device, StopGuessing-G uses a tuple of the normalized user agent and username, as Gossamer logs filter the authentication cookie value from the HTTP headers of the login request. We provide more details of StopGuessing-G, our best effort implementation of StopGuessing, in Appendix C.2. To train the model, we treat each successful malicious login request during the training period as a positive sample. However, the training period only contains 12 such positive samples at U1 and 3,005 at U2. Following Schechter et al. [102], we randomly sample the same number of successful benign login requests (negative samples) during the training period. Finally, we calculate the aforementioned seven features for each login request from the training samples. 109 If any successful request from an IP is flagged malicious by the model, all requests from that IP on that day are considered malicious. The trained model achieves an F1 score of 52% at U2 and 0% at U1 over the training dataset. More details of our implementation of StopGuessing are given in Appendix C.2. StopGuessing-G performed poorly at both universities in detecting success- ful malicious login requests during the holdout period. At U1, it did not flag any login requests as malicious, and at U2, it only flagged 5% of malicious lo- gins with a poor precision of 11%. We hypothesize that the poor performance of StopGuessing-G is primarily because there were very few successful malicious login requests in the training dataset. Open questions. We show that both the volumetric and ML-based approaches exhibit poor performance when applied to the Gossamer dataset. Specifically, volumetric approaches prove to be too simplistic for capturing the complex and diverse attack strategies, while ML-based approaches struggle to perform well under real-world constraints, such as sparse IP addresses and coarse-grained password derived measurements. A fundamental question persists: Can we de- velop an efficient and timely malicious login detection mechanism that achieves high precision and recall in a real-world dataset such as Gossamer logs? 4.4 Anomaly-Based Detection Approaches We now describe three new malicious login detection methods based on anomaly detection techniques. We leverage an approach called directed anomaly scoring (DAS) introduced by Ho et al. [61]. Briefly, their approach involves first choosing a set of features that describes the suspiciousness of an 110 event and labeling for each feature which direction—higher or lower—is “sus- picious.” For example, the higher the number of requests from an IP address, the more likely it’s malicious. Thus the number of requests (NR) can be a DAS feature with higher being the direction of maliciousness. Then DAS uses these per-feature comparisons to define an event comparison: event A is more sus- picious than event B if all of A’s feature values are more malicious than the respective feature values of B. Finally, the DAS score of an event compared to a set of other events is equal to the number of other events for which it is more suspicious. Two observations motivated us to adopt DAS-based approaches. First, as Ho et al. argue, supervised ML approaches do not handle well the heavy im- balance between normal and anomaly events. While Ho et al. investigated this in the context of spearphishing, we conjecture that similar dynamics arise in malicious login detection. For example, in Gossamer logs, only 18% of logins at U1 and 11% at U2 are malicious (Figure 4.1). Also, previous supervised ML approaches such as those proposed in [52,102] proved ineffective when applied to our datasets (see Section 4.3.3). Secondly, in most real-world login datasets, there is no way to describe benign behavior as a set of patterns or distributions, which is essential for standard anomaly detection approaches [37, 75]. There- fore, we use DAS which can overcome these limitations. DAS scores of L sets. We group all login requests within a given period (say, a day) by their source IP address, which we refer to as an L set. We then de- sign a set of features Γ that quantifies the suspiciousness of an L set. Thus, the DAS score of an L set given a set of other L sets L and a set of features Γ is DAS_Score(L,L,Γ) = |{L′ ∈ L|∀γ ∈ Γ, γ(L) > γ(L′)}. We later describe how we 111 use DAS_Score to determine if an L set is malicious. If an L set is labeled as malicious, we consider all login requests belonging to that L set as malicious. Designing feature sets. Online guessing attacks through an IP address will exhibit a number of distinct characteristics. For example, malicious IP addresses often have an abnormally high number of failed login requests (NF), since the majority of an attacker’s guesses fail. Often, attackers send popular passwords to many users in a “password spraying attack”; other times, they may focus on testing many passwords against one user in a targeted attack. Attackers may use publicly available or privately sold breached credentials to execute a guessing attack. They may draw out their attack over a longer period of time or send a burst of login requests in a short time period. We synthesize these malicious characteristics into four distinct attack behaviors and use sets of features that describe each of these behaviors. • Targeted attacks (ΓTar). The attacker targets certain users by sending many pass- word guesses to only one or a few users. We use the average number of unique passwords submitted to a user (AUP) from an L set to detect this sus- picious behavior. • High volume bursty activity (ΓBursty). The attacker sends a burst of logins for a very short period of time. We use the mean and standard deviation of the interarrival time between successive login requests (called MIT and SIT re- spectively) to identify L sets with bursty behavior. We augment these two features with the number of requests (NR) and the number of unique user- names (NU) originating from the L set, which reduces false alerts. • Known breach credential stuffing attack(ΓKnwnCrd). We use the fraction of requests with a breached password (FPIB), the fraction of requests with a username 112 from a breach (FUIB), and the fraction of requests with a username-password pair in the breach (FCIB).2 Similar to the bursty activity feature set, we aug- ment these three features with NR and NU. • Unknown breach credential stuffing attack (ΓUnkwnCrd). In some cases, anL set may appear to indicate credential stuffing behavior, but the credentials used do not directly appear in Gossamer’s breach data. We hypothesize that these at- tacks are using a different breach dataset than that of Gossamer, and we refer to such attacks as unknown breach credential stuffing attacks. L sets exhibiting such behavior would have a low FPIB but a higher fraction of tweaked pass- words (FTP) and fraction of usernames in the breach (FUIB). Thus we rely on a feature set of {FTP,FUIB,FCIB,NR,NU} to detect such attacks. For a given L set, we calculate the DAS score separately for each set of features Γ, and we consider an L set as malicious if it has a positive DAS score using any of the four feature sets. We also experimented with other DAS score thresholds larger than zero while identifying an L set as malicious; however they did not perform well. 4.4.1 Malicious Login Detection Using DAS We design three different approaches for performing timely attack detection us- ing DAS scoring. By timely, we envision that an analyst could deploy these approaches and after a certain period (at the end of a day), collect and analyze 2After 29 days into instrumentation, the U2 deployment of Gossamer had university-specific breach data added. We therefore use this FCIB feature only after that date for U2, as the FCIB for the L sets in the first 29 days is zero, making it unhelpful for DAS scoring. 113 Attack type ID / Univ # of L sets NR CS SW PDC # L sets % req # L sets % req # L sets % req Credential stuffing 3 / U1 555 15,117 47 14 208 61 555 100 4 / U1 77 14,603 8 10 71 98 77 100 5 / U1 1 4,593 1 100 1 100 1 100 9 / U1 30 241 29 97 30 97 30 100 12 / U2 ∗ 3 2,601 0 0 3 100 1 42 19 / U2 1 428 1 100 1 100 1 100 20 / U2 4 5,199 4 100 4 100 3 86 Targeted 22 / U2 74 1,878 1 1 74 100 74 100 23 / U2 52 1,296 48 96 48 94 52 100 24 / U2 4 101 4 100 4 100 4 100 25 / U2 3 80 3 100 3 100 3 100 Figure 4.3: The detection performance (recall) of the three DAS-based ap- proaches on the four attack clusters at U1 and seven at U2 reported by Araña [64] over the holdout period. Multi-day clusters are highlighted with an asterisk (∗), and the # L sets is boldfaced when all L sets of the attack cluster are detected by the approach. the L sets flagged. Comparison Set (CS). In our first approach, we use a comparison set of mali- cious L sets to which we compare a given L set. Specifically, to formulate the comparison set for a given L set on day d, we first gather all L sets from the last δ days (L[d−δ,d)). Then we calculate the DAS scores of each L set in L[d−δ,d) with respect to L[d−δ,d) given a feature set Γ. The comparison set then consists of the δ ·β L sets with the highest DAS scores, where β is an estimate of the number of L sets an analyst would be able to analyze per day for each sub-detector. Ties are broken based on the number of requests in the L set (NR) and then by the number of unique users contacted (NU). We set δ = 7 and configure each sub-detector with a budget of β = 5. Then to flag an L set on day d, we evaluate the DAS score of the L set against the comparison sets of that day for each sub-detector; if the DAS score is positive 114 for any sub-detector, we flag the L set as malicious. Already, with this configuration, Comparison Set outperforms the previ- ously tested ML approaches on the training period of Gossamer logs, producing F1 scores of 82% at U1 and 52% at U2 for detecting malicious logins (Figure C.2). It missed someL sets with comparatively fewer logins—yielding a recall of 53% at U1 and 45% at U2 for L sets. However, the performance of this approach on the holdout data was modest, as we describe in Section 4.5. Sliding Window (SW). Since the Comparison Set approach requires the last δ = 7 days of L sets for detecting malicious logins, we explore an alternate approach where we look at L sets in a much shorter time interval (or window) of size swin—say one hour. We consider the requests within a time window [ts, ts + swin]. We group all login requests within this window into L sets and then score each of the resulting L sets using DAS w.r.t. all L sets in that time period for each sub-detector Γ. We start by setting ts to the beginning of a given day and “slide the window” to increment ts by ssld, and we repeat the process until the end of the day. Then we reset ts to the start of the next day and repeat the process. Similar to other approaches, anyL sets with a positive DAS score are flagged as malicious. If an IP address corresponding to an L set is flagged in any win- dow on a given day, we consider all login requests from the IP address through- out the day as malicious. We perform experiments with swin = 1 hour and ssld = 10 mins for both universities. (We also experimented with different swin and ssld values but found that the F1 score did not change much at both universities over the training 115 dataset). After evaluating on our training dataset, we observe that this approach flags more L sets as malicious—with a recall of 99% at U1 and 98% at U2 for malicious requests. It also has a precision of 71% (U1) and 61% (U2). Thus, the Sliding Window approach outperforms the Comparison Set method in terms of both recall and precision during the training period of Gossamer logs. This tendency to flag manyL sets as malicious can be a double-edged sword. On the one hand, it may hurt the precision of Sliding Window; on the other, it may help discover previously undiscovered attacks. We delve more into this tradeoff in Section 4.5. Per-day Clustering (PDC). In the clustering approach used for Araña, Islam et al. [64] clustered L sets into attacks using a set of volumetric, client-level, and password-based features. However, their clustering approach relied on seeing the whole time period of data before clustering into attacks, which does not allow an analyst to detect attacks in a timely fashion. We now test whether this clustering approach could be applied at the end of each day, allowing for timely detection. We apply Arañato each day of login requests to obtain a set of clusters and calculate the combined feature values of the resulting clusters (normally taking an average or sum of the values). We then compute the DAS score for each cluster w.r.t. each other cluster, and we do this using each set of sub-detector features separately. As before, any positive DAS score indicates that a cluster is malicious, and if a cluster is flagged mali- cious, we consider all L sets (and therefore all requests) within that cluster as malicious. This approach exhibited the best performance so far with F1 scores of 82% and 55% for requests at U1 and U2, respectively. More importantly, this approach maintains its performance over the holdout period, as we highlight 116 # of L sets U1 U2 False alerts based on Araña 1,851 367 Manually analyzed w/ NR > 5 229 367 Manually flagged as malicious 49 221 Manually flagged malicious and detected by • Comparison Set 38 48 • Sliding Window 11 133 • Per-day Clustering 0 110 Figure 4.4: Manual analysis of the false positives (FPs) generated by the three DAS-based approaches. in Section 4.5. 4.5 Attack Detection Performance We now evaluate the performance of the three DAS-based detection approaches described in Section 4.4.1 on the holdout period of Gossamer logs. We focus on answering two questions: how well each approach can capture the attacks reported in prior work [64] and how often it generates false alerts. We report the precision, recall, and F1 scores of the three approaches over the holdout period in Figure 4.2. In general, DAS-based attack detection approaches performed better than the volumetric and ML-based ones. The Comparison Set approach performed relatively poorly with an F1 score of 38% for requests at U1 and 37% at U2. Interestingly, the Sliding Window approach performed quite well at U1 with a precision (85%) and recall (82%) but had too many false positives (with respect to the prior approaches) at U2. Per-day Clustering performed the best among the three DAS-based approaches at U1, achieving perfect recall for both requests and L sets, as well as good precision. For U2, Per-day Clustering in general 117 Comparison Set 2021-02-18 2021-02-20 2021-02-22 2021-02-24 2021-02-26 2021-02-28 2021-03-020 20 40 60 80 Days # of al er ts (L se ts ) Total alerts True alerts Known attacks Sliding Window 2021-02-18 2021-02-20 2021-02-22 2021-02-24 2021-02-26 2021-02-28 2021-03-020 20 40 60 80 Days # of al er ts (L se ts ) Total alerts True alerts Known attacks Per-day Clustering 2021-02-18 2021-02-20 2021-02-22 2021-02-24 2021-02-26 2021-02-28 2021-03-020 20 40 60 80 Days # of al er ts (L se ts ) Total alerts True alerts Known attacks Figure 4.5: For the three DAS-based approaches, we report the total number of alerts (# ofL sets flagged malicious) generated per day (red line) and how many of them are true alerts (blue line) over the testing period at U2. We also report the number of known malicious L sets on each day (green line) which is the same for all approaches. A smaller gap between “total alerts” and “true alerts” indicates higher precision, and a smaller gap between “true alerts” and “known attacks” indicates higher recall. performed better than the approaches but had a low precision. Since Per-day Clustering is based on Araña, this high performance is not surprising. The low precision at U2 is due to potential attacks that were not detected by Araña [64]. Capturing known attacks. Figure 4.3 illustrates the performance of the DAS- based detection mechanisms on the 11 attack clusters present in the holdout period. 118 At U1, we observe that both the Comparison Set and Sliding Window ap- proaches struggle to detect two high volume attack clusters (#3 and #4) which are likely part of the same attack, as they occurred on the same day and dis- played similar behavior. The Per-day Clustering approach performed the best at U1, with a 100% recall. We believe this is because clustering can group the low volumeL sets exhibiting similar malicious behavior together (such as in clusters #3 and #4); and since Per-day Clustering calculates DAS scoring at the cluster level, these low volume L sets are unable to escape detection when grouped together. At U2, Comparison Set misses one attack cluster #12 completely, along with all but one L set from cluster #22, four L sets from cluster #23, producing a low recall of 61% for requests and 43% for L sets. On the other hand, the Sliding Window and Per-day Clustering approaches display a higher recall of > 81% for requests and > 95% for L sets. In particular, the Sliding Window approach captures all attack L sets at U2 but has a lower precision of 27%. The Per-day Clustering approach captures all but threeL sets, resulting in a recall of 98% and 81% for L sets and requests respectively. However, unlike the Sliding Window approach, Per-day Clustering maintains a higher precision. Analyzing false alerts. At both universities, the DAS-based detection ap- proaches flagged a number of L sets as false positives according to the Araña- ground truth [64]. We manually analyzed these false positives to confirm them as either benign or malicious (meaning they were initially missed by Araña [64]). To do this, we take all the false positive L sets (1,851 at U1 and 367 at U2) and filter out those containing five or fewer requests, as it is too difficult to make 119 a judgment on such a low volume L set. This filtered out 1,622 L sets at U1 and none at U2. We collaboratively performed a manual analysis of theseL sets and inspected if they were clearly malicious as described in prior work [64]; if not, we consider them benign. Figure 4.4 presents a quantitative breakdown of false positive L sets (based on Arañaground truth) that our conservative manual analysis confirms as ma- licious. We find 49 L sets at U1 and 221 at U2 to be clearly malicious For example, at U1, the Comparison Set approach flagged 378 low volume L sets that appear on the same day as attack cluster #9. We manually analyze 38 of these 378 L sets having NR > 5 and find all 38 L sets to be part of a larger attack encompassing attack cluster #9. We do not make a decision about the other 340 L sets with NR ≤ 5 because it is difficult to make a judgment for an L set sending less than five requests. These 38 L sets were not flagged by Arañabecause they were discarded by the “HFR” heuristic that filtered out L sets sending less than 10 login requests at U1 (NR ≤ 10). By treating those 38 L sets as true alerts, precision improves slightly from 19% to 28% at the L set level and 84% to 86% at the request level. The Sliding Window approach flagged an overwhelming 1,684 false positive L sets at U1, 22 of which had NR > 5. After manually analyzing these 22, we found that 11 were indeed malicious. This increased the precision slightly from 16% to 18% at the L set level and 85% to 86% at the request level for Sliding Window. Although the Per-day Clustering approach flagged 168 false positiveL sets at U1, none appeared malicious under manual inspection, as all have NR ≤ 5. At U2, we find that 221 (60%) of the false positiveL sets are clearly malicious. These L sets were missed by Arañabecause Arañafocused on high volume at- 120 tack clusters, only looking at clusters with NR ≥ 5, 000 ∨ NU ≥ 5, 000. Thus, it missed these L sets that were part of lower volume clusters, even though they exhibit similar malicious behavior as the high volume attack clusters. We aug- ment our ground truth of known malicious L sets from Araña [64] with these manually confirmed 221 malicious L sets at U2 and reassess the precision of all three detection approaches. For the Comparison Set approach, the precision improves to 92% (from 23%) for requests and 78% (from 32%) forL sets. For the Sliding Window approach, we observe a similar trend—the precision improves to 86% (from 22%) and 46% (from 27%) at the request and L set levels respec- tively. Similarly for Per-day Clustering, precision improved to 83% (from 33%), and 93% (from 52%) at the request and L set levels respectively. Per-day workload analysis. Since the goal of these detection approaches is to flag malicious L sets that a human analyst can verify, it is important to under- stand the total number of alerts an analyst would receive each day. On average, the Comparison Set approach raises 11 alerts per day at U2, Sliding Window raises 38, and Per-day Clustering raises 21. However, some of these alerts could be false positives, burdening the analyst. To better understand this, we show (in red) the total number of L sets flagged as malicious at U2 during the holdout period in Figure 4.5. We aug- ment the Arañaground truth with the manually labeled 221 malicious L sets at U2 and show in green the number of known attacks on each day (this is the same for all detection approaches). The blue line represents how many of the alerts flagged by the detection approach are true attacks. We observe that the Comparison Set approach does not perform well, miss- ing many attacks and raising many false alerts. The Sliding Window approach 121 captures all known attacks (thus the known attacks and true alerts lines overlap) but sacrifices precision by raising too many false alerts. The Per-day Clustering approach prioritizes precision (thus the total alerts and true alerts lines over- lap), although it does miss a few known attacks that Sliding Window is able to capture. Considering also that Per-day Clustering has the best performance at U1, we believe it would be suitable for an analyst who prefers high precision and moderate recall. 4.6 Robustness of Malicious Login Detection A major concern when deploying an attack detection approach is how robust such an approach would be against adversaries that modify their attack strategy based on the defense to evade detection. In this section, we model such evasion attacks and empirically measure their efficacy. 4.6.1 Model for Evasion Attacks Let L denote a request — a tuple of information submitted to the login server. This tuple includes a username L.u, a password L.w, the IP address of the client L.ip, the user agent of the client browser L.ua, and a timestamp L.t. We assume for simplicity that all timestamps of the requests (adversarial or other- wise) are unique, allowing a total ordering over any set of requests. A tran- script T then is a sequence of some number of requests ordered by timestamp — T = [L1, L2, . . .] . We denote combining two transcripts via T = T1 ∪ T2. The set of 〈u,w〉 corresponding to a T we call a “guess set” G = {(L.u, L.w) | L ∈ T }. 122 A login server receives login requests from many other users in addition to the attacker-controlled login requests; we refer to this as the background traffic and denote it by a transcript B — it may include a mix of malicious and benign traffic. A defense strategy D is an algorithm that takes as input a transcript of requests T and outputs for each attempt whether it is malicious or benign. An evasion attack converts an existing malicious attack transcript A = [L1, . . . , Lm] of some length m into a modified transcript A′ ∪ S . We allow the attacker to modify the timestamp, user agent, or IP address of requests in A (or leave them unchanged) to produce a new transcript A′; the evasion attacker is not allowed to change the username and password in these requests and |A| = |A′| = m. However, the evasion attacker may add some number of ad- ditional spurious requests S with whatever values they like. To gauge the efficacy of an evasion, we compare how easily the unmodified attack A and the modified one A can be detected relative to some background traffic B. In more detail, we runD on B ∪ A and let c be the number of attempts from A marked by D as malicious. We then run D on B ∪ A′ ∪ S and let c′ be the number of attempts from A′ that are marked as malicious. Then the success rate of an evasion attacker is simply the fraction of attempts that are rendered benign after introducing evasion, namely (c − c′)/m . We assume the attacker knows howDworks but not the contents of B. Modeling resources for A. Given sufficient resources, there are always straightforward evasion attacks for any of the defense mechanisms we have considered. For example, given sufficient time, an attacker could send one re- quest per day. Given sufficiently many IP addresses, the attacker could send one request per IP address, which no detection mechanism can currently differenti- 123 ate from benign login traffic. Thus, the goal of existing detection approaches is to ensure that evading detection requires a large amount of resources. Con- versely, an optimal evasion attacker is one who minimizes the resources re- quired for a maximal evasion success rate. Since solving this minimax opti- mization problem is challenging, we adopt a practical approach as follows. We assume that the attackers have constraints on the resources used. We consider three types of resources: (1) I(A) denotes the number of IP addresses used in the attack traffic A; (2) R(A) denotes the number requests sent over the internet to the login server, which is equal to |A|; and (3) T (A) denotes the total time used to send all the attack traffic, that is T (A) = maxL∈A L.t −minL∈A L.t. We are interested in the success rate of evasion attackers that operate within some restricted amount of additional resources. Relative to an attack transcript A, let θip be the number of additional IP addresses the attacker has access to beyond those in A, let θr be the number |S | of additional spurious requests the attacker is allowed to send, and let θt be the amount of extra time to execute the attack. Thus, an evasion attack A for a transcript A will be valid for resource constraints θip, θr, θt if any possible output A′, S from it is such that I(A′ ∪ S ) ≤ I(A) + θip and R(A′ ∪ S ) ≤ R(A) + θr and T (A′ ∪ S ) ≤ T (A) + θt. Simulating evasion attacks. With the above formalization, simulating evasion attacks is simple. To do this, we give the transcript of an unmodified attack cluster (A) to the attacker. The attacker is allowed to spend a fixed amount of resources 〈θip, θr, θt〉 to generate B ∪ A′ ∪ S . 124 4.6.2 Evasion Strategies Based on the way resources are used by the evasive attacker, we devise different evasive strategies to test the robustness of the three DAS-based detection ap- proaches. These evasive strategies are easy to launch and do not require signif- icantly modifying existing attack scripts or tools (e.g., OpenBullet2 [16], Sentry MBA [?]); yet, as we will show, they are effective in evading many attack cam- paigns. Broadly, these evasive strategies can be categorized into two classes: introducing chaff which distracts the detection algorithms from the actual attack traffic and blending the attack traffic into the benign background traffic. We do not experiment with evading detection from the volumetric or machine learning approaches discussed earlier, as they already do not perform well. 1 Introducing chaff L sets. Attackers can introduce chaff to distract the de- tector from the actual attack by sending malicious-appearing, non-attack logins from their set of IP addresses. For example, recall that the Comparison Set de- tection method creates a comparison set consisting of the top δ × β L sets with the highest DAS score in the last δ days. An attack L set is flagged as malicious if it has a positive DAS score against any other L sets inside the comparison set. By introducing chaff, the attacker could create new L sets that score higher than the existing L sets in the comparison set so that the chaff L sets entirely populate the comparison set; these chaff L sets would be designed so that the attack L set never has a relative DAS score higher than these chaff L sets. 2 Blending in using sacrificial logins. Attackers can also attempt to blend their malicious login requests into the background benign login requests. To do this, the attacker can add additional spurious login requests with their attack, 125 which we call sacrificial requests, as they aim to reduce (or hide) the suspicious- ness of malicious L sets by making them appear more like the surrounding benign traffic. For example, consider an attack L set originally containing 100 requests. If 40 requests from this L set contain a breached password, the fraction of breached passwords (FPIB) feature would be 0.4. Now if the attacker sends 100 additional sacrificial requests containing a password not present in breach data, that attack L set would now send 200 requests in total with a less suspi- cious FPIB value of 0.2; this reduced suspiciousness could help the malicious L set blend into the benign L sets and avoid detection. 3 Blending in by redistributing the guess set. The attacker could also at- tempt to blend into background benign traffic by redistributing their guess set across additional IP addresses (by changing the IP address fields in their attack transcript). For example, consider the case of detecting targeted behavior using the AUP feature. If an attack L set exhibits targeted attack behavior by send- ing 25 different password guesses to a single user, then it would have a AUP of 25 and would likely be detected for such a high AUP value. However, the attacker can evade detection by sending these 25 different password guesses from 25 different L sets. Each of these 25 L sets would now have an AUP of one, effectively avoiding detection, as an AUP value of one does not exhibit tar- geted behavior. For the sake of evaluation, we assume that the attacker evenly splits the total number of requests and the number of requests exhibiting any suspicious feature across all L sets. 4 Blending in by slowing the attack. Finally, an attacker can insert additional 126 delay between successive login requests to minimize the suspiciousness of the timestamp-based features (i.e., MIT, SIT). This strategy is effective for evading malicious L sets that send many requests within a short period of time. For example, consider an attacker sending NR login requests from an IP within t seconds. Assuming the attacker is evenly distributing these NR lo- gins over the time period t, the attack L set would have a mean interarrival time of MIT = t/NR. By remaining active for an additional θt period, the at- tacker could increase (and thus reduce suspiciousness of) its MIT feature value to (t + θt)/NR. Attack tools (e.g., OpenBullet2 [16]) typically provide such an op- tion to reschedule logins to a given interval. Of course, this would increase the cost for the attacker, requiring them to hold the IP addresses for a longer period of time. This list is not exhaustive; an attacker may find other ways to craft evasive attacks, and we detail a few of these at the end of Section 4.6.3. However, we focus on the four described evasion strategies in evaluating our three proposed detection approaches, as we believe they are indicative of what an attacker may try and could be effective in evading detection mechanisms. 4.6.3 Evasion Results To investigate the efficacy of the evasion strategies, we explore a set of reason- able amounts of resources an evasion attacker could use for evading the attack clusters: θip ∈ {10, 20, 35}, θr = 1, 000, and θt = 2 hours. The results of our exper- iments are shown in Figures 4.6 and 4.7. In these figures and when describing the evasion results, we refer to each attack cluster using the same ID as from 127 ID / Univ. Type L NR % NR evaded Redistributing Chaff L sets θip = 10 20 35 θip = 10 20 35 1 / U1 C 3 18,093 0 0 0 0 0 100 2 / U1 C 1 17,827 0 0 100 0 0 100 3 / U1 § C 47 2,080 0 1 37 49 49 100 4 / U1 § C 8 1,467 61 22 58 30 96 100 5 / U1 § C 1 4,593 100 100 100 100 100 100 6 / U1 C 1 2,345 100 100 100 0 0 100 7 / U1 C 1 1,481 0 0 14 100 100 100 8 / U1 C 28 472 18 56 90 69 69 100 9 / U1 C 29 30 100 100 100 97 97 100 11 / U2 §F B 663 27,750 100 100 100 100 100 100 12 / U2 § C 8 9,123 0 24 45 97 100 100 14 / U2 C 2 65 100 100 100 100 100 100 17 / U2 C 32 4,925 24 33 56 100 100 100 18 / U2 C 6 4,454 0 0 31 52 100 100 19 / U2 § C 4 4,783 0 0 9 100 100 100 20 / U2 § C 4 5,199 0 62 70 49 100 100 22 / U2 § T 1 28 0 100 100 100 100 100 23 / U2 §F T, B 48 1,250 0 0 23 71 81 100 24 / U2 § T 4 101 0 0 53 100 100 100 25 / U2 § T 3 80 0 0 100 34 100 100 26 / U2 T 1 27 0 0 100 100 100 100 27 / U2 T 1 27 0 0 100 100 100 100 28 / U2 T 1 25 0 100 100 100 100 100 29 / U2 T 4 468 0 0 0 0 0 0 Figure 4.6: Evasion results for the Comparison Set approach when attempting to evade attack clusters by redistributing the attack transcript over θip IP addresses or inserting θip chaff L sets. prior work [64] (as indicated by the “ID” column). For completeness, the list of all attack clusters from [64] is shown in Appendix C.5. For each attack clus- ter, we use the symbols B, C, and T in the “Type” column to indicate whether the cluster was detected by the ΓBursty, ΓKnwnCrd or ΓUnkwnCrd, or ΓTar feature sets, respectively. Comparison Set. We apply the evasion strategies to the nine attack clusters at U1 and 15 at U2 detected by the Comparison Set approach. Specifically, we apply the redistribution and chaff strategies with θip ∈ {10, 20, 35} IP addresses; 128 ID / Univ. Type L NR % of NR evaded θip = 10 20 35 10 / U2 F B 12 169,573 0 0 0 11 / U2 F B 663 33,304 100 100 100 12 / U2 § C 5 4,521 66 0 100 13 / U2 C 3 12,240 58 45 100 14 / U2 C 843 7,104 67 24 16 15 / U2 C 1 5,287 100 100 100 17 / U2 C 3 4,925 100 100 100 18 / U2 C 6 4,219 83 0 100 19 / U2 § C 4 3,938 0 0 100 20 / U2 § C 4 5,199 0 0 100 21 / U2 C 2 5,103 50 0 100 22 / U2 F§ T, B 74 1,878 49 66 77 23 / U2 F§ T, B 52 1,250 43 72 76 24 / U2 § T 4 101 100 100 100 25 / U2 § T 3 80 100 100 100 26 / U2 T 1 27 100 100 100 27 / U2 T 1 27 100 100 100 28 / U2 T 1 25 100 100 100 29 / U2 T 4 468 0 0 0 § These attack clusters appear during the holdout period; the others occur during the training period. F For attack clusters that exhibit bursty activity, we slow down the attack by θt = 2 hours; Figure 4.7: Results of the three blending evasion strategies against Per-day Clus- tering technique. None of the attacks at U1 could evade Per-day Clustering. and for the two attack clusters (#11 and #23 at U2) exhibiting bursty behavior, we also add an additional θt = 2 hours to slow down the attack. The results are shown in Figure 4.6. We first try redistributing the guess set across additional IP addresses; this evasion is partially effective, with ten attack clusters evading completely with θip = 35 IP addresses. Next we try inserting chaff L sets to replace the existing L sets in the comparison set. For this evasion attack to work, an attacker would need the chaff L sets such that they have higher DAS score than the ones cur- rently in the comparison set, but not strictly less suspicious than the attackL set, as that would prevent evasion. We accomplish this by setting the feature values 129 of these chaff L sets as follows. For the number of requests per L set NR, we take the 50th percentile value of allL sets inside the comparison set and thus set NR = 19 at U1 and NR = 27 at U2 (An attacker could potentially compute these percentiles from historical login data they might have access to). For the other relevant features, we max out their suspiciousness by setting FPIB = FUIB = 1 and FCIB = FTP = 0.5 at both universities. We also set AUP = NR to allow for evasion of targeted attack clusters. Thus we create chaff L sets with very high suspiciousness along all features, producing a higher DAS score which is neces- sary to be added to the comparison set. This also ensures that the actual attack L sets do not dominate any of the chaff L sets and get flagged. Chaff L sets and requests must be sent at least a day before the attack is executed. Using chaff IP addresses in this manner allows an attacker to fully populate the comparison set with chaff L sets for θip = 35, effectively evading all but one attack cluster. This evasion only requires 665 requests at U1 and 945 at U2. However, this approach fails against attack #29, as none of the evasion strategies are effective. This is because attack #29 had an AUP of 37, exceeding the AUP of the employed chaff L sets (≤ 27). Consequently, the attack dominates the chaff L sets for the ΓTar sub-detector. Sliding Window. Since the Sliding Window approach considers all L sets within a window, introducing new chaff L sets would not help distract a detec- tor from an actual attack. Instead, attackers might try the blending in approach by inserting θr sacrificial logins, redistributing the guess set evenly across θip extra IP addresses, or slowing down the attack with an extra time period θt. Therefore, we attempt to evade all 29 attack clusters detected by Sliding Win- dow with θr = 1, 000, θip ∈ {10, 20, 35}, and θt = 2 hours. 130 We find that this blending in strategy is only effective in evading attack clus- ters exhibiting targeted behavior. Specifically, the eight targeted attack clusters at U2 can be evaded by the redistributing login requests across only θip = 10 IP addresses. The other 21 attack clusters from both universities exhibiting bursty, credential stuffing behavior can not be evaded even with θr = 1, 000, θip = 35, and θt = 2 hours. Per-day Clustering. Similar to Sliding Window, Per-day Clustering also con- siders all clusters within a single day when assessing whether a cluster is mali- cious. Therefore, it is relatively resistant to the evasion strategy of introducing chaff L sets. However, an evasion attacker can still exploit the Agglomerative clustering mechanism by using the three blending in evasion strategies to merge malicious and benign clusters into one or split an attack cluster such that DAS scoring can- not flag it as malicious. Existing work on poisoning Agglomerative clustering focuses on single [26] or complete [25] linkage, but Per-day Clustering uses mean linkage. Future work can look at how to poison Agglomerative clustering with mean linkage and how to use that poisoning technique to evade attack clusters detected by Per-day Clustering. For now, we rely on using the three blending in strategies with θip = [10, 20, 35] IP addresses, θr = 1, 000 login requests, and θt = 2 hours. Thus, the attacker could first redistribute their attack transcript A across these additional θip IP addresses. Then to hide A further, they can add θr = 1, 000 sacrificial logins evenly distributed among the L sets in the attack clusters to decrease the suspi- ciousness of the feature values. Lastly, to hide any bursty activity, the attacker could lengthen their attack by θt = 2 hours. Figure 4.7 (right) shows the results 131 of our evasion attempts. At U1, evasion attacks against Per-day Clustering are unsuccessful for all ten attack clusters. At U2, we observe a progressive trend in evasion success as we increase the number of additional IP addresses θip while holding θr = 1, 000 and θt = 2 constant for all 16 attack clusters. In particular, with θip = 35, 13 out of 16 attack clusters evade completely. Attack #11 can evade detection just by lengthening the attack by θt = 2 hours. Discussion of other evasion strategies An attacker could also try other attack strategies that don’t use extra resources. For example, since Per-day Clustering approach considers client-level features (i.e., the ISP of the IP address and the user agent string) while clustering L sets, the attacker could try to manipulate these features of the attack L sets to avoid detection by randomizing the user agents or using an IP addresses belonging to a cloud ISP popular among benign traffic. We find that by changing the ISP of all theL sets to a Microsoft Azure ISP and setting the user agent string to that of a browser on a mobile device, attack #22 can completely evade detection from Per-day Clustering without needing any extra resources. 4.7 Discussion & Future Work Deploying detection mechanisms in practice. The three DAS-based ap- proaches rely on four feature sets, including two involving password-derived features. Safely collecting these features requires a framework such as Gossamer [29] that can log such information without compromising user pri- vacy and security. 132 Implementing and running these approaches is not difficult in practice. DAS is straightforward to implement and runs in O(n2) time, where n is the number of L sets the login server observes in a day. The Per-day Clustering approach also requires O(n2) time to cluster the n L sets before performing DAS scoring. We run our experiments on an Intel i7-4790 Ubuntu Linux machine with eight cores and eight GB of memory at U1 and an Intel Xeon Linux machine with 56 cores and 125 GB of memory at U2, and we parallelize the DAS and Arañaclustering calculations using 40 threads. On average, all three approaches finish in less than a minute per day. Comparing results between the two universities. Although the Per-day Clus- tering approach performs well at both universities, we observe contrasting trends when comparing the performance of the other two approaches (Com- parison Set and Sliding Window) between the two universities (Figure 4.2). At times, drawing a direct comparison between these two universities is difficult since the attacks they experienced exhibit different behaviors, both quantita- tively and qualitatively in many cases. For example, at U2, nine attack clusters exhibited targeted behavior, and four exhibited bursty behavior; whereas at U1, no attack clusters exhibited either behavior. This is possibly because, as sug- gested in [64], Gossamer collected data from every login endpoint at U2. But at U1, Gossamer collected logins from only the main login endpoint. We also observe differences in manually analyzing the false positives. At U2, 367 L sets were initially labeled as false positives by the three approaches. After manually analyzing, we found that 221 (60%) of them were actually ma- licious. In contrast, at U1 1,851 L sets were initially labeled as false positives, and only 49 (2.6%) of them were confirmed as malicious after manual review; 133 this was partly because the majority of false positives L sets were too low vol- ume (NR ≤ 5) to make a judgment. Another contributing factor may be the thresholds selected to filter clusters. At U2, Arañaconsiders attack clusters as malicious if they meet the criteria NR ≥ 5, 000 ∨ NU ≥ 5, 000; however, these thresholds were decreased to 1,000 at U1 because it experienced lower volume attacks. Thus Arañamay have originally missed many attacks at U2 due to this higher threshold. Future work can investigate how the performance of detec- tion mechanisms generalizes or changes based on the organization concerned. Optimizing evasion attacks. It would be interesting to know, given an attack cluster, what are the minimum resources needed to evade detection for a given approach. While this is a complex (and probably NP-hard) optimization prob- lem to tackle, we perform some initial exploration using a simple heuristic to optimize resources θip, θr, and θt needed by attackers. As mentioned earlier, op- timizing the required resources requires the attacker to have knowledge about other login requests, which we assume the attacker has for this experiment. We find that attack cluster #11 (which is not evadable with θt = 2 hours) can be evaded with θt = 5 hours. For cluster #29 which also did not evade with θt = 5, θip = 45 is required for a successful evasion. Attack clusters not exhibiting bursty behavior do not need any extra time. More details about our heuristics and ex- periments are in Appendix C.4. Preventing evasion attacks. We explore two adaptations to the DAS detection mechanisms that may improve their robustness to evasion attacks. Our initial attempt was to randomly sample L sets to populate the comparison set instead of choosing the ones with the highest DAS scores. Such randomization would make it more difficult for an attacker to control whether their chaff L sets get 134 into the comparison set and thus limit their ability to distract from an attack. While this performed well on the training period at U1, it performed poorly on the test period with F1 scores of 38% at U1 and 43% at U2. In our next attempt, instead of scoring an L set with DAS against other L sets, we take the median value for each feature in the four sub-detector feature sets, and we flag an L set as malicious if it has a more suspicious feature value than the median for all features in a sub-detector. Because of the simplicity of this approach, it would be easy to deploy in web authentication systems and will be efficient to compute. We found this approach performed well at U1 with an F1 Score of 87% for the test period. However, it performed poorly at U2 with an F1 Score of 31%. Adjusting the thresholds to a higher percentile (rather than the median) improved performance slightly, and we found that the best perfor- mance at U2 occurred at the 34th percentile with an F1 Score of 54% over the test period. Because of the varied performance, this approach may not be gen- eralizable across different services. Further work on detection methods should investigate how login servers can combine different approaches that reduce the ease of evasion while preserving the ability to detect attacks. 4.8 Conclusion We conduct a comprehensive investigation into the efficacy and resilience of timely malicious login detection approaches, evaluating existing volumetric and machine learning-based defenses on the Gossamer dataset comprised of 34 million real-world login requests. Our findings indicate that current defenses underperform against the intricate attack behaviors observed in real-world sce- 135 narios. We then explore how to develop approaches that are better at captur- ing such complex attacks without raising too many false alerts by leveraging an anomaly event scoring mechanism. Lastly, we underscore the importance of considering evasion attackers when deploying such defenses and show how current detection mechanisms can be broken by an evasive attacker. 136 CHAPTER 5 CONCLUSION In this work, we designed and built a framework to collect information on sub- mitted measurements, and then we used the resulting measurements to detect password guessing attacks. Our goal in starting these projects was to improve password-based authentication through observations on real data and develop new approaches for detecting guessing attacks, and we feel that we accom- plished this goal. 5.1 Overarching Observations Through this work we make the following observations. Passwords provide valuable information for detecting attacks. We showed in Gossamer [29] how password-derived information can be logged in a secure manner. Different types of guessing attacks can then be characterized using dif- ferent combinations of password-derived measurements, and such information can be helpful when designing attack detectors or even for an analyst when manually reviewing the data. Existing login systems could gather such infor- mation using Gossamer [29]; or if that is not feasible due to the implementation effort or system requirements, a credential checking tool such as Might I Get Pwned (MIGP) [94] can be used to check whether credentials and their variants have appeared in a breach. Clustering is helpful for detecting attack campaigns. Through our work on Araña [64] and in investigating timely detection approaches [65], we saw sev- eral distributed attacks containing dozens or hundreds of IP addresses that may not have been easily identified as the same attack without a clustering approach. More work is needed to develop detection methods that are resistent to evasion attacks. We showed in our final project [65] that attackers can use dif- ferent approaches to attempt to conceal their attacks, and these approaches can be very effective in evading detection. Login system developers should consider an evasive attacker when designing attack detection mechanisms. 5.2 The Authentication Ecosystem In recent years, outsourcing authentication to third parties such as Okta [6] has become more and more common. This could help reduce compromises, since these larger, authentication-focused companies have more resources to build more secure systems and better attack countermeasures, although a compro- mise at such an authentication company could be much more damaging. Passkeys and biometrics are also quickly becoming the recommended method of authentication [33, 55], which is another positive for the authenti- cation space because they are less prone to guessing attacks. However, pass- words will probably remain the dominant authentication mechanism for quite some time both to support backwards compatibility (e.g., rolling out passkeys for most users but allowing alumni users to log into their school accounts with their passwords) and because many companies may not have the resources to roll out a new authentication mechanism. Our work underlines some of the weaknesses of passwords and confirms 138 some findings in prior research — such as that password typos are very com- mon and credential stuffing is highly prevalent and damaging. Our findings support the use of password-derived information to detect guessing attacks. Cloudflare has already started allowing companies to use tools such as Might I Get Pwned [94, 116] on their login traffic to detect credential stuffing attacks. Other companies could also consider doing so; adding such checks could be done easily without deploying a full password measurement infras- tructure. Companies should also consider proactive breach alerting during lo- gins and password changes to prevent credential stuffing attacks before they happen. In particular, Might I Get Pwned checks whether a password or close variant of the password has appeared in a breach, and our measurements at Cloudflare showed that it flags 20% more vulnerable passwords than an exact match breach alerting service [94]. As more and more companies roll out guessing attack countermeasures, though, we can expect to see more advanced attacks such as credential tweaking (which we did not observe with Gossamer, perhaps because automated tools have not advanced to that level yet). We can also expect attackers to work harder to conceal their attacks, and as alternative authentication mechanisms like passkeys become more ubiquitous, attackers may move more towards tar- geting account recovery methods. As the authentication ecosystem moves and changes, we hope that our work can help pave the way for more measurement studies on passwords and other authentication mechanisms. 139 APPENDIX A GOSSAMER: SECURELY MEASURING PASSWORD-BASED LOGINS A.0.1 Measurements Taken We show all data stored in the ephemeral and persistent databases in Figure A.1. Note that the raw password is only stored in encrypted format for 24 hours in the ephemeral database. The symbol × indicates that a field was stored in plaintext, and ⊗ indicates it was encrypted before storing in the in- dicated level of storage. A.0.2 Filtering Out Attacks We excluded high volume attacks before reporting summary statistics to cap- ture an accurate description of regular user behavior. The remaining harder- to-detect adversarial behavior appears to be an insignificant fraction of logins. We were given access to compromise account reports from the instrumentation time period (as described in Section 2.3. We found that an average of 190 com- promised accounts were reported every month. Overall, less than 1% of the total user population in the monitoring period were compromised, less than 1% of all logins were to accounts that were compromised at some point in the measurement period, and 2% of IPs were associated with those requests. The compromise report database logged compromise usernames, but not specific requests or IP addresses corresponding to the user’s compromise; thus it was difficult to exclude all attacks without filtering out an even larger por- Field Eph. Pers. Basic statistics username ⊗ ⊗ password ⊗ IP address × × receipt timestamp at the login server and at Gossamer × × receipt timestamp at Gossamer × × HTTP headers × × result (success or failure), result code × × zxcvbn score (bucketized to 0, 1) × password was malformed × Credential stuffing measurements username appeared in the breach data × password appeared in the breach data × username-password pair appeared in the breach data × breach source of the username, password, or pair × Credential spraying & dictionary-based guessing measurements password appeared in – top {10, 100, 1,000} most common breached pws × – top {2,000, 5,000} hashcat-generated pws × – top {2,000, 5,000} RockYou pws × was password frequently submitted today? × was username frequently submitted today? × Credential tweaking measurements PPSM [93] strength of password × guess rank due to credential tweaking attack [93] × edit dist. ≤ 2 of pw from – other submissions for same username × – other submissions for same IP × Figure A.1: Measurements we log in ephemeral (Eph.) and persistent (Pers.) storage. tion of benign behavior. Excluding all login requests to usernames that were compromised at any point in our instrumentation period did not significantly change the summary statistics we reported. However, filtering out the high- volume attacks, as we did in the paper, did change some statistics. To concretize this, one statistic that we believe is more sensitive to whether attack traffic is included in measurements is the fraction of requests using a breached password. With the high volume attacks included in the calculation, this fraction was 4.73%. This percentage decreased to 2.36% when we filtered 141 out high volume attacks, but only further decreased to 2.34% when we filtered out all requests associated with a compromised username. Characterizing benign behavior in the presence of attacks is a fundamen- tal challenge, given the lack of ground truth. Future work on improving the identification of adversial behavior can give further insights and confirm our characterization of benign user behavior. A.0.3 Login Statistics We show some additional statistics about the login requests recorded by Gossamer. Figure A.3 shows some additional statistics we reported earlier in Figure 2.4. Figure A.2 shows the distribution of operation systems as parsed from the user agents of the requests at both universities; Figure A.4 shows the top 10 most common user agents we saw at both schools. OS U1 U2 Windows 36.13% 23.96% Mac OS X 29.47% 19.58% iOS 26.20% 43.47% Android 5.04% 3.18% Linux 2.43% 0.31% Other 0.73% 8.96% Figure A.2: The distribution of operating systems (OS) as detected in the user- agent of all the requests (after removing requests containing empty user-agent string at U2). 142 U1 U2 Submitted password statistics % req. w/ password in breach† 2.71% 0.10% % req. w/ username in breach† 5.31% 3.08% % req. w/ user-pwd pair in breach† 0.07% 0.01% % failed req. – containing a typo 29.67% 12.04% – containing a typo (with edit dist msmt) 62.39% 58.37% – from mobile device containing a typo 38.63% 38.36% – from mobile device containing a typo (with edit dist msmt) 72.69% 81.87% % pwds tweaked 0.92% 0.34% % pwds w/ zxcvbn score of 0 0.06% 0.40% % pwds in top 5k hashcat < 0.01% 0.06% % pwds in top 5k rockyou 0.02% 0.17% % pwds in top 1k breach comp 0.01% 0.10% Session Statistics (with a 360s threshold) Avg. session size 2.25 2.21 99th percentile session size 10 6 % abandoned sessions 5.47% 1.96% % sessions with at least two attempts 22.24% 38.22% % mobile sessions 41.32% 35.45% % sessions with a typo 2.64% 0.85% % mobile sessions with a typo 0.01% 0.20% Avg. num sessions per user per day 1.74 9.23 User Statistics # of unique usernames seen 196,424 309,801 # of valid users 177,286 169,774 # of active users 130,695 110,476 % valid users w/ weak passwords 0.03% 0.06% % valid users w/ username in breach† 5.79% 3.27% % valid users w/ passwords in breach† 2.92% 9.34% % valid users w/ user-pw pair in breach† 0.01% 0.15% % valid users w/ tweaked password 1.22% 0.66% % valid users w/ no failed attempts 33.21% 58.02% % valid users who may be using pw managers 24.76% 27.34% Avg. fails before a success 1.18 1.19 Avg. devices per user per day 1.51 1.91 Avg. devices per user 14.51 14.97 Avg. IPs per user 8.70 10.56 Avg. successful IPs per user 10.65 17.63 Avg. user agents per user 6.15 3.99 Avg. unique passwords per user 1.96 9.59 Avg. attempts per unique IP per user 5.86 5.21 Login Statistics Avg. Login requests per day 49,302 246,274 Avg. # of submitted usernames per day 24,822 61,798 % of requests succeeded 94.99% 92.35% % of requests with null user agent 0.48% 34.31% # of requests per day per user – Average 1.99 2.05 – median 1 2 – 99th-percentile 7 12 % of requests from mobile device 31.00% 35.57% % of failed requests from mobile device 24.90% 11.96% Figure A.3: Summary statistics of login requests recorded by Gossamer at U1 (from Dec. ’20 to July ’21) and U2 (from Dec ’20 to Mar ’21). 143 User agent % req. Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36 1.54% Mozilla/5.0 (iPhone; CPU iPhone OS 14_2 like Mac OS X) AppleWe- bKit/605.1.15 (KHTML, like Gecko) Version/14.0.1 Mobile/15E148 Sa- fari/604.1 0.99% Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36 0.36% Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Fire- fox/78.0 0.23% Mozilla/5.0 (Unknown; Linux x86_64) AppleWebKit/534.34 (KHTML, like Gecko) PingdomTMS/0.8.5 Safari/534.34 0.22% Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36 0.21% Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0.1 Safari/605.1.15 0.15% Mozilla/5.0 (Macintosh; Intel Mac OS X 11_1_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36 0.14% Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:84.0) Gecko/20100101 Fire- fox/84.0 0.13% Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36 Edg/87.0.664.66 0.13% User agent % req. Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36 3.15% Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36 1.56% Mozilla/5.0 (iPhone; CPU iPhone OS 14_2 like Mac OS X) AppleWe- bKit/605.1.15 (KHTML, like Gecko) Version/14.0.1 Mobile/15E148 Sa- fari/604.1 1.54% Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36 1.32% Mozilla/5.0 (iPhone; CPU iPhone OS 14_3 like Mac OS X) AppleWe- bKit/605.1.15 (KHTML, like Gecko) Version/14.0.2 Mobile/15E148 Sa- fari/604.1 1.16% Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36 1.12% Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36 0.88% Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.182 Safari/537.36 0.76% Mozilla/5.0 (iPhone; CPU iPhone OS 14_4 like Mac OS X) AppleWe- bKit/605.1.15 (KHTML, like Gecko) Version/14.0.3 Mobile/15E148 Sa- fari/604.1 0.70% Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36 0.62% Figure A.4: Top 10 most common user agents at U1 (top) and at U2 (bottom) 144 APPENDIX B ARAÑA: DISCOVERING AND CHARACTERIZING PASSWORD GUESSING ATTACKS IN PRACTICE B.1 Reasons for Compromise Reports As described in Section 3.3, we received the logs of compromised account re- ports from both universities. We show the breakdown of the reasons for com- promise as recorded at U1 in Figure B.1. We that found 69% of accounts were reported as compromised due to large-scale automated attacks (referred to as “Bulk credential testing” at U1). For compromised accounts reported within one hour of the time of attack, 46% were classified as “Bulk credential testing”, 17% as self-reported password compromise, and the remaining 37% were split between the other 10 reasons appearing in the dataset. For the six compromised accounts that took longer than 300 hours (12.5 days) to report, all were reported as “Simultaneous use from different locales.” At U2, the recorded reasons for compromise are very coarse: 99% are re- ported simply as “compromised accounts”. Three accounts were disabled based on requests from the human resources department (former employee), and two were disabled as the user is “deceased”. No username is reported more than once for different reasons. Reason Count Percent Bulk credential testing 1255 69.3% Simultaneous use from different locales 438 24.2% Self-reported password compromise 45 2.5% Blocked for spamming via Office 365 23 1.3% Password misused 20 1.1% Spamming via Office 365 / Gmail 11 0.6% Other (incl. reports from Duo fraud, third-party, etc.) 20 1.1% Figure B.1: The reasons for compromise as noted at U1 and the number of in- stances of such compromise. Users reported multiple times are counted as dis- tinct instances. B.2 Using DAS to Detect Attacks The DAS algorithm [61] has been used successfully in other contexts, such as de- tecting spearphishing attacks. Applied to our context, the DAS algorithm takes a set L1,L2, . . . and orders the sets as follows. For some configured subset of numerical features, we first associate an ordering operator over possible feature values (e.g., higher failure rate FF is more suspicious). Then we associate to Li a score that is equal to the number of other sets L j (i , j) such that Li’s features are all strictly more suspicious than L j’s features. We can then obtain a partial ordering over the set of L sets based on these scores. In finding a configuration of features for DAS, we optimized for the fraction of the top 50L sets as ordered by DAS that were associated with a compromised username, as reported by the security engineers. In doing so, we found that using two simple volumetric features — the number of requests submitted (NR) and the number of unique users contacted (NU) — yielded one of the highest fractions ofL sets associated with a compromised username. In fact, computing DAS with the features NU and NR yielded 41 out of the top 50 L sets and 87 out of the top 100 L sets associated with a compromised username. Adding as a 146 feature the number of consecutive days an IP has been active (CD) increased it to 50 out of the top 50 and 98 out of the top 100. Thus we hypothesize that these three features are most correlated with the current mechanisms used by the university IT offices for detecting attacks that cause compromise reports. DAS with NR, NU, and CD may only be useful for discovering attacks al- ready being caught by existing countermeasures, so we explore extending to further features that the IT offices may not be considering. For example, features such as the fraction of passwords in a breach (FPIB), the fraction of tweaked passwords (FTP), and average unique passwords per user (AUP) may all help in detecting attacks. We tried running DAS on a configuration with a much richer feature set: NR, NU, FPIB, FCIB, FTP, AUP, and FF. Despite manually flagging all 50 as probable attacks, only 19 were flagged as attacks in the compromise database. This indicates that the IT offices may be missing many attacks that could be found with a richer combination of features, and it also suggests that the compromise reports may not be a good ground truth. Our experiments with DAS show that it has promise for discovering attacks. While naively using it with just volumetric features seems to miss more subtle attack behaviors, when used with richer Gossamer-enabled features DAS can even discover (successful) attacks that are not being caught by existing counter- measures. This suggests that future deployments may want to consider using DAS-style approaches for remote guessing attack detection, similar to its origi- nal use with spearphishing. However, from the perspective of our goal of better characterizing attack campaigns, DAS has various limitations. In particular, it cannot group IP addresses into attack campaigns, and distributed attacks that use multiple IPs will be treated as separate attacks. 147 Clustering algorithm Silhouette score ↑ U1 U2 PCA and K-Means++ −0.13 −0.09 DBSCAN −0.39 −0.30 HDBSCAN −0.12 −0.08 Agglomerative +0.19 +0.17 Figure B.2: Silhouette scores of different clustering models. Thus we consider clustering as an unsupervised approach to grouping sus- picious IP addresses into attack campaigns. B.3 Additional Clustering Results Clustering Quality. Before settling on agglomerative clustering for group- ing L sets into potential campaigns, we tried a number of other clustering techniques. We considered the K-means++, DBSCAN, HDBSCAN, and ag- glomerative clustering techniques in this study on the same set of features from Figure 3.2. Since K-means++ cannot work with a custom similarity model, we applied principal component analysis (PCA) on all feature values of L sets and projected them in a two dimensional space to run K-means++ clustering. We also tried projecting to more than two dimensions but did not observe any noticeable effect on the clustering quality. To judge the quality of the clustering techniques, we used the silhouette score [23], which computes the normalized difference between the average inter-class and intra-class distances. We did a grid search over all hyperparameters for each of the clustering techniques and reported the best silhouette score in Figure B.2. All clustering methods except agglomerative clustering received a negative silhouette score, signifying poor quality clusters. We hypothesize that the poor 148 performance is because our similarity model is not a metric similar to Euclidean distance, which is essential for K-means++ to produce meaningful clusters. DB- SCAN performed well when the clusters were of the same density — when L sets belonging to the same cluster are uniformly distributed inside a cluster. However, in our use case, it is common to have attack campaigns of different densities. While HDBSCAN [84] is specifically designed to handle non-uniform clusters and produced relatively better clusters for some attack campaigns, it still received a lower silhouette score than agglomerative clustering. Sensitivity of clustering threshold. To analyze how sensitive our clustering results are for different thresholds, we run two experiments by changing (a) the filtering threshold which is applied to L sets and (b) the distance threshold that dictates whether two clusters would be merged or not. Our findings show that the clustering results are sensitive to changes in the distance threshold but remain relatively the same for changes to filtering thresholds. For the first experiment, we change the filtering threshold to a less aggressive 70th percentile instead of the 80th percentile as we did originally, while keeping the distance threshold the same as before at 0.51. After manual analysis, we find no noticeable change in the likely malicious clusters in comparison to Araña’s clustering results. This indicates that clustering results may not be sensitive to changes in the filtering thresholds. In the second experiment, in addition to the filtering thresholds, we also change the distance threshold by applying a knee locator method on L sets filtered by the 70th percentile. This gives a distance threshold of 0.41, which is lower than the original distance threshold of 0.51. Although the majority of the resulting clusters remain the same, we observe a few noticeable changes in the 149 clusters as we describe below. We sampled the top 20 (16 untargeted and four targeted) likely malicious clusters using the same sampling criteria used in Araña. The first 16 clusters had NR ≥ 5, 000 ∨ NU ≥ 5, 000, exhibiting high volume, untargeted behavior. Out of these 16, we discover that nine clusters were exactly the same as the ones found in Araña’s clustering results. However, we identify five new untargeted clusters which we did not see in Araña’s clustering result. Our manual analy- sis confirms one of them to be a malicious attack campaign, one to be clearly benign behavior, and the other three to contain a mix of benign and malicious behavior. Furthermore, we observe that two attack campaigns which were pre- viously split into five clusters by Araña, are more accurately represented by two distinct clusters with this new distance threshold. The remaining four clusters with AUP ≥ 25 exhibited targeted behavior and were identical to those found by Araña. Lastly, we notice that, with the new relaxed distance threshold, two targeted attack campaigns detected by Araña are completely missed. This is because lowering the distance threshold introduced spurious L sets that did not exhibit the same targeted behavior as the other malicious L sets showing clear targeted behavior. Thus, the AUP value of the mixed cluster was reduced below our selection threshold of 25. In conclusion, we find that Araña’s clustering results are particularly sen- sitve to changes to the distance thresholds. While lowering the distance thesh- old can allow for the discovery of more potential attack campaigns, it also in- creases the chance of mixing benign and malicious L sets in the same cluster. To prioritize precision and a low false positive rate, we choose the higher per- 150 centile filtering thresholds which result in a high distance threshold, minimizing the chance of benign and malicious L sets appearing in the same cluster. Clustering without password-based features. To understand the importance of recording users’ password-based information, we reran Araña at U2 with- out the six password-based features; We found that without these features, we were still able detect 92% (1,570) of 1,709 malicious L sets and 80% (16) of 20 attack clusters discovered by Araña as shown in Figure C.5. However, it missed one multi-day credential stuffing attack from a single IP address (cluster #15) and three targeted attacks (clusters #22, #23, #29). We hypothesize that since all L sets in attack cluster #15 have similar values across five password guess- ability features and all L sets in attack clusters #22, #23, and #29 have similar feature values for AUP, Araña placed them in the same respective attack clusters predicting that they originate from the same attackers even if they were spread across multiple days. Additionally, clustering without password-based features flagged 46 new L sets. We manually analyzed these 46 L sets and suspect that at least four L sets are malformed clients, since they submitted the same password against a single user. Moreover, we found that L sets within the same cluster have significantly different values for password-based features. For example, eight L sets formed a new cluster in which one IP address had an FPIB value of 0, but the other seven L sets had FPIB > 0.55 with an average FPIB of 0.66. For the remaining 34 L sets, it is difficult for us to judge them as either fully malicious or benign, since the password-based features varied within a cluster. Thus we believe password-based features are important for grouping attack traffic into attack campaigns, as well as manually judging a cluster as malicious or benign. 151 # L sets # L sets flagged by FCA Attack # [29] flagged [29] Clusters Flagged Not Flagged #1 (at U1) 7 1,6 6 1 #2 (at U1) 1 2 1 0 #3 (at U2) 12 10 12 0 Figure B.3: Confusion matrices of the number of L sets corresponding to the three attack campaigns found in prior work [29] using their manual approach versus our FCA approach. B.4 Comparing Araña to Prior Work [29] In Section 3.6.1, we discuss the three attack campaigns manually identified in prior work [29]. Here we compare in more detail how their attack identification compares to the corresponding clusters found using Araña. In Figure B.3, we present three confusion matrices — one for each attack. These show how many L sets were labeled as part of the attack or not for both methods. For two attacks, agreement was perfect (unsurprising for attack #2 which only emanated from a single IP on a single day), and for attack #1 Araña missed just one L set. We believe that this false negative arose because that L set set fell on the next day (after midnight) compared to the priorL sets, suggesting that there can be some noise introduced by our 24-hour cut-offs for login sets. Even so, an analyst using Araña could, in this case, easily detect the false negative manually since the IP address is the same. B.5 Geographical Source of Attacks We use the ISP of a request to determine the country or countries of origin. Since a campaign often contains multiple IPs, there may be more than one country of origin. Figure B.4 shows the most common countries from which attacks orig- 152 inated. At U1 and U2, the vast majority of malicious IP addresses originated from within the United States. Country # of requests United States 65,474 Germany 6,170 Morocco 3,467 Canada 152 Brazil 112 Country # of requests United States 234,515 Russia 19,003 Ireland 13,130 Netherlands 5,678 Canada 1,571 Figure B.4: The five most common countries at U1 (left) and U2 (right) by num- ber of attack requests. 153 APPENDIX C THE PRACTICALITY AND ROBUSTNESS OF TIMELY MALICIOUS LOGIN DETECTION C.1 Finding k for Volumetric Flagging To find a k that performs well at both universities, we investigate different val- ues of k and their F1 scores over the training period of Gossamer logs. As shown in Figure C.1, we observe that after k = 500, the F1 score at U1 becomes > 80 and does not change much till k = 1, 500. At U2, for k = 1, 000, the F1 score is the highest at 76%. Increasing k further decreases the F1 score. We also ex- perimented with k = 10, 000 (not shown in Figure C.1) which performs poorly at both universities. Therefore, we chose k = 1, 000 as a relaxed threshold for IP-based volumetric flagging, as it performs well at both universities. 0 500 1,000 1,500 0 20 40 60 80 100 k F1 sc or e (% ) U1 U2 Figure C.1: The F1 scores for different values of k over the training period of Gossamer logs at both universities. Train Test Request level L set level Request level L set level P R F P R F P R F P R F Volumetric k = 10 9 40 23 5 81 9 45 96 62 54 76 63 k = 1, 000 71 96 82 86 6 12 100 13 23 100 0 0 Anomaly-based Comparison Set 71 97 82 97 37 53 84 24 38 19 13 15 Sliding Window 71 99 83 64 71 67 85 82 84 16 47 23 Per-day Clustering 99 100 99 73 100 84 94 100 97 79 100 88 (a) U1 Train Test Request level L set level Request level L set level P R F P R F P R F P R F Volumetric k = 10 14 81 24 35 31 33 51 84 63 82 46 59 k = 1, 000 71 65 68 86 < 1 1 100 10 19 100 < 1 < 1 ML-based StopGuessing 43 65 52 20 61 31 11 5 7 17 19 18 WhoAreYou 33 62 43 27 95 42 39 9 15 6 13 8 Anomaly-based Comparison Set 43 65 52 68 33 45 23 61 37 32 43 47 Sliding Window 61 98 75 33 62 43 22 99 36 27 95 42 Per-day Clustering 83 81 82 72 45 55 33 81 47 52 98 68 (b) U2 Figure C.2: Evaluation of the volumetric, ML-based, and DAS-based ap- proaches on the Gossamer train and holdout datasets using attacks reported in [64] as ground truth. We report the Precision, Recall, and F1 scores (in per- centage (%)) broken down both by individual requests and L sets. StopGuess- ing and WhoAreYou do not flag any requests as malicious at U1. C.2 StopGuessing Implementation Details StopGuessing [102] incorporates five features of traditional volumetric IP-based blocking. Some of these techniques require fine-grained information about submitted passwords or browser authentication cookies which are privacy- sensitive and can endanger user security [29]. StopGuessing sidestepped this crucial problem by relying on a simulator, available at [18] to generate synthetic 155 ` Level Name Total # Unseen 4 IP address 136 > 99% 3 ISP 77 93% 2 Autonomous System (AS) 45 93% 1 Country 3 33% Figure C.3: % of unseen IP addresses, ISPs, AS numbers, and countries at each level ` observed by U2 during the holdout period. logins of benign users and attackers. Developed by Bohuk et al. [29], Gossamer is a password measurement framework that can collect password-related information from real world user logins without endangering user privacy and security. Gossamer keeps the password encrypted in an ephemeral database which rotates the encryption key at the end of each day and stores certain safe-to-log password-derived measure- ments to a persistent database. It does not log the authentication cookie to pre- serve user security. We estimate the five features used in StopGuessing with the password-related information recorded by Gossamer and without using au- thentication cookies as follows. (1) Penalize frequently-guessed passwords. A strong signal for capturing malicious IP addresses is to check whether a request contains a frequently guessed/popular password on a given day. Schechter et al. [102], estimated this by storing a data structure called a binomial ladder frequency filter that can detect if the submitted password is one of the most frequently guessed or not. Similarly, Gossamer uses a counting bloom filter in the ephemeral database to store if the submitted password is within the top 1,000 frequently submitted password in the last 24 hours, and it flushes a binary 0/1 value to persistent storage at the end of the 24 hours for each login request. We use the fraction of passwords having a zero in this field to penalize the IP addresses that send 156 frequently guessed passwords. This feature is calculated twice—once for login requests with a valid username and again for those with invalid usernames. (2) Defend weaker accounts more. Accounts with weak passwords need more protection. Schechter et al. [102] mention that they estimate this using the most frequently guessed incorrect passwords observed in previous login attempts. We use the binary Zxcvbn [123] score of the submitted password to identify accounts with weak passwords that need more protection, since Zxcvbn is a popular mea- sure of password strength recorded by Gossamer. (3) Penalize typos less than other failures. Schechter et al. [102] simulate users committing typos by randomly introducing error (i.e., making small changes to the password) with a probability of one in 50. We instead use the Gossamer measurement that records if a failed login attempt has edit distance ≤ 2 to the correct password (if there is a successful login to that username within the day). (4) Ignore repeated account/password pairs. Gossamer logs do not contain hashed or encrypted passwords in the persistent storage. Therefore, identify- ing and removing requests with repeated username-password pairs is difficult. Nevertheless, we use information about the username and password derived from the breach compilation dataset used by Gossamer. For example, Gossamer records if the username and password appear in a breach dataset; it also records the edit distance of the submitted password from any passwords associated with the same username in the breach dataset. Although this information is not available for usernames not appearing in the breach dataset, we can use this edit distance to identify repeated submitted username-password pairs for 157 usernames in the breach. (5) Account for prior successful logins. To identify if the user is logging in from a client where they have prior successful logins, Schechter et al. [102] assign a device cookie if the client is logging into the account for the first time; subsequent logins would be indicated with that cookie to the login server. How- ever, Gossamer does not log this cookie field from the HTTP header to protect user privacy. Therefore, to identify if the user is logging in from the same client, we consider the tuple of the normalized user agent and username. Lastly, StopGuessing offsets the calculated score for a carefully selected sub- set of IP addresses to accommodate the observation that many users log in from IP addresses that are proxies or NAT. Since Gossamer stores the result of the login request, we can estimate this by considering the number of user accounts to which the IP address has successfully logged in. C.3 WhoAreYou Implementation Details We now detail how we calculated the reputation scores, and we describe our ab- lation study to understand why it failed in yielding a reliable reputation scoring for IP addresses. C.3.1 Estimating Reputation Score Reputation score refers to the chances of an IP address x sending malicious login requests p(x|mal). We use the 18 attack clusters reported in [64] on the Gossamer 158 logs that occurred during the training period for estimating reputation scores. For example, if during the training period the IP address x sends nx malicious logins, then p(x|mal) = nx/N, where N is the number of malicious logins received during the training period. C.3.2 Ablation Study Smoothing technique to solve the sparsity problem. Real-world login data is sparse, and thus the p(x|mal) calculation becomes difficult for unseen IP ad- dresses that appear only during the testing period. Freeman et al. [52] suggested using a smoothing technique to solve this problem. To do this, they first consider a hierarchical level for IP addresses. At the lowest level, the IP addresses itself (` = 4), ISP (` = 3), autonomous system numbers (ASN) (` = 2), country (` = 1) of the IP addresses, and finally world (the highest level) (` = 0). This hierarchy allows us to calculate the reputation score of an IP address x at different levels p(x`|mal) and then sum them up to get the final reputation score; that is, p(x|mal) = ∑`=4 `=1 p(x`|mal). Why doesn’t the smoothing technique work? For calculating p(x|mal) of a given x, we also use the same smoothing technique described above. However, this technique does not yield a reliable reputation score. This is because the sparsity problem is more intense for Gossamer logs as shown in Figure C.3. For example, for U2, at ` = 4 (IP-level) 135 out of 136 IP addresses with ma- licious logins in the testing period were unseen in the training period. Moving to a higher level does not increase the percentage of seen IP addresses either. At 159 ID / Univ. Type L sets NR θip θr θt 10 / U2 B 12 169,573 - - 5 11 / U2 B 663 33,304 - - 2 12 / U2 C 5 4,521 5 91 - 13 / U2 C 3 12,240 5 10,097 - 14 / U2 C, B 843 7,104 30 1883 2 15 / U2 C 1 5,287 15 2494 - 18 / U2 C 7 6,290 10 2487 - 19 / U2 C 4 3,938 5 3956 - 20 / U2 C 3 4,463 5 447 - 22 / U2 T, B 74 1878 15 732 1 23 / U2 T, B 52 1,250 25 993 1 24 / U2 T 4 468 15 632 - 25 / U2 T 4 101 5 78 - 26 / U2 T 3 80 5 101 - 27 / U2 T 1 27 5 125 - 28 / U2 T 1 27 5 632 - 29 / U2 T 1 25 40 94 - Figure C.4: Results of the evasion attacks when redistributing the login requests across θip IP addresses involved in the corresponding attack cluster, adding θr additional login requests, and lengthening the attack by θt. We show the results over the 19 attack clusters detected by the Per-day Clustering approach at U2. For this experiment we assume the attacker has access to all login requests as observed by the login server. A “-” symbol means the attacker does not require the corresponding resource for evasion. ` = 3, only five out of 77 ISPs are seen during the training period; at ` = 2, only three out of 45 are. At ` = 1 (the country level) only one out of three are seen, but scoring at this level is also less precise in general. C.4 Additional Evasion Results We now explain how an attacker can optimize their spent resources by conser- vatively assuming the attacker has knowledge about all other login requests. We develop a simple heuristic and empirically measure the resources required by the attacker. We describe our heuristic below and show the results in Figure C.4. 160 At first, we give the attacker no additional IP addresses (θip = 0). Then, we perform a binary search over θr with an upper limit of 106 to attempt evasion of the targeted and credential stuffing behavior exhibited by the attack cluster. If successful, then we perform another binary search over the additional time θt with an upper limit of 24 hours to evade the bursty behavior. Finally, if we are successful in evading all malicious behavior exhibited by the attack cluster, we stop. Otherwise, we increase θip by five and perform the two binary searches over θr and θt again until the evasion succeeds. As shown in Figure C.4, we find two attack clusters (#10 and #29) which remain unevadable with our experimental setting in Section 4.6 can be evaded with parameters θt = 5 and θip = 40. We also observe that spending θt is not necessary for evasion except for five clusters (#10, #11, #14, #22, #23) that exhibit bursty behavior. C.5 Attack Clusters Detected by Araña [64] Using a novel clustering technique along with manual analysis, Arañadetected nine attack clusters at U1 and 20 at U2. We use these attack clusters as the ground truth of attacks for our analysis. Details of the attack clusters including the number of requests in each cluster, number of IPs involved, number of days the attack was active, how fast the requests were submitted, and information about the submitted password are shown in Figure C.5. This table is replicated from [64]. 161 ID / Univ. Dura tio n # IP s # ISP s NU NR AUP FPIB FCIB FUIB FTP FVU FF 1 / U1 5 h 3 3 10,424 18,093 1.21 0.56 0.41 0.69 0.06 0.94 1.00 2 / U1 4 h 1 1 15,209 17,827 1.10 0.56 0.22 0.42 0.06 0.97 1.00 3 / U1 14 h 555 4 12,659 15,117 1.00 0.14 0.02 0.58 0.46 0.78 1.00 4 / U1 11 h 77 2 12,318 14,603 1.00 0.14 0.02 0.58 0.45 0.78 1.00 5 / U1 41 m 1 1 4,480 4,593 1.03 0.63 0.21 0.44 0.08 0.97 1.00 6 / U1 4 h 3 1 2,101 2,683 1.21 0.66 0.48 0.69 0.04 0.98 1.00 7 / U1 9 h 1 1 1,347 1,481 1.00 0.32 0.10 0.22 0.05 0.99 1.00 8 / U1 19 h 85 13 894 1,246 1.00 0.33 0.08 0.92 0.84 0.92 1.00 9 / U1 37m 30 7 219 241 1.00 0.92 0.78 0.96 0.10 0.93 1.00 10/ U2 11 h 12 1 76,321 169,573 1.01 0.03 0.00 0.00 0.00 0.00 1.00 11 / U2 10 m 663 10 27,488 33,304 1.00 0.79 0.00 0.00 0.00 0.01 1.00 12 / U2 15d 1 1 8,192 13,289 1.02 0.61 0.07 0.54 0.02 0.63 0.93 13 / U2 3 d 1 1 7,939 12,240 1.30 0.66 0.10 0.75 0.00 0.35 0.99 14 / U2 23h 843 13 7,771 10,535 1.01 0.66 0.00 0.71 0.00 0.48 0.97 15 / U2 5 d 1 1 4,662 9,714 1.07 0.84 0.00 0.91 0.00 0.40 0.99 16 / U2 2 d 2 1 4,934 7,323 1.00 0.58 0.00 0.23 0.00 0.81 0.84 17 / U2 2d 3 2 2,513 6,290 1.55 0.40 0.18 0.91 0.70 0.37 0.99 18 / U2 10 d 1 1 1,902 5,434 1.37 0.55 0.27 0.82 0.41 0.39 0.99 19 / U2 5 d 1 1 3,584 5,261 1.04 0.76 0.80 0.98 0.18 0.42 1.00 20 / U2 2 d 3 2 1,756 5,199 1.56 0.50 0.22 0.92 0.72 0.35 0.98 21 / U2 23 h 1 1 5,076 5,103 1.00 0.59 0.00 0.26 0.00 0.80 0.85 22 / U2 13 h 73 43 75 1,878 25.04 0.88 0.01 0.31 0.29 0.96 1.00 23 / U2 21 h 52 38 52 1,296 24.92 0.89 0.01 0.34 0.32 0.88 1.00 24 / U2 12 h 4 4 4 101 25.25 0.86 0.00 0.55 0.50 0.51 0.99 25 / U2 16 h 3 3 3 80 26.67 0.75 0.01 0.65 0.58 0.30 0.99 26 / U2 8 m 1 1 1 27 27.00 0.85 0.00 0.00 0.00 0.00 1.00 27 / U2 8 m 1 1 1 27 27.00 0.78 0.04 1.00 0.96 1.00 1.00 28 / U2 13 m 1 1 1 25 25.00 0.88 0.00 0.00 0.00 0.00 1.00 29 / U2 4 d 1 1 3 468 38.75 0.59 0.00 0.67 0.16 0.33 1.00 Figure C.5: Attack clusters detected by Araña [?] on Gossamer dataset. Similar to [64], we refer to each attack cluster using the ID in the first column. 162 BIBLIOGRAPHY [1] https://github.com/islamazhar/Arana-Public. [2] https://blackbox.ipinfo.app. [3] DigitalOcean – The developer cloud. https://www.digitalocean.com. [4] Google Cloud: Cloud Computing Services. https://cloud.google.com. [5] Inwi – Opérateur téléphonique mobile, fixe & internet au Maroc. https: //www.inwi.ma. [6] Okta. https://www.okta.com/. [7] Python Fernet Cryptography Library. https://cryptography.io/en/ latest/fernet.html. [8] Python Miscreant Encryption Library. https://github.com/miscreant/ miscreant.py. [9] UA-parser. https://github.com/ua-parser/uap-python. [10] RockYou Hack: From Bad To Worse. https://techcrunch.com/2009/12/ 14/rockyou-hack-security-myspace-facebook-passwords/, 2009. [11] Best64 Rule List. https://github.com/hashcat/hashcat/blob/master/ rules/best64.rule, 2018. [12] OAuth. https://oauth.net/2/, 2020. [13] Duo Security: Two-Factor Authentication & End Point Security. https: //duo.com/, 2021. [14] Account lockout threshold. https://learn.microsoft.com/en-us/ windows/security/threat-protection/security-policy-settings/account- lockout-threshold, 2023. [15] LinkedIn Corporation. https://www.linkedin.com/, 2023. [16] OpenBullet2. https://github.com/openbullet/OpenBullet2, 2023. https://github.com/islamazhar/Arana-Public https://blackbox.ipinfo.app https://www.digitalocean.com https://cloud.google.com https://www.inwi.ma https://www.inwi.ma https://www.okta.com/ https://cryptography.io/en/latest/fernet.html https://cryptography.io/en/latest/fernet.html https://github.com/miscreant/miscreant.py https://github.com/miscreant/miscreant.py https://github.com/ua-parser/uap-python https://techcrunch.com/2009/12/14/rockyou-hack-security-myspace-facebook-passwords/ https://techcrunch.com/2009/12/14/rockyou-hack-security-myspace-facebook-passwords/ https://github.com/hashcat/hashcat/blob/master/rules/best64.rule https://github.com/hashcat/hashcat/blob/master/rules/best64.rule https://oauth.net/2/ https://duo.com/ https://duo.com/ https://learn.microsoft.com/en-us/windows/security/threat-protection/security-policy-settings/account-lockout-threshold https://learn.microsoft.com/en-us/windows/security/threat-protection/security-policy-settings/account-lockout-threshold https://learn.microsoft.com/en-us/windows/security/threat-protection/security-policy-settings/account-lockout-threshold https://www.linkedin.com/ https://github.com/openbullet/OpenBullet2 [17] R-fx Networks Projects. https://www.rfxn.com/, 2023. [18] StopGuessing. https://github.com/microsoft/StopGuessing, 2023. [19] Jacob Abbott and Sameer Patil. How Mandatory Second Factor Affects the Authentication User Experience. In 2020 CHI Conference on Human Factors in Computing Systems, CHI ’20, page 1–13, New York, NY, USA, 2020. Association for Computing Machinery. [20] Daniel Arp, Erwin Quiring, Feargus Pendlebury, Alexander Warnecke, Fabio Pierazzi, Christian Wressnegger, Lorenzo Cavallaro, and Konrad Rieck. Dos and Don’ts of Machine Learning in Computer Security. In USENIX Security 22, 2022. [21] Mahdi Hassan Aysa, Abdullahi Abdu Ibrahim, and Alaa Hamid Mo- hammed. IoT DDOS Attack Detection Using Machine Learning. In 2020 4th International Symposium on Multidisciplinary Studies and Innovative Tech- nologies (ISMSIT), 2020. [22] Sruti Bhagavatula, Lujo Bauer, and Apu Kapadia. (How) Do people change their passwords after a breach? USENIX ;login, 2021. [23] Ashutosh Bhardwaj. Silhouette coefficient. https://towardsdatascience. com/silhouette-coefficient-validating-clustering-techniques- e976bb81d10c, 2020. [24] Kemal Bicakci and Yusuf Uzunay. Is FIDO2 Passwordless Authentication a Hype or for Real?: A Position Paper. In 2022 15th International Conference on Information Security and Cryptography (ISCTURKEY), pages 68–73. IEEE, 2022. [25] Battista Biggio, Samuel Rota Bulò, Ignazio Pillai, Michele Mura, Eyasu Zemene Mequanint, Marcello Pelillo, and Fabio Roli. Poison- ing complete-linkage hierarchical clustering. In Structural, Syntactic, and Statistical Pattern Recognition: Joint IAPR International Workshop, S+ SSPR 2014, Joensuu, Finland, August 20-22, 2014. Proceedings, pages 42–52. Springer, 2014. [26] Battista Biggio, Ignazio Pillai, Samuel Rota Bulò, Davide Ariu, Marcello Pelillo, and Fabio Roli. Is data clustering in adversarial settings secure? In 2013 ACM workshop on Artificial Intelligence and Security, pages 87–98, 2013. 164 https://www.rfxn.com/ https://github.com/microsoft/StopGuessing https://towardsdatascience.com/silhouette-coefficient-validating-clustering-techniques-e976bb81d10c https://towardsdatascience.com/silhouette-coefficient-validating-clustering-techniques-e976bb81d10c https://towardsdatascience.com/silhouette-coefficient-validating-clustering-techniques-e976bb81d10c [27] Jeremiah Blocki and Wuwei Zhang. DALock: Password Distribution- Aware Throttling. Proceedings on Privacy Enhancing Technologies, 2022. [28] Beyond Identity Blog. The history and future of passwords. https://www.beyondidentity.com/blog/history-and-future-passwords, Sep 2021. [29] Marina Sanusi Bohuk, Mazharul Islam, Syed Suleman Ahmad, Michael Swift, Thomas Ristenpart, and Rahul Chatterjee. Gossamer: Securely Measuring Password-based Logins. In 31nd USENIX Security Symposium (USENIX Security 22), USA, 2022. USENIX Association. [30] Joseph Bonneau, Cormac Herley, Paul C. van Oorschot, and Frank Sta- jano. The quest to replace passwords: A framework for comparative eval- uation of web authentication schemes. In 2012 IEEE Symposium on Security and Privacy, pages 553–567, 2012. [31] Joseph Bonneau, Cormac Herley, Paul C. Van Oorschot, and Frank Sta- jano. The quest to replace passwords: A framework for comparative eval- uation of web authentication schemes. In 2012 IEEE Symposium on Security and Privacy, 2012. [32] Joseph Bonneau, Cormac Herley, Paul C Van Oorschot, and Frank Stajano. Passwords and the evolution of imperfect authentication. Communications of the ACM, 58(7), 2015. [33] Christlaan Brand and Sriram Karra. The beginning of the end of the pass- word. https://blog.google/technology/safety-security/the-beginning- of-the-end-of-the-password/. [34] AMR Brostoff and MA Sasse. ’Ten strikes and you’re out’: Increasing the number of login attempts can improve password usability. In CHI 2003 Workshop on Human-Computer Interaction and Security Systems, Fort Lauderdale, 2003. [35] Anna L. Buczak and Erhan Guven. A Survey of Data Mining and Machine Learning Methods for Cyber Security Intrusion Detection. IEEE Commu- nications Surveys & Tutorials, 18(2), 2016. [36] Lars Buitinck, Gilles Louppe, Mathieu Blondel, Fabian Pedregosa, An- dreas Mueller, Olivier Grisel, Vlad Niculae, Peter Prettenhofer, Alexandre Gramfort, Jaques Grobler, Robert Layton, Jake VanderPlas, Arnaud Joly, 165 https://www.beyondidentity.com/blog/history-and-future-passwords https://blog.google/technology/safety-security/the-beginning-of-the-end-of-the-password/ https://blog.google/technology/safety-security/the-beginning-of-the-end-of-the-password/ Brian Holt, and Gaël Varoquaux. API design for machine learning soft- ware: experiences from the scikit-learn project. In ECML PKDD Workshop: Languages for Data Mining and Machine Learning, pages 108–122, 2013. [37] Varun Chandola, Arindam Banerjee, and Vipin Kumar. Anomaly detec- tion: A survey. ACM computing surveys (CSUR), 41(3):1–58, 2009. [38] Rahul Chatterjee, Anish Athalye, Devdatta Akhawe, Ari Juels, and Thomas Ristenpart. pASSWORD tYPOS and how to correct them securely. 2016. Full version of the paper can be found at https://cs.cornell.edu/ rahul/papers/pwtypos.pdf. [39] Rahul Chatterjee, Anish Athayle, Devdatta Akhawe, Ari Juels, and Thomas Ristenpart. pASSWORD tYPOS and how to correct them se- curely. In 2016 IEEE Symposium on Security and Privacy, 2016. [40] Jessica Colnago, Summer Devlin, Maggie Oates, Chelse Swoopes, Lujo Bauer, Lorrie Cranor, and Nicolas Christin. "It’s not actually that horri- ble": Exploring Adoption of Two-Factor Authentication at a University. In 2018 CHI Conference on Human Factors in Computing Systems (CHI ’18). Association for Computing Machinery, 2018. [41] Jessica Colnago, Summer Devlin, Maggie Oates, Chelse Swoopes, Lujo Bauer, Lorrie Cranor, and Nicolas Christin. “It’s Not Actually That Horri- ble”: Exploring Adoption of Two-Factor Authentication at a University, page 1–11. Association for Computing Machinery, New York, NY, USA, 2018. [42] Anupam Das, Joseph Bonneau, Matthew Caesar, Nikita Borisov, and Xi- aoFeng Wang. The Tangled Web of Password Reuse. In NDSS, volume 14, pages 23–26, 2014. [43] Dashlane. Report: A global look at password health. https://www. dashlane.com/global-password-health-score-report-2022. [44] Katie Deighton. Tech Companies Push Users to Adopt Two-Factor Authentication. https://www.wsj.com/articles/tech-companies-push- users-to-adopt-two-factor-authentication-11635807088, 2021. [45] Periwinkle Doerfler, Kurt Thomas, Maija Marincenko, Juri Ranieri, Yu Jiang, Angelika Moscicki, and Damon Mccoy. Evaluating Login Chal- lenges as a Defense Against Account Takeover. In 28th International Con- ference on World Wide Web (WWW ’19), 2019. 166 https://www.dashlane.com/global-password-health-score-report-2022 https://www.dashlane.com/global-password-health-score-report-2022 https://www.wsj.com/articles/tech-companies-push-users-to-adopt-two-factor-authentication-11635807088 https://www.wsj.com/articles/tech-companies-push-users-to-adopt-two-factor-authentication-11635807088 [46] Verizon Enterprise. DBIR: 2023 Databreach Investigation Report. https://www.verizon.com/business/resources/Tf0f/reports/2023- data-breach-investigations-report-dbir.pdf, 2023. [47] Chris Evans, Chris Palmer, and Ryan Sleevi. Public key pinning exten- sion for http. Internet Engineering Task Force. 27Available: http://www. ietf. org/internet-drafts/draft-ietf-websec-key-pinning-09. txt, 2015. [48] Dinei Florencio and Cormac Herley. A large-scale study of web password habits. In 16th International Conference on World Wide Web, WWW ’07, page 657–666, New York, NY, USA, 2007. Association for Computing Machin- ery. [49] Dinei Florêncio, Cormac Herley, and Baris Coskun. Do strong web pass- words accomplish anything? HotSec, 7(6):159, 2007. [50] Alain Forget, Saranga Komanduri, Alessandro Acquisti, Nicolas Christin, Lorrie Faith Cranor, and Rahul Telang. Building the Security Behavior Observatory: An Infrastructure for Long-Term Monitoring of Client Ma- chines. In 2014 Symposium and Bootcamp on the Science of Security, HotSoS ’14. Association for Computing Machinery, 2014. [51] John Franks, Phillip Hallam-Baker, Jeffrey Hostetler, Scott Lawrence, Paul Leach, Ari Luotonen, and Lawrence Stewart. HTTP authentication: Basic and digest access authentication, 1999. [52] David Freeman, Sakshi Jain, Markus Duermuth, Battista Biggio, and Gior- gio Giacinto. Who Are You? A Statistical Approach to Measuring User Authenticity. In NDSS, 2016. [53] Eva Gerlitz, Maximilian Häring, and Matthew Smith. Please do not use or your License Plate Number: Analyzing Password Policies in Ger- man Companies. In Seventeenth Symposium on Usable Privacy and Security (SOUPS 2021), pages 17–36. USENIX Association, August 2021. [54] Conor Gilsenan, Fuzail Shakir, Noura Alomar, and Serge Egelman. Secu- rity and Privacy Failures in Popular 2FA Apps. In 32nd USENIX Security Symposium (USENIX Security 23), 2023. [55] Samuel Greengard. Passkeys unlock a new era for authentica- tion. https://cacm.acm.org/news/270924-passkeys-unlock-a-new-era- for-authentication/fulltext, Mar 2023. 167 https://www.verizon.com/business/resources/Tf0f/reports/2023-data-breach-investigations-report-dbir.pdf https://www.verizon.com/business/resources/Tf0f/reports/2023-data-breach-investigations-report-dbir.pdf https://cacm.acm.org/news/270924-passkeys-unlock-a-new-era-for-authentication/fulltext https://cacm.acm.org/news/270924-passkeys-unlock-a-new-era-for-authentication/fulltext [56] W3C Web Authentication Working Group. Passkeys: A Web Authenti- cation API for Secure and Passwordless Sign-in. "https://w3.org/TR/ webauthn-passkeys/". [57] Hana Habib, Jessica Colnago, William Melicher, Blase Ur, Sean Segreti, Lujo Bauer, Nicolas Christin, and Lorrie Cranor. Password creation in the presence of blacklists. Proc. USEC, 2017. [58] D. Harkings. Synthetic initialization vector (SIV) authenticated encryp- tion using the advanced encryption standard (AES). [59] Cormac Herley and Stuart E. Schechter. Distinguishing Attacks from Le- gitimate Authentication Traffic at Scale. In Network and Distributed Systems Security (NDSS) Symposium 2019, 2019. [60] Briland Hitaj, Paolo Gasti, Giuseppe Ateniese, and Fernando Perez-Cruz. PassGAN: A Deep Learning Approach for Password Guessing. In Interna- tional conference on applied cryptography and network security, pages 217–237. Springer, 2019. [61] Grant Ho, Aashish Sharma, Mobin Javed, Vern Paxson, and David Wag- ner. Detecting Credential Spearphishing Attacks in Enterprise Settings. In 26th USENIX Security Symposium (USENIX Security 17), SEC’17, page 469–485, USA, 2017. USENIX Association. [62] Nicolas Huaman, Sabrina Amft, Marten Oltrogge, Yasemin Acar, and Sascha Fahl. They would do better if they worked together: The case of interaction problems between password managers and websites. In 2021 IEEE Symposium on Security and Privacy (SP), pages 1367–1381. IEEE, 2021. [63] Troy Hunt. Have I Been Pwned? https://haveibeenpwned.com/ Passwords/, 2018. [64] Mazharul Islam, Marina Sanusi Bohuk, Paul Chung, Thomas Ristenpart, and Rahul Chatterjee. Araña: Discovering and Characterizing Pass- word Guessing Attacks in Practice. In 32nd USENIX Security Sympo- sium (USENIX Security 23), pages 1019–1036, Anaheim, CA, August 2023. USENIX Association. [65] Mazharul Islam, Marina Sanusi Bohuk, Thomas Ristenpart, and Rahul Chatterjee. The Practicality and Robustness of Timely Malicious Login Detection. In In Preparation. 168 "https://w3.org/TR/webauthn-passkeys/" "https://w3.org/TR/webauthn-passkeys/" https://haveibeenpwned.com/Passwords/ https://haveibeenpwned.com/Passwords/ [66] Mobin Javed and Vern Paxson. Detecting stealthy, distributed ssh brute- forcing. In 2013 ACM SIGSAC conference on Computer & communications security, pages 85–96, 2013. [67] Mohammed Jubur, Prakash Shrestha, Nitesh Saxena, and Jay Prakash. By- passing push-based second factor and passwordless authentication with human-indistinguishable notifications. In 2021 ACM Asia Conference on Computer and Communications Security, ASIA CCS ’21, page 447–461, New York, NY, USA, 2021. Association for Computing Machinery. [68] Mohammed Jubur, Prakash Shrestha, Nitesh Saxena, and Jay Prakash. By- passing push-based second factor and passwordless authentication with human-indistinguishable notifications. In 2021 ACM Asia Conference on Computer and Communications Security, ASIA CCS ’21, page 447–461, New York, NY, USA, 2021. Association for Computing Machinery. [69] P. G. Kelley, S. Komanduri, M. L. Mazurek, R. Shay, T. Vidas, L. Bauer, N. Christin, L. F. Cranor, and J. Lopez. Guess Again (and Again and Again): Measuring Password Strength by Simulating Password-Cracking Algorithms. In 2012 IEEE Symposium on Security and Privacy, 2012. [70] Erin Kenneally and David Dittrich. The menlo report: Ethical principles guiding information and communication technology research. Available at SSRN 2445102, 2012. [71] Saranga Komanduri. Modeling the Adversary to Evaluate Password Strength With Limited Samples. https://kilthub.cmu.edu/articles/ thesis/Modeling_the_Adversary_to_Evaluate_Password_Strength_ With_Limited_Samples/6720701/1, 2018. [72] Saranga Komanduri, Richard Shay, Patrick Gage Kelley, Michelle L. Mazurek, Lujo Bauer, Nicolas Christin, Lorrie Faith Cranor, and Serge Egelman. Of Passwords and People: Measuring the Effect of Password- Composition Policies. In SIGCHI Conference on Human Factors in Comput- ing Systems (CHI ’11). Association for Computing Machinery, 2011. [73] Leona Lassak, Elleen Pan, Blase Ur, and Maximilian Golla. Why Aren’t We Using Passkeys? Obstacles Companies Face Deploying FIDO2 Pass- wordless Authentication. In 33rd USENIX Security Symposium (USENIX Security 24), Philadelphia, PA, 2024. USENIX Association. [74] Kristin Lauter. Password Monitor: Safeguarding passwords in Microsoft 169 https://kilthub.cmu.edu/articles/thesis/Modeling_the_Adversary_to_Evaluate_Password_Strength_With_Limited_Samples/6720701/1 https://kilthub.cmu.edu/articles/thesis/Modeling_the_Adversary_to_Evaluate_Password_Strength_With_Limited_Samples/6720701/1 https://kilthub.cmu.edu/articles/thesis/Modeling_the_Adversary_to_Evaluate_Password_Strength_With_Limited_Samples/6720701/1 Edge. https://www.microsoft.com/en-us/research/blog/password- monitor-safeguarding-passwords-in-microsoft-edge/, 2021. [75] Aleksandar Lazarevic, Levent Ertoz, Vipin Kumar, Aysel Ozgur, and Jaideep Srivastava. A comparative study of anomaly detection schemes in network intrusion detection. In 2003 SIAM international conference on data mining, pages 25–36. SIAM, 2003. [76] Lucy Li, Bijeeta Pal, Junade Ali, Nick Sullivan, Rahul Chatterjee, and Thomas Ristenpart. Protocols for checking compromised credentials. In 2019 ACM SIGSAC Conference on Computer and Communications Security, CCS ’19, page 1387–1403, New York, NY, USA, 2019. Association for Com- puting Machinery. [77] Bo Lu, Xiaokuan Zhang, Ziman Ling, Yinqian Zhang, and Zhiqiang Lin. A measurement study of authentication rate-limiting mechanisms of mod- ern websites. In 34th Annual Computer Security Applications Conference, ACSAC ’18, page 89–100, New York, NY, USA, 2018. Association for Com- puting Machinery. [78] Sanam Ghorbani Lyastani, Michael Schilling, Michaela Neumayr, Michael Backes, and Sven Bugiel. Is FIDO2 the kingslayer of user authentication? A comparative usability study of FIDO2 passwordless authentication. In 2020 IEEE Symposium on Security and Privacy (SP), pages 268–285. IEEE, 2020. [79] J. Ma, W. Yang, M. Luo, and N. Li. A study of probabilistic password models. In 2014 IEEE Symposium on Security and Privacy, pages 689–704, 2014. [80] MaxMind. GeoIP2 Database. https://www.maxmind.com/en/geoip2- databases. [81] LLC MaxMind. GeoIP. https://www.maxmind.com/en/geoip-demo. [82] Peter Mayer, Collins W Munyendo, Michelle L Mazurek, and Adam J Aviv. Why Users (Don’t) Use Password Managers at a Large Educational Institution. In 31st USENIX Security Symposium (USENIX Security 22), pages 1849–1866, 2022. [83] Michelle L. Mazurek, Saranga Komanduri, Timothy Vidas, Lujo Bauer, Nicolas Christin, Lorrie Faith Cranor, Patrick Gage Kelley, Richard Shay, 170 https://www.microsoft.com/en-us/research/blog/password-monitor-safeguarding-passwords-in-microsoft-edge/ https://www.microsoft.com/en-us/research/blog/password-monitor-safeguarding-passwords-in-microsoft-edge/ https://www.maxmind.com/en/geoip2-databases https://www.maxmind.com/en/geoip2-databases https://www.maxmind.com/en/geoip-demo and Blase Ur. Measuring Password Guessability for an Entire University. In 2013 ACM SIGSAC Conference on Computer and Communications Security (CCS ’13). Association for Computing Machinery, 2013. [84] Leland McInnes, John Healy, and Steve Astels. HDBSCAN: Hierarchical Density-Based Clustering. The Journal of Open Source Software, 2(11), 2017. [85] William Melicher, Blase Ur, Sean M. Segreti, Saranga Komanduri, Lujo Bauer, Nicolas Christin, and Lorrie Faith Cranor. Fast, Lean, and Ac- curate: Modeling Password Guessability Using Neural Networks. In USENIX Security 16. USENIX Association, 2016. [86] Andrew Miklas, Stefan Saroiu, Alec Wolman, and Angela Demke Brown. Bunker: A privacy-oriented platform for network tracing. In 6th USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2009. [87] Daniel Müllner. Modern hierarchical, agglomerative clustering algo- rithms. arXiv preprint arXiv:1109.2378, 2011. [88] Arvind Narayanan and Vitaly Shmatikov. Fast Dictionary Attacks on Passwords Using Time-Space Tradeoff. In 12th ACM Conference on Com- puter and Communications Security, CCS ’05, page 364–372, New York, NY, USA, 2005. Association for Computing Machinery. [89] Arvind Narayanan and Vitaly Shmatikov. Robust de-anonymization of large sparse datasets. In 2008 IEEE Symposium on Security and Privacy (sp 2008), pages 111–125. IEEE, 2008. [90] Alexandra Nisenoff, Maximilian Golla, Miranda Wei, Juliette Hainline, Hayley Szymanek, Annika Braun, Annika Hildebrandt, Blair Christensen, David Langenberg, and Blase Ur. A {Two-Decade} retrospective analysis of a university’s vulnerability to attacks exploiting reused passwords. In 32nd USENIX Security Symposium (USENIX Security 23), pages 5127–5144, 2023. [91] Sean Oesch and Scott Ruoti. That Was Then, This Is Now: A Security Eval- uation of Password Generation, Storage, and Autofill in Browser-Based Password Managers. In 29th USENIX Security Symposium (USENIX Secu- rity 20), pages 2165–2182. USENIX Association, August 2020. [92] Openwall. John The Ripper. https://www.openwall.com/john/. 171 https://www.openwall.com/john/ [93] Bijeeta Pal, Tal Daniel, Rahul Chatterjee, and Thomas Ristenpart. Beyond credential stuffing: Password similarity models using neural networks. In 2019 IEEE Symposium on Security and Privacy (SP), pages 417–434, May 2019. [94] Bijeeta Pal, Mazharul Islam, Thomas Ristenpart, and Rahul Chatterjee. Might I get pwned: A second generation password breach alerting ser- vice, 2021. [95] Danny Palmer. These systems are facing billions of attacks every month as hackers try to guess passwords. https://www.zdnet.com/article/these- systems-are-facing-billions-of-attacks-every-month-as-hackers-try-to- guess-passwords/, 2021. [96] Sarah Pearman, Jeremy Thomas, Pardis Emami Naeini, Hana Habib, Lujo Bauer, Nicolas Christin, Lorrie Faith Cranor, Serge Egelman, and Alain Forget. Let’s go in for a closer look: Observing passwords in their natural habitat. In 2017 ACM SIGSAC Conference on Computer and Communications Security (CCS ’17). Association for Computing Machinery, 2017. [97] Sarah Pearman, Shikun Aerin Zhang, Lujo Bauer, Nicolas Christin, and Lorrie Faith Cranor. Why people (don’t) use password managers effec- tively. In Fifteenth Symposium on Usable Privacy and Security (SOUPS 2019), pages 319–338, 2019. [98] Soc Radar. 91% of e-commerce login traffic is credential stuff- ing. https://socradar.io/91-of-e-commerce-login-traffic-is-credential- stuffing-attempts/, 2022. [99] Nabeel Saeed. Top insights from our 2022 state of secure identity report. [100] Ville Satopaa, Jeannie Albrecht, David Irwin, and Barath Raghavan. Find- ing a "Kneedle" in a Haystack: Detecting Knee Points in System Behav- ior. In 2011 31st International Conference on Distributed Computing Systems Workshops, 2011. [101] Stuart Schechter, Cormac Herley, and Michael Mitzenmacher. Popularity Is Everything: A New Approach to Protecting Passwords from Statistical- Guessing Attacks. In 5th USENIX Workshop on Hot Topics in Security (Hot- Sec 10), Washington, DC, August 2010. USENIX Association. [102] Stuart Schechter, Yuan Tian, and Cormac Herley. StopGuessing: Using 172 https://www.zdnet.com/article/these-systems-are-facing-billions-of-attacks-every-month-as-hackers-try-to-guess-passwords/ https://www.zdnet.com/article/these-systems-are-facing-billions-of-attacks-every-month-as-hackers-try-to-guess-passwords/ https://www.zdnet.com/article/these-systems-are-facing-billions-of-attacks-every-month-as-hackers-try-to-guess-passwords/ https://socradar.io/91-of-e-commerce-login-traffic-is-credential-stuffing-attempts/ https://socradar.io/91-of-e-commerce-login-traffic-is-credential-stuffing-attempts/ Guessed Passwords to Thwart Online Guessing. In 2019 IEEE European Symposium on Security and Privacy (EuroS&P), 2019. [103] Tara Seals. Billions of passwords offered for $2 in cyber-underground. ThreatPost, 2021. [104] Richard Shay, Patrick Gage Kelley, Saranga Komanduri, Michelle L. Mazurek, Blase Ur, Timothy Vidas, Lujo Bauer, Nicolas Christin, and Lor- rie Faith Cranor. Correct horse battery staple: Exploring the usability of system-assigned passphrases. In Eighth Symposium on Usable Privacy and Security, SOUPS ’12, New York, NY, USA, 2012. Association for Comput- ing Machinery. [105] Richard Shay, Saranga Komanduri, Adam L. Durity, Phillip (Seyoung) Huh, Michelle L. Mazurek, Sean M. Segreti, Blase Ur, Lujo Bauer, Nico- las Christin, and Lorrie Faith Cranor. Can long passwords be secure and usable? In SIGCHI Conference on Human Factors in Computing Systems, CHI ’14, page 2927–2936, New York, NY, USA, 2014. Association for Comput- ing Machinery. [106] Jens Steube and Gabriele Gristina. Hashcat. https://hashcat.net/ hashcat/. [107] Schechter Stuart. Before You Use a Password Manager. https: //stuartschechter.medium.com/before-you-use-a-password-manager- 9f5949ccf168, 2019. [108] Latanya Sweeney. Weaving technology and policy together to maintain confidentiality. The Journal of Law, Medicine & Ethics, 25(2-3):98–110, 1997. [109] Pang-Ning Tan, Michael Steinbach, Anuj Karpatneand, and Vipin Kumar. Introduction to Data Mining, 2nd Edition. People’s Posts and Telecommuni- cations Publishing House, Beijing, 2019. [110] Danna Thee. Sentry MBA. https://e.cyberint.com/hubfs/Sentry% 20MBA%20Report/CyberInt_Sentry-MBA_Report.pdf. [111] Kurt Thomas, Frank Li, Ali Zand, Jacob Barrett, Juri Ranieri, Luca Inv- ernizzi, Yarik Markov, Oxana Comanescu, Vijay Eranti, Angelika Mosci- cki, Daniel Margolis, Vern Paxson, and Elie Bursztein. Data Breaches, Phishing, or Malware? Understanding the Risks of Stolen Credentials. In 2017 ACM SIGSAC Conference on Computer and Communications Security (CCS ’17). Association for Computing Machinery, 2017. 173 https://hashcat.net/hashcat/ https://hashcat.net/hashcat/ https://stuartschechter.medium.com/before-you-use-a-password-manager-9f5949ccf168 https://stuartschechter.medium.com/before-you-use-a-password-manager-9f5949ccf168 https://stuartschechter.medium.com/before-you-use-a-password-manager-9f5949ccf168 https://e.cyberint.com/hubfs/Sentry%20MBA%20Report/CyberInt_Sentry-MBA_Report.pdf https://e.cyberint.com/hubfs/Sentry%20MBA%20Report/CyberInt_Sentry-MBA_Report.pdf [112] Kurt Thomas, Jennifer Pullman, Kevin Yeo, Ananth Raghunathan, Patrick Gage Kelley, Luca Invernizzi, Borbala Benko, Tadek Pietraszek, Sarvar Patel, Dan Boneh, and Elie Bursztein. Protecting Accounts from Credential Stuffing with Password Breach Alerting. In USENIX Security 19. USENIX Association, 2019. [113] Michael Tremante. End User Security: Account Takeover Protections with Cloudflare. https://blog.cloudflare.com/account-takeover-protection/, 2021. [114] Blase Ur, Patrick Gage Kelley, Saranga Komanduri, Joel Lee, Michael Maass, Michelle L. Mazurek, Timothy Passaro, Richard Shay, Timothy Vi- das, Lujo Bauer, Nicolas Christin, and Lorrie Faith Cranor. How does your password measure up? the effect of strength meters on password creation. In USENIX Security 22. USENIX Association, 2012. [115] Blase Ur, Fumiko Noma, Jonathan Bees, Sean M Segreti, Richard Shay, Lujo Bauer, Nicolas Christin, and Lorrie Faith Cranor. " I Added’!’at the End to Make It Secure": Observing Password Creation in the Lab. In Eleventh symposium on usable privacy and security (SOUPS 2015), pages 123– 140, 2015. [116] Luke Valenta, Cefan Daniel Rubin, and Christopher Wood. Privacy- Preserving Compromised Credential Checking. https://blog.cloudflare. com/privacy-preserving-compromised-credential-checking, 2021. [117] Luis Von Ahn, Manuel Blum, Nicholas J Hopper, and John Langford. Captcha: Using hard ai problems for security. In International conference on the theory and applications of cryptographic techniques, pages 294–311. Springer, 2003. [118] Chun Wang, Steve T.K. Jan, Hang Hu, Douglas Bossart, and Gang Wang. The next domino to fall: Empirical analysis of user passwords across on- line services. In Eighth ACM Conference on Data and Application Security and Privacy, CODASPY ’18, page 196–203, New York, NY, USA, 2018. As- sociation for Computing Machinery. [119] Ding Wang, Zijian Zhang, Ping Wang, Jeff Yan, and Xinyi Huang. Tar- geted online password guessing: An underestimated threat. 2016. [120] WAQAS. A trove of 1.4 billion clear text credentials file found on dark web, 2017. 174 https://blog.cloudflare.com/account-takeover-protection/ https://blog.cloudflare.com/privacy-preserving-compromised-credential-checking https://blog.cloudflare.com/privacy-preserving-compromised-credential-checking [121] Rick Wash, Emilee Rader, Ruthie Berman, and Zac Wellmer. Under- standing password choices: How frequently entered passwords are re- used across websites. In Twelfth Symposium on Usable Privacy and Security (SOUPS 2016), pages 175–188, 2016. [122] Matt Weir, Sudhir Aggarwal, Breno de Medeiros, and Bill Glodek. Pass- word Cracking Using Probabilistic Context-Free Grammars. In 2009 30th IEEE Symposium on Security and Privacy, 2009. [123] Dan Lowe Wheeler. zxcvbn: Low-Budget Password Strength Estimation. In Proc. USENIX Security, 2016. [124] Suzanne Widup, Bob Rudis, Dave Hylender, and Marc Spitler. 2015 Data Breach Investigations Report. https://www.researchgate.net/ publication/289254638_2015_Verizon_Data_Breach_Investigations_ Report, 2015. [125] Wikipedia. List of the most common passwords. https://en.wikipedia. org/wiki/List_of_the_most_common_passwords. [126] Danny Yadron. Man behind the first computer password: It’s become a nightmare. Wall Street Journal, May 2014. [127] Samira Zibaei, Dinah Rinoa Malapaya, Benjamin Mercier, Amirali Salehi- Abari, and Julie Thorpe. Do password managers nudge secure (ran- dom) passwords? In Eighteenth Symposium on Usable Privacy and Security (SOUPS 2022), pages 581–597, 2022. 175 https://www.researchgate.net/publication/289254638_2015_Verizon_Data_Breach_Investigations_Report https://www.researchgate.net/publication/289254638_2015_Verizon_Data_Breach_Investigations_Report https://www.researchgate.net/publication/289254638_2015_Verizon_Data_Breach_Investigations_Report https://en.wikipedia.org/wiki/List_of_the_most_common_passwords https://en.wikipedia.org/wiki/List_of_the_most_common_passwords Biographical Sketch Dedication Acknowledgements Contents List of Figures Introduction Gossamer: Securely Measuring Password-based Logins Introduction Background and Related Work Designing a Secure Measurement Architecture Password-Derived Measurements and Security Analysis Password-derived measurements Security analysis of measurements Deploying Gossamer Login Statistics, Patterns, & Observations Characteristics of high volume attacks User and client statistics Password security Usability Conclusion Araña: Discovering and Characterizing Password Guessing Attacks in Practice Introduction Background and Related Work Gossamer Logs Towards Detecting Attack Campaigns Login Sets and Features Initial Attempts Using Compromise Reports Campaign Discovery Pipeline Filtering Likely Benign Requests Clustering Potentially Malicious sets Attack Campaigns Discovered Analysis of Attack Campaigns Example Attack Campaigns Higher-Level Attack Behaviors Observed Discussion Real-World Deployment Constraints Conclusion The Practicality and Robustness of Timely Malicious Login Detection Introduction Background & Related Work Evaluating Prior Approaches Dataset Performance of Volumetric Approaches Performance of ML Approaches Anomaly-Based Detection Approaches Malicious Login Detection Using DAS Attack Detection Performance Robustness of Malicious Login Detection Model for Evasion Attacks Evasion Strategies Evasion Results Discussion & Future Work Conclusion Conclusion Overarching Observations The Authentication Ecosystem Gossamer: Securely Measuring Password-based Logins Measurements Taken Filtering Out Attacks Login Statistics Araña: Discovering and Characterizing Password Guessing Attacks in Practice Reasons for Compromise Reports Using DAS to Detect Attacks Additional Clustering Results Comparing Araña to Prior Work gossamer Geographical Source of Attacks The Practicality and Robustness of Timely Malicious Login Detection Finding k for Volumetric Flagging StopGuessing Implementation Details WhoAreYou Implementation Details Estimating Reputation Score Ablation Study Additional Evasion Results Attack Clusters Detected by Araña arana Bibliography