This file is 'spambase.DOCUMENTATION' at the UCI Machine Learning Repository: http://www.ics.uci.edu/~mlearn/MLRepository.html --> edited by david draper (15 mar 2019) to describe the data set 'spam.csv', which was created by Efron and Hastie (2016) in close correspondence to the UCI Machine Learning Repository data set 1. Title: SPAM E-mail Database 2. Sources: (a) Creators: Mark Hopkins, Erik Reeber, George Forman, Jaap Suermondt Hewlett-Packard Labs, 1501 Page Mill Rd., Palo Alto, CA 94304 (b) Donor: George Forman (gforman at nospam hpl.hp.com) 650-857-7835 (c) Generated: June-July 1999 3. Past Usage: (a) Hewlett-Packard Internal-only Technical Report. External forthcoming. (b) Determine whether a given email is spam or not. (c) ~7% misclassification error. False positives (marking good mail as spam) are very undesirable. If we insist on zero false positives in the training/testing set, 20-25% of the spam passed through the filter. 4. Relevant Information: The "spam" concept is diverse: advertisements for products/web sites, make money fast schemes, chain letters, pornography ... Our collection of spam e-mails came from our postmaster George Forman and individuals who had filed spam. Our collection of non-spam e-mails came from filed work and personal e-mails, and hence the word 'george' and the area code '650' are indicators of non-spam. These are useful when constructing a personalized spam filter. One would either have to blind such non-spam indicators or get a very wide collection of non-spam to generate a general purpose spam filter. For background on spam: Cranor, Lorrie F., LaMacchia, Brian A. Spam! Communications of the ACM, 41(8):74-83, 1998. 5. Number of Instances: 4601 (1813 Spam = 39.4%) 6. Number of Attributes: 59 (57 continuous, 2 nominal class label (1 outcome, 1 indicator of train/test cross-validation status 7. Attribute Information: Most of the predictor variables indicate whether a particular word or character frequently occurred frequently in the e-mail. The run-length predictors (variables 55-57) measure the lengths of sequences of consecutive capital letters. Here are definitions of the predictor variables: 48 continuous real [0,100] attributes of type word_freq_WORD = percentage of words in the e-mail that match WORD, i.e. 100 * (number of times the WORD appears in the e-mail) / total number of words in e-mail. A "word" in this case is any string of alphanumeric characters bounded by non-alphanumeric characters or end-of-string. 6 continuous real [0,100] attributes of type char_freq_CHAR = percentage of characters in the e-mail that match CHAR, i.e. 100 * (number of CHAR occurences) / total characters in e-mail 1 continuous real [1,...] attribute of type capital_run_length_average = average length of uninterrupted sequences of capital letters 1 continuous integer [1,...] attribute of type capital_run_length_longest = length of longest uninterrupted sequence of capital letters 1 continuous integer [1,...] attribute of type capital_run_length_total = sum of length of uninterrupted sequences of capital letters = total number of capital letters in the e-mail Here are definitions of the other variables: 1 logical { F, T } class attribute of type 'spam' = denotes whether the e-mail was considered spam (T) or not (F), i.e. unsolicited commercial e-mail. 1 logical { F, T } class attribute of type 'testid' = denotes whether the email was randomly assigned to the cross-validation partition subset 'test'/'modeling' (T) or 'train'/'validation' (F) 8. Missing Attribute Values: None 9. Class Distribution: Spam 1813 (39.4%) Non-Spam 2788 (60.6%)