Association rule mining | Computer Science homework help

1 Association rule mining

In this notebook, you’ll implement the basic pairwise association rule mining algorithm.

To keep the implementation simple, you will apply your implementation to a simpliﬁed

dataset, namely, letters (“items”) in words (“receipts” or “baskets”). Having ﬁnished that code,

you will then apply that code to some grocery store market basket data. If you write the code

well, it will not be difﬁcult to reuse building blocks from the letter case in the basket data case.

1.1 Problem deﬁnition

Let’s say you have a fragment of text in some language. You wish to know whether there are

association rules among the letters that appear in a word. In this problem:

• Words are “receipts”

• Letters within a word are “items”

b, where a and b are

letters. You will write code to do that by calculating for each rule its conﬁdence, conf(a =

)

You want to know whether there are association rules of the form, a =

)

b).

“Conﬁdence” will be another name for an estimate of the conditional probability of b given a, or

Pr[b

a].

1.2 Sample text input

Let’s carry out this analysis on a “dummy” text fragment, which graphic designers refer to as the

lorem ipsum:

In [ ]: latin_text = “””

Sed ut perspiciatis, unde omnis iste natus error sit

voluptatem accusantium doloremque laudantium, totam

rem aperiam eaque ipsa, quae ab illo inventore

veritatis et quasi architecto beatae vitae dicta

sunt, explicabo. Nemo enim ipsam voluptatem, quia

voluptas sit, aspernatur aut odit aut fugit, sed

quia consequuntur magni dolores eos, qui ratione

voluptatem sequi nesciunt, neque porro quisquam est,

qui dolorem ipsum, quia dolor sit amet consectetur

adipisci[ng] velit, sed quia non numquam [do] eius

modi tempora inci[di]dunt, ut labore et dolore

magnam aliquam quaerat voluptatem. Ut enim ad minima

veniam, quis nostrum exercitationem ullam corporis

suscipit laboriosam, nisi ut aliquid ex ea commodi

consequatur? Quis autem vel eum iure reprehenderit,

qui in ea voluptate velit esse, quam nihil molestiae

consequatur, vel illum, qui dolorem eum fugiat, quo

voluptas nulla pariatur?

At vero eos et accusamus et iusto odio dignissimos

ducimus, qui blanditiis praesentium voluptatum

deleniti atque corrupti, quos dolores et quas

molestias excepturi sint, obcaecati cupiditate non

provident, similique sunt in culpa, qui officia

deserunt mollitia animi, id est laborum et dolorum

fuga. Et harum quidem rerum facilis est et expedita

distinctio. Nam libero tempore, cum soluta nobis est

eligendi optio, cumque nihil impedit, quo minus id,

quod maxime placeat, facere possimus, omnis voluptas

assumenda est, omnis dolor repellendus. Temporibus

autem quibusdam et aut officiis debitis aut rerum

necessitatibus saepe eveniet, ut et voluptates

repudiandae sint et molestiae non recusandae. Itaque

earum rerum hic tenetur a sapiente delectus, ut aut

reiciendis voluptatibus maiores alias consequatur

aut perferendis doloribus asperiores repellat.

“””

print(“First 100 characters:n {} …”.format(latin_text[:100]))

Exercise 0 (ungraded). Look up and read the translation of lorem ipsum!

Data cleaning. Like most data in the real world, this dataset is noisy. It has both uppercase and

lowercase letters, words have repeated letters, and there are all sorts of non-alphabetic characters.

For our analysis, we should keep all the letters and spaces (so we can identify distinct words), but

we should ignore case and ignore repetition within a word.

For example, the eighth word of this text is “error.” As an itemset, it consists of the three unique

letters,

e, o, r

. That is, treat the word as a set, meaning you only keep the unique letters.

This itemset has three possible itempairs:

e, o

e, r

, and

o, r

Start by writing some code to help “clean up” the input.

Exercise 1 (normalize_string_test: 2 points). Complete the following function,

normalize_string(s). The input s is a string (str object). The function should return a new

string with (a) all characters converted to lowercase and (b) all non-alphabetic, non-whitespace

characters removed.

Clariﬁcation. Scanning the sample text, latin_text, you may see things that look like

special cases. For instance, inci[di]dunt and [do]. For these, simply remove the

non-alphabetic characters and only separate the words if there is explicit whitespace.

For instance, inci[di]dunt would become incididunt (as a single word) and [do]

would become do as a standalone word because the original string has whitespace on

either side. A period or comma without whitespace would, similarly, just be treated

as a non-alphabetic character inside a word unless there is explicit whitespace. So

e pluribus.unum basium would become e pluribusunum basium even though your

common-sense understanding might separate pluribus and unum.

Hint. Regard as a whitespace character anything “whitespace-like.” That is, consider

not just regular spaces, but also tabs, newlines, and perhaps others. To detect whitespaces

easily, look for a “high-level” function that can help you do so rather than checking

for literal space characters.

In [ ]: def normalize_string(s):

assert type (s) is str

# YOUR CODE HERE

# Demo:

print(latin_text[:100], “…n=>”, normalize_string(latin_text[:100]), “…”)

In [ ]: # `normalize_string_test`: Test cell

norm_latin_text = normalize_string(latin_text)

assert type(norm_latin_text) is str

assert len(norm_latin_text) == 1694

assert all([c.isalpha() or c.isspace() for c in norm_latin_text])

assert norm_latin_text == norm_latin_text.lower()

print(“n(Passed!)”)

Exercise 2 (get_normalized_words_test: 1 point). Implement the following function,

get_normalized_words(s). It takes as input a string s (i.e., a str object). It should return a

list of the words in s, after normalization per the deﬁnition of normalize_string(). (That is, the

input s may not be normalized yet.)

In [ ]: def get_normalized_words (s):

assert type (s) is str

# YOUR CODE HERE

# Demo:

print (“First five words:n{}”.format (get_normalized_words (latin_text)[:5]))

In [ ]: # `get_normalized_words_test`: Test cell

norm_latin_words = get_normalized_words(norm_latin_text)

assert len(norm_latin_words) == 250

Calculate the price of your order

Type of paper needed:

Pages:

550 words

Academic level:

We'll send you the first draft for approval by September 11, 2018 at 10:52 AM

Total price:

$26

The price is based on these factors:

Academic level

Number of pages

Urgency

Our guarantees

Delivering a high-quality product at a reasonable price is not enough anymore.
That’s why we have developed 5 beneficial guarantees that will make your experience with our service enjoyable, easy, and safe.

Money-back guarantee

You have to be 100% sure of the quality of your product to give a money-back guarantee. This describes us perfectly. Make sure that this guarantee is totally transparent.

Zero-plagiarism guarantee

Each paper is composed from scratch, according to your instructions. It is then checked by our plagiarism-detection software. There is no gap where plagiarism could squeeze in.

Free-revision policy

Thanks to our free revisions, there is no way for you to be unsatisfied. We will work on your paper until you are completely happy with the result.

Privacy policy

Your email is safe, as we store it according to international data protection rules. Your bank details are secure, as we use only reliable payment systems.

Fair-cooperation guarantee

By sending us your money, you buy the service we provide. Check out our terms and conditions if you prefer business talks to be laid out in official language.