Frequency count on a range of data

Challenge: I have thousands of rows of data, each containing multiple columns of piece of information. I want a frequency count of each piece of information that appears.

I received some data the other day. In total, there were 1,000 rows of raw data. Each record contained a set of comma-separated tags. The most tag-heavy record had 200 tags; other records had but one or two. My mission was to create a single, deduplicated list of tags, together with their frequencies.

First, load the data into the first column of Excel. The next step is to get each tag into a separate cell.

Data | Text to Columns

This gives you the option to declare the comma as the separator, and what was a single column becokmes 200 columns of data. The next step is to stack them. To do this requires two functions: ADDRESS and INDIRECT.

The ADDRESS function brings back the actual text of a cell reference. So =ADDRESS(2,3) will bring back the cell reference of row 2, column 3, i.e. "C2". But it brings this back as text.

This is where the INDIRECT function comes into play. INDIRECT takes some text and interprets it as a formula. So =INDIRECT(C2) will give you the contents of C2.

Combine the two, and you have:

=INDIRECT(ADDRESS(2,3))

This will bring back the contents of cell C2. But if you make the two arguments of the ADDRESS formula into cell references, you have poetry.

The idea is to create two lists containing the vertical and horizontal reference numbers of each cell in the 1,000-by-200 matrix, and use that to inform the above formula.

So in a new sheet, create two columns—one containing the row references of the tags and one containing their column references. In the first, put 200 1s followed by 200 2s, 200 3s etc. until we hit 1,000—200,000 rows in total. The second column contained the numbers 1 through 200, repeated 1,000 times.

In the third column, apply the above formula to each of those row–column combinations. This will bring back every single tag in a list, with zeros where there was no tag. (I.e. a record with five tags will have those listed followed by 195 zeros.)

Copy and paste values, and you can then sort the list alphabetically by tag, removing the zeros.

A simple PivotTable then gives you the frequency count for each tag.

This entry was posted in How to and tagged , , , , . Bookmark the permalink.