I paid five dollars for a copy of my own Acxiom profile and this is all I got

Acxiom, the “world’s largest commercial database on consumers,” does not make it easy.

Or rather, they don’t make it easy for the consumers to find out information about themselves. It’s probably a whole lot easier to use if you yourself are a marketer.

For you as a consumer, however, Acxiom provides two ways to access your own data. Neither of them are complete and one of them involved me mailing a five dollar check to the company’s office in Arkansas.

I will get more to that later but first let’s dwell a bit in anger. Acxiom makes consumers, whose very data they are selling as a product without any permission whatsoever, pay five dollars to get a report about themselves. Five dollars! After having to mail a paper check to the company I expected a thick paper report to come by mail in return. Instead I got a two-page PDF by email. Gah. But I’m getting all out of order.

First approach: AboutTheData

Acxiom launched in 2013 in order to give consumers “a glimpse of some of the details the company has collected about them.” I dipped my toe in the Acxiom water by creating an account. The site asks you a series of sort of trick questions in order to do so, such as: “which address have you not lived at?” The first time I went through the process I boycotted before finishing; I didn’t want my answers to contribute to making my Axiom profile even more accurate. Two days later, I gave into curiosity. They knew the answers anyway.

After creating an account the site shows you data about yourself in six different categories: characteristic data, home data, household vehicle data, household economic data, household purchase data, household interests data.

For me some of these sections showed up as empty. For some, this totally made sense. I do not nor have I have I ever owned a car. Other categories suspiciously showed up as empty on AboutTheData but were pretty comprehensive in the reference report I later requested. Hm.

What I did learn from AboutTheData? I think they think I am my own child. Or somehow, that I have a 16-17 year old female daughter I didn’t know about.

They were on point with my credit card data though.

Second approach: US Reference Report

All of Acxiom’s web design budget seems to have been blown on AboutTheData. The page on their main site informing you how to file a reference report is pretty sparse. The gist is this: If you submit a request for a report through their website, respond affirmatively to several emails and separately mail a check to Arkansas made payable to Acxiom you too can receive a two-page PDF about yourself.

The details were sparse but Acxiom does indeed know the address of every apartment I’ve ever lived in and the length of my stay.

I was interested to note too, the well filled-in voter info section of the report. I was particularly interested to note the ways that it was on the nose (I am indeed a registered Democrat) and the ways it was not (I’ve voted much more recently than 2013).

My victories, meaning their inaccuracies, felt small. In the end, Acxiom’s got my five dollars and my consumer profile too.

Excellent Internetty Podcasts

Podcast Blog PostI’m completely obsessed with podcasts. It’s a big part of the reason I started one,  Internet School Podcast on Internet Studies, with Ellie Marshall. I listen to podcasts almost constantly – and on all subjects too. But in thinking about shaping the Internet School Podcast, I wanted to highlight those podcasts that address some element of Internet Studies. This will be an ongoing blog post of particularly cool podcast episodes/segments on tech. Most of the ones listed below are recent, because my memory is shot, but that’s all the more reason to start keeping track.

Let me know if there’s any podcasts that I should be listening to but I’m not. Always looking for new recs.


Planet Money – “The People Inside Your Machine” (23 mins) 

Planet Money provides primers on all kinds of financial topics and sometimes that includes the financial implications of technology. This episode does a great job of synthesizing what Amazon’s Mechanical Turk is and why it’s interesting and strange. (Mechanical Turk is an Amazon marketplace for work, typically very small tasks such as identifying an image, that are then paid on a task basis.) To be fair, Planet Money does this sort of great explanation job with every topic they address but Mechanical Turk is a particularly thorny problem. What’s especially cool about this episode too is the interviews with Mechanical Turkers themselves on why they work for the site, what kind of work they get, and how that work is changing.

TLDR – “Hunting for YouTube’s Saddest Comments” (8 minutes)

TLDR is the tech-focused spin-off of On the Media. I’m a weirdo who always likes to read the comments so I’m predisposed to like this episode BUT, the podcast does a good job of highlighting the highly personal and nostalgic side to many YouTube comments. Dig through the layers of hate in the comment section of any video and you might find a gem of a memory of a first kiss.

Start-Up – “How to not Pitch a Billionaire” (26 minutes)

Start-Up documents Alex Blumberg’s attempts to create a podcasting company. The first episode deals with the pitching process, as most start-up journeys also begin. The title gives away that the initial pitch does not go well and man, this episode has uncomfortable moments. It also reveals how VCs think through in what they will, and won’t, agree to invest.

Reply All – “This Website is for Sale” (19 minutes)

Former TLDR hosts PJ Vogt and Alex Goldman switched to the Start Up creator’s new podcasting company and started up Reply All. (The podcasting world is er, small). This episode delves into the market for sales of domains. Who are the people selling domains exactly? How are prices determined? The world of domain sales it turns out, is totally bananas.

New Tech City – “The ‘Bi-Literate’ Brain” (23 minutes)

New Tech City deals with the personal implications of technology. This episode addresses the difference between reading on paper or a screen and validates everything I ever thought about using e-readers. Basically, I’m not alone in having difficulty remembering the plots of books read through e-readers. New Tech City’s host Manoush Zamorodi doesn’t just diagnose the problem, but provides a solution of how to train your brain to be “bi-literate.”

Radio Berkman – “Copyright XXX” (20 minutes)

Radio Berkman is made up of interviews with people doing research at the very cool Berkman Center for Internet & Society at Harvard University. This episode is an interview with Kate Darling on her research on perspectives on copyright in porn. Basically, Darling explains how the porn industry has completely lost the battle against piracy and why many producers don’t care. To be real, this episode completely changed the way I think about intellectual property regulation.


Finding First Names in Open Data

The other day I read a plaintive email from a man with two kids who was asking a Sarasota County Commissioner for advice on looking for a job. I hadn’t hacked the man’s Gmail account – I was just checking out the Sarasota County Commissioner email database. All emails sent to and from Sarasota County Commissioners are a matter of public record in Florida and so County’s website makes it very easy to search through them. The Sarasota commissioners must be well aware that every hasty reply will end up online, but did the job seeker know that his full name, his description of his job situation and his personal email address would be online for me to see?

This started me thinking about Personally Identifiable Information (PII) in government open datasets. PII is defined by the Department of Homeland Security as “any information that permits the identity of an individual to be directly or
indirectly inferred.” The DHS then distinguishes between plain PII and “sensitive” PII which includes social security numbers and information on a person’s religious affiliation, sexual orientation or ethnicity. How much PII is in government open datasets? I took a look at all the data in one portal, the New York State Open New York site.

I combed through all 1,383 datasets on the site, looking for any set that included information on individual people. For the sake of this investigation I defined PII as any one of the following:

1) The name of a specific person who is not an elected official

2) The personal email of a specific person

3) The personal phone number of a specific person

4) The home address of a specific person

I found PII in 69 or 4.9% of the datasets on This PII ranged from the names of all CEOS of active corporations registered in New York to the jockey drivers involved in horse deaths or breakdowns to the names of all building managers for any structures with an oil boiler.

The most accessed datasets on the website are relatively more likely to involve PII. Three out of ten of the most accessed datasets, or 30%, contain information related to specific people.

By definition, open data must be usable for secondary purposes other than the reason for which it was collected. There are very good reasons why the government should maintain a list of unlicensed plumbers who have received disciplinary action against them – who knows what further pipe havoc these plumbers could wreak? Yet should these unlicensed plumbers have their mistakes held against them in other aspects of their life? What happens when this data is combined with the other information that exists online about this individual? The New York State Open Data Handbook acknowledges that some of the datasets may contain PII that and states that “even if there are no legal impediments to publishing the data, releasing the data may have unintended or undesirable effects.” It is time to consider these unintended effects.

Judging Open Data Aggregators

Let’s forget about Google’s domination of search for a min; I believe open data aggregators* have a future as an alternative sort of search engine.  Why? They enable a user to find:

1. Information that is not easily available elsewhere. (Open datasets are generally not indexed for search by Google or other search platforms)

2. Relevant information without knowing exactly the right source.

* Let’s note that I’m defining open data aggregators as any site that allows a user to search through datasets from at least two sources with a single key term. Generally the datasets are from open government sources but they don’t have to be.

Open data aggregators can be even more useful than a search engine for certain purposes. If I am interested in finding out about a particularly private company for instance, searching public filings might provide me with more information the the organization’s website. If I’m a journalist or an academic looking for numbers to back up a claim, an open data aggregator could be just the resource to turn to.

Open data aggregators could do for open data what search did for the web. Which is the Google of open data aggregators in that strained analogy? That part remains to be seen.

I’m impatient though, so I evaluated a few myself to see which might win out.

The contenders, Quandl, Knoema, Datahub, Engage, Enigma and Google Public Data.

What’s their deal?

1) was started in a partnership between the University of Chicago, Urban Center for Computation and Data and DataMade. Right now focuses on government data but lists “unstructured data such as tweets and crowdsourced observations” in its roadmap. The site just launched and is still only in beta.

2) Quandl is a Canadian company that positions itself as a numerical data marketplace. Public data is available for free on Quandl, but the site also mediates the sale of proprietary datasets.

3) Knoema calls itself a “knowledge platform” – what that means is that is more of an interactive aggregator than other ones here. Users can upload their own datasets.

4) Datahub is the Open Knowledge Foundation’s data management platform. Lots of government datasets are available but they are meant for developers to download. There is little way to interact with the data on the website itself. Like with Knoema, users can upload their own datasets.

5) Engage, funded by the European Commission, aims to bring together public datasets from all over the European Union for researchers to access.

6) Enigma is beautiful. I’ve said this before on this blog, but it’s still true. The company’s site has a range of datasets available to search. Users can look at whatever for free but have to pay for substantial API access.

7) Google Public Data feels empty and sad.

Who’s the best?

Coolest interface:

Easiest APIs:

Prettiest vizualizations:

Most datasets: Quandl

Least datasets: Google Public Data (Only 11! Gooooooooogle…)

Winner: Disappointingly no one aggregator has got it all (yet). It depends what you’re looking to do.

Open Data Portal in an Open Data Aggregator Part II

After the last post I wanted to investigate whether the Enigma had not included available expenditures data was missing or if it was missing in the municipal open data portals more generally. So I checked out a sample of four cities: Seattle, Chicago, San Francisco and Edmonton. The cities were all pulled from Socrata’s customer spotlight page.


  • Does the city’s open data portal have any budget data? Seattle’s portal lists the budget for the last few years but the “budget control levels” datasets included are generalized to the level of department or program.
  • Does it include specific expenditures? No. Not at the level of the specific companies that the city paid out of a program’s budget.
  • Does Enigma have this  budget/expenditure data? No. Enigma has two other datasets for Seattle but nothing on the budget.


  • Does the city’s open data portal have any budget data? Chicago’s open data portal has a number of datasets listing budget appropriations.
  • Does it include specific expenditures? Yes, but not at the level of company paid. The Chicago budget data includes appropriations for data hardware, data circuits and outsourcing to data centers but nothing on city spending on Socrata.
  • Does Enigma have this  budget/expenditure data? No. Enigma does not have any expenditure data for Chicago.

San Francisco

  • Does the city’s open data portal have any budget data? Yes and interestingly San Francisco is one of the few cities that I’ve seen that puts budget data from every fiscal year all in one dataset, instead of separating them out by year.
  • Does it include specific expenditures? San Francisco’s budget is much more specific than Seattle’s (for example lists “copy machine” instead of simply “Information Technology”) but it also does not list the company paid for the item.
  • Does Enigma have this  budget/expenditure data? Enigma does not have any government data from San Francisco.


  • Does the city’s open data portal have any budget data? Well, no. But the Edmonton open data portal does list several years worth of the results of the city’s “Budget Consultation Survey.”
  • Does it include specific expenditures? N/A
  • Does Enigma have this  budget/expenditure data? No. Enigma does not have any government data from municipality in Canada.

tl;dr Enigma is growing but it is still missing a lot of municipal data. (See the datasets currently requested on Enigma here.) Cities with data portals often don’t include specific expenditures or even any budget data at all in the data they open.

An Open Data Platform in an Open Data Aggregator

So I just spent the last few hours tooling around on the open data aggregator site,

If you haven’t spent time on Enigma before, I’d recommend it. The site aggregates thousands of datasets from governments (local, state, national), international organizations and companies (like Crunchbase and interestingly, BP) into one slick and intuitive interface. There are a few things I would still like to see from Enigma but those petty complaints will be part of a later blog post comparing different free open data aggregators. Suffice to say, it’s hard to not be impressed by Enigma.

Today though, I was wondering what I could find in Enigma on the open data platform Socrata. Open data on open data, you know?

The answer interestingly was, not much. My search itself was easy because Enigma enables keyword search over all of its databases at once. So I could peruse any dataset that mentioned Socrata by name. First, I looked through the common company filings, IRS Form 5500 on Employee Benefit Plans, H1-B Visa Applications from the Department of Labor. Then, I found the expenditure reports from the cities and states that are Socrata’s clients and things got more intriguing.

There are three cities or states in Enigma’s datasets that have expenditure reports including how much they have paid Socrata: Austin, Missouri and Oregon. Comparing the spending data from the most recent year that they all have data on (2012), these governments are each spending around $30,000 a year on Socrata. That price seems completely reasonable for deployment or maintenance of an open data site.

What’s odd is that these three governments are a small fraction of Socrata’s customers. According to the customer spotlight on Socrata’s website, the governments of Seattle, Chicago, New Orleans, Oklahoma and San Francisco are all clients as well. This raises the question, why are so many of these open data portals missing expenditure data in Enigma? Stay tuned. I’ll be looking into this more in the next week.


Why I Keep my Instagram Private (Or the Problem with Third Party Use)

My Instagram is kept private not because I care if other Instagrammers can see my photos, but because I worry about what uses third parties might find for them. This privacy decision reveals the importance of consent regarding use of personal information.

Why would I want my photos to be public? I like the idea of contributing to a photo collage on Instagram of an event, a place or a hashtag. Strangers viewing my photos within the app will at least see them within the context I choose to place them. Yet I do not like that if my Instagram account is public, anyone can use my photo for any other purpose as well. My Twitter account is public but because of that I rarely post photos to that site. Photos feel more like my personal property than the nonsense I tweet.

For the first two years of its life Instagram was accessible only through the app. Then, in November 2012, the company placed all Instagram profiles on the web, making public photos visible to anyone, instead of just within the (admittedly fairly wide) circle of Instagram users.

Public Instagram photos are catalogued with the rest of the web, so there is no telling who will use them for what in the future. There are a number of services such as Instasave, Downgram and Instaport to allow users to download their photos and the photos of others en masse. Other third party applications like ease the search for photos by location. Once my photos are on the web, I no longer have control over how they’re used.

At the extreme end, recent reports describe users “role-playing” with public baby photos. This means that the user has copied the photo(s) of the baby and then repost them, pretending the baby is their own. While Instagram users could certainly leave unsavory comments within the app, this kind of behavior is only possible with the ownership over the photos that Instagram on the web allows. Meaning, role-players can more easily download photos from the web.

Of course by having any account at all I am still allowing Instagram (and its owner Facebook) access to my photos. Instagram briefly changed their terms of service to allow itself to sell user photos to advertisers without alerting the users, but a barrage of complaints led to a reversal of this position. The site’s terms of service now state that users own the photos they upload.

I wish that I was always able to decide whether to consent to third party use of my information. This decision is rare. So, when given the opportunity, I will limit third party use of my photos and this is why I keep my Instagram private.

Why Doesn’t Yelp Use (More) Open Data?

Spending as much time as I do on both the New York Open Data portal and on Yelp, I often wonder that there isn’t more overlap between the two. Yelp is a good test case of the difficulties companies face in integrating municipal open data into their product.

What kind of open data? The most obvious would be health inspection results. While browsing potential snack spots you might want to know that the 5-for-a-dollar dumpling place has a four and a half star rating but received a C in its last inspection while this other place is similar but got an A.

Yelp has taken steps towards using health inspection data. In 2012 Yelp launched LIVES, Local Inspector Value Entry Specification, a health inspection open data standard. According to the page for LIVES, any city can participate in the standard but the original launch included just San Francisco and New York. A 2013 article from Slate says that health inspection scores would soon be available in Yelp for Philadelphia, Boston and Chicago as well.

Health scores definitely exist in San Francisco Yelp but I couldn’t find a single New York restaurant with one, nor could I find any listed for restaurants in Philadelphia, Boston or Chicago. The LIVES standard seems to have stalled. What would be the difficulties?

Yelp needs LIVES in order to use health inspection data because each city has its own, slightly different, format. For example, San Francisco scores food establishments on a 1-100 scale while New York health inspectors assign letter grades. The LIVES formatting requires 1-100 scores and so cities like New York would have to decide to which numbers their letter grades correspond. Having different scoring systems across Yelp for health inspections would be confusing to users and so the translation from open data to Yelp is the stumbling block.

What other kinds of New York open data could yelp use? The sidewalk café dataset. Yelp already lists whether or not restaurants or bars have outdoor seating and using that dataset would improve that. The WiFi hotspot data could guide freelancers to which is the best WiFi equipped café nearby.

Additionally using open data requires Yelp to determine matches in the government database for their user-added businesses. This is likely a headache for Yelp as government databases generally list businesses by their legal name while users will add a business under its trade name.

If cities want to encourage companies to use the open data then a standard across cities is very helpful. However, a city might not intend for that to be the purpose of its open data program.  Local civic app developers will find it just as easy to use non-standard open data as long as the definitions for each column are clear. That is though, unfortunately, a big if and one that shared municipal open data standards could fix.

TL;DR? OK, I got you below.

Why doesn’t Yelp use (more) open data?

  • Each city has its open data in its own format. Yelp would need to standardize all of these to keep their product looking consistent across markets.
  • Yelp is a database of user-generated info. Reconciling the conflicts between what Yelp has for a certain place and what a government database lists would be difficult.

What kind of New York open data could Yelp use?

How do you get context to trend?

Who’s a reporter anyway these days? My blog can get me a press pass but it doesn’t mean I follow a journalist’s ethical framework. The members of the Smart Global Communications panel at one of  the Social Good Summit‘s morning masterclasses discussed the increasingly blurry line between news and activism.

The panel included Claire Wardle of UNHCR, Stephen Keppel of Univision, Niall Dunne of BT and Bryn Mooser, co-founder of the  new news outlet RYOT. Rajesh Mirchandani of BBC News moderated.

Traditional media outlets are beginning to help their audience to act on stories and participate in the news while aid organizations are starting to create content themselves. Both changes are due to different parts of web 2.0. Social media allows viewers to respond to stories and and to send in their own reports. The Internet has also led to the decimation of newspapers and in turn, a sharp reduction in the number of foreign correspondents available to cover breaking stories.

Wardle described how UNHCR has stepped into the void to create coverage of internal crises. She said initially they were happy when news outlets would use their photos or a quote. Now news organizations are routinely embedding entire clips by UNHCR in their stories. As Wardle pointed out, while news organizations may have one or two reporters at most in Syria, UNHCR has hundreds of staff on the ground.

According to Mirchandani the aims of UNHCR are not necessarily at odds with the role of journalists. “You can be an advocate without being an apologist,” he said.  RYOT is actively encouraging this sort of participatory journalism.

“We’re installing in our newsroom the idea of solutions journalism,” Mooser said. RYOT seeks to distinguish itself by integrating a call to action into every story. Mooser and the other panelists repeated both the concept that “good stories will trend” and the idea that detailing ways that the audience can participate is the best way to increase engagement. Yet despite the tech-positive tone of Social Good Summit, social media does not inherently improve discourse and news coverage. Memes and shocking anecdotes are more likely to gain attention than considered reporting. If I retweet, does that mean that I understand?