Platform Architect, TEN7
Ye olde site counter
The dawn (and ubiquity) of Google Analytics
Matomo, an excellent candidate to replace Google Analytics
Problems with reCAPTCHA
IVAN STEGIC: Hey everyone! You’re listening to The TEN7 Podcast, where we get together every fortnight, and sometimes more often, to talk about technology, business and the humans in it. I’m your host Ivan Stegic. My guest today is Tess Flynn, who has been on the show a number of times before. She is DevOps Engineer here at TEN7, though today we’ll be talking a little bit about Google Analytics and reCAPTCHA and the vast amount of data that companies like Google have about us, and the implications that has. Welcome back to the show, Tess.
TESS FLYNN: Hello.
IVAN: Hello! So, this episode idea came from a tweet of yours.
TESS: Cause I like raising trouble whenever I get the chance.
IVAN: Yes, and not only do you raise trouble on Twitter, you post it in Slack and see what people say. [laughing] I’m going to read it for our listeners. So, the tweet went out and you said the following: “I really wish that we in the Drupal community didn’t just roll over and play dead when it comes to Google Analytics, Tag Manager and reCAPTCHA. Each of these have serious privacy implications and we often don’t even bring that up with clients.”
First of all, how’s my Tess imitation?
TESS: Well, what you have to do is you have to prevent the top part of your larynx from actually vibrating and then you’ll get closer.
IVAN: [laughing] I don’t think I can do that. [laughing] Okay, so what’s the genesis of those tweets. Why did you say this?
TESS: So, I was talking with some friends on a Mastodon, instance and one of them brought up that they really don’t want to work on any site that ever touches Google Analytics or any kind of privacy-violating technology. And that got me thinking about this, about the number of different sites that I have worked on over my career, all of which use Google Analytics, and it just occurred to me that I never really bothered to consider Is that a good thing? What do they do with this data? Where does it go?
For the most part we really have had two different eras in website tracking, or I suppose I could say three different eras. So the first era is the ye olde-fashioned counter application from way back in the stone age when we hewed GIFs and CGI scripts with bear skins and bone saws and all of those things. [laughing]
IVAN: One GIF per number of the counter so you only had 10 to use.
TESS: And then came a server-side analytics period where you had some built-in application that either was loosely or tightly coupled with either your application or your web server, and then you would log into a private portal and look at your stats and they were all rendered in hideous tables because it’s all about the data, nobody cares ever about the presentation, right?
But, now we’re in this era where technology is very pervasive. You can’t really operate in society without some level of technology. I get incensed when a lot of government officials for example, say that you don’t need the internet in order to live in present-day society. And then I go to their website and say, Okay, so how do I apply for unemployment benefits? Oh, it’s a website. I can’t call anybody. Hmm.
Yeah, it’s things like that that start getting on my nerves, and we’re in this period where we have so much data, so much tracking, and so much of a panopticon, that we don’t really realize how much we’ve contributed to it, either directly or indirectly, because we’re only concerned with our little corner of it, and Google Analytics is one of those things that contributes to the larger panopticon.
IVAN: It really is, but it’s so easy Tess, it’s so ubiquitous. Come on.
TESS: Well, so are some of the alternatives that you could use too.
IVAN: Really? Okay, let’s talk about what alternatives there are, because we’ve all heard of other measurement apps, right? There’s Kissmetrics, Heap, Optimizely and Mixpanel, but they all kind of have the same problem, right? They’re these big companies that gobble your data up and then sell them to other companies.
TESS: Exactly. So, they have all the tracking data, all the behavioral data, and then they resell that data, anonymized or not. And anonymizing data is kind of a misnomer. I have worked with private data, HIPAA compliant data before, and I know that there are ways of scraping information even out of that in aggregate, and that is the primary problem with technology today, is there’s a lot of aggregate data. It’s the old story of, Well, I’m not signed in on YouTube on my TV, and then about four weeks later it’s giving you the same exact recommendations that you have on your phone, which you’re logged in on. How does it do that? Behavioral aggregation.
IVAN: Yep. We don’t like that do we?
TESS: All of those solutions that we’ve already talked about are all data that you’re giving away. The reason why they don’t cost you anything or cost you so little is because you’re the product.
IVAN: That’s the old adage isn’t it? If something’s free and you’re not paying for the service, you are likely the product.
TESS: They’re using it as a con in order to get your information, then they turn that information over to advertisers and that’s where the real lucrative income is.
IVAN: Oh man, okay, so let’s talk about some privacy-respecting solutions then. I read about Simple Analytics a while ago, and that seems to be their business model, right? A privacy-respecting analytics platform. They’re based in the Netherlands, you pay them for the service, they promise never to sell your data. They still have your data.
TESS: It’s kind of a ProtonMail approach isn’t it?
IVAN: Exactly. Tell people what you mean by that.
TESS: Okay. So, ProtonMail is, I believe, a Swiss email provider. It provides webmail access, like Gmail, but it has the same kind of marketing pitch, that they exist in a “politically neutral country” insert laugh track here [laughing]. They ostensibly don’t sell your data again, it’s always private and always encrypted. Okay, yeah. How long until you switch CEOs and you break every one of those promises? Because I saw what happened to Keybase. [laughing]
IVAN: Tell me about Keybase before we go on here. What happened to Keybase?
TESS: Keybase was a very promising project where you could share public GPG keys and other security keys in order to facilitate secure communication with other individuals by verifying identity publicly via web servers, social media posts and so on. The problem is that it got all of that Silicon Valley tech-pro VC money into it, and it turned into a blockchain-riddled bitcoin wallet piece of garbage. [laughing]
IVAN: Yeah, I wondered what happened to that. I didn’t do any research into it, but I still have the Keybase account.
TESS: I still have it too, but I don’t even use it anymore. The primary promise that it had in addition to sharing keys was also the ability to have encrypted communication that was fully anonymous that you didn’t have to surrender some critical piece of information force such as your phone number which is what Signal has as a problem.
IVAN: Yeah, yeah, yeah. Well, it sounds like big VC money corrupts good ideas and privacy, is what I think we are saying here in the last few minutes, right? [laughing]
TESS: [laughing] Yeah, but let’s get back to analytics.
IVAN: Yes. I was talking about Simple Analytics and then I was going to tell our audience how you were telling me about something else called Matomo last week. Something that you can host yourself. I did a little bit of research myself on them, but you’ve played around with the product as well. So, Matomo actually goes back to 2007.
TESS: Yeah, it’s one of these second-wave analytics tracker projects inspired by that. But unlike a lot of those projects, this one actually managed to continue to be maintained, continued to get UI and UX improvements during that entire time. And now, it is a very modern application, which has a UI that’s if not comparable, I would almost say better than Analytics.
IVAN: Yeah, don’t say Analytics, say Google Analytics, because as soon as you just keep saying Analytics then it changes it into Xerox and that’s not what we need. [laughing] So, Matomo had another name, they were called “Piwik,” which I think is “kiwi” with the “p” added on backwards.
TESS: Because it’s a PHP application.
IVAN: That’s just brilliant isn’t it? [laughing]
TESS: [laughing] There’s a reason why they probably don’t have that name anymore. Matomo is actually a lot easier to say.
IVAN: And it’s also a word that means “decent” in Japanese.
TESS: That’s actually their third name by the way. They had even older names than that too.
IVAN: Oh, they did? What were they before Piwik?
TESS: Something like “My PHP Analytics” or something like that, going off of “My PHP Web Admin” and the like.
IVAN: Oh, I remember that.
TESS: It goes that far back. [laughing]
IVAN: Wow. Okay. So that’s good actually, right? Okay, so one of their values is transparency, and they say you own the data. So, what does Matomo give us?
TESS: So, there’s two different products with Matomo. There’s Matomo Cloud which is the one that is a hosted solution which is very much like Google Analytics, and Simple Analytics, and all of these other products. It has, again, the ProtonMail promise of Yes, we won’t ever sell your data, but how much do you really trust that. And then they have a fully self-hosted open source version that’s on GitHub. And that’s what got my attention, because now this is an open governance, open release, open code approach that allows me to actually host it myself, maintain the data myself, make sure that it’s subject compliant to my client’s host country laws.
For example, if I have a client in, say Canada, they don’t want to give their data to an American company which might hand it over to the FBI, or whomever. They want to make sure that it stays inside their country that’s subject to their privacy laws. By hosting it within their country, it is automatically subject to their privacy laws. So, that is a key advantage.
IVAN: That is a really important advantage. Okay, so they offer a cloud hosting solution so you could basically do the same thing as Google Analytics, except now it’s hosted by Matomo. Or you could download it and you can get it from GitHub and host it yourself. How hard is that typically, do you think?
TESS: If you’ve installed Drupal before, this is actually about the same, if not slightly easier. It’s not very difficult. You get an archive of—the techies will call that a Dist—that has the full files already set up, you don’t need to run Composer or any other applications in order to compile it. You put it on a web server, you make sure that you have SSL on it and you’re good to go. Give it a database and you’re ready to rock.
IVAN: Really? And it’s a PHP by SQL stack?
TESS: Yeah, and it can run behind FPM or nginx or Mod_php with Apache. All of those work.
IVAN: What about bandwidth usage and the amount of server resources you might need to run this? I would imagine not very much.
TESS: Well, that’s a very good question, because I actually haven’t had the opportunity yet to hard install it myself. I am still just trialing the cloud version, and I spent most of the weekend trying to build a container because, me, obviously, that will install Matomo itself. And their Kubernetes and container support isn’t really where I want it to be, so I’m trying to fix that and make my own version of it. I had some success making the container, but I haven’t gotten around to developing a Helm chart to put it on Kubernetes yet.
IVAN: So that’s the goal for you then?
TESS: That’s the next thing I’m going to try, and then once it’s on my server I’ll be able to see how much server draw that it actually takes. And then if it’s good enough, I’m going to pull Google Analytics off of my site. I’m not going to use it anymore.
IVAN: I think that’s a good aspirational place to be. I’m not sold that we need Google Analytics on our ten7.com site, and I’m also not sold that we have to be talking to our clients about keeping their Google Analytics either. If we can make that Matomo deployment into something we put into our Kubernetes hosting, I think there’s value to that.
TESS: Yeah, I think that’s actually very possible to set that up as a cluster-wide solution that can be multi-tenant. That’s actually one thing that’s really nice about Matomo is, it is a multi-tenant solution out of a box.
IVAN: Tell me about what that means, multi-tenant.
TESS: It means you can configure multiple user accounts with multiple settings.
IVAN: Multiple domain names, right? So, you don’t have to create a new account for every single Google Analytics client. They don’t all have to have their own accounts. You don’t have to worry about them having to create it all, you just create them for your client, they’re all on the same server, and if you wanted to share information between sites and clients, in theory, could you do that?
TESS: I believe so. I haven’t checked to see what the access restrictions are per user account. The indications have suggested that yes, that is possible, but I need to check on that.
TESS: And if not, it’s not a very difficult application to spin up. Once you have the necessary database behind it, you just spin up as many of those as you need.
IVAN: So, besides us owning our own data, and besides this information not feeding into the greater Google Analytics machine and the Google corporation or Alphabet, what else do we get from Matomo?
TESS: It also has another thing where it does real-time live tracking. So, one thing that is part of Matomo’s marketing pitch is that [Google] Analytics tends to be an aggregate solution, it doesn’t show you the exact data for every individual user that visits your site. Whereas Matomo does have the ability to do that every single time, and you can see what their behavior is, and what pages they go to. And it’s actually on your dashboard in Matomo. It has a nice little river of visits on your site which shows you the technology, the device type, how long they were on your site, their geographic locality if you have that enabled, and also which sites they visit in what order, which is really, really useful.
Another thing is that if you have something like the do not track headers enabled, it doesn’t track you, at all. It doesn’t have any weird, scummy, We’ll track you no matter what. If you decide to opt out of tracking it will not track you. Period.
IVAN: That’s nice. That’s really nice. So, I know that one of the things that Matomo’s using in their marketing to differentiate themselves from Google Analytics is this idea that Google Analytics isn’t actually giving you all the data when you look at your results. And Google refers to it as Google Analytics Data Sampling, and they basically say that “sampling is the practice of analyzing a subset of all data in order to uncover the meaningful information in the larger data set.” So, Google doesn’t actually give you all your data or analyze it.
TESS: Mm-hm. They keep it for themselves.
IVAN: They keep it for themselves. But Matomo lets you have it all right? And it basically crunches the numbers for you so that you don’t miss out on things, something like sampling is going to cause.
TESS: And It also does other things. It also allows you to run campaigns, just like Google Analytics, track particular subsets of your site, and it has its own tag manager built in.
IVAN: So, it actually competes with Google Tag Manager as well. That’s interesting.
TESS: And there’s a Drupal module for it.
IVAN: For Matomo?
TESS: It’s actually pretty easy. You just install it like any other Drupal module, give it a few pieces of information, and then you’re good to go.
IVAN: Version 7 and 8?
TESS: I only installed the 8 version. I’m pretty sure there’s a 7 version. It’s been around for a while.
IVAN: Yeah, I would imagine that there’s a 7 version as well. Wow. And I also saw that Matomo has GDPR and other relevant privacy-respecting banners and cookie options, right, that I don’t think you see with Google Analytics. At least I haven’t seen it.
TESS: It’s a lot more of a respectful solution compared to Google Analytics.
IVAN: So, what do we do as a community then, and not, like you say, roll over and play dead? I suppose talking about it is one thing, that’s why we’re doing this podcast. What else can we do, do you think, to promote privacy-respecting solutions like Matomo?
TESS: I usually would’ve liked to start a conversation with a client like that by asking, “What do you expect to get out of your Analytics? Are you looking for particular pages that are popular? Are you looking to get into particular market sets? Are you just looking for technology compliance information? What are you actually looking for?” Because you don’t want to turn into Google and just say, “I don’t know what we’re looking to do, we’ll take it all.” [laughing]
TESS: Because taking it all is also not particularly great either, and that’s kind of similar to the whole reCAPTCHA discussion as well.
IVAN: And I suppose if the client’s requirements are We run a shop that is a commerce site on Drupal, and it’s integrated with something on Shopify, and we’re using Google Ads, and we’re using AdWords and Facebook ads, and we’re really trying to track and optimize our return on our investment, then switching out Google Analytics for something like Matomo might not be a good business decision for that particular client. On the other hand, if it’s a nonprofit, and they are simply trying to gauge visitors, and they don’t have all these deep integrations with commerce, then Matomo’s probably an easier solution, and probably something that your constituents of the nonprofit, if they’re interested in privacy, will respect.
TESS: Mostly I keep thinking that a lot of clients, like a lot of individuals, don’t realize the privacy implications of the technology that we decide that we’re going to use even though it’s everywhere. Like, one thing a lot of political activists will tell you is Don’t take your phone with you to an event, ever. Not even if it’s turned off, don’t even take it with you.
TESS: Because it can be used to geolocate you either roughly or exactly without your knowledge.
IVAN: And you don’t want that.
TESS: No, because that can be used even if it’s a perfectly legal, perfectly peaceful event, that can be used to get you fired later, and that actually has been shown to happen.
IVAN: In the tweet you sent out you also mentioned reCAPTCHA. So, let’s talk about that. Remind everybody what reCAPTCHA is please.
TESS: So reCAPTCHA is a visual authentication mechanism. It usually presents you with an image or series of images, and you are to select the images that correspond to a list of text. Like, select all traffic cones, for example. And then you select the number of traffic cones, and then it validates you through. Sometimes it will also gather data silently, like your mouse wiggles, if you’ve changed windows, if that behavior is nonlinear or non-regular. As a result, it can actually determine from there that you’re probably a human with relatively high confidence, because you’re not acting like a script along a particular determined programmatic path to interact with the site. This is used as a stopgap in order to keep sites from having registrations from any number of bot accounts and bot systems and click farms in order to make sure that the people who sign up for your sight are “real people.”
There’s a number of different problems with this. One, it doesn’t always work. There are ways of defeating reCAPTCHA even though it’s there and it works most of the time, and that’s a practical concern. The next concern is, it is a nightmare for accessibility. This is nearly impossible to work if you’re using a screen reader. So, if you have visual issues or motor control issues and you can’t use a mouse, this is just going to make your day that much worse. And that is actually a legal problem for a lot of sites.
Then I usually want to ask a client, Why do you need it? What are you hoping to use it for? And then we have to think about that more strategically, Why do you have users registering for your site? Is that really necessary? How else can you track them? Are there other mechanisms that you can use to delay click farms and other bot scripts from actually accessing your site?
An old-fashioned pattern is you sign up for the site, and you don’t get to log in immediately, you actually have to wait for an email. Some sites like publishing firms that rely primarily on commenting in order to make sure that they have high engagement with their readership, that might not be ideal, but it is an effective method even today in order to prevent people from accessing it. And these were all the label-on-the-tin concerns; when you open the can [laughing] to find the worms inside, I have to drop a bombshell on you.
IVAN: Tell me what it is.
TESS: So, let’s say that you are using an iPhone. You pull out your iPhone and you’re using Face ID to unlock it. Perfectly normal, perfectly innocent.
TESS: The same machine vision algorithm that is used to unlock your phone can also be used for target acquisition for a weaponized drone. Do we really want to contribute to that kind of technology, even as consumers?
IVAN: It doesn’t sound like we do. [laughing]
TESS: [laughing] I wouldn’t want to do that either. It is very uncomfortable and does not sit well with my conscience. And reCAPTCHA is a shadow method to do machine vision learning. That is exactly what it’s for. That’s why it gives you pictures. That’s also why Google Voice existed. Because it was a means to train vocal recognition so they could turn that technology around and sell it.
IVAN: Oh, that’s right. I’d forgotten about that.
TESS: Modern technology companies, if they are selling a product and it seems to be free, it’s not free. In fact, you’re two products at the same time. The one that they’re using to gain your behavioral data, and they’re using it to train something else that they’ll also turn around and sell later.
IVAN: There’s a trend here right? I mean, Google Analytics was an acquisition that Google made of Urchin, and we’ve talked about this on the podcast in the past with Dan Antonson. And so that’s why the Google Analytics UA numbers all start with UA, because it used to be called Urchin Analytics. reCAPTCHA is a Google product now. That was previously open source and was acquired by Google in 2009 I think, or 2008, something like that, so, more than 10 years ago. So, reCAPTCHA has really been this machine learning exercise that Google has been producing and using to the benefit of its shareholders and to the detriment of the community that is giving all this information to them for. So, that’s the trend. But, there are alternatives to reCAPTCHA. You mentioned honeypots and different ways of determining whether or not a user is actually human. But there’s this other CAPTCHA alternative that I read about the other day here, hCAPTCHA. Cloudflare is actually implementing hCAPTCHA across its end.
TESS: Cloudflare, that doesn’t make me suspicious already.
IVAN: Yes, so I have my own thoughts about Cloudflare as well.
TESS: I trust Cloudflare as far as I could throw them.
IVAN: At least, at least, it’s not going to Cloudflare and Google if you’re using Cloudflare endpoints now. I suppose that’s a silver lining.
TESS: One of the things that immediately stood out to me with hCAPTCHA when I was looking at it is, it primarily relies on you installing a browser extension, and that’s already nope city for me.
IVAN: Yeah. Is that really how it works? You have to have an extension?
TESS: That was right on the main info page, yeah.
IVAN: So, you can’t actually visit a site that has an hCAPTCHA on it without having the extension on it? That doesn’t seem like it’s going to work.
TESS: No. It’s probably not a very good option. A lot of these technologies are really trying to work around one fundamental thing, which is, we want to keep bots out of particular sites, keep humans in those sites, so that they don’t start posting ads all over the place. And, the thing is, there’s already an existing solution for that that works.
IVAN: What is it?
TESS: It’s called a content moderator. You pay someone.
IVAN: Yeah, okay.
TESS: [laughing] Because human beings are generally pretty good at figuring out if another person’s a human being or posting an ad when they shouldn’t be. [laughing]
IVAN: That doesn’t scale well.
TESS: And the problem is that it costs a lot of money and it’s very stressful and even then it has ethical implications for the moderator themselves. There are distinct studies that have shown that being a content moderator causes a lot of people to have PTSD issues for years and years and years after they leave that particular business, because it’s just that nasty. There are so many nasty people out there.
IVAN: Lots of trolls. I can’t imagine how it is to deal with that kind of PTSD after moderating the content.
TESS: This is why I keep going back to the same question of, Is this really something that you want to do? Is this something that’s really necessary for your site? Because if it’s not, then maybe you shouldn’t. Maybe it’s not that important for your business. Maybe there’s alternative ways of doing it.
IVAN: I like the idea of putting in your phone number or even your email address to log in to receive a special token, maybe it’s even a one-time use token, so that you can comment.
TESS: An email address would probably be my preferred one. Phone numbers have different implications that are also worrying.
IVAN: Yeah, phone numbers are also a dime a dozen too. You can very easily get them and very cheaply use them with APIs as well, so it’s not going to provide you a great deal of protection, right?
IVAN: Okay. So, what’s the moral of the story here before we wrap up, do you think?
TESS: I think the moral of the story is that we need to talk to clients a lot more about the privacy implications of the technologies that we’re pulling off the shelf. If we’re going to be recommending a particular solution, we really should take a moment to consider, Is this the right solution? Do we actually want to implement that? Do we want to feed these other corporations who do who knows what with this data later, of which we will become subsequently culpable for supporting, even vicariously? So, that’s something that I think we do have to be aware of and we have to raise the knowledge and the understanding of ourselves and our clients in order to bring this forward. And only then can we hopefully make a small, tiny infinitesimal step towards a more ethical industry.
IVAN: I think you’re right. I think the moral of the story here is education. And as the vendors and the service providers in this situation, we should be educating our clients about options and alternatives to Google Analytics. Maybe Google Analytics is okay in some cases. Most cases, probably alternatives would be just fine, and that goes for reCAPTCHA as well.
Well, Tess, I hope this episode didn’t give you PTSD. [laughing] I really appreciate you spending your time with me today.
TESS: I just want people to think about things that they haven’t thought of before.
IVAN: Thank you very much. Tess Flynn is DevOps Engineer at TEN7, and you can find her online as @socketwench. That’s “wench” not “wrench.” And she’s on Twitter, Mastodon on Drupal.org, Patreon and more. You can check out her website as well at deninet.com.
You’ve been listening to The TEN7 Podcast. Find us online at ten7.com/podcast. And if you have a second, do send us a message. We love hearing from you. Our email address is [email protected]. Until next time, this is Ivan Stegic. Thank you for listening.