Piwik - Open Analytics Platform

How to remove data of disabled websites from the database ?

Some words of context :

A few weeks ago, I enabled import of webserver logs into Piwik to get better information about who's visiting me, and especially static contents. This worked fine BUT had serious downsides : it is insanely CPU and HDD greedy (which I don't have a lot of since I'm running on a low-cost virtual machine). So this is why I had to disable it all, and remove entries from the database.

Let's proceed :

The procedure below also shows the preliminary steps while searching for details about the database itself and the data I'm trying to delete. These steps are still there "for future reference" but also because I'm not sure so far this procedure is final and erased all the unnecessary information. I'll amend it if need be.

Before manually deleting information about one or more websites from the database, make sure you have :
  • Deleted these websites using Piwik's web UI
  • Disabled data import + processing + archiving for these websites (CRON jobs)
MariaDB [(none)]> use piwik;
	Database changed

MariaDB [piwik]> show tables;
	+--------------------------------+
	| Tables_in_piwik                |
	+--------------------------------+
	| piwik_access                   |
	| piwik_archive_blob_YYYY_MM    |
	...
	| piwik_archive_numeric_YYYY_MM |
	...
	| piwik_goal                     |
	| piwik_log_action               |
	| piwik_log_conversion           |
	| piwik_log_conversion_item      |
	| piwik_log_link_visit_action    |
	| piwik_log_profiling            |
	| piwik_log_visit                |
	| piwik_logger_message           |
	| piwik_option                   |
	| piwik_report                   |
	| piwik_segment                  |
	| piwik_sequence                 |
	| piwik_session                  |
	| piwik_site                     |
	| piwik_site_setting             |
	| piwik_site_url                 |
	| piwik_user                     |
	| piwik_user_dashboard           |
	| piwik_user_language            |
	+--------------------------------+
	68 rows in set (0.00 sec)

MariaDB [piwik]> DESCRIBE piwik_site;
	+--------------------------------+------------------+------+-----+---------+----------------+
	| Field                          | Type             | Null | Key | Default | Extra          |
	+--------------------------------+------------------+------+-----+---------+----------------+
	| idsite                         | int(10) unsigned | NO   | PRI | NULL    | auto_increment |
	| name                           | varchar(90)      | NO   |     | NULL    |                |
	...

MariaDB [piwik]> SELECT idsite, name FROM piwik_site;
	+--------+-----------+
	| idsite | name      |
	+--------+-----------+
	|      1 | mySite_1 |
	|      2 | mySite_2 |
	|      5 | mySite_5 |
	|      7 | mySite_7 |
	+--------+-----------+
	4 rows in set (0.00 sec)

The sites with ids 1, 2, 5 and 7 are the sites that are registered and still appear in Piwik's web UI.
Sites with ids 4 and 6 (see below) have been deleted : they are not displayed anymore in the web UI, but they still appear in the database.

MariaDB [piwik]> SELECT idsite, count(idvisit) FROM piwik_log_visit WHERE idsite in (4, 6) GROUP BY idsite;
	+--------+----------------+
	| idsite | count(idvisit) |
	+--------+----------------+
	|      4 |           9022 |
	|      6 |          72890 |
	+--------+----------------+
	3 rows in set (0.19 sec)

MariaDB [piwik]> DESCRIBE piwik_log_link_visit_action;
	+------------------------------+------------------+------+-----+---------+----------------+
	| Field                        | Type             | Null | Key | Default | Extra          |
	+------------------------------+------------------+------+-----+---------+----------------+
	| idlink_va                    | int(11) unsigned | NO   | PRI | NULL    | auto_increment |
	| idsite                       | int(10) unsigned | NO   | MUL | NULL    |                |
	...
	+------------------------------+------------------+------+-----+---------+----------------+
	27 rows in set (0.51 sec)

MariaDB [piwik]> SELECT idsite, count(idlink_va) FROM piwik_log_link_visit_action WHERE idsite in (4, 6) GROUP BY idsite;
	+--------+------------------+
	| idsite | count(idlink_va) |
	+--------+------------------+
	|      4 |           281508 |
	|      6 |          4231128 |
	+--------+------------------+
	3 rows in set (1 min 42.07 sec)

At that point, you may be interested in the size of the tables :

dbName='piwik'; ls -Shl /var/lib/mysql/$dbName | head -20
	-rw-rw---- 1 mysql mysql 956M Dec 16 17:50 piwik_log_link_visit_action.ibd
	-rw-rw---- 1 mysql mysql 120M Dec 16 17:28 piwik_archive_blob_2015_01.ibd
	-rw-rw---- 1 mysql mysql 120M Dec 16 17:28 piwik_archive_blob_2015_12.ibd
	...

MariaDB [piwik]> SELECT
	table_name,
	table_rows,
	ROUND(( data_length + index_length ) / 1024 / 1024, 0) AS table_size_MB
FROM
	information_schema.tables
WHERE
	table_schema="piwik"
ORDER BY
	table_size_MB DESC,
	table_rows DESC;

	+--------------------------------+------------+---------------+
	| table_name                     | table_rows | table_size_MB |
	+--------------------------------+------------+---------------+
	| piwik_log_link_visit_action    |     436605 |           365 |
	| piwik_archive_blob_YYYY_MM    |      35384 |            41 |
	| piwik_archive_blob_YYYY_MM    |      14581 |            22 |
	| piwik_log_visit                |      35806 |            15 |
	| piwik_archive_numeric_YYYY_MM |      14217 |            12 |
	| piwik_archive_blob_YYYY_MM    |      12322 |            12 |
	| piwik_archive_numeric_YYYY_MM |       6270 |             4 |
	| piwik_log_action               |       5947 |             4 |
	...
	+--------------------------------+------------+---------------+
	68 rows in set (0.72 sec)
	This is the "after" version.
Time to flush out some data, now :
MariaDB [piwik]> DELETE FROM piwik_log_link_visit_action WHERE idsite in (4, 6);
	Query OK, 4512645 rows affected (1 hour 9 min 35.56 sec)

MariaDB [piwik]> DELETE FROM piwik_log_visit WHERE idsite in (4, 6);
	Query OK, 81914 rows affected (3 min 53.47 sec)

Now, let's reclaim some HDD space to MySQL.
This step in not mandatory at all, but since the database inflated so much, it occupied up to 75% of the total storage space, and I needed some megabytes back

MariaDB [piwik]> OPTIMIZE TABLE piwik_log_link_visit_action;
	+-----------------------------------+----------+----------+-------------------------------------------------------------------+
	| Table                             | Op       | Msg_type | Msg_text                                                          |
	+-----------------------------------+----------+----------+-------------------------------------------------------------------+
	| piwik.piwik_log_link_visit_action | optimize | note     | Table does not support optimize, doing recreate + analyze instead |
	| piwik.piwik_log_link_visit_action | optimize | status   | OK                                                                |
	+-----------------------------------+----------+----------+-------------------------------------------------------------------+
	2 rows in set (31 min 45.15 sec)

MariaDB [piwik]> OPTIMIZE TABLE piwik_log_visit;
	+-----------------------+----------+----------+-------------------------------------------------------------------+
	| Table                 | Op       | Msg_type | Msg_text                                                          |
	+-----------------------+----------+----------+-------------------------------------------------------------------+
	| piwik.piwik_log_visit | optimize | note     | Table does not support optimize, doing recreate + analyze instead |
	| piwik.piwik_log_visit | optimize | status   | OK                                                                |
	+-----------------------+----------+----------+-------------------------------------------------------------------+
	2 rows in set (3 min 46.07 sec)

How to track static resources with Piwik ?

When it's about tracking dynamic content, it's fairly easy : update templates so that a dedicated Javascript snippet is inserted into every generated page. Once in the browser (if not blocked by browser settings/plugins), the snippet is executed and fires an HTTP request toward our Piwik host with some specific parameters : the site ID, the page title, the local IP + browser + plugins + ...
But when it comes to static content such as images, software packages or PDF files, there's no way to inject + run this Javascript snippet. There are still some solutions to track hits on these resources.

Build a content delivery page :

  1. intercept requests to http://www.mydomain.tld/static/document.pdf as they arrive with webserver rewrites.
  2. transfer them internally to /deliverStatic.php?target=document.pdf
  3. in deliverStatic.php, send a custom-built HTTP request to the Piwik Tracking HTTP API (example), then deliver document.pdf

Count hits from the webserver logs (source) :

You should create a dedicated "site" in the Piwik console to use the log analytics method to avoid counting twice the hits on dynamic content. Indeed, server-generated pages already carry the Javascript snippet that will hit our Piwik instance once displayed in a web browser, but sending those pages already created a hit on our webserver, which is already written in access logs. (source)

If you already have a running Piwik instance, this will be rather straightforward :

  1. Find the misc/log-analytics/import_logs.py script on your Piwik installation directory.
    This may default to something like : /var/www/piwik/misc/log-analytics/import_logs.py
  2. Retrieve some parameters :
    • the URL of your Piwik instance : http://piwik.mydomain.tld
    • the Piwik ID of the site you'll be monitoring with webserver logs : idSite (this is possibly a site you've just created)
    • the number of available CPU cores (threads, actually)
    • additionnal settings about whether or not to import hits related to (source) :
      • static content
      • HTTP redirects (HTTP 3xx status codes)
      • HTTP errors (HTTP 4xx and HTTP 5xx status codes)
      • hits made by bots
    • the location of your webserver access logs : /var/log/myWebser/www.mydomain.tld.access.log
  3. then open a shell on your server to launch the log import command (source) : python /var/www/piwik/misc/log-analytics/import_logs.py --url=http://piwik.mydomain.tld --idsite=n --recorders=n --enable-http-errors --enable-http-redirects --enable-static --enable-bots /var/log/myWebser/www.mydomain.tld.access.log
  4. Should be worth doing this on a regular basis via crontab.

Who's that visitor from Sunnyvale, California (199.19.249.196) ?

Situation :

Piwik is set up and running fine, counts visitors as expected, and reports their location. I've configured it to exclude my own visits from the statistics and also to exclude hits from my company's web proxy, but there's a mysterious visitor from Sunnyvale, California (199.19.249.196) who's visiting the same pages my colleagues and I are visiting. What's going on ?

Details :

I first thought something was wrong with the configuration of the excluded IP addresses, or that geotagging was misplacing visitors on Earth.

Solution :

Turns out that the IP address 199.19.249.196 belongs to Blue Coat Systems, a company located in Sunnyvale, California. This IP address actually leads to somewhere in London (says monadresseip.info, confirmed by traceroute).
Blue Coat Systems specializes in network appliances, among which are some web filters that are massively used by companies throughout the world to protect you from NSFW websites (eBay, Facebook, pr0n, ...). So every time one of my colleagues or myself visit my website, the webfilter "checks" it again.

If you prefer to ignore such visits, just append this IP address to the blacklist on the Administration | Web sites page.

Piwik sees no visitors and complains There is no data for this report

This is strongly inspired by this FAQ.
Here are some points you should check :

How to track a new website with Piwik ?

This is fairly straightforward, but since it failed the last time I did it, I'm writing what I did so that I won't forget anything... just in case it doesn't work again ...
  1. Log into Piwik as the user you've created to be the administrator
  2. Open the Settings menu (gear icon in the top left corner
  3. Then Websites and
  4. Name it, enter its main URL : http://www.mydomain.tld, then
  5. This new website should now be listed with others. It should have a Display code hyperlink, which leads to a new page with some extra options and a Javascript snippet that must be loaded on every page you wish to be tracked. You can paste this snippet into your page HTML header template, or save it in an external file and "include" this file from your pages : both work !
  6. That's it. Visit one of your tracked pages (and view its source code to make sure the Javascript snippet is there), you should be tracked !
  7. Open the Piwik dashboard, select your website, then open the Visitors | Log page : the visit you just made should be there.
  8. You may have forgotten something, read : Piwik sees no visitors and complains There is no data for this report.

Since I installed Varnish, all visitors are reported to be from 127.0.0.1 by Piwik

Situation :

Before : my webserver is running Lighttpd and Piwik and everything is going extremely well : visitors are identified (as well as their location) using their IP address.

After : I installed Varnish (on the same server) and made a basic configuration so that "it works", but now, everybody seems to connect from 127.0.0.1

Details :

This is because Varnish intercepts client requests and forwards them to Lighttpd. Thus, as for Lighttpd's point of view, the only client is the local Varnish, hence the 127.0.0.1. To fix this, we'll have to :

Solution :

Varnish : forward the client IP address via the X-Forwarded-For HTTP header :

  1. Make /etc/varnish/default.vcl look like :
    sub vcl_recv {
    	if (req.restarts == 0) {
    		if (req.http.x-forwarded-for) {
    			set req.http.X-Forwarded-For = req.http.X-Forwarded-For + ", " + client.ip;
    			}
    		else {
    			set req.http.X-Forwarded-For = client.ip;
    			}
    		}
    	}
  2. Then restart Varnish

Lighttpd : be aware of the reverse proxy (source) :

  1. Enable the mod_extforward module in /etc/lighttpd/lighttpd.conf :
    server.modules = (
    	...,
    	"mod_extforward",
    	...
    	)
  2. Tell it to search for the client IP address in the X-Forwarded-For HTTP header sent by the trusted proxy 127.0.0.1 :
    extforward.headers   = ("X-Forwarded-For")
    extforward.forwarder = ("127.0.0.1" => "trust")
  3. Then restart Lighttpd

To make sure this is working as expected :

  1. on the webserver : tail -f /var/log/lighttpd/myGreatWebsite.access.log
  2. from anywhere (let's call this place "the client") : wget -S -O /dev/null http://myGreatWebsite/?$RANDOM
    The $RANDOM querystring parameter is to make sure we won't get a HIT from Varnish and that the request will be forwarded to Lighttpd
  3. on the client side, you should get :
    HTTP request sent, awaiting response...
      HTTP/1.1 200 OK
      Cache-Control: public, max-age=600
      Vary: Accept-Encoding
      Last-Modified: Mon, 20 Oct 2014 19:48:47 GMT
      Content-Type: text/html
      Expires: Tue, 21 Oct 2014 09:44:22 GMT
      Date: Tue, 21 Oct 2014 09:34:22 GMT
      Server: lighttpd/1.4.31
      X-Varnish: 196645
      Age: 0
      Via: 1.1 varnish-v4
      ETag: W/"3135140236"
      X-Cache: MISS
      Transfer-Encoding: chunked
      Connection: keep-alive
      Accept-Ranges: bytes
    Length: unspecified [text/html]
    Saving to: `/dev/null'
  4. while on the server :
    a.b.c.d myGreatWebsite - [21/Oct/2014:11:34:22 +0200] "GET /?13574 HTTP/1.1" 200 2221 "-" "Wget/1.13.4 (linux-gnu)"
    where a.b.c.d is the client IP address.

Configure Piwik accordingly (source) :

looks like there's nothing special to do. Forget about stuff below.

To do so, all we have to do is to instruct Piwik to get the client IP from the X-Forwarded-For HTTP header. This can be done in /pathTo/piwik/documentRoot/config/config.ini.php :

proxy_client_headers[] = HTTP_X_FORWARDED_FOR

When making tests, keep in mind that Piwik can not detect Wget hits since no Javascript is executed then.

Piwik howto's

How to view IP addresses from the piwik_log_visit table (source) ?

In the piwik_log_visit table, the contents of the location_ip field is stored as varbinary(16). To transcode it into the usual dotted-quad representation :
SELECT inet_ntoa(conv(hex(location_ip), 16, 10)) AS IP FROM piwik_log_visit;

How to delete some visits (mine, actually ) from the database (feature request) ?

As said on the Piwik forum, it's as simple as :

  1. Deleting records from the log_visit table and log_link_visit_action tables
  2. Dropping the corresponding archive_blob_yyyy_mm and archive_numeric_yyyy_mm tables.

Upon installation, you may have provided a prefix to table names, so effective table name could be different.
  1. Just in case something goes weird : backup the Piwik database
  2. Find the ID of the site having visits to delete :
    SELECT idsite, name, main_url FROM piwik_site; outputs :
    +--------+-------+--------------------------+
    | idsite | name  | main_url                 |
    +--------+-------+--------------------------+
    |      1 | site1 | http://site1.example.com |
    |      2 | site2 | http://site2.example.com |
    +--------+-------+--------------------------+
    
  3. Find the IDs of visits to delete :
    SELECT idvisit FROM piwik_log_visit WHERE idsite=1 AND inet_ntoa(conv(hex(location_ip), 16, 10)) IN ('1.2.3.4', '5.6.7.8'); outputs :
    +---------+
    | idvisit |
    +---------+
    |       1 |
    |       2 |
    |       3 |
    |      20 |
    +---------+
  4. Then delete those visits from log_visit table : DELETE FROM piwik_log_visit WHERE idsite=1 AND idvisit in (1, 2, 3, 20);
  5. Same operation with piwik_log_link_visit_action : DELETE FROM piwik_log_link_visit_action WHERE idsite=1 AND idvisit in (1, 2, 3, 20);
  6. Drop the archive tables containing consolidated data about the visits being deleted : drop table piwik_archive_blob_2014_06; drop table piwik_archive_blob_2014_07; drop table piwik_archive_numeric_2014_06; drop table piwik_archive_numeric_2014_07;
  7. That's it !