First import data from a tsv file (Tab Separated Values) with file() virtual table function. sales.tsv

>>> sql("""create table sales as
...             select Product,Region,Month,cast(Sales as int) as Sales
...                 from
...                     (file 'testing/sales.tsv' 'dialect:tsv' header:t)""")
>>> sql("select * from sales")
Product | Region | Month   | Sales
Cars    | Athens | 2010-01 | 200
Cars    | Athens | 2010-02 | 130
Bikes   | NY     | 2010-01 | 10
Bikes   | NY     | 2010-02 | 30
Cars    | NY     | 2010-01 | 100
Cars    | NY     | 2010-02 | 160
Cars    | Paris  | 2010-01 | 70
Cars    | Paris  | 2010-02 | 20
Bikes   | Paris  | 2010-01 | 100
Bikes   | Paris  | 2010-02 | 20
Boats   | Paris  | 2010-01 | 200

Let’s say we want the result to be a table with columns:

  • Product
  • NY: the total sales in NY for this product
  • Paris: the total sales in Paris for this product
  • Athens: the total sales in Athens for this product

We will perform the aggregate function sum over Sales column, grouping by Product and Region columns. Then vecpack() must be performed over Region and Sales sums, grouping on Product.

To use vecpack() the first argument must be a pack of the dimensions. This is the result of packing the distinct Region values, grouping over all the table.

>>> sql("""select Product,unpackcol(vpck)
...      from
...        (select Product,vecpack(rpk,Region,salessum) as vpck
...         from
...             (select pack(distinct Region) as rpk from sales),
...             (select Product,Region,sum(sales) as salessum
...                 from sales group by Product,Region)
...         group by Product)""")
Product | Paris | NY  | Athens
Bikes   | 120   | 40  | 0
Boats   | 200   | 0   | 0
Cars    | 90    | 260 | 330

Application scenario

In this example we implement a simple application for query recommendation based on user’s country of origin. The input data come from a web portal’s logs.

The query mining workflow includes the following five main steps:

  1. Import the portal log files into a relational table
  2. Use time-heuristics so as to identify coherent query sessions. Assign and store a new, reconstructed session id for each record in the logs
  3. Preprocess and clean logged queries text
  4. Retrieve IP-to-country information from the web and combine this with the IP information from portal log files so as to assign a country code to each one of the logged sessions
  5. Apply an Apriori-based technique for extracting frequent term sets per country.

A sample from the log file used is presented below:

36506        guest   X.X.X.134       p93t9q5eqaa0u0p8isd9skp1n6      en      ("paradis riedinger")   search_sim  2010-01-01 00:28:23
36507        guest   X.X.X.134       p93t9q5eqaa0u0p8isd9skp1n6      en      ("paradis riedinger")   view_brief  2010-01-01 00:28:34
36508        guest   X.X.X.134       p93t9q5eqaa0u0p8isd9skp1n6      en      ("paradis riedinger")   view_brief  2010-01-01 00:28:34
36509        guest   X.X.X.134       p93t9q5eqaa0u0p8isd9skp1n6      en      ("paradis riedinger")   view_full  2010-01-01 00:28:42
36510        guest   X.X.X.55        a9qo379qnl5hbbria6tj6nlp95      de  (creator all "frank leonhard")  search_adv  2010-01-01 00:28:44
36511        guest   X.X.X.55        a9qo379qnl5hbbria6tj6nlp95      de  (creator all "frank leonhard")      view_brief  2010-01-01 00:29:10
36512        guest   X.X.X.55        a9qo379qnl5hbbria6tj6nlp95      de  (creator all "frank leonhard")      search_res  2010-01-01 00:29:18

Step 1

Initially, log files which are available in the Tab Separated Value format, are imported into a relational table. The import process is easily implemented in madIS using the file() function, by selecting the appropriate columns from the “.tsv” file.

create table logs as
   select C1 as id, C2 as userid, C3 as userip, C4 as sesid, C6 as query, c7 as action, C8 as date
   from file('raw_logs.tsv','delimiter:\t','quoting:QUOTE_NONE');

Step 2

Thereafter, session reconstruction, a common task in web usage mining, is performed. In this example it uses a predefined inactivity thresholds to break in-coming sessions. Such functionality is performed in madIS through the datediffbreak() function, which takes as an argument the inactivity threshold (as well as some additional parameters) and returns the new session ids. In the following madSQL segment, where the efficient employment of the described function is presented, the inactivity threshold has been set to 30 minutes (provided in milliseconds).

alter table logs add sesidnew text;
update logs
     set sesidnew=
     (select bgroupid
             ( cache select datediffbreak(sesid,id,date,30*60*1000,'order',date,id)
                             from logs
                             where sesid not null group by sesid)
             where C1=id )
     where sesid is not null;

Step 3

Then, focusing on issued queries, the distinct queries per session are retrieved from the logs and a variety of query text processing steps take place. However, a great amount of malformed queries has been observed, so it is only queries of valid text in utf8 encoding that are selected. This filtering step is performed using the isvalidutf8() function in the where clause of the corresponding madSQL fragment.

Thereafter, stop word removal is performed over the selected queries through the corresponding function. Moreover, since queries are issued and logged using the Common Query Language (CQL) syntax, an additional filtering step is executed for removing CQL constructs, through the madIS function cqlkeywords(). The processed queries are then stored in table QueriesPerSession. One of the advantages offered by madIS framework, is that all the powerful Python capabilities and open source libraries for text processing can be exploited for the rapid implementation of customized functions, hence significantly easing the “flow-level programming”, in madSQL.

create table QueriesPerSession as
select filterstopwords(cqlkeywords(query)) as cleanquery , sesidnew, userid
from logs
where action like 'search%' and cleanquery!='' and isvalidutf8(query)
group by query, sesidnew;

Step 4

Aiming towards the extraction of term associations per country, an external data source is fetched from the web, containing the mapping between IP ranges and countries. Again, the file() function is employed for fetching the URL resource and importing the corresponding data in a main memory indexed table using the cache() function of madIS. The fetched data source specifies the IP ranges in IP long number format, so the logged IPs have to be converted to the same format by using ip2long() function, and then to be subsequently matched to the imported ranges. However, since each session may contain requests from multiple IPs (due to dynamic IPs, etc.), it is only the first encountered IP that is considered for each session. The first IP for each session is obtained using the minrow() function, comparing the corresponding records’ ids. The generated mapping from each session to a country code is stored in table Session2Country.

create temporary table Session2Country as
select sesidnew, CountryCode
     (select ip2long(minrow(id,userip)) as iplong, sesidnew
             from logs
             group by sesidnew),
     (cache select cast(C3 as integer) as ipfrom, cast(C4 as integer) as ipto, C5 as CountryCode
             from file('http://.../GeoIPCountryCSV.zip','dialect:csv','compression:t'))
where iplong>=ipfrom and iplong <= ipto;

Step 5

Then information regarding text of queries, and session to country mappings is jointly used so as to extract term associations, with an apriori-like technique, using the aggregate function freqitemsets(). Frequent query term sets are computed over each one of the country codes that occur in Session2Country table.

create table FrequentItemsets as
select 'nat', CountryCode,
from QueriesPerSession as qs, Session2Country as sc
where  qs.sesidnew = sc.sesidnew
group by CountryCode;