Examples¶
Pivot¶
First import data from a tsv file (Tab Separated Values) with file()
virtual table function.
sales.tsv
>>> sql("""create table sales as
... select Product,Region,Month,cast(Sales as int) as Sales
... from
... (file 'testing/sales.tsv' 'dialect:tsv' header:t)""")
>>> sql("select * from sales")
Product | Region | Month | Sales
----------------------------------
Cars | Athens | 2010-01 | 200
Cars | Athens | 2010-02 | 130
Bikes | NY | 2010-01 | 10
Bikes | NY | 2010-02 | 30
Cars | NY | 2010-01 | 100
Cars | NY | 2010-02 | 160
Cars | Paris | 2010-01 | 70
Cars | Paris | 2010-02 | 20
Bikes | Paris | 2010-01 | 100
Bikes | Paris | 2010-02 | 20
Boats | Paris | 2010-01 | 200
Let’s say we want the result to be a table with columns:
- Product
- NY: the total sales in NY for this product
- Paris: the total sales in Paris for this product
- Athens: the total sales in Athens for this product
We will perform the aggregate function sum over Sales column, grouping by Product and Region columns.
Then vecpack()
must be performed over Region and Sales sums, grouping on Product.
To use vecpack()
the first argument must be a pack of the dimensions. This is the
result of packing the distinct Region values, grouping over all the table.
>>> sql("""select Product,unpackcol(vpck)
... from
... (select Product,vecpack(rpk,Region,salessum) as vpck
... from
... (select pack(distinct Region) as rpk from sales),
... (select Product,Region,sum(sales) as salessum
... from sales group by Product,Region)
... group by Product)""")
Product | Paris | NY | Athens
------------------------------
Bikes | 120 | 40 | 0
Boats | 200 | 0 | 0
Cars | 90 | 260 | 330
Application scenario¶
In this example we implement a simple application for query recommendation based on user’s country of origin. The input data come from a web portal’s logs.
The query mining workflow includes the following five main steps:
- Import the portal log files into a relational table
- Use time-heuristics so as to identify coherent query sessions. Assign and store a new, reconstructed session id for each record in the logs
- Preprocess and clean logged queries text
- Retrieve IP-to-country information from the web and combine this with the IP information from portal log files so as to assign a country code to each one of the logged sessions
- Apply an Apriori-based technique for extracting frequent term sets per country.
A sample from the log file used is presented below:
36506 guest X.X.X.134 p93t9q5eqaa0u0p8isd9skp1n6 en ("paradis riedinger") search_sim 2010-01-01 00:28:23
36507 guest X.X.X.134 p93t9q5eqaa0u0p8isd9skp1n6 en ("paradis riedinger") view_brief 2010-01-01 00:28:34
36508 guest X.X.X.134 p93t9q5eqaa0u0p8isd9skp1n6 en ("paradis riedinger") view_brief 2010-01-01 00:28:34
36509 guest X.X.X.134 p93t9q5eqaa0u0p8isd9skp1n6 en ("paradis riedinger") view_full 2010-01-01 00:28:42
36510 guest X.X.X.55 a9qo379qnl5hbbria6tj6nlp95 de (creator all "frank leonhard") search_adv 2010-01-01 00:28:44
36511 guest X.X.X.55 a9qo379qnl5hbbria6tj6nlp95 de (creator all "frank leonhard") view_brief 2010-01-01 00:29:10
36512 guest X.X.X.55 a9qo379qnl5hbbria6tj6nlp95 de (creator all "frank leonhard") search_res 2010-01-01 00:29:18
Step 1
Initially, log files which are available in the Tab Separated Value format, are imported into a relational table.
The import process is easily implemented in madIS using the file()
function,
by selecting the appropriate columns from the “.tsv” file.
create table logs as
select C1 as id, C2 as userid, C3 as userip, C4 as sesid, C6 as query, c7 as action, C8 as date
from file('raw_logs.tsv','delimiter:\t','quoting:QUOTE_NONE');
Step 2
Thereafter, session reconstruction, a common task in web usage mining, is performed.
In this example it uses a predefined inactivity thresholds to break in-coming sessions.
Such functionality is performed in madIS through the datediffbreak()
function,
which takes as an argument the inactivity threshold (as well as some additional parameters) and returns the new session ids.
In the following madSQL segment, where the efficient employment of the described function is presented,
the inactivity threshold has been set to 30 minutes (provided in milliseconds).
alter table logs add sesidnew text;
update logs
set sesidnew=
(select bgroupid
from
( cache select datediffbreak(sesid,id,date,30*60*1000,'order',date,id)
from logs
where sesid not null group by sesid)
where C1=id )
where sesid is not null;
Step 3
Then, focusing on issued queries, the distinct queries per session are retrieved from the logs
and a variety of query text processing steps take place.
However, a great amount of malformed queries has been observed, so it is only queries of valid text in utf8 encoding that are selected.
This filtering step is performed using the isvalidutf8()
function in the where clause of the corresponding madSQL fragment.
Thereafter, stop word removal is performed over the selected queries through the corresponding function.
Moreover, since queries are issued and logged using the Common Query Language (CQL) syntax, an additional filtering step is executed for removing CQL constructs,
through the madIS function cqlkeywords()
. The processed queries are then stored in table QueriesPerSession.
One of the advantages offered by madIS framework, is that all the powerful Python capabilities and
open source libraries for text processing can be exploited for the rapid implementation of customized functions,
hence significantly easing the “flow-level programming”, in madSQL.
create table QueriesPerSession as
select filterstopwords(cqlkeywords(query)) as cleanquery , sesidnew, userid
from logs
where action like 'search%' and cleanquery!='' and isvalidutf8(query)
group by query, sesidnew;
Step 4
Aiming towards the extraction of term associations per country, an external data source is fetched from the web,
containing the mapping between IP ranges and countries.
Again, the file()
function is employed for fetching the URL resource and importing the corresponding data in a main memory
indexed table using the cache()
function of madIS. The fetched data source specifies the IP ranges in IP long number format,
so the logged IPs have to be converted to the same format by using ip2long()
function, and then to be subsequently matched
to the imported ranges. However, since each session may contain requests from multiple IPs (due to dynamic IPs, etc.),
it is only the first encountered IP that is considered for each session.
The first IP for each session is obtained using the minrow()
function, comparing the corresponding records’ ids.
The generated mapping from each session to a country code is stored in table Session2Country.
create temporary table Session2Country as
select sesidnew, CountryCode
from
(select ip2long(minrow(id,userip)) as iplong, sesidnew
from logs
group by sesidnew),
(cache select cast(C3 as integer) as ipfrom, cast(C4 as integer) as ipto, C5 as CountryCode
from file('http://.../GeoIPCountryCSV.zip','dialect:csv','compression:t'))
where iplong>=ipfrom and iplong <= ipto;
Step 5
Then information regarding text of queries, and session to country mappings is jointly used so as to extract term associations,
with an apriori-like technique, using the aggregate function freqitemsets()
.
Frequent query term sets are computed over each one of the country codes that occur in Session2Country table.
create table FrequentItemsets as
select 'nat', CountryCode,
freqitemsets(cleanquery,'threshold:2','maxlen:5')
from QueriesPerSession as qs, Session2Country as sc
where qs.sesidnew = sc.sesidnew
group by CountryCode;