Examples ******** .. _pivoting: .. highlight:: mysql Pivot ===== First import data from a tsv file (Tab Separated Values) with :func:`~functions.vtable.file.file` virtual table function. :download:`sales.tsv <../../functions/row/testing/sales.tsv>` .. code-block:: python >>> sql("""create table sales as ... select Product,Region,Month,cast(Sales as int) as Sales ... from ... (file 'testing/sales.tsv' 'dialect:tsv' header:t)""") >>> sql("select * from sales") Product | Region | Month | Sales ---------------------------------- Cars | Athens | 2010-01 | 200 Cars | Athens | 2010-02 | 130 Bikes | NY | 2010-01 | 10 Bikes | NY | 2010-02 | 30 Cars | NY | 2010-01 | 100 Cars | NY | 2010-02 | 160 Cars | Paris | 2010-01 | 70 Cars | Paris | 2010-02 | 20 Bikes | Paris | 2010-01 | 100 Bikes | Paris | 2010-02 | 20 Boats | Paris | 2010-01 | 200 Let's say we want the result to be a table with columns: - Product - NY: the total sales in NY for this product - Paris: the total sales in Paris for this product - Athens: the total sales in Athens for this product We will perform the aggregate function *sum* over *Sales* column, grouping by *Product* and *Region* columns. Then :func:`~functions.aggregate.packing.vecpack` must be performed over *Region* and *Sales* sums, grouping on *Product*. To use :func:`~functions.aggregate.packing.vecpack` the first argument must be a pack of the dimensions. This is the result of packing the distinct *Region* values, grouping over all the table. .. code-block:: python >>> sql("""select Product,unpackcol(vpck) ... from ... (select Product,vecpack(rpk,Region,salessum) as vpck ... from ... (select pack(distinct Region) as rpk from sales), ... (select Product,Region,sum(sales) as salessum ... from sales group by Product,Region) ... group by Product)""") Product | Paris | NY | Athens ------------------------------ Bikes | 120 | 40 | 0 Boats | 200 | 0 | 0 Cars | 90 | 260 | 330 .. _applexample: Application scenario ==================== In this example we implement a simple application for query recommendation based on user's country of origin. The input data come from a web portal's logs. The query mining workflow includes the following five main steps: 1. Import the portal log files into a relational table 2. Use time-heuristics so as to identify coherent query sessions. Assign and store a new, reconstructed session id for each record in the logs 3. Preprocess and clean logged queries text 4. Retrieve IP-to-country information from the web and combine this with the IP information from portal log files so as to assign a country code to each one of the logged sessions 5. Apply an `Apriori-based `_ technique for extracting frequent term sets per country. A sample from the log file used is presented below: .. code-block:: none 36506 guest X.X.X.134 p93t9q5eqaa0u0p8isd9skp1n6 en ("paradis riedinger") search_sim 2010-01-01 00:28:23 36507 guest X.X.X.134 p93t9q5eqaa0u0p8isd9skp1n6 en ("paradis riedinger") view_brief 2010-01-01 00:28:34 36508 guest X.X.X.134 p93t9q5eqaa0u0p8isd9skp1n6 en ("paradis riedinger") view_brief 2010-01-01 00:28:34 36509 guest X.X.X.134 p93t9q5eqaa0u0p8isd9skp1n6 en ("paradis riedinger") view_full 2010-01-01 00:28:42 36510 guest X.X.X.55 a9qo379qnl5hbbria6tj6nlp95 de (creator all "frank leonhard") search_adv 2010-01-01 00:28:44 36511 guest X.X.X.55 a9qo379qnl5hbbria6tj6nlp95 de (creator all "frank leonhard") view_brief 2010-01-01 00:29:10 36512 guest X.X.X.55 a9qo379qnl5hbbria6tj6nlp95 de (creator all "frank leonhard") search_res 2010-01-01 00:29:18 **Step 1** Initially, log files which are available in the Tab Separated Value format, are imported into a relational table. The import process is easily implemented in madIS using the :func:`~functions.vtable.file.file` function, by selecting the appropriate columns from the ".tsv" file. :: create table logs as select C1 as id, C2 as userid, C3 as userip, C4 as sesid, C6 as query, c7 as action, C8 as date from file('raw_logs.tsv','delimiter:\t','quoting:QUOTE_NONE'); **Step 2** Thereafter, session reconstruction, a common task in web usage mining, is performed. In this example it uses a predefined inactivity thresholds to break in-coming sessions. Such functionality is performed in madIS through the :func:`~functions.aggregate.subgroup.datediffbreak` function, which takes as an argument the inactivity threshold (as well as some additional parameters) and returns the new session ids. In the following madSQL segment, where the efficient employment of the described function is presented, the inactivity threshold has been set to 30 minutes (provided in milliseconds). :: alter table logs add sesidnew text; update logs set sesidnew= (select bgroupid from ( cache select datediffbreak(sesid,id,date,30*60*1000,'order',date,id) from logs where sesid not null group by sesid) where C1=id ) where sesid is not null; **Step 3** Then, focusing on issued queries, the distinct queries per session are retrieved from the logs and a variety of query text processing steps take place. However, a great amount of malformed queries has been observed, so it is only queries of valid text in utf8 encoding that are selected. This filtering step is performed using the :func:`~functions.row.text.isvalidutf8` function in the where clause of the corresponding madSQL fragment.  Thereafter, stop word removal is performed over the selected queries through the corresponding function. Moreover, since queries are issued and logged using the Common Query Language (CQL) syntax, an additional filtering step is executed for removing *CQL* constructs, through the madIS function :func:`~functions.row.text.cqlkeywords`. The processed queries are then stored in table *QueriesPerSession*. One of the advantages offered by madIS framework, is that all the powerful Python capabilities and open source libraries for text processing can be exploited for the rapid implementation of customized functions, hence significantly easing the “flow-level programming”, in madSQL. :: create table QueriesPerSession as select filterstopwords(cqlkeywords(query)) as cleanquery , sesidnew, userid from logs where action like 'search%' and cleanquery!='' and isvalidutf8(query) group by query, sesidnew; **Step 4** Aiming towards the extraction of term associations per country, an external data source is fetched from the web, containing the mapping between IP ranges and countries. Again, the :func:`~functions.vtable.file.file` function is employed for fetching the URL resource and importing the corresponding data in a main memory indexed table using the :func:`~functions.vtable.cache.cache` function of madIS. The fetched data source specifies the IP ranges in IP long number format, so the logged IPs have to be converted to the same format by using :func:`~functions.row.iptools.ip2long` function, and then to be subsequently matched to the imported ranges. However, since each session may contain requests from multiple IPs (due to dynamic IPs, etc.), it is only the first encountered IP that is considered for each session. The first IP for each session is obtained using the :func:`~functions.aggregate.selection.minrow` function, comparing the corresponding records’ ids. The generated mapping from each session to a country code is stored in table Session2Country. :: create temporary table Session2Country as select sesidnew, CountryCode from (select ip2long(minrow(id,userip)) as iplong, sesidnew from logs group by sesidnew), (cache select cast(C3 as integer) as ipfrom, cast(C4 as integer) as ipto, C5 as CountryCode from file('http://.../GeoIPCountryCSV.zip','dialect:csv','compression:t')) where iplong>=ipfrom and iplong <= ipto; **Step 5** Then information regarding text of queries, and session to country mappings is jointly used so as to extract term associations, with an `apriori-like `_ technique, using the aggregate function :func:`~functions.aggregate.mining.freqitemsets`. Frequent query term sets are computed over each one of the country codes that occur in Session2Country table. :: create table FrequentItemsets as select 'nat', CountryCode, freqitemsets(cleanquery,'threshold:2','maxlen:5') from QueriesPerSession as qs, Session2Country as sc where qs.sesidnew = sc.sesidnew group by CountryCode;