.. _aggregate-functions-list: Aggregate functions list ======================================================= .. automodule:: functions.aggregate.__init__ :mod:`date` functions --------------------- .. module:: functions.aggregate.date **avgdtdiff function** .. function:: avgdtdiff(date) Returns the average difference *date* values of the group in milliseconds. Input dates should be in :ref:`ISO 8601 format `. Examples: >>> table1(''' ... '2007-01-01 00:04:37' ... '2007-01-01 00:04:39' ... '2007-01-01 00:04:40' ... '2007-01-01 00:04:49' ... ''') >>> sql("select avgdtdiff(a) from table1") avgdtdiff(a) ------------ 3000.0 .. doctest:: :hide: >>> sql("select avgdtdiff(a) from (select '2005-01-01' as a) ") avgdtdiff(a) ------------ None >>> sql("select avgdtdiff(a) from (select 5 as a where a!=5) ") avgdtdiff(a) ------------ None **dategroupduration function** .. function:: dategroupduration(date) Returns the duration of the group of dates in seconds. Input dates should be in :ref:`ISO 8601 format `. Examples: >>> table1(''' ... '2007-01-01 00:04:37' ... '2007-01-01 00:04:39' ... '2007-01-01 00:04:40' ... '2007-01-01 00:04:49' ... ''') >>> sql("select dategroupduration(a) from table1") dategroupduration(a) -------------------- 12 >>> sql("select dategroupduration(a) from (select '2005-01-01' as a) ") dategroupduration(a) -------------------- 0 **frecencyindex function** .. function:: frecencyindex(date) Returns the frecency Index which is computed based on a set of *date* values, using predifend time-windows. Input dates should be in :ref:`ISO 8601 format `. Examples: >>> table1(''' ... '2011-04-01 00:04:37' ... '2011-01-01 00:04:39' ... '2011-02-12 00:04:40' ... '2011-02-14 00:04:49' ... ''') >>> sql("select frecencyindex(a) from table1") frecencyindex(a) ---------------- 2.9 **mindtdiff function** .. function:: mindtdiff(date) Returns the minimum difference *date* values of the group in milliseconds. Input dates should be in :ref:`ISO 8601 format `. Examples: >>> table1(''' ... '2007-01-01 00:03:13' ... '2007-01-01 00:03:27' ... '2007-01-01 00:03:36' ... '2007-01-01 00:04:39' ... '2007-01-01 00:04:40' ... '2007-01-01 00:04:49' ... ''') >>> sql("select mindtdiff(a) from table1") mindtdiff(a) ------------ 1000 .. doctest:: :hide: >>> sql("select mindtdiff(a) from (select '2005-01-01' as a) ") mindtdiff(a) ------------ None >>> sql("select mindtdiff(a) from (select 5 as a where a!=5) ") mindtdiff(a) ------------ None :mod:`graph` functions ---------------------- .. module:: functions.aggregate.graph **graphcliques function** .. function:: graphcliques(node1, node2) -> graph cliques Finds and returns the cliques in the graph defined by the node1<->node2 links. Examples: >>> table1(''' ... n1 n2 ... n2 n3 ... n1 n3 ... n3 n4 ... n4 n5 ... n5 n3 ... n1 n6 ... ''') >>> sql("select graphcliques(a,b) from table1") # doctest: +NORMALIZE_WHITESPACE cliqueid | nodeid ----------------- 0 | n1 0 | n2 0 | n3 1 | n3 1 | n4 1 | n5 **graphpowerhash function** .. function:: graphpowerhash(steps, [undirected_edge], node1, node2, [node1_details, edge_details, node2_details]) -> jpack of graph node hashes Graph power hashing is based on a `power iteration algorithm `_ that calculates hashes on every processing step. The produced output, contains for every node in the input graph a hash that "describes" its "surroundings". Parameters: :steps: The *steps* option controls the number of steps that the power hashing will be executed for. Another way to conceptualize the *steps* parameter is to think of it as the radius of the graph around a particular node that the node's hash covers. Steps parameter's possible value are: - null (default). When steps=null, then steps is automatically set to number_of_nodes/2 - Positive integer value. - -1 . Steps is set to number_of_nodes - Negative integers, steps is set to number_of_nodes / absolute_value(steps) :undirected_edge': This option can only have the *null* value. - Parameter absent. The graph is assumed to be directed. - Parameter present and having a *null* value. The graph is assumed to be undirected :node1,node2: Node1 connects to Node2. If node1 doesn't connect to any node, then *node2*'s value should be null. :node and edge details: Optional details, that are processed with the graph's structure. In essence these parameters define "tags" on the nodes and edges of the graph. .. note:: The graph power hash algorithm is an experimental algorithm created by me, Lefteris Stamatogiannakis. I haven't proved its correctness, so please use it with care. Due to its hash usage, there is a (very low probability) that two different graphs could hash to the same power hash. I would be very very thankfull to anyone knowledgable in graph theory, who could prove it to be wrong (or correct). If the reader knows of a paper that describes another algorithm similar to this algorithm, i would be glad to be pointed towards it. .. note:: The computational complexity of the powerhash algorithm is O(n * steps * average_node_degree). The optimal value for the hash to fully cover the graph, is to set the steps parameter to *graph_diameter* / 2. Right now for steps=null, we take the worse upper bound of n / 2, so the computational complexity becomes O(n * ~(n/2) * average_node_degree). Examples: Directed graph: >>> table1(''' ... 1 2 ... 2 3 ... 3 4 ... 4 5 ... 5 3 ... ''') >>> sql("select graphpowerhash(null, a,b) from table1") graphpowerhash(null, a,b) ------------------------------------------------------------------------------------------------------------------------------ ["OaNj+OtIZPqcwjc3QVvKpg","Um7OU79ApcRNA2TKrdcBcA","ZyQT/AoKyjIkwWMNvceK2A","3vaHWSLU/H32HvHTVBkpUQ","+3uZjYUMSXwyZs7HFHKNVg"] Above graph having its nodes renumbered (its powerhash is the same as above): >>> table2(''' ... 2 5 ... 5 4 ... 4 1 ... 1 3 ... 3 4 ... ''') >>> sql("select graphpowerhash(null, a,b) from table2") graphpowerhash(null, a,b) ------------------------------------------------------------------------------------------------------------------------------ ["OaNj+OtIZPqcwjc3QVvKpg","Um7OU79ApcRNA2TKrdcBcA","ZyQT/AoKyjIkwWMNvceK2A","3vaHWSLU/H32HvHTVBkpUQ","+3uZjYUMSXwyZs7HFHKNVg"] Above graph with a small change (its hash differs from above graphs): >>> table3(''' ... 2 5 ... 5 4 ... 4 1 ... 1 3 ... 3 5 ... ''') >>> sql("select graphpowerhash(null, a,b) from table3") graphpowerhash(null, a,b) ------------------------------------------------------------------------------------------------------------------------------ ["APq1eISun1GpYjgUhiMrLA","NPPh9FLzC5cUedxldXV77Q","VVZ93zo6gePuMeRf6f00Zg","df/4yDABlitCTfOGut0NvA","lqo+lY4fcjqujlgsYr+3Yw"] Actual testing of equality or inequality of above graphs: >>> sql("select hashmd5( (select graphpowerhash(null, a,b) from table1) )=hashmd5( (select graphpowerhash(null, a,b) from table2) ) as grapheq") grapheq ------- 1 >>> sql("select hashmd5( (select graphpowerhash(null, a,b) from table1) )=hashmd5( (select graphpowerhash(null, a,b) from table3) ) as grapheq") grapheq ------- 0 Graph with only one node: >>> sql("select graphpowerhash(null, a, null) from (select * from table1 limit 1)") graphpowerhash(null, a, null) ----------------------------- ["TOiuilAk4RLkg01tIwyvcg"] Undirected version of table1's graph: >>> sql("select graphpowerhash(null, null, a,b) from table1") graphpowerhash(null, null, a,b) ------------------------------------------------------------------------------------------------------------------------------ ["JudlYSkYV7rFHjk94abY/A","W88IN4kgDSeVX9kaY36SJg","W88IN4kgDSeVX9kaY36SJg","6ez9ee0N2ogdvKJVQ8VKWA","7gz+LT/LtsyFc+GxMUlL8g"] Same graph as above, but some of the edges have been reversed (the undirected powerhash matches the powerhash above): >>> table4(''' ... 2 1 ... 2 3 ... 3 4 ... 4 5 ... 3 5 ... ''') >>> sql("select graphpowerhash(null, null, a,b) from table4") graphpowerhash(null, null, a,b) ------------------------------------------------------------------------------------------------------------------------------ ["JudlYSkYV7rFHjk94abY/A","W88IN4kgDSeVX9kaY36SJg","W88IN4kgDSeVX9kaY36SJg","6ez9ee0N2ogdvKJVQ8VKWA","7gz+LT/LtsyFc+GxMUlL8g"] Graph similarity, using the step parameter (value of step defines the radius of the similar subgraphs that can be found): >>> sql("select jaccard( (select graphpowerhash(3, a, b) from table1), (select graphpowerhash(3, a, b) from table3) ) as jacsim") jacsim ------ 0.0 >>> sql("select jaccard( (select graphpowerhash(1, a, b) from table1), (select graphpowerhash(1, a, b) from table3) ) as jacsim") jacsim ------ 0.25 Powerhash of graph having details (using a chemical composition): >>> table5(''' ... 1 2 O = C ... 2 3 C = O ... ''') First without details: >>> sql("select graphpowerhash(null, null, a, b) from table5") graphpowerhash(null, null, a, b) ---------------------------------------------------------------------------- ["Rw3sDN24TI7YARBNOOmYSg","9m5wcZf9iUxDwgzQkzu6Ag","9m5wcZf9iUxDwgzQkzu6Ag"] Second with all details: >>> sql("select graphpowerhash(null, null, a, b, c, d, e) from table5") graphpowerhash(null, null, a, b, c, d, e) ---------------------------------------------------------------------------- ["CPebw+eZYzw5bWgx47/tkg","CPebw+eZYzw5bWgx47/tkg","WNn4aDDBKcoMMi+nrz5JEA"] **graphtodot function** .. function:: graphtodot(graphname, [undirected_edge], node1, node2, [node1_details, edge_details, node2_details]) -> graphviz dot graph Returns the *Graphviz* DOT representation of the input graph. Examples: Directed graph: >>> table1(''' ... 1 2 ... 2 3 ... 3 4 ... 4 5 ... 5 3 ... ''') >>> sql("select graphtodot(null, a,b) from table1") graphtodot(null, a,b) ------------------------------------------------------------------------ digraph { "1" -> "2"; "2" -> "3"; "3" -> "4"; "4" -> "5"; "5" -> "3"; } Undirected graph: >>> table2(''' ... 2 5 ... 5 4 ... 4 1 ... 1 3 ... 3 4 ... ''') >>> sql("select graphtodot(null, null, a,b) from table2") graphtodot(null, null, a,b) ---------------------------------------------------------------------- graph { "1" -- "3"; "2" -- "5"; "3" -- "4"; "4" -- "1"; "5" -- "4"; } Graph with details: >>> table5(''' ... 1 2 O = C ... 2 3 C = O ... ''') >>> sql("select graphtodot('chem_comp_1', null, a, b, c, d, e) from table5") graphtodot('chem_comp_1', null, a, b, c, d, e) ------------------------------------------------------------------------------------------------------------------------ graph chem_comp_1 { "1" [label="O"]; "1" -- "2" [label="="]; "2" [label="C"]; "2" -- "3" [label="="]; "3" [label="O"]; } **graphtotgf function** .. function:: graphtotgf(node1, node2, [node1_details, edge_details, node2_details]) -> TGF graph Returns the TGF representation of the input graph. Examples: >>> table1(''' ... 1 2 ... 2 3 ... 3 4 ... 4 5 ... 5 3 ... ''') >>> sql("select graphtotgf(a,b) from table1") # doctest: +NORMALIZE_WHITESPACE graphtotgf(a,b) ------------------------------------------ 1 2 3 4 5 # 1 2 2 3 3 4 4 5 5 3 Graph with details: >>> table5(''' ... 1 2 O = C ... 2 3 C = O ... ''') >>> sql("select graphtotgf(a, b, c, d, e) from table5") graphtotgf(a, b, c, d, e) -------------------------- 1 O 2 C 3 O # 1 2 = 2 3 = :mod:`jpacks` functions ----------------------- .. module:: functions.aggregate.jpacks **jdictgroup function** .. function:: jdictgroup(columns) Groups columns of a group into a jdict. Example: >>> table1(''' ... word1 1 ... word2 1 ... word3 2 ... word4 2 ... ''') >>> sql("select jdictgroup(a) from table1 group by b") jdictgroup(a) --------------------------- {"word1":null,"word2":null} {"word3":null,"word4":null} >>> sql("select jdictgroup(a,b) from table1") jdictgroup(a,b) ----------------------------------------- {"word1":1,"word2":1,"word3":2,"word4":2} >>> table2(''' ... [1,2] 1 ... [3,4] 1 ... [5,6] 2 ... [7,8] 2 ... ''') >>> sql("select jdictgroup(a) from table2") jdictgroup(a) ----------------------------------------------------- {"[1,2]":null,"[3,4]":null,"[5,6]":null,"[7,8]":null} >>> sql("select jdictgroup(a,b) from table2") jdictgroup(a,b) ----------------------------------------- {"[1,2]":1,"[3,4]":1,"[5,6]":2,"[7,8]":2} **jdictgroupunion function** .. function:: jgroupunion(jdicts) -> jdict Calculates the union of all jdicts inside a group. The returned jdict's key values, are calculated as the max length of the lists (or dictionaries) that have been found inside the individual jdicts of the group. Example: >>> table1(''' ... '{"b":1, "a":1}' ... '{"c":1, "d":[1,2,3]}' ... '{"b":{"1":2,"3":4}, "d":1}' ... ''') >>> sql("select jdictgroupunion(a) from table1") jdictgroupunion(a) ------------------------- {"b":2,"a":1,"c":1,"d":3} **jgroup function** .. function:: jgroup(columns) Groups columns of a group into a jpack. Example: >>> table1(''' ... word1 1 ... word2 1 ... word3 2 ... word4 2 ... ''') >>> sql("select jgroup(a) from table1 group by b") jgroup(a) ----------------- ["word1","word2"] ["word3","word4"] >>> sql("select jgroup(a,b) from table1") jgroup(a,b) ------------------------------------------------- [["word1",1],["word2",1],["word3",2],["word4",2]] >>> table2(''' ... [1,2] 1 ... [3,4] 1 ... [5,6] 2 ... [7,8] 2 ... ''') >>> sql("select jgroup(a) from table2") jgroup(a) ------------------------- [[1,2],[3,4],[5,6],[7,8]] >>> sql("select jgroup(a,b) from table2") jgroup(a,b) ----------------------------------------- [[[1,2],1],[[3,4],1],[[5,6],2],[[7,8],2]] >>> sql("select jgroup(jdict('a',a,'b',b)) from table2") jgroup(jdict('a',a,'b',b)) ------------------------------------------------------------------------- [{"a":[1,2],"b":1},{"a":[3,4],"b":1},{"a":[5,6],"b":2},{"a":[7,8],"b":2}] **jgroupintersection function** .. function:: jgroupintersection(columns) -> jpack Calculates the intersection of all jpacks (by treating them as sets) inside a group. Example: >>> table1(''' ... '[1,2]' 2 ... '[2,3]' 2 ... '[2,4]' '[2,11]' ... 2 2 ... ''') >>> sql("select jgroupintersection(a,b) from table1") jgroupintersection(a,b) ----------------------- 2 >>> sql("select jgroupintersection(1)") jgroupintersection(1) --------------------- 1 **jgroupunion function** .. function:: jgroupunion(columns) -> jpack Calculates the union of the jpacks (by treating them as sets) inside a group. Example: >>> table1(''' ... '[1,2]' 6 ... '[2,3]' 7 ... '[2,4]' '[8,11]' ... 5 9 ... ''') >>> sql("select jgroupunion(a,b) from table1") jgroupunion(a,b) ---------------------- [1,2,6,3,7,4,8,11,5,9] >>> sql("select jgroupunion(1)") jgroupunion(1) -------------- 1 **jgroupunionkeys function** .. function:: jgroupunionkeys(columns) -> jpack Calculates the union of the jdict keys. Use it with care, because for performance reasons the input data are not checked at all. They should all be jdicts. Example: >>> table1(''' ... '{"1":1, "2":3}' '{"a":5}' ... '{"2":1, "3":3}' '{}' ... ''') >>> sql("select jgroupunionkeys(a,b) from table1") jgroupunionkeys(a,b) -------------------- ["1","2","a","3"] >>> sql("select jgroupunionkeys('{}')") jgroupunionkeys('{}') --------------------- [] **jgroupuniquelimit function** .. function:: jgroupuniquelimit(jpack, k, limit) -> jpack Returns the k where the unique values inside all jpacks have reached limit. Example: >>> table1(''' ... '[1,2]' 1 ... '[2,3,4,5]' 2 ... '[2,4]' 3 ... 5 4 ... ''') >>> sql("select jgroupuniquelimit(a,b,3) from table1") jgroupuniquelimit(a,b,3) ------------------------ 2 :mod:`mining` functions ----------------------- .. module:: functions.aggregate.mining **freqitemsets function** .. function:: freqitemsets(datacol, [threshold, noautothres, stats, maxlen]) -> [itemset_id:int, itemset_length:int, itemset_frequency:int, item:text] Calculates frequent itemsets on a given column (datacol). The algorithm is tuned for the case when we have many different items (in the order of millions), many input itemsets, but small itemset length (10-20). Returned table schema: :itemset_id: Automatic itemset id :itemset_length: Length of itemset :itemset_frequency: How many times an itemset has been found :item: Itemset's item value Parameters: :datacol: Column on which to calculate frequent itemsets :threshold: Default is 2 How many times an freq. itemset must appear for it to appear in the results :noautothres: 1/0 (Default is 0) Do not calculate the threshold automatically :stats: 1/0 (Default is 0) Return frequent itemset statistics :maxlen: NUMBER (Default is no limit at all) Maximum itemset length to search Examples: >>> table1(''' ... 'car wood bike' 'first group' ... 'car car wood' 'first group' ... 'car wood' 'first group' ... 'car wood ice' 'first group' ... 'ice' 'second group' ... 'car ice' 'second group' ... 'car cream toy' 'second group' ... 'icecream ice car toy' 'second group' ... ''') >>> sql("select b,freqitemsets(a, 'threshold:2', 'noautothres:1', 'maxlen:2') from table1 group by b") b | itemset_id | itemset_length | itemset_frequency | item --------------------------------------------------------------------- first group | 1 | 1 | 4 | wood first group | 2 | 1 | 4 | car first group | 3 | 2 | 4 | car first group | 3 | 2 | 4 | wood second group | 1 | 1 | 3 | ice second group | 2 | 1 | 3 | car second group | 3 | 1 | 2 | toy second group | 4 | 2 | 2 | car second group | 4 | 2 | 2 | ice second group | 5 | 2 | 2 | car second group | 5 | 2 | 2 | toy >>> sql("select b,freqitemsets(a, 'stats:1') from table1 group by b") b | MaxTransactionLength | CombinationCount | PassedTransactions | ValidKeywords ------------------------------------------------------------------------------------------- first group | 3 | 2 | 3 | 2 first group | 3 | 1 | 1 | 2 first group | 3 | 0 | 0 | 0 second group | 4 | 3 | 3 | 3 second group | 4 | 0 | 3 | 0 **sampledistvals function** .. function:: sampledistvals(sample_size, C1, C2, C3) -> [C1, C2, C3] Sampledistvals returns sample_size distinct values for each of the input C1..Cn columns. >>> table1(''' ... test1 2 3 ... test1 2 3 ... test2 4 2 ... test4 2 t ... ''') >>> sql("select sampledistvals(3, a, b, c) from table1") C1 | C2 | C3 --------------------------------------------- ["test1","test2","test4"] | [2,4] | [2,3,"t"] **samplegroup function** .. function:: samplegroup(sample_size, C1, C2, C3) Returns a random sample_size set of rows. >>> table1(''' ... test1 2 3 ... test1 2 3 ... test2 4 2 ... test4 2 t ... ''') >>> sql("select samplegroup(2, a, b, c) from table1") # doctest: +ELLIPSIS C1 | C2 | C3 --------------- ... >>> sql("select samplegroup(2) from (select 5 where 5=6)") # doctest: +ELLIPSIS :mod:`partialsort` functions ---------------------------- .. module:: functions.aggregate.partialsort **partialsort function** .. function:: partialsort(n,col1,col2,col3,....) sorts the first n columns of its input :Returned multiset schema: Columns are automatically named as col1, col2 ... Examples: >>> table1(''' ... aa 43 ... ac 34 ... ab 21 ... as 23 ... ''') >>> sql("select partialsort(1,a,b) from table1") c1 | c2 ------- aa | 43 ab | 21 ac | 34 as | 23 :mod:`selection` functions -------------------------- .. module:: functions.aggregate.selection **groupdiff function** .. function:: groupdiff(compare,values) Compares group members over the first argument (i.e. *compare*). It keeps only the rows where *compare* changes. It appends a column at the end containing how many times *compare* repeats. Examples: >>> table1(''' ... 0 a ... 0 b ... 1 c ... 1 d ... 2 e ... 3 e ... 3 f ... 3 g ... ''') >>> sql("select groupdiff(a,b) as b from table1") b1 | b2 | b3 ------------ 0 | a | 2 1 | c | 2 2 | e | 1 3 | e | 3 >>> sql("select groupdiff(a) as a from table1") a1 | a2 ------- 0 | 2 1 | 2 2 | 1 3 | 3 >>> sql("select groupdiff(b,a) as a from table1") a1 | a2 | a3 ------------ a | 0 | 1 b | 0 | 1 c | 1 | 1 d | 1 | 1 e | 2 | 2 f | 3 | 1 g | 3 | 1 **imax function** .. function:: imax(i,value) Returns the i-th max value of the group. Examples: >>> table1(''' ... 34 la ... 18 lo ... 120.0 all ... ''') >>> sql("select imax(1,a) as first from table1") first ----- 120 >>> sql("select imax(3,a) as third from table1") third ----- 18 >>> sql("select imax(2,cast( a as text)) as secstr from table1") secstr ------ 18 >>> sql("select imax(4,a) from table1") imax(4,a) --------- None >>> sql("select imax(4,a) from (select 5 as a where a!=5)") imax(4,a) --------- None **maxrow function** .. function:: maxrow(compare,value) Compares group members over the first argument (i.e. *compare*). When the maximum is located, it returns the corresponding value in the second argument (i.e. *value*). Examples: >>> table1(''' ... 34 la ... 18 lo ... 120.0 all ... ''') >>> sql("select maxrow(a,b) as b from table1") b --- all >>> sql("select maxrow(a,a) as a from table1") a --- 120 >>> sql("select maxrow(b,a) as a from table1") a -- 18 **minrow function** .. function:: minrow(compare,value) Compares group members over the first argument (i.e. *compare*). When the minimum is located, it returns the corresponding value in the second argument (i.e. *value*). Examples: >>> table1(''' ... 34 la ... 18 lo ... 120.0 all ... ''') >>> sql("select minrow(a,b) as b from table1") b -- lo >>> sql("select minrow(a,a) as a from table1") a -- 18 .. doctest:: :hide: >>> sql("select minrow(a,a) as a from (select 5 as a where a!=5)") a ---- None **ontop function** .. function:: ontop(n,compare,value1,value2,....) -> [colname1, colname2 ...] Compares group members over the second argument (i.e. *compare*), so as to locate the top *n* members (specified in the first argument) and then returns the corresponding data under the specified columns *value1, value2, ....*. :Returned multiset schema: Columns are automatically named as *colname1 text, colname2 text...* .. seealso:: * :ref:`tutmultiset` functions Examples: >>> table1(''' ... 34 la ... 18 lo ... 120.0 all ... ''') >>> sql("select ontop(1,a,b) from table1") top1 ---- all >>> sql("select ontop(2,a,a,b) from table1") top1 | top2 ----------- 120 | all 34 | la >>> sql("select ontop(2,a,a,b,a,b) from table1") top1 | top2 | top3 | top4 ------------------------- 120 | all | 120 | all 34 | la | 34 | la >>> sql("select ontop(pk) from (select 5 as pk where pk!=5)") top1 - :mod:`skcluster` functions -------------------------- .. module:: functions.aggregate.skcluster **skcluster function** .. function:: skcluster(str,rowid cols) :NOTE: The present operator requires the following packages: 'numpy', 'scipy', 'sklearn' numpy & scipy: https://www.scipy.org/scipylib/download.html sklearn: http://scikit-learn.org/stable/install.html Implements the clustering algorithm initialized by str and returns the sample id, the coordinates in space and its cluster label. Cols refer to features except for the first one which includes each dato id. In case the algorithm computes centroids in clusters, #features more columns are returned: The coordinates of cluster centroids in which the dato belongs to. For instance, if dataset consists of 4 columns (= 4 features) 4 more columns will be added in result table in case centroids have been computed Parameters: :initstr: Initialization string of the algorithm with optional parameters (from scikit-learn api, ie: KMeans(n_clusters=3)). :cols: Names of the input-variables Examples: A dataset with 3 columns (1 column for the id and 2 features): >>> table1(''' ... 0.52 0.40 ... 0.63 0.33 ... 0.77 0.15 ... 0.81 0.74 ... ''') >>> sql("select skcluster('AffinityPropagation()',c1,c2) from table1") --------------------------- id | label | center1 | center2 -------------- 1 | 0 | 0.63 | 0.33 2 | 0 | 0.63 | 0.33 3 | 0 | 0.63 | 0.33 4 | 1 | 0.81 | 0.74 In case algorithm doesn't compute centroids, only the 'id' and 'label' columns appeared in result >>> sql("select skcluster('SpectralClustering(n_clusters=2)',c1,c2) from table1") --------------------------- id | label -------------- 1 | 0 | 2 | 0 | 3 | 0 | 4 | 1 | :mod:`skdimred` functions ------------------------- .. module:: functions.aggregate.skdimred **skdimred function** .. function:: skdimred(initstr,cols) :NOTE: The operator requires the following packages: 'numpy', 'scipy', 'sklearn' numpy & scipy: https://www.scipy.org/scipylib/download.html sklearn: http://scikit-learn.org/stable/install.html Implements dimensionality reduction on table t (based on algorithms from Machine Learning package scikit-learn.org). Standarization is also performed to the features by removing the mean and scaling to unit variance Returns: the transformed data in the new space. Parameters: :initstr: Initialization string of the algorithm with optional parameters (from scikit-learn api, ie: PCA(n_components=3)). Three methods are supported: PCA, SVD and TSNE) :cols: Names of the input-variables Examples: Sample from the iris dataset with 4 columns (SepalLength, SepalWidth, PetalLength, PetalWidth): >>> table1(''' ... 5.1 3.5 1.4 0.2 ... 4.9 3 1.4 0.2 ... 4.7 3.2 1.3 0.2 ... 4.6 3.1 1.5 0.2 ... 5 3.6 1.4 0.2 ... --- [0|Column names --- ... [1|SL [2|SW [3|PL [4]PW ... ''') >>> sql("select skdimred('PCA(n_components=2)',SL,SW,PL,PW) from table1;") eig1 | eig2 ------------------------------ -1.52434877924 | -0.436029188708 0.669710216202 | 0.234613817817 0.378259496001 | 1.74252845419 1.9247054679 | -1.10077422234 -1.44832640086 | -0.440338860953 :mod:`statistics` functions --------------------------- .. module:: functions.aggregate.statistics **amean function** .. function:: amean(X) -> [amean float] Computes the arithmetic mean, i.e. the average, thus providing an alternative choise to traditional *avg* offered by sqlite. Examples: >>> table1(''' ... 1 ... 2 ... 2 ... 3 ... 'text is ignored, as well as null values' ... 'none' ... ''') >>> sql("select amean(a) from table1") amean(a) -------- 2.0 .. doctest:: :hide: >>> sql("delete from table1") >>> sql("select amean(a) from table1") amean(a) -------- None **frecency function** .. function:: frecency(actiondate[,points[,now:date]]) Returns a float weighted sum assigning to each action *points* or less, depending on the *actiondate* distance to the current date (or *now:date*). In detail the action points decrease 30% at distance 10-30 days, 50% at 1-3 months, 70% at 3-6 months and 90% at greater distance. Date parameters should be in ISO8601 format. .. _iso8601: **ISO 8601 format** : Year: YYYY (eg 1997) Year and month: YYYY-MM (eg 1997-07) Complete date: YYYY-MM-DD (eg 1997-07-16) Complete date plus hours and minutes: YYYY-MM-DD hh:mmTZD (eg 1997-07-16 19:20+01:00) Complete date plus hours, minutes and seconds: YYYY-MM-DD hh:mm:ssTZD (eg 1997-07-16 19:20:30+01:00) Complete date plus hours and minutes: YYYY-MM-DDThh:mmTZD (eg 1997-07-16T19:20+01:00) Complete date plus hours, minutes and seconds: YYYY-MM-DDThh:mm:ssTZD (eg 1997-07-16T19:20:30+01:00) Examples: >>> table1(''' ... 2009-06-01 1 ... 2009-08-28 2 ... 2009-09-17 3 ... ''') >>> sql("select frecency(a,'now:2009-09-26 04:38:30') from table1") frecency(a,'now:2009-09-26 04:38:30') ------------------------------------- 200.0 **fsum function** .. function:: fsum(X) -> json Computes the sum using fractional computation. It return the result in json format Examples: >>> table1(''' ... 1 ... 2 ... 2 ... 10 ... ''') >>> sql("select fsum(a) from table1") fsum(a) ------- [15, 1] >>> table1(''' ... 0.99999999 ... 3.99999999 ... 0.78978989 ... 1.99999999 ... ''') >>> sql("select fsum(a) from table1") fsum(a) ------------------------------------- [70164189421580937, 9007199254740992] **gmean function** .. function:: gmean(X,[m]) -> [gmean float] Computes the genaralized mean (also known as the power mean or Holder mean), which is an abstraction of the *Pythagorean means* including *arithmetic*, *geometric*, and *harmonic* means. It is defined for a set of *n* positive real numbers as follows: .. math:: gmean_{\mathrm} = \Big ( {1 \over N} \sum_{i=1}^{N} x_i ^p \Big ) ^{1/p} The (optional) second argument stands for the *p* paramteter, thus determining the exact mean type: - p=2 : *Quadratic mean* (computed for both negative and positive values) - p=1 : *Artihmetic mean* - p=0 : *Geometric mean* (only for positive real numbers) - p=-1: *Harmonian mean* (only for positive real numbers) By default, i.e. in absence of second argument, p is set to 0, computing the geometric mean. Examples: >>> table1(''' ... 6 ... 50 ... 9 ... 1200 ... 'text is ignored, as well as None values' ... 'None' ... ''') >>> sql("select gmean(a) from table1") gmean(a) ------------- 42.4264068712 >>> table2(''' ... 34 ... 27 ... 45 ... 55 ... 22 ... 34 ... ''') >>> sql("select gmean(a,1) from table2") gmean(a,1) ------------- 36.1666666667 >>> sql("select gmean(a,0) from table2") gmean(a,0) ------------- 34.5451100372 >>> sql("select gmean(a) from table2") gmean(a) ------------- 34.5451100372 >>> sql("select gmean(a,-1) from table2") gmean(a,-1) ------------- 33.0179836512 >>> sql("select gmean(a,2) from table2") gmean(a,2) ------------- 37.8043207407 **median function** .. function:: median(X) -> [median float] Returns the median, i.e.numeric value separating the higher half of a sample, a population, or a probability distribution, from the lower half. It is computed by arranging all the observations from lowest value to highest value and picking the middle one. If there is an even number of observations, then there is no single middle value, so the mean of the two middle values is obtained. Incoming textual values are simply ignored. Examples: >>> table1(''' ... 1 ... 3 ... 6 ... 6 ... 6 ... 6 ... 7 ... 7 ... 7 ... 7 ... 12 ... 12 ... 17 ... ''') >>> sql("select median(a) from table1") median(a) --------- 7.0 >>> table2(''' ... 1 ... 2 ... 2 ... 3 ... 3 ... 9 ... ''') >>> sql("select median(a) from table2") median(a) --------- 2.5 >>> table3(''' ... 1 ... 2 ... maria ... lala ... null ... 'None' ... 3 ... 9 ... ''') >>> sql("select median(a) from table3") median(a) --------- 2.5 .. doctest:: :hide: >>> sql("delete from table3") >>> sql("select median(a) from table3") median(a) --------- None **modeop function** .. function:: modeop(X) -> [ModeOpElements int/str, ModeOpValue int] Returns the mode (i.e. the value that occurs the most frequently in a data set), along with the modevalue (i.e. the maximum frequency of occurrence) When more than one modes are found in a data set (i.e. when more than one values appear with the maximum frequency), all values are returned. For a sample from a continuous distribution, such as [0.935..., 1.211..., 2.430..., 3.668..., 3.874...], the concept of mode is unusable in its raw form, since each value will occur precisely once. Following the usual practice, data is discretized by rounding to the closer int value. For a textual sample, values are first converted to lowercase. :Returned multiset schema: Columns are automatically named as *ModeOpElements, ModeOpValue* .. seealso:: * :ref:`tutmultiset` functions Examples: >>> table1(''' ... 1 ... 3 ... 6 ... 6 ... 6 ... 6 ... 7 ... 7 ... 7 ... 7 ... 12 ... 12 ... 17 ... ''') >>> sql("select modeop(a) from table1") ModeOpElements | ModeOpValue ---------------------------- 6 | 4 7 | 4 >>> table2(''' ... 1.1235 ... 1 ... 5.1 ... 5.2 ... 5.3 ... 5.5 ... 5.6 ... 5.7 ... ''') >>> sql("select modeop(a) from table2") ModeOpElements | ModeOpValue ---------------------------- 5 | 3 6 | 3 >>> table3(''' ... leuteris ... maria ... marialena ... Meili ... meili ... ''') >>> sql("select modeop(a) from table3") ModeOpElements | ModeOpValue ---------------------------- meili | 2 .. doctest:: :hide: >>> sql("delete from table3") >>> sql("select modeop(a) from table3") ModeOpElements | ModeOpValue ---------------------------- None | None **pearson function** .. function:: pearson(X,Y) -> float Computes the pearson coefficient of X and Y datasets Examples: >>> sql("select pearson(c1,1/c1) from range(1,91)") pearson(c1,1/c1) ---------------- -0.181568259801 >>> sql("select pearson(c1,17*c1+5) from range(1,91)") pearson(c1,17*c1+5) ------------------- 1.0 >>> sql("select pearson(c1,pyfun('math.pow',2,c1)) from range(1,41)") pearson(c1,pyfun('math.pow',2,c1)) ---------------------------------- 0.456349821382 >>> sql("select pearson(a,b) from (select 1 as a, 2 as b)") pearson(a,b) ------------ 0 **rangef function** .. function:: rangef(X) -> [rangef float] Computes the numerical range for a dataset X, substracting the minimum value from the maximum value. Textal and NULL data entries are simply ignored. Examples: >>> table1(''' ... 1 ... 3 ... 6 ... 6 ... 7 ... 12 ... 12 ... 17 ... 'text is ignored' ... 'None' ... ''') >>> sql("select rangef(a) from table1") rangef(a) --------- 16.0 .. doctest:: :hide: >>> sql("delete from table1") >>> sql("select rangef(a) from table1") rangef(a) --------- None **stdev function** .. function:: stdev(X,[type]) -> [stdev float] Computes standard deviation of a dataset X, i.e. the square root of its variance. Sample standard deviation is determined by default; population standard deviation can be determined by setting the (optional) second argument to values 'true' or 'population'. When values 'false' or 'sample' are entered for type, the default sample standard deviation computation is performed. Examples: >>> table1(''' ... 3 ... 7 ... 7 ... 19 ... 'text is ignored' ... 'none' ... ''') >>> sql("select stdev(a) from table1") stdev(a) ------------- 6.92820323028 >>> sql("select stdev(a,'population') from table1") stdev(a,'population') --------------------- 6.0 >>> sql("select stdev(a,'true') from table1") stdev(a,'true') --------------- 6.0 .. doctest:: :hide: >>> sql("delete from table1") >>> sql("select stdev(a) from table1") stdev(a) -------- None **variance function** .. function:: variance(X,[type]) -> [variance float] Determine the measure of the spread of the data set about the mean. Sample variance is determined by default; population variance can be determined by setting the (optional) second argument to values 'true' or 'population'. When values 'false' or 'sample' are entered for type, the default sample variance computation is performed. Examples: >>> table1(''' ... 1 ... 2 ... 3 ... 4 ... 5 ... 6 ... 'text is ignored' ... 'none' ... ''') >>> sql("select variance(a) from table1") variance(a) ----------- 3.5 >>> sql("select variance(a,'false') from table1") variance(a,'false') ------------------- 3.5 >>> sql("select variance(a,'sample') from table1") variance(a,'sample') -------------------- 3.5 >>> sql("select variance(a,'True') from table1") variance(a,'True') ------------------ 2.91666666667 >>> sql("select variance(a,'Population') from table1") variance(a,'Population') ------------------------ 2.91666666667 .. doctest:: :hide: >>> sql("delete from table1") >>> sql("select variance(a) from table1") variance(a) ----------- None **wamean function** .. function:: wamean(W,X) -> [wamean float] Computes the weighted arithmetic mean, i.e. the weighted average. First column contains the weights and second column contains the actual data values. .. math:: wamean_{\mathrm} = \sum_{i=1}^{N} w_i x_i / \sum_{i=1}^{N} w_i Examples: >>> table1(''' ... 2 1 ... 2 2 ... 1 2 ... 'text is ignored, as well as null values' 3 ... 'none' 2 ... 1 'text is ignored, as well as null values' ... 2 'none' ... 2 3 ... ''') >>> sql("select wamean(a,b) from table1") wamean(a,b) ----------- 2.0 .. doctest:: :hide: >>> sql("delete from table1") >>> sql("select wamean(a) from table1") wamean(a) --------- None :mod:`subgroup` functions ------------------------- .. module:: functions.aggregate.subgroup **condbreak function** .. function:: condbreak(groupid, C1, C2 ,...., condition, orderby) -> [bgroupid,C1,C2....] Returns an expanded *groupid* and the *value1...valueN*, perfoming new groupings when condition is true. Rows grouped together are the ones that order by *orderby* column have no intermediate true values for *condition*. :Returned multiset schema: - *bgroupid* *groupid* appended with an integer value indicating the subgroup of the row. - *C1, C2 ..* The input values of the row. .. seealso:: * :ref:`tutmultiset` functions >>> table1(''' ... 1 user1 open ... 2 user1 read ... 3 user1 close ... 4 user1 open ... 5 user1 write ... 6 user1 close ... 7 user2 open ... 8 user2 write ... ''') >>> sql("select condbreak(b,c,c='open',a) from table1 group by b") bgroupid | C1 ---------------- user11 | open user11 | read user11 | close user12 | open user12 | write user12 | close user21 | open user21 | write >>> sql("select condbreak(b,c,c='open',a) from (select 4 as a, 6 as b, 9 as c where c!=9)") **datediffbreak function** .. function:: datediffbreak(groupid, C1, C2 ,...., date, maxdiff[,'order',orderbycol1,orderbycol2,...]) -> [bgroupid,C1,C2....] Returns an expanded *groupid* and the *value1...valueN*, perfoming new groupings when subsequent rows *date* values differ more than *maxdiff* milliseconds. Rows grouped together are the ones that order by *orderby* column or if ommited by the given order have less *date* distance than *maxdiff*. Input dates should be in :ref:`ISO 8601 format `. :Returned multiset schema: - *bgroupid* *groupid* appended with an integer value indicating the subgroup of the row. - *C1, C2 ..* The input values of the row. .. seealso:: * :ref:`tutmultiset` functions >>> table1(''' ... 1 session1 '2007-01-01 00:03:13' ... 2 session1 '2007-01-01 00:03:27' ... 3 session1 '2007-01-01 00:03:36' ... 4 session2 '2007-01-01 00:04:39' ... 5 session2 '2007-01-01 00:04:40' ... 6 session3 '2007-01-01 00:04:49' ... 7 session3 '2007-01-01 00:04:59' ... ''') >>> sql("select datediffbreak(b,a,c,10*1000,'order',c,a) from table1 group by b") bgroupid | C1 -------------- session10 | 1 session11 | 2 session11 | 3 session20 | 4 session20 | 5 session30 | 6 session30 | 7 .. doctest:: :hide: >>> sql("select datediffbreak(b,c,c='open',a) from (select 4 as a, 6 as b, 9 as c where c!=9)") bgroupid | C1 --------------- None | None >>> sql("select datediffbreak(b,a,c,10*1000,a,c) from table1 group by b") #doctest:+ELLIPSIS +NORMALIZE_WHITESPACE Traceback (most recent call last): ... OperatorError: Madis SQLError: Operator DATEDIFFBREAK: Wrong date format: 1 **datedifffilter function** .. function:: datedifffilter(maxdiff, date, C1, C2 ....) -> [date,C1,C2....] Returns only a subset of the provided entries, performing a sort of entry clustering based on the entries date difference. Each cluster is represented by the latest entry. The first argument defines the time differnece threshold that is employed for entry clustering, and it is provided in seconds. The second argument is assumed to contain the date column. Entries are assumed to be provided in an ascending order by the date column. Input dates should be in :ref:`ISO 8601 format `. All subsequent columns remain unchanged. :Returned multiset schema: - *date, C1, C2 ..* The selected input values of the row. .. seealso:: * :ref:`tutmultiset` functions >>> table1(''' ... 2010-01-01T01:32:03Z value1 ... 2010-01-01T01:32:04Z value2 ... 2010-01-01T01:32:06Z value3 ... 2010-01-01T01:32:08Z value4 ... 2010-01-01T01:32:29Z value5 ... 2010-01-01T02:35:03Z value6 ... 2010-01-01T02:35:04Z value7 ... 2010-01-01T03:55:04Z value8 ... ''') >>> sql("select datedifffilter(20, a,b) from table1") date | C1 ----------------------------- 2010-01-01T01:32:08Z | value4 2010-01-01T01:32:29Z | value5 2010-01-01T02:35:04Z | value7 2010-01-01T03:55:04Z | value8 >>> table1(''' ... 2010-01-01T01:32:03Z value1 ... ''') >>> sql("select datedifffilter(20, a,b) from table1") date | C1 ----------------------------- 2010-01-01T01:32:03Z | value1 >>> table1(''' ... '2010-01-01 01:32:03' value1 ... '2010-01-01 01:32:04' value2 ... '2010-01-01 01:32:06' value3 ... '2010-01-01 01:32:08' value4 ... '2010-01-01 01:32:29' value5 ... '2010-01-01 02:35:03' value6 ... '2010-01-01 02:35:04' value7 ... '2010-01-01 03:55:04' value8 ... ''') >>> sql("select datedifffilter(30, a,b) from table1") date | C1 ---------------------------- 2010-01-01 01:32:29 | value5 2010-01-01 02:35:04 | value7 2010-01-01 03:55:04 | value8 **datediffgroup function** .. function:: datediffgroup(maxdiff, date, C1, C2 ....) -> [groupid, date,C1,C2....] Performing a sort of entry clustering based on the entries date difference. The cluster id that is assigned to each entry is returned in the first column, and it is followed by the entry's original contents. The first argument defines the time differnece threshold that is employed for entry clustering, and it is provided in seconds. The second argument is assumed to contain the date column. Entries are assumed to be provided in an ascending order by the date column. Input dates should be in :ref:`ISO 8601 format `. All subsequent columns remain unchanged. :Returned multiset schema: - *date, C1, C2 ..* The selected input values of the row. .. seealso:: * :ref:`tutmultiset` functions >>> table1(''' ... 2010-01-01T01:32:03Z value1 ... 2010-01-01T01:32:04Z value2 ... 2010-01-01T01:32:06Z value3 ... 2010-01-01T01:32:08Z value4 ... 2010-01-01T01:32:29Z value5 ... 2010-01-01T02:35:03Z value6 ... 2010-01-01T02:35:04Z value7 ... 2010-01-01T03:55:04Z value8 ... ''') >>> sql("select datediffgroup(20,a,b) from table1") groupid | date | C1 --------------------------------------- 1 | 2010-01-01T01:32:03Z | value1 1 | 2010-01-01T01:32:04Z | value2 1 | 2010-01-01T01:32:06Z | value3 1 | 2010-01-01T01:32:08Z | value4 2 | 2010-01-01T01:32:29Z | value5 3 | 2010-01-01T02:35:03Z | value6 3 | 2010-01-01T02:35:04Z | value7 4 | 2010-01-01T03:55:04Z | value8 **datediffnewsesid function** .. function:: datediffnewsesid(maxdiff, date, groupid, C1, C2 ,....) -> [bgroupid, C1, C2, ...] Returns only the C1, C2, ... that should be updated with bgroupid, so as for the input groups to be effectively broken on maxdiff times. Input dates should be in :ref:`ISO 8601 format `. :Returned multiset schema: - *bgroupid* *groupid* appended with an integer value indicating the subgroup of the row. - *C1, C2 ..* The input values of the row. >>> table1(''' ... 1 session1 '2007-01-01 00:03:13' ... 2 session1 '2007-01-01 00:03:27' ... 3 session1 '2007-01-01 00:03:36' ... 4 session2 '2007-01-01 00:04:39' ... 5 session2 '2007-01-01 00:05:40' ... 6 session3 '2007-01-01 00:04:49' ... 7 session3 '2007-01-01 00:06:59' ... 8 session3 '2007-01-01 00:06:59' ... 9 session4 '2007-01-01 00:04:59' ... ''') >>> sql("select datediffnewsesid(10, c, b, a) from table1 group by b") bgroupid | C1 -------------- session11 | 2 session11 | 3 session21 | 5 session31 | 7 session31 | 8 .. doctest:: :hide: >>> sql("select datediffnewsesid(10, c, b, a) from (select 4 as a, 6 as b, 9 as c where c!=9)") bgroupid | C1 --------------- None | None >>> table2(''' ... 1 session1 '2007-05-01T21:10:51Z' ... 9 session1 '2012-01-08T18:24:32Z' ... ''') >>> sql("select datediffnewsesid(1800, c, b, a) from table2 group by b") bgroupid | C1 -------------- session11 | 9 **groupmax function** .. function:: groupmax(n,col1,col2,col3,....) It groups by the first n columns of the input, and returns the maximum value of the rest. :Returned schema: Columns are automatically named as col1, col2 ... Examples: >>> table1(''' ... aa t1 43 ... ac t2 34 ... aa t3 12 ... ab t4 21 ... ac t5 14 ... as t6 23 ... ''') >>> sql("select groupmax(1,a,b,c) from table1") c1 | c2 | c3 ------------ ac | t5 | 34 aa | t3 | 43 ab | t4 | 21 as | t6 | 23 **groupsum function** .. function:: groupsum(n,col1,col2,col3,....) It groups by the first n columns of the input, and sums/jsets the rest. :Returned schema: Columns are automatically named as col1, col2 ... Examples: >>> table1(''' ... aa t1 43 ... ac t2 34 ... aa t3 12 ... ab t4 21 ... ac t5 14 ... as t6 23 ... ''') >>> sql("select groupsum(1,a,b,c) from table1") c1 | c2 | c3 --------------------- ac | ["t2","t5"] | 48 aa | ["t1","t3"] | 55 ab | t4 | 21 as | t6 | 23 :mod:`text` functions --------------------- .. module:: functions.aggregate.text **concatgroup function** .. function:: concatgroup(X) Concatenates strings in a group/dataset X. Example: >>> table1(''' ... word1 1 ... word2 1 ... word3 2 ... word4 2 ... ''') >>> sql("select concatgroup(a) from table1 group by b") concatgroup(a) -------------- word1word2 word3word4 **concatlines function** .. function:: concatlines(X) Concatenates strings in a group/dataset X. Example: >>> table1(''' ... word1 ... word2 ... word3 ... word4 ... ''') >>> sql("select concatlines(a) from table1") concatlines(a) ----------------------- word1 word2 word3 word4 **concatterms function** .. function:: concatterms(text1, text2,...) Concatenates strings in a group/dataset X, while keeping them disjoint, i.e. using the single space delimiter. Examples: >>> table1(''' ... word1 1 ... word2 1 ... word3 2 ... word4 2 ... ''') >>> sql("select concatterms(a) from table1 group by b") concatterms(a) -------------- word1 word2 word3 word4 :mod:`util` functions --------------------- .. module:: functions.aggregate.util **expandgroups function** .. function:: expandgroups(args) -> args Expands the contents of groups. Usefull for debugging group related queries. Examples: Directed graph: >>> table1(''' ... 1 2 ... 2 3 ... 1 4 ... 2 5 ... 3 3 ... ''') >>> sql("select expandgroups(a,b) from table1") C1 | C2 ------- 1 | 2 2 | 3 1 | 4 2 | 5 3 | 3 >>> sql("select expandgroups(a,b) as gr from table1") gr1 | gr2 --------- 1 | 2 2 | 3 1 | 4 2 | 5 3 | 3 >>> sql("select a,expandgroups(b) as gr from table1 group by a") a | gr ------ 1 | 2 1 | 4 2 | 3 2 | 5 3 | 3 **showgroups function** .. function:: showgroups(args) -> string Shows the contents of groups. Usefull for debugging group related queries. Examples: Directed graph: >>> table1(''' ... 1 2 ... 2 3 ... 1 4 ... 2 5 ... 3 3 ... ''') >>> sql("select showgroups(a,b) from table1") # doctest: +NORMALIZE_WHITESPACE showgroups(a,b) -------------------- 1 2 2 3 1 4 2 5 3 3 >>> sql("select showgroups(b) as gr from table1 group by a") gr ---- 2 4 3 5 3