Saturday, March 17, 2018

FW: statistics in hitlist

-----Original Message-----
From: Joel Bernstein [mailto:joelsolr@gmail.com]
Sent: 16 March 2018 23:41
To: solr-user@lucene.apache.org
Subject: Re: statistics in hitlist

With regression you're looking at how the change in one variable effects the
change in another variable. So you need to have values that are changing.
What you described is an average of field X which is not changing, regressed
against the value of X.

I think one approach to this is to regress the moving average of X with the
actual value of X. We can do this with the math library, but before
exploring the code for this spend some thinking about if that's the problem
you're trying to solve. Take a look at how moving averages work:
https://en.wikipedia.org/wiki/Moving_average





Joel Bernstein
http://joelsolr.blogspot.com/

On Fri, Mar 16, 2018 at 9:26 AM, John Smith <localdevjs@gmail.com> wrote:

> Thanks for the link to the documentation, that will probably come in
> useful.
>
> I didn't see a way though, to get my avg function working? So instead
> of doing a linear regression on two fields, X and Y, in a hitlist, we
> need to do a linear regression on field X, and the average value of X.
> Is that possible? To pass in a function to the regress function instead of
a field?
>
>
>
>
>
> On Thu, Mar 15, 2018 at 10:41 PM, Joel Bernstein <joelsolr@gmail.com>
> wrote:
>
> > I've been working on the user guide for the math expressions. Here
> > is the page on regression:
> >
> > https://github.com/joel-bernstein/lucene-solr/blob/math_expressions_
> > documentation/solr/solr-ref-guide/src/regression.adoc
> >
> > This page is part of the larger math expression documentation. The
> > TOC is
> > here:
> >
> > https://github.com/joel-bernstein/lucene-solr/blob/math_expressions_
> > documentation/solr/solr-ref-guide/src/math-expressions.adoc
> >
> > The docs are still very rough but you can get an idea of the coverage.
> >
> >
> >
> > Joel Bernstein
> > http://joelsolr.blogspot.com/
> >
> > On Thu, Mar 15, 2018 at 10:26 PM, Joel Bernstein
> > <joelsolr@gmail.com>
> > wrote:
> >
> > > If you want to get everything in query you can do this:
> > >
> > > let(echo="d,e",
> > > a=search(tx_prod_production,
> > > q="oil_first_90_days_production:[1
> TO
> > > *]",
> > > fq="isParent:true", rows="1500000",
> > > fl="id,oil_first_90_days_production,oil_last_30_days_production",
> > sort="id
> > > asc"),
> > > b=col(a, oil_first_90_days_production),
> > > c=col(a, oil_last_30_days_production),
> > > d=regress(b, c),
> > > e=someExpression())
> > >
> > > The echo parameter tells the let expression which variables to output.
> > >
> > >
> > >
> > > Joel Bernstein
> > > http://joelsolr.blogspot.com/
> > >
> > > On Thu, Mar 15, 2018 at 3:13 PM, Erick Erickson <
> erickerickson@gmail.com
> > >
> > > wrote:
> > >
> > >> What does the fq clause look like?
> > >>
> > >> On Thu, Mar 15, 2018 at 11:51 AM, John Smith
> > >> <localdevjs@gmail.com>
> > >> wrote:
> > >> > Hi Joel, I did some more work on this statistics stuff today.
> > >> > Yes,
> we
> > do
> > >> > have nulls in our data; the document contains many fields, we
> > >> > don't
> > >> always
> > >> > have values for each field, but we can't set the nulls to 0
> > >> > either
> (or
> > >> any
> > >> > other value, really) as that will mess up other calculations
> > >> > (such
> as
> > >> when
> > >> > calculating average etc); we would normally just ignore fields
> > >> > with
> > null
> > >> > values when calculating stats manually ourselves.
> > >> >
> > >> > Adding a check in the "q" parameter to ensure that the fields
> > >> > used
> in
> > >> the
> > >> > calculations are > 0 does work now. Thanks for the tip (and
> > >> > sorry,
> > >> should
> > >> > have caught that myself). But I am unable to use "fq" for these
> > checks,
> > >> > they have to be added to the q instead. Adding fq's doesn't
> > >> > have any
> > >> effect.
> > >> >
> > >> >
> > >> > Anyway, I'm trying to change this up a little. This is what I'm
> > >> currently
> > >> > using (switched from "random" to "search" since I actually need
> > >> > the
> > full
> > >> > hitlist not just a random subset):
> > >> >
> > >> > let(a=search(tx_prod_production,
> > >> > q="oil_first_90_days_production:[1
> > TO
> > >> *]",
> > >> > fq="isParent:true", rows="1500000",
> > >> > fl="id,oil_first_90_days_production,oil_last_30_days_production
> > >> > ",
> > >> sort="id
> > >> > asc"),
> > >> > b=col(a, oil_first_90_days_production),
> > >> > c=col(a, oil_last_30_days_production),
> > >> > d=regress(b, c))
> > >> >
> > >> > So I have 2 fields there defined, that works great (in terms of
> > >> > a
> test
> > >> and
> > >> > running the query); but I need to replace the second field,
> > >> > "oil_last_30_days_production" with the avg value in
> > >> > oil_first_90_days_production.
> > >> >
> > >> > I can get the avg with this expression:
> > >> > stats(tx_prod_production, q="oil_first_90_days_production:[1 TO
> *]",
> > >> > fq="isParent:true", rows="1500000", avg(oil_first_90_days_
> > production))
> > >> >
> > >> > But I don't know how to push that avg value into the first
> > >> > streaming expression; guessing I have to set "c=...." but that
> > >> > is where I'm
> > >> getting
> > >> > lost, since avg only returns 1 value and the first parameter,
> > >> > "b",
> > >> returns
> > >> > a list of sorts. Somehow I have to get the avg value stuffed
> > >> > inside
> a
> > >> > "col", where it is the same value for every row in the hitlist...?
> > >> >
> > >> > Thanks for your help!
> > >> >
> > >> >
> > >> > On Mon, Mar 5, 2018 at 10:50 PM, Joel Bernstein
> > >> > <joelsolr@gmail.com
> >
> > >> wrote:
> > >> >
> > >> >> I suspect you've got nulls in your data. I just tested with
> > >> >> null
> > >> values and
> > >> >> got the same error. For testing purposes try loading the data
> > >> >> with
> > >> default
> > >> >> values of zero.
> > >> >>
> > >> >>
> > >> >> Joel Bernstein
> > >> >> http://joelsolr.blogspot.com/
> > >> >>
> > >> >> On Mon, Mar 5, 2018 at 10:12 PM, Joel Bernstein <
> joelsolr@gmail.com>
> > >> >> wrote:
> > >> >>
> > >> >> > Let's break the expression down and build it up slowly.
> > >> >> > Let's
> start
> > >> with:
> > >> >> >
> > >> >> > let(echo="true",
> > >> >> > a=random(tx_prod_production, q="*:*",
> > >> >> > fq="isParent:true",
> > >> rows="15",
> > >> >> > fl="oil_first_90_days_production,oil_last_30_days_production"),
> > >> >> > b=col(a, oil_first_90_days_production))
> > >> >> >
> > >> >> >
> > >> >> > This should return variables a and b. Let's see what the
> > >> >> > data
> looks
> > >> like.
> > >> >> > I changed the rows from 15 to 15000. If it all looks good we
> > >> >> > can
> > >> expand
> > >> >> the
> > >> >> > rows and continue adding functions.
> > >> >> >
> > >> >> >
> > >> >> >
> > >> >> >
> > >> >> > Joel Bernstein
> > >> >> > http://joelsolr.blogspot.com/
> > >> >> >
> > >> >> > On Mon, Mar 5, 2018 at 4:11 PM, John Smith
> > >> >> > <localdevjs@gmail.com
> >
> > >> wrote:
> > >> >> >
> > >> >> >> Thanks Joel for your help on this.
> > >> >> >>
> > >> >> >> What I've done so far:
> > >> >> >> - unzip downloaded solr-7.2
> > >> >> >> - modify the _default "managed-schema" to add the random
> > >> >> >> field
> > type
> > >> and
> > >> >> >> the dynamic random field
> > >> >> >> - start solr7 using "solr start -c"
> > >> >> >> - indexed my data using pint/pdouble/boolean field types
> > >> >> >> etc
> > >> >> >>
> > >> >> >> I can now run the random function all by itself, it returns
> random
> > >> >> >> results as expected. So far so good!
> > >> >> >>
> > >> >> >> However... now trying to get the regression stuff working:
> > >> >> >>
> > >> >> >> let(a=random(tx_prod_production, q="*:*",
> > >> >> >> fq="isParent:true", rows="15000",
> > >> >> >> fl="oil_first_90_days_producti
on,oil_last_30_days_production"),
> > >> >> >> b=col(a, oil_first_90_days_production),
> > >> >> >> c=col(a, oil_last_30_days_production),
> > >> >> >> d=regress(b, c))
> > >> >> >>
> > >> >> >> Posted directly into solr admin UI. Run the streaming
> > >> >> >> expression
> > >> and I
> > >> >> >> get this error message:
> > >> >> >> "EXCEPTION": "Failed to evaluate expression regress(b,c) -
> Numeric
> > >> value
> > >> >> >> expected but found type java.lang.String for value
> > >> >> >> oil_first_90_days_production"
> > >> >> >>
> > >> >> >> It thinks my numeric field is defined as a string? But when
> > >> >> >> I
> view
> > >> the
> > >> >> >> schema, those 2 fields are defined as ints:
> > >> >> >>
> > >> >> >>
> > >> >> >> When I run a normal query and choose xml as output format,
> > >> >> >> then
> it
> > >> also
> > >> >> >> puts "int" elements into the hitlist, so the schema appears
> > >> >> >> to
> be
> > >> >> correct
> > >> >> >> it's just when using this regress function that something
> > >> >> >> goes
> > >> wrong and
> > >> >> >> solr thinks the field is string.
> > >> >> >>
> > >> >> >> Any suggestions?
> > >> >> >> Thanks!
> > >> >> >>
> > >> >> >>
> > >> >> >>
> > >> >> >> On Thu, Mar 1, 2018 at 9:12 PM, Joel Bernstein <
> > joelsolr@gmail.com>
> > >> >> >> wrote:
> > >> >> >>
> > >> >> >>> The field type will also need to be in the schema:
> > >> >> >>>
> > >> >> >>> <!-- The "RandomSortField" is not used to store or search
> > >> >> >>> any
> > >> >> >>>
> > >> >> >>> data. You can declare fields of this type it in
> > >> >> >>> your
> > >> schema
> > >> >> >>>
> > >> >> >>> to generate pseudo-random orderings of your docs
> > >> >> >>> for
> > >> sorting
> > >> >> >>>
> > >> >> >>> or function purposes. The ordering is generated
> > >> >> >>> based
> > on
> > >> the
> > >> >> >>> field
> > >> >> >>>
> > >> >> >>> name and the version of the index. As long as the
> index
> > >> >> version
> > >> >> >>>
> > >> >> >>> remains unchanged, and the same field name is
> > >> >> >>> reused,
> > >> >> >>>
> > >> >> >>> the ordering of the docs will be consistent.
> > >> >> >>>
> > >> >> >>> If you want different psuedo-random orderings of
> > >> documents,
> > >> >> >>>
> > >> >> >>> for the same version of the index, use a
> > >> >> >>> dynamicField
> > and
> > >> >> >>>
> > >> >> >>> change the field name in the request.
> > >> >> >>>
> > >> >> >>> -->
> > >> >> >>>
> > >> >> >>> <fieldType name="random" class="solr.RandomSortField"
> > >> indexed="true" />
> > >> >> >>>
> > >> >> >>>
> > >> >> >>> Joel Bernstein
> > >> >> >>> http://joelsolr.blogspot.com/
> > >> >> >>>
> > >> >> >>> On Thu, Mar 1, 2018 at 8:00 PM, Joel Bernstein <
> > joelsolr@gmail.com
> > >> >
> > >> >> >>> wrote:
> > >> >> >>>
> > >> >> >>> > You'll need to have this field in your schema:
> > >> >> >>> >
> > >> >> >>> > <dynamicField name="random_*" type="random" />
> > >> >> >>> >
> > >> >> >>> > I'll check to see if the default schema used with solr
> > >> >> >>> > start
> -c
> > >> has
> > >> >> >>> this
> > >> >> >>> > field, if not I'll add it. Thanks for pointing this out.
> > >> >> >>> >
> > >> >> >>> > I checked and right now the random expression is only
> accepting
> > >> one
> > >> >> fq,
> > >> >> >>> > but I consider this a bug. It should accept multiple.
> > >> >> >>> > I'll
> > create
> > >> >> >>> ticket
> > >> >> >>> > for getting this fixed.
> > >> >> >>> >
> > >> >> >>> >
> > >> >> >>> >
> > >> >> >>> > Joel Bernstein
> > >> >> >>> > http://joelsolr.blogspot.com/
> > >> >> >>> >
> > >> >> >>> > On Thu, Mar 1, 2018 at 4:55 PM, John Smith <
> > localdevjs@gmail.com
> > >> >
> > >> >> >>> wrote:
> > >> >> >>> >
> > >> >> >>> >> Joel, thanks for the pointers to the streaming feature.
> > >> >> >>> >> I
> had
> > no
> > >> >> idea
> > >> >> >>> solr
> > >> >> >>> >> had that (and also just discovered the very intersting
> > >> >> >>> >> sql
> > >> feature!
> > >> >> I
> > >> >> >>> will
> > >> >> >>> >> be sure to investigate that in more detail in the future).
> > >> >> >>> >>
> > >> >> >>> >> However I'm having some trouble getting basic streaming
> > >> functions
> > >> >> >>> working.
> > >> >> >>> >> I've already figured out that I had to move to "solr cloud"
> > >> instead
> > >> >> of
> > >> >> >>> >> "solr standalone" because I was getting errors about
> > >> >> >>> >> "cannot
> > >> find zk
> > >> >> >>> >> instance" or whatever which went away when using "solr
> > >> >> >>> >> start
> > -c"
> > >> >> >>> instead.
> > >> >> >>> >>
> > >> >> >>> >> But now I'm trying to use the random function since
> > >> >> >>> >> that was
> > >> one of
> > >> >> >>> the
> > >> >> >>> >> functions used in your example.
> > >> >> >>> >>
> > >> >> >>> >> random(tx_header, q="*:*", rows="100", fl="countyname")
> > >> >> >>> >>
> > >> >> >>> >> I posted that directly in the "stream" section of the
> > >> >> >>> >> solr
> > >> admin UI.
> > >> >> >>> This
> > >> >> >>> >> is all on linux, with solr 7.1.0 and 7.2.1 (tried
> > >> >> >>> >> several
> > >> versions
> > >> >> in
> > >> >> >>> case
> > >> >> >>> >> it was a bug in one)
> > >> >> >>> >>
> > >> >> >>> >> I get back an error message:
> > >> >> >>> >> *sort param could not be parsed as a query, and is not
> > >> >> >>> >> a
> field
> > >> that
> > >> >> >>> exists
> > >> >> >>> >> in the index: random_-255009774*
> > >> >> >>> >>
> > >> >> >>> >> I'm not passing in any sort field anywhere. But the
> > >> >> >>> >> solr
> logs
> > >> show
> > >> >> >>> these
> > >> >> >>> >> three log entries:
> > >> >> >>> >>
> > >> >> >>> >> 2018-03-01 21:41:18.954 INFO (qtp257513673-21)
> > >> >> >>> >> [c:tx_header
> > >> >> s:shard1
> > >> >> >>> >> r:core_node2 x:tx_header_shard1_replica_n1]
> o.a.s.c.S.Request
> > >> >> >>> >> [tx_header_shard1_replica_n1] webapp=/solr
> > >> >> >>> >> path=/select
> > >> >> >>> >> params={q=*:*&_stateVer_=tx_header:6&fl=countyname
> > >> >> >>> >> *&sort=random_-255009774+asc*&
> rows=100&wt=javabin&version=2}
> > >> >> >>> status=400
> > >> >> >>> >> QTime=19
> > >> >> >>> >>
> > >> >> >>> >> 2018-03-01 21:41:18.966 ERROR (qtp257513673-17)
> > >> >> >>> >> [c:tx_header
> > >> >> s:shard1
> > >> >> >>> >> r:core_node2 x:tx_header_shard1_replica_n1]
> > >> >> >>> o.a.s.c.s.i.CloudSolrClient
> > >> >> >>> >> Request to collection [tx_header] failed due to (400)
> > >> >> >>> >> org.apache.solr.client.solrj.impl.HttpSolrClient$
> > >> >> RemoteSolrException:
> > >> >> >>> >> Error
> > >> >> >>> >> from server at http://192.168.13.31:8983/solr/tx_header:
> sort
> > >> param
> > >> >> >>> could
> > >> >> >>> >> not be parsed as a query, and is not a field that
> > >> >> >>> >> exists in
> > the
> > >> >> index:
> > >> >> >>> >> random_-255009774, retry? 0
> > >> >> >>> >>
> > >> >> >>> >> 2018-03-01 21:41:18.968 ERROR (qtp257513673-17)
> > >> >> >>> >> [c:tx_header
> > >> >> s:shard1
> > >> >> >>> >> r:core_node2 x:tx_header_shard1_replica_n1]
> > >> >> >>> o.a.s.c.s.i.s.ExceptionStream
> > >> >> >>> >> java.io.IOException:
> > >> >> >>> >> org.apache.solr.client.solrj.impl.HttpSolrClient$
> > >> >> RemoteSolrException:
> > >> >> >>> >> Error
> > >> >> >>> >> from server at http://192.168.13.31:8983/solr/tx_header:
> sort
> > >> param
> > >> >> >>> could
> > >> >> >>> >> not be parsed as a query, and is not a field that
> > >> >> >>> >> exists in
> > the
> > >> >> index:
> > >> >> >>> >> random_-255009774
> > >> >> >>> >>
> > >> >> >>> >>
> > >> >> >>> >> So basically it looks like solr is injecting the
> > "sort=random_"
> > >> >> stuff
> > >> >> >>> into
> > >> >> >>> >> my query and of course that is failing on the search
> > >> >> >>> >> since
> > that
> > >> >> >>> >> field/column doesn't exist in my schema. Everytime I
> > >> >> >>> >> run the
> > >> random
> > >> >> >>> >> function, I get a slightly different field name that it
> > >> injects, but
> > >> >> >>> they
> > >> >> >>> >> all start with "random_" etc.
> > >> >> >>> >>
> > >> >> >>> >> I have tried adding my own sort field instead, hoping
> > >> >> >>> >> solr
> > >> wouldn't
> > >> >> >>> inject
> > >> >> >>> >> one for me, but it still injected a random sort fieldname:
> > >> >> >>> >> random(tx_header, q="*:*", rows="100", fl="countyname",
> > >> >> >>> sort="countyname
> > >> >> >>> >> asc")
> > >> >> >>> >>
> > >> >> >>> >>
> > >> >> >>> >> Assuming I can fix that whole problem, my second
> > >> >> >>> >> question
> is:
> > >> can I
> > >> >> >>> add
> > >> >> >>> >> multiple "fq=" parameters to the random function? I
> > >> >> >>> >> build a
> > >> pretty
> > >> >> >>> >> complicated query using many fq= fields, and then want
> > >> >> >>> >> to
> run
> > >> some
> > >> >> >>> stats
> > >> >> >>> >> on
> > >> >> >>> >> that hitlist; so somehow I have to pass in the query
> > >> >> >>> >> that
> made
> > >> up
> > >> >> the
> > >> >> >>> >> exact
> > >> >> >>> >> hitlist to these various functions, but when I used
> > >> >> >>> >> multiple
> > >> "fq="
> > >> >> >>> values
> > >> >> >>> >> it only seemed to use the last one I specified and just
> > ignored
> > >> all
> > >> >> >>> the
> > >> >> >>> >> previous fq's?
> > >> >> >>> >>
> > >> >> >>> >> Thanks in advance for any comments/suggestions...!
> > >> >> >>> >>
> > >> >> >>> >>
> > >> >> >>> >>
> > >> >> >>> >>
> > >> >> >>> >> On Fri, Feb 23, 2018 at 5:59 PM, Joel Bernstein <
> > >> joelsolr@gmail.com
> > >> >> >
> > >> >> >>> >> wrote:
> > >> >> >>> >>
> > >> >> >>> >> > This is going to be a complex answer because Solr
> > >> >> >>> >> > actually
> > >> now has
> > >> >> >>> >> multiple
> > >> >> >>> >> > ways of doing regression analysis as part of the
> > >> >> >>> >> > Streaming
> > >> >> >>> Expression
> > >> >> >>> >> > statistical programming library. The basic
> > >> >> >>> >> > documentation
> is
> > >> here:
> > >> >> >>> >> >
> > >> >> >>> >> > https://lucene.apache.org/solr/guide/7_2/statistical-
> > program
> > >> >> >>> ming.html
> > >> >> >>> >> >
> > >> >> >>> >> > Here is a sample expression that performs a simple
> > >> >> >>> >> > linear
> > >> >> >>> regression in
> > >> >> >>> >> > Solr 7.2:
> > >> >> >>> >> >
> > >> >> >>> >> > let(a=random(collection1, q="any query",
> > >> >> >>> >> > rows="15000",
> > >> fl="fieldA,
> > >> >> >>> >> > fieldB"),
> > >> >> >>> >> > b=col(a, fieldA),
> > >> >> >>> >> > c=col(a, fieldB),
> > >> >> >>> >> > d=regress(b, c))
> > >> >> >>> >> >
> > >> >> >>> >> >
> > >> >> >>> >> > The expression above takes a random sample of 15000
> results
> > >> from
> > >> >> >>> >> > collection1. The result set will include fieldA and
> > >> >> >>> >> > fieldB
> > in
> > >> each
> > >> >> >>> >> record.
> > >> >> >>> >> > The result set is stored in variable "a".
> > >> >> >>> >> >
> > >> >> >>> >> > Then the "col" function creates arrays of numbers
> > >> >> >>> >> > from the
> > >> results
> > >> >> >>> >> stored
> > >> >> >>> >> > in variable a. The values in fieldA are stored in the
> > variable
> > >> >> "b".
> > >> >> >>> The
> > >> >> >>> >> > values in fieldB are stored in variable "c".
> > >> >> >>> >> >
> > >> >> >>> >> > Then the regress function performs a simple linear
> > regression
> > >> on
> > >> >> >>> arrays
> > >> >> >>> >> > stored in variables "b" and "c".
> > >> >> >>> >> >
> > >> >> >>> >> > The output of the regress function is a map
> > >> >> >>> >> > containing the
> > >> >> >>> regression
> > >> >> >>> >> > result. This result includes RSquared and other
> > >> >> >>> >> > attributes
> > of
> > >> the
> > >> >> >>> >> > regression model such as R (correlation), slope, y
> intercept
> > >> >> etc...
> > >> >> >>> >> >
> > >> >> >>> >> >
> > >> >> >>> >> >
> > >> >> >>> >> >
> > >> >> >>> >> >
> > >> >> >>> >> >
> > >> >> >>> >> >
> > >> >> >>> >> >
> > >> >> >>> >> >
> > >> >> >>> >> > Joel Bernstein
> > >> >> >>> >> > http://joelsolr.blogspot.com/
> > >> >> >>> >> >
> > >> >> >>> >> > On Fri, Feb 23, 2018 at 3:10 PM, John Smith <
> > >> localdevjs@gmail.com
> > >> >> >
> > >> >> >>> >> wrote:
> > >> >> >>> >> >
> > >> >> >>> >> > > Hi Joel, thanks for the answer. I'm not really a
> > >> >> >>> >> > > stats
> > guy,
> > >> but
> > >> >> >>> the
> > >> >> >>> >> end
> > >> >> >>> >> > > result of all this is supposed to be obtaining R^2.
> > >> >> >>> >> > > Is
> > >> there no
> > >> >> >>> way of
> > >> >> >>> >> > > obtaining this value, then (short of iterating over
> > >> >> >>> >> > > all
> > the
> > >> >> >>> results in
> > >> >> >>> >> > the
> > >> >> >>> >> > > hitlist and calculating it myself)?
> > >> >> >>> >> > >
> > >> >> >>> >> > > On Fri, Feb 23, 2018 at 12:26 PM, Joel Bernstein <
> > >> >> >>> joelsolr@gmail.com>
> > >> >> >>> >> > > wrote:
> > >> >> >>> >> > >
> > >> >> >>> >> > > > Typically SSE is the sum of the squared errors of
> > >> >> >>> >> > > > the
> > >> >> >>> prediction in
> > >> >> >>> >> a
> > >> >> >>> >> > > > regression analysis. The stats component doesn't
> perform
> > >> >> >>> regression,
> > >> >> >>> >> > > > although it might be a nice feature.
> > >> >> >>> >> > > >
> > >> >> >>> >> > > >
> > >> >> >>> >> > > >
> > >> >> >>> >> > > > Joel Bernstein
> > >> >> >>> >> > > > http://joelsolr.blogspot.com/
> > >> >> >>> >> > > >
> > >> >> >>> >> > > > On Fri, Feb 23, 2018 at 12:17 PM, John Smith <
> > >> >> >>> localdevjs@gmail.com>
> > >> >> >>> >> > > wrote:
> > >> >> >>> >> > > >
> > >> >> >>> >> > > > > I'm using solr, and enabling stats as per this
page:
> > >> >> >>> >> > > > > https://lucene.apache.org/solr/guide/6_6/the-st
> > >> >> >>> >> > > > > ats-
> > >> >> component
> > >> >> >>> .html
> > >> >> >>> >> > > > >
> > >> >> >>> >> > > > > I want to get more stat values though.
> > >> >> >>> >> > > > > Specifically
> > I'm
> > >> >> >>> looking
> > >> >> >>> >> for
> > >> >> >>> >> > > > > r-squared (coefficient of determination). This
> > >> >> >>> >> > > > > value
> > is
> > >> not
> > >> >> >>> >> present
> > >> >> >>> >> > in
> > >> >> >>> >> > > > > solr, however some of the pieces used to
> > >> >> >>> >> > > > > calculate
> r^2
> > >> are
> > >> >> in
> > >> >> >>> the
> > >> >> >>> >> > stats
> > >> >> >>> >> > > > > element, for example:
> > >> >> >>> >> > > > >
> > >> >> >>> >> > > > > <double name="min">0.0</double> <double
> > >> >> >>> >> > > > > name="max">10.0</double> <long
> > >> >> >>> >> > > > > name="count">15</long> <long
> > >> >> >>> >> > > > > name="missing">17</long> <double
> > >> >> >>> >> > > > > name="sum">85.0</double> <double
> > >> >> >>> >> > > > > name="sumOfSquares">603.0</double>
> > >> >> >>> >> > > > > <double name="mean">5.666666666666667</double>
> > >> >> >>> >> > > > > <double
> > >> >> >>> >> > > > > name="stddev">2.943920288775949</double>
> > >> >> >>> >> > > > >
> > >> >> >>> >> > > > >
> > >> >> >>> >> > > > > So I have the sumOfSquares available (SST), and
> using
> > >> this
> > >> >> >>> >> > > calculation, I
> > >> >> >>> >> > > > > can get R^2:
> > >> >> >>> >> > > > >
> > >> >> >>> >> > > > > R^2 = 1 - SSE/SST
> > >> >> >>> >> > > > >
> > >> >> >>> >> > > > > All I need then is SSE. Is there anyway I can
> > >> >> >>> >> > > > > get
> SSE
> > >> from
> > >> >> >>> those
> > >> >> >>> >> > other
> > >> >> >>> >> > > > > stats in solr?
> > >> >> >>> >> > > > >
> > >> >> >>> >> > > > > Thanks in advance!
> > >> >> >>> >> > > > >
> > >> >> >>> >> > > >
> > >> >> >>> >> > >
> > >> >> >>> >> >
> > >> >> >>> >>
> > >> >> >>> >
> > >> >> >>> >
> > >> >> >>>
> > >> >> >>
> > >> >> >>
> > >> >> >
> > >> >>
> > >>
> > >
> > >
> >
>

No comments:

Post a Comment