Working with data: statistics & visualization

#1

Analyzing Data with Elm

Evan wrote " I think science and statistics are under-explored in @elmlang packages. I think it would be great to parse xslx and run regressions in Elm. Help scientists get interesting data online!"

The little demo app above is one contribution towards addressing this need. You can run the app online from here; the code is in the examples folder of jxxcarlson/elm-stat, the library on which the app relies. The idea (for now) is that you can upload and analyze CSV files. There is a link at the bottom left (footer) that gives you a sample file to download.

I’d like to make the library as useful as possible. For that to happen, I need to use it to mess around with as many different data sets as possible. If you could point me your favorites, that would be wonderful, as would be suggestions of what a good stats/visualization package should do. Also, if you could report instances in which the app choked on the data that you gave it, that would be great. (I expect this to happen; the app and library are in a primitive state.)

My approach so far has been to rely on other libraries to the greatest extent possible, e.g., zgohr/elm-csv for parsing Csv files and terezka/line-charts for rendering the graphs, adding data-transformation glue and statistical functions, e.g., computation of the coefficients for the regression line . The elm/file library is used to upload files.

Below is my current list of topics that need to be addressed. I am sure there are others.

1. Different kinds of data (a) The data used in the figure above is time series data, in this case global yearly temperature anomalies from 1880 to 2016. The anomaly is the deviation from the average in some reference period. (b) We need to be able to work with other kinds of data besides time series, e.g., generate scatter plots. Perhaps the data consists of (x,y) coordinates of outbreaks of a disease. (Famous example from the London cholera epidemic). Etc.

2. Data format and data extracton A piece of the temperature is displayed below. It is a csv file preceded by a metadata header and including column headings which describe the data. Very easy to parse, extract the metadata, extract the data itself, and automatically apply labels to the x and y axes. The library does this. BUT: not all files are so nicely constructed, or are in this format, so one has to have ways of dealing with them.

  Global Land and Ocean Temperature Anomalies
  January-December 1880-2016
  Units: Degrees Celsius
  Base Period: 1901-2000
  Missing: -999
  Year,Value
  1880,-0.12
  1881,-0.07
  1882,-0.08
  ...
  2014,0.75
  2015,0.91
  2016,0.95

I’ve made a start on other file formats with the module DataParser in the latest version of the code on the repo. It parses and cleans up files that look like the one below, which has fifty lines of metadata and over 900 lines of data in 12 columns. But it is possible to design intelligent filters which extract the actual data for analysis To plot it, one extracts columns i and j as floating point numbers and feeds it to the chart function.

Source: https://climate.nasa.gov/vital-signs/sea-level/
HDR Global Mean Sea Level Data
HDR
HDR This file contains Global Mean Sea Level (GMSL) variations computed at the NASA Goddard Space Flight Center under the
..
HDR column description
HDR 1 altimeter type 0=dual-frequency  999=single frequency (ie Poseidon-1)
HDR 2 merged file cycle #
...
HDR* 12 smoothed (60-day Gaussian type filter) GMSL (GIA applied) variation (mm); annual and semi-annual signal removed )  with respect to 20-year mean
...
HDR Header_End---------------------------------------
   0  11  1993.0115260    466462 337277.00    -37.24     92.66    -37.02    -37.24     92.66    -37.02    -37.52
   0  12  1993.0386920    460889 334037.31    -40.35     95.39    -38.20    -40.34     95.39    -38.19    -38.05
 ...

A weakness of the discussion so far is that I have only tested the library on two files, both representing time series!! I haven’t yet integrated DataParser into the app.

3. Data selection and manipulation. One feature, implemented in the library but so far not in the app, is the ability to extract an arbitrary pair of columns from the date for plotting. One should be able to do sets of these, so as to show several superimposed graphs. Another feature, implemented in the demo app, is to be able to restrict the range of the data via an inequality on one of the columns. For example, in the temperature data, it might be instructive to restrict to the years (a) 1880-1975 and (b) 1975-2016l . You can do this yourself using the demo app, Do it and compare the slopes of the regression lines.

These are just two kinds of manipulation of the data. Perhaps smoothing is another. I would very much like to hear comments on this topic.

4. Saving output. Anyone who uses such an app would like a way to export high-quality graphs, not just take screenshots. There ought to be a good pure-Elm way of extracting the SVG output of the graph and downloading it as a file.

I look forward to your comments.

8 Likes
#2

For scientific data visualization, it would be really good to be able to not just read in x,y data but x, delta_x, y, delta_y or x, x_lower, x_upper, y, y_upper, y_lower style data (and variants where only x or y has deltas) for the rendering of error bars. Much scientific data is essentially useless for drawing conclusions from unless you know the estimated size of the measurement/sampling errors…

1 Like
#3

Good points Jess, thanks! Do you have any data that I can work with in this regard?

#4

Nice work @jxxcarlson, it’s great to see more data analysis being done in elm. If you want to make richer interactive svg visualizations in the future, I can’t recommend gampleman’s elm-visualization library enough, it’s really fantastic. https://package.elm-lang.org/packages/gampleman/elm-visualization/latest/

#5

Thanks very much! Indeed, I have been looking at gampleman’s elm-visualization and plan to adopt it in the very near future,

Addendum (Feb 19): I’ve started using gampleman’s elm-visualization in the app – with it I can now do scatterplots as well as line charts.

#6

Following @Jess_Bromley 's suggestion above, I have added support for error bars in the new module ErrorBars. See the latest version of jxxcarlson/elm-stat (3.1). What other missing features should I be working on?

PS. Haven’t integrated error bars into the app yet.

#7

Here is a list of features in typical fitting software I have used in the past, just for reference:

  • Plot X% confidence interval
  • Provide an estimate uncertainty of the fitted parameters
  • Regarding 4. saving the output: it used to be that I could right click on the svg element and save it as an image (in older versions of firefox) but that doesn’t seem to be the case anymore. It seems like the only way is if you use an <img> tag for your image. I don’t know how that would work with a javascript generate svg tree.
  • For more sophisticated plots (or models) it is often useful to plot the residuals

Just some ideas! Great work as always!

#8

Thanks Salo! Very helpful suggestions. Working on confidence intervals now. The latest version does scatterplots as well as line charts. If you have any data you could share with me, that would be great (jxxcarlson@gmail.com)

1 Like
#9

After jxxcarlson’s request for data with error intervals, and having recently stumbled upon some cool historical speed of light data with confidence intervals, I got caught up in making my own implementation of error bars. Perhaps it could be of use in furthering elm-stat, since I also chose to use the same tools zgohr/elm-csv, terezka/line-charts, and additionally ericgj/elm-csv-decode. (I never really figured out csv-decode, and parsed everything as strings the first time instead of using proper decoders. But the more interesting part is the chart!)

Have a look at https://github.com/MichaelLAnderson/error-bars
or just use Ellie: https://ellie-app.com/4MM5YdJX55Wa1

First post here, btw. I’ll figure out better formatting as I go…

3 Likes
#10

@mikela, That is very nice indeed (both the data and the esthetics of your graph)! I have taken a different approach so far, which is to compute confidence intervals from data that has multiple y-values for a given x-value. But your example shows that I should also have a way of presenting data in the form that you do, with given uncertainties.

I’m now at version 4.01 of elm-stat, which now does both line charts and scatter plots. I’ve written a module for computing error bars from data, but haven’t integrated it into the demo app yet.

closed #11

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.