Analyzing Data with Elm
Evan wrote " I think science and statistics are under-explored in @elmlang packages. I think it would be great to parse xslx and run regressions in Elm. Help scientists get interesting data online!"
The little demo app above is one contribution towards addressing this need. You can run the app online from here; the code is in the examples folder of jxxcarlson/elm-stat, the library on which the app relies. The idea (for now) is that you can upload and analyze CSV files. There is a link at the bottom left (footer) that gives you a sample file to download.
I’d like to make the library as useful as possible. For that to happen, I need to use it to mess around with as many different data sets as possible. If you could point me your favorites, that would be wonderful, as would be suggestions of what a good stats/visualization package should do. Also, if you could report instances in which the app choked on the data that you gave it, that would be great. (I expect this to happen; the app and library are in a primitive state.)
My approach so far has been to rely on other libraries to the greatest extent possible, e.g., zgohr/elm-csv
for parsing Csv files and terezka/line-charts
for rendering the graphs, adding data-transformation glue and statistical functions, e.g., computation of the coefficients for the regression line . The elm/file
library is used to upload files.
Below is my current list of topics that need to be addressed. I am sure there are others.
1. Different kinds of data (a) The data used in the figure above is time series data, in this case global yearly temperature anomalies from 1880 to 2016. The anomaly is the deviation from the average in some reference period. (b) We need to be able to work with other kinds of data besides time series, e.g., generate scatter plots. Perhaps the data consists of (x,y) coordinates of outbreaks of a disease. (Famous example from the London cholera epidemic). Etc.
2. Data format and data extracton A piece of the temperature is displayed below. It is a csv file preceded by a metadata header and including column headings which describe the data. Very easy to parse, extract the metadata, extract the data itself, and automatically apply labels to the x and y axes. The library does this. BUT: not all files are so nicely constructed, or are in this format, so one has to have ways of dealing with them.
Global Land and Ocean Temperature Anomalies
January-December 1880-2016
Units: Degrees Celsius
Base Period: 1901-2000
Missing: -999
Year,Value
1880,-0.12
1881,-0.07
1882,-0.08
...
2014,0.75
2015,0.91
2016,0.95
I’ve made a start on other file formats with the module DataParser
in the latest version of the code on the repo. It parses and cleans up files that look like the one below, which has fifty lines of metadata and over 900 lines of data in 12 columns. But it is possible to design intelligent filters which extract the actual data for analysis To plot it, one extracts columns i and j as floating point numbers and feeds it to the chart function.
Source: https://climate.nasa.gov/vital-signs/sea-level/
HDR Global Mean Sea Level Data
HDR
HDR This file contains Global Mean Sea Level (GMSL) variations computed at the NASA Goddard Space Flight Center under the
..
HDR column description
HDR 1 altimeter type 0=dual-frequency 999=single frequency (ie Poseidon-1)
HDR 2 merged file cycle #
...
HDR* 12 smoothed (60-day Gaussian type filter) GMSL (GIA applied) variation (mm); annual and semi-annual signal removed ) with respect to 20-year mean
...
HDR Header_End---------------------------------------
0 11 1993.0115260 466462 337277.00 -37.24 92.66 -37.02 -37.24 92.66 -37.02 -37.52
0 12 1993.0386920 460889 334037.31 -40.35 95.39 -38.20 -40.34 95.39 -38.19 -38.05
...
A weakness of the discussion so far is that I have only tested the library on two files, both representing time series!! I haven’t yet integrated DataParser
into the app.
3. Data selection and manipulation. One feature, implemented in the library but so far not in the app, is the ability to extract an arbitrary pair of columns from the date for plotting. One should be able to do sets of these, so as to show several superimposed graphs. Another feature, implemented in the demo app, is to be able to restrict the range of the data via an inequality on one of the columns. For example, in the temperature data, it might be instructive to restrict to the years (a) 1880-1975 and (b) 1975-2016l . You can do this yourself using the demo app, Do it and compare the slopes of the regression lines.
These are just two kinds of manipulation of the data. Perhaps smoothing is another. I would very much like to hear comments on this topic.
4. Saving output. Anyone who uses such an app would like a way to export high-quality graphs, not just take screenshots. There ought to be a good pure-Elm way of extracting the SVG output of the graph and downloading it as a file.
I look forward to your comments.