Overview

Dataset statistics

Number of variables3
Number of observations208
Missing cells146
Missing cells (%)23.4%
Duplicate rows0
Duplicate rows (%)0.0%
Total size in memory5.0 KiB
Average record size in memory24.6 B

Variable types

NUM2
DATE1

Reproduction

Analysis started2020-08-18 00:53:39.329315
Analysis finished2020-08-18 00:53:42.956855
Duration3.63 seconds
Versionpandas-profiling v2.8.0
Command linepandas_profiling --config_file config.yaml [YOUR_FILE.csv]
Download configurationconfig.yaml

Warnings

Recoveries is highly correlated with Predicted RecoveriesHigh correlation
Predicted Recoveries is highly correlated with RecoveriesHigh correlation
Recoveries has 146 (70.2%) missing values Missing
df_index has unique values Unique
Predicted Recoveries has unique values Unique

Variables

df_index
Date

UNIQUE

Distinct count208
Unique (%)100.0%
Missing0
Missing (%)0.0%
Memory size1.8 KiB
Minimum2020-01-22 00:00:00
Maximum2020-08-16 00:00:00
Histogram

Predicted Recoveries
Real number (ℝ≥0)

HIGH CORRELATION
UNIQUE

Distinct count208
Unique (%)100.0%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean67044.86530012194
Minimum28.0
Maximum86889.08734432753
Zeros0
Zeros (%)0.0%
Memory size1.8 KiB

Quantile statistics

Minimum28
5-th percentile215.1488657
Q165623.22521
median82818.89174
Q384171.81443
95-th percentile85595.70052
Maximum86889.08734
Range86861.08734
Interquartile range (IQR)18548.58921

Descriptive statistics

Standard deviation28839.30875
Coefficient of variation (CV)0.4301494025
Kurtosis0.7213659654
Mean67044.8653
Median Absolute Deviation (MAD)2238.00389
Skewness-1.531728373
Sum13945331.98
Variance831705729.3
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
84444.7268210.5%
 
83340.5421210.5%
 
83752.9113510.5%
 
84988.9682210.5%
 
51297.7050110.5%
 
23287.270210.5%
 
80846.3694610.5%
 
74156.6207410.5%
 
80693.0746910.5%
 
84000.2103810.5%
 
75516.5516910.5%
 
84878.4223110.5%
 
83425.4516810.5%
 
82603.5269110.5%
 
79411.7753610.5%
 
84661.1084510.5%
 
83794.8043210.5%
 
80533.811510.5%
 
84468.1959410.5%
 
81134.4640510.5%
 
76294.9905510.5%
 
86506.4880910.5%
 
82694.2100210.5%
 
85349.4262310.5%
 
83567.3963210.5%
 
Other values (183)18388.0%
 
ValueCountFrequency (%) 
2810.5%
 
3010.5%
 
3610.5%
 
3910.5%
 
4910.5%
 
5810.5%
 
10110.5%
 
12010.5%
 
13510.5%
 
163.9110.5%
 
ValueCountFrequency (%) 
86889.0873410.5%
 
86761.8143510.5%
 
86633.9939210.5%
 
86506.4880910.5%
 
86378.5678410.5%
 
86249.0729410.5%
 
86118.4117710.5%
 
85989.8083510.5%
 
85864.4713510.5%
 
85742.19510.5%
 

Recoveries
Real number (ℝ≥0)

HIGH CORRELATION
MISSING

Distinct count62
Unique (%)100.0%
Missing146
Missing (%)70.2%
Infinite0
Infinite (%)0.0%
Mean28826.0
Minimum28.0
Maximum72814.0
Zeros0
Zeros (%)0.0%
Memory size1.8 KiB

Quantile statistics

Minimum28
5-th percentile39.5
Q11607.5
median20701.5
Q356925.75
95-th percentile71229.45
Maximum72814
Range72786
Interquartile range (IQR)55318.25

Descriptive statistics

Standard deviation27386.31517
Coefficient of variation (CV)0.9500560318
Kurtosis-1.522122967
Mean28826
Median Absolute Deviation (MAD)20574
Skewness0.3736451882
Sum1787212
Variance750010258.8
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
6791010.5%
 
797710.5%
 
6164410.5%
 
1596210.5%
 
6018110.5%
 
6701710.5%
 
391810.5%
 
147710.5%
 
5229210.5%
 
6290110.5%
 
1801410.5%
 
7053510.5%
 
5553910.5%
 
3293010.5%
 
5738810.5%
 
5000110.5%
 
2269910.5%
 
61410.5%
 
46310.5%
 
27510.5%
 
21410.5%
 
13510.5%
 
12010.5%
 
10110.5%
 
5810.5%
 
Other values (37)3717.8%
 
(Missing)14670.2%
 
ValueCountFrequency (%) 
2810.5%
 
3010.5%
 
3610.5%
 
3910.5%
 
4910.5%
 
5810.5%
 
10110.5%
 
12010.5%
 
13510.5%
 
21410.5%
 
ValueCountFrequency (%) 
7281410.5%
 
7236210.5%
 
7185710.5%
 
7126610.5%
 
7053510.5%
 
6975510.5%
 
6879810.5%
 
6791010.5%
 
6701710.5%
 
6566010.5%
 

Interactions

Correlations

Pearson's r

The Pearson's correlation coefficient (r) is a measure of linear correlation between two variables. It's value lies between -1 and +1, -1 indicating total negative linear correlation, 0 indicating no linear correlation and 1 indicating total positive linear correlation. Furthermore, r is invariant under separate changes in location and scale of the two variables, implying that for a linear function the angle to the x-axis does not affect r.

To calculate r for two variables X and Y, one divides the covariance of X and Y by the product of their standard deviations.

Spearman's ρ

The Spearman's rank correlation coefficient (ρ) is a measure of monotonic correlation between two variables, and is therefore better in catching nonlinear monotonic correlations than Pearson's r. It's value lies between -1 and +1, -1 indicating total negative monotonic correlation, 0 indicating no monotonic correlation and 1 indicating total positive monotonic correlation.

To calculate ρ for two variables X and Y, one divides the covariance of the rank variables of X and Y by the product of their standard deviations.

Kendall's τ

Similarly to Spearman's rank correlation coefficient, the Kendall rank correlation coefficient (τ) measures ordinal association between two variables. It's value lies between -1 and +1, -1 indicating total negative correlation, 0 indicating no correlation and 1 indicating total positive correlation.

To calculate τ for two variables X and Y, one determines the number of concordant and discordant pairs of observations. τ is given by the number of concordant pairs minus the discordant pairs divided by the total number of pairs.

Phik (φk)

Phik (φk) is a new and practical correlation coefficient that works consistently between categorical, ordinal and interval variables, captures non-linear dependency and reverts to the Pearson correlation coefficient in case of a bivariate normal input distribution. There is extensive documentation available here.

Missing values

Sample

First rows

df_indexPredicted RecoveriesRecoveries
02020-01-2228.0028.0
12020-01-2330.0030.0
22020-01-2436.0036.0
32020-01-2539.0039.0
42020-01-2649.0049.0
52020-01-2758.0058.0
62020-01-28101.00101.0
72020-01-29120.00120.0
82020-01-30135.00135.0
92020-01-31163.91214.0

Last rows

df_indexPredicted RecoveriesRecoveries
1982020-08-0785742.194997NaN
1992020-08-0885864.471347NaN
2002020-08-0985989.808353NaN
2012020-08-1086118.411768NaN
2022020-08-1186249.072944NaN
2032020-08-1286378.567838NaN
2042020-08-1386506.488090NaN
2052020-08-1486633.993923NaN
2062020-08-1586761.814349NaN
2072020-08-1686889.087344NaN