Overview

Dataset statistics

Number of variables3
Number of observations208
Missing cells146
Missing cells (%)23.4%
Duplicate rows0
Duplicate rows (%)0.0%
Total size in memory5.0 KiB
Average record size in memory24.6 B

Variable types

NUM2
DATE1

Reproduction

Analysis started2020-08-18 00:53:46.889499
Analysis finished2020-08-18 00:53:50.186742
Duration3.3 seconds
Versionpandas-profiling v2.8.0
Command linepandas_profiling --config_file config.yaml [YOUR_FILE.csv]
Download configurationconfig.yaml

Warnings

Recoveries has 146 (70.2%) missing values Missing
df_index has unique values Unique
Predicted Recoveries has 9 (4.3%) zeros Zeros
Recoveries has 23 (11.1%) zeros Zeros

Variables

df_index
Date

UNIQUE

Distinct count208
Unique (%)100.0%
Missing0
Missing (%)0.0%
Memory size1.8 KiB
Minimum2020-01-22 00:00:00
Maximum2020-08-16 00:00:00
Histogram

Predicted Recoveries
Real number (ℝ≥0)

ZEROS

Distinct count200
Unique (%)96.2%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean1059029.6129939533
Minimum0.0
Maximum4164682.9299011193
Zeros9
Zeros (%)4.3%
Memory size1.8 KiB

Quantile statistics

Minimum0
5-th percentile0.18079005
Q134.41636553
median596634.8526
Q31852301.542
95-th percentile3548071.754
Maximum4164682.93
Range4164682.93
Interquartile range (IQR)1852267.126

Descriptive statistics

Standard deviation1204392.14
Coefficient of variation (CV)1.137260116
Kurtosis-0.2883747944
Mean1059029.613
Median Absolute Deviation (MAD)596630.5754
Skewness0.9170950852
Sum220278159.5
Variance1.450560427e+12
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
094.3%
 
1036824.94710.5%
 
454.240084810.5%
 
167.197895710.5%
 
161563.132210.5%
 
9.7749355810.5%
 
2430537.9310.5%
 
12.21124910.5%
 
5.78853416410.5%
 
1425749.20610.5%
 
41325.4402410.5%
 
1061974.90110.5%
 
1803096.36410.5%
 
504022.11210.5%
 
787.469249410.5%
 
10.1406900910.5%
 
1309058.4510.5%
 
3509548.85410.5%
 
3689515.54610.5%
 
1846762.30810.5%
 
3930409.20710.5%
 
2152336.10110.5%
 
1087297.54810.5%
 
13093.0452110.5%
 
2392486.60310.5%
 
Other values (175)17584.1%
 
ValueCountFrequency (%) 
094.3%
 
0.0710.5%
 
0.135110.5%
 
0.26564310.5%
 
0.3870479910.5%
 
0.709954630710.5%
 
1.01025780710.5%
 
1.2895397610.5%
 
1.54927197710.5%
 
1.79082293910.5%
 
ValueCountFrequency (%) 
4164682.9310.5%
 
4106193.63410.5%
 
4047680.53210.5%
 
3989256.16310.5%
 
3930409.20710.5%
 
3871463.87910.5%
 
3811496.60110.5%
 
3750596.94710.5%
 
3689515.54610.5%
 
3628886.55410.5%
 

Recoveries
Real number (ℝ≥0)

MISSING
ZEROS

Distinct count9
Unique (%)14.5%
Missing146
Missing (%)70.2%
Infinite0
Infinite (%)0.0%
Mean6.887096774193548
Minimum0.0
Maximum178.0
Zeros23
Zeros (%)11.1%
Memory size1.8 KiB

Quantile statistics

Minimum0
5-th percentile0
Q10
median3
Q37
95-th percentile12
Maximum178
Range178
Interquartile range (IQR)7

Descriptive statistics

Standard deviation22.49962101
Coefficient of variation (CV)3.266923894
Kurtosis57.36243821
Mean6.887096774
Median Absolute Deviation (MAD)3
Skewness7.442931096
Sum427
Variance506.2329455
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
02311.1%
 
3125.8%
 
7115.3%
 
1241.9%
 
541.9%
 
631.4%
 
1721.0%
 
821.0%
 
17810.5%
 
(Missing)14670.2%
 
ValueCountFrequency (%) 
02311.1%
 
3125.8%
 
541.9%
 
631.4%
 
7115.3%
 
821.0%
 
1241.9%
 
1721.0%
 
17810.5%
 
ValueCountFrequency (%) 
17810.5%
 
1721.0%
 
1241.9%
 
821.0%
 
7115.3%
 
631.4%
 
541.9%
 
3125.8%
 
02311.1%
 

Interactions

Correlations

Pearson's r

The Pearson's correlation coefficient (r) is a measure of linear correlation between two variables. It's value lies between -1 and +1, -1 indicating total negative linear correlation, 0 indicating no linear correlation and 1 indicating total positive linear correlation. Furthermore, r is invariant under separate changes in location and scale of the two variables, implying that for a linear function the angle to the x-axis does not affect r.

To calculate r for two variables X and Y, one divides the covariance of X and Y by the product of their standard deviations.

Spearman's ρ

The Spearman's rank correlation coefficient (ρ) is a measure of monotonic correlation between two variables, and is therefore better in catching nonlinear monotonic correlations than Pearson's r. It's value lies between -1 and +1, -1 indicating total negative monotonic correlation, 0 indicating no monotonic correlation and 1 indicating total positive monotonic correlation.

To calculate ρ for two variables X and Y, one divides the covariance of the rank variables of X and Y by the product of their standard deviations.

Kendall's τ

Similarly to Spearman's rank correlation coefficient, the Kendall rank correlation coefficient (τ) measures ordinal association between two variables. It's value lies between -1 and +1, -1 indicating total negative correlation, 0 indicating no correlation and 1 indicating total positive correlation.

To calculate τ for two variables X and Y, one determines the number of concordant and discordant pairs of observations. τ is given by the number of concordant pairs minus the discordant pairs divided by the total number of pairs.

Phik (φk)

Phik (φk) is a new and practical correlation coefficient that works consistently between categorical, ordinal and interval variables, captures non-linear dependency and reverts to the Pearson correlation coefficient in case of a bivariate normal input distribution. There is extensive documentation available here.

Missing values

Sample

First rows

df_indexPredicted RecoveriesRecoveries
02020-01-220.000.0
12020-01-230.000.0
22020-01-240.000.0
32020-01-250.000.0
42020-01-260.000.0
52020-01-270.000.0
62020-01-280.000.0
72020-01-290.000.0
82020-01-300.000.0
92020-01-310.070.0

Last rows

df_indexPredicted RecoveriesRecoveries
1982020-08-073.628887e+06NaN
1992020-08-083.689516e+06NaN
2002020-08-093.750597e+06NaN
2012020-08-103.811497e+06NaN
2022020-08-113.871464e+06NaN
2032020-08-123.930409e+06NaN
2042020-08-133.989256e+06NaN
2052020-08-144.047681e+06NaN
2062020-08-154.106194e+06NaN
2072020-08-164.164683e+06NaN