Statistics in Trade: Part 1

Different Type of Datasets:

In the economy, we mainly deal with Four main categories of Data:

1. Time Series Data: X(t)

The data which change by time. like the stock price which changes frequently, every minute, hour, day or…

2. Cross-Sectional Data: X(i)

These data sets are taken at a specific period of time. like the Unemployment rate in the year2021. The data taken are obtained based on random sampling.

3. Pooled Cross-sectional:

the data has both cross-sectional and time-series features.

4. Panel or longitude Data:

This category of data is taken, the same cross-section at different times.

Sample and Population:

Covariance:

It measures the strength of the linear relationship between two numerical variables (x,y), only concerned with the strength of the relationship and no causal effect.

Sample Covariance

Cov (X,Y) >0 — -> same direction of the two variable movements

Cov (X,Y) =0 — -> independent

Cov (X,Y) <0 — -> One moves in the opposite of the other one.

but we can not find anything related to the causation in this case.

Figure 2. Positive, negative, and zero covariance.

Coefficient of correlation:

To find a correlation between the stock price and trade we can use the coefficient of correlation. The presence of correlation shows a casual relationship between the two data series.

how closely the two variables are changing together?

Figure 1. The correlation coefficient is shown in different date samples.

Correlation includes causations, but causation is one of the influences of that.

Econometrics investigates by how much y changes if the X changes. in econometrics, we are dealing with finding the reasons.

On the website tylervigen.com you can find different samples of the case.

Figure 1: not all the correlated variables, have causation related to each other.

Example:

Norm.INV(probability, mean, max)

This the function on excel that used for making normally distributed data.

Note: The correlation coefficient may be misleading with a small number of data. to check if the number of data selected was trusted or not, we should consider the significance of the data. in case we use the Random Probability and make two different sets, those two sets have a correlation of 0.

Probability Distribution:

A Statistical distribution describes the number of times each possible outcome occurs in a sample.

a normal distribution is a continuous distribution in which there is most likely to be at the mean, and the data are very less likely to be away from the mean.

Figure 2. Normal Distribution

The data in the normal distribution are distant from the mean value for 99% of data is less than 3 times standard deviation. 66% of the data are distributed is around the mean value.

Normalization:

In order to make the process of using the data more handy, we can make the normalization on the data sets with the normal distribution. By normalizing, in fact, we move the data distribution curve in a direction that the mean of the new curve becomes 0 and squeeze or push the data in a way, that the Standard deviation becomes 1. In that way, the values instead of x in horizontal access are called to be z-scores. The interesting thing about the Z-score is that it shows the difference that the data has from the mean value as a multiplier of the Standard Deviation.

In normalization, we make the mean equal to zero and the SD=1.

we also change the X-axis to a new variable we can have Z axis.

Z= x-X/SD

T- Distribution:

In probability and statistics, Student’s t-distribution (or simply the t-distribution) is any member of a family of continuous probability distributions that arise when estimating the mean of a normally-distributed population in situations where the sample size is small and the population’s standard deviation is unknown.

The t-distribution plays a role in a number of widely used statistical analyses, including Student’s t-test for assessing the statistical significance of the difference between two sample means, the construction of confidence intervals for the difference between two population means, and in linear regression analysis. The Student’s t-distribution also arises in the Bayesian analysis of data from a normal family.

If we take a sample of observations from a normal distribution, then the t-distribution with (n-1) degrees of freedom can be defined as the distribution of the location of the sample mean relative to the true mean, divided by the sample standard deviation, after multiplying by the standardizing term Sqrt(n). In this way, the t-distribution can be used to construct a confidence interval for the true mean.

The t-distribution is symmetric and bell-shaped, like the normal distribution, but has heavier tails, meaning that it is more prone to producing values that fall far from its mean. This makes it useful for understanding the statistical behavior of certain types of ratios of random quantities, in which variation in the denominator is amplified and may produce outlying values when the denominator of the ratio falls close to zero. The Student’s t-distribution is a special case of the generalized hyperbolic distribution.

Hypothesis Testing:

in any research, we have a main Hypothesis, while for some statistical tests, we have to use Null and Alternative Hypotheses. The Null Hypothesis is the main hypothesis that we have in our test. It is the possible answer that we believe is going to be the correct answer and we plan for making the tests in a way to find if it is true or not.

The alternative Hypothesis, on the other side, is the hypothesis that we are going to reject in our analysis.

Example: Education Affects Wages:

Null Hypothesis (H0): Education Affects the Wage Positively.

Alternative Hypothesis (Ha): Education has no effect or negative effects on the wage.

Significance Level:

Alpha is the possibility of rejecting the null hypothesis. i.e. we approved that the H0, ( the positive effect of Education on wage is true), but there is 5% of data that are not covering this hypothesis, this is called to be the significance level of our experiment.

Power of Test:

If the test Null hypothesis is rejected, then there is a possibility that we have some data that are not approving the rejection. This is called Beta. the value 1-Beta is called the power of the test.

Types of Errors in Hypothesis testing

Hypothesis Testing:

In statistics, we deal with a lot of different tests to check the validity of the Hypothesis. In most of the Hypothesis testing, we are looking for confidence of 95% on our results, so the standard level of Alpha (significance) that should be used is always around 5%.

Pearson Correlation Test:

The test is designed to know that how correlated are two different variables. In fact, It is a measure of how linear a set of data is. The values range from -1 to 1. If the best fit line is positive, then the coefficient is positive; if negative, then the coefficient is negative. If the slope of the best fit line is 0 (horizontal) or infinity (vertical), then the coefficient is 0.

Considerations of Application:

Under heavy noise conditions, extracting the correlation coefficient between two sets of stochastic variables is nontrivial, in particular where Canonical Correlation Analysis reports degraded correlation values due to the heavy noise contributions. A generalization of the approach is given elsewhere.

Pierson tests in simpler situations when we are dealing with Normal Distribution (Either Bivariate or not) & With no correlation between the datasets can be modified to the Students t-distribution test.

Bivariate Normal Distribution
An increase in the number of the data sets leads to a less critical coefficient value.

How we convert the data in the T-Test:

Conversion to t
How the critical value is calculated, increase in the number of samples makes the critical value smaller.

r is the critical value that we use in the calculations. consider the case, that we are dealing with 2 dataset each with 1000 data, then the value of r will be smaller till the time that it reaches a constant value.

One tailed and two tailed considerations on the Student t- test:

The main keywords that we have to know in implementing this test are the concepts of one-tailed or two-tailed data.

In the case of our Hypotheses based on the validity of Correlation is made, i.e.

H0: r=0

Ha:r!=0

In this case, we have to check the r for one-tailed data.

In case that the correlation coefficient is not equal to zero, then we use the to two tail data.

i.e.

H0: r>0

Ha: r≤0

Table of critical values:

t-student critical value table

Here also there is a tool that calculates the values for differnt tests.

Application of the Normal Distribution in Trade:

when dealing with finding the Stop Loss or Take profit, I can base my calculations on the 15 minutes Trade Volume data.

--

--

--

An Enthusiast Equity Analyst and Independent Financial Researcher with a passion for Fundamental Analysis. I use Medium for the daily records.

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Cyber Crime and Confusion Matrix

Just a non-essential artist ranting

Different Hypothesis testing in Machine Learning

Know your data - using Databricks Data Profile

Visualizing Audio Data and Performing Feature Extraction

A Beginner’s Dive into Data Science: Measures of Spread.

Top Books For DATA SCIENCE (must-read)

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Iman Najafi

Iman Najafi

An Enthusiast Equity Analyst and Independent Financial Researcher with a passion for Fundamental Analysis. I use Medium for the daily records.

More from Medium

SwapScanner

WORLD SOIL DAY — DECEMBER 5

Let’s Decode

white pencil on a black paper sketch

ARTH_TASK 19