Posts

Showing posts from June, 2019

week 4 weekend

Image
This weekend we went to lands end in san francisco. We were on the trail for no more than 3 minutes when Alex, Will and I decided to go over the ropes and wonder off trail. Alex and I went on an adventure and were scaling down the rocks and getting closer to the water. It was a lot of fun until we realized how far down we were and had to go back up. However we got to take a lot of amazing pictures! When we came back up we lasted another 5 minutes before wondering off trail again on an adventure. This time we ended up getting out socks wet, but again it allowed us to get great pictures. This adventure was through a small cave. When it came out we were in this area where the water would hit and it would fly up when it hit the rocks like in the little mermaid (excellent movie). After this side adventure we got back on the trail and walked for a while, we came to an area where we could go in the water a bit and have our feet in the sand. Will and I stood back where the water would hit our ...

Week 4

Image
This week I got to work a lot on the code of the project! I used the FluxNet data 2002-2004 and calculated the similarities of the datasets using Minimum Jump Cost dissimilarity measure (MJC), Euclidean Distance and Fourier coefficients method. These algorithms makes some very interesting graphs.

Week 3 (06/17/19 - 06/21/19)

The algorithm that was going to be developed to analyze the FluxNet dataset has changed from Locality-Sensitive Hashing to now using statistical machine learning to identify patterns in the dataset and track changes in it. I finished the research proposal.

Week Starting 06/17/19 (Week 3)

This week the research dataset was changed from the Sloan Digital Sky Survey to the FLUXNET dataset.  FLUXNET measures the exchanges of carbon dioxide, water vapor, and energy between terrestrial ecosystems and the atmosphere. I finished my research proposal, which can be found here: https://drive.google.com/open?id=1yXAvF-o3i_VXOL7Tl7_QvyDVWCXOopM69zbvg--vtpM. I have started to research the FLUXNET data and so far I have read four papers on the subject.

Week Starting 6/10/19 (Week 2)

This week I have extensively researched different algorithms to determine the similarities of a given dataset at different levels. I found that my dataset will be the Sloan Digital Sky Survey, which takes pictures of the sky in order to attempt to create a map of the universe. I have researched two of the three main methods that are currently widely accepted. The ones I researched were trees, mainly kd-trees, and graph based algorithms, mainly Navigable Small World graphs and Hierarchal Navigable Small World graphs (HNSW). I have started to run some of these algorithms with a practice dataset so that I can understand the usage better. The other main method is hashing, but it really doesn’t apply to the dataset that I will be working on. As of today, I am working on using map on the test dataset.

Week 2

I have read a lot of papers and feel I am getting the hang on what is going on from a theoretical stand point. I have started coding and am attempting to use faiss, nmslib and falconn algorithms to use Locality Sensitive Hashing, LSH on a small dataset. Eventually I want to be able to use these algorithms to analyze a larger dataset from FLUXNET.

General Description

Our goal is to create algorithms to detect duplicated or unnecessary data in large data sets. This can easily occur when a scientist collects data and when this happens the data becomes more difficult to deal with and wastes the finite resources that the computer has at it's disposal. This can also lead to a slow down among other issues. Once we detect the unnecessary or duplicated data we would like to find a way to dispose of this data.

Description of project

When scientists collect their data there tends to be an overlap in the acquired data. Due to this computer resources and space are wasted for this duplicated data. In this project we will focus on techniques to help scientists track and identify this duplicated data so that it can be dealt with. We will do this by developing MinHash algorithms. Doing this we hope will help scientists find duplicated data and not have to waste computer space and resources.