Notice: Mini Project BDA 3rd and 5th

All The students of 3 and 5 sem BDA are required to complete 1 project from their respective semester list. Other than this you can opt the project from CS list. You will get extra credit for more projects you will complete.
For 3rd Sem
1. This project uses the one of the most simplest and easiet of data set in the field of bigdata. This dataset is known as the Iris dataset and available at https://archive.ics.uci.edu/ml/datasets/Iris This is perhaps the best known database to be found in the pattern recognition literature. This was given by Fisher way back in 1936 and is a classic in the field and is referenced frequently to this day. The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. One class is linearly separable from the other 2; the latter are NOT linearly separable from each other.
Your task: Predict the class ( Iris Setosa, Iris Versicolour or Iris Virginica) based on the available attributes. Divide the training and testing dataset in 70:30 or 75:25 ratio.

2. This project uses a popular dataset known as the Titanic Dataset in the field of Bigdata. This dataset is available at https://www.kaggle.com/c/titanic/data (You need to create an account in kaggle to download this dataset. Eventually every student of Bigdata should have an account in kaggle to work with real life Data science problems).
Your task: Predict whether a passenger is likely to survive or not based on the given attributes

For 5th Sem

1. In this project students are required to perform analytics on the Complete FIFA 2017 Player dataset. This dataset is available at https://www.kaggle.com/artimous/complete-fifa-2017-player-dataset-global (You would require to have an account in kaggle to download this dataset). Total seven analytics are required to be performed, 5 are listed below and 2 have to decided by the students themselves.
(i) Analytics on the best goal scores
(ii) Analytics on on-field behaviours of the players
(iii) Analytics of the best attributes of the players of a particular continents (for example say European players are better dribblers, African players have better stamina etc)
(iv) Analytics on what attributes the different club prefers
(v) Analytics on contract period

2. In this project students are required to perform analytics on the World Bank Projects & Operations dataset. This dataset is available at https://data.worldbank.org/data-catalog/projects-portfolio
World Bank Projects & Operations provides access to basic information on all of the World Bank’s lending projects from 1947 to the present. The dataset includes basic information such as the project title, task manager, country, project id, sector, themes, commitment amount, product line, procurement notices, contract awards, and financing. It also provides links to publicly disclosed online documents. Total six analytics are required to be performed, 5 are listed below and 1 has to decided by the students themselves.
(i) Analytics on which continents are getting maximum benefits
(ii) Analytics on the sectors
(iii) Analytics on the individual countries of a particular continent
(iv) Analytics on the gradual change of sectors over the years (for example whether agriculture was more prominent in 1950s than in recent years etc)
(v) Analytics on development status of a continent based on projects (you can ponder over questions like if world bank is providing more projects does it indicate that area is less developed? Does developed continents require world bank projects? Does the project numbers have gone down over the years to indicate more self sufficiency? etc)

3. This project aims to identify the species of a bird from an given image. The dataset to be used in this project is available publicly at http://www.vision.caltech.edu/visipedia/CUB-200-2011.html All the attributes that will be used are visual types like color of specific part of the bird, shape of a specific part etc. The dataset should be divided in 70:30 ratio for training and testing. Students would be likely to use more than one training methods to achieve better training-testing accuracy. Python or R are suggested for implementation.