**Which industry can benefit from data mining?**

All of these

Retail

Manufacturing

Finance/Banking

**A function used by a node in a neural net to transform input data from any domain of values into a finite range of values is known as a(n):**

Confusion matrix

Activation Function

Chi-square

Antecedent

**Changes to parts of a code could lead to the problem of ______________ data.**

granular

inconsistent

nonintegrated

dirty

**What is data visualization?**

A structured and developed prediction of data
results

The visual interpretation of complex relationships
in multidimensional data

The technical term for the act of data being stored
in a server

**Decision trees are able to handle missing values without using any impute transformation. True or False?**

True

False

**Data items grouped into relationships and preferences are known as:**

Punctional Organizations

Predictable Sets

Degrees of Fit

Clusters

**Which of the following is valid XML?**

All are valid

<valid>This One</valid>

<valid>"This One"</valid>

<body answer="valid">This
One</body>

**A(n) _____ algorithm creates rules that describe how often events have occurred together.**

associative

artificial

pruning

CHAID

**What are decision trees?**

Complex reports generated by a qualified data
scientist

Structures that generate rules for the
classification of a dataset

Hierarchical dimensions that can be created with a
hyper cube browser

Data not collected by the organization, such as
data available from a reference book

**Which of the following is not a relational database?**

Google Big Table

MongoDB

All of the above

Apache Cassandra

**In predictive models, the values or classes to be predicted are called the:**

Response

Target variables

Dependent

All of these

**In a neural net, to what does topology refer?**

The number of layers and the number of nodes in
each layer

The graphical visualization of the data

The range of variables in a set

The number of nodes utilized

**You are a credit risk manager of a retail bank. Some information about customers are available to analytics. Based on this data you have to decide that a person will be a good or bad customer. Choose the appropriate data mining task for this business problems.**

Segmentation

Classification

Regression

**Which of the following clustering algorithms can find clusters of arbitrary shape?**

None of these

Single-Link

DSBSCAN

Both of these

**True of False? Loose coupling data mining architecture is mainly for memory-based data mining systems that does not require high scalability and high performance.**

True

False

**What is CRISP-DM?**

Microsoft's linear regression algorithm

A decision tree developed in the 1980's but almost
entirely replaced by the CART method today

A six phase method for predicting e-commerce buying
habits

A cross-industry standard process for data mining

**The annual revenue of an international company is correlated with other attributes like advertisement, exchange rate, inﬂation rate etc. Having these values (or their reliable estimations for the next year) the company have to calculate its expected revenue for the next year. Choose the appropriate data mining task for this business problem.**

Regression

Classification

Segmentation

**Which of these are NOT considered internal data factors?**

Product Positioning

Economic downturns

Staff Skills

Price

**With which of these layers does a neural network start?**

Transparent layer

Output Layer

Hidden Layer

Input layer

**What is the measure of how much two random variables change together?**

covariance

stochastic inertia

polyconvergence

binary standard deviation

**A hyperplane is a**

decision boundary separating classes of data

non-terminating error condition

collection of linked hypertext files

variant of the C4.5 algorithm

**Suppose that the company's marketing department collects data from customers. Make customer groups to ensure that the most appropriate group to target the different offers. Choose the appropriate data mining task for this business problem.**

Classification

Regression

Segmentation

**The level of the model that specifies (often graphically) which variables are locally dependent on each other.**

Qualitative Level

Structural Level

Primary Level

Quantitative Level

**To increase the confidence of your state of classification performance on the entire population, you should:**

Decrease the size of the training dataset

Increase the size of the test dataset

Decrease the size of the test dataset

Increase the size of the training dataset

**Data not collected by the organization, such as data from a proprietary database, that is combined with the organization’s own data is known as:**

Noise

Non-applicable date

Overlay

Overfitting

**Which data mining technique organizes sets of data into predefined groups?**

Classification

Sequential Patterning

Clustering

Gamification

**What is the front end layer of data mining architecture?**

Firewalls established to protect data from
malicious sources

The team of programmers who designed the software
utilized in a particular mining project

An intuitive and user friendly user interface

The hardware designed specifically for storage of
massive amounts of data

**Which of these is an example of a sequential pattern relationship?**

Using business experience and gut instinct to
design a new floorplan in a grocery store

Reorganizing your basketball team's starting lineup
based on an analysis of performance

Placing two frequently purchased items next to each
other on the shelf

Predicting the likelihood of a backpack being
purchased based on a consumer's purchase of sleeping bags and hiking shoes

**In the analysis of time-series data, the mean value over a given time period (usually some interval in the past up to the present) is called a(n)**

unbiased mean

compounded mean

partial average

moving average

**True or False? Tests in CART are always Binary.**

False

True

**In the association between two variables, what is the difference between the antecedent and the consequent?**

The antecedent is on the left, the consequent on
the right

Nothing, they are interchangeable

The antecedent is on the right, the consequent is
on the left.

The antecedent is always a very complex variable

**The algorithm powering the Google search engine is:**

The Brin-Page Method

PageRank

GoogleCrawler

AdaBoost

**What is Dependency Modeling?**

A task which consists of techniques for estimating,
from data, the joint multi-variate probability density function of all of the
variables/fields in the database.

Learning a function that maps a data item into one
of several predefined groups or clusters.

A multi-step process involving data preparation,
pattern searching, knowledge evaluation, and refinement with iteration after
modification.

The process of finding a model which describes
significant dependencies between variables

**What is Regression?**

Learning a function that maps a data item to a
real-valued prediction variable.

A descriptive task where one seeks to identify a
finite set of categories to describe the data.

An expression E in a language L describing facts in
a subset FE of F.

Learning a function that maps a data item into one
of several predefined groups.

**What is Change and Deviation Detection?**

A task focusing on discovering the most significant
changes in the data from previously measured or normative values

The process of finding a model which describes
significant dependencies between variables

A task which consists of techniques for estimating,
from data, the joint multi-variate probability density function of all of the
variables/fields in the database.

Methods for finding a compact description for a
subset of data.

**Sharding refers to:**

a measure of the noise in a database's contents

none of the above

simultaneously accessing multiple object databases
over SSH

partioning a database for distribution across
different servers

**Which of these is NOT a common descriptions of layers?**

Functional

Output

Hidden

Input

**What is the type of data mining that drives the Amazon.com recommendation system?**

Clustering Algorithms

Anomaly Detection

Association Learning

Fuzzy Logic

**Support Vector Machines have an advantage over Neural Networks because SVM's are**

parametric

easier to train via online learning

none of the above

more resistent to local minima convergence

**True or False? Economic indicators are external data factors.**

False

True

**Which of these are NOT types of analytical software:**

Neural network

Machine learning

All are valid types

Statistical

**Which of the following algorithms is generally suitable for unsupervised learning tasks?**

k-nearest neighbor

info-fuzzy networks

k-means algorithm

Restricted Boltzmann machine

**Which of the following storage solutions is most appropriate for a semi-structured dataset whose members do not all have the same attributes?**

MariaDB

SQLite

MySQL

MongoDB

**In order to estimate classification performance on an entire population, you need _______**

(None of these)

disjoint training and test datasets

Test Datasets

Disjoint training

**What is the extraction of useful if-then rules from data based on statistical significance?**

Preliminary Method Mapping

Dynamic Information Inference

Fuzzy Logic Application

Rule Induction

**What is a KDD Process?**

K-mean Data Discovery

Knoop-hardness measured through high-impact
dimension

Differential Decryption

Knowledge Discovery in Databases

**Generalization error is a consequence of**

Overfit

Underfit

Poorly defined Chernoff Bound

Parametric analysis

**Which of the following is NOT a common source system?**

DB Connect

UDC

SAP source

Node

**Which of these are evolutionary computational methods?**

Clustering algorithms

Bayesian inference algorithms

Genetic algorithms

Heuristic algorithms

**True or False? The MARS algorithm cannot produce rules.**

True

False

**A technique that classifies each record in a dataset based on a combination of the classes of the k record(s) most similar to it in a historical dataset is:**

Nearest Neighbor

Decision Treeing

Logistic Regression

Association Model Query

**Which of the following is most appropriate for finding the shortest chain of friends linking two people in a social graph who are not friends with each other?**

Neural Networks

k-means algorithm

Dijkstra's algorithm

Markov chains

**Which of the following is NOT a function of data warehouses?**

Extracting data

Cleaning data

Cleaning dirty data

Storing purchased data

**What is Interestingness?**

A discovered pattern that is true on new data with
some degree of certainty, and generalizes to other data.

An overall measure of pattern value, combining
validity, novelty, usefulness, and simplicity.

An expression E in a language L describing facts in
a subset FE of F.

A multi-step process involving data preparation,
pattern searching, knowledge evaluation, and refinement with iteration after
modification.

**What is a genetic algorithm?**

A classic algorithm for frequent item set mining
and association rule learning over transactional databases. It proceeds by
identifying the frequent individual items in the database and extending them to
larger and larger item sets as long as those item sets appear sufficiently
often in the database.

An algorithm that estimates how well a particular
pattern (a model and its parameters) meet the criteria of the KDD process.
Evaluation of predictive accuracy (validity) is based on cross validation.
Evaluation of descriptive quality involves predictive accuracy, novelty,
utility, and understandability of the fitted model. Both logical and
statistical criteria can be used for model evaluation.

A search algorithm that enables us to locate
optimal binary string by processing an initial random population of binary
strings by performing operations such as artificial mutation, crossover and
selection.

**In the MapReduce model, Map and Reduce functions act directly on which kind of data structure?**

MySQL matrices

key-value pair

linked lists

relational databases

**Which of the following is not a common goal of the KDD Process:**

Description

Performance

Prediction

**Which of the following disciplines overlaps Data Mining?**

Artificial Intelligence

Statistics

All of the above

Linguistics

**What is Classification?**

Learning a function that maps a data item into one
of several predefined groups.

A descriptive task where one seeks to identify a
finite set of categories to describe the data.

Methods for finding a compact description for a
subset of data.

A discovered pattern that is true on new data with
some degree of certainty, and generalizes to other data.

**Which of the followng clustering algorithms can optimize an ojbective function?**

DSBSCAN and Single Link

k-means only

k-means and CLARANS

Subspace Clustering Algorithms

**What is Clustering?**

Learning a function that maps a data item into one
of several predefined groups or clusters.

A descriptive task where one seeks to identify a finite
set of categories to describe the data.

A task which consists of techniques for estimating,
from data, the joint multi-variate probability density function of all of the
variables/fields in the database.

The process of finding a model which describes
significant dependencies between variables

**Which of the following properties applies to Single-Layer Perceptrons?**

backpropagation

continuous output

random initalization of weights

able to learn non-linear separations

**In Natural Language Processing, what is the role of a lexical analyzer?**

generates a context-free grammar

splits the stream of input characters into tokens

processes the parse tree for semantic meaning

checks the validity of a token

**A DBMS reduces data redundancy and inconsistency by**

Enforcing referential integrity

Utilizing a data dictionary

Minimizing isolated files with repeated data

uncoupling program and data

**In which type of analysis is a Kohonen feature map typically employed?**

Predictive analysis

Descriptive modeling analysis

Cluster analysis

Exploratory data analysis

**Converted information to provide insights about historical patterns and future trends is known as:**

Meta-data

Clustering

Knowledge

Linear regression

**What is Summarization?**

A task focusing on discovering the most significant
changes in the data from previously measured or normative values

A descriptive task where one seeks to identify a
finite set of categories to describe the data.

The process of finding a model which describes
significant dependencies between variables

Methods for finding a compact description for a
subset of data.

**Which of the following is NOT a method of combining multiple models into an ensemble model?**

Bootstrapping

Averaging

Stacking

Voting

**"In 2% of the purchases at the hardware store, both a pick and a shovel were bought,” is an example of:**

Support

Topology

Supervised learning

Validation

**Which of the following applications are usually used to classify students' performances?**

Regression analysis

If...then... analysis

Cluster analysis

Market-basket analysis

**Which of the following properties is a constraint on a RESTful application?**

returns JSON output

stateful

linearly seperable

stateless

**Which of the following algorithms produces decision trees?**

DBSCAN

ID3

logistic regression

none of the above

**The component of the Hadoop Distributed Filesystem responsible for storing metadata is called the**

FS Shell

Datanode

Namenode

DFSAdmin

**Which xpath selector expression captures all link elements of the form 'http://example.com/profile/12345' in an html page while excluding all links of the form 'http://example.com/casenumber/12345?**

//href/profile

//a/profile

//a/[contains(@href, "profile")]/@href

//a/[contains(@href, "profile")]

**What is Pig**

A programming language that enables Hadoop to
operate as a data warehouse.

A programming language that simplifies the common
tasks of working with Hadoop.

None of these

**If more than one value occurs the same number of times, the data is:**

Multi-leafed

Multi-modal

Multi-faceted

Multivariated

**Taking multiple random samples of data and building a classification model for each is known as:**

Clustering

Boosting

Binning

Fuzzy Sampling

**A commonly used continuous alternative to the step function in multi-layered neural network output is the**

logistic function

hyperbolic function

logarithmic function

multi-layered NN cannot compute continuous output

**The authentication protocol used by many significant web APIs is called:**

SSL

PGP

OAuth

HTTPS

**What is CURL?**

A methodology for classifying hidden features of
data

The part of HTTP that specifies access permission

Combinatorial Unsupervised Recursive Learning
algorithm

A command-line tool for retrieving files

**Apriori is a seminal algorithm for ﬁnding frequent item sets using:**

Candidate generation

Normal mixture models

None of these

Overfitting methods

**What is the first step in the business understanding phase?**

Firmly grasp business objectives and needs

Create data mining goals to achieve the business
objectives

Create a list of all relevant algorithms to be
applied to the task

Assess the current situation by finding out the
resources, assumptions, constraints etc.

**In any numerical data set with a meaningful mean value, what is the minimum fraction of data that will fall within n standard deviations of the mean?**

1/n^2

1/n

1-1/n^2

1/2n

**Which of these is a possible architecture of a data mining system?**

Transitive coupling

Magnetic coupling

No-coupling

Quickstart coupling

**Which of these is not a step in the KDD process?**

Data Mining

Data Cleaning

Data Integration

Data Quantification

**Which of the following method can be used for modeling a categorical target variable?**

Regression

Non-Linear Regression

All of the Above

ARIMA

Logistic Regression

**The level of the model that specifies the strengths of the dependencies using some numerical scale.**

Quantitative Level

Numeric Level

Primary Level

Dependency Level

**Which of the following is not a primary phase of a Hadoop Reducer?**

Reduce

Map

Sort

Shuffle

**The measured differences between a model and its predictions are known as:**

Non-applicable data

Noise

Range

Outliers

**True or False? Artificial neural networks are linear predictive models.**

False

True

**Which are popular data mining methods?**

Probabilistic Graphical Dependency Models

Decision Trees and Rules

Relational Learning Models

All of these

**Which decision tree method performs multi-level splits when computing classification trees?**

ID3 (Iterative Dichotomiser 3)

CHAID (Chi Square Automatic Interaction Detection)

C4.5 algorithm

CART (Classification and Regression Trees)

**What is the advantage of the k-Medoids Clustering Algorithm over the k-Means Clustering (Lloyd's) Algorithm?**

represents clusters by center

more resistant to outliers

all of the above

uses iterative refinement

**Which of the following is not an appropriate tool for harvesting data from a website that accesses its database through Javascript/AJAX calls?**

wget

PhantomJS

Selenium

All of the above are appropriate

**Which of the following is part of a retail customer data mining strategy?**

loyalty cards

holiday sale

customer testimonials

money-back guarantee

**Hash based technique, Transaction Reduction, Portioning, Sampling, and Dynamic Item Counting are all examples of what?**

Techniques to improve the efficiency of an Apriori
algorithm

Method to repeatedly scan the scan the database and
check a large set of candidates by pattern matching.

Methods of generating frequent item sets without
candidate generation.

Methods for finding a compact description for a
subset of data.

**Which of the following is not valid JSON?**

All are valid

{"answer": "this one"}

{["answer": "this one"]}

{"answer": ["this one"]}

**The two major functions of BI servers are:**

Management and delivery

Processing and management

Application and delivery

Source and results

**How do you measure interestingness in association patterns?**

meaure accuracy

measure relevance

measure variance

measure lift

**A descriptive approach to exploring data that can help identify relationships among values in a database is:**

Clustering

Predictive analysis

Link analysis

Function activation

**Where can a website operator generally find data on her customers' IP addresses?**

cookies

all of the above

HTTP request headers

server logfiles

**Data mining provides a link between:**

Online analytical processing and dynamic
information

Genetic algorithms and logistic regression

Parallel processing and RAID

Separate transactional and analytical systems

**What is Hive**

Both of these

Hive is a programming language that simplifies the
common tasks of working with Hadoop.

Hive enables Hadoop to operate as a data warehouse.

**What is the purpose of the Hadoop Distributed File System (HDFS)?**

Creating a context in which there are no
restrictions on the data, enabling it to be unstructured and schemaless.

To enable computation to take place by allowing
each server to have access to the data.

Ensuring that data is replicated with redundancy
across the cluster.

All of these.

**The silhouette coefficient can be used to determine the natural number of clusters for ________.**

Hierarchichal Algorithms

Partitioning Algorithms

Density Based Algorithms

Subspace Clustering Algorithms

## No comments:

## Post a Comment