Is this ALLOW FILTERING efficient?
When you write "this" you mean in the context of your query and your model, however the efficiency of an ALLOW FILTERING query depends mostly on the data it has to filter. Unless you show some real data this is a hard to answer question.
I am expecting that cassandra will filter in this order...
Yeah, this is what will happen. However, the inclusion of an ALLOW FILTERING clause in the query usually means a poor table design, that is you're not following some guidelines on Cassandra modeling (specifically the "one query <--> one table").
As a solution, I could hint you to include the field in the clustering key just before the action field, modifying your table definition:start
CREATE TABLE test (
day int,
id varchar,
start int,
action varchar,
PRIMARY KEY((day),action,start,id)
);
You then would rewrite your query without any ALLOW FILTERING clause:
SELECT * FROM test WHERE day=1 AND action='accept' AND start > 1475485412 AND start < 1485785654
having only the minor issue that if one record "switches" values you cannot perform an update on the single action field (because it's now part of the clustering key), so you need to perform a delete with the old action value and an insert it with the correct new value. But if you have Cassandra 3.0+ all this can be done with the help of the new Materialized View implementation. Have a look at the documentation for further information.action
Right now the table only has about 320k records and I can use ALLOW FILTERING with no problem, but I realize this might not always be the case.
So here's the thing: Cassandra is very good at querying data by a specific key. It is also good at retrieving a range of data within a partition.
"SELECT * FROM {}.{} WHERE timestamp > {} ALLOW FILTERING;"
But due to its distributed nature, it is not good at scanning an entire table to compile a result set. And that's what you are asking it to do with the above query.
Network traffic is expensive. So the main goal with Cassandra, is to ensure that your query is served by a single node. When using without specifying your partition key (name) causes your query to require a coordinator node, and check each node in your cluster for values that may match your WHERE clause.ALLOW FILTERING
Essentially, the more nodes that are in your cluster, the more detrimental becomes to performance (unless you specify at least your partition key...only then you are guaranteeing that your query can be served by a single node). Note, that your slower query actually does this right, and solves that problem for you. ALLOW FILTERING
I had an idea to add another field that would be the day of the week, and maybe a month field as well.
And this is a good idea!
It solves two problems.
Cassandra has a limit of 2 billion cells per partition. As your partition key is "name" and you keep adding unique timestamps inside it, you will progress toward that limit until you either reach it, or your partition becomes too big to use (probably the latter).
Here is how I would solve this:
CREATE TABLE cryptocoindb.worldcoinindex_byday (
daybucket text,
name text,
datetime timestamp,
label text,
price_btc double,
price_cny double,
price_eur double,
price_gbp double,
price_rur double,
price_usd double,
volume_24h double,
PRIMARY KEY (daybucket, datetime, name)
) WITH CLUSTERING ORDER BY (datetime DESC, name ASC);
Now you could query like this:
SELECT * FROM cryptocoindb.worldcoinindex
WHERE daybucket='20170825' AND datetime > '2017-08-25 17:20';
Additionally, by clustering your rows on "datetime" descending, you are ensuring that the most-recent data is at the top of each cell (giving Cassandra less to have to parse through).
I moved "name" to be the last clustering column, just to maintain uniqueness. If you're never going to query by "name," then it doesn't make sense to use it as your partition key.
Hope this helps.
Note: I changed your to timestamp int because it added clarity to the example. You can use whatever works for you, but just be careful of confusion arising from naming a column after a data type.datetime timestamp
Edit 20170826
Is the following the same as your code or different?
PRIMARY KEY ((daybucket, datetime), name)
No, that's not the same. That's using something called a composite partition key. It will give you better data distribution in your cluster, but it will make querying harder for you, and basically set you back to doing table scans.
For a good, comprehensive description of Cassandra primary keys, Carlo Bertuccini has great answer on StackOverflow:
https://stackoverflow.com/questions/24949676/difference-between-partition-key-composite-key-and-clustering-key-in-cassandra/24953331#24953331
Is there a way to alter the way Cassandra reads timestamps or an easy way to make changes to that whole datafield to alter the timestamp so it will be correctly read?
Not really. Cassandra timestamps can be tricky to work with. They store with millisecond precision, but don't actually show that full precision when queried. Also, as of one of the 2.1 patches, it automatically displays time in GMT; so that can be confusing to people as well. If your way of managing timestamps on the application side is working for you, just stick with that.
CREATE TABLE campaigns (
SchduleStartdate text,
SchduleEndDate text,
id int,
scheduletime text,
enable boolean,
PRIMARY KEY ((SchduleStartdate, SchduleEndDate),id));
You can make the below queries to the table,
slect * from campaigns where SchduleStartdate = 'xxx' and SchduleEndDate = 'xx'; -- to get the answer to above question.
slect * from campaigns where SchduleStartdate = 'xxx' and SchduleEndDate = 'xx' and id = 1; -- if you want to filter the data again for specific ids
Here the SchduleStartdate and SchduleEndDate is used as the Partition Key and the ID is used as the Clustering key to make sure the entries are unique.
By this way, you can filter based on start, end and then id if needed.
One downside with this will be if you only need to filter by id that wont be possible as you need to first restrict the partition keys.
But what I understand from the manual is that allow filtering should not apply on this query, so basically why is this happening?
I am afraid your understanding is wrong. You want to filter on a non primary key column. In Cassandra, you need to add for this purpose.ALLOW FILTERING
Can you try
select * from users where last_name ='Jansen' ALLOW FILTERING
But please remember this is equivalent to doing
and filtering the data from the result. So this is a very heavy operation and causes huge performance impact.select * from users
Appears this feature is deceptive and what you're trying to do is not yet possible. Slated for v3.
https://issues.apache.org/jira/browse/CASSANDRA-6377